All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH review 0/11] General unprivileged mount support
@ 2016-07-02 17:18 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:18 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk


As well as in these patches the code is also available from:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

It has been a long time in coming but recently in the userns tree the
superblock has been expanded with a s_user_ns field indicating the user
namespace that owns a superblock.

The s_user_ns owner of a superblock has three implications.
- Only kuids and kgids that map into s_user_ns are allowed to be sent to a
 filesystem from the vfs.
- If the uid or gid on the filesystem does not map into s_user_ns i_uid
  is set to INVALID_UID and i_gid is set to INVALID_GID.
- The scope of permission checks can be changed from global to a
  capabilitiy check in s_user_ns.

The overall strategy is to handle as much of this as possible in the VFS
so that what is happening is consistent between filesystems and has
widespread review, and so that individual filesystems don't need to
duplicate code.

This set of patches inserts checks to ensure only kuids and kgids that
map into s_user_ns are sent to filesystems.

This set of patches updates the vfs to deal with potentially unmapped
uids and gids in the i_uid and i_gid fields.  The strategy adopted is to
deny any activity that causes inodes with unmapped uids or gid to be
written to disk except for a chown that causes makes i_uid and i_gid
to map into s_user_ns.

Relaxing of the capability checks and adding new filesystems that would
benefit from the changes is held off until the vfs support is complete.

I believe this work is complete so if there anything questionable you
see please let me know.

I have included linux-api because several system calls get new failure
modes mostly -EOVERFLOW, and that may need separate documenation and
review.

The target for this work is to enable fully unprivileged fuse mounts
and whichever filesystem results from the uid shifting work.  These
vfs changes should support all kinds of filesystems, but in practice
it is an open problem if it is possible to modify a block based
filesystem to be safe from people manipulating filesystem images in an
attempt to get the kernel to malfunction so only a very limited set of
additional filesystems is ever expected to be enabled by this work.

Eric W. Biederman (6):
      userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
      vfs: Verify acls are valid within superblock's s_user_ns.
      vfs: Don't modify inodes with a uid or gid unknown to the vfs
      vfs: Don't create inodes with a uid or gid unknown to the vfs
      quota: Ensure qids map to the filesystem
      quota: Handle quota data stored in s_user_ns.

Seth Forshee (5):
      fs: Refuse uid/gid changes which don't map into s_user_ns
      fs: Check for invalid i_uid in may_follow_link()
      cred: Reject inodes with invalid ids in set_create_file_as()
      evm: Translate user/group ids relative to s_user_ns when computing HMAC
      fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns

 drivers/staging/lustre/lustre/mdc/mdc_request.c |  2 +-
 fs/9p/acl.c                                     |  2 +-
 fs/attr.c                                       | 19 +++++++++
 fs/inode.c                                      |  7 ++++
 fs/namei.c                                      | 40 ++++++++++++++----
 fs/posix_acl.c                                  |  8 ++--
 fs/quota/dquot.c                                |  7 +++-
 fs/quota/quota.c                                | 14 +++----
 fs/quota/quota_tree.c                           | 23 +++++++----
 fs/quota/quota_v1.c                             |  6 ++-
 fs/quota/quota_v2.c                             | 10 +++--
 fs/xattr.c                                      |  7 ++++
 include/linux/fs.h                              | 55 ++++++++++++++-----------
 include/linux/posix_acl.h                       |  2 +-
 include/linux/quota.h                           | 10 +++++
 include/linux/uidgid.h                          |  4 +-
 kernel/cred.c                                   |  2 +
 security/integrity/evm/evm_crypto.c             |  4 +-
 18 files changed, 154 insertions(+), 68 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH review 0/11] General unprivileged mount support
@ 2016-07-02 17:18 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:18 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk


As well as in these patches the code is also available from:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

It has been a long time in coming but recently in the userns tree the
superblock has been expanded with a s_user_ns field indicating the user
namespace that owns a superblock.

The s_user_ns owner of a superblock has three implications.
- Only kuids and kgids that map into s_user_ns are allowed to be sent to a
 filesystem from the vfs.
- If the uid or gid on the filesystem does not map into s_user_ns i_uid
  is set to INVALID_UID and i_gid is set to INVALID_GID.
- The scope of permission checks can be changed from global to a
  capabilitiy check in s_user_ns.

The overall strategy is to handle as much of this as possible in the VFS
so that what is happening is consistent between filesystems and has
widespread review, and so that individual filesystems don't need to
duplicate code.

This set of patches inserts checks to ensure only kuids and kgids that
map into s_user_ns are sent to filesystems.

This set of patches updates the vfs to deal with potentially unmapped
uids and gids in the i_uid and i_gid fields.  The strategy adopted is to
deny any activity that causes inodes with unmapped uids or gid to be
written to disk except for a chown that causes makes i_uid and i_gid
to map into s_user_ns.

Relaxing of the capability checks and adding new filesystems that would
benefit from the changes is held off until the vfs support is complete.

I believe this work is complete so if there anything questionable you
see please let me know.

I have included linux-api because several system calls get new failure
modes mostly -EOVERFLOW, and that may need separate documenation and
review.

The target for this work is to enable fully unprivileged fuse mounts
and whichever filesystem results from the uid shifting work.  These
vfs changes should support all kinds of filesystems, but in practice
it is an open problem if it is possible to modify a block based
filesystem to be safe from people manipulating filesystem images in an
attempt to get the kernel to malfunction so only a very limited set of
additional filesystems is ever expected to be enabled by this work.

Eric W. Biederman (6):
      userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
      vfs: Verify acls are valid within superblock's s_user_ns.
      vfs: Don't modify inodes with a uid or gid unknown to the vfs
      vfs: Don't create inodes with a uid or gid unknown to the vfs
      quota: Ensure qids map to the filesystem
      quota: Handle quota data stored in s_user_ns.

Seth Forshee (5):
      fs: Refuse uid/gid changes which don't map into s_user_ns
      fs: Check for invalid i_uid in may_follow_link()
      cred: Reject inodes with invalid ids in set_create_file_as()
      evm: Translate user/group ids relative to s_user_ns when computing HMAC
      fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns

 drivers/staging/lustre/lustre/mdc/mdc_request.c |  2 +-
 fs/9p/acl.c                                     |  2 +-
 fs/attr.c                                       | 19 +++++++++
 fs/inode.c                                      |  7 ++++
 fs/namei.c                                      | 40 ++++++++++++++----
 fs/posix_acl.c                                  |  8 ++--
 fs/quota/dquot.c                                |  7 +++-
 fs/quota/quota.c                                | 14 +++----
 fs/quota/quota_tree.c                           | 23 +++++++----
 fs/quota/quota_v1.c                             |  6 ++-
 fs/quota/quota_v2.c                             | 10 +++--
 fs/xattr.c                                      |  7 ++++
 include/linux/fs.h                              | 55 ++++++++++++++-----------
 include/linux/posix_acl.h                       |  2 +-
 include/linux/quota.h                           | 10 +++++
 include/linux/uidgid.h                          |  4 +-
 kernel/cred.c                                   |  2 +
 security/integrity/evm/evm_crypto.c             |  4 +-
 18 files changed, 154 insertions(+), 68 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns
       [not found] ` <87ziq03qnj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-07-02 17:20   ` Eric W. Biederman
  2016-07-04  8:52   ` [PATCH review 0/11] General unprivileged mount support Jan Kara
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Add checks to notify_change to verify that uid and gid changes
will map into the superblock's user namespace. If they do not
fail with -EOVERFLOW.

This is mandatory so that fileystems don't have to even think
of dealing with ia_uid and ia_gid that

--EWB Moved the test from inode_change_ok to notify_change

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0f6c88..dd723578ddce 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -255,6 +255,17 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
 		return 0;
 
+	/*
+	 * Verify that uid/gid changes are valid in the target
+	 * namespace of the superblock.
+	 */
+	if (ia_valid & ATTR_UID &&
+	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+		return -EOVERFLOW;
+	if (ia_valid & ATTR_GID &&
+	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns
  2016-07-02 17:18 ` Eric W. Biederman
  (?)
@ 2016-07-02 17:20 ` Eric W. Biederman
  2016-07-02 17:20   ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                     ` (8 more replies)
  -1 siblings, 9 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

Add checks to notify_change to verify that uid and gid changes
will map into the superblock's user namespace. If they do not
fail with -EOVERFLOW.

This is mandatory so that fileystems don't have to even think
of dealing with ia_uid and ia_gid that

--EWB Moved the test from inode_change_ok to notify_change

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/attr.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0f6c88..dd723578ddce 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -255,6 +255,17 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
 		return 0;
 
+	/*
+	 * Verify that uid/gid changes are valid in the target
+	 * namespace of the superblock.
+	 */
+	if (ia_valid & ATTR_UID &&
+	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+		return -EOVERFLOW;
+	if (ia_valid & ATTR_GID &&
+	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 02/11] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
  2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
@ 2016-07-02 17:20       ` Eric W. Biederman
  2016-07-02 17:20     ` Eric W. Biederman
                         ` (7 subsequent siblings)
  8 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Refuse to admit any user namespace has a mapping of the INVALID_UID
and the INVALID_GID when !CONFIG_USER_NS.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/uidgid.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h
index 03835522dfcb..25e9d9216340 100644
--- a/include/linux/uidgid.h
+++ b/include/linux/uidgid.h
@@ -177,12 +177,12 @@ static inline gid_t from_kgid_munged(struct user_namespace *to, kgid_t kgid)
 
 static inline bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid)
 {
-	return true;
+	return uid_valid(uid);
 }
 
 static inline bool kgid_has_mapping(struct user_namespace *ns, kgid_t gid)
 {
-	return true;
+	return gid_valid(gid);
 }
 
 #endif /* CONFIG_USER_NS */
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 02/11] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
@ 2016-07-02 17:20       ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

Refuse to admit any user namespace has a mapping of the INVALID_UID
and the INVALID_GID when !CONFIG_USER_NS.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/uidgid.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h
index 03835522dfcb..25e9d9216340 100644
--- a/include/linux/uidgid.h
+++ b/include/linux/uidgid.h
@@ -177,12 +177,12 @@ static inline gid_t from_kgid_munged(struct user_namespace *to, kgid_t kgid)
 
 static inline bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid)
 {
-	return true;
+	return uid_valid(uid);
 }
 
 static inline bool kgid_has_mapping(struct user_namespace *ns, kgid_t gid)
 {
-	return true;
+	return gid_valid(gid);
 }
 
 #endif /* CONFIG_USER_NS */
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns.
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2016-07-02 17:20       ` Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 04/11] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
                       ` (7 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Update posix_acl_valid to verify that an acl is within a user namespace.

Update the callers of posix_acl_valid to pass in an appropriate
user namespace.  For posix_acl_xattr_set and v9fs_xattr_set_acl pass in
inode->i_sb->s_user_ns to posix_acl_valid.  For md_unpack_acl pass in
&init_user_ns as no inode or superblock is in sight.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 drivers/staging/lustre/lustre/mdc/mdc_request.c | 2 +-
 fs/9p/acl.c                                     | 2 +-
 fs/posix_acl.c                                  | 8 ++++----
 include/linux/posix_acl.h                       | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 86b7445365f4..40cf57fad581 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -434,7 +434,7 @@ static int mdc_unpack_acl(struct ptlrpc_request *req, struct lustre_md *md)
 		return rc;
 	}
 
-	rc = posix_acl_valid(acl);
+	rc = posix_acl_valid(&init_user_ns, acl);
 	if (rc) {
 		CERROR("validate acl: %d\n", rc);
 		posix_acl_release(acl);
diff --git a/fs/9p/acl.c b/fs/9p/acl.c
index 0576eaeb60b9..5b6a1743ea17 100644
--- a/fs/9p/acl.c
+++ b/fs/9p/acl.c
@@ -266,7 +266,7 @@ static int v9fs_xattr_set_acl(const struct xattr_handler *handler,
 		if (IS_ERR(acl))
 			return PTR_ERR(acl);
 		else if (acl) {
-			retval = posix_acl_valid(acl);
+			retval = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (retval)
 				goto err_out;
 		}
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 8a4a266beff3..647c28180675 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -205,7 +205,7 @@ posix_acl_clone(const struct posix_acl *acl, gfp_t flags)
  * Check if an acl is valid. Returns 0 if it is, or -E... otherwise.
  */
 int
-posix_acl_valid(const struct posix_acl *acl)
+posix_acl_valid(struct user_namespace *user_ns, const struct posix_acl *acl)
 {
 	const struct posix_acl_entry *pa, *pe;
 	int state = ACL_USER_OBJ;
@@ -225,7 +225,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_USER:
 				if (state != ACL_USER)
 					return -EINVAL;
-				if (!uid_valid(pa->e_uid))
+				if (!kuid_has_mapping(user_ns, pa->e_uid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -240,7 +240,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_GROUP:
 				if (state != ACL_GROUP)
 					return -EINVAL;
-				if (!gid_valid(pa->e_gid))
+				if (!kgid_has_mapping(user_ns, pa->e_gid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -845,7 +845,7 @@ posix_acl_xattr_set(const struct xattr_handler *handler,
 			return PTR_ERR(acl);
 
 		if (acl) {
-			ret = posix_acl_valid(acl);
+			ret = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (ret)
 				goto out;
 		}
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 5b5a80cc5926..1af6438fde3e 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -81,7 +81,7 @@ posix_acl_release(struct posix_acl *acl)
 
 extern void posix_acl_init(struct posix_acl *, int);
 extern struct posix_acl *posix_acl_alloc(int, gfp_t);
-extern int posix_acl_valid(const struct posix_acl *);
+extern int posix_acl_valid(struct user_namespace *, const struct posix_acl *);
 extern int posix_acl_permission(struct inode *, const struct posix_acl *, int);
 extern struct posix_acl *posix_acl_from_mode(umode_t, gfp_t);
 extern int posix_acl_equiv_mode(const struct posix_acl *, umode_t *);
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns.
  2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
@ 2016-07-02 17:20   ` Eric W. Biederman
  2016-07-02 17:20     ` Eric W. Biederman
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

Update posix_acl_valid to verify that an acl is within a user namespace.

Update the callers of posix_acl_valid to pass in an appropriate
user namespace.  For posix_acl_xattr_set and v9fs_xattr_set_acl pass in
inode->i_sb->s_user_ns to posix_acl_valid.  For md_unpack_acl pass in
&init_user_ns as no inode or superblock is in sight.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/staging/lustre/lustre/mdc/mdc_request.c | 2 +-
 fs/9p/acl.c                                     | 2 +-
 fs/posix_acl.c                                  | 8 ++++----
 include/linux/posix_acl.h                       | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 86b7445365f4..40cf57fad581 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -434,7 +434,7 @@ static int mdc_unpack_acl(struct ptlrpc_request *req, struct lustre_md *md)
 		return rc;
 	}
 
-	rc = posix_acl_valid(acl);
+	rc = posix_acl_valid(&init_user_ns, acl);
 	if (rc) {
 		CERROR("validate acl: %d\n", rc);
 		posix_acl_release(acl);
diff --git a/fs/9p/acl.c b/fs/9p/acl.c
index 0576eaeb60b9..5b6a1743ea17 100644
--- a/fs/9p/acl.c
+++ b/fs/9p/acl.c
@@ -266,7 +266,7 @@ static int v9fs_xattr_set_acl(const struct xattr_handler *handler,
 		if (IS_ERR(acl))
 			return PTR_ERR(acl);
 		else if (acl) {
-			retval = posix_acl_valid(acl);
+			retval = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (retval)
 				goto err_out;
 		}
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 8a4a266beff3..647c28180675 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -205,7 +205,7 @@ posix_acl_clone(const struct posix_acl *acl, gfp_t flags)
  * Check if an acl is valid. Returns 0 if it is, or -E... otherwise.
  */
 int
-posix_acl_valid(const struct posix_acl *acl)
+posix_acl_valid(struct user_namespace *user_ns, const struct posix_acl *acl)
 {
 	const struct posix_acl_entry *pa, *pe;
 	int state = ACL_USER_OBJ;
@@ -225,7 +225,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_USER:
 				if (state != ACL_USER)
 					return -EINVAL;
-				if (!uid_valid(pa->e_uid))
+				if (!kuid_has_mapping(user_ns, pa->e_uid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -240,7 +240,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_GROUP:
 				if (state != ACL_GROUP)
 					return -EINVAL;
-				if (!gid_valid(pa->e_gid))
+				if (!kgid_has_mapping(user_ns, pa->e_gid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -845,7 +845,7 @@ posix_acl_xattr_set(const struct xattr_handler *handler,
 			return PTR_ERR(acl);
 
 		if (acl) {
-			ret = posix_acl_valid(acl);
+			ret = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (ret)
 				goto out;
 		}
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 5b5a80cc5926..1af6438fde3e 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -81,7 +81,7 @@ posix_acl_release(struct posix_acl *acl)
 
 extern void posix_acl_init(struct posix_acl *, int);
 extern struct posix_acl *posix_acl_alloc(int, gfp_t);
-extern int posix_acl_valid(const struct posix_acl *);
+extern int posix_acl_valid(struct user_namespace *, const struct posix_acl *);
 extern int posix_acl_permission(struct inode *, const struct posix_acl *, int);
 extern struct posix_acl *posix_acl_from_mode(umode_t, gfp_t);
 extern int posix_acl_equiv_mode(const struct posix_acl *, umode_t *);
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 04/11] fs: Check for invalid i_uid in may_follow_link()
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2016-07-02 17:20       ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 05/11] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 757a32725d92..8701bd9a5270 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -901,6 +901,7 @@ static inline int may_follow_link(struct nameidata *nd)
 {
 	const struct inode *inode;
 	const struct inode *parent;
+	kuid_t puid;
 
 	if (!sysctl_protected_symlinks)
 		return 0;
@@ -916,7 +917,8 @@ static inline int may_follow_link(struct nameidata *nd)
 		return 0;
 
 	/* Allowed if parent directory and link owner match. */
-	if (uid_eq(parent->i_uid, inode->i_uid))
+	puid = parent->i_uid;
+	if (uid_valid(puid) && uid_eq(puid, inode->i_uid))
 		return 0;
 
 	if (nd->flags & LOOKUP_RCU)
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 04/11] fs: Check for invalid i_uid in may_follow_link()
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 757a32725d92..8701bd9a5270 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -901,6 +901,7 @@ static inline int may_follow_link(struct nameidata *nd)
 {
 	const struct inode *inode;
 	const struct inode *parent;
+	kuid_t puid;
 
 	if (!sysctl_protected_symlinks)
 		return 0;
@@ -916,7 +917,8 @@ static inline int may_follow_link(struct nameidata *nd)
 		return 0;
 
 	/* Allowed if parent directory and link owner match. */
-	if (uid_eq(parent->i_uid, inode->i_uid))
+	puid = parent->i_uid;
+	if (uid_valid(puid) && uid_eq(puid, inode->i_uid))
 		return 0;
 
 	if (nd->flags & LOOKUP_RCU)
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 04/11] fs: Check for invalid i_uid in may_follow_link()
@ 2016-07-02 17:20     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 757a32725d92..8701bd9a5270 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -901,6 +901,7 @@ static inline int may_follow_link(struct nameidata *nd)
 {
 	const struct inode *inode;
 	const struct inode *parent;
+	kuid_t puid;
 
 	if (!sysctl_protected_symlinks)
 		return 0;
@@ -916,7 +917,8 @@ static inline int may_follow_link(struct nameidata *nd)
 		return 0;
 
 	/* Allowed if parent directory and link owner match. */
-	if (uid_eq(parent->i_uid, inode->i_uid))
+	puid = parent->i_uid;
+	if (uid_valid(puid) && uid_eq(puid, inode->i_uid))
 		return 0;
 
 	if (nd->flags & LOOKUP_RCU)
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 05/11] cred: Reject inodes with invalid ids in set_create_file_as()
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                       ` (2 preceding siblings ...)
  2016-07-02 17:20     ` [PATCH review 04/11] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 06/11] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
                       ` (5 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/cred.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cred.c b/kernel/cred.c
index 0c0cd8a62285..5f264fb5737d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -689,6 +689,8 @@ EXPORT_SYMBOL(set_security_override_from_ctx);
  */
 int set_create_files_as(struct cred *new, struct inode *inode)
 {
+	if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+		return -EINVAL;
 	new->fsuid = inode->i_uid;
 	new->fsgid = inode->i_gid;
 	return security_kernel_create_files_as(new, inode);
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 05/11] cred: Reject inodes with invalid ids in set_create_file_as()
  2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
  2016-07-02 17:20   ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
  2016-07-02 17:20     ` Eric W. Biederman
@ 2016-07-02 17:20   ` Eric W. Biederman
  2016-07-02 17:20     ` Eric W. Biederman
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/cred.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cred.c b/kernel/cred.c
index 0c0cd8a62285..5f264fb5737d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -689,6 +689,8 @@ EXPORT_SYMBOL(set_security_override_from_ctx);
  */
 int set_create_files_as(struct cred *new, struct inode *inode)
 {
+	if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+		return -EINVAL;
 	new->fsuid = inode->i_uid;
 	new->fsgid = inode->i_gid;
 	return security_kernel_create_files_as(new, inode);
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 06/11] vfs: Don't modify inodes with a uid or gid unknown to the vfs
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                       ` (3 preceding siblings ...)
  2016-07-02 17:20     ` [PATCH review 05/11] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 07/11] vfs: Don't create " Eric W. Biederman
                       ` (4 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.

The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.

Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk.  In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.

Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.

In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set.  Reads of the inodes appear safe and
useful, but any write or modification is disallowed.  The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values.  After such a chown the inode is normal and may
be treated as such.

Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.

One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.

Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr.  As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed.  evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c          |  8 ++++++++
 fs/inode.c         |  7 +++++++
 fs/namei.c         | 26 +++++++++++++++++++++-----
 fs/xattr.c         |  7 +++++++
 include/linux/fs.h |  5 +++++
 5 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index dd723578ddce..42bb42bb3c72 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -266,6 +266,14 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
 		return -EOVERFLOW;
 
+	/* Don't allow modifications of files with invalid uids or
+	 * gids unless those uids & gids are being made valid.
+	 */
+	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
+		return -EOVERFLOW;
+	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21b30ce..c0ebb97fb085 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1617,6 +1617,13 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
 
 	if (inode->i_flags & S_NOATIME)
 		return false;
+
+	/* Atime updates will likely cause i_uid and i_gid to be written
+	 * back improprely if their true value is unknown to the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return false;
+
 	if (IS_NOATIME(inode))
 		return false;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
diff --git a/fs/namei.c b/fs/namei.c
index 8701bd9a5270..840201c4c290 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -410,6 +410,14 @@ int __inode_permission(struct inode *inode, int mask)
 		 */
 		if (IS_IMMUTABLE(inode))
 			return -EACCES;
+
+		/*
+		 * Updating mtime will likely cause i_uid and i_gid to be
+		 * written back improperly if their true value is unknown
+		 * to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EACCES;
 	}
 
 	retval = do_inode_permission(inode, mask);
@@ -2759,10 +2767,11 @@ EXPORT_SYMBOL(__check_sticky);
  *	c. have CAP_FOWNER capability
  *  6. If the victim is append-only or immutable we can't do antyhing with
  *     links pointing to it.
- *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
- *  8. If we were asked to remove a non-directory and victim isn't one - EISDIR.
- *  9. We can't remove a root or mountpoint.
- * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ *  7. If the victim has an unknown uid or gid we can't change the inode.
+ *  8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
+ * 10. We can't remove a root or mountpoint.
+ * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
 static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
@@ -2784,7 +2793,7 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 		return -EPERM;
 
 	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
-	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode))
+	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
 		return -EPERM;
 	if (isdir) {
 		if (!d_is_dir(victim))
@@ -4190,6 +4199,13 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
+	/*
+	 * Updating the link count will likely cause i_uid and i_gid to
+	 * be writen back improperly if their true value is unknown to
+	 * the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return -EPERM;
 	if (!dir->i_op->link)
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode))
diff --git a/fs/xattr.c b/fs/xattr.c
index 4beafc43daa5..c243905835ab 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -38,6 +38,13 @@ xattr_permission(struct inode *inode, const char *name, int mask)
 	if (mask & MAY_WRITE) {
 		if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
 			return -EPERM;
+		/*
+		 * Updating an xattr will likely cause i_uid and i_gid
+		 * to be writen back improperly if their true value is
+		 * unknown to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EPERM;
 	}
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 375e37f42cdf..cb25ceb6d1ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,11 @@ struct super_operations {
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
 
+static inline bool HAS_UNMAPPED_ID(struct inode *inode)
+{
+	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 06/11] vfs: Don't modify inodes with a uid or gid unknown to the vfs
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.

The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.

Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk.  In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.

Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.

In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set.  Reads of the inodes appear safe and
useful, but any write or modification is disallowed.  The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values.  After such a chown the inode is normal and may
be treated as such.

Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.

One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.

Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr.  As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed.  evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/attr.c          |  8 ++++++++
 fs/inode.c         |  7 +++++++
 fs/namei.c         | 26 +++++++++++++++++++++-----
 fs/xattr.c         |  7 +++++++
 include/linux/fs.h |  5 +++++
 5 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index dd723578ddce..42bb42bb3c72 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -266,6 +266,14 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
 		return -EOVERFLOW;
 
+	/* Don't allow modifications of files with invalid uids or
+	 * gids unless those uids & gids are being made valid.
+	 */
+	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
+		return -EOVERFLOW;
+	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21b30ce..c0ebb97fb085 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1617,6 +1617,13 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
 
 	if (inode->i_flags & S_NOATIME)
 		return false;
+
+	/* Atime updates will likely cause i_uid and i_gid to be written
+	 * back improprely if their true value is unknown to the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return false;
+
 	if (IS_NOATIME(inode))
 		return false;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
diff --git a/fs/namei.c b/fs/namei.c
index 8701bd9a5270..840201c4c290 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -410,6 +410,14 @@ int __inode_permission(struct inode *inode, int mask)
 		 */
 		if (IS_IMMUTABLE(inode))
 			return -EACCES;
+
+		/*
+		 * Updating mtime will likely cause i_uid and i_gid to be
+		 * written back improperly if their true value is unknown
+		 * to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EACCES;
 	}
 
 	retval = do_inode_permission(inode, mask);
@@ -2759,10 +2767,11 @@ EXPORT_SYMBOL(__check_sticky);
  *	c. have CAP_FOWNER capability
  *  6. If the victim is append-only or immutable we can't do antyhing with
  *     links pointing to it.
- *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
- *  8. If we were asked to remove a non-directory and victim isn't one - EISDIR.
- *  9. We can't remove a root or mountpoint.
- * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ *  7. If the victim has an unknown uid or gid we can't change the inode.
+ *  8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
+ * 10. We can't remove a root or mountpoint.
+ * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
 static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
@@ -2784,7 +2793,7 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 		return -EPERM;
 
 	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
-	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode))
+	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
 		return -EPERM;
 	if (isdir) {
 		if (!d_is_dir(victim))
@@ -4190,6 +4199,13 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
+	/*
+	 * Updating the link count will likely cause i_uid and i_gid to
+	 * be writen back improperly if their true value is unknown to
+	 * the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return -EPERM;
 	if (!dir->i_op->link)
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode))
diff --git a/fs/xattr.c b/fs/xattr.c
index 4beafc43daa5..c243905835ab 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -38,6 +38,13 @@ xattr_permission(struct inode *inode, const char *name, int mask)
 	if (mask & MAY_WRITE) {
 		if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
 			return -EPERM;
+		/*
+		 * Updating an xattr will likely cause i_uid and i_gid
+		 * to be writen back improperly if their true value is
+		 * unknown to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EPERM;
 	}
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 375e37f42cdf..cb25ceb6d1ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,11 @@ struct super_operations {
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
 
+static inline bool HAS_UNMAPPED_ID(struct inode *inode)
+{
+	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 06/11] vfs: Don't modify inodes with a uid or gid unknown to the vfs
@ 2016-07-02 17:20     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.

The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.

Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk.  In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.

Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.

In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set.  Reads of the inodes appear safe and
useful, but any write or modification is disallowed.  The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values.  After such a chown the inode is normal and may
be treated as such.

Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.

One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.

Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr.  As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed.  evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c          |  8 ++++++++
 fs/inode.c         |  7 +++++++
 fs/namei.c         | 26 +++++++++++++++++++++-----
 fs/xattr.c         |  7 +++++++
 include/linux/fs.h |  5 +++++
 5 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index dd723578ddce..42bb42bb3c72 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -266,6 +266,14 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
 		return -EOVERFLOW;
 
+	/* Don't allow modifications of files with invalid uids or
+	 * gids unless those uids & gids are being made valid.
+	 */
+	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
+		return -EOVERFLOW;
+	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21b30ce..c0ebb97fb085 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1617,6 +1617,13 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
 
 	if (inode->i_flags & S_NOATIME)
 		return false;
+
+	/* Atime updates will likely cause i_uid and i_gid to be written
+	 * back improprely if their true value is unknown to the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return false;
+
 	if (IS_NOATIME(inode))
 		return false;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
diff --git a/fs/namei.c b/fs/namei.c
index 8701bd9a5270..840201c4c290 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -410,6 +410,14 @@ int __inode_permission(struct inode *inode, int mask)
 		 */
 		if (IS_IMMUTABLE(inode))
 			return -EACCES;
+
+		/*
+		 * Updating mtime will likely cause i_uid and i_gid to be
+		 * written back improperly if their true value is unknown
+		 * to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EACCES;
 	}
 
 	retval = do_inode_permission(inode, mask);
@@ -2759,10 +2767,11 @@ EXPORT_SYMBOL(__check_sticky);
  *	c. have CAP_FOWNER capability
  *  6. If the victim is append-only or immutable we can't do antyhing with
  *     links pointing to it.
- *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
- *  8. If we were asked to remove a non-directory and victim isn't one - EISDIR.
- *  9. We can't remove a root or mountpoint.
- * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ *  7. If the victim has an unknown uid or gid we can't change the inode.
+ *  8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
+ * 10. We can't remove a root or mountpoint.
+ * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
 static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
@@ -2784,7 +2793,7 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 		return -EPERM;
 
 	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
-	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode))
+	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
 		return -EPERM;
 	if (isdir) {
 		if (!d_is_dir(victim))
@@ -4190,6 +4199,13 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
+	/*
+	 * Updating the link count will likely cause i_uid and i_gid to
+	 * be writen back improperly if their true value is unknown to
+	 * the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return -EPERM;
 	if (!dir->i_op->link)
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode))
diff --git a/fs/xattr.c b/fs/xattr.c
index 4beafc43daa5..c243905835ab 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -38,6 +38,13 @@ xattr_permission(struct inode *inode, const char *name, int mask)
 	if (mask & MAY_WRITE) {
 		if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
 			return -EPERM;
+		/*
+		 * Updating an xattr will likely cause i_uid and i_gid
+		 * to be writen back improperly if their true value is
+		 * unknown to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EPERM;
 	}
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 375e37f42cdf..cb25ceb6d1ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,11 @@ struct super_operations {
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
 
+static inline bool HAS_UNMAPPED_ID(struct inode *inode)
+{
+	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                       ` (4 preceding siblings ...)
  2016-07-02 17:20     ` [PATCH review 06/11] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 08/11] quota: Ensure qids map to the filesystem Eric W. Biederman
                       ` (3 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

It is expected that filesystems can not represent uids and gids from
outside of their user namespace.  Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 840201c4c290..629823f19a6a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2814,16 +2814,22 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
  *  1. We can't do it if child already exists (open has special treatment for
  *     this case, but since we are inlined it's OK)
  *  2. We can't do it if dir is read-only (done in permission())
- *  3. We should have write and exec permissions on dir
- *  4. We can't do it if dir is immutable (done in permission())
+ *  3. We can't do it if the fs can't represent the fsuid or fsgid.
+ *  4. We should have write and exec permissions on dir
+ *  5. We can't do it if dir is immutable (done in permission())
  */
 static inline int may_create(struct inode *dir, struct dentry *child)
 {
+	struct user_namespace *s_user_ns;
 	audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
 	if (child->d_inode)
 		return -EEXIST;
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
+	s_user_ns = dir->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
+	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+		return -EOVERFLOW;
 	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
 }
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
  2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
                     ` (3 preceding siblings ...)
  2016-07-02 17:20     ` Eric W. Biederman
@ 2016-07-02 17:20   ` Eric W. Biederman
       [not found]     ` <20160702172035.19568-7-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2016-07-04  7:59       ` Jan Kara
  2016-07-02 17:20   ` [PATCH review 08/11] quota: Ensure qids map to the filesystem Eric W. Biederman
                     ` (3 subsequent siblings)
  8 siblings, 2 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

It is expected that filesystems can not represent uids and gids from
outside of their user namespace.  Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namei.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 840201c4c290..629823f19a6a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2814,16 +2814,22 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
  *  1. We can't do it if child already exists (open has special treatment for
  *     this case, but since we are inlined it's OK)
  *  2. We can't do it if dir is read-only (done in permission())
- *  3. We should have write and exec permissions on dir
- *  4. We can't do it if dir is immutable (done in permission())
+ *  3. We can't do it if the fs can't represent the fsuid or fsgid.
+ *  4. We should have write and exec permissions on dir
+ *  5. We can't do it if dir is immutable (done in permission())
  */
 static inline int may_create(struct inode *dir, struct dentry *child)
 {
+	struct user_namespace *s_user_ns;
 	audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
 	if (child->d_inode)
 		return -EEXIST;
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
+	s_user_ns = dir->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
+	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+		return -EOVERFLOW;
 	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
 }
 
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 08/11] quota: Ensure qids map to the filesystem
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                       ` (5 preceding siblings ...)
  2016-07-02 17:20     ` [PATCH review 07/11] vfs: Don't create " Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 09/11] quota: Handle quota data stored in s_user_ns Eric W. Biederman
                       ` (2 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Introduce the helper qid_has_mapping and use it to ensure that the
quota system only considers qids that map to the filesystems
s_user_ns.

In practice for quota supporting filesystems today this is the exact
same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
map into init_user_ns.

Replace the qid_valid calls with qid_has_mapping as values come in
from userspace.  This is harmless today and it prepares the quota
system to work on filesystems with quotas but mounted by unprivileged
users.

Call qid_has_mapping from dqget.  This ensures the passed in qid has a
prepresentation on the underlying filesystem.  Previously this was
unnecessary as filesystesm never had qids that could not map.  With
the introduction of filesystems outside of s_user_ns this will not
remain true.

All of this ensures the quota code never has to deal with qids that
don't map to the underlying filesystem.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/quota/dquot.c      |  3 +++
 fs/quota/quota.c      | 12 ++++++------
 include/linux/quota.h | 10 ++++++++++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ff21980d0119..74706b6aa747 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -841,6 +841,9 @@ struct dquot *dqget(struct super_block *sb, struct kqid qid)
 	unsigned int hashent = hashfn(sb, qid);
 	struct dquot *dquot, *empty = NULL;
 
+	if (!qid_has_mapping(sb->s_user_ns, qid))
+		return ERR_PTR(-EINVAL);
+
         if (!sb_has_quota_active(sb, qid.type))
 		return ERR_PTR(-ESRCH);
 we_slept:
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 0f10ee9892ce..73f6f4cf0a21 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -211,7 +211,7 @@ static int quota_getquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &fdq);
 	if (ret)
@@ -237,7 +237,7 @@ static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
 	if (ret)
@@ -288,7 +288,7 @@ static int quota_setquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	copy_from_if_dqblk(&fdq, &idq);
 	return sb->s_qcop->set_dqblk(sb, qid, &fdq);
@@ -581,7 +581,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
 	if (from_kqid(&init_user_ns, qid) == 0 &&
@@ -642,7 +642,7 @@ static int quota_getxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &qdq);
 	if (ret)
@@ -669,7 +669,7 @@ static int quota_getnextxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq);
 	if (ret)
diff --git a/include/linux/quota.h b/include/linux/quota.h
index 9dfb6bce8c9e..1db16ee39b31 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -179,6 +179,16 @@ static inline struct kqid make_kqid_projid(kprojid_t projid)
 	return kqid;
 }
 
+/**
+ *	qid_has_mapping - Report if a qid maps into a user namespace.
+ *	@ns:  The user namespace to see if a value maps into.
+ *	@qid: The kernel internal quota identifier to test.
+ */
+static inline bool qid_has_mapping(struct user_namespace *ns, struct kqid qid)
+{
+	return from_kqid(ns, qid) != (qid_t) -1;
+}
+
 
 extern spinlock_t dq_data_lock;
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 08/11] quota: Ensure qids map to the filesystem
  2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
                     ` (4 preceding siblings ...)
  2016-07-02 17:20   ` [PATCH review 07/11] vfs: Don't create " Eric W. Biederman
@ 2016-07-02 17:20   ` Eric W. Biederman
  2016-07-02 17:20     ` Eric W. Biederman
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

Introduce the helper qid_has_mapping and use it to ensure that the
quota system only considers qids that map to the filesystems
s_user_ns.

In practice for quota supporting filesystems today this is the exact
same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
map into init_user_ns.

Replace the qid_valid calls with qid_has_mapping as values come in
from userspace.  This is harmless today and it prepares the quota
system to work on filesystems with quotas but mounted by unprivileged
users.

Call qid_has_mapping from dqget.  This ensures the passed in qid has a
prepresentation on the underlying filesystem.  Previously this was
unnecessary as filesystesm never had qids that could not map.  With
the introduction of filesystems outside of s_user_ns this will not
remain true.

All of this ensures the quota code never has to deal with qids that
don't map to the underlying filesystem.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/quota/dquot.c      |  3 +++
 fs/quota/quota.c      | 12 ++++++------
 include/linux/quota.h | 10 ++++++++++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ff21980d0119..74706b6aa747 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -841,6 +841,9 @@ struct dquot *dqget(struct super_block *sb, struct kqid qid)
 	unsigned int hashent = hashfn(sb, qid);
 	struct dquot *dquot, *empty = NULL;
 
+	if (!qid_has_mapping(sb->s_user_ns, qid))
+		return ERR_PTR(-EINVAL);
+
         if (!sb_has_quota_active(sb, qid.type))
 		return ERR_PTR(-ESRCH);
 we_slept:
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 0f10ee9892ce..73f6f4cf0a21 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -211,7 +211,7 @@ static int quota_getquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &fdq);
 	if (ret)
@@ -237,7 +237,7 @@ static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
 	if (ret)
@@ -288,7 +288,7 @@ static int quota_setquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	copy_from_if_dqblk(&fdq, &idq);
 	return sb->s_qcop->set_dqblk(sb, qid, &fdq);
@@ -581,7 +581,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
 	if (from_kqid(&init_user_ns, qid) == 0 &&
@@ -642,7 +642,7 @@ static int quota_getxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &qdq);
 	if (ret)
@@ -669,7 +669,7 @@ static int quota_getnextxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq);
 	if (ret)
diff --git a/include/linux/quota.h b/include/linux/quota.h
index 9dfb6bce8c9e..1db16ee39b31 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -179,6 +179,16 @@ static inline struct kqid make_kqid_projid(kprojid_t projid)
 	return kqid;
 }
 
+/**
+ *	qid_has_mapping - Report if a qid maps into a user namespace.
+ *	@ns:  The user namespace to see if a value maps into.
+ *	@qid: The kernel internal quota identifier to test.
+ */
+static inline bool qid_has_mapping(struct user_namespace *ns, struct kqid qid)
+{
+	return from_kqid(ns, qid) != (qid_t) -1;
+}
+
 
 extern spinlock_t dq_data_lock;
 
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                       ` (6 preceding siblings ...)
  2016-07-02 17:20     ` [PATCH review 08/11] quota: Ensure qids map to the filesystem Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 10/11] evm: Translate user/group ids relative to s_user_ns when computing HMAC Eric W. Biederman
  2016-07-02 17:20       ` Eric W. Biederman
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

For the dquot hashfn translate the qid into sb->s_user_ns to remove
extraneous bits before hashing.

When dealing with generic quota files in quota_tree.c, quota_v1.c, and
quota_v2.c before writing quotas to the file encode them in
sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

One complication comes up in qtree_get_next_id, as it is possible in
principle for id's to be present in the quota file that do not have a
mapping in the filesystems user namespace.  If that is the case the
code is modified to skip the unmapped ids.

Inspired-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/quota/dquot.c      |  4 ++--
 fs/quota/quota.c      |  2 +-
 fs/quota/quota_tree.c | 23 ++++++++++++++---------
 fs/quota/quota_v1.c   |  6 ++++--
 fs/quota/quota_v2.c   | 10 ++++++----
 5 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..9b39925f34f4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -252,7 +252,7 @@ static int __dquot_initialize(struct inode *inode, int type);
 static inline unsigned int
 hashfn(const struct super_block *sb, struct kqid qid)
 {
-	unsigned int id = from_kqid(&init_user_ns, qid);
+	unsigned int id = from_kqid(sb->s_user_ns, qid);
 	int type = qid.type;
 	unsigned long tmp;
 
@@ -748,7 +748,7 @@ void dqput(struct dquot *dquot)
 	if (!atomic_read(&dquot->dq_count)) {
 		quota_error(dquot->dq_sb, "trying to free free dquot of %s %d",
 			    quotatypes[dquot->dq_id.type],
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		BUG();
 	}
 #endif
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c
index 0738972e8d3f..4f8f6bdc698c 100644
--- a/fs/quota/quota_tree.c
+++ b/fs/quota/quota_tree.c
@@ -34,7 +34,7 @@ static int __get_index(struct qtree_mem_dqinfo *info, qid_t id, int depth)
 
 static int get_index(struct qtree_mem_dqinfo *info, struct kqid qid, int depth)
 {
-	qid_t id = from_kqid(&init_user_ns, qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, qid);
 
 	return __get_index(info, id, depth);
 }
@@ -554,7 +554,7 @@ static loff_t find_block_dqentry(struct qtree_mem_dqinfo *info,
 	if (i == qtree_dqstr_in_blk(info)) {
 		quota_error(dquot->dq_sb,
 			    "Quota for id %u referenced but not present",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		ret = -EIO;
 		goto out_buf;
 	} else {
@@ -624,7 +624,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 			if (offset < 0)
 				quota_error(sb,"Can't read quota structure "
 					    "for id %u",
-					    from_kqid(&init_user_ns,
+					    from_kqid(sb->s_user_ns,
 						      dquot->dq_id));
 			dquot->dq_off = 0;
 			set_bit(DQ_FAKE_B, &dquot->dq_flags);
@@ -643,7 +643,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 		if (ret >= 0)
 			ret = -EIO;
 		quota_error(sb, "Error while reading quota structure for id %u",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(sb->s_user_ns, dquot->dq_id));
 		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 		memset(&dquot->dq_dqb, 0, sizeof(struct mem_dqblk));
 		kfree(ddquot);
@@ -721,13 +721,18 @@ out_buf:
 
 int qtree_get_next_id(struct qtree_mem_dqinfo *info, struct kqid *qid)
 {
-	qid_t id = from_kqid(&init_user_ns, *qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, *qid);
+	struct kqid next_id;
 	int ret;
 
-	ret = find_next_id(info, &id, QT_TREEOFF, 0);
-	if (ret < 0)
-		return ret;
-	*qid = make_kqid(&init_user_ns, qid->type, id);
+	do {
+		ret = find_next_id(info, &id, QT_TREEOFF, 0);
+		if (ret < 0)
+			return ret;
+		next_id = make_kqid(info->dqi_sb->s_user_ns, qid->type, id);
+	} while (!qid_valid(next_id)); /* Loop until a mapped id is found */
+
+	*qid = next_id;
 	return 0;
 }
 EXPORT_SYMBOL(qtree_get_next_id);
diff --git a/fs/quota/quota_v1.c b/fs/quota/quota_v1.c
index 8fe79beced5c..087e7743b996 100644
--- a/fs/quota/quota_v1.c
+++ b/fs/quota/quota_v1.c
@@ -54,6 +54,7 @@ static void v1_mem2disk_dqblk(struct v1_disk_dqblk *d, struct mem_dqblk *m)
 
 static int v1_read_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	int type = dquot->dq_id.type;
 	struct v1_disk_dqblk dqblk;
 
@@ -64,7 +65,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 	memset(&dqblk, 0, sizeof(struct v1_disk_dqblk));
 	dquot->dq_sb->s_op->quota_read(dquot->dq_sb, type, (char *)&dqblk,
 			sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id));
 
 	v1_disk2mem_dqblk(&dquot->dq_dqb, &dqblk);
 	if (dquot->dq_dqb.dqb_bhardlimit == 0 &&
@@ -79,6 +80,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 
 static int v1_commit_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	short type = dquot->dq_id.type;
 	ssize_t ret;
 	struct v1_disk_dqblk dqblk;
@@ -95,7 +97,7 @@ static int v1_commit_dqblk(struct dquot *dquot)
 	if (sb_dqopt(dquot->dq_sb)->files[type])
 		ret = dquot->dq_sb->s_op->quota_write(dquot->dq_sb, type,
 			(char *)&dqblk, sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kquid(dq_user_ns, dquot->dq_id)));
 	if (ret != sizeof(struct v1_disk_dqblk)) {
 		quota_error(dquot->dq_sb, "dquota write failed");
 		if (ret >= 0)
diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
index ca71bf881ad1..4ecf91611a87 100644
--- a/fs/quota/quota_v2.c
+++ b/fs/quota/quota_v2.c
@@ -199,6 +199,7 @@ static void v2r0_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r0_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -212,7 +213,7 @@ static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le32(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_qid));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -225,7 +226,7 @@ static int v2r0_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, quot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
@@ -252,6 +253,7 @@ static void v2r1_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r1_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -265,7 +267,7 @@ static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le64(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -278,7 +280,7 @@ static int v2r1_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20     ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

For the dquot hashfn translate the qid into sb->s_user_ns to remove
extraneous bits before hashing.

When dealing with generic quota files in quota_tree.c, quota_v1.c, and
quota_v2.c before writing quotas to the file encode them in
sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

One complication comes up in qtree_get_next_id, as it is possible in
principle for id's to be present in the quota file that do not have a
mapping in the filesystems user namespace.  If that is the case the
code is modified to skip the unmapped ids.

Inspired-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/quota/dquot.c      |  4 ++--
 fs/quota/quota.c      |  2 +-
 fs/quota/quota_tree.c | 23 ++++++++++++++---------
 fs/quota/quota_v1.c   |  6 ++++--
 fs/quota/quota_v2.c   | 10 ++++++----
 5 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..9b39925f34f4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -252,7 +252,7 @@ static int __dquot_initialize(struct inode *inode, int type);
 static inline unsigned int
 hashfn(const struct super_block *sb, struct kqid qid)
 {
-	unsigned int id = from_kqid(&init_user_ns, qid);
+	unsigned int id = from_kqid(sb->s_user_ns, qid);
 	int type = qid.type;
 	unsigned long tmp;
 
@@ -748,7 +748,7 @@ void dqput(struct dquot *dquot)
 	if (!atomic_read(&dquot->dq_count)) {
 		quota_error(dquot->dq_sb, "trying to free free dquot of %s %d",
 			    quotatypes[dquot->dq_id.type],
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		BUG();
 	}
 #endif
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c
index 0738972e8d3f..4f8f6bdc698c 100644
--- a/fs/quota/quota_tree.c
+++ b/fs/quota/quota_tree.c
@@ -34,7 +34,7 @@ static int __get_index(struct qtree_mem_dqinfo *info, qid_t id, int depth)
 
 static int get_index(struct qtree_mem_dqinfo *info, struct kqid qid, int depth)
 {
-	qid_t id = from_kqid(&init_user_ns, qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, qid);
 
 	return __get_index(info, id, depth);
 }
@@ -554,7 +554,7 @@ static loff_t find_block_dqentry(struct qtree_mem_dqinfo *info,
 	if (i == qtree_dqstr_in_blk(info)) {
 		quota_error(dquot->dq_sb,
 			    "Quota for id %u referenced but not present",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		ret = -EIO;
 		goto out_buf;
 	} else {
@@ -624,7 +624,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 			if (offset < 0)
 				quota_error(sb,"Can't read quota structure "
 					    "for id %u",
-					    from_kqid(&init_user_ns,
+					    from_kqid(sb->s_user_ns,
 						      dquot->dq_id));
 			dquot->dq_off = 0;
 			set_bit(DQ_FAKE_B, &dquot->dq_flags);
@@ -643,7 +643,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 		if (ret >= 0)
 			ret = -EIO;
 		quota_error(sb, "Error while reading quota structure for id %u",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(sb->s_user_ns, dquot->dq_id));
 		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 		memset(&dquot->dq_dqb, 0, sizeof(struct mem_dqblk));
 		kfree(ddquot);
@@ -721,13 +721,18 @@ out_buf:
 
 int qtree_get_next_id(struct qtree_mem_dqinfo *info, struct kqid *qid)
 {
-	qid_t id = from_kqid(&init_user_ns, *qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, *qid);
+	struct kqid next_id;
 	int ret;
 
-	ret = find_next_id(info, &id, QT_TREEOFF, 0);
-	if (ret < 0)
-		return ret;
-	*qid = make_kqid(&init_user_ns, qid->type, id);
+	do {
+		ret = find_next_id(info, &id, QT_TREEOFF, 0);
+		if (ret < 0)
+			return ret;
+		next_id = make_kqid(info->dqi_sb->s_user_ns, qid->type, id);
+	} while (!qid_valid(next_id)); /* Loop until a mapped id is found */
+
+	*qid = next_id;
 	return 0;
 }
 EXPORT_SYMBOL(qtree_get_next_id);
diff --git a/fs/quota/quota_v1.c b/fs/quota/quota_v1.c
index 8fe79beced5c..087e7743b996 100644
--- a/fs/quota/quota_v1.c
+++ b/fs/quota/quota_v1.c
@@ -54,6 +54,7 @@ static void v1_mem2disk_dqblk(struct v1_disk_dqblk *d, struct mem_dqblk *m)
 
 static int v1_read_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	int type = dquot->dq_id.type;
 	struct v1_disk_dqblk dqblk;
 
@@ -64,7 +65,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 	memset(&dqblk, 0, sizeof(struct v1_disk_dqblk));
 	dquot->dq_sb->s_op->quota_read(dquot->dq_sb, type, (char *)&dqblk,
 			sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id));
 
 	v1_disk2mem_dqblk(&dquot->dq_dqb, &dqblk);
 	if (dquot->dq_dqb.dqb_bhardlimit == 0 &&
@@ -79,6 +80,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 
 static int v1_commit_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	short type = dquot->dq_id.type;
 	ssize_t ret;
 	struct v1_disk_dqblk dqblk;
@@ -95,7 +97,7 @@ static int v1_commit_dqblk(struct dquot *dquot)
 	if (sb_dqopt(dquot->dq_sb)->files[type])
 		ret = dquot->dq_sb->s_op->quota_write(dquot->dq_sb, type,
 			(char *)&dqblk, sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kquid(dq_user_ns, dquot->dq_id)));
 	if (ret != sizeof(struct v1_disk_dqblk)) {
 		quota_error(dquot->dq_sb, "dquota write failed");
 		if (ret >= 0)
diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
index ca71bf881ad1..4ecf91611a87 100644
--- a/fs/quota/quota_v2.c
+++ b/fs/quota/quota_v2.c
@@ -199,6 +199,7 @@ static void v2r0_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r0_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -212,7 +213,7 @@ static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le32(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_qid));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -225,7 +226,7 @@ static int v2r0_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, quot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
@@ -252,6 +253,7 @@ static void v2r1_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r1_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -265,7 +267,7 @@ static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le64(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -278,7 +280,7 @@ static int v2r1_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 09/11] quota: Handle quota data stored in s_user_ns.
@ 2016-07-02 17:20     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

For the dquot hashfn translate the qid into sb->s_user_ns to remove
extraneous bits before hashing.

When dealing with generic quota files in quota_tree.c, quota_v1.c, and
quota_v2.c before writing quotas to the file encode them in
sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

One complication comes up in qtree_get_next_id, as it is possible in
principle for id's to be present in the quota file that do not have a
mapping in the filesystems user namespace.  If that is the case the
code is modified to skip the unmapped ids.

Inspired-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/quota/dquot.c      |  4 ++--
 fs/quota/quota.c      |  2 +-
 fs/quota/quota_tree.c | 23 ++++++++++++++---------
 fs/quota/quota_v1.c   |  6 ++++--
 fs/quota/quota_v2.c   | 10 ++++++----
 5 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..9b39925f34f4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -252,7 +252,7 @@ static int __dquot_initialize(struct inode *inode, int type);
 static inline unsigned int
 hashfn(const struct super_block *sb, struct kqid qid)
 {
-	unsigned int id = from_kqid(&init_user_ns, qid);
+	unsigned int id = from_kqid(sb->s_user_ns, qid);
 	int type = qid.type;
 	unsigned long tmp;
 
@@ -748,7 +748,7 @@ void dqput(struct dquot *dquot)
 	if (!atomic_read(&dquot->dq_count)) {
 		quota_error(dquot->dq_sb, "trying to free free dquot of %s %d",
 			    quotatypes[dquot->dq_id.type],
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		BUG();
 	}
 #endif
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c
index 0738972e8d3f..4f8f6bdc698c 100644
--- a/fs/quota/quota_tree.c
+++ b/fs/quota/quota_tree.c
@@ -34,7 +34,7 @@ static int __get_index(struct qtree_mem_dqinfo *info, qid_t id, int depth)
 
 static int get_index(struct qtree_mem_dqinfo *info, struct kqid qid, int depth)
 {
-	qid_t id = from_kqid(&init_user_ns, qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, qid);
 
 	return __get_index(info, id, depth);
 }
@@ -554,7 +554,7 @@ static loff_t find_block_dqentry(struct qtree_mem_dqinfo *info,
 	if (i == qtree_dqstr_in_blk(info)) {
 		quota_error(dquot->dq_sb,
 			    "Quota for id %u referenced but not present",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		ret = -EIO;
 		goto out_buf;
 	} else {
@@ -624,7 +624,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 			if (offset < 0)
 				quota_error(sb,"Can't read quota structure "
 					    "for id %u",
-					    from_kqid(&init_user_ns,
+					    from_kqid(sb->s_user_ns,
 						      dquot->dq_id));
 			dquot->dq_off = 0;
 			set_bit(DQ_FAKE_B, &dquot->dq_flags);
@@ -643,7 +643,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 		if (ret >= 0)
 			ret = -EIO;
 		quota_error(sb, "Error while reading quota structure for id %u",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(sb->s_user_ns, dquot->dq_id));
 		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 		memset(&dquot->dq_dqb, 0, sizeof(struct mem_dqblk));
 		kfree(ddquot);
@@ -721,13 +721,18 @@ out_buf:
 
 int qtree_get_next_id(struct qtree_mem_dqinfo *info, struct kqid *qid)
 {
-	qid_t id = from_kqid(&init_user_ns, *qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, *qid);
+	struct kqid next_id;
 	int ret;
 
-	ret = find_next_id(info, &id, QT_TREEOFF, 0);
-	if (ret < 0)
-		return ret;
-	*qid = make_kqid(&init_user_ns, qid->type, id);
+	do {
+		ret = find_next_id(info, &id, QT_TREEOFF, 0);
+		if (ret < 0)
+			return ret;
+		next_id = make_kqid(info->dqi_sb->s_user_ns, qid->type, id);
+	} while (!qid_valid(next_id)); /* Loop until a mapped id is found */
+
+	*qid = next_id;
 	return 0;
 }
 EXPORT_SYMBOL(qtree_get_next_id);
diff --git a/fs/quota/quota_v1.c b/fs/quota/quota_v1.c
index 8fe79beced5c..087e7743b996 100644
--- a/fs/quota/quota_v1.c
+++ b/fs/quota/quota_v1.c
@@ -54,6 +54,7 @@ static void v1_mem2disk_dqblk(struct v1_disk_dqblk *d, struct mem_dqblk *m)
 
 static int v1_read_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	int type = dquot->dq_id.type;
 	struct v1_disk_dqblk dqblk;
 
@@ -64,7 +65,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 	memset(&dqblk, 0, sizeof(struct v1_disk_dqblk));
 	dquot->dq_sb->s_op->quota_read(dquot->dq_sb, type, (char *)&dqblk,
 			sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id));
 
 	v1_disk2mem_dqblk(&dquot->dq_dqb, &dqblk);
 	if (dquot->dq_dqb.dqb_bhardlimit == 0 &&
@@ -79,6 +80,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 
 static int v1_commit_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	short type = dquot->dq_id.type;
 	ssize_t ret;
 	struct v1_disk_dqblk dqblk;
@@ -95,7 +97,7 @@ static int v1_commit_dqblk(struct dquot *dquot)
 	if (sb_dqopt(dquot->dq_sb)->files[type])
 		ret = dquot->dq_sb->s_op->quota_write(dquot->dq_sb, type,
 			(char *)&dqblk, sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kquid(dq_user_ns, dquot->dq_id)));
 	if (ret != sizeof(struct v1_disk_dqblk)) {
 		quota_error(dquot->dq_sb, "dquota write failed");
 		if (ret >= 0)
diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
index ca71bf881ad1..4ecf91611a87 100644
--- a/fs/quota/quota_v2.c
+++ b/fs/quota/quota_v2.c
@@ -199,6 +199,7 @@ static void v2r0_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r0_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -212,7 +213,7 @@ static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le32(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_qid));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -225,7 +226,7 @@ static int v2r0_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, quot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
@@ -252,6 +253,7 @@ static void v2r1_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r1_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -265,7 +267,7 @@ static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le64(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -278,7 +280,7 @@ static int v2r1_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 10/11] evm: Translate user/group ids relative to s_user_ns when computing HMAC
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                       ` (7 preceding siblings ...)
  2016-07-02 17:20     ` [PATCH review 09/11] quota: Handle quota data stored in s_user_ns Eric W. Biederman
@ 2016-07-02 17:20     ` Eric W. Biederman
  2016-07-02 17:20       ` Eric W. Biederman
  9 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The EVM HMAC should be calculated using the on disk user and
group ids, so the k[ug]ids in the inode must be translated
relative to the s_user_ns of the inode's super block.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 security/integrity/evm/evm_crypto.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index 30b6b7d0429f..11c1d30bd705 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -151,8 +151,8 @@ static void hmac_add_misc(struct shash_desc *desc, struct inode *inode,
 	memset(&hmac_misc, 0, sizeof(hmac_misc));
 	hmac_misc.ino = inode->i_ino;
 	hmac_misc.generation = inode->i_generation;
-	hmac_misc.uid = from_kuid(&init_user_ns, inode->i_uid);
-	hmac_misc.gid = from_kgid(&init_user_ns, inode->i_gid);
+	hmac_misc.uid = from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+	hmac_misc.gid = from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
 	hmac_misc.mode = inode->i_mode;
 	crypto_shash_update(desc, (const u8 *)&hmac_misc, sizeof(hmac_misc));
 	if (evm_hmac_attrs & EVM_ATTR_FSUUID)
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 10/11] evm: Translate user/group ids relative to s_user_ns when computing HMAC
  2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
                     ` (6 preceding siblings ...)
  2016-07-02 17:20     ` Eric W. Biederman
@ 2016-07-02 17:20   ` Eric W. Biederman
       [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  8 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

The EVM HMAC should be calculated using the on disk user and
group ids, so the k[ug]ids in the inode must be translated
relative to the s_user_ns of the inode's super block.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 security/integrity/evm/evm_crypto.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index 30b6b7d0429f..11c1d30bd705 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -151,8 +151,8 @@ static void hmac_add_misc(struct shash_desc *desc, struct inode *inode,
 	memset(&hmac_misc, 0, sizeof(hmac_misc));
 	hmac_misc.ino = inode->i_ino;
 	hmac_misc.generation = inode->i_generation;
-	hmac_misc.uid = from_kuid(&init_user_ns, inode->i_uid);
-	hmac_misc.gid = from_kgid(&init_user_ns, inode->i_gid);
+	hmac_misc.uid = from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+	hmac_misc.gid = from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
 	hmac_misc.mode = inode->i_mode;
 	crypto_shash_update(desc, (const u8 *)&hmac_misc, sizeof(hmac_misc));
 	if (evm_hmac_attrs & EVM_ATTR_FSUUID)
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 11/11] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
  2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
@ 2016-07-02 17:20       ` Eric W. Biederman
  2016-07-02 17:20     ` Eric W. Biederman
                         ` (7 subsequent siblings)
  8 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

For filesystems mounted from a user namespace on-disk ids should
be translated relative to s_users_ns rather than init_user_ns.

When an id in the filesystem doesn't exist in s_user_ns the
associated id in the inode will be set to INVALID_[UG]ID, which
turns these into de facto "nobody" ids. This actually maps pretty
well into the way most code already works, and those places where
it didn't were fixed in previous patches. Moving forward vfs code
needs to be careful to handle instances where ids in inodes may
be invalid.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/fs.h | 50 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb25ceb6d1ef..8aa9b72e0bc5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -831,31 +831,6 @@ static inline void i_size_write(struct inode *inode, loff_t i_size)
 #endif
 }
 
-/* Helper functions so that in most cases filesystems will
- * not need to deal directly with kuid_t and kgid_t and can
- * instead deal with the raw numeric values that are stored
- * in the filesystem.
- */
-static inline uid_t i_uid_read(const struct inode *inode)
-{
-	return from_kuid(&init_user_ns, inode->i_uid);
-}
-
-static inline gid_t i_gid_read(const struct inode *inode)
-{
-	return from_kgid(&init_user_ns, inode->i_gid);
-}
-
-static inline void i_uid_write(struct inode *inode, uid_t uid)
-{
-	inode->i_uid = make_kuid(&init_user_ns, uid);
-}
-
-static inline void i_gid_write(struct inode *inode, gid_t gid)
-{
-	inode->i_gid = make_kgid(&init_user_ns, gid);
-}
-
 static inline unsigned iminor(const struct inode *inode)
 {
 	return MINOR(inode->i_rdev);
@@ -1461,6 +1436,31 @@ struct super_block {
 	struct list_head	s_inodes;	/* all inodes */
 };
 
+/* Helper functions so that in most cases filesystems will
+ * not need to deal directly with kuid_t and kgid_t and can
+ * instead deal with the raw numeric values that are stored
+ * in the filesystem.
+ */
+static inline uid_t i_uid_read(const struct inode *inode)
+{
+	return from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+}
+
+static inline gid_t i_gid_read(const struct inode *inode)
+{
+	return from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
+}
+
+static inline void i_uid_write(struct inode *inode, uid_t uid)
+{
+	inode->i_uid = make_kuid(inode->i_sb->s_user_ns, uid);
+}
+
+static inline void i_gid_write(struct inode *inode, gid_t gid)
+{
+	inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
+}
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 11/11] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
@ 2016-07-02 17:20       ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:20 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Linux Containers, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

For filesystems mounted from a user namespace on-disk ids should
be translated relative to s_users_ns rather than init_user_ns.

When an id in the filesystem doesn't exist in s_user_ns the
associated id in the inode will be set to INVALID_[UG]ID, which
turns these into de facto "nobody" ids. This actually maps pretty
well into the way most code already works, and those places where
it didn't were fixed in previous patches. Moving forward vfs code
needs to be careful to handle instances where ids in inodes may
be invalid.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/fs.h | 50 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb25ceb6d1ef..8aa9b72e0bc5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -831,31 +831,6 @@ static inline void i_size_write(struct inode *inode, loff_t i_size)
 #endif
 }
 
-/* Helper functions so that in most cases filesystems will
- * not need to deal directly with kuid_t and kgid_t and can
- * instead deal with the raw numeric values that are stored
- * in the filesystem.
- */
-static inline uid_t i_uid_read(const struct inode *inode)
-{
-	return from_kuid(&init_user_ns, inode->i_uid);
-}
-
-static inline gid_t i_gid_read(const struct inode *inode)
-{
-	return from_kgid(&init_user_ns, inode->i_gid);
-}
-
-static inline void i_uid_write(struct inode *inode, uid_t uid)
-{
-	inode->i_uid = make_kuid(&init_user_ns, uid);
-}
-
-static inline void i_gid_write(struct inode *inode, gid_t gid)
-{
-	inode->i_gid = make_kgid(&init_user_ns, gid);
-}
-
 static inline unsigned iminor(const struct inode *inode)
 {
 	return MINOR(inode->i_rdev);
@@ -1461,6 +1436,31 @@ struct super_block {
 	struct list_head	s_inodes;	/* all inodes */
 };
 
+/* Helper functions so that in most cases filesystems will
+ * not need to deal directly with kuid_t and kgid_t and can
+ * instead deal with the raw numeric values that are stored
+ * in the filesystem.
+ */
+static inline uid_t i_uid_read(const struct inode *inode)
+{
+	return from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+}
+
+static inline gid_t i_gid_read(const struct inode *inode)
+{
+	return from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
+}
+
+static inline void i_uid_write(struct inode *inode, uid_t uid)
+{
+	inode->i_uid = make_kuid(inode->i_sb->s_user_ns, uid);
+}
+
+static inline void i_gid_write(struct inode *inode, gid_t gid)
+{
+	inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
+}
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]     ` <20160702172035.19568-9-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-02 17:33       ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:33 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni


In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

For the dquot hashfn translate the qid into sb->s_user_ns to remove
extraneous bits before hashing.

When dealing with generic quota files in quota_tree.c, quota_v1.c, and
quota_v2.c before writing quotas to the file encode them in
sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

One complication comes up in qtree_get_next_id, as it is possible in
principle for id's to be present in the quota file that do not have a
mapping in the filesystems user namespace.  If that is the case the
code is modified to skip the unmapped ids.

Inspired-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

Grr. My kconfig was wrong and typo's crept into my previous version
of this patch.  Thank you kbuild test robot from catching this.

 fs/quota/dquot.c      |  4 ++--
 fs/quota/quota.c      |  2 +-
 fs/quota/quota_tree.c | 23 ++++++++++++++---------
 fs/quota/quota_v1.c   |  6 ++++--
 fs/quota/quota_v2.c   | 10 ++++++----
 5 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..9b39925f34f4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -252,7 +252,7 @@ static int __dquot_initialize(struct inode *inode, int type);
 static inline unsigned int
 hashfn(const struct super_block *sb, struct kqid qid)
 {
-	unsigned int id = from_kqid(&init_user_ns, qid);
+	unsigned int id = from_kqid(sb->s_user_ns, qid);
 	int type = qid.type;
 	unsigned long tmp;
 
@@ -748,7 +748,7 @@ void dqput(struct dquot *dquot)
 	if (!atomic_read(&dquot->dq_count)) {
 		quota_error(dquot->dq_sb, "trying to free free dquot of %s %d",
 			    quotatypes[dquot->dq_id.type],
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		BUG();
 	}
 #endif
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c
index 0738972e8d3f..4f8f6bdc698c 100644
--- a/fs/quota/quota_tree.c
+++ b/fs/quota/quota_tree.c
@@ -34,7 +34,7 @@ static int __get_index(struct qtree_mem_dqinfo *info, qid_t id, int depth)
 
 static int get_index(struct qtree_mem_dqinfo *info, struct kqid qid, int depth)
 {
-	qid_t id = from_kqid(&init_user_ns, qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, qid);
 
 	return __get_index(info, id, depth);
 }
@@ -554,7 +554,7 @@ static loff_t find_block_dqentry(struct qtree_mem_dqinfo *info,
 	if (i == qtree_dqstr_in_blk(info)) {
 		quota_error(dquot->dq_sb,
 			    "Quota for id %u referenced but not present",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		ret = -EIO;
 		goto out_buf;
 	} else {
@@ -624,7 +624,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 			if (offset < 0)
 				quota_error(sb,"Can't read quota structure "
 					    "for id %u",
-					    from_kqid(&init_user_ns,
+					    from_kqid(sb->s_user_ns,
 						      dquot->dq_id));
 			dquot->dq_off = 0;
 			set_bit(DQ_FAKE_B, &dquot->dq_flags);
@@ -643,7 +643,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 		if (ret >= 0)
 			ret = -EIO;
 		quota_error(sb, "Error while reading quota structure for id %u",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(sb->s_user_ns, dquot->dq_id));
 		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 		memset(&dquot->dq_dqb, 0, sizeof(struct mem_dqblk));
 		kfree(ddquot);
@@ -721,13 +721,18 @@ out_buf:
 
 int qtree_get_next_id(struct qtree_mem_dqinfo *info, struct kqid *qid)
 {
-	qid_t id = from_kqid(&init_user_ns, *qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, *qid);
+	struct kqid next_id;
 	int ret;
 
-	ret = find_next_id(info, &id, QT_TREEOFF, 0);
-	if (ret < 0)
-		return ret;
-	*qid = make_kqid(&init_user_ns, qid->type, id);
+	do {
+		ret = find_next_id(info, &id, QT_TREEOFF, 0);
+		if (ret < 0)
+			return ret;
+		next_id = make_kqid(info->dqi_sb->s_user_ns, qid->type, id);
+	} while (!qid_valid(next_id)); /* Loop until a mapped id is found */
+
+	*qid = next_id;
 	return 0;
 }
 EXPORT_SYMBOL(qtree_get_next_id);
diff --git a/fs/quota/quota_v1.c b/fs/quota/quota_v1.c
index 8fe79beced5c..a354f8bed96a 100644
--- a/fs/quota/quota_v1.c
+++ b/fs/quota/quota_v1.c
@@ -54,6 +54,7 @@ static void v1_mem2disk_dqblk(struct v1_disk_dqblk *d, struct mem_dqblk *m)
 
 static int v1_read_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	int type = dquot->dq_id.type;
 	struct v1_disk_dqblk dqblk;
 
@@ -64,7 +65,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 	memset(&dqblk, 0, sizeof(struct v1_disk_dqblk));
 	dquot->dq_sb->s_op->quota_read(dquot->dq_sb, type, (char *)&dqblk,
 			sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id)));
 
 	v1_disk2mem_dqblk(&dquot->dq_dqb, &dqblk);
 	if (dquot->dq_dqb.dqb_bhardlimit == 0 &&
@@ -79,6 +80,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 
 static int v1_commit_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	short type = dquot->dq_id.type;
 	ssize_t ret;
 	struct v1_disk_dqblk dqblk;
@@ -95,7 +97,7 @@ static int v1_commit_dqblk(struct dquot *dquot)
 	if (sb_dqopt(dquot->dq_sb)->files[type])
 		ret = dquot->dq_sb->s_op->quota_write(dquot->dq_sb, type,
 			(char *)&dqblk, sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id)));
 	if (ret != sizeof(struct v1_disk_dqblk)) {
 		quota_error(dquot->dq_sb, "dquota write failed");
 		if (ret >= 0)
diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
index ca71bf881ad1..54e27b296656 100644
--- a/fs/quota/quota_v2.c
+++ b/fs/quota/quota_v2.c
@@ -199,6 +199,7 @@ static void v2r0_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r0_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -212,7 +213,7 @@ static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le32(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -225,7 +226,7 @@ static int v2r0_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
@@ -252,6 +253,7 @@ static void v2r1_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r1_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -265,7 +267,7 @@ static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le64(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -278,7 +280,7 @@ static int v2r1_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]     ` <20160702172035.19568-9-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-02 17:33       ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:33 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk, linux-fsdevel,
	Djalal Harouni


In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

For the dquot hashfn translate the qid into sb->s_user_ns to remove
extraneous bits before hashing.

When dealing with generic quota files in quota_tree.c, quota_v1.c, and
quota_v2.c before writing quotas to the file encode them in
sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

One complication comes up in qtree_get_next_id, as it is possible in
principle for id's to be present in the quota file that do not have a
mapping in the filesystems user namespace.  If that is the case the
code is modified to skip the unmapped ids.

Inspired-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Grr. My kconfig was wrong and typo's crept into my previous version
of this patch.  Thank you kbuild test robot from catching this.

 fs/quota/dquot.c      |  4 ++--
 fs/quota/quota.c      |  2 +-
 fs/quota/quota_tree.c | 23 ++++++++++++++---------
 fs/quota/quota_v1.c   |  6 ++++--
 fs/quota/quota_v2.c   | 10 ++++++----
 5 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..9b39925f34f4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -252,7 +252,7 @@ static int __dquot_initialize(struct inode *inode, int type);
 static inline unsigned int
 hashfn(const struct super_block *sb, struct kqid qid)
 {
-	unsigned int id = from_kqid(&init_user_ns, qid);
+	unsigned int id = from_kqid(sb->s_user_ns, qid);
 	int type = qid.type;
 	unsigned long tmp;
 
@@ -748,7 +748,7 @@ void dqput(struct dquot *dquot)
 	if (!atomic_read(&dquot->dq_count)) {
 		quota_error(dquot->dq_sb, "trying to free free dquot of %s %d",
 			    quotatypes[dquot->dq_id.type],
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		BUG();
 	}
 #endif
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c
index 0738972e8d3f..4f8f6bdc698c 100644
--- a/fs/quota/quota_tree.c
+++ b/fs/quota/quota_tree.c
@@ -34,7 +34,7 @@ static int __get_index(struct qtree_mem_dqinfo *info, qid_t id, int depth)
 
 static int get_index(struct qtree_mem_dqinfo *info, struct kqid qid, int depth)
 {
-	qid_t id = from_kqid(&init_user_ns, qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, qid);
 
 	return __get_index(info, id, depth);
 }
@@ -554,7 +554,7 @@ static loff_t find_block_dqentry(struct qtree_mem_dqinfo *info,
 	if (i == qtree_dqstr_in_blk(info)) {
 		quota_error(dquot->dq_sb,
 			    "Quota for id %u referenced but not present",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		ret = -EIO;
 		goto out_buf;
 	} else {
@@ -624,7 +624,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 			if (offset < 0)
 				quota_error(sb,"Can't read quota structure "
 					    "for id %u",
-					    from_kqid(&init_user_ns,
+					    from_kqid(sb->s_user_ns,
 						      dquot->dq_id));
 			dquot->dq_off = 0;
 			set_bit(DQ_FAKE_B, &dquot->dq_flags);
@@ -643,7 +643,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 		if (ret >= 0)
 			ret = -EIO;
 		quota_error(sb, "Error while reading quota structure for id %u",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(sb->s_user_ns, dquot->dq_id));
 		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 		memset(&dquot->dq_dqb, 0, sizeof(struct mem_dqblk));
 		kfree(ddquot);
@@ -721,13 +721,18 @@ out_buf:
 
 int qtree_get_next_id(struct qtree_mem_dqinfo *info, struct kqid *qid)
 {
-	qid_t id = from_kqid(&init_user_ns, *qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, *qid);
+	struct kqid next_id;
 	int ret;
 
-	ret = find_next_id(info, &id, QT_TREEOFF, 0);
-	if (ret < 0)
-		return ret;
-	*qid = make_kqid(&init_user_ns, qid->type, id);
+	do {
+		ret = find_next_id(info, &id, QT_TREEOFF, 0);
+		if (ret < 0)
+			return ret;
+		next_id = make_kqid(info->dqi_sb->s_user_ns, qid->type, id);
+	} while (!qid_valid(next_id)); /* Loop until a mapped id is found */
+
+	*qid = next_id;
 	return 0;
 }
 EXPORT_SYMBOL(qtree_get_next_id);
diff --git a/fs/quota/quota_v1.c b/fs/quota/quota_v1.c
index 8fe79beced5c..a354f8bed96a 100644
--- a/fs/quota/quota_v1.c
+++ b/fs/quota/quota_v1.c
@@ -54,6 +54,7 @@ static void v1_mem2disk_dqblk(struct v1_disk_dqblk *d, struct mem_dqblk *m)
 
 static int v1_read_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	int type = dquot->dq_id.type;
 	struct v1_disk_dqblk dqblk;
 
@@ -64,7 +65,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 	memset(&dqblk, 0, sizeof(struct v1_disk_dqblk));
 	dquot->dq_sb->s_op->quota_read(dquot->dq_sb, type, (char *)&dqblk,
 			sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id)));
 
 	v1_disk2mem_dqblk(&dquot->dq_dqb, &dqblk);
 	if (dquot->dq_dqb.dqb_bhardlimit == 0 &&
@@ -79,6 +80,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 
 static int v1_commit_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	short type = dquot->dq_id.type;
 	ssize_t ret;
 	struct v1_disk_dqblk dqblk;
@@ -95,7 +97,7 @@ static int v1_commit_dqblk(struct dquot *dquot)
 	if (sb_dqopt(dquot->dq_sb)->files[type])
 		ret = dquot->dq_sb->s_op->quota_write(dquot->dq_sb, type,
 			(char *)&dqblk, sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id)));
 	if (ret != sizeof(struct v1_disk_dqblk)) {
 		quota_error(dquot->dq_sb, "dquota write failed");
 		if (ret >= 0)
diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
index ca71bf881ad1..54e27b296656 100644
--- a/fs/quota/quota_v2.c
+++ b/fs/quota/quota_v2.c
@@ -199,6 +199,7 @@ static void v2r0_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r0_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -212,7 +213,7 @@ static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le32(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -225,7 +226,7 @@ static int v2r0_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
@@ -252,6 +253,7 @@ static void v2r1_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r1_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -265,7 +267,7 @@ static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le64(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -278,7 +280,7 @@ static int v2r1_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
@ 2016-07-02 17:33       ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-02 17:33 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni


In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

For the dquot hashfn translate the qid into sb->s_user_ns to remove
extraneous bits before hashing.

When dealing with generic quota files in quota_tree.c, quota_v1.c, and
quota_v2.c before writing quotas to the file encode them in
sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

One complication comes up in qtree_get_next_id, as it is possible in
principle for id's to be present in the quota file that do not have a
mapping in the filesystems user namespace.  If that is the case the
code is modified to skip the unmapped ids.

Inspired-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

Grr. My kconfig was wrong and typo's crept into my previous version
of this patch.  Thank you kbuild test robot from catching this.

 fs/quota/dquot.c      |  4 ++--
 fs/quota/quota.c      |  2 +-
 fs/quota/quota_tree.c | 23 ++++++++++++++---------
 fs/quota/quota_v1.c   |  6 ++++--
 fs/quota/quota_v2.c   | 10 ++++++----
 5 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..9b39925f34f4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -252,7 +252,7 @@ static int __dquot_initialize(struct inode *inode, int type);
 static inline unsigned int
 hashfn(const struct super_block *sb, struct kqid qid)
 {
-	unsigned int id = from_kqid(&init_user_ns, qid);
+	unsigned int id = from_kqid(sb->s_user_ns, qid);
 	int type = qid.type;
 	unsigned long tmp;
 
@@ -748,7 +748,7 @@ void dqput(struct dquot *dquot)
 	if (!atomic_read(&dquot->dq_count)) {
 		quota_error(dquot->dq_sb, "trying to free free dquot of %s %d",
 			    quotatypes[dquot->dq_id.type],
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		BUG();
 	}
 #endif
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c
index 0738972e8d3f..4f8f6bdc698c 100644
--- a/fs/quota/quota_tree.c
+++ b/fs/quota/quota_tree.c
@@ -34,7 +34,7 @@ static int __get_index(struct qtree_mem_dqinfo *info, qid_t id, int depth)
 
 static int get_index(struct qtree_mem_dqinfo *info, struct kqid qid, int depth)
 {
-	qid_t id = from_kqid(&init_user_ns, qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, qid);
 
 	return __get_index(info, id, depth);
 }
@@ -554,7 +554,7 @@ static loff_t find_block_dqentry(struct qtree_mem_dqinfo *info,
 	if (i == qtree_dqstr_in_blk(info)) {
 		quota_error(dquot->dq_sb,
 			    "Quota for id %u referenced but not present",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id));
 		ret = -EIO;
 		goto out_buf;
 	} else {
@@ -624,7 +624,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 			if (offset < 0)
 				quota_error(sb,"Can't read quota structure "
 					    "for id %u",
-					    from_kqid(&init_user_ns,
+					    from_kqid(sb->s_user_ns,
 						      dquot->dq_id));
 			dquot->dq_off = 0;
 			set_bit(DQ_FAKE_B, &dquot->dq_flags);
@@ -643,7 +643,7 @@ int qtree_read_dquot(struct qtree_mem_dqinfo *info, struct dquot *dquot)
 		if (ret >= 0)
 			ret = -EIO;
 		quota_error(sb, "Error while reading quota structure for id %u",
-			    from_kqid(&init_user_ns, dquot->dq_id));
+			    from_kqid(sb->s_user_ns, dquot->dq_id));
 		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 		memset(&dquot->dq_dqb, 0, sizeof(struct mem_dqblk));
 		kfree(ddquot);
@@ -721,13 +721,18 @@ out_buf:
 
 int qtree_get_next_id(struct qtree_mem_dqinfo *info, struct kqid *qid)
 {
-	qid_t id = from_kqid(&init_user_ns, *qid);
+	qid_t id = from_kqid(info->dqi_sb->s_user_ns, *qid);
+	struct kqid next_id;
 	int ret;
 
-	ret = find_next_id(info, &id, QT_TREEOFF, 0);
-	if (ret < 0)
-		return ret;
-	*qid = make_kqid(&init_user_ns, qid->type, id);
+	do {
+		ret = find_next_id(info, &id, QT_TREEOFF, 0);
+		if (ret < 0)
+			return ret;
+		next_id = make_kqid(info->dqi_sb->s_user_ns, qid->type, id);
+	} while (!qid_valid(next_id)); /* Loop until a mapped id is found */
+
+	*qid = next_id;
 	return 0;
 }
 EXPORT_SYMBOL(qtree_get_next_id);
diff --git a/fs/quota/quota_v1.c b/fs/quota/quota_v1.c
index 8fe79beced5c..a354f8bed96a 100644
--- a/fs/quota/quota_v1.c
+++ b/fs/quota/quota_v1.c
@@ -54,6 +54,7 @@ static void v1_mem2disk_dqblk(struct v1_disk_dqblk *d, struct mem_dqblk *m)
 
 static int v1_read_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	int type = dquot->dq_id.type;
 	struct v1_disk_dqblk dqblk;
 
@@ -64,7 +65,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 	memset(&dqblk, 0, sizeof(struct v1_disk_dqblk));
 	dquot->dq_sb->s_op->quota_read(dquot->dq_sb, type, (char *)&dqblk,
 			sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id)));
 
 	v1_disk2mem_dqblk(&dquot->dq_dqb, &dqblk);
 	if (dquot->dq_dqb.dqb_bhardlimit == 0 &&
@@ -79,6 +80,7 @@ static int v1_read_dqblk(struct dquot *dquot)
 
 static int v1_commit_dqblk(struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	short type = dquot->dq_id.type;
 	ssize_t ret;
 	struct v1_disk_dqblk dqblk;
@@ -95,7 +97,7 @@ static int v1_commit_dqblk(struct dquot *dquot)
 	if (sb_dqopt(dquot->dq_sb)->files[type])
 		ret = dquot->dq_sb->s_op->quota_write(dquot->dq_sb, type,
 			(char *)&dqblk, sizeof(struct v1_disk_dqblk),
-			v1_dqoff(from_kqid(&init_user_ns, dquot->dq_id)));
+			v1_dqoff(from_kqid(dq_user_ns, dquot->dq_id)));
 	if (ret != sizeof(struct v1_disk_dqblk)) {
 		quota_error(dquot->dq_sb, "dquota write failed");
 		if (ret >= 0)
diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
index ca71bf881ad1..54e27b296656 100644
--- a/fs/quota/quota_v2.c
+++ b/fs/quota/quota_v2.c
@@ -199,6 +199,7 @@ static void v2r0_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r0_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -212,7 +213,7 @@ static void v2r0_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le32(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -225,7 +226,7 @@ static int v2r0_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
@@ -252,6 +253,7 @@ static void v2r1_disk2memdqb(struct dquot *dquot, void *dp)
 
 static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 {
+	struct user_namespace *dq_user_ns = dquot->dq_sb->s_user_ns;
 	struct v2r1_disk_dqblk *d = dp;
 	struct mem_dqblk *m = &dquot->dq_dqb;
 	struct qtree_mem_dqinfo *info =
@@ -265,7 +267,7 @@ static void v2r1_mem2diskdqb(void *dp, struct dquot *dquot)
 	d->dqb_bsoftlimit = cpu_to_le64(v2_stoqb(m->dqb_bsoftlimit));
 	d->dqb_curspace = cpu_to_le64(m->dqb_curspace);
 	d->dqb_btime = cpu_to_le64(m->dqb_btime);
-	d->dqb_id = cpu_to_le32(from_kqid(&init_user_ns, dquot->dq_id));
+	d->dqb_id = cpu_to_le32(from_kqid(dq_user_ns, dquot->dq_id));
 	if (qtree_entry_unused(info, dp))
 		d->dqb_itime = cpu_to_le64(1);
 }
@@ -278,7 +280,7 @@ static int v2r1_is_id(void *dp, struct dquot *dquot)
 
 	if (qtree_entry_unused(info, dp))
 		return 0;
-	return qid_eq(make_kqid(&init_user_ns, dquot->dq_id.type,
+	return qid_eq(make_kqid(dquot->dq_sb->s_user_ns, dquot->dq_id.type,
 				le32_to_cpu(d->dqb_id)),
 		      dquot->dq_id);
 }
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
       [not found]     ` <20160702172035.19568-7-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-04  7:59       ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  7:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Sat 02-07-16 12:20:31, Eric W. Biederman wrote:
> It is expected that filesystems can not represent uids and gids from
> outside of their user namespace.  Keep things simple by not even
> trying to create filesystem nodes with non-sense uids and gids.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

So if we have sb->s_user_ns that doesn't map UID and GID 0, root cannot
directly create files in this filesystem. EOVERFLOW error will at least
hint us where the problem is but still I'm suspecting this is going to
create hard to debug configuration issues... I'm not sure if we can do
anything about this but I wanted to point it out.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
       [not found]     ` <20160702172035.19568-7-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-04  7:59       ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  7:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

On Sat 02-07-16 12:20:31, Eric W. Biederman wrote:
> It is expected that filesystems can not represent uids and gids from
> outside of their user namespace.  Keep things simple by not even
> trying to create filesystem nodes with non-sense uids and gids.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

So if we have sb->s_user_ns that doesn't map UID and GID 0, root cannot
directly create files in this filesystem. EOVERFLOW error will at least
hint us where the problem is but still I'm suspecting this is going to
create hard to debug configuration issues... I'm not sure if we can do
anything about this but I wanted to point it out.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
@ 2016-07-04  7:59       ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  7:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

On Sat 02-07-16 12:20:31, Eric W. Biederman wrote:
> It is expected that filesystems can not represent uids and gids from
> outside of their user namespace.  Keep things simple by not even
> trying to create filesystem nodes with non-sense uids and gids.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

So if we have sb->s_user_ns that doesn't map UID and GID 0, root cannot
directly create files in this filesystem. EOVERFLOW error will at least
hint us where the problem is but still I'm suspecting this is going to
create hard to debug configuration issues... I'm not sure if we can do
anything about this but I wanted to point it out.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found] ` <87ziq03qnj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2016-07-02 17:20   ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
@ 2016-07-04  8:52   ` Jan Kara
  2016-07-06 13:44   ` Andy Lutomirski
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  8:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
> 
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
> 
> It has been a long time in coming but recently in the userns tree the
> superblock has been expanded with a s_user_ns field indicating the user
> namespace that owns a superblock.
> 
> The s_user_ns owner of a superblock has three implications.
> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
>  filesystem from the vfs.
> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
>   is set to INVALID_UID and i_gid is set to INVALID_GID.
> - The scope of permission checks can be changed from global to a
>   capabilitiy check in s_user_ns.

OK, to check that I understand it right:

So the uids and gids that are stored on disk are still expected to be in
the initial id namespace, aren't they?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-02 17:18 ` Eric W. Biederman
  (?)
  (?)
@ 2016-07-04  8:52 ` Jan Kara
  2016-07-04 16:27     ` Eric W. Biederman
       [not found]   ` <20160704085220.GC5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
  -1 siblings, 2 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  8:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
> 
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
> 
> It has been a long time in coming but recently in the userns tree the
> superblock has been expanded with a s_user_ns field indicating the user
> namespace that owns a superblock.
> 
> The s_user_ns owner of a superblock has three implications.
> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
>  filesystem from the vfs.
> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
>   is set to INVALID_UID and i_gid is set to INVALID_GID.
> - The scope of permission checks can be changed from global to a
>   capabilitiy check in s_user_ns.

OK, to check that I understand it right:

So the uids and gids that are stored on disk are still expected to be in
the initial id namespace, aren't they?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]       ` <87mvm03pxy.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-07-04  9:11         ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  9:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Sat 02-07-16 12:33:29, Eric W. Biederman wrote:
> In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
> the filesystems notion of id 0.

Hum, is it really usable? Basically the tool calling Q_XSETQLIMIT would
have to be aware of the namespace the filesystem is mounted in to be able
to perform the desired operation (and if it gets is wrong, there's
possibility it would just silently set the timers for some user instead of
for all users).

> For the dquot hashfn translate the qid into sb->s_user_ns to remove
> extraneous bits before hashing.
> 
> When dealing with generic quota files in quota_tree.c, quota_v1.c, and
> quota_v2.c before writing quotas to the file encode them in
> sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

And here the sb->s_user_ns becomes part of the on-disk format because the
numerical ID value is used to compute the position in on-disk data
structures. This seems to be in conflict with the idea that anything that
is stored on disk is stored in the initial user namespace.

> One complication comes up in qtree_get_next_id, as it is possible in
> principle for id's to be present in the quota file that do not have a
> mapping in the filesystems user namespace.  If that is the case the
> code is modified to skip the unmapped ids.

So I'm not 100% sure how this is going to work. qtree_get_next_id() is used
as an interface to report quota information for the whole filesystem. So
skipping ids without mappings makes sense. However userspace uses the
interface as:

id = 0;
while (get_next_quota(id, &result)) {
	do stuff;
	id = result.dq_id + 1;
}

If sb->s_user_ns is the current namespace of the process, I suppose this is
going to work as expected. If the namespace of the process is different
(including the init_user_ns), I expect this breaks, doesn't it?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]       ` <87mvm03pxy.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-07-04  9:11         ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  9:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk, linux-fsdevel,
	Djalal Harouni

On Sat 02-07-16 12:33:29, Eric W. Biederman wrote:
> In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
> the filesystems notion of id 0.

Hum, is it really usable? Basically the tool calling Q_XSETQLIMIT would
have to be aware of the namespace the filesystem is mounted in to be able
to perform the desired operation (and if it gets is wrong, there's
possibility it would just silently set the timers for some user instead of
for all users).

> For the dquot hashfn translate the qid into sb->s_user_ns to remove
> extraneous bits before hashing.
> 
> When dealing with generic quota files in quota_tree.c, quota_v1.c, and
> quota_v2.c before writing quotas to the file encode them in
> sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

And here the sb->s_user_ns becomes part of the on-disk format because the
numerical ID value is used to compute the position in on-disk data
structures. This seems to be in conflict with the idea that anything that
is stored on disk is stored in the initial user namespace.

> One complication comes up in qtree_get_next_id, as it is possible in
> principle for id's to be present in the quota file that do not have a
> mapping in the filesystems user namespace.  If that is the case the
> code is modified to skip the unmapped ids.

So I'm not 100% sure how this is going to work. qtree_get_next_id() is used
as an interface to report quota information for the whole filesystem. So
skipping ids without mappings makes sense. However userspace uses the
interface as:

id = 0;
while (get_next_quota(id, &result)) {
	do stuff;
	id = result.dq_id + 1;
}

If sb->s_user_ns is the current namespace of the process, I suppose this is
going to work as expected. If the namespace of the process is different
(including the init_user_ns), I expect this breaks, doesn't it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
@ 2016-07-04  9:11         ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-04  9:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Sat 02-07-16 12:33:29, Eric W. Biederman wrote:
> In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
> the filesystems notion of id 0.

Hum, is it really usable? Basically the tool calling Q_XSETQLIMIT would
have to be aware of the namespace the filesystem is mounted in to be able
to perform the desired operation (and if it gets is wrong, there's
possibility it would just silently set the timers for some user instead of
for all users).

> For the dquot hashfn translate the qid into sb->s_user_ns to remove
> extraneous bits before hashing.
> 
> When dealing with generic quota files in quota_tree.c, quota_v1.c, and
> quota_v2.c before writing quotas to the file encode them in
> sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.

And here the sb->s_user_ns becomes part of the on-disk format because the
numerical ID value is used to compute the position in on-disk data
structures. This seems to be in conflict with the idea that anything that
is stored on disk is stored in the initial user namespace.

> One complication comes up in qtree_get_next_id, as it is possible in
> principle for id's to be present in the quota file that do not have a
> mapping in the filesystems user namespace.  If that is the case the
> code is modified to skip the unmapped ids.

So I'm not 100% sure how this is going to work. qtree_get_next_id() is used
as an interface to report quota information for the whole filesystem. So
skipping ids without mappings makes sense. However userspace uses the
interface as:

id = 0;
while (get_next_quota(id, &result)) {
	do stuff;
	id = result.dq_id + 1;
}

If sb->s_user_ns is the current namespace of the process, I suppose this is
going to work as expected. If the namespace of the process is different
(including the init_user_ns), I expect this breaks, doesn't it?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found]   ` <20160704085220.GC5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-04 16:27     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-04 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
>> 
>> As well as in these patches the code is also available from:
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>> 
>> It has been a long time in coming but recently in the userns tree the
>> superblock has been expanded with a s_user_ns field indicating the user
>> namespace that owns a superblock.
>> 
>> The s_user_ns owner of a superblock has three implications.
>> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
>>  filesystem from the vfs.
>> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
>>   is set to INVALID_UID and i_gid is set to INVALID_GID.
>> - The scope of permission checks can be changed from global to a
>>   capabilitiy check in s_user_ns.
>
> OK, to check that I understand it right:
>
> So the uids and gids that are stored on disk are still expected to be in
> the initial id namespace, aren't they?

No.

The general expectation is that the ids on disk are store in s_user_ns.

Id's that don't map to the initial id namespace get stored in the
generic data structures as INVALID_UID and INVALID_GID.

In practice I don't expect anyone will set up a situation knowingly
where id's don't map, but the case has to be handled because mistakes
and malicious code happens.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found]   ` <20160704085220.GC5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-04 16:27     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-04 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Seth Forshee, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

Jan Kara <jack@suse.cz> writes:

> On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
>> 
>> As well as in these patches the code is also available from:
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>> 
>> It has been a long time in coming but recently in the userns tree the
>> superblock has been expanded with a s_user_ns field indicating the user
>> namespace that owns a superblock.
>> 
>> The s_user_ns owner of a superblock has three implications.
>> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
>>  filesystem from the vfs.
>> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
>>   is set to INVALID_UID and i_gid is set to INVALID_GID.
>> - The scope of permission checks can be changed from global to a
>>   capabilitiy check in s_user_ns.
>
> OK, to check that I understand it right:
>
> So the uids and gids that are stored on disk are still expected to be in
> the initial id namespace, aren't they?

No.

The general expectation is that the ids on disk are store in s_user_ns.

Id's that don't map to the initial id namespace get stored in the
generic data structures as INVALID_UID and INVALID_GID.

In practice I don't expect anyone will set up a situation knowingly
where id's don't map, but the case has to be handled because mistakes
and malicious code happens.

Eric


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-04 16:27     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-04 16:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Seth Forshee, Linux Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jann Horn,
	Michael Kerrisk

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
>> 
>> As well as in these patches the code is also available from:
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>> 
>> It has been a long time in coming but recently in the userns tree the
>> superblock has been expanded with a s_user_ns field indicating the user
>> namespace that owns a superblock.
>> 
>> The s_user_ns owner of a superblock has three implications.
>> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
>>  filesystem from the vfs.
>> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
>>   is set to INVALID_UID and i_gid is set to INVALID_GID.
>> - The scope of permission checks can be changed from global to a
>>   capabilitiy check in s_user_ns.
>
> OK, to check that I understand it right:
>
> So the uids and gids that are stored on disk are still expected to be in
> the initial id namespace, aren't they?

No.

The general expectation is that the ids on disk are store in s_user_ns.

Id's that don't map to the initial id namespace get stored in the
generic data structures as INVALID_UID and INVALID_GID.

In practice I don't expect anyone will set up a situation knowingly
where id's don't map, but the case has to be handled because mistakes
and malicious code happens.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]         ` <20160704091100.GD5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-05 14:48           ` Seth Forshee
  2016-07-05 15:34           ` Eric W. Biederman
  1 sibling, 0 replies; 134+ messages in thread
From: Seth Forshee @ 2016-07-05 14:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni,
	Michael Kerrisk

On Mon, Jul 04, 2016 at 11:11:00AM +0200, Jan Kara wrote:
> On Sat 02-07-16 12:33:29, Eric W. Biederman wrote:
> > In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
> > the filesystems notion of id 0.
> 
> Hum, is it really usable? Basically the tool calling Q_XSETQLIMIT would
> have to be aware of the namespace the filesystem is mounted in to be able
> to perform the desired operation (and if it gets is wrong, there's
> possibility it would just silently set the timers for some user instead of
> for all users).

Generally userspace does not need to be aware of the namespace. The user
id passed from userspace is translated based on its namespace, and if
that kqid doesn't map into s_user_ns the Q_XSETQLIM operation fails.

But it requires going to some trouble and having CAP_SYS_ADMIN towards
the relevant namespaces to give processes not in s_user_ns visibility to
the mount, so that isn't going to be a common scenario. If some user
does set up such a scenario then it doesn't seem to be asking too much
for them to be aware of the limitations.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
  2016-07-04  9:11         ` Jan Kara
  (?)
  (?)
@ 2016-07-05 14:48         ` Seth Forshee
  -1 siblings, 0 replies; 134+ messages in thread
From: Seth Forshee @ 2016-07-05 14:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric W. Biederman, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk, linux-fsdevel,
	Djalal Harouni

On Mon, Jul 04, 2016 at 11:11:00AM +0200, Jan Kara wrote:
> On Sat 02-07-16 12:33:29, Eric W. Biederman wrote:
> > In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
> > the filesystems notion of id 0.
> 
> Hum, is it really usable? Basically the tool calling Q_XSETQLIMIT would
> have to be aware of the namespace the filesystem is mounted in to be able
> to perform the desired operation (and if it gets is wrong, there's
> possibility it would just silently set the timers for some user instead of
> for all users).

Generally userspace does not need to be aware of the namespace. The user
id passed from userspace is translated based on its namespace, and if
that kqid doesn't map into s_user_ns the Q_XSETQLIM operation fails.

But it requires going to some trouble and having CAP_SYS_ADMIN towards
the relevant namespaces to give processes not in s_user_ns visibility to
the mount, so that isn't going to be a common scenario. If some user
does set up such a scenario then it doesn't seem to be asking too much
for them to be aware of the limitations.

Thanks,
Seth


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
  2016-07-04  7:59       ` Jan Kara
@ 2016-07-05 14:55           ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-05 14:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Sat 02-07-16 12:20:31, Eric W. Biederman wrote:
>> It is expected that filesystems can not represent uids and gids from
>> outside of their user namespace.  Keep things simple by not even
>> trying to create filesystem nodes with non-sense uids and gids.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>
> So if we have sb->s_user_ns that doesn't map UID and GID 0, root cannot
> directly create files in this filesystem. EOVERFLOW error will at least
> hint us where the problem is but still I'm suspecting this is going to
> create hard to debug configuration issues... I'm not sure if we can do
> anything about this but I wanted to point it out.

A reasonable point.

A couple of details.

- If there is no uid or gid 0 inside of a user namespace there is
  no root user in that namespace so this is a moot point.

- If we are talking about uid 0 in the initial user namespace the
  scenario you are worried about can occur, but you have to work at it
  to get into that situation.

To create a superblock has s_user_ns != &init_user_ns something like
the following has to occur.

setuid(1000);
unshare(CLONE_NEWUSER);
unshare(CLONE_NEWNS);
mount(....);

At which point the oridinary root user can not see the filesystem
with s_user_ns != &init_user_ns as it is located in another mount
namespace.

Which means root has to use setns to get into that mount namespace,
so there are going to be larger hints than -EOVERFLOW.

At the same time if there are better error codes that make it clearer
what is happening I am all ears.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
@ 2016-07-05 14:55           ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-05 14:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Seth Forshee, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

Jan Kara <jack@suse.cz> writes:

> On Sat 02-07-16 12:20:31, Eric W. Biederman wrote:
>> It is expected that filesystems can not represent uids and gids from
>> outside of their user namespace.  Keep things simple by not even
>> trying to create filesystem nodes with non-sense uids and gids.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> So if we have sb->s_user_ns that doesn't map UID and GID 0, root cannot
> directly create files in this filesystem. EOVERFLOW error will at least
> hint us where the problem is but still I'm suspecting this is going to
> create hard to debug configuration issues... I'm not sure if we can do
> anything about this but I wanted to point it out.

A reasonable point.

A couple of details.

- If there is no uid or gid 0 inside of a user namespace there is
  no root user in that namespace so this is a moot point.

- If we are talking about uid 0 in the initial user namespace the
  scenario you are worried about can occur, but you have to work at it
  to get into that situation.

To create a superblock has s_user_ns != &init_user_ns something like
the following has to occur.

setuid(1000);
unshare(CLONE_NEWUSER);
unshare(CLONE_NEWNS);
mount(....);

At which point the oridinary root user can not see the filesystem
with s_user_ns != &init_user_ns as it is located in another mount
namespace.

Which means root has to use setns to get into that mount namespace,
so there are going to be larger hints than -EOVERFLOW.

At the same time if there are better error codes that make it clearer
what is happening I am all ears.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]         ` <20160704091100.GD5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
  2016-07-05 14:48           ` Seth Forshee
@ 2016-07-05 15:34           ` Eric W. Biederman
  1 sibling, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-05 15:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Sat 02-07-16 12:33:29, Eric W. Biederman wrote:
>> In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
>> the filesystems notion of id 0.
>
> Hum, is it really usable? Basically the tool calling Q_XSETQLIMIT would
> have to be aware of the namespace the filesystem is mounted in to be able
> to perform the desired operation (and if it gets is wrong, there's
> possibility it would just silently set the timers for some user instead of
> for all users).

In general yes.  Because we are talking about the uid of the root user
in the container that mounted the filesystem.  It is the uid that is
stored as 0 on the filesystem.  So in the small everything looks normal.

>> For the dquot hashfn translate the qid into sb->s_user_ns to remove
>> extraneous bits before hashing.
>> 
>> When dealing with generic quota files in quota_tree.c, quota_v1.c, and
>> quota_v2.c before writing quotas to the file encode them in
>> sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.
>
> And here the sb->s_user_ns becomes part of the on-disk format because the
> numerical ID value is used to compute the position in on-disk data
> structures. This seems to be in conflict with the idea that anything that
> is stored on disk is stored in the initial user namespace.

Up to now it has been the case that all filesystems with a backing store
that the kernel could mount stored their data in the initial user
namespace.  This set of changes is all about making it so that a
filesystem with uids and gids stored in s_user_ns can be supported
safely by the kernel.

>> One complication comes up in qtree_get_next_id, as it is possible in
>> principle for id's to be present in the quota file that do not have a
>> mapping in the filesystems user namespace.  If that is the case the
>> code is modified to skip the unmapped ids.
>
> So I'm not 100% sure how this is going to work. qtree_get_next_id() is used
> as an interface to report quota information for the whole filesystem. So
> skipping ids without mappings makes sense. However userspace uses the
> interface as:
>
> id = 0;
> while (get_next_quota(id, &result)) {
> 	do stuff;
> 	id = result.dq_id + 1;
> }
>
> If sb->s_user_ns is the current namespace of the process, I suppose this is
> going to work as expected. If the namespace of the process is different
> (including the init_user_ns), I expect this breaks, doesn't it?

Possibly.  And this is a good question.

Where this will break initially in that scenario is in:
static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
			  void __user *addr)
{
        ...

	if (!sb->s_qcop->get_nextdqblk)
		return -ENOSYS;
	qid = make_kqid(current_user_ns(), type, id);
	if (!qid_has_mapping(sb->s_user_ns, qid))
		return -EINVAL;
	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
	if (ret)

If userspace actually has to perform the increment.  The way
I read the code in 

Now it has occurred to me that I am probably getting ahead of things
with actually enabling all of this for the quota subsystem as I don't
think we have fileystems that use the vfs quota support on our short
term list.   I don't think these patches are wrong except possibily
in handling a quota interface with unmapped ids.  Which is sufficiently
weird I don't know if there is something better we can do.

Given that we are not likely to use quotas for a while I am happy
to add a test in vfs_load_quota_inode (the guts of quota_on, and
quota_enable) that tests if s_user_ns != &init_user_ns and just fails.
Then the quota changes can be left until we figure out how to support a
filesystem that uses quota support.  That should ensure better testing
coverage if nothing else.

The more I think of it the more I think that sounds like wisdom.
Dropping this patch and replacing it by one that just does:

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index d8fb0fd3ff6f..9c9890fe18b7 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
                error = -EINVAL;
                goto out_fmt;
        }
+       /* Filesystems outside of init_user_ns not yet supported */
+       if (sb->s_user_ns != &init_user_ns) {
+               error = -EINVAL;
+               goto out_fmt;
+       }
        /* Usage always has to be set... */
        if (!(flags & DQUOT_USAGE_ENABLED)) {
                error = -EINVAL;


seems a lot more appropriate at this point.  That is enough to give a
great big hint there is something that needs to be done but won't
embrittle the code with untested corner cases.

Especially since the code probably needs testing and a very close review
to see what problems a hostile quota file can create.  I the quota files
are simple enough we can't get into to much trouble but that is an
important consideration at this point.

So I am going to go respin this patchset replacing this change.

Eric

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
  2016-07-04  9:11         ` Jan Kara
                           ` (2 preceding siblings ...)
  (?)
@ 2016-07-05 15:34         ` Eric W. Biederman
       [not found]           ` <87d1msumhy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2016-07-05 20:57             ` Dave Chinner
  -1 siblings, 2 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-05 15:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Seth Forshee, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk, linux-fsdevel,
	Djalal Harouni

Jan Kara <jack@suse.cz> writes:

> On Sat 02-07-16 12:33:29, Eric W. Biederman wrote:
>> In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
>> the filesystems notion of id 0.
>
> Hum, is it really usable? Basically the tool calling Q_XSETQLIMIT would
> have to be aware of the namespace the filesystem is mounted in to be able
> to perform the desired operation (and if it gets is wrong, there's
> possibility it would just silently set the timers for some user instead of
> for all users).

In general yes.  Because we are talking about the uid of the root user
in the container that mounted the filesystem.  It is the uid that is
stored as 0 on the filesystem.  So in the small everything looks normal.

>> For the dquot hashfn translate the qid into sb->s_user_ns to remove
>> extraneous bits before hashing.
>> 
>> When dealing with generic quota files in quota_tree.c, quota_v1.c, and
>> quota_v2.c before writing quotas to the file encode them in
>> sb->s_user_ns, and decode ids read from the file from sb->s_user_ns.
>
> And here the sb->s_user_ns becomes part of the on-disk format because the
> numerical ID value is used to compute the position in on-disk data
> structures. This seems to be in conflict with the idea that anything that
> is stored on disk is stored in the initial user namespace.

Up to now it has been the case that all filesystems with a backing store
that the kernel could mount stored their data in the initial user
namespace.  This set of changes is all about making it so that a
filesystem with uids and gids stored in s_user_ns can be supported
safely by the kernel.

>> One complication comes up in qtree_get_next_id, as it is possible in
>> principle for id's to be present in the quota file that do not have a
>> mapping in the filesystems user namespace.  If that is the case the
>> code is modified to skip the unmapped ids.
>
> So I'm not 100% sure how this is going to work. qtree_get_next_id() is used
> as an interface to report quota information for the whole filesystem. So
> skipping ids without mappings makes sense. However userspace uses the
> interface as:
>
> id = 0;
> while (get_next_quota(id, &result)) {
> 	do stuff;
> 	id = result.dq_id + 1;
> }
>
> If sb->s_user_ns is the current namespace of the process, I suppose this is
> going to work as expected. If the namespace of the process is different
> (including the init_user_ns), I expect this breaks, doesn't it?

Possibly.  And this is a good question.

Where this will break initially in that scenario is in:
static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
			  void __user *addr)
{
        ...

	if (!sb->s_qcop->get_nextdqblk)
		return -ENOSYS;
	qid = make_kqid(current_user_ns(), type, id);
	if (!qid_has_mapping(sb->s_user_ns, qid))
		return -EINVAL;
	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
	if (ret)

If userspace actually has to perform the increment.  The way
I read the code in 

Now it has occurred to me that I am probably getting ahead of things
with actually enabling all of this for the quota subsystem as I don't
think we have fileystems that use the vfs quota support on our short
term list.   I don't think these patches are wrong except possibily
in handling a quota interface with unmapped ids.  Which is sufficiently
weird I don't know if there is something better we can do.

Given that we are not likely to use quotas for a while I am happy
to add a test in vfs_load_quota_inode (the guts of quota_on, and
quota_enable) that tests if s_user_ns != &init_user_ns and just fails.
Then the quota changes can be left until we figure out how to support a
filesystem that uses quota support.  That should ensure better testing
coverage if nothing else.

The more I think of it the more I think that sounds like wisdom.
Dropping this patch and replacing it by one that just does:

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index d8fb0fd3ff6f..9c9890fe18b7 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
                error = -EINVAL;
                goto out_fmt;
        }
+       /* Filesystems outside of init_user_ns not yet supported */
+       if (sb->s_user_ns != &init_user_ns) {
+               error = -EINVAL;
+               goto out_fmt;
+       }
        /* Usage always has to be set... */
        if (!(flags & DQUOT_USAGE_ENABLED)) {
                error = -EINVAL;


seems a lot more appropriate at this point.  That is enough to give a
great big hint there is something that needs to be done but won't
embrittle the code with untested corner cases.

Especially since the code probably needs testing and a very close review
to see what problems a hostile quota file can create.  I the quota files
are simple enough we can't get into to much trouble but that is an
important consideration at this point.

So I am going to go respin this patchset replacing this change.

Eric

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]           ` <87d1msumhy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-07-05 20:57             ` Dave Chinner
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-05 20:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
> The more I think of it the more I think that sounds like wisdom.
> Dropping this patch and replacing it by one that just does:
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index d8fb0fd3ff6f..9c9890fe18b7 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>                 error = -EINVAL;
>                 goto out_fmt;
>         }
> +       /* Filesystems outside of init_user_ns not yet supported */
> +       if (sb->s_user_ns != &init_user_ns) {
> +               error = -EINVAL;
> +               goto out_fmt;
> +       }
>         /* Usage always has to be set... */
>         if (!(flags & DQUOT_USAGE_ENABLED)) {
>                 error = -EINVAL;
> 
> 
> seems a lot more appropriate at this point.  That is enough to give a
> great big hint there is something that needs to be done but won't
> embrittle the code with untested corner cases.

You'll need to propagate that to all filesystems that have their own
quota implemenation, too.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]           ` <87d1msumhy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-07-05 20:57             ` Dave Chinner
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-05 20:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Seth Forshee, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk, linux-fsdevel,
	Djalal Harouni

On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
> The more I think of it the more I think that sounds like wisdom.
> Dropping this patch and replacing it by one that just does:
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index d8fb0fd3ff6f..9c9890fe18b7 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>                 error = -EINVAL;
>                 goto out_fmt;
>         }
> +       /* Filesystems outside of init_user_ns not yet supported */
> +       if (sb->s_user_ns != &init_user_ns) {
> +               error = -EINVAL;
> +               goto out_fmt;
> +       }
>         /* Usage always has to be set... */
>         if (!(flags & DQUOT_USAGE_ENABLED)) {
>                 error = -EINVAL;
> 
> 
> seems a lot more appropriate at this point.  That is enough to give a
> great big hint there is something that needs to be done but won't
> embrittle the code with untested corner cases.

You'll need to propagate that to all filesystems that have their own
quota implemenation, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
@ 2016-07-05 20:57             ` Dave Chinner
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-05 20:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Seth Forshee, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
> The more I think of it the more I think that sounds like wisdom.
> Dropping this patch and replacing it by one that just does:
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index d8fb0fd3ff6f..9c9890fe18b7 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>                 error = -EINVAL;
>                 goto out_fmt;
>         }
> +       /* Filesystems outside of init_user_ns not yet supported */
> +       if (sb->s_user_ns != &init_user_ns) {
> +               error = -EINVAL;
> +               goto out_fmt;
> +       }
>         /* Usage always has to be set... */
>         if (!(flags & DQUOT_USAGE_ENABLED)) {
>                 error = -EINVAL;
> 
> 
> seems a lot more appropriate at this point.  That is enough to give a
> great big hint there is something that needs to be done but won't
> embrittle the code with untested corner cases.

You'll need to propagate that to all filesystems that have their own
quota implemenation, too.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
  2016-07-05 20:57             ` Dave Chinner
  (?)
  (?)
@ 2016-07-05 21:28             ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-05 21:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> writes:

> On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
>> The more I think of it the more I think that sounds like wisdom.
>> Dropping this patch and replacing it by one that just does:
>> 
>> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
>> index d8fb0fd3ff6f..9c9890fe18b7 100644
>> --- a/fs/quota/dquot.c
>> +++ b/fs/quota/dquot.c
>> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>>                 error = -EINVAL;
>>                 goto out_fmt;
>>         }
>> +       /* Filesystems outside of init_user_ns not yet supported */
>> +       if (sb->s_user_ns != &init_user_ns) {
>> +               error = -EINVAL;
>> +               goto out_fmt;
>> +       }
>>         /* Usage always has to be set... */
>>         if (!(flags & DQUOT_USAGE_ENABLED)) {
>>                 error = -EINVAL;
>> 
>> 
>> seems a lot more appropriate at this point.  That is enough to give a
>> great big hint there is something that needs to be done but won't
>> embrittle the code with untested corner cases.
>
> You'll need to propagate that to all filesystems that have their own
> quota implemenation, too.

All of the filesytems that have their own quota implementations omit the
flag FS_USERNS_MOUNT in fs_flags in struct filesystem so they are
protected.

The above change is extending to the generic implementation of quota
files the same protection that is already enjoyed by filesystems in
general.  Recognition that being attacked by malicious users is a
difficult thing to defend against, and not enabling the code until there
is a reasonable certainty the code is robust against maliciuos attack
and everyone involved with maintaining the code is comfortable about the
situtation.

I have every intention of keeping all of the changes to the generic
parts of the quota layer so that filesystems who wish to implement
unprivileged mounts and implement quotas may proceed if they so desire.

Eric

p.s.  I  expect the the generic quota implementation is simple enough
that it is not particularly suseptible to problems caused by malicious
data.  But I don't currently care enough to verify and test that
assumption so this is very much the wrong time for me to be enabling the
feature.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
  2016-07-05 20:57             ` Dave Chinner
@ 2016-07-05 21:28               ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-05 21:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Seth Forshee, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk, linux-fsdevel,
	Djalal Harouni

Dave Chinner <david@fromorbit.com> writes:

> On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
>> The more I think of it the more I think that sounds like wisdom.
>> Dropping this patch and replacing it by one that just does:
>> 
>> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
>> index d8fb0fd3ff6f..9c9890fe18b7 100644
>> --- a/fs/quota/dquot.c
>> +++ b/fs/quota/dquot.c
>> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>>                 error = -EINVAL;
>>                 goto out_fmt;
>>         }
>> +       /* Filesystems outside of init_user_ns not yet supported */
>> +       if (sb->s_user_ns != &init_user_ns) {
>> +               error = -EINVAL;
>> +               goto out_fmt;
>> +       }
>>         /* Usage always has to be set... */
>>         if (!(flags & DQUOT_USAGE_ENABLED)) {
>>                 error = -EINVAL;
>> 
>> 
>> seems a lot more appropriate at this point.  That is enough to give a
>> great big hint there is something that needs to be done but won't
>> embrittle the code with untested corner cases.
>
> You'll need to propagate that to all filesystems that have their own
> quota implemenation, too.

All of the filesytems that have their own quota implementations omit the
flag FS_USERNS_MOUNT in fs_flags in struct filesystem so they are
protected.

The above change is extending to the generic implementation of quota
files the same protection that is already enjoyed by filesystems in
general.  Recognition that being attacked by malicious users is a
difficult thing to defend against, and not enabling the code until there
is a reasonable certainty the code is robust against maliciuos attack
and everyone involved with maintaining the code is comfortable about the
situtation.

I have every intention of keeping all of the changes to the generic
parts of the quota layer so that filesystems who wish to implement
unprivileged mounts and implement quotas may proceed if they so desire.

Eric

p.s.  I  expect the the generic quota implementation is simple enough
that it is not particularly suseptible to problems caused by malicious
data.  But I don't currently care enough to verify and test that
assumption so this is very much the wrong time for me to be enabling the
feature.


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
@ 2016-07-05 21:28               ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-05 21:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Seth Forshee, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> writes:

> On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
>> The more I think of it the more I think that sounds like wisdom.
>> Dropping this patch and replacing it by one that just does:
>> 
>> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
>> index d8fb0fd3ff6f..9c9890fe18b7 100644
>> --- a/fs/quota/dquot.c
>> +++ b/fs/quota/dquot.c
>> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>>                 error = -EINVAL;
>>                 goto out_fmt;
>>         }
>> +       /* Filesystems outside of init_user_ns not yet supported */
>> +       if (sb->s_user_ns != &init_user_ns) {
>> +               error = -EINVAL;
>> +               goto out_fmt;
>> +       }
>>         /* Usage always has to be set... */
>>         if (!(flags & DQUOT_USAGE_ENABLED)) {
>>                 error = -EINVAL;
>> 
>> 
>> seems a lot more appropriate at this point.  That is enough to give a
>> great big hint there is something that needs to be done but won't
>> embrittle the code with untested corner cases.
>
> You'll need to propagate that to all filesystems that have their own
> quota implemenation, too.

All of the filesytems that have their own quota implementations omit the
flag FS_USERNS_MOUNT in fs_flags in struct filesystem so they are
protected.

The above change is extending to the generic implementation of quota
files the same protection that is already enjoyed by filesystems in
general.  Recognition that being attacked by malicious users is a
difficult thing to defend against, and not enabling the code until there
is a reasonable certainty the code is robust against maliciuos attack
and everyone involved with maintaining the code is comfortable about the
situtation.

I have every intention of keeping all of the changes to the generic
parts of the quota layer so that filesystems who wish to implement
unprivileged mounts and implement quotas may proceed if they so desire.

Eric

p.s.  I  expect the the generic quota implementation is simple enough
that it is not particularly suseptible to problems caused by malicious
data.  But I don't currently care enough to verify and test that
assumption so this is very much the wrong time for me to be enabling the
feature.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
  2016-07-05 21:28               ` Eric W. Biederman
@ 2016-07-06  6:35                   ` Dave Chinner
  -1 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-06  6:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Tue, Jul 05, 2016 at 04:28:53PM -0500, Eric W. Biederman wrote:
> Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> writes:
> 
> > On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
> >> The more I think of it the more I think that sounds like wisdom.
> >> Dropping this patch and replacing it by one that just does:
> >> 
> >> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> >> index d8fb0fd3ff6f..9c9890fe18b7 100644
> >> --- a/fs/quota/dquot.c
> >> +++ b/fs/quota/dquot.c
> >> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
> >>                 error = -EINVAL;
> >>                 goto out_fmt;
> >>         }
> >> +       /* Filesystems outside of init_user_ns not yet supported */
> >> +       if (sb->s_user_ns != &init_user_ns) {
> >> +               error = -EINVAL;
> >> +               goto out_fmt;
> >> +       }
> >>         /* Usage always has to be set... */
> >>         if (!(flags & DQUOT_USAGE_ENABLED)) {
> >>                 error = -EINVAL;
> >> 
> >> 
> >> seems a lot more appropriate at this point.  That is enough to give a
> >> great big hint there is something that needs to be done but won't
> >> embrittle the code with untested corner cases.
> >
> > You'll need to propagate that to all filesystems that have their own
> > quota implemenation, too.
> 
> All of the filesytems that have their own quota implementations omit the
> flag FS_USERNS_MOUNT in fs_flags in struct filesystem so they are
> protected.

Which is the same situation currently for every filesystem that
supports quotas, regardless of the implementation infrastructure.

> p.s.  I  expect the the generic quota implementation is simple enough
> that it is not particularly suseptible to problems caused by malicious
> data.  But I don't currently care enough to verify and test that
> assumption so this is very much the wrong time for me to be enabling the
> feature.

All the more reason you should be adding the same guard to all the
other filesystems....

All i'm asking you to do is to make this check in a way that all
filesystems that implement quotas will execute it. Don't leave
landmines with security implications around - make sure all
filesystems have the same protections.

-Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
@ 2016-07-06  6:35                   ` Dave Chinner
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-06  6:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Seth Forshee, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Michael Kerrisk, linux-fsdevel,
	Djalal Harouni

On Tue, Jul 05, 2016 at 04:28:53PM -0500, Eric W. Biederman wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
> >> The more I think of it the more I think that sounds like wisdom.
> >> Dropping this patch and replacing it by one that just does:
> >> 
> >> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> >> index d8fb0fd3ff6f..9c9890fe18b7 100644
> >> --- a/fs/quota/dquot.c
> >> +++ b/fs/quota/dquot.c
> >> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
> >>                 error = -EINVAL;
> >>                 goto out_fmt;
> >>         }
> >> +       /* Filesystems outside of init_user_ns not yet supported */
> >> +       if (sb->s_user_ns != &init_user_ns) {
> >> +               error = -EINVAL;
> >> +               goto out_fmt;
> >> +       }
> >>         /* Usage always has to be set... */
> >>         if (!(flags & DQUOT_USAGE_ENABLED)) {
> >>                 error = -EINVAL;
> >> 
> >> 
> >> seems a lot more appropriate at this point.  That is enough to give a
> >> great big hint there is something that needs to be done but won't
> >> embrittle the code with untested corner cases.
> >
> > You'll need to propagate that to all filesystems that have their own
> > quota implemenation, too.
> 
> All of the filesytems that have their own quota implementations omit the
> flag FS_USERNS_MOUNT in fs_flags in struct filesystem so they are
> protected.

Which is the same situation currently for every filesystem that
supports quotas, regardless of the implementation infrastructure.

> p.s.  I  expect the the generic quota implementation is simple enough
> that it is not particularly suseptible to problems caused by malicious
> data.  But I don't currently care enough to verify and test that
> assumption so this is very much the wrong time for me to be enabling the
> feature.

All the more reason you should be adding the same guard to all the
other filesystems....

All i'm asking you to do is to make this check in a way that all
filesystems that implement quotas will execute it. Don't leave
landmines with security implications around - make sure all
filesystems have the same protections.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
  2016-07-06  6:35                   ` Dave Chinner
@ 2016-07-06  8:25                     ` Jan Kara
  -1 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06  8:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Djalal Harouni, Michael Kerrisk

On Wed 06-07-16 16:35:04, Dave Chinner wrote:
> On Tue, Jul 05, 2016 at 04:28:53PM -0500, Eric W. Biederman wrote:
> > Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> writes:
> > 
> > > On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
> > >> The more I think of it the more I think that sounds like wisdom.
> > >> Dropping this patch and replacing it by one that just does:
> > >> 
> > >> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> > >> index d8fb0fd3ff6f..9c9890fe18b7 100644
> > >> --- a/fs/quota/dquot.c
> > >> +++ b/fs/quota/dquot.c
> > >> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
> > >>                 error = -EINVAL;
> > >>                 goto out_fmt;
> > >>         }
> > >> +       /* Filesystems outside of init_user_ns not yet supported */
> > >> +       if (sb->s_user_ns != &init_user_ns) {
> > >> +               error = -EINVAL;
> > >> +               goto out_fmt;
> > >> +       }
> > >>         /* Usage always has to be set... */
> > >>         if (!(flags & DQUOT_USAGE_ENABLED)) {
> > >>                 error = -EINVAL;
> > >> 
> > >> 
> > >> seems a lot more appropriate at this point.  That is enough to give a
> > >> great big hint there is something that needs to be done but won't
> > >> embrittle the code with untested corner cases.
> > >
> > > You'll need to propagate that to all filesystems that have their own
> > > quota implemenation, too.
> > 
> > All of the filesytems that have their own quota implementations omit the
> > flag FS_USERNS_MOUNT in fs_flags in struct filesystem so they are
> > protected.
> 
> Which is the same situation currently for every filesystem that
> supports quotas, regardless of the implementation infrastructure.
> 
> > p.s.  I  expect the the generic quota implementation is simple enough
> > that it is not particularly suseptible to problems caused by malicious
> > data.  But I don't currently care enough to verify and test that
> > assumption so this is very much the wrong time for me to be enabling the
> > feature.
> 
> All the more reason you should be adding the same guard to all the
> other filesystems....
> 
> All i'm asking you to do is to make this check in a way that all
> filesystems that implement quotas will execute it. Don't leave
> landmines with security implications around - make sure all
> filesystems have the same protections.

Well, I'm not sure I follow you here. VFS quotas are a generic code used by
a few filesystems. So I can imagine that someone would decide to enable
FS_USERNS_MOUNT for one of those filesystems without thinking about quotas
and then Eric's check would trigger and possibly save use from some
problems.

When someone decides to enable FS_USERNS_MOUNT for XFS, he will have
presumably made sure all parts of XFS are safe, including its quota
implementation.

I don't want to stop you or Eric in adding an extra check in XFS, I just
have hard time to see how that check would trigger and how XFS quota is
different from other XFS parts...

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
@ 2016-07-06  8:25                     ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06  8:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric W. Biederman, Jan Kara, Seth Forshee, Jann Horn, Linux API,
	Linux Containers, Andy Lutomirski, James Bottomley,
	Michael Kerrisk, linux-fsdevel, Djalal Harouni

On Wed 06-07-16 16:35:04, Dave Chinner wrote:
> On Tue, Jul 05, 2016 at 04:28:53PM -0500, Eric W. Biederman wrote:
> > Dave Chinner <david@fromorbit.com> writes:
> > 
> > > On Tue, Jul 05, 2016 at 10:34:49AM -0500, Eric W. Biederman wrote:
> > >> The more I think of it the more I think that sounds like wisdom.
> > >> Dropping this patch and replacing it by one that just does:
> > >> 
> > >> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> > >> index d8fb0fd3ff6f..9c9890fe18b7 100644
> > >> --- a/fs/quota/dquot.c
> > >> +++ b/fs/quota/dquot.c
> > >> @@ -2273,6 +2273,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
> > >>                 error = -EINVAL;
> > >>                 goto out_fmt;
> > >>         }
> > >> +       /* Filesystems outside of init_user_ns not yet supported */
> > >> +       if (sb->s_user_ns != &init_user_ns) {
> > >> +               error = -EINVAL;
> > >> +               goto out_fmt;
> > >> +       }
> > >>         /* Usage always has to be set... */
> > >>         if (!(flags & DQUOT_USAGE_ENABLED)) {
> > >>                 error = -EINVAL;
> > >> 
> > >> 
> > >> seems a lot more appropriate at this point.  That is enough to give a
> > >> great big hint there is something that needs to be done but won't
> > >> embrittle the code with untested corner cases.
> > >
> > > You'll need to propagate that to all filesystems that have their own
> > > quota implemenation, too.
> > 
> > All of the filesytems that have their own quota implementations omit the
> > flag FS_USERNS_MOUNT in fs_flags in struct filesystem so they are
> > protected.
> 
> Which is the same situation currently for every filesystem that
> supports quotas, regardless of the implementation infrastructure.
> 
> > p.s.  I  expect the the generic quota implementation is simple enough
> > that it is not particularly suseptible to problems caused by malicious
> > data.  But I don't currently care enough to verify and test that
> > assumption so this is very much the wrong time for me to be enabling the
> > feature.
> 
> All the more reason you should be adding the same guard to all the
> other filesystems....
> 
> All i'm asking you to do is to make this check in a way that all
> filesystems that implement quotas will execute it. Don't leave
> landmines with security implications around - make sure all
> filesystems have the same protections.

Well, I'm not sure I follow you here. VFS quotas are a generic code used by
a few filesystems. So I can imagine that someone would decide to enable
FS_USERNS_MOUNT for one of those filesystems without thinking about quotas
and then Eric's check would trigger and possibly save use from some
problems.

When someone decides to enable FS_USERNS_MOUNT for XFS, he will have
presumably made sure all parts of XFS are safe, including its quota
implementation.

I don't want to stop you or Eric in adding an extra check in XFS, I just
have hard time to see how that check would trigger and how XFS quota is
different from other XFS parts...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-04 16:27     ` Eric W. Biederman
@ 2016-07-06  8:54         ` Jan Kara
  -1 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06  8:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
> 
> > On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
> >> 
> >> As well as in these patches the code is also available from:
> >> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
> >> 
> >> It has been a long time in coming but recently in the userns tree the
> >> superblock has been expanded with a s_user_ns field indicating the user
> >> namespace that owns a superblock.
> >> 
> >> The s_user_ns owner of a superblock has three implications.
> >> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
> >>  filesystem from the vfs.
> >> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
> >>   is set to INVALID_UID and i_gid is set to INVALID_GID.
> >> - The scope of permission checks can be changed from global to a
> >>   capabilitiy check in s_user_ns.
> >
> > OK, to check that I understand it right:
> >
> > So the uids and gids that are stored on disk are still expected to be in
> > the initial id namespace, aren't they?
> 
> No.
> 
> The general expectation is that the ids on disk are store in s_user_ns.
>
> Id's that don't map to the initial id namespace get stored in the
> generic data structures as INVALID_UID and INVALID_GID.

> In practice I don't expect anyone will set up a situation knowingly
> where id's don't map, but the case has to be handled because mistakes
> and malicious code happens.

OK, thanks for explanation. But then the namespace the filesystem is
mounted with essentially becomes part of the on-disk format, doesn't it?
Because if someone mounts the media from a different namespace, suddently
the UID/GIDs may map to different users in initial user namespace and
consequences may be weird, right? Shouldn't it thus be somehow stored
together with the filesystem to make things more robust?

I don't remember the indented uses for user-ns mounts so I may be just
wrong. But my experience tells me that external data (such as user
namespace ID mappings in your case) that modify meaning of on-disk format
tend to cause maintenance difficulties in the long run... Because someone
*will* have the idea of migrating these fs images between containers /
machines and then they have to make sure mappings get migrated as well and
it all becomes cumbersome.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-06  8:54         ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06  8:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Seth Forshee, Linux Containers, linux-fsdevel,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
> >> 
> >> As well as in these patches the code is also available from:
> >> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
> >> 
> >> It has been a long time in coming but recently in the userns tree the
> >> superblock has been expanded with a s_user_ns field indicating the user
> >> namespace that owns a superblock.
> >> 
> >> The s_user_ns owner of a superblock has three implications.
> >> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
> >>  filesystem from the vfs.
> >> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
> >>   is set to INVALID_UID and i_gid is set to INVALID_GID.
> >> - The scope of permission checks can be changed from global to a
> >>   capabilitiy check in s_user_ns.
> >
> > OK, to check that I understand it right:
> >
> > So the uids and gids that are stored on disk are still expected to be in
> > the initial id namespace, aren't they?
> 
> No.
> 
> The general expectation is that the ids on disk are store in s_user_ns.
>
> Id's that don't map to the initial id namespace get stored in the
> generic data structures as INVALID_UID and INVALID_GID.

> In practice I don't expect anyone will set up a situation knowingly
> where id's don't map, but the case has to be handled because mistakes
> and malicious code happens.

OK, thanks for explanation. But then the namespace the filesystem is
mounted with essentially becomes part of the on-disk format, doesn't it?
Because if someone mounts the media from a different namespace, suddently
the UID/GIDs may map to different users in initial user namespace and
consequences may be weird, right? Shouldn't it thus be somehow stored
together with the filesystem to make things more robust?

I don't remember the indented uses for user-ns mounts so I may be just
wrong. But my experience tells me that external data (such as user
namespace ID mappings in your case) that modify meaning of on-disk format
tend to cause maintenance difficulties in the long run... Because someone
*will* have the idea of migrating these fs images between containers /
machines and then they have to make sure mappings get migrated as well and
it all becomes cumbersome.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
  2016-07-05 14:55           ` Eric W. Biederman
@ 2016-07-06  9:07               ` Jan Kara
  -1 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06  9:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

On Tue 05-07-16 09:55:18, Eric W. Biederman wrote:
> Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
> 
> > On Sat 02-07-16 12:20:31, Eric W. Biederman wrote:
> >> It is expected that filesystems can not represent uids and gids from
> >> outside of their user namespace.  Keep things simple by not even
> >> trying to create filesystem nodes with non-sense uids and gids.
> >> 
> >> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> >
> > So if we have sb->s_user_ns that doesn't map UID and GID 0, root cannot
> > directly create files in this filesystem. EOVERFLOW error will at least
> > hint us where the problem is but still I'm suspecting this is going to
> > create hard to debug configuration issues... I'm not sure if we can do
> > anything about this but I wanted to point it out.
> 
> A reasonable point.
> 
> A couple of details.
> 
> - If there is no uid or gid 0 inside of a user namespace there is
>   no root user in that namespace so this is a moot point.
> 
> - If we are talking about uid 0 in the initial user namespace the
>   scenario you are worried about can occur, but you have to work at it
>   to get into that situation.
> 
> To create a superblock has s_user_ns != &init_user_ns something like
> the following has to occur.
> 
> setuid(1000);
> unshare(CLONE_NEWUSER);
> unshare(CLONE_NEWNS);
> mount(....);
> 
> At which point the oridinary root user can not see the filesystem
> with s_user_ns != &init_user_ns as it is located in another mount
> namespace.
> 
> Which means root has to use setns to get into that mount namespace,
> so there are going to be larger hints than -EOVERFLOW.

OK, I see. Thanks for explanation. The inability of the admin (UID 0 in
init_user_ns) to easily see and change any filesystem mounted in the system
makes me somewhat nervous ;). But I guess the complexity is the price for
the flexibility...

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
@ 2016-07-06  9:07               ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06  9:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Seth Forshee, Linux Containers, linux-fsdevel,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

On Tue 05-07-16 09:55:18, Eric W. Biederman wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Sat 02-07-16 12:20:31, Eric W. Biederman wrote:
> >> It is expected that filesystems can not represent uids and gids from
> >> outside of their user namespace.  Keep things simple by not even
> >> trying to create filesystem nodes with non-sense uids and gids.
> >> 
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >
> > So if we have sb->s_user_ns that doesn't map UID and GID 0, root cannot
> > directly create files in this filesystem. EOVERFLOW error will at least
> > hint us where the problem is but still I'm suspecting this is going to
> > create hard to debug configuration issues... I'm not sure if we can do
> > anything about this but I wanted to point it out.
> 
> A reasonable point.
> 
> A couple of details.
> 
> - If there is no uid or gid 0 inside of a user namespace there is
>   no root user in that namespace so this is a moot point.
> 
> - If we are talking about uid 0 in the initial user namespace the
>   scenario you are worried about can occur, but you have to work at it
>   to get into that situation.
> 
> To create a superblock has s_user_ns != &init_user_ns something like
> the following has to occur.
> 
> setuid(1000);
> unshare(CLONE_NEWUSER);
> unshare(CLONE_NEWNS);
> mount(....);
> 
> At which point the oridinary root user can not see the filesystem
> with s_user_ns != &init_user_ns as it is located in another mount
> namespace.
> 
> Which means root has to use setns to get into that mount namespace,
> so there are going to be larger hints than -EOVERFLOW.

OK, I see. Thanks for explanation. The inability of the admin (UID 0 in
init_user_ns) to easily see and change any filesystem mounted in the system
makes me somewhat nervous ;). But I guess the complexity is the price for
the flexibility...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found] ` <87ziq03qnj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
  2016-07-02 17:20   ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
  2016-07-04  8:52   ` [PATCH review 0/11] General unprivileged mount support Jan Kara
@ 2016-07-06 13:44   ` Andy Lutomirski
  2016-07-06 14:01   ` Andy Lutomirski
  2016-07-06 18:10     ` Eric W. Biederman
  4 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2016-07-06 13:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	James Bottomley, Seth Forshee, Michael Kerrisk, Linux FS Devel,
	Djalal Harouni

On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>

Minor nit: in "fs: Refuse uid/gid changes which don't map into
s_user_ns", the changelog ends in the middle of a sentence.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found] ` <87ziq03qnj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-07-06 13:44   ` Andy Lutomirski
  2016-07-04  8:52   ` [PATCH review 0/11] General unprivileged mount support Jan Kara
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2016-07-06 13:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux Containers, Linux FS Devel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn, Jan Kara,
	Jann Horn, Michael Kerrisk

On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>

Minor nit: in "fs: Refuse uid/gid changes which don't map into
s_user_ns", the changelog ends in the middle of a sentence.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-06 13:44   ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2016-07-06 13:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux Containers, Linux FS Devel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn, Jan Kara,
	Jann Horn, Michael Kerrisk

On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>

Minor nit: in "fs: Refuse uid/gid changes which don't map into
s_user_ns", the changelog ends in the middle of a sentence.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found]         ` <20160706085440.GD14067-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-06 13:54           ` Seth Forshee
  0 siblings, 0 replies; 134+ messages in thread
From: Seth Forshee @ 2016-07-06 13:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	Djalal Harouni

On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
> > 
> > > On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
> > >> 
> > >> As well as in these patches the code is also available from:
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
> > >> 
> > >> It has been a long time in coming but recently in the userns tree the
> > >> superblock has been expanded with a s_user_ns field indicating the user
> > >> namespace that owns a superblock.
> > >> 
> > >> The s_user_ns owner of a superblock has three implications.
> > >> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
> > >>  filesystem from the vfs.
> > >> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
> > >>   is set to INVALID_UID and i_gid is set to INVALID_GID.
> > >> - The scope of permission checks can be changed from global to a
> > >>   capabilitiy check in s_user_ns.
> > >
> > > OK, to check that I understand it right:
> > >
> > > So the uids and gids that are stored on disk are still expected to be in
> > > the initial id namespace, aren't they?
> > 
> > No.
> > 
> > The general expectation is that the ids on disk are store in s_user_ns.
> >
> > Id's that don't map to the initial id namespace get stored in the
> > generic data structures as INVALID_UID and INVALID_GID.
> 
> > In practice I don't expect anyone will set up a situation knowingly
> > where id's don't map, but the case has to be handled because mistakes
> > and malicious code happens.
> 
> OK, thanks for explanation. But then the namespace the filesystem is
> mounted with essentially becomes part of the on-disk format, doesn't it?
> Because if someone mounts the media from a different namespace, suddently
> the UID/GIDs may map to different users in initial user namespace and
> consequences may be weird, right? Shouldn't it thus be somehow stored
> together with the filesystem to make things more robust?
> 
> I don't remember the indented uses for user-ns mounts so I may be just
> wrong. But my experience tells me that external data (such as user
> namespace ID mappings in your case) that modify meaning of on-disk format
> tend to cause maintenance difficulties in the long run... Because someone
> *will* have the idea of migrating these fs images between containers /
> machines and then they have to make sure mappings get migrated as well and
> it all becomes cumbersome.

The intended use case for this is containers, with the idea being that I
as a user will get the same behavior in the container as I would in
init_user_ns without needing any userspace modifications to achieve
that.

So if I have a filesystem that contains uid 0 and I mount it in my
container, I should see uid 0. If I mount the same bits in another
container with a different uid mapping I should also see uid 0.

If I mkfs a new filesystem in my container then mount it, the root
directory of the fs is owned by uid 0 in my container without any
modifications to mkfs.

I'd argue that this makes it easier to migrate a disk between containers
because the ids in the disk show up the same within the container
regardless of the id mapping. If someone wants to mount a filesystem in
one container and also access it in another container with a completely
different id mapping, well I don't think that's ever going to work well.

Seth

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06  8:54         ` Jan Kara
  (?)
@ 2016-07-06 13:54         ` Seth Forshee
  2016-07-06 14:22             ` Jan Kara
  2016-07-06 14:22           ` Jan Kara
  -1 siblings, 2 replies; 134+ messages in thread
From: Seth Forshee @ 2016-07-06 13:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric W. Biederman, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > Jan Kara <jack@suse.cz> writes:
> > 
> > > On Sat 02-07-16 12:18:08, Eric W. Biederman wrote:
> > >> 
> > >> As well as in these patches the code is also available from:
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
> > >> 
> > >> It has been a long time in coming but recently in the userns tree the
> > >> superblock has been expanded with a s_user_ns field indicating the user
> > >> namespace that owns a superblock.
> > >> 
> > >> The s_user_ns owner of a superblock has three implications.
> > >> - Only kuids and kgids that map into s_user_ns are allowed to be sent to a
> > >>  filesystem from the vfs.
> > >> - If the uid or gid on the filesystem does not map into s_user_ns i_uid
> > >>   is set to INVALID_UID and i_gid is set to INVALID_GID.
> > >> - The scope of permission checks can be changed from global to a
> > >>   capabilitiy check in s_user_ns.
> > >
> > > OK, to check that I understand it right:
> > >
> > > So the uids and gids that are stored on disk are still expected to be in
> > > the initial id namespace, aren't they?
> > 
> > No.
> > 
> > The general expectation is that the ids on disk are store in s_user_ns.
> >
> > Id's that don't map to the initial id namespace get stored in the
> > generic data structures as INVALID_UID and INVALID_GID.
> 
> > In practice I don't expect anyone will set up a situation knowingly
> > where id's don't map, but the case has to be handled because mistakes
> > and malicious code happens.
> 
> OK, thanks for explanation. But then the namespace the filesystem is
> mounted with essentially becomes part of the on-disk format, doesn't it?
> Because if someone mounts the media from a different namespace, suddently
> the UID/GIDs may map to different users in initial user namespace and
> consequences may be weird, right? Shouldn't it thus be somehow stored
> together with the filesystem to make things more robust?
> 
> I don't remember the indented uses for user-ns mounts so I may be just
> wrong. But my experience tells me that external data (such as user
> namespace ID mappings in your case) that modify meaning of on-disk format
> tend to cause maintenance difficulties in the long run... Because someone
> *will* have the idea of migrating these fs images between containers /
> machines and then they have to make sure mappings get migrated as well and
> it all becomes cumbersome.

The intended use case for this is containers, with the idea being that I
as a user will get the same behavior in the container as I would in
init_user_ns without needing any userspace modifications to achieve
that.

So if I have a filesystem that contains uid 0 and I mount it in my
container, I should see uid 0. If I mount the same bits in another
container with a different uid mapping I should also see uid 0.

If I mkfs a new filesystem in my container then mount it, the root
directory of the fs is owned by uid 0 in my container without any
modifications to mkfs.

I'd argue that this makes it easier to migrate a disk between containers
because the ids in the disk show up the same within the container
regardless of the id mapping. If someone wants to mount a filesystem in
one container and also access it in another container with a completely
different id mapping, well I don't think that's ever going to work well.

Seth


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found] ` <87ziq03qnj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-07-06 13:44   ` Andy Lutomirski
@ 2016-07-06 14:01   ` Andy Lutomirski
  2016-07-06 18:10     ` Eric W. Biederman
  4 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2016-07-06 14:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	James Bottomley, Seth Forshee, Michael Kerrisk, Linux FS Devel,
	Djalal Harouni

On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>
> It has been a long time in coming but recently in the userns tree the
> superblock has been expanded with a s_user_ns field indicating the user
> namespace that owns a superblock.
>

It would be nice if global root could mount otherwise-unsafe
filesystems with s_user_ns set to a non-init namespace.  Do you have
plans to allow that?

--Andy

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found] ` <87ziq03qnj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-07-06 14:01   ` Andy Lutomirski
  2016-07-04  8:52   ` [PATCH review 0/11] General unprivileged mount support Jan Kara
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2016-07-06 14:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux Containers, Linux FS Devel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn, Jan Kara,
	Jann Horn, Michael Kerrisk

On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>
> It has been a long time in coming but recently in the userns tree the
> superblock has been expanded with a s_user_ns field indicating the user
> namespace that owns a superblock.
>

It would be nice if global root could mount otherwise-unsafe
filesystems with s_user_ns set to a non-init namespace.  Do you have
plans to allow that?

--Andy

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-06 14:01   ` Andy Lutomirski
  0 siblings, 0 replies; 134+ messages in thread
From: Andy Lutomirski @ 2016-07-06 14:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Linux Containers, Linux FS Devel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn, Jan Kara,
	Jann Horn, Michael Kerrisk

On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> As well as in these patches the code is also available from:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>
> It has been a long time in coming but recently in the userns tree the
> superblock has been expanded with a s_user_ns field indicating the user
> namespace that owns a superblock.
>

It would be nice if global root could mount otherwise-unsafe
filesystems with s_user_ns set to a non-init namespace.  Do you have
plans to allow that?

--Andy

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 13:54         ` Seth Forshee
  2016-07-06 14:22             ` Jan Kara
@ 2016-07-06 14:22           ` Jan Kara
  1 sibling, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06 14:22 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, James Bottomley, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	Djalal Harouni

On Wed 06-07-16 08:54:46, Seth Forshee wrote:
> On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > I don't remember the indented uses for user-ns mounts so I may be just
> > wrong. But my experience tells me that external data (such as user
> > namespace ID mappings in your case) that modify meaning of on-disk format
> > tend to cause maintenance difficulties in the long run... Because someone
> > *will* have the idea of migrating these fs images between containers /
> > machines and then they have to make sure mappings get migrated as well and
> > it all becomes cumbersome.
> 
> The intended use case for this is containers, with the idea being that I
> as a user will get the same behavior in the container as I would in
> init_user_ns without needing any userspace modifications to achieve
> that.
> 
> So if I have a filesystem that contains uid 0 and I mount it in my
> container, I should see uid 0. If I mount the same bits in another
> container with a different uid mapping I should also see uid 0.
> 
> If I mkfs a new filesystem in my container then mount it, the root
> directory of the fs is owned by uid 0 in my container without any
> modifications to mkfs.
> 
> I'd argue that this makes it easier to migrate a disk between containers
> because the ids in the disk show up the same within the container
> regardless of the id mapping. If someone wants to mount a filesystem in
> one container and also access it in another container with a completely
> different id mapping, well I don't think that's ever going to work well.

OK, I see how this is supposed to work. However you assume here that both
containers have the same set of valid UIDs, don't you? If that is not the
case, the mounted image will not be usable in the other container, right?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 13:54         ` Seth Forshee
@ 2016-07-06 14:22             ` Jan Kara
  2016-07-06 14:22           ` Jan Kara
  1 sibling, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06 14:22 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Eric W. Biederman, Linux Containers, linux-fsdevel,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

On Wed 06-07-16 08:54:46, Seth Forshee wrote:
> On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > I don't remember the indented uses for user-ns mounts so I may be just
> > wrong. But my experience tells me that external data (such as user
> > namespace ID mappings in your case) that modify meaning of on-disk format
> > tend to cause maintenance difficulties in the long run... Because someone
> > *will* have the idea of migrating these fs images between containers /
> > machines and then they have to make sure mappings get migrated as well and
> > it all becomes cumbersome.
> 
> The intended use case for this is containers, with the idea being that I
> as a user will get the same behavior in the container as I would in
> init_user_ns without needing any userspace modifications to achieve
> that.
> 
> So if I have a filesystem that contains uid 0 and I mount it in my
> container, I should see uid 0. If I mount the same bits in another
> container with a different uid mapping I should also see uid 0.
> 
> If I mkfs a new filesystem in my container then mount it, the root
> directory of the fs is owned by uid 0 in my container without any
> modifications to mkfs.
> 
> I'd argue that this makes it easier to migrate a disk between containers
> because the ids in the disk show up the same within the container
> regardless of the id mapping. If someone wants to mount a filesystem in
> one container and also access it in another container with a completely
> different id mapping, well I don't think that's ever going to work well.

OK, I see how this is supposed to work. However you assume here that both
containers have the same set of valid UIDs, don't you? If that is not the
case, the mounted image will not be usable in the other container, right?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-06 14:22             ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-06 14:22 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Jan Kara, Eric W. Biederman, Linux Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jann Horn,
	Michael Kerrisk

On Wed 06-07-16 08:54:46, Seth Forshee wrote:
> On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > I don't remember the indented uses for user-ns mounts so I may be just
> > wrong. But my experience tells me that external data (such as user
> > namespace ID mappings in your case) that modify meaning of on-disk format
> > tend to cause maintenance difficulties in the long run... Because someone
> > *will* have the idea of migrating these fs images between containers /
> > machines and then they have to make sure mappings get migrated as well and
> > it all becomes cumbersome.
> 
> The intended use case for this is containers, with the idea being that I
> as a user will get the same behavior in the container as I would in
> init_user_ns without needing any userspace modifications to achieve
> that.
> 
> So if I have a filesystem that contains uid 0 and I mount it in my
> container, I should see uid 0. If I mount the same bits in another
> container with a different uid mapping I should also see uid 0.
> 
> If I mkfs a new filesystem in my container then mount it, the root
> directory of the fs is owned by uid 0 in my container without any
> modifications to mkfs.
> 
> I'd argue that this makes it easier to migrate a disk between containers
> because the ids in the disk show up the same within the container
> regardless of the id mapping. If someone wants to mount a filesystem in
> one container and also access it in another container with a completely
> different id mapping, well I don't think that's ever going to work well.

OK, I see how this is supposed to work. However you assume here that both
containers have the same set of valid UIDs, don't you? If that is not the
case, the mounted image will not be usable in the other container, right?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found]             ` <20160706142255.GB21164-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-06 14:46               ` Seth Forshee
  2016-07-06 15:01               ` Eric W. Biederman
  2016-07-06 15:23               ` James Bottomley
  2 siblings, 0 replies; 134+ messages in thread
From: Seth Forshee @ 2016-07-06 14:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	Djalal Harouni

On Wed, Jul 06, 2016 at 04:22:55PM +0200, Jan Kara wrote:
> On Wed 06-07-16 08:54:46, Seth Forshee wrote:
> > On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> > > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > > I don't remember the indented uses for user-ns mounts so I may be just
> > > wrong. But my experience tells me that external data (such as user
> > > namespace ID mappings in your case) that modify meaning of on-disk format
> > > tend to cause maintenance difficulties in the long run... Because someone
> > > *will* have the idea of migrating these fs images between containers /
> > > machines and then they have to make sure mappings get migrated as well and
> > > it all becomes cumbersome.
> > 
> > The intended use case for this is containers, with the idea being that I
> > as a user will get the same behavior in the container as I would in
> > init_user_ns without needing any userspace modifications to achieve
> > that.
> > 
> > So if I have a filesystem that contains uid 0 and I mount it in my
> > container, I should see uid 0. If I mount the same bits in another
> > container with a different uid mapping I should also see uid 0.
> > 
> > If I mkfs a new filesystem in my container then mount it, the root
> > directory of the fs is owned by uid 0 in my container without any
> > modifications to mkfs.
> > 
> > I'd argue that this makes it easier to migrate a disk between containers
> > because the ids in the disk show up the same within the container
> > regardless of the id mapping. If someone wants to mount a filesystem in
> > one container and also access it in another container with a completely
> > different id mapping, well I don't think that's ever going to work well.
> 
> OK, I see how this is supposed to work. However you assume here that both
> containers have the same set of valid UIDs, don't you? If that is not the
> case, the mounted image will not be usable in the other container, right?

It's possible of course. I'd expect anyone wanting to use this in
practice to set up their containers with appropriate mappings. Full OS
style containers will at least need some minimal uid mapping to work
properly.

Seth

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 14:22             ` Jan Kara
  (?)
@ 2016-07-06 14:46             ` Seth Forshee
  -1 siblings, 0 replies; 134+ messages in thread
From: Seth Forshee @ 2016-07-06 14:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric W. Biederman, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

On Wed, Jul 06, 2016 at 04:22:55PM +0200, Jan Kara wrote:
> On Wed 06-07-16 08:54:46, Seth Forshee wrote:
> > On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> > > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > > I don't remember the indented uses for user-ns mounts so I may be just
> > > wrong. But my experience tells me that external data (such as user
> > > namespace ID mappings in your case) that modify meaning of on-disk format
> > > tend to cause maintenance difficulties in the long run... Because someone
> > > *will* have the idea of migrating these fs images between containers /
> > > machines and then they have to make sure mappings get migrated as well and
> > > it all becomes cumbersome.
> > 
> > The intended use case for this is containers, with the idea being that I
> > as a user will get the same behavior in the container as I would in
> > init_user_ns without needing any userspace modifications to achieve
> > that.
> > 
> > So if I have a filesystem that contains uid 0 and I mount it in my
> > container, I should see uid 0. If I mount the same bits in another
> > container with a different uid mapping I should also see uid 0.
> > 
> > If I mkfs a new filesystem in my container then mount it, the root
> > directory of the fs is owned by uid 0 in my container without any
> > modifications to mkfs.
> > 
> > I'd argue that this makes it easier to migrate a disk between containers
> > because the ids in the disk show up the same within the container
> > regardless of the id mapping. If someone wants to mount a filesystem in
> > one container and also access it in another container with a completely
> > different id mapping, well I don't think that's ever going to work well.
> 
> OK, I see how this is supposed to work. However you assume here that both
> containers have the same set of valid UIDs, don't you? If that is not the
> case, the mounted image will not be usable in the other container, right?

It's possible of course. I'd expect anyone wanting to use this in
practice to set up their containers with appropriate mappings. Full OS
style containers will at least need some minimal uid mapping to work
properly.

Seth

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found]             ` <20160706142255.GB21164-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
  2016-07-06 14:46               ` Seth Forshee
@ 2016-07-06 15:01               ` Eric W. Biederman
  2016-07-06 15:23               ` James Bottomley
  2 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Wed 06-07-16 08:54:46, Seth Forshee wrote:
>> On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
>> > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
>> > I don't remember the indented uses for user-ns mounts so I may be just
>> > wrong. But my experience tells me that external data (such as user
>> > namespace ID mappings in your case) that modify meaning of on-disk format
>> > tend to cause maintenance difficulties in the long run... Because someone
>> > *will* have the idea of migrating these fs images between containers /
>> > machines and then they have to make sure mappings get migrated as well and
>> > it all becomes cumbersome.
>> 
>> The intended use case for this is containers, with the idea being that I
>> as a user will get the same behavior in the container as I would in
>> init_user_ns without needing any userspace modifications to achieve
>> that.
>> 
>> So if I have a filesystem that contains uid 0 and I mount it in my
>> container, I should see uid 0. If I mount the same bits in another
>> container with a different uid mapping I should also see uid 0.
>> 
>> If I mkfs a new filesystem in my container then mount it, the root
>> directory of the fs is owned by uid 0 in my container without any
>> modifications to mkfs.
>> 
>> I'd argue that this makes it easier to migrate a disk between containers
>> because the ids in the disk show up the same within the container
>> regardless of the id mapping. If someone wants to mount a filesystem in
>> one container and also access it in another container with a completely
>> different id mapping, well I don't think that's ever going to work well.
>
> OK, I see how this is supposed to work. However you assume here that both
> containers have the same set of valid UIDs, don't you? If that is not the
> case, the mounted image will not be usable in the other container, right?

In practice most uids and gids are going to be < 2^16.  We currently
have a 32bit id space so it isn't hard to map all of the ids.

In practice this is exactly the same problem as moving filesystems
between systems and ensuring that the /etc/passwd and /etc/group entries
that describe the users and groups on the two systems are similiar
enough not to cause problems.

The difference between the classic case of moving filesystems between
systems and the case with user namespaces is that when there are
differences they can be encapuslated in a user namespace.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 14:22             ` Jan Kara
                               ` (2 preceding siblings ...)
  (?)
@ 2016-07-06 15:01             ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Seth Forshee, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

Jan Kara <jack@suse.cz> writes:

> On Wed 06-07-16 08:54:46, Seth Forshee wrote:
>> On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
>> > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
>> > I don't remember the indented uses for user-ns mounts so I may be just
>> > wrong. But my experience tells me that external data (such as user
>> > namespace ID mappings in your case) that modify meaning of on-disk format
>> > tend to cause maintenance difficulties in the long run... Because someone
>> > *will* have the idea of migrating these fs images between containers /
>> > machines and then they have to make sure mappings get migrated as well and
>> > it all becomes cumbersome.
>> 
>> The intended use case for this is containers, with the idea being that I
>> as a user will get the same behavior in the container as I would in
>> init_user_ns without needing any userspace modifications to achieve
>> that.
>> 
>> So if I have a filesystem that contains uid 0 and I mount it in my
>> container, I should see uid 0. If I mount the same bits in another
>> container with a different uid mapping I should also see uid 0.
>> 
>> If I mkfs a new filesystem in my container then mount it, the root
>> directory of the fs is owned by uid 0 in my container without any
>> modifications to mkfs.
>> 
>> I'd argue that this makes it easier to migrate a disk between containers
>> because the ids in the disk show up the same within the container
>> regardless of the id mapping. If someone wants to mount a filesystem in
>> one container and also access it in another container with a completely
>> different id mapping, well I don't think that's ever going to work well.
>
> OK, I see how this is supposed to work. However you assume here that both
> containers have the same set of valid UIDs, don't you? If that is not the
> case, the mounted image will not be usable in the other container, right?

In practice most uids and gids are going to be < 2^16.  We currently
have a 32bit id space so it isn't hard to map all of the ids.

In practice this is exactly the same problem as moving filesystems
between systems and ensuring that the /etc/passwd and /etc/group entries
that describe the users and groups on the two systems are similiar
enough not to cause problems.

The difference between the classic case of moving filesystems between
systems and the case with user namespaces is that when there are
differences they can be encapuslated in a user namespace.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 14:01   ` Andy Lutomirski
@ 2016-07-06 15:19     ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	James Bottomley, Seth Forshee, Michael Kerrisk, Linux FS Devel,
	Djalal Harouni

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> As well as in these patches the code is also available from:
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>>
>> It has been a long time in coming but recently in the userns tree the
>> superblock has been expanded with a s_user_ns field indicating the user
>> namespace that owns a superblock.
>>
>
> It would be nice if global root could mount otherwise-unsafe
> filesystems with s_user_ns set to a non-init namespace.  Do you have
> plans to allow that?

Looking at the code global root would pass the permission checks that
are present in my tree today.  At the same time I don't have a provision
for global root to specify the user namespace.

The practical limitation is that the filesystems need a little bit of
gentle massaging to handle the case when s_user_ns != &init_user_ns.
I think i_uid_read and i_uid_write already provide 90% of that.

The other limitation is that s_user_ns is defined as the owner of the
filesystem and as the default translation.  Which will mean root in
s_user_ns will be trusted to remount the filesystem.

The primary target right now is a safe unprivileged mount of fuse.


I am not opposed to extensions allowing the global root to mount
otherwise-unsafe filesystems with s_user_ns != &init_user_ns as long as
it is for cases where we can trust the underlying filesystem image and
don't mind giving root in s_user_ns all interesting permissions over the
filesystem.  I think that could be an very interesting intermediary step
in getting filesystems supported.  Still my focus is true unprivileged
mounts, there are a huge number of little details that go into getting
that right, and my poor brain can't handle looking at cases beyond that.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-06 15:19     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Seth Forshee, Linux Containers, Linux FS Devel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn, Jan Kara,
	Jann Horn, Michael Kerrisk

Andy Lutomirski <luto@amacapital.net> writes:

> On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> As well as in these patches the code is also available from:
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>>
>> It has been a long time in coming but recently in the userns tree the
>> superblock has been expanded with a s_user_ns field indicating the user
>> namespace that owns a superblock.
>>
>
> It would be nice if global root could mount otherwise-unsafe
> filesystems with s_user_ns set to a non-init namespace.  Do you have
> plans to allow that?

Looking at the code global root would pass the permission checks that
are present in my tree today.  At the same time I don't have a provision
for global root to specify the user namespace.

The practical limitation is that the filesystems need a little bit of
gentle massaging to handle the case when s_user_ns != &init_user_ns.
I think i_uid_read and i_uid_write already provide 90% of that.

The other limitation is that s_user_ns is defined as the owner of the
filesystem and as the default translation.  Which will mean root in
s_user_ns will be trusted to remount the filesystem.

The primary target right now is a safe unprivileged mount of fuse.


I am not opposed to extensions allowing the global root to mount
otherwise-unsafe filesystems with s_user_ns != &init_user_ns as long as
it is for cases where we can trust the underlying filesystem image and
don't mind giving root in s_user_ns all interesting permissions over the
filesystem.  I think that could be an very interesting intermediary step
in getting filesystems supported.  Still my focus is true unprivileged
mounts, there are a huge number of little details that go into getting
that right, and my poor brain can't handle looking at cases beyond that.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 13:44   ` Andy Lutomirski
@ 2016-07-06 15:21       ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	James Bottomley, Seth Forshee, Michael Kerrisk, Linux FS Devel,
	Djalal Harouni

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> As well as in these patches the code is also available from:
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>>
>
> Minor nit: in "fs: Refuse uid/gid changes which don't map into
> s_user_ns", the changelog ends in the middle of a sentence.

This text?
> For filesystems mounted from a user namespace on-disk ids should
> be translated relative to s_users_ns rather than init_user_ns.

> When an id in the filesystem doesn't exist in s_user_ns the
> associated id in the inode will be set to INVALID_[UG]ID, which
> turns these into de facto "nobody" ids. This actually maps pretty
> well into the way most code already works, and those places where
> it didn't were fixed in previous patches. Moving forward vfs code
> needs to be careful to handle instances where ids in inodes may
> be invalid.


I don't see the changelog problem you are talking about.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-06 15:21       ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Seth Forshee, Linux Containers, Linux FS Devel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn, Jan Kara,
	Jann Horn, Michael Kerrisk

Andy Lutomirski <luto@amacapital.net> writes:

> On Sat, Jul 2, 2016 at 10:18 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> As well as in these patches the code is also available from:
>> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing
>>
>
> Minor nit: in "fs: Refuse uid/gid changes which don't map into
> s_user_ns", the changelog ends in the middle of a sentence.

This text?
> For filesystems mounted from a user namespace on-disk ids should
> be translated relative to s_users_ns rather than init_user_ns.

> When an id in the filesystem doesn't exist in s_user_ns the
> associated id in the inode will be set to INVALID_[UG]ID, which
> turns these into de facto "nobody" ids. This actually maps pretty
> well into the way most code already works, and those places where
> it didn't were fixed in previous patches. Moving forward vfs code
> needs to be careful to handle instances where ids in inodes may
> be invalid.


I don't see the changelog problem you are talking about.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
       [not found]             ` <20160706142255.GB21164-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
  2016-07-06 14:46               ` Seth Forshee
  2016-07-06 15:01               ` Eric W. Biederman
@ 2016-07-06 15:23               ` James Bottomley
  2 siblings, 0 replies; 134+ messages in thread
From: James Bottomley @ 2016-07-06 15:23 UTC (permalink / raw)
  To: Jan Kara, Seth Forshee
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Michael Kerrisk, Djalal Harouni

On Wed, 2016-07-06 at 16:22 +0200, Jan Kara wrote:
> On Wed 06-07-16 08:54:46, Seth Forshee wrote:
> > On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> > > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > > I don't remember the indented uses for user-ns mounts so I may be 
> > > just wrong. But my experience tells me that external data (such 
> > > as user namespace ID mappings in your case) that modify meaning 
> > > of on-disk format tend to cause maintenance difficulties in the 
> > > long run... Because someone *will* have the idea of migrating 
> > > these fs images between containers / machines and then they have 
> > > to make sure mappings get migrated as well and it all becomes
> > > cumbersome.
> > 
> > The intended use case for this is containers, with the idea being 
> > that I as a user will get the same behavior in the container as I 
> > would in init_user_ns without needing any userspace modifications 
> > to achieve that.
> > 
> > So if I have a filesystem that contains uid 0 and I mount it in my
> > container, I should see uid 0. If I mount the same bits in another
> > container with a different uid mapping I should also see uid 0.
> > 
> > If I mkfs a new filesystem in my container then mount it, the root
> > directory of the fs is owned by uid 0 in my container without any
> > modifications to mkfs.
> > 
> > I'd argue that this makes it easier to migrate a disk between 
> > containers because the ids in the disk show up the same within the 
> > container regardless of the id mapping. If someone wants to mount a
> > filesystem in one container and also access it in another container 
> > with a completely different id mapping, well I don't think that's 
> > ever going to work well.
> 
> OK, I see how this is supposed to work. However you assume here that 
> both containers have the same set of valid UIDs, don't you? If that 
> is not the case, the mounted image will not be usable in the other 
> container, right?

You can always set it up wrongly is the rule of containers.  Because
the virtualizations are so granular, there are many possible
configurations which don't make sense in the real world.

The main use case for this is operating system images.  For them we
have a set of known UID/GIDs in the image (usually 0-1000 plus the
nobody/nogroups for both).  Using this scheme, we'd set up the
container in a userns that mapped all these ids to something
unprivileged and then set up a s_user_ns to do the same for the mount
location of the image meaning that the unprivileged container can now
manipulate the image.

There are several self contained proposals on linux-fsdevel for doing
this, like shifts, which is what I'm currently using to manipulate
images, so for me what it does is allows me to get rid of all the
credential shifting when performing operations on the underlying
filesystem.  In fact, I think it pretty much allows me to get rid of a
lot of the upper/lower filesystem distinction in shiftfs and I'd get
quotas and other stuff I ignored for free.

However, any of the other uid/gid shifting proposals can also use this
as the engine.

The point here is, this patch set is simply mechanism; it requires a
glue layer (like shiftfs, fuse or the vfs remapping proposal) to
activate it.  The activation decides how much exposure to the
underlying filesystem there is, so with shiftfs, there's none, it's a
purely volatile system crafted for chosen images.  However, it's fully
possible to come up with an activation where the filesystem would
decide (through some on disk format information) to declare the image
to be safely remapped in this uid/gid range and then we could allow it
to be mounted unprivileged (without a capability check) into a user_ns
that matched the mapping.  This latter is a bit of a fantasy since
container images are currently little more than tar files and we have
no extant way to connect them to linux fs formats, but once the
possibility exists, whose to say this won't change?

James

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 14:22             ` Jan Kara
                               ` (3 preceding siblings ...)
  (?)
@ 2016-07-06 15:23             ` James Bottomley
       [not found]               ` <1467818630.2369.21.camel-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>
  -1 siblings, 1 reply; 134+ messages in thread
From: James Bottomley @ 2016-07-06 15:23 UTC (permalink / raw)
  To: Jan Kara, Seth Forshee
  Cc: Eric W. Biederman, Linux Containers, linux-fsdevel, Linux API,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jann Horn,
	Michael Kerrisk

On Wed, 2016-07-06 at 16:22 +0200, Jan Kara wrote:
> On Wed 06-07-16 08:54:46, Seth Forshee wrote:
> > On Wed, Jul 06, 2016 at 10:54:40AM +0200, Jan Kara wrote:
> > > On Mon 04-07-16 11:27:46, Eric W. Biederman wrote:
> > > I don't remember the indented uses for user-ns mounts so I may be 
> > > just wrong. But my experience tells me that external data (such 
> > > as user namespace ID mappings in your case) that modify meaning 
> > > of on-disk format tend to cause maintenance difficulties in the 
> > > long run... Because someone *will* have the idea of migrating 
> > > these fs images between containers / machines and then they have 
> > > to make sure mappings get migrated as well and it all becomes
> > > cumbersome.
> > 
> > The intended use case for this is containers, with the idea being 
> > that I as a user will get the same behavior in the container as I 
> > would in init_user_ns without needing any userspace modifications 
> > to achieve that.
> > 
> > So if I have a filesystem that contains uid 0 and I mount it in my
> > container, I should see uid 0. If I mount the same bits in another
> > container with a different uid mapping I should also see uid 0.
> > 
> > If I mkfs a new filesystem in my container then mount it, the root
> > directory of the fs is owned by uid 0 in my container without any
> > modifications to mkfs.
> > 
> > I'd argue that this makes it easier to migrate a disk between 
> > containers because the ids in the disk show up the same within the 
> > container regardless of the id mapping. If someone wants to mount a
> > filesystem in one container and also access it in another container 
> > with a completely different id mapping, well I don't think that's 
> > ever going to work well.
> 
> OK, I see how this is supposed to work. However you assume here that 
> both containers have the same set of valid UIDs, don't you? If that 
> is not the case, the mounted image will not be usable in the other 
> container, right?

You can always set it up wrongly is the rule of containers.  Because
the virtualizations are so granular, there are many possible
configurations which don't make sense in the real world.

The main use case for this is operating system images.  For them we
have a set of known UID/GIDs in the image (usually 0-1000 plus the
nobody/nogroups for both).  Using this scheme, we'd set up the
container in a userns that mapped all these ids to something
unprivileged and then set up a s_user_ns to do the same for the mount
location of the image meaning that the unprivileged container can now
manipulate the image.

There are several self contained proposals on linux-fsdevel for doing
this, like shifts, which is what I'm currently using to manipulate
images, so for me what it does is allows me to get rid of all the
credential shifting when performing operations on the underlying
filesystem.  In fact, I think it pretty much allows me to get rid of a
lot of the upper/lower filesystem distinction in shiftfs and I'd get
quotas and other stuff I ignored for free.

However, any of the other uid/gid shifting proposals can also use this
as the engine.

The point here is, this patch set is simply mechanism; it requires a
glue layer (like shiftfs, fuse or the vfs remapping proposal) to
activate it.  The activation decides how much exposure to the
underlying filesystem there is, so with shiftfs, there's none, it's a
purely volatile system crafted for chosen images.  However, it's fully
possible to come up with an activation where the filesystem would
decide (through some on disk format information) to declare the image
to be safely remapped in this uid/gid range and then we could allow it
to be mounted unprivileged (without a capability check) into a user_ns
that matched the mapping.  This latter is a bit of a fantasy since
container images are currently little more than tar files and we have
no extant way to connect them to linux fs formats, but once the
possibility exists, whose to say this won't change?

James




^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
       [not found]               ` <20160706090705.GE14067-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-06 15:37                 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Tue 05-07-16 09:55:18, Eric W. Biederman wrote:

>> To create a superblock has s_user_ns != &init_user_ns something like
>> the following has to occur.
>> 
>> setuid(1000);
>> unshare(CLONE_NEWUSER);
>> unshare(CLONE_NEWNS);
>> mount(....);
>> 
>> At which point the oridinary root user can not see the filesystem
>> with s_user_ns != &init_user_ns as it is located in another mount
>> namespace.
>> 
>> Which means root has to use setns to get into that mount namespace,
>> so there are going to be larger hints than -EOVERFLOW.
>
> OK, I see. Thanks for explanation. The inability of the admin (UID 0 in
> init_user_ns) to easily see and change any filesystem mounted in the system
> makes me somewhat nervous ;). But I guess the complexity is the price for
> the flexibility...

Certainly the price of not allowing unprivileged users to confuse
existing setuid 0 and setcap executables.

There are days I really wish I could do things the plan9 way and throw
out any legacy baggage that makes things hard to implement.  As I recall
plan9 threw out setuid, setgid and only supported the 9p filesystem to
make this all simple and almost safe.

Wonderfully and horribly I have to keep a lot more real world code
usable in this framework.

All of that said with the existence of CRIU I expect we will ultimately
have all of the interfaces needed for adminstration applications to see
everything that is going on.  Unfortunately sometimes those interfaces
and tools lag behind the rest of the work.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
       [not found]               ` <20160706090705.GE14067-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-06 15:37                 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Seth Forshee, Linux Containers, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

Jan Kara <jack@suse.cz> writes:

> On Tue 05-07-16 09:55:18, Eric W. Biederman wrote:

>> To create a superblock has s_user_ns != &init_user_ns something like
>> the following has to occur.
>> 
>> setuid(1000);
>> unshare(CLONE_NEWUSER);
>> unshare(CLONE_NEWNS);
>> mount(....);
>> 
>> At which point the oridinary root user can not see the filesystem
>> with s_user_ns != &init_user_ns as it is located in another mount
>> namespace.
>> 
>> Which means root has to use setns to get into that mount namespace,
>> so there are going to be larger hints than -EOVERFLOW.
>
> OK, I see. Thanks for explanation. The inability of the admin (UID 0 in
> init_user_ns) to easily see and change any filesystem mounted in the system
> makes me somewhat nervous ;). But I guess the complexity is the price for
> the flexibility...

Certainly the price of not allowing unprivileged users to confuse
existing setuid 0 and setcap executables.

There are days I really wish I could do things the plan9 way and throw
out any legacy baggage that makes things hard to implement.  As I recall
plan9 threw out setuid, setgid and only supported the 9p filesystem to
make this all simple and almost safe.

Wonderfully and horribly I have to keep a lot more real world code
usable in this framework.

All of that said with the existence of CRIU I expect we will ultimately
have all of the interfaces needed for adminstration applications to see
everything that is going on.  Unfortunately sometimes those interfaces
and tools lag behind the rest of the work.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 07/11] vfs: Don't create inodes with a uid or gid unknown to the vfs
@ 2016-07-06 15:37                 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 15:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Seth Forshee, Linux Containers,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jann Horn,
	Michael Kerrisk

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Tue 05-07-16 09:55:18, Eric W. Biederman wrote:

>> To create a superblock has s_user_ns != &init_user_ns something like
>> the following has to occur.
>> 
>> setuid(1000);
>> unshare(CLONE_NEWUSER);
>> unshare(CLONE_NEWNS);
>> mount(....);
>> 
>> At which point the oridinary root user can not see the filesystem
>> with s_user_ns != &init_user_ns as it is located in another mount
>> namespace.
>> 
>> Which means root has to use setns to get into that mount namespace,
>> so there are going to be larger hints than -EOVERFLOW.
>
> OK, I see. Thanks for explanation. The inability of the admin (UID 0 in
> init_user_ns) to easily see and change any filesystem mounted in the system
> makes me somewhat nervous ;). But I guess the complexity is the price for
> the flexibility...

Certainly the price of not allowing unprivileged users to confuse
existing setuid 0 and setcap executables.

There are days I really wish I could do things the plan9 way and throw
out any legacy baggage that makes things hard to implement.  As I recall
plan9 threw out setuid, setgid and only supported the 9p filesystem to
make this all simple and almost safe.

Wonderfully and horribly I have to keep a lot more real world code
usable in this framework.

All of that said with the existence of CRIU I expect we will ultimately
have all of the interfaces needed for adminstration applications to see
everything that is going on.  Unfortunately sometimes those interfaces
and tools lag behind the rest of the work.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
  2016-07-06 15:23             ` James Bottomley
@ 2016-07-06 16:35                   ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 16:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Jann Horn, Linux API, Linux Containers,
	Andy Lutomirski, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> writes:

> The point here is, this patch set is simply mechanism; it requires a
> glue layer (like shiftfs, fuse or the vfs remapping proposal) to
> activate it.

Well ext4 can be used directly and Seth has basic patches for that
support.  The modifications needed are quite modest.  The problem with
ext4 is that a malicious ext4 filesystem image might be able to do
something nasty to the kernel.  How to create a maintainable high
performance filesystem that can guard against malicious filesystem
images is an open problem right now.  Which makes ext4 a poor target
for unprivileged mounts.

Fuse is a good target because guarding against malicious input from
userspace is part of it's orginial design.

The new novel mechanism is handling INVALID_UID and INVALID_GID at the
VFS layer so that filesystems who have translations in play (which
will be anything not mounted by the global root) won't have to get
all of the weird corner cases right on their own.

To that end I will be very interested to see what shiftfs looks like on
top of all of this.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 0/11] General unprivileged mount support
@ 2016-07-06 16:35                   ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 16:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Seth Forshee, Linux Containers, linux-fsdevel,
	Linux API, Djalal Harouni, Serge E. Hallyn, Andy Lutomirski,
	Jann Horn, Michael Kerrisk

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> The point here is, this patch set is simply mechanism; it requires a
> glue layer (like shiftfs, fuse or the vfs remapping proposal) to
> activate it.

Well ext4 can be used directly and Seth has basic patches for that
support.  The modifications needed are quite modest.  The problem with
ext4 is that a malicious ext4 filesystem image might be able to do
something nasty to the kernel.  How to create a maintainable high
performance filesystem that can guard against malicious filesystem
images is an open problem right now.  Which makes ext4 a poor target
for unprivileged mounts.

Fuse is a good target because guarding against malicious input from
userspace is part of it's orginial design.

The new novel mechanism is handling INVALID_UID and INVALID_GID at the
VFS layer so that filesystems who have translations in play (which
will be anything not mounted by the global root) won't have to get
all of the weird corner cases right on their own.

To that end I will be very interested to see what shiftfs looks like on
top of all of this.

Eric


^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
       [not found]                     ` <20160706082545.GC14067-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-06 17:51                       ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 17:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Linux API, Linux Containers, Dave Chinner,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Wed 06-07-16 16:35:04, Dave Chinner wrote:

>> All the more reason you should be adding the same guard to all the
>> other filesystems....
>>
>> All i'm asking you to do is to make this check in a way that all
>> filesystems that implement quotas will execute it. Don't leave
>> landmines with security implications around - make sure all
>> filesystems have the same protections.

That is exactly what I am doing.  Ensuring people don't think the
generic quota file code has been closely audited and reviewed and deemed
safe against malicious users.

> Well, I'm not sure I follow you here. VFS quotas are a generic code used by
> a few filesystems. So I can imagine that someone would decide to enable
> FS_USERNS_MOUNT for one of those filesystems without thinking about quotas
> and then Eric's check would trigger and possibly save use from some
> problems.
>
> When someone decides to enable FS_USERNS_MOUNT for XFS, he will have
> presumably made sure all parts of XFS are safe, including its quota
> implementation.
>
> I don't want to stop you or Eric in adding an extra check in XFS, I just
> have hard time to see how that check would trigger and how XFS quota is
> different from other XFS parts...

Exactly.

While it is true that the quota file code has not been deemed safe for
attack by malicious users on any filesystem.  It only matters if the
entire filesystem has been deemed safe by setting FS_USERNS_MOUNT.  So
the only possible avenue of confusion I can see is with the default
quota file code in dquot.c and company which I am addressing.

The fully generic parts of quota in quota.c that every one uses will
handle all of the weird s_user_ns != &init_user_ns cases at the end
of this patchset so there is nothing to worry about there either.

So I don't see additional checks worth adding anywhere.  Dave if you want
to send me an XFS patch or point out where such a check would belong in
XFS I won't be opposed to carrying it in my tree along with the rest of
this change.   I probably will be puzzled about what makes that code
need an extra check but I won't be opposed to carring such a patch.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH v2 review 09/11] quota: Handle quota data stored in s_user_ns.
  2016-07-06  8:25                     ` Jan Kara
  (?)
@ 2016-07-06 17:51                     ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 17:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Seth Forshee, Jann Horn, Linux API,
	Linux Containers, Andy Lutomirski, James Bottomley,
	Michael Kerrisk, linux-fsdevel, Djalal Harouni

Jan Kara <jack@suse.cz> writes:

> On Wed 06-07-16 16:35:04, Dave Chinner wrote:

>> All the more reason you should be adding the same guard to all the
>> other filesystems....
>>
>> All i'm asking you to do is to make this check in a way that all
>> filesystems that implement quotas will execute it. Don't leave
>> landmines with security implications around - make sure all
>> filesystems have the same protections.

That is exactly what I am doing.  Ensuring people don't think the
generic quota file code has been closely audited and reviewed and deemed
safe against malicious users.

> Well, I'm not sure I follow you here. VFS quotas are a generic code used by
> a few filesystems. So I can imagine that someone would decide to enable
> FS_USERNS_MOUNT for one of those filesystems without thinking about quotas
> and then Eric's check would trigger and possibly save use from some
> problems.
>
> When someone decides to enable FS_USERNS_MOUNT for XFS, he will have
> presumably made sure all parts of XFS are safe, including its quota
> implementation.
>
> I don't want to stop you or Eric in adding an extra check in XFS, I just
> have hard time to see how that check would trigger and how XFS quota is
> different from other XFS parts...

Exactly.

While it is true that the quota file code has not been deemed safe for
attack by malicious users on any filesystem.  It only matters if the
entire filesystem has been deemed safe by setting FS_USERNS_MOUNT.  So
the only possible avenue of confusion I can see is with the default
quota file code in dquot.c and company which I am addressing.

The fully generic parts of quota in quota.c that every one uses will
handle all of the weird s_user_ns != &init_user_ns cases at the end
of this patchset so there is nothing to worry about there either.

So I don't see additional checks worth adding anywhere.  Dave if you want
to send me an XFS patch or point out where such a check would belong in
XFS I won't be opposed to carrying it in my tree along with the rest of
this change.   I probably will be puzzled about what makes that code
need an extra check but I won't be opposed to carring such a patch.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH review 0/12] General unprivileged mount support v2
  2016-07-02 17:18 ` Eric W. Biederman
@ 2016-07-06 18:10     ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:10 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni


As well as in these patches the code is also available from:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

v2 is slightly simplified.  During the review I realized the previous
quota changes while perhaps not wrong were in a part of the code I don't
want people to think is ready for use on unprivileged filesystems yet.

It has been a long time in coming but recently in the userns tree the
superblock has been expanded with a s_user_ns field indicating the user
namespace that owns a superblock.

The s_user_ns owner of a superblock has three implications.
- Only kuids and kgids that map into s_user_ns are allowed to be sent to a
 filesystem from the vfs.
- If the uid or gid on the filesystem does not map into s_user_ns i_uid
  is set to INVALID_UID and i_gid is set to INVALID_GID.
- The scope of permission checks can be changed from global to a
  capabilitiy check in s_user_ns.

The overall strategy is to handle as much of this as possible in the VFS
so that what is happening is consistent between filesystems and has
widespread review, and so that individual filesystems don't need to
duplicate code.

This set of patches inserts checks to ensure only kuids and kgids that
map into s_user_ns are sent to filesystems.

This set of patches updates the vfs to deal with potentially unmapped
uids and gids in the i_uid and i_gid fields.  The strategy adopted is to
deny any activity that causes inodes with unmapped uids or gid to be
written to disk except for a chown that causes makes i_uid and i_gid
to map into s_user_ns.

Relaxing of the capability checks and adding new filesystems that would
benefit from the changes is held off until the vfs support is complete.

I believe this work is complete so if there anything questionable you
see please let me know.

I have included linux-api because several system calls get new failure
modes mostly -EOVERFLOW, and that may need separate documenation and
review.

The target for this work is to enable fully unprivileged fuse mounts
and whichever filesystem results from the uid shifting work.  These
vfs changes should support all kinds of filesystems, but in practice
it is an open problem if it is possible to modify a block based
filesystem to be safe from people manipulating filesystem images in an
attempt to get the kernel to malfunction so only a very limited set of
additional filesystems is ever expected to be enabled by this work.

Eric W. Biederman (7):
      userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
      vfs: Verify acls are valid within superblock's s_user_ns.
      vfs: Don't modify inodes with a uid or gid unknown to the vfs
      vfs: Don't create inodes with a uid or gid unknown to the vfs
      quota: Ensure qids map to the filesystem
      quota: Handle quota data stored in s_user_ns in quota_setxquota
      dquot: For now explicitly don't support filesystems outside of init_user_ns

Seth Forshee (5):
      fs: Refuse uid/gid changes which don't map into s_user_ns
      fs: Check for invalid i_uid in may_follow_link()
      cred: Reject inodes with invalid ids in set_create_file_as()
      evm: Translate user/group ids relative to s_user_ns when computing HMAC
      fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns


 drivers/staging/lustre/lustre/mdc/mdc_request.c |  2 +-
 fs/9p/acl.c                                     |  2 +-
 fs/attr.c                                       | 19 +++++++++
 fs/inode.c                                      |  7 ++++
 fs/namei.c                                      | 40 ++++++++++++++----
 fs/posix_acl.c                                  |  8 ++--
 fs/quota/dquot.c                                |  8 ++++
 fs/quota/quota.c                                | 14 +++----
 fs/xattr.c                                      |  7 ++++
 include/linux/fs.h                              | 55 ++++++++++++++-----------
 include/linux/posix_acl.h                       |  2 +-
 include/linux/quota.h                           | 10 +++++
 include/linux/uidgid.h                          |  4 +-
 kernel/cred.c                                   |  2 +
 security/integrity/evm/evm_crypto.c             |  4 +-
 15 files changed, 133 insertions(+), 51 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH review 0/12] General unprivileged mount support v2
@ 2016-07-06 18:10     ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:10 UTC (permalink / raw)
  To: Linux Containers
  Cc: linux-fsdevel, Linux API, James Bottomley, Djalal Harouni,
	Serge E. Hallyn, Andy Lutomirski, Jan Kara, Jann Horn,
	Michael Kerrisk, Seth Forshee


As well as in these patches the code is also available from:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

v2 is slightly simplified.  During the review I realized the previous
quota changes while perhaps not wrong were in a part of the code I don't
want people to think is ready for use on unprivileged filesystems yet.

It has been a long time in coming but recently in the userns tree the
superblock has been expanded with a s_user_ns field indicating the user
namespace that owns a superblock.

The s_user_ns owner of a superblock has three implications.
- Only kuids and kgids that map into s_user_ns are allowed to be sent to a
 filesystem from the vfs.
- If the uid or gid on the filesystem does not map into s_user_ns i_uid
  is set to INVALID_UID and i_gid is set to INVALID_GID.
- The scope of permission checks can be changed from global to a
  capabilitiy check in s_user_ns.

The overall strategy is to handle as much of this as possible in the VFS
so that what is happening is consistent between filesystems and has
widespread review, and so that individual filesystems don't need to
duplicate code.

This set of patches inserts checks to ensure only kuids and kgids that
map into s_user_ns are sent to filesystems.

This set of patches updates the vfs to deal with potentially unmapped
uids and gids in the i_uid and i_gid fields.  The strategy adopted is to
deny any activity that causes inodes with unmapped uids or gid to be
written to disk except for a chown that causes makes i_uid and i_gid
to map into s_user_ns.

Relaxing of the capability checks and adding new filesystems that would
benefit from the changes is held off until the vfs support is complete.

I believe this work is complete so if there anything questionable you
see please let me know.

I have included linux-api because several system calls get new failure
modes mostly -EOVERFLOW, and that may need separate documenation and
review.

The target for this work is to enable fully unprivileged fuse mounts
and whichever filesystem results from the uid shifting work.  These
vfs changes should support all kinds of filesystems, but in practice
it is an open problem if it is possible to modify a block based
filesystem to be safe from people manipulating filesystem images in an
attempt to get the kernel to malfunction so only a very limited set of
additional filesystems is ever expected to be enabled by this work.

Eric W. Biederman (7):
      userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
      vfs: Verify acls are valid within superblock's s_user_ns.
      vfs: Don't modify inodes with a uid or gid unknown to the vfs
      vfs: Don't create inodes with a uid or gid unknown to the vfs
      quota: Ensure qids map to the filesystem
      quota: Handle quota data stored in s_user_ns in quota_setxquota
      dquot: For now explicitly don't support filesystems outside of init_user_ns

Seth Forshee (5):
      fs: Refuse uid/gid changes which don't map into s_user_ns
      fs: Check for invalid i_uid in may_follow_link()
      cred: Reject inodes with invalid ids in set_create_file_as()
      evm: Translate user/group ids relative to s_user_ns when computing HMAC
      fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns


 drivers/staging/lustre/lustre/mdc/mdc_request.c |  2 +-
 fs/9p/acl.c                                     |  2 +-
 fs/attr.c                                       | 19 +++++++++
 fs/inode.c                                      |  7 ++++
 fs/namei.c                                      | 40 ++++++++++++++----
 fs/posix_acl.c                                  |  8 ++--
 fs/quota/dquot.c                                |  8 ++++
 fs/quota/quota.c                                | 14 +++----
 fs/xattr.c                                      |  7 ++++
 include/linux/fs.h                              | 55 ++++++++++++++-----------
 include/linux/posix_acl.h                       |  2 +-
 include/linux/quota.h                           | 10 +++++
 include/linux/uidgid.h                          |  4 +-
 kernel/cred.c                                   |  2 +
 security/integrity/evm/evm_crypto.c             |  4 +-
 15 files changed, 133 insertions(+), 51 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH review 01/12] fs: Refuse uid/gid changes which don't map into s_user_ns
  2016-07-06 18:10     ` Eric W. Biederman
@ 2016-07-06 18:12         ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Add checks to notify_change to verify that uid and gid changes
will map into the superblock's user namespace. If they do not
fail with -EOVERFLOW.

This is mandatory so that fileystems don't have to even think
of dealing with ia_uid and ia_gid that

--EWB Moved the test from inode_change_ok to notify_change

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0f6c88..dd723578ddce 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -255,6 +255,17 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
 		return 0;
 
+	/*
+	 * Verify that uid/gid changes are valid in the target
+	 * namespace of the superblock.
+	 */
+	if (ia_valid & ATTR_UID &&
+	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+		return -EOVERFLOW;
+	if (ia_valid & ATTR_GID &&
+	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 01/12] fs: Refuse uid/gid changes which don't map into s_user_ns
@ 2016-07-06 18:12         ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

Add checks to notify_change to verify that uid and gid changes
will map into the superblock's user namespace. If they do not
fail with -EOVERFLOW.

This is mandatory so that fileystems don't have to even think
of dealing with ia_uid and ia_gid that

--EWB Moved the test from inode_change_ok to notify_change

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/attr.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 25b24d0f6c88..dd723578ddce 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -255,6 +255,17 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
 		return 0;
 
+	/*
+	 * Verify that uid/gid changes are valid in the target
+	 * namespace of the superblock.
+	 */
+	if (ia_valid & ATTR_UID &&
+	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+		return -EOVERFLOW;
+	if (ia_valid & ATTR_GID &&
+	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 02/12] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                             ` (9 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

Refuse to admit any user namespace has a mapping of the INVALID_UID
and the INVALID_GID when !CONFIG_USER_NS.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/uidgid.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h
index 03835522dfcb..25e9d9216340 100644
--- a/include/linux/uidgid.h
+++ b/include/linux/uidgid.h
@@ -177,12 +177,12 @@ static inline gid_t from_kgid_munged(struct user_namespace *to, kgid_t kgid)
 
 static inline bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid)
 {
-	return true;
+	return uid_valid(uid);
 }
 
 static inline bool kgid_has_mapping(struct user_namespace *ns, kgid_t gid)
 {
-	return true;
+	return gid_valid(gid);
 }
 
 #endif /* CONFIG_USER_NS */
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 02/12] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
  2016-07-06 18:12         ` Eric W. Biederman
  (?)
@ 2016-07-06 18:12         ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

Refuse to admit any user namespace has a mapping of the INVALID_UID
and the INVALID_GID when !CONFIG_USER_NS.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/uidgid.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/uidgid.h b/include/linux/uidgid.h
index 03835522dfcb..25e9d9216340 100644
--- a/include/linux/uidgid.h
+++ b/include/linux/uidgid.h
@@ -177,12 +177,12 @@ static inline gid_t from_kgid_munged(struct user_namespace *to, kgid_t kgid)
 
 static inline bool kuid_has_mapping(struct user_namespace *ns, kuid_t uid)
 {
-	return true;
+	return uid_valid(uid);
 }
 
 static inline bool kgid_has_mapping(struct user_namespace *ns, kgid_t gid)
 {
-	return true;
+	return gid_valid(gid);
 }
 
 #endif /* CONFIG_USER_NS */
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns.
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2016-07-06 18:12           ` [PATCH review 02/12] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 04/12] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
                             ` (8 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

Update posix_acl_valid to verify that an acl is within a user namespace.

Update the callers of posix_acl_valid to pass in an appropriate
user namespace.  For posix_acl_xattr_set and v9fs_xattr_set_acl pass in
inode->i_sb->s_user_ns to posix_acl_valid.  For md_unpack_acl pass in
&init_user_ns as no inode or superblock is in sight.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 drivers/staging/lustre/lustre/mdc/mdc_request.c | 2 +-
 fs/9p/acl.c                                     | 2 +-
 fs/posix_acl.c                                  | 8 ++++----
 include/linux/posix_acl.h                       | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 86b7445365f4..40cf57fad581 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -434,7 +434,7 @@ static int mdc_unpack_acl(struct ptlrpc_request *req, struct lustre_md *md)
 		return rc;
 	}
 
-	rc = posix_acl_valid(acl);
+	rc = posix_acl_valid(&init_user_ns, acl);
 	if (rc) {
 		CERROR("validate acl: %d\n", rc);
 		posix_acl_release(acl);
diff --git a/fs/9p/acl.c b/fs/9p/acl.c
index 0576eaeb60b9..5b6a1743ea17 100644
--- a/fs/9p/acl.c
+++ b/fs/9p/acl.c
@@ -266,7 +266,7 @@ static int v9fs_xattr_set_acl(const struct xattr_handler *handler,
 		if (IS_ERR(acl))
 			return PTR_ERR(acl);
 		else if (acl) {
-			retval = posix_acl_valid(acl);
+			retval = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (retval)
 				goto err_out;
 		}
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 8a4a266beff3..647c28180675 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -205,7 +205,7 @@ posix_acl_clone(const struct posix_acl *acl, gfp_t flags)
  * Check if an acl is valid. Returns 0 if it is, or -E... otherwise.
  */
 int
-posix_acl_valid(const struct posix_acl *acl)
+posix_acl_valid(struct user_namespace *user_ns, const struct posix_acl *acl)
 {
 	const struct posix_acl_entry *pa, *pe;
 	int state = ACL_USER_OBJ;
@@ -225,7 +225,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_USER:
 				if (state != ACL_USER)
 					return -EINVAL;
-				if (!uid_valid(pa->e_uid))
+				if (!kuid_has_mapping(user_ns, pa->e_uid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -240,7 +240,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_GROUP:
 				if (state != ACL_GROUP)
 					return -EINVAL;
-				if (!gid_valid(pa->e_gid))
+				if (!kgid_has_mapping(user_ns, pa->e_gid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -845,7 +845,7 @@ posix_acl_xattr_set(const struct xattr_handler *handler,
 			return PTR_ERR(acl);
 
 		if (acl) {
-			ret = posix_acl_valid(acl);
+			ret = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (ret)
 				goto out;
 		}
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 5b5a80cc5926..1af6438fde3e 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -81,7 +81,7 @@ posix_acl_release(struct posix_acl *acl)
 
 extern void posix_acl_init(struct posix_acl *, int);
 extern struct posix_acl *posix_acl_alloc(int, gfp_t);
-extern int posix_acl_valid(const struct posix_acl *);
+extern int posix_acl_valid(struct user_namespace *, const struct posix_acl *);
 extern int posix_acl_permission(struct inode *, const struct posix_acl *, int);
 extern struct posix_acl *posix_acl_from_mode(umode_t, gfp_t);
 extern int posix_acl_equiv_mode(const struct posix_acl *, umode_t *);
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns.
  2016-07-06 18:12         ` Eric W. Biederman
  (?)
  (?)
@ 2016-07-06 18:12         ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

Update posix_acl_valid to verify that an acl is within a user namespace.

Update the callers of posix_acl_valid to pass in an appropriate
user namespace.  For posix_acl_xattr_set and v9fs_xattr_set_acl pass in
inode->i_sb->s_user_ns to posix_acl_valid.  For md_unpack_acl pass in
&init_user_ns as no inode or superblock is in sight.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 drivers/staging/lustre/lustre/mdc/mdc_request.c | 2 +-
 fs/9p/acl.c                                     | 2 +-
 fs/posix_acl.c                                  | 8 ++++----
 include/linux/posix_acl.h                       | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 86b7445365f4..40cf57fad581 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -434,7 +434,7 @@ static int mdc_unpack_acl(struct ptlrpc_request *req, struct lustre_md *md)
 		return rc;
 	}
 
-	rc = posix_acl_valid(acl);
+	rc = posix_acl_valid(&init_user_ns, acl);
 	if (rc) {
 		CERROR("validate acl: %d\n", rc);
 		posix_acl_release(acl);
diff --git a/fs/9p/acl.c b/fs/9p/acl.c
index 0576eaeb60b9..5b6a1743ea17 100644
--- a/fs/9p/acl.c
+++ b/fs/9p/acl.c
@@ -266,7 +266,7 @@ static int v9fs_xattr_set_acl(const struct xattr_handler *handler,
 		if (IS_ERR(acl))
 			return PTR_ERR(acl);
 		else if (acl) {
-			retval = posix_acl_valid(acl);
+			retval = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (retval)
 				goto err_out;
 		}
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 8a4a266beff3..647c28180675 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -205,7 +205,7 @@ posix_acl_clone(const struct posix_acl *acl, gfp_t flags)
  * Check if an acl is valid. Returns 0 if it is, or -E... otherwise.
  */
 int
-posix_acl_valid(const struct posix_acl *acl)
+posix_acl_valid(struct user_namespace *user_ns, const struct posix_acl *acl)
 {
 	const struct posix_acl_entry *pa, *pe;
 	int state = ACL_USER_OBJ;
@@ -225,7 +225,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_USER:
 				if (state != ACL_USER)
 					return -EINVAL;
-				if (!uid_valid(pa->e_uid))
+				if (!kuid_has_mapping(user_ns, pa->e_uid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -240,7 +240,7 @@ posix_acl_valid(const struct posix_acl *acl)
 			case ACL_GROUP:
 				if (state != ACL_GROUP)
 					return -EINVAL;
-				if (!gid_valid(pa->e_gid))
+				if (!kgid_has_mapping(user_ns, pa->e_gid))
 					return -EINVAL;
 				needs_mask = 1;
 				break;
@@ -845,7 +845,7 @@ posix_acl_xattr_set(const struct xattr_handler *handler,
 			return PTR_ERR(acl);
 
 		if (acl) {
-			ret = posix_acl_valid(acl);
+			ret = posix_acl_valid(inode->i_sb->s_user_ns, acl);
 			if (ret)
 				goto out;
 		}
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 5b5a80cc5926..1af6438fde3e 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -81,7 +81,7 @@ posix_acl_release(struct posix_acl *acl)
 
 extern void posix_acl_init(struct posix_acl *, int);
 extern struct posix_acl *posix_acl_alloc(int, gfp_t);
-extern int posix_acl_valid(const struct posix_acl *);
+extern int posix_acl_valid(struct user_namespace *, const struct posix_acl *);
 extern int posix_acl_permission(struct inode *, const struct posix_acl *, int);
 extern struct posix_acl *posix_acl_from_mode(umode_t, gfp_t);
 extern int posix_acl_equiv_mode(const struct posix_acl *, umode_t *);
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 04/12] fs: Check for invalid i_uid in may_follow_link()
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2016-07-06 18:12           ` [PATCH review 02/12] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 05/12] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
                             ` (7 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 757a32725d92..8701bd9a5270 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -901,6 +901,7 @@ static inline int may_follow_link(struct nameidata *nd)
 {
 	const struct inode *inode;
 	const struct inode *parent;
+	kuid_t puid;
 
 	if (!sysctl_protected_symlinks)
 		return 0;
@@ -916,7 +917,8 @@ static inline int may_follow_link(struct nameidata *nd)
 		return 0;
 
 	/* Allowed if parent directory and link owner match. */
-	if (uid_eq(parent->i_uid, inode->i_uid))
+	puid = parent->i_uid;
+	if (uid_valid(puid) && uid_eq(puid, inode->i_uid))
 		return 0;
 
 	if (nd->flags & LOOKUP_RCU)
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 04/12] fs: Check for invalid i_uid in may_follow_link()
  2016-07-06 18:12         ` Eric W. Biederman
                           ` (2 preceding siblings ...)
  (?)
@ 2016-07-06 18:12         ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

Filesystem uids which don't map into a user namespace may result
in inode->i_uid being INVALID_UID. A symlink and its parent
could have different owners in the filesystem can both get
mapped to INVALID_UID, which may result in following a symlink
when this would not have otherwise been permitted when protected
symlinks are enabled.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 757a32725d92..8701bd9a5270 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -901,6 +901,7 @@ static inline int may_follow_link(struct nameidata *nd)
 {
 	const struct inode *inode;
 	const struct inode *parent;
+	kuid_t puid;
 
 	if (!sysctl_protected_symlinks)
 		return 0;
@@ -916,7 +917,8 @@ static inline int may_follow_link(struct nameidata *nd)
 		return 0;
 
 	/* Allowed if parent directory and link owner match. */
-	if (uid_eq(parent->i_uid, inode->i_uid))
+	puid = parent->i_uid;
+	if (uid_valid(puid) && uid_eq(puid, inode->i_uid))
 		return 0;
 
 	if (nd->flags & LOOKUP_RCU)
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 05/12] cred: Reject inodes with invalid ids in set_create_file_as()
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (2 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 04/12] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 06/12] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
                             ` (6 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/cred.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cred.c b/kernel/cred.c
index 0c0cd8a62285..5f264fb5737d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -689,6 +689,8 @@ EXPORT_SYMBOL(set_security_override_from_ctx);
  */
 int set_create_files_as(struct cred *new, struct inode *inode)
 {
+	if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+		return -EINVAL;
 	new->fsuid = inode->i_uid;
 	new->fsgid = inode->i_gid;
 	return security_kernel_create_files_as(new, inode);
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 05/12] cred: Reject inodes with invalid ids in set_create_file_as()
  2016-07-06 18:12         ` Eric W. Biederman
                           ` (3 preceding siblings ...)
  (?)
@ 2016-07-06 18:12         ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

Using INVALID_[UG]ID for the LSM file creation context doesn't
make sense, so return an error if the inode passed to
set_create_file_as() has an invalid id.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/cred.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cred.c b/kernel/cred.c
index 0c0cd8a62285..5f264fb5737d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -689,6 +689,8 @@ EXPORT_SYMBOL(set_security_override_from_ctx);
  */
 int set_create_files_as(struct cred *new, struct inode *inode)
 {
+	if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+		return -EINVAL;
 	new->fsuid = inode->i_uid;
 	new->fsgid = inode->i_gid;
 	return security_kernel_create_files_as(new, inode);
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 06/12] vfs: Don't modify inodes with a uid or gid unknown to the vfs
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (3 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 05/12] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 07/12] vfs: Don't create " Eric W. Biederman
                             ` (5 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.

The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.

Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk.  In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.

Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.

In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set.  Reads of the inodes appear safe and
useful, but any write or modification is disallowed.  The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values.  After such a chown the inode is normal and may
be treated as such.

Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.

One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.

Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr.  As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed.  evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c          |  8 ++++++++
 fs/inode.c         |  7 +++++++
 fs/namei.c         | 26 +++++++++++++++++++++-----
 fs/xattr.c         |  7 +++++++
 include/linux/fs.h |  5 +++++
 5 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index dd723578ddce..42bb42bb3c72 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -266,6 +266,14 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
 		return -EOVERFLOW;
 
+	/* Don't allow modifications of files with invalid uids or
+	 * gids unless those uids & gids are being made valid.
+	 */
+	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
+		return -EOVERFLOW;
+	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21b30ce..c0ebb97fb085 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1617,6 +1617,13 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
 
 	if (inode->i_flags & S_NOATIME)
 		return false;
+
+	/* Atime updates will likely cause i_uid and i_gid to be written
+	 * back improprely if their true value is unknown to the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return false;
+
 	if (IS_NOATIME(inode))
 		return false;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
diff --git a/fs/namei.c b/fs/namei.c
index 8701bd9a5270..840201c4c290 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -410,6 +410,14 @@ int __inode_permission(struct inode *inode, int mask)
 		 */
 		if (IS_IMMUTABLE(inode))
 			return -EACCES;
+
+		/*
+		 * Updating mtime will likely cause i_uid and i_gid to be
+		 * written back improperly if their true value is unknown
+		 * to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EACCES;
 	}
 
 	retval = do_inode_permission(inode, mask);
@@ -2759,10 +2767,11 @@ EXPORT_SYMBOL(__check_sticky);
  *	c. have CAP_FOWNER capability
  *  6. If the victim is append-only or immutable we can't do antyhing with
  *     links pointing to it.
- *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
- *  8. If we were asked to remove a non-directory and victim isn't one - EISDIR.
- *  9. We can't remove a root or mountpoint.
- * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ *  7. If the victim has an unknown uid or gid we can't change the inode.
+ *  8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
+ * 10. We can't remove a root or mountpoint.
+ * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
 static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
@@ -2784,7 +2793,7 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 		return -EPERM;
 
 	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
-	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode))
+	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
 		return -EPERM;
 	if (isdir) {
 		if (!d_is_dir(victim))
@@ -4190,6 +4199,13 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
+	/*
+	 * Updating the link count will likely cause i_uid and i_gid to
+	 * be writen back improperly if their true value is unknown to
+	 * the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return -EPERM;
 	if (!dir->i_op->link)
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode))
diff --git a/fs/xattr.c b/fs/xattr.c
index 4beafc43daa5..c243905835ab 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -38,6 +38,13 @@ xattr_permission(struct inode *inode, const char *name, int mask)
 	if (mask & MAY_WRITE) {
 		if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
 			return -EPERM;
+		/*
+		 * Updating an xattr will likely cause i_uid and i_gid
+		 * to be writen back improperly if their true value is
+		 * unknown to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EPERM;
 	}
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 375e37f42cdf..cb25ceb6d1ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,11 @@ struct super_operations {
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
 
+static inline bool HAS_UNMAPPED_ID(struct inode *inode)
+{
+	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 06/12] vfs: Don't modify inodes with a uid or gid unknown to the vfs
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                             ` (9 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.

The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.

Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk.  In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.

Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.

In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set.  Reads of the inodes appear safe and
useful, but any write or modification is disallowed.  The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values.  After such a chown the inode is normal and may
be treated as such.

Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.

One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.

Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr.  As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed.  evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/attr.c          |  8 ++++++++
 fs/inode.c         |  7 +++++++
 fs/namei.c         | 26 +++++++++++++++++++++-----
 fs/xattr.c         |  7 +++++++
 include/linux/fs.h |  5 +++++
 5 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index dd723578ddce..42bb42bb3c72 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -266,6 +266,14 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
 		return -EOVERFLOW;
 
+	/* Don't allow modifications of files with invalid uids or
+	 * gids unless those uids & gids are being made valid.
+	 */
+	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
+		return -EOVERFLOW;
+	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21b30ce..c0ebb97fb085 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1617,6 +1617,13 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
 
 	if (inode->i_flags & S_NOATIME)
 		return false;
+
+	/* Atime updates will likely cause i_uid and i_gid to be written
+	 * back improprely if their true value is unknown to the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return false;
+
 	if (IS_NOATIME(inode))
 		return false;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
diff --git a/fs/namei.c b/fs/namei.c
index 8701bd9a5270..840201c4c290 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -410,6 +410,14 @@ int __inode_permission(struct inode *inode, int mask)
 		 */
 		if (IS_IMMUTABLE(inode))
 			return -EACCES;
+
+		/*
+		 * Updating mtime will likely cause i_uid and i_gid to be
+		 * written back improperly if their true value is unknown
+		 * to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EACCES;
 	}
 
 	retval = do_inode_permission(inode, mask);
@@ -2759,10 +2767,11 @@ EXPORT_SYMBOL(__check_sticky);
  *	c. have CAP_FOWNER capability
  *  6. If the victim is append-only or immutable we can't do antyhing with
  *     links pointing to it.
- *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
- *  8. If we were asked to remove a non-directory and victim isn't one - EISDIR.
- *  9. We can't remove a root or mountpoint.
- * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ *  7. If the victim has an unknown uid or gid we can't change the inode.
+ *  8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
+ * 10. We can't remove a root or mountpoint.
+ * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
 static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
@@ -2784,7 +2793,7 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 		return -EPERM;
 
 	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
-	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode))
+	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
 		return -EPERM;
 	if (isdir) {
 		if (!d_is_dir(victim))
@@ -4190,6 +4199,13 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
+	/*
+	 * Updating the link count will likely cause i_uid and i_gid to
+	 * be writen back improperly if their true value is unknown to
+	 * the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return -EPERM;
 	if (!dir->i_op->link)
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode))
diff --git a/fs/xattr.c b/fs/xattr.c
index 4beafc43daa5..c243905835ab 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -38,6 +38,13 @@ xattr_permission(struct inode *inode, const char *name, int mask)
 	if (mask & MAY_WRITE) {
 		if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
 			return -EPERM;
+		/*
+		 * Updating an xattr will likely cause i_uid and i_gid
+		 * to be writen back improperly if their true value is
+		 * unknown to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EPERM;
 	}
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 375e37f42cdf..cb25ceb6d1ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,11 @@ struct super_operations {
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
 
+static inline bool HAS_UNMAPPED_ID(struct inode *inode)
+{
+	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 06/12] vfs: Don't modify inodes with a uid or gid unknown to the vfs
@ 2016-07-06 18:12           ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

When a filesystem outside of init_user_ns is mounted it could have
uids and gids stored in it that do not map to init_user_ns.

The plan is to allow those filesystems to set i_uid to INVALID_UID and
i_gid to INVALID_GID for unmapped uids and gids and then to handle
that strange case in the vfs to ensure there is consistent robust
handling of the weirdness.

Upon a careful review of the vfs and filesystems about the only case
where there is any possibility of confusion or trouble is when the
inode is written back to disk.  In that case filesystems typically
read the inode->i_uid and inode->i_gid and write them to disk even
when just an inode timestamp is being updated.

Which leads to a rule that is very simple to implement and understand
inodes whose i_uid or i_gid is not valid may not be written.

In dealing with access times this means treat those inodes as if the
inode flag S_NOATIME was set.  Reads of the inodes appear safe and
useful, but any write or modification is disallowed.  The only inode
write that is allowed is a chown that sets the uid and gid on the
inode to valid values.  After such a chown the inode is normal and may
be treated as such.

Denying all writes to inodes with uids or gids unknown to the vfs also
prevents several oddball cases where corruption would have occurred
because the vfs does not have complete information.

One problem case that is prevented is attempting to use the gid of a
directory for new inodes where the directories sgid bit is set but the
directories gid is not mapped.

Another problem case avoided is attempting to update the evm hash
after setxattr, removexattr, and setattr.  As the evm hash includeds
the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
a correct evm hash from being computed.  evm hash verification also
fails when i_uid or i_gid is unknown but that is essentially harmless
as it does not cause filesystem corruption.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/attr.c          |  8 ++++++++
 fs/inode.c         |  7 +++++++
 fs/namei.c         | 26 +++++++++++++++++++++-----
 fs/xattr.c         |  7 +++++++
 include/linux/fs.h |  5 +++++
 5 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index dd723578ddce..42bb42bb3c72 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -266,6 +266,14 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
 		return -EOVERFLOW;
 
+	/* Don't allow modifications of files with invalid uids or
+	 * gids unless those uids & gids are being made valid.
+	 */
+	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
+		return -EOVERFLOW;
+	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
+		return -EOVERFLOW;
+
 	error = security_inode_setattr(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/inode.c b/fs/inode.c
index 4ccbc21b30ce..c0ebb97fb085 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1617,6 +1617,13 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
 
 	if (inode->i_flags & S_NOATIME)
 		return false;
+
+	/* Atime updates will likely cause i_uid and i_gid to be written
+	 * back improprely if their true value is unknown to the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return false;
+
 	if (IS_NOATIME(inode))
 		return false;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
diff --git a/fs/namei.c b/fs/namei.c
index 8701bd9a5270..840201c4c290 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -410,6 +410,14 @@ int __inode_permission(struct inode *inode, int mask)
 		 */
 		if (IS_IMMUTABLE(inode))
 			return -EACCES;
+
+		/*
+		 * Updating mtime will likely cause i_uid and i_gid to be
+		 * written back improperly if their true value is unknown
+		 * to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EACCES;
 	}
 
 	retval = do_inode_permission(inode, mask);
@@ -2759,10 +2767,11 @@ EXPORT_SYMBOL(__check_sticky);
  *	c. have CAP_FOWNER capability
  *  6. If the victim is append-only or immutable we can't do antyhing with
  *     links pointing to it.
- *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
- *  8. If we were asked to remove a non-directory and victim isn't one - EISDIR.
- *  9. We can't remove a root or mountpoint.
- * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ *  7. If the victim has an unknown uid or gid we can't change the inode.
+ *  8. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  9. If we were asked to remove a non-directory and victim isn't one - EISDIR.
+ * 10. We can't remove a root or mountpoint.
+ * 11. We don't allow removal of NFS sillyrenamed files; it's handled by
  *     nfs_async_unlink().
  */
 static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
@@ -2784,7 +2793,7 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 		return -EPERM;
 
 	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
-	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode))
+	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
 		return -EPERM;
 	if (isdir) {
 		if (!d_is_dir(victim))
@@ -4190,6 +4199,13 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	 */
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
+	/*
+	 * Updating the link count will likely cause i_uid and i_gid to
+	 * be writen back improperly if their true value is unknown to
+	 * the vfs.
+	 */
+	if (HAS_UNMAPPED_ID(inode))
+		return -EPERM;
 	if (!dir->i_op->link)
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode))
diff --git a/fs/xattr.c b/fs/xattr.c
index 4beafc43daa5..c243905835ab 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -38,6 +38,13 @@ xattr_permission(struct inode *inode, const char *name, int mask)
 	if (mask & MAY_WRITE) {
 		if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
 			return -EPERM;
+		/*
+		 * Updating an xattr will likely cause i_uid and i_gid
+		 * to be writen back improperly if their true value is
+		 * unknown to the vfs.
+		 */
+		if (HAS_UNMAPPED_ID(inode))
+			return -EPERM;
 	}
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 375e37f42cdf..cb25ceb6d1ef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1874,6 +1874,11 @@ struct super_operations {
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
 
+static inline bool HAS_UNMAPPED_ID(struct inode *inode)
+{
+	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 07/12] vfs: Don't create inodes with a uid or gid unknown to the vfs
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (4 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 06/12] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 08/12] quota: Ensure qids map to the filesystem Eric W. Biederman
                             ` (4 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

It is expected that filesystems can not represent uids and gids from
outside of their user namespace.  Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 840201c4c290..629823f19a6a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2814,16 +2814,22 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
  *  1. We can't do it if child already exists (open has special treatment for
  *     this case, but since we are inlined it's OK)
  *  2. We can't do it if dir is read-only (done in permission())
- *  3. We should have write and exec permissions on dir
- *  4. We can't do it if dir is immutable (done in permission())
+ *  3. We can't do it if the fs can't represent the fsuid or fsgid.
+ *  4. We should have write and exec permissions on dir
+ *  5. We can't do it if dir is immutable (done in permission())
  */
 static inline int may_create(struct inode *dir, struct dentry *child)
 {
+	struct user_namespace *s_user_ns;
 	audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
 	if (child->d_inode)
 		return -EEXIST;
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
+	s_user_ns = dir->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
+	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+		return -EOVERFLOW;
 	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
 }
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 07/12] vfs: Don't create inodes with a uid or gid unknown to the vfs
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                             ` (9 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

It is expected that filesystems can not represent uids and gids from
outside of their user namespace.  Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.

Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namei.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 840201c4c290..629823f19a6a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2814,16 +2814,22 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
  *  1. We can't do it if child already exists (open has special treatment for
  *     this case, but since we are inlined it's OK)
  *  2. We can't do it if dir is read-only (done in permission())
- *  3. We should have write and exec permissions on dir
- *  4. We can't do it if dir is immutable (done in permission())
+ *  3. We can't do it if the fs can't represent the fsuid or fsgid.
+ *  4. We should have write and exec permissions on dir
+ *  5. We can't do it if dir is immutable (done in permission())
  */
 static inline int may_create(struct inode *dir, struct dentry *child)
 {
+	struct user_namespace *s_user_ns;
 	audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
 	if (child->d_inode)
 		return -EEXIST;
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
+	s_user_ns = dir->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
+	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+		return -EOVERFLOW;
 	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
 }
 
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 07/12] vfs: Don't create inodes with a uid or gid unknown to the vfs
@ 2016-07-06 18:12           ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

It is expected that filesystems can not represent uids and gids from
outside of their user namespace.  Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.

Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namei.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 840201c4c290..629823f19a6a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2814,16 +2814,22 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
  *  1. We can't do it if child already exists (open has special treatment for
  *     this case, but since we are inlined it's OK)
  *  2. We can't do it if dir is read-only (done in permission())
- *  3. We should have write and exec permissions on dir
- *  4. We can't do it if dir is immutable (done in permission())
+ *  3. We can't do it if the fs can't represent the fsuid or fsgid.
+ *  4. We should have write and exec permissions on dir
+ *  5. We can't do it if dir is immutable (done in permission())
  */
 static inline int may_create(struct inode *dir, struct dentry *child)
 {
+	struct user_namespace *s_user_ns;
 	audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
 	if (child->d_inode)
 		return -EEXIST;
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
+	s_user_ns = dir->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
+	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+		return -EOVERFLOW;
 	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
 }
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 08/12] quota: Ensure qids map to the filesystem
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (5 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 07/12] vfs: Don't create " Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 09/12] quota: Handle quota data stored in s_user_ns in quota_setxquota Eric W. Biederman
                             ` (3 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

Introduce the helper qid_has_mapping and use it to ensure that the
quota system only considers qids that map to the filesystems
s_user_ns.

In practice for quota supporting filesystems today this is the exact
same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
map into init_user_ns.

Replace the qid_valid calls with qid_has_mapping as values come in
from userspace.  This is harmless today and it prepares the quota
system to work on filesystems with quotas but mounted by unprivileged
users.

Call qid_has_mapping from dqget.  This ensures the passed in qid has a
prepresentation on the underlying filesystem.  Previously this was
unnecessary as filesystesm never had qids that could not map.  With
the introduction of filesystems outside of s_user_ns this will not
remain true.

All of this ensures the quota code never has to deal with qids that
don't map to the underlying filesystem.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/quota/dquot.c      |  3 +++
 fs/quota/quota.c      | 12 ++++++------
 include/linux/quota.h | 10 ++++++++++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ff21980d0119..74706b6aa747 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -841,6 +841,9 @@ struct dquot *dqget(struct super_block *sb, struct kqid qid)
 	unsigned int hashent = hashfn(sb, qid);
 	struct dquot *dquot, *empty = NULL;
 
+	if (!qid_has_mapping(sb->s_user_ns, qid))
+		return ERR_PTR(-EINVAL);
+
         if (!sb_has_quota_active(sb, qid.type))
 		return ERR_PTR(-ESRCH);
 we_slept:
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 0f10ee9892ce..73f6f4cf0a21 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -211,7 +211,7 @@ static int quota_getquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &fdq);
 	if (ret)
@@ -237,7 +237,7 @@ static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
 	if (ret)
@@ -288,7 +288,7 @@ static int quota_setquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	copy_from_if_dqblk(&fdq, &idq);
 	return sb->s_qcop->set_dqblk(sb, qid, &fdq);
@@ -581,7 +581,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
 	if (from_kqid(&init_user_ns, qid) == 0 &&
@@ -642,7 +642,7 @@ static int quota_getxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &qdq);
 	if (ret)
@@ -669,7 +669,7 @@ static int quota_getnextxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq);
 	if (ret)
diff --git a/include/linux/quota.h b/include/linux/quota.h
index 9dfb6bce8c9e..1db16ee39b31 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -179,6 +179,16 @@ static inline struct kqid make_kqid_projid(kprojid_t projid)
 	return kqid;
 }
 
+/**
+ *	qid_has_mapping - Report if a qid maps into a user namespace.
+ *	@ns:  The user namespace to see if a value maps into.
+ *	@qid: The kernel internal quota identifier to test.
+ */
+static inline bool qid_has_mapping(struct user_namespace *ns, struct kqid qid)
+{
+	return from_kqid(ns, qid) != (qid_t) -1;
+}
+
 
 extern spinlock_t dq_data_lock;
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 08/12] quota: Ensure qids map to the filesystem
  2016-07-06 18:12         ` Eric W. Biederman
                           ` (6 preceding siblings ...)
  (?)
@ 2016-07-06 18:12         ` Eric W. Biederman
       [not found]           ` <20160706181212.16267-8-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  -1 siblings, 1 reply; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

Introduce the helper qid_has_mapping and use it to ensure that the
quota system only considers qids that map to the filesystems
s_user_ns.

In practice for quota supporting filesystems today this is the exact
same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
map into init_user_ns.

Replace the qid_valid calls with qid_has_mapping as values come in
from userspace.  This is harmless today and it prepares the quota
system to work on filesystems with quotas but mounted by unprivileged
users.

Call qid_has_mapping from dqget.  This ensures the passed in qid has a
prepresentation on the underlying filesystem.  Previously this was
unnecessary as filesystesm never had qids that could not map.  With
the introduction of filesystems outside of s_user_ns this will not
remain true.

All of this ensures the quota code never has to deal with qids that
don't map to the underlying filesystem.

Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/quota/dquot.c      |  3 +++
 fs/quota/quota.c      | 12 ++++++------
 include/linux/quota.h | 10 ++++++++++
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index ff21980d0119..74706b6aa747 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -841,6 +841,9 @@ struct dquot *dqget(struct super_block *sb, struct kqid qid)
 	unsigned int hashent = hashfn(sb, qid);
 	struct dquot *dquot, *empty = NULL;
 
+	if (!qid_has_mapping(sb->s_user_ns, qid))
+		return ERR_PTR(-EINVAL);
+
         if (!sb_has_quota_active(sb, qid.type))
 		return ERR_PTR(-ESRCH);
 we_slept:
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 0f10ee9892ce..73f6f4cf0a21 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -211,7 +211,7 @@ static int quota_getquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &fdq);
 	if (ret)
@@ -237,7 +237,7 @@ static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
 	if (ret)
@@ -288,7 +288,7 @@ static int quota_setquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	copy_from_if_dqblk(&fdq, &idq);
 	return sb->s_qcop->set_dqblk(sb, qid, &fdq);
@@ -581,7 +581,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->set_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
 	if (from_kqid(&init_user_ns, qid) == 0 &&
@@ -642,7 +642,7 @@ static int quota_getxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_dqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_dqblk(sb, qid, &qdq);
 	if (ret)
@@ -669,7 +669,7 @@ static int quota_getnextxquota(struct super_block *sb, int type, qid_t id,
 	if (!sb->s_qcop->get_nextdqblk)
 		return -ENOSYS;
 	qid = make_kqid(current_user_ns(), type, id);
-	if (!qid_valid(qid))
+	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq);
 	if (ret)
diff --git a/include/linux/quota.h b/include/linux/quota.h
index 9dfb6bce8c9e..1db16ee39b31 100644
--- a/include/linux/quota.h
+++ b/include/linux/quota.h
@@ -179,6 +179,16 @@ static inline struct kqid make_kqid_projid(kprojid_t projid)
 	return kqid;
 }
 
+/**
+ *	qid_has_mapping - Report if a qid maps into a user namespace.
+ *	@ns:  The user namespace to see if a value maps into.
+ *	@qid: The kernel internal quota identifier to test.
+ */
+static inline bool qid_has_mapping(struct user_namespace *ns, struct kqid qid)
+{
+	return from_kqid(ns, qid) != (qid_t) -1;
+}
+
 
 extern spinlock_t dq_data_lock;
 
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 09/12] quota: Handle quota data stored in s_user_ns in quota_setxquota
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (6 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 08/12] quota: Ensure qids map to the filesystem Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns Eric W. Biederman
                             ` (2 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Inspired-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/quota/quota.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 09/12] quota: Handle quota data stored in s_user_ns in quota_setxquota
  2016-07-06 18:12         ` Eric W. Biederman
                           ` (7 preceding siblings ...)
  (?)
@ 2016-07-06 18:12         ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
the filesystems notion of id 0.

Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Inspired-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/quota/quota.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 73f6f4cf0a21..35df08ee9c97 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -584,7 +584,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
 	if (!qid_has_mapping(sb->s_user_ns, qid))
 		return -EINVAL;
 	/* Are we actually setting timer / warning limits for all users? */
-	if (from_kqid(&init_user_ns, qid) == 0 &&
+	if (from_kqid(sb->s_user_ns, qid) == 0 &&
 	    fdq.d_fieldmask & (FS_DQ_WARNS_MASK | FS_DQ_TIMER_MASK)) {
 		struct qc_info qinfo;
 		int ret;
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (7 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 09/12] quota: Handle quota data stored in s_user_ns in quota_setxquota Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 11/12] evm: Translate user/group ids relative to s_user_ns when computing HMAC Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 12/12] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns Eric W. Biederman
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

Mostly supporting filesystems outside of init_user_ns is
s/&init_usre_ns/dquot->dq_sb->s_user_ns/.  An actual need for
supporting quotas on filesystems outside of s_user_ns is quite a ways
away and to be done responsibily needs an audit on what can happen
with hostile quota files.  Until that audit is complete don't attempt
to support quota files on filesystems outside of s_user_ns.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/quota/dquot.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..87197d13cc76 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -2271,6 +2271,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
 		error = -EINVAL;
 		goto out_fmt;
 	}
+	/* Filesystems outside of init_user_ns not yet supported */
+	if (sb->s_user_ns != &init_user_ns) {
+		error = -EINVAL;
+		goto out_fmt;
+	}
 	/* Usage always has to be set... */
 	if (!(flags & DQUOT_USAGE_ENABLED)) {
 		error = -EINVAL;
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                             ` (9 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

Mostly supporting filesystems outside of init_user_ns is
s/&init_usre_ns/dquot->dq_sb->s_user_ns/.  An actual need for
supporting quotas on filesystems outside of s_user_ns is quite a ways
away and to be done responsibily needs an audit on what can happen
with hostile quota files.  Until that audit is complete don't attempt
to support quota files on filesystems outside of s_user_ns.

Cc: Jan Kara <jack@suse.cz>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/quota/dquot.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..87197d13cc76 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -2271,6 +2271,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
 		error = -EINVAL;
 		goto out_fmt;
 	}
+	/* Filesystems outside of init_user_ns not yet supported */
+	if (sb->s_user_ns != &init_user_ns) {
+		error = -EINVAL;
+		goto out_fmt;
+	}
 	/* Usage always has to be set... */
 	if (!(flags & DQUOT_USAGE_ENABLED)) {
 		error = -EINVAL;
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns
@ 2016-07-06 18:12           ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

Mostly supporting filesystems outside of init_user_ns is
s/&init_usre_ns/dquot->dq_sb->s_user_ns/.  An actual need for
supporting quotas on filesystems outside of s_user_ns is quite a ways
away and to be done responsibily needs an audit on what can happen
with hostile quota files.  Until that audit is complete don't attempt
to support quota files on filesystems outside of s_user_ns.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/quota/dquot.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 74706b6aa747..87197d13cc76 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -2271,6 +2271,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
 		error = -EINVAL;
 		goto out_fmt;
 	}
+	/* Filesystems outside of init_user_ns not yet supported */
+	if (sb->s_user_ns != &init_user_ns) {
+		error = -EINVAL;
+		goto out_fmt;
+	}
 	/* Usage always has to be set... */
 	if (!(flags & DQUOT_USAGE_ENABLED)) {
 		error = -EINVAL;
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 11/12] evm: Translate user/group ids relative to s_user_ns when computing HMAC
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (8 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 12/12] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns Eric W. Biederman
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The EVM HMAC should be calculated using the on disk user and
group ids, so the k[ug]ids in the inode must be translated
relative to the s_user_ns of the inode's super block.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 security/integrity/evm/evm_crypto.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index 30b6b7d0429f..11c1d30bd705 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -151,8 +151,8 @@ static void hmac_add_misc(struct shash_desc *desc, struct inode *inode,
 	memset(&hmac_misc, 0, sizeof(hmac_misc));
 	hmac_misc.ino = inode->i_ino;
 	hmac_misc.generation = inode->i_generation;
-	hmac_misc.uid = from_kuid(&init_user_ns, inode->i_uid);
-	hmac_misc.gid = from_kgid(&init_user_ns, inode->i_gid);
+	hmac_misc.uid = from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+	hmac_misc.gid = from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
 	hmac_misc.mode = inode->i_mode;
 	crypto_shash_update(desc, (const u8 *)&hmac_misc, sizeof(hmac_misc));
 	if (evm_hmac_attrs & EVM_ATTR_FSUUID)
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 11/12] evm: Translate user/group ids relative to s_user_ns when computing HMAC
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-06 18:12           ` Eric W. Biederman
  2016-07-06 18:12           ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
                             ` (9 subsequent siblings)
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

The EVM HMAC should be calculated using the on disk user and
group ids, so the k[ug]ids in the inode must be translated
relative to the s_user_ns of the inode's super block.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 security/integrity/evm/evm_crypto.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index 30b6b7d0429f..11c1d30bd705 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -151,8 +151,8 @@ static void hmac_add_misc(struct shash_desc *desc, struct inode *inode,
 	memset(&hmac_misc, 0, sizeof(hmac_misc));
 	hmac_misc.ino = inode->i_ino;
 	hmac_misc.generation = inode->i_generation;
-	hmac_misc.uid = from_kuid(&init_user_ns, inode->i_uid);
-	hmac_misc.gid = from_kgid(&init_user_ns, inode->i_gid);
+	hmac_misc.uid = from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+	hmac_misc.gid = from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
 	hmac_misc.mode = inode->i_mode;
 	crypto_shash_update(desc, (const u8 *)&hmac_misc, sizeof(hmac_misc));
 	if (evm_hmac_attrs & EVM_ATTR_FSUUID)
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 11/12] evm: Translate user/group ids relative to s_user_ns when computing HMAC
@ 2016-07-06 18:12           ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The EVM HMAC should be calculated using the on disk user and
group ids, so the k[ug]ids in the inode must be translated
relative to the s_user_ns of the inode's super block.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 security/integrity/evm/evm_crypto.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index 30b6b7d0429f..11c1d30bd705 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -151,8 +151,8 @@ static void hmac_add_misc(struct shash_desc *desc, struct inode *inode,
 	memset(&hmac_misc, 0, sizeof(hmac_misc));
 	hmac_misc.ino = inode->i_ino;
 	hmac_misc.generation = inode->i_generation;
-	hmac_misc.uid = from_kuid(&init_user_ns, inode->i_uid);
-	hmac_misc.gid = from_kgid(&init_user_ns, inode->i_gid);
+	hmac_misc.uid = from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+	hmac_misc.gid = from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
 	hmac_misc.mode = inode->i_mode;
 	crypto_shash_update(desc, (const u8 *)&hmac_misc, sizeof(hmac_misc));
 	if (evm_hmac_attrs & EVM_ATTR_FSUUID)
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 12/12] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
       [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (9 preceding siblings ...)
  2016-07-06 18:12           ` [PATCH review 11/12] evm: Translate user/group ids relative to s_user_ns when computing HMAC Eric W. Biederman
@ 2016-07-06 18:12           ` Eric W. Biederman
  10 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Jan Kara, Linux API, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

For filesystems mounted from a user namespace on-disk ids should
be translated relative to s_users_ns rather than init_user_ns.

When an id in the filesystem doesn't exist in s_user_ns the
associated id in the inode will be set to INVALID_[UG]ID, which
turns these into de facto "nobody" ids. This actually maps pretty
well into the way most code already works, and those places where
it didn't were fixed in previous patches. Moving forward vfs code
needs to be careful to handle instances where ids in inodes may
be invalid.

Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/fs.h | 50 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb25ceb6d1ef..8aa9b72e0bc5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -831,31 +831,6 @@ static inline void i_size_write(struct inode *inode, loff_t i_size)
 #endif
 }
 
-/* Helper functions so that in most cases filesystems will
- * not need to deal directly with kuid_t and kgid_t and can
- * instead deal with the raw numeric values that are stored
- * in the filesystem.
- */
-static inline uid_t i_uid_read(const struct inode *inode)
-{
-	return from_kuid(&init_user_ns, inode->i_uid);
-}
-
-static inline gid_t i_gid_read(const struct inode *inode)
-{
-	return from_kgid(&init_user_ns, inode->i_gid);
-}
-
-static inline void i_uid_write(struct inode *inode, uid_t uid)
-{
-	inode->i_uid = make_kuid(&init_user_ns, uid);
-}
-
-static inline void i_gid_write(struct inode *inode, gid_t gid)
-{
-	inode->i_gid = make_kgid(&init_user_ns, gid);
-}
-
 static inline unsigned iminor(const struct inode *inode)
 {
 	return MINOR(inode->i_rdev);
@@ -1461,6 +1436,31 @@ struct super_block {
 	struct list_head	s_inodes;	/* all inodes */
 };
 
+/* Helper functions so that in most cases filesystems will
+ * not need to deal directly with kuid_t and kgid_t and can
+ * instead deal with the raw numeric values that are stored
+ * in the filesystem.
+ */
+static inline uid_t i_uid_read(const struct inode *inode)
+{
+	return from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+}
+
+static inline gid_t i_gid_read(const struct inode *inode)
+{
+	return from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
+}
+
+static inline void i_uid_write(struct inode *inode, uid_t uid)
+{
+	inode->i_uid = make_kuid(inode->i_sb->s_user_ns, uid);
+}
+
+static inline void i_gid_write(struct inode *inode, gid_t gid)
+{
+	inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
+}
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH review 12/12] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
  2016-07-06 18:12         ` Eric W. Biederman
                           ` (11 preceding siblings ...)
  (?)
@ 2016-07-06 18:12         ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Linux Containers
  Cc: Seth Forshee, linux-fsdevel, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jan Kara,
	Jann Horn, Michael Kerrisk

From: Seth Forshee <seth.forshee@canonical.com>

For filesystems mounted from a user namespace on-disk ids should
be translated relative to s_users_ns rather than init_user_ns.

When an id in the filesystem doesn't exist in s_user_ns the
associated id in the inode will be set to INVALID_[UG]ID, which
turns these into de facto "nobody" ids. This actually maps pretty
well into the way most code already works, and those places where
it didn't were fixed in previous patches. Moving forward vfs code
needs to be careful to handle instances where ids in inodes may
be invalid.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/fs.h | 50 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index cb25ceb6d1ef..8aa9b72e0bc5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -831,31 +831,6 @@ static inline void i_size_write(struct inode *inode, loff_t i_size)
 #endif
 }
 
-/* Helper functions so that in most cases filesystems will
- * not need to deal directly with kuid_t and kgid_t and can
- * instead deal with the raw numeric values that are stored
- * in the filesystem.
- */
-static inline uid_t i_uid_read(const struct inode *inode)
-{
-	return from_kuid(&init_user_ns, inode->i_uid);
-}
-
-static inline gid_t i_gid_read(const struct inode *inode)
-{
-	return from_kgid(&init_user_ns, inode->i_gid);
-}
-
-static inline void i_uid_write(struct inode *inode, uid_t uid)
-{
-	inode->i_uid = make_kuid(&init_user_ns, uid);
-}
-
-static inline void i_gid_write(struct inode *inode, gid_t gid)
-{
-	inode->i_gid = make_kgid(&init_user_ns, gid);
-}
-
 static inline unsigned iminor(const struct inode *inode)
 {
 	return MINOR(inode->i_rdev);
@@ -1461,6 +1436,31 @@ struct super_block {
 	struct list_head	s_inodes;	/* all inodes */
 };
 
+/* Helper functions so that in most cases filesystems will
+ * not need to deal directly with kuid_t and kgid_t and can
+ * instead deal with the raw numeric values that are stored
+ * in the filesystem.
+ */
+static inline uid_t i_uid_read(const struct inode *inode)
+{
+	return from_kuid(inode->i_sb->s_user_ns, inode->i_uid);
+}
+
+static inline gid_t i_gid_read(const struct inode *inode)
+{
+	return from_kgid(inode->i_sb->s_user_ns, inode->i_gid);
+}
+
+static inline void i_uid_write(struct inode *inode, uid_t uid)
+{
+	inode->i_uid = make_kuid(inode->i_sb->s_user_ns, uid);
+}
+
+static inline void i_gid_write(struct inode *inode, gid_t gid)
+{
+	inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
+}
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
2.8.3


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns
       [not found]           ` <20160706181212.16267-10-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2016-07-11 10:09             ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-11 10:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

On Wed 06-07-16 13:12:10, Eric W. Biederman wrote:
> Mostly supporting filesystems outside of init_user_ns is
> s/&init_usre_ns/dquot->dq_sb->s_user_ns/.  An actual need for
> supporting quotas on filesystems outside of s_user_ns is quite a ways
> away and to be done responsibily needs an audit on what can happen
> with hostile quota files.  Until that audit is complete don't attempt
> to support quota files on filesystems outside of s_user_ns.
> 
> Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Looks good. You can add:

Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>

								Honza

> ---
>  fs/quota/dquot.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 74706b6aa747..87197d13cc76 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -2271,6 +2271,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>  		error = -EINVAL;
>  		goto out_fmt;
>  	}
> +	/* Filesystems outside of init_user_ns not yet supported */
> +	if (sb->s_user_ns != &init_user_ns) {
> +		error = -EINVAL;
> +		goto out_fmt;
> +	}
>  	/* Usage always has to be set... */
>  	if (!(flags & DQUOT_USAGE_ENABLED)) {
>  		error = -EINVAL;
> -- 
> 2.8.3
> 
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns
  2016-07-06 18:12           ` Eric W. Biederman
  (?)
@ 2016-07-11 10:09           ` Jan Kara
  -1 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-11 10:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Seth Forshee, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

On Wed 06-07-16 13:12:10, Eric W. Biederman wrote:
> Mostly supporting filesystems outside of init_user_ns is
> s/&init_usre_ns/dquot->dq_sb->s_user_ns/.  An actual need for
> supporting quotas on filesystems outside of s_user_ns is quite a ways
> away and to be done responsibily needs an audit on what can happen
> with hostile quota files.  Until that audit is complete don't attempt
> to support quota files on filesystems outside of s_user_ns.
> 
> Cc: Jan Kara <jack@suse.cz>
> Acked-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Looks good. You can add:

Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/quota/dquot.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 74706b6aa747..87197d13cc76 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -2271,6 +2271,11 @@ static int vfs_load_quota_inode(struct inode *inode, int type, int format_id,
>  		error = -EINVAL;
>  		goto out_fmt;
>  	}
> +	/* Filesystems outside of init_user_ns not yet supported */
> +	if (sb->s_user_ns != &init_user_ns) {
> +		error = -EINVAL;
> +		goto out_fmt;
> +	}
>  	/* Usage always has to be set... */
>  	if (!(flags & DQUOT_USAGE_ENABLED)) {
>  		error = -EINVAL;
> -- 
> 2.8.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
  2016-07-06 18:12         ` [PATCH review 08/12] quota: Ensure qids map to the filesystem Eric W. Biederman
@ 2016-07-11 10:14               ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-11 10:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

On Wed 06-07-16 13:12:08, Eric W. Biederman wrote:
> Introduce the helper qid_has_mapping and use it to ensure that the
> quota system only considers qids that map to the filesystems
> s_user_ns.
> 
> In practice for quota supporting filesystems today this is the exact
> same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
> map into init_user_ns.
> 
> Replace the qid_valid calls with qid_has_mapping as values come in
> from userspace.  This is harmless today and it prepares the quota
> system to work on filesystems with quotas but mounted by unprivileged
> users.
> 
> Call qid_has_mapping from dqget.  This ensures the passed in qid has a
> prepresentation on the underlying filesystem.  Previously this was
> unnecessary as filesystesm never had qids that could not map.  With
> the introduction of filesystems outside of s_user_ns this will not
> remain true.
> 
> All of this ensures the quota code never has to deal with qids that
> don't map to the underlying filesystem.

Does this and the following patch make sense when we actually don't allow
quotas when s_user_ns is set? If things are untested, they will get broken
so I think it is wiser to do the conversion only once we decide quota
should be supported for namespaces != init_user_ns. Or do I miss something?

								Honza

> 
> Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> Acked-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/quota/dquot.c      |  3 +++
>  fs/quota/quota.c      | 12 ++++++------
>  include/linux/quota.h | 10 ++++++++++
>  3 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index ff21980d0119..74706b6aa747 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -841,6 +841,9 @@ struct dquot *dqget(struct super_block *sb, struct kqid qid)
>  	unsigned int hashent = hashfn(sb, qid);
>  	struct dquot *dquot, *empty = NULL;
>  
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
> +		return ERR_PTR(-EINVAL);
> +
>          if (!sb_has_quota_active(sb, qid.type))
>  		return ERR_PTR(-ESRCH);
>  we_slept:
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 0f10ee9892ce..73f6f4cf0a21 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -211,7 +211,7 @@ static int quota_getquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_dqblk(sb, qid, &fdq);
>  	if (ret)
> @@ -237,7 +237,7 @@ static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_nextdqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
>  	if (ret)
> @@ -288,7 +288,7 @@ static int quota_setquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->set_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	copy_from_if_dqblk(&fdq, &idq);
>  	return sb->s_qcop->set_dqblk(sb, qid, &fdq);
> @@ -581,7 +581,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->set_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	/* Are we actually setting timer / warning limits for all users? */
>  	if (from_kqid(&init_user_ns, qid) == 0 &&
> @@ -642,7 +642,7 @@ static int quota_getxquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_dqblk(sb, qid, &qdq);
>  	if (ret)
> @@ -669,7 +669,7 @@ static int quota_getnextxquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_nextdqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq);
>  	if (ret)
> diff --git a/include/linux/quota.h b/include/linux/quota.h
> index 9dfb6bce8c9e..1db16ee39b31 100644
> --- a/include/linux/quota.h
> +++ b/include/linux/quota.h
> @@ -179,6 +179,16 @@ static inline struct kqid make_kqid_projid(kprojid_t projid)
>  	return kqid;
>  }
>  
> +/**
> + *	qid_has_mapping - Report if a qid maps into a user namespace.
> + *	@ns:  The user namespace to see if a value maps into.
> + *	@qid: The kernel internal quota identifier to test.
> + */
> +static inline bool qid_has_mapping(struct user_namespace *ns, struct kqid qid)
> +{
> +	return from_kqid(ns, qid) != (qid_t) -1;
> +}
> +
>  
>  extern spinlock_t dq_data_lock;
>  
> -- 
> 2.8.3
> 
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
@ 2016-07-11 10:14               ` Jan Kara
  0 siblings, 0 replies; 134+ messages in thread
From: Jan Kara @ 2016-07-11 10:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Seth Forshee, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jan Kara, Jann Horn, Michael Kerrisk

On Wed 06-07-16 13:12:08, Eric W. Biederman wrote:
> Introduce the helper qid_has_mapping and use it to ensure that the
> quota system only considers qids that map to the filesystems
> s_user_ns.
> 
> In practice for quota supporting filesystems today this is the exact
> same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
> map into init_user_ns.
> 
> Replace the qid_valid calls with qid_has_mapping as values come in
> from userspace.  This is harmless today and it prepares the quota
> system to work on filesystems with quotas but mounted by unprivileged
> users.
> 
> Call qid_has_mapping from dqget.  This ensures the passed in qid has a
> prepresentation on the underlying filesystem.  Previously this was
> unnecessary as filesystesm never had qids that could not map.  With
> the introduction of filesystems outside of s_user_ns this will not
> remain true.
> 
> All of this ensures the quota code never has to deal with qids that
> don't map to the underlying filesystem.

Does this and the following patch make sense when we actually don't allow
quotas when s_user_ns is set? If things are untested, they will get broken
so I think it is wiser to do the conversion only once we decide quota
should be supported for namespaces != init_user_ns. Or do I miss something?

								Honza

> 
> Cc: Jan Kara <jack@suse.cz>
> Acked-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/quota/dquot.c      |  3 +++
>  fs/quota/quota.c      | 12 ++++++------
>  include/linux/quota.h | 10 ++++++++++
>  3 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index ff21980d0119..74706b6aa747 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -841,6 +841,9 @@ struct dquot *dqget(struct super_block *sb, struct kqid qid)
>  	unsigned int hashent = hashfn(sb, qid);
>  	struct dquot *dquot, *empty = NULL;
>  
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
> +		return ERR_PTR(-EINVAL);
> +
>          if (!sb_has_quota_active(sb, qid.type))
>  		return ERR_PTR(-ESRCH);
>  we_slept:
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 0f10ee9892ce..73f6f4cf0a21 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -211,7 +211,7 @@ static int quota_getquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_dqblk(sb, qid, &fdq);
>  	if (ret)
> @@ -237,7 +237,7 @@ static int quota_getnextquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_nextdqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &fdq);
>  	if (ret)
> @@ -288,7 +288,7 @@ static int quota_setquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->set_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	copy_from_if_dqblk(&fdq, &idq);
>  	return sb->s_qcop->set_dqblk(sb, qid, &fdq);
> @@ -581,7 +581,7 @@ static int quota_setxquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->set_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	/* Are we actually setting timer / warning limits for all users? */
>  	if (from_kqid(&init_user_ns, qid) == 0 &&
> @@ -642,7 +642,7 @@ static int quota_getxquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_dqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_dqblk(sb, qid, &qdq);
>  	if (ret)
> @@ -669,7 +669,7 @@ static int quota_getnextxquota(struct super_block *sb, int type, qid_t id,
>  	if (!sb->s_qcop->get_nextdqblk)
>  		return -ENOSYS;
>  	qid = make_kqid(current_user_ns(), type, id);
> -	if (!qid_valid(qid))
> +	if (!qid_has_mapping(sb->s_user_ns, qid))
>  		return -EINVAL;
>  	ret = sb->s_qcop->get_nextdqblk(sb, &qid, &qdq);
>  	if (ret)
> diff --git a/include/linux/quota.h b/include/linux/quota.h
> index 9dfb6bce8c9e..1db16ee39b31 100644
> --- a/include/linux/quota.h
> +++ b/include/linux/quota.h
> @@ -179,6 +179,16 @@ static inline struct kqid make_kqid_projid(kprojid_t projid)
>  	return kqid;
>  }
>  
> +/**
> + *	qid_has_mapping - Report if a qid maps into a user namespace.
> + *	@ns:  The user namespace to see if a value maps into.
> + *	@qid: The kernel internal quota identifier to test.
> + */
> +static inline bool qid_has_mapping(struct user_namespace *ns, struct kqid qid)
> +{
> +	return from_kqid(ns, qid) != (qid_t) -1;
> +}
> +
>  
>  extern spinlock_t dq_data_lock;
>  
> -- 
> 2.8.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
       [not found]               ` <20160711101424.GH12410-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-11 18:12                 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-11 18:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux API, Linux Containers, Andy Lutomirski, James Bottomley,
	Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Wed 06-07-16 13:12:08, Eric W. Biederman wrote:
>> Introduce the helper qid_has_mapping and use it to ensure that the
>> quota system only considers qids that map to the filesystems
>> s_user_ns.
>> 
>> In practice for quota supporting filesystems today this is the exact
>> same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
>> map into init_user_ns.
>> 
>> Replace the qid_valid calls with qid_has_mapping as values come in
>> from userspace.  This is harmless today and it prepares the quota
>> system to work on filesystems with quotas but mounted by unprivileged
>> users.
>> 
>> Call qid_has_mapping from dqget.  This ensures the passed in qid has a
>> prepresentation on the underlying filesystem.  Previously this was
>> unnecessary as filesystesm never had qids that could not map.  With
>> the introduction of filesystems outside of s_user_ns this will not
>> remain true.
>> 
>> All of this ensures the quota code never has to deal with qids that
>> don't map to the underlying filesystem.
>
> Does this and the following patch make sense when we actually don't allow
> quotas when s_user_ns is set? If things are untested, they will get broken
> so I think it is wiser to do the conversion only once we decide quota
> should be supported for namespaces != init_user_ns. Or do I miss
> something?

All of the code will get exercised in the success case.  So I don't
think there is a strong chance of bitrot.

The patches are trivially correct so I am not concerned about them.

What the patches add are a logical consistency with the rest of the vfs.
In my tree the rest of the vfs now handles uids and gids that can not be
mapped to a filesystem and returns an error.  In my tree the rest of the
vfs now effectively assumes that uids and gids are stored in s_user_ns
on a filesystem.

So I freely admit there is a slight chance that some bit-rot will occur
and the quota code will be changed in such a way that those assumptions
are broken.  Bugs happen.   I think it much less likely if all of the
generic code in the vfs makes the same assumptions.



The place where I am concerned about thorough review and testing is
someone poisoning quota files and then the kernel trying to use them.
In the preliminary work we have done in other places in the kernel and
for other filesystems there almost always winds up being some way to
confuse the kernel and get it to misbave if you can poison the disk
based inputs.  As poison disk based inputs is not something filesystems
are stronlgy concerned about.  In most cases the disk the filesystem
resides on is in the box and therefore under control of the OS at all
times.  Dave Chinner has even said he will never consider handling
poisoned disk based inputs for XFS as the run time cost is too high.


Between actually finding issues that can cause problems, and the
increased amount of kernel code executed (and thus the increase in
kernel attack surface) I am very paranoid about enabling code that
trusts data that could be poisoned data from a hostile party.


At the same time I am very uncomfortable with the fact the kernel does
not protect against malicious disks and poisoned disk images.  As
poisoned disk images are a well known exploit vector in the wild.  A
well known and demonstrated attack vector that works is to leave a usb
stick in a public place, and helpful people will place it into their
computer to try and figure out who it belongs to.  In trying to be
helpful their computer will unbeknownst to them start executing code
that does not serve the interests of the computer owner.  I hate that we
can not currently protect people from shenanigans like that.

So I intend to be as responsible and as careful as I can, but also to
push forward in the hopes that we can eventually not have to worry about
our computers betraying us when we as human beings perform reasonable
actions.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
       [not found]               ` <20160711101424.GH12410-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-07-11 18:12                 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-11 18:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux Containers, Seth Forshee, linux-fsdevel, Linux API,
	James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

Jan Kara <jack@suse.cz> writes:

> On Wed 06-07-16 13:12:08, Eric W. Biederman wrote:
>> Introduce the helper qid_has_mapping and use it to ensure that the
>> quota system only considers qids that map to the filesystems
>> s_user_ns.
>> 
>> In practice for quota supporting filesystems today this is the exact
>> same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
>> map into init_user_ns.
>> 
>> Replace the qid_valid calls with qid_has_mapping as values come in
>> from userspace.  This is harmless today and it prepares the quota
>> system to work on filesystems with quotas but mounted by unprivileged
>> users.
>> 
>> Call qid_has_mapping from dqget.  This ensures the passed in qid has a
>> prepresentation on the underlying filesystem.  Previously this was
>> unnecessary as filesystesm never had qids that could not map.  With
>> the introduction of filesystems outside of s_user_ns this will not
>> remain true.
>> 
>> All of this ensures the quota code never has to deal with qids that
>> don't map to the underlying filesystem.
>
> Does this and the following patch make sense when we actually don't allow
> quotas when s_user_ns is set? If things are untested, they will get broken
> so I think it is wiser to do the conversion only once we decide quota
> should be supported for namespaces != init_user_ns. Or do I miss
> something?

All of the code will get exercised in the success case.  So I don't
think there is a strong chance of bitrot.

The patches are trivially correct so I am not concerned about them.

What the patches add are a logical consistency with the rest of the vfs.
In my tree the rest of the vfs now handles uids and gids that can not be
mapped to a filesystem and returns an error.  In my tree the rest of the
vfs now effectively assumes that uids and gids are stored in s_user_ns
on a filesystem.

So I freely admit there is a slight chance that some bit-rot will occur
and the quota code will be changed in such a way that those assumptions
are broken.  Bugs happen.   I think it much less likely if all of the
generic code in the vfs makes the same assumptions.



The place where I am concerned about thorough review and testing is
someone poisoning quota files and then the kernel trying to use them.
In the preliminary work we have done in other places in the kernel and
for other filesystems there almost always winds up being some way to
confuse the kernel and get it to misbave if you can poison the disk
based inputs.  As poison disk based inputs is not something filesystems
are stronlgy concerned about.  In most cases the disk the filesystem
resides on is in the box and therefore under control of the OS at all
times.  Dave Chinner has even said he will never consider handling
poisoned disk based inputs for XFS as the run time cost is too high.


Between actually finding issues that can cause problems, and the
increased amount of kernel code executed (and thus the increase in
kernel attack surface) I am very paranoid about enabling code that
trusts data that could be poisoned data from a hostile party.


At the same time I am very uncomfortable with the fact the kernel does
not protect against malicious disks and poisoned disk images.  As
poisoned disk images are a well known exploit vector in the wild.  A
well known and demonstrated attack vector that works is to leave a usb
stick in a public place, and helpful people will place it into their
computer to try and figure out who it belongs to.  In trying to be
helpful their computer will unbeknownst to them start executing code
that does not serve the interests of the computer owner.  I hate that we
can not currently protect people from shenanigans like that.

So I intend to be as responsible and as careful as I can, but also to
push forward in the hopes that we can eventually not have to worry about
our computers betraying us when we as human beings perform reasonable
actions.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
@ 2016-07-11 18:12                 ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-11 18:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux Containers, Seth Forshee,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux API, James Bottomley,
	Djalal Harouni, Serge E. Hallyn, Andy Lutomirski, Jann Horn,
	Michael Kerrisk

Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:

> On Wed 06-07-16 13:12:08, Eric W. Biederman wrote:
>> Introduce the helper qid_has_mapping and use it to ensure that the
>> quota system only considers qids that map to the filesystems
>> s_user_ns.
>> 
>> In practice for quota supporting filesystems today this is the exact
>> same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
>> map into init_user_ns.
>> 
>> Replace the qid_valid calls with qid_has_mapping as values come in
>> from userspace.  This is harmless today and it prepares the quota
>> system to work on filesystems with quotas but mounted by unprivileged
>> users.
>> 
>> Call qid_has_mapping from dqget.  This ensures the passed in qid has a
>> prepresentation on the underlying filesystem.  Previously this was
>> unnecessary as filesystesm never had qids that could not map.  With
>> the introduction of filesystems outside of s_user_ns this will not
>> remain true.
>> 
>> All of this ensures the quota code never has to deal with qids that
>> don't map to the underlying filesystem.
>
> Does this and the following patch make sense when we actually don't allow
> quotas when s_user_ns is set? If things are untested, they will get broken
> so I think it is wiser to do the conversion only once we decide quota
> should be supported for namespaces != init_user_ns. Or do I miss
> something?

All of the code will get exercised in the success case.  So I don't
think there is a strong chance of bitrot.

The patches are trivially correct so I am not concerned about them.

What the patches add are a logical consistency with the rest of the vfs.
In my tree the rest of the vfs now handles uids and gids that can not be
mapped to a filesystem and returns an error.  In my tree the rest of the
vfs now effectively assumes that uids and gids are stored in s_user_ns
on a filesystem.

So I freely admit there is a slight chance that some bit-rot will occur
and the quota code will be changed in such a way that those assumptions
are broken.  Bugs happen.   I think it much less likely if all of the
generic code in the vfs makes the same assumptions.



The place where I am concerned about thorough review and testing is
someone poisoning quota files and then the kernel trying to use them.
In the preliminary work we have done in other places in the kernel and
for other filesystems there almost always winds up being some way to
confuse the kernel and get it to misbave if you can poison the disk
based inputs.  As poison disk based inputs is not something filesystems
are stronlgy concerned about.  In most cases the disk the filesystem
resides on is in the box and therefore under control of the OS at all
times.  Dave Chinner has even said he will never consider handling
poisoned disk based inputs for XFS as the run time cost is too high.


Between actually finding issues that can cause problems, and the
increased amount of kernel code executed (and thus the increase in
kernel attack surface) I am very paranoid about enabling code that
trusts data that could be poisoned data from a hostile party.


At the same time I am very uncomfortable with the fact the kernel does
not protect against malicious disks and poisoned disk images.  As
poisoned disk images are a well known exploit vector in the wild.  A
well known and demonstrated attack vector that works is to leave a usb
stick in a public place, and helpful people will place it into their
computer to try and figure out who it belongs to.  In trying to be
helpful their computer will unbeknownst to them start executing code
that does not serve the interests of the computer owner.  I hate that we
can not currently protect people from shenanigans like that.

So I intend to be as responsible and as careful as I can, but also to
push forward in the hopes that we can eventually not have to worry about
our computers betraying us when we as human beings perform reasonable
actions.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
  2016-07-11 18:12                 ` Eric W. Biederman
@ 2016-07-13  1:34                     ` Dave Chinner
  -1 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-13  1:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
> The place where I am concerned about thorough review and testing is
> someone poisoning quota files and then the kernel trying to use them.
> In the preliminary work we have done in other places in the kernel and
> for other filesystems there almost always winds up being some way to
> confuse the kernel and get it to misbave if you can poison the disk
> based inputs.  As poison disk based inputs is not something filesystems
> are stronlgy concerned about.  In most cases the disk the filesystem
> resides on is in the box and therefore under control of the OS at all
> times.  Dave Chinner has even said he will never consider handling
> poisoned disk based inputs for XFS as the run time cost is too high.

I didn't say that. I said that comprehensive checks to catch all
possible malicious inputs is too expensive to consider a viable
solution for allowing user-mounts of arbitrary filesystem images
through the kernel.

We already have runtime validation that bounds check most on-disk
fields when they are read - that deals with fuzz based poisoning
testing quite well and provides some protection against directed
structural attacks as well. IOWs, we already handle a large scope of
poisoned inputs safely, but it's not comprehensive because we can't
easily determine cross-object reference validity within the
format-determined limits.

e.g. we can check that the number of records in a btree block is
within the valid bounds of a block, but we cannot determine that the
record count has been incremented by 1 and a bogus record has been
inserted somewhere inside the block and the CRC recalculated to
match the modification. We can also check the records themselves for
being within bound (e.g. we can check a freespace record points to
block within range and of valid length) but we can't check that the
extent is actually free space. That requires doing a full filesystem
traversal to determine if the extent is actually free space or not.

Of course, we could look up the rmap tree (if we have one) to
determine if the space is actually used or not, but an attacker can
also insert/remove records in that tree, too, so if we can't trust
the free space tree, we can't trust the rmap tree either.
Hence we have to fall back to brute force validation if we
want to be certain that the metadata has not been tampered with.

To bring this back to quota files, the only way to validate that a
quota file has not been tampered with is to run a quotacheck on the
filesystem once it has been mounted. This requires visiting every
inode in the filesystem, so it an expensive operation. Only XFS has
this functionality in kernel, so for untrusted mounts we could
simply run it on every mount that has quotas enabled. Of course,
users won't care that mounting their filesystem now takes several
minutes (hours, even, when we have millions of inodes in the fs)
while these checks are run...

Detecting malicious corruptions that specifically manipulate the
on-disk structure within the bounds of format validity are difficult
to detect and costly to protect against. We'd need to move large
parts of fsck into the kernel and run it to validate every piece of
metadata read into the kernel. Then we've got a much larger attack
surface in the kernel (all the validity checking code needs to be
robust against invalid structures, too!), a lot more complexity
(more bugs!) and a lot of additional runtime overhead (slow
filesystem = unhappy users!). It's just not a practical solution to
the problem.

> Between actually finding issues that can cause problems, and the
> increased amount of kernel code executed (and thus the increase in
> kernel attack surface) I am very paranoid about enabling code that
> trusts data that could be poisoned data from a hostile party.
>
> At the same time I am very uncomfortable with the fact the kernel does
> not protect against malicious disks and poisoned disk images.  As
> poisoned disk images are a well known exploit vector in the wild.  A
> well known and demonstrated attack vector that works is to leave a usb
> stick in a public place, and helpful people will place it into their
> computer to try and figure out who it belongs to.  In trying to be
> helpful their computer will unbeknownst to them start executing code
> that does not serve the interests of the computer owner.  I hate that we
> can not currently protect people from shenanigans like that.

Yes, we know all about these problems. Unfortunately someone
appears to not be listening when they being repeatedly told that
hardening all the kernel filesystem implementations against poisoned
images is simply not a viable solution to the problem.

Move the parsing of untrusted structures out of the kernel - work
with the various filesystem teams to build viable FUSE
implementations (where it's much easier to incorporate parts of the
userspace fsck code) and provide the FUSE filesystems to container
users wanting to mount their own filesystem images.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
@ 2016-07-13  1:34                     ` Dave Chinner
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-13  1:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Linux Containers, Seth Forshee, linux-fsdevel,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
> The place where I am concerned about thorough review and testing is
> someone poisoning quota files and then the kernel trying to use them.
> In the preliminary work we have done in other places in the kernel and
> for other filesystems there almost always winds up being some way to
> confuse the kernel and get it to misbave if you can poison the disk
> based inputs.  As poison disk based inputs is not something filesystems
> are stronlgy concerned about.  In most cases the disk the filesystem
> resides on is in the box and therefore under control of the OS at all
> times.  Dave Chinner has even said he will never consider handling
> poisoned disk based inputs for XFS as the run time cost is too high.

I didn't say that. I said that comprehensive checks to catch all
possible malicious inputs is too expensive to consider a viable
solution for allowing user-mounts of arbitrary filesystem images
through the kernel.

We already have runtime validation that bounds check most on-disk
fields when they are read - that deals with fuzz based poisoning
testing quite well and provides some protection against directed
structural attacks as well. IOWs, we already handle a large scope of
poisoned inputs safely, but it's not comprehensive because we can't
easily determine cross-object reference validity within the
format-determined limits.

e.g. we can check that the number of records in a btree block is
within the valid bounds of a block, but we cannot determine that the
record count has been incremented by 1 and a bogus record has been
inserted somewhere inside the block and the CRC recalculated to
match the modification. We can also check the records themselves for
being within bound (e.g. we can check a freespace record points to
block within range and of valid length) but we can't check that the
extent is actually free space. That requires doing a full filesystem
traversal to determine if the extent is actually free space or not.

Of course, we could look up the rmap tree (if we have one) to
determine if the space is actually used or not, but an attacker can
also insert/remove records in that tree, too, so if we can't trust
the free space tree, we can't trust the rmap tree either.
Hence we have to fall back to brute force validation if we
want to be certain that the metadata has not been tampered with.

To bring this back to quota files, the only way to validate that a
quota file has not been tampered with is to run a quotacheck on the
filesystem once it has been mounted. This requires visiting every
inode in the filesystem, so it an expensive operation. Only XFS has
this functionality in kernel, so for untrusted mounts we could
simply run it on every mount that has quotas enabled. Of course,
users won't care that mounting their filesystem now takes several
minutes (hours, even, when we have millions of inodes in the fs)
while these checks are run...

Detecting malicious corruptions that specifically manipulate the
on-disk structure within the bounds of format validity are difficult
to detect and costly to protect against. We'd need to move large
parts of fsck into the kernel and run it to validate every piece of
metadata read into the kernel. Then we've got a much larger attack
surface in the kernel (all the validity checking code needs to be
robust against invalid structures, too!), a lot more complexity
(more bugs!) and a lot of additional runtime overhead (slow
filesystem = unhappy users!). It's just not a practical solution to
the problem.

> Between actually finding issues that can cause problems, and the
> increased amount of kernel code executed (and thus the increase in
> kernel attack surface) I am very paranoid about enabling code that
> trusts data that could be poisoned data from a hostile party.
>
> At the same time I am very uncomfortable with the fact the kernel does
> not protect against malicious disks and poisoned disk images.  As
> poisoned disk images are a well known exploit vector in the wild.  A
> well known and demonstrated attack vector that works is to leave a usb
> stick in a public place, and helpful people will place it into their
> computer to try and figure out who it belongs to.  In trying to be
> helpful their computer will unbeknownst to them start executing code
> that does not serve the interests of the computer owner.  I hate that we
> can not currently protect people from shenanigans like that.

Yes, we know all about these problems. Unfortunately someone
appears to not be listening when they being repeatedly told that
hardening all the kernel filesystem implementations against poisoned
images is simply not a viable solution to the problem.

Move the parsing of untrusted structures out of the kernel - work
with the various filesystem teams to build viable FUSE
implementations (where it's much easier to incorporate parts of the
userspace fsck code) and provide the FUSE filesystems to container
users wanting to mount their own filesystem images.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
  2016-07-13  1:34                     ` Dave Chinner
@ 2016-07-13  3:45                       ` Dave Chinner
  -1 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-13  3:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Jann Horn, Djalal Harouni

On Wed, Jul 13, 2016 at 11:34:36AM +1000, Dave Chinner wrote:
> On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
> > The place where I am concerned about thorough review and testing is
> > someone poisoning quota files and then the kernel trying to use them.
> > In the preliminary work we have done in other places in the kernel and
> > for other filesystems there almost always winds up being some way to
> > confuse the kernel and get it to misbave if you can poison the disk
> > based inputs.  As poison disk based inputs is not something filesystems
> > are stronlgy concerned about.  In most cases the disk the filesystem
> > resides on is in the box and therefore under control of the OS at all
> > times.  Dave Chinner has even said he will never consider handling
> > poisoned disk based inputs for XFS as the run time cost is too high.
> 
> I didn't say that. I said that comprehensive checks to catch all
> possible malicious inputs is too expensive to consider a viable
> solution for allowing user-mounts of arbitrary filesystem images
> through the kernel.
[.....]
> To bring this back to quota files, the only way to validate that a
> quota file has not been tampered with is to run a quotacheck on the
> filesystem once it has been mounted. This requires visiting every
> inode in the filesystem, so it an expensive operation. Only XFS has
> this functionality in kernel, so for untrusted mounts we could
> simply run it on every mount that has quotas enabled. Of course,
> users won't care that mounting their filesystem now takes several
> minutes (hours, even, when we have millions of inodes in the fs)
> while these checks are run...

So, over lunch I realised the problem with this. quotacheck is
verifying the contents of the quota file, but we haven't verified
the structure of the quota file to begin with. Hence just enabling
quotas could cause the filesystem to do bad things in the kernel on
mount if the quota file metadata has been tampered with.

IOWs, it's not just quota data parsing that we have to be concerned
with here - parsing the quota file structure itself could be an
attack vector that triggers on mount.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
@ 2016-07-13  3:45                       ` Dave Chinner
  0 siblings, 0 replies; 134+ messages in thread
From: Dave Chinner @ 2016-07-13  3:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Linux Containers, Seth Forshee, linux-fsdevel,
	Linux API, James Bottomley, Djalal Harouni, Serge E. Hallyn,
	Andy Lutomirski, Jann Horn, Michael Kerrisk

On Wed, Jul 13, 2016 at 11:34:36AM +1000, Dave Chinner wrote:
> On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
> > The place where I am concerned about thorough review and testing is
> > someone poisoning quota files and then the kernel trying to use them.
> > In the preliminary work we have done in other places in the kernel and
> > for other filesystems there almost always winds up being some way to
> > confuse the kernel and get it to misbave if you can poison the disk
> > based inputs.  As poison disk based inputs is not something filesystems
> > are stronlgy concerned about.  In most cases the disk the filesystem
> > resides on is in the box and therefore under control of the OS at all
> > times.  Dave Chinner has even said he will never consider handling
> > poisoned disk based inputs for XFS as the run time cost is too high.
> 
> I didn't say that. I said that comprehensive checks to catch all
> possible malicious inputs is too expensive to consider a viable
> solution for allowing user-mounts of arbitrary filesystem images
> through the kernel.
[.....]
> To bring this back to quota files, the only way to validate that a
> quota file has not been tampered with is to run a quotacheck on the
> filesystem once it has been mounted. This requires visiting every
> inode in the filesystem, so it an expensive operation. Only XFS has
> this functionality in kernel, so for untrusted mounts we could
> simply run it on every mount that has quotas enabled. Of course,
> users won't care that mounting their filesystem now takes several
> minutes (hours, even, when we have millions of inodes in the fs)
> while these checks are run...

So, over lunch I realised the problem with this. quotacheck is
verifying the contents of the quota file, but we haven't verified
the structure of the quota file to begin with. Hence just enabling
quotas could cause the filesystem to do bad things in the kernel on
mount if the quota file metadata has been tampered with.

IOWs, it's not just quota data parsing that we have to be concerned
with here - parsing the quota file structure itself could be an
attack vector that triggers on mount.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
  2016-07-13  1:34                     ` Dave Chinner
@ 2016-07-13  5:43                       ` Jann Horn
  -1 siblings, 0 replies; 134+ messages in thread
From: Jann Horn @ 2016-07-13  5:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Linux API, Linux Containers, Andy Lutomirski,
	James Bottomley, Seth Forshee, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	Djalal Harouni


[-- Attachment #1.1: Type: text/plain, Size: 2537 bytes --]

On Wed, Jul 13, 2016 at 11:34:36AM +1000, Dave Chinner wrote:
> On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
> > The place where I am concerned about thorough review and testing is
> > someone poisoning quota files and then the kernel trying to use them.
> > In the preliminary work we have done in other places in the kernel and
> > for other filesystems there almost always winds up being some way to
> > confuse the kernel and get it to misbave if you can poison the disk
> > based inputs.  As poison disk based inputs is not something filesystems
> > are stronlgy concerned about.  In most cases the disk the filesystem
> > resides on is in the box and therefore under control of the OS at all
> > times.  Dave Chinner has even said he will never consider handling
> > poisoned disk based inputs for XFS as the run time cost is too high.
> 
> I didn't say that. I said that comprehensive checks to catch all
> possible malicious inputs is too expensive to consider a viable
> solution for allowing user-mounts of arbitrary filesystem images
> through the kernel.
> 
[...]
> 
> To bring this back to quota files, the only way to validate that a
> quota file has not been tampered with is to run a quotacheck on the
> filesystem once it has been mounted. This requires visiting every
> inode in the filesystem, so it an expensive operation. Only XFS has
> this functionality in kernel, so for untrusted mounts we could
> simply run it on every mount that has quotas enabled. Of course,
> users won't care that mounting their filesystem now takes several
> minutes (hours, even, when we have millions of inodes in the fs)
> while these checks are run...
> 
> Detecting malicious corruptions that specifically manipulate the
> on-disk structure within the bounds of format validity are difficult
> to detect and costly to protect against. We'd need to move large
> parts of fsck into the kernel and run it to validate every piece of
> metadata read into the kernel. Then we've got a much larger attack
> surface in the kernel (all the validity checking code needs to be
> robust against invalid structures, too!), a lot more complexity
> (more bugs!) and a lot of additional runtime overhead (slow
> filesystem = unhappy users!). It's just not a practical solution to
> the problem.

And ideally, you'd want to also guard against an evil disk that
suddenly changes its contents after you've run fsck on it, and you
can't easily do that without making things complicated.

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
@ 2016-07-13  5:43                       ` Jann Horn
  0 siblings, 0 replies; 134+ messages in thread
From: Jann Horn @ 2016-07-13  5:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric W. Biederman, Jan Kara, Linux Containers, Seth Forshee,
	linux-fsdevel, Linux API, James Bottomley, Djalal Harouni,
	Serge E. Hallyn, Andy Lutomirski, Michael Kerrisk

[-- Attachment #1: Type: text/plain, Size: 2537 bytes --]

On Wed, Jul 13, 2016 at 11:34:36AM +1000, Dave Chinner wrote:
> On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
> > The place where I am concerned about thorough review and testing is
> > someone poisoning quota files and then the kernel trying to use them.
> > In the preliminary work we have done in other places in the kernel and
> > for other filesystems there almost always winds up being some way to
> > confuse the kernel and get it to misbave if you can poison the disk
> > based inputs.  As poison disk based inputs is not something filesystems
> > are stronlgy concerned about.  In most cases the disk the filesystem
> > resides on is in the box and therefore under control of the OS at all
> > times.  Dave Chinner has even said he will never consider handling
> > poisoned disk based inputs for XFS as the run time cost is too high.
> 
> I didn't say that. I said that comprehensive checks to catch all
> possible malicious inputs is too expensive to consider a viable
> solution for allowing user-mounts of arbitrary filesystem images
> through the kernel.
> 
[...]
> 
> To bring this back to quota files, the only way to validate that a
> quota file has not been tampered with is to run a quotacheck on the
> filesystem once it has been mounted. This requires visiting every
> inode in the filesystem, so it an expensive operation. Only XFS has
> this functionality in kernel, so for untrusted mounts we could
> simply run it on every mount that has quotas enabled. Of course,
> users won't care that mounting their filesystem now takes several
> minutes (hours, even, when we have millions of inodes in the fs)
> while these checks are run...
> 
> Detecting malicious corruptions that specifically manipulate the
> on-disk structure within the bounds of format validity are difficult
> to detect and costly to protect against. We'd need to move large
> parts of fsck into the kernel and run it to validate every piece of
> metadata read into the kernel. Then we've got a much larger attack
> surface in the kernel (all the validity checking code needs to be
> robust against invalid structures, too!), a lot more complexity
> (more bugs!) and a lot of additional runtime overhead (slow
> filesystem = unhappy users!). It's just not a practical solution to
> the problem.

And ideally, you'd want to also guard against an evil disk that
suddenly changes its contents after you've run fsck on it, and you
can't easily do that without making things complicated.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
       [not found]                       ` <20160713054358.GB28635-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
@ 2016-07-14 17:03                         ` Eric W. Biederman
  0 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-14 17:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jan Kara, Linux API, Linux Containers, Dave Chinner,
	Andy Lutomirski, James Bottomley, Seth Forshee, Michael Kerrisk,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Djalal Harouni

Jann Horn <jann-XZ1E9jl8jIdeoWH0uzbU5w@public.gmane.org> writes:

> On Wed, Jul 13, 2016 at 11:34:36AM +1000, Dave Chinner wrote:
>> On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
>> > The place where I am concerned about thorough review and testing is
>> > someone poisoning quota files and then the kernel trying to use them.
>> > In the preliminary work we have done in other places in the kernel and
>> > for other filesystems there almost always winds up being some way to
>> > confuse the kernel and get it to misbave if you can poison the disk
>> > based inputs.  As poison disk based inputs is not something filesystems
>> > are stronlgy concerned about.  In most cases the disk the filesystem
>> > resides on is in the box and therefore under control of the OS at all
>> > times.  Dave Chinner has even said he will never consider handling
>> > poisoned disk based inputs for XFS as the run time cost is too high.
>> 
>> I didn't say that. I said that comprehensive checks to catch all
>> possible malicious inputs is too expensive to consider a viable
>> solution for allowing user-mounts of arbitrary filesystem images
>> through the kernel.

Dave you said that speaking as the XFS maintainer.  So I take that to be
your position and to refer to XFS.

> [...]
>> 
>> To bring this back to quota files, the only way to validate that a
>> quota file has not been tampered with is to run a quotacheck on the
>> filesystem once it has been mounted. This requires visiting every
>> inode in the filesystem, so it an expensive operation. Only XFS has
>> this functionality in kernel, so for untrusted mounts we could
>> simply run it on every mount that has quotas enabled. Of course,
>> users won't care that mounting their filesystem now takes several
>> minutes (hours, even, when we have millions of inodes in the fs)
>> while these checks are run...
>>
>> Detecting malicious corruptions that specifically manipulate the
>> on-disk structure within the bounds of format validity are difficult
>> to detect and costly to protect against. We'd need to move large
>> parts of fsck into the kernel and run it to validate every piece of
>> metadata read into the kernel. Then we've got a much larger attack
>> surface in the kernel (all the validity checking code needs to be
>> robust against invalid structures, too!), a lot more complexity
>> (more bugs!) and a lot of additional runtime overhead (slow
>> filesystem = unhappy users!). It's just not a practical solution to
>> the problem.

This critiqute as I read it confuses data integrity and safety from
privilege escalation.  If a filesystem image or it's backing store are
malicious there is no need to be concerned about data integrity.

> And ideally, you'd want to also guard against an evil disk that
> suddenly changes its contents after you've run fsck on it, and you
> can't easily do that without making things complicated.

I agree an evil disk definitely should be part of any threat analysis.

I also agree that anything that is complicated is not a practical
as complicated means lots of code, and bugs are in general a function
of the amount of code.

At the same time my only concern when analyzing something for safety
against a malicious filesystem is can the malicious data cause the
kernel to misbehave.  This includes things like stack overflows,
and memory corruption.

In the specific case of a quota file, if there is a quota file that has
little to no resemblence to reality but the kernel doesn't misbehave, I
don't think that is a problem from an unprivileged mount perspective.
On the flip side if a malicious quota file allows an in kernel quota
variable to go negative and that causes the kernel to misbehave that is
a show stopper for allowing an unprivileged mount.

Personally I see the quantity of code in current filesystems as making
it hard to have a low enough probability of problems to allow
unprivileged mounts.

Changes for all of the weird cases a backing store for unprivileged
mounts brings with it have been added to the VFS because that is the
right place to implement them.  Working at a filesystem independent
level allows all of the right people involved in the review, and it
allows clean and general solutions to the weird cases that come up
with a uid or gid does not map into the kernel.


All of that said the only filesystem with backing store that I see as a
reasonable target to support right now is fuse.  As fuse was designed
from the get go to support filesystems from unprivileged users.

Will fuse someday support quotas?  I don't know.  But the there is no
extra cost to support that case in fs/quota/quota.c so I have added the
necessary code.

Eric

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH review 08/12] quota: Ensure qids map to the filesystem
  2016-07-13  5:43                       ` Jann Horn
  (?)
@ 2016-07-14 17:03                       ` Eric W. Biederman
  -1 siblings, 0 replies; 134+ messages in thread
From: Eric W. Biederman @ 2016-07-14 17:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Dave Chinner, Jan Kara, Linux Containers, Seth Forshee,
	linux-fsdevel, Linux API, James Bottomley, Djalal Harouni,
	Serge E. Hallyn, Andy Lutomirski, Michael Kerrisk

Jann Horn <jann@thejh.net> writes:

> On Wed, Jul 13, 2016 at 11:34:36AM +1000, Dave Chinner wrote:
>> On Mon, Jul 11, 2016 at 01:12:49PM -0500, Eric W. Biederman wrote:
>> > The place where I am concerned about thorough review and testing is
>> > someone poisoning quota files and then the kernel trying to use them.
>> > In the preliminary work we have done in other places in the kernel and
>> > for other filesystems there almost always winds up being some way to
>> > confuse the kernel and get it to misbave if you can poison the disk
>> > based inputs.  As poison disk based inputs is not something filesystems
>> > are stronlgy concerned about.  In most cases the disk the filesystem
>> > resides on is in the box and therefore under control of the OS at all
>> > times.  Dave Chinner has even said he will never consider handling
>> > poisoned disk based inputs for XFS as the run time cost is too high.
>> 
>> I didn't say that. I said that comprehensive checks to catch all
>> possible malicious inputs is too expensive to consider a viable
>> solution for allowing user-mounts of arbitrary filesystem images
>> through the kernel.

Dave you said that speaking as the XFS maintainer.  So I take that to be
your position and to refer to XFS.

> [...]
>> 
>> To bring this back to quota files, the only way to validate that a
>> quota file has not been tampered with is to run a quotacheck on the
>> filesystem once it has been mounted. This requires visiting every
>> inode in the filesystem, so it an expensive operation. Only XFS has
>> this functionality in kernel, so for untrusted mounts we could
>> simply run it on every mount that has quotas enabled. Of course,
>> users won't care that mounting their filesystem now takes several
>> minutes (hours, even, when we have millions of inodes in the fs)
>> while these checks are run...
>>
>> Detecting malicious corruptions that specifically manipulate the
>> on-disk structure within the bounds of format validity are difficult
>> to detect and costly to protect against. We'd need to move large
>> parts of fsck into the kernel and run it to validate every piece of
>> metadata read into the kernel. Then we've got a much larger attack
>> surface in the kernel (all the validity checking code needs to be
>> robust against invalid structures, too!), a lot more complexity
>> (more bugs!) and a lot of additional runtime overhead (slow
>> filesystem = unhappy users!). It's just not a practical solution to
>> the problem.

This critiqute as I read it confuses data integrity and safety from
privilege escalation.  If a filesystem image or it's backing store are
malicious there is no need to be concerned about data integrity.

> And ideally, you'd want to also guard against an evil disk that
> suddenly changes its contents after you've run fsck on it, and you
> can't easily do that without making things complicated.

I agree an evil disk definitely should be part of any threat analysis.

I also agree that anything that is complicated is not a practical
as complicated means lots of code, and bugs are in general a function
of the amount of code.

At the same time my only concern when analyzing something for safety
against a malicious filesystem is can the malicious data cause the
kernel to misbehave.  This includes things like stack overflows,
and memory corruption.

In the specific case of a quota file, if there is a quota file that has
little to no resemblence to reality but the kernel doesn't misbehave, I
don't think that is a problem from an unprivileged mount perspective.
On the flip side if a malicious quota file allows an in kernel quota
variable to go negative and that causes the kernel to misbehave that is
a show stopper for allowing an unprivileged mount.

Personally I see the quantity of code in current filesystems as making
it hard to have a low enough probability of problems to allow
unprivileged mounts.

Changes for all of the weird cases a backing store for unprivileged
mounts brings with it have been added to the VFS because that is the
right place to implement them.  Working at a filesystem independent
level allows all of the right people involved in the review, and it
allows clean and general solutions to the weird cases that come up
with a uid or gid does not map into the kernel.


All of that said the only filesystem with backing store that I see as a
reasonable target to support right now is fuse.  As fuse was designed
from the get go to support filesystems from unprivileged users.

Will fuse someday support quotas?  I don't know.  But the there is no
extra cost to support that case in fs/quota/quota.c so I have added the
necessary code.

Eric


^ permalink raw reply	[flat|nested] 134+ messages in thread

end of thread, other threads:[~2016-07-14 17:16 UTC | newest]

Thread overview: 134+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-02 17:18 [PATCH review 0/11] General unprivileged mount support Eric W. Biederman
2016-07-02 17:18 ` Eric W. Biederman
2016-07-02 17:20 ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
2016-07-02 17:20   ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
2016-07-02 17:20   ` [PATCH review 04/11] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
2016-07-02 17:20     ` Eric W. Biederman
2016-07-02 17:20   ` [PATCH review 05/11] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
2016-07-02 17:20   ` [PATCH review 06/11] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
2016-07-02 17:20     ` Eric W. Biederman
2016-07-02 17:20   ` [PATCH review 07/11] vfs: Don't create " Eric W. Biederman
     [not found]     ` <20160702172035.19568-7-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2016-07-04  7:59       ` Jan Kara
2016-07-04  7:59     ` Jan Kara
2016-07-04  7:59       ` Jan Kara
     [not found]       ` <20160704075919.GA5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-05 14:55         ` Eric W. Biederman
2016-07-05 14:55           ` Eric W. Biederman
     [not found]           ` <87zipwxhgp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-06  9:07             ` Jan Kara
2016-07-06  9:07               ` Jan Kara
2016-07-06 15:37               ` Eric W. Biederman
2016-07-06 15:37                 ` Eric W. Biederman
     [not found]               ` <20160706090705.GE14067-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-06 15:37                 ` Eric W. Biederman
2016-07-02 17:20   ` [PATCH review 08/11] quota: Ensure qids map to the filesystem Eric W. Biederman
2016-07-02 17:20   ` [PATCH review 09/11] quota: Handle quota data stored in s_user_ns Eric W. Biederman
2016-07-02 17:20     ` Eric W. Biederman
     [not found]     ` <20160702172035.19568-9-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2016-07-02 17:33       ` [PATCH v2 " Eric W. Biederman
2016-07-02 17:33     ` Eric W. Biederman
2016-07-02 17:33       ` Eric W. Biederman
2016-07-04  9:11       ` Jan Kara
2016-07-04  9:11         ` Jan Kara
     [not found]         ` <20160704091100.GD5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-05 14:48           ` Seth Forshee
2016-07-05 15:34           ` Eric W. Biederman
2016-07-05 14:48         ` Seth Forshee
2016-07-05 15:34         ` Eric W. Biederman
     [not found]           ` <87d1msumhy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-05 20:57             ` Dave Chinner
2016-07-05 20:57           ` Dave Chinner
2016-07-05 20:57             ` Dave Chinner
2016-07-05 21:28             ` Eric W. Biederman
2016-07-05 21:28               ` Eric W. Biederman
     [not found]               ` <8737nnrcyy.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-06  6:35                 ` Dave Chinner
2016-07-06  6:35                   ` Dave Chinner
2016-07-06  8:25                   ` Jan Kara
2016-07-06  8:25                     ` Jan Kara
2016-07-06 17:51                     ` Eric W. Biederman
     [not found]                     ` <20160706082545.GC14067-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-06 17:51                       ` Eric W. Biederman
2016-07-05 21:28             ` Eric W. Biederman
     [not found]       ` <87mvm03pxy.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-04  9:11         ` Jan Kara
2016-07-02 17:20   ` [PATCH review 10/11] evm: Translate user/group ids relative to s_user_ns when computing HMAC Eric W. Biederman
     [not found]   ` <20160702172035.19568-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2016-07-02 17:20     ` [PATCH review 02/11] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS Eric W. Biederman
2016-07-02 17:20       ` Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 03/11] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 04/11] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 05/11] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 06/11] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 07/11] vfs: Don't create " Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 08/11] quota: Ensure qids map to the filesystem Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 09/11] quota: Handle quota data stored in s_user_ns Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 10/11] evm: Translate user/group ids relative to s_user_ns when computing HMAC Eric W. Biederman
2016-07-02 17:20     ` [PATCH review 11/11] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns Eric W. Biederman
2016-07-02 17:20       ` Eric W. Biederman
2016-07-04  8:52 ` [PATCH review 0/11] General unprivileged mount support Jan Kara
2016-07-04 16:27   ` Eric W. Biederman
2016-07-04 16:27     ` Eric W. Biederman
     [not found]     ` <87h9c52wsd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-06  8:54       ` Jan Kara
2016-07-06  8:54         ` Jan Kara
2016-07-06 13:54         ` Seth Forshee
2016-07-06 14:22           ` Jan Kara
2016-07-06 14:22             ` Jan Kara
2016-07-06 14:46             ` Seth Forshee
     [not found]             ` <20160706142255.GB21164-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-06 14:46               ` Seth Forshee
2016-07-06 15:01               ` Eric W. Biederman
2016-07-06 15:23               ` James Bottomley
2016-07-06 15:01             ` Eric W. Biederman
2016-07-06 15:23             ` James Bottomley
     [not found]               ` <1467818630.2369.21.camel-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>
2016-07-06 16:35                 ` Eric W. Biederman
2016-07-06 16:35                   ` Eric W. Biederman
2016-07-06 14:22           ` Jan Kara
     [not found]         ` <20160706085440.GD14067-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-06 13:54           ` Seth Forshee
     [not found]   ` <20160704085220.GC5200-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-04 16:27     ` Eric W. Biederman
2016-07-06 13:44 ` Andy Lutomirski
2016-07-06 13:44   ` Andy Lutomirski
     [not found]   ` <CALCETrVof174gPCZnD2Z-RMjR-P=NcA0mYCU9ki6=o9hpFL-BA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-07-06 15:21     ` Eric W. Biederman
2016-07-06 15:21       ` Eric W. Biederman
2016-07-06 14:01 ` Andy Lutomirski
2016-07-06 14:01   ` Andy Lutomirski
2016-07-06 15:19   ` Eric W. Biederman
2016-07-06 15:19     ` Eric W. Biederman
     [not found] ` <87ziq03qnj.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-02 17:20   ` [PATCH review 01/11] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
2016-07-04  8:52   ` [PATCH review 0/11] General unprivileged mount support Jan Kara
2016-07-06 13:44   ` Andy Lutomirski
2016-07-06 14:01   ` Andy Lutomirski
2016-07-06 18:10   ` [PATCH review 0/12] General unprivileged mount support v2 Eric W. Biederman
2016-07-06 18:10     ` Eric W. Biederman
     [not found]     ` <874m82bptc.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-06 18:12       ` [PATCH review 01/12] fs: Refuse uid/gid changes which don't map into s_user_ns Eric W. Biederman
2016-07-06 18:12         ` Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 02/12] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 04/12] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 05/12] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 06/12] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
2016-07-06 18:12           ` Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 07/12] vfs: Don't create " Eric W. Biederman
2016-07-06 18:12           ` Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 08/12] quota: Ensure qids map to the filesystem Eric W. Biederman
     [not found]           ` <20160706181212.16267-8-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2016-07-11 10:14             ` Jan Kara
2016-07-11 10:14               ` Jan Kara
2016-07-11 18:12               ` Eric W. Biederman
2016-07-11 18:12                 ` Eric W. Biederman
     [not found]                 ` <878tx8dowu.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-13  1:34                   ` Dave Chinner
2016-07-13  1:34                     ` Dave Chinner
2016-07-13  3:45                     ` Dave Chinner
2016-07-13  3:45                       ` Dave Chinner
2016-07-13  5:43                     ` Jann Horn
2016-07-13  5:43                       ` Jann Horn
2016-07-14 17:03                       ` Eric W. Biederman
     [not found]                       ` <20160713054358.GB28635-J1fxOzX/cBvk1uMJSBkQmQ@public.gmane.org>
2016-07-14 17:03                         ` Eric W. Biederman
     [not found]               ` <20160711101424.GH12410-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-07-11 18:12                 ` Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 09/12] quota: Handle quota data stored in s_user_ns in quota_setxquota Eric W. Biederman
2016-07-06 18:12         ` [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns Eric W. Biederman
2016-07-06 18:12           ` Eric W. Biederman
2016-07-11 10:09           ` Jan Kara
     [not found]           ` <20160706181212.16267-10-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2016-07-11 10:09             ` Jan Kara
2016-07-06 18:12         ` [PATCH review 11/12] evm: Translate user/group ids relative to s_user_ns when computing HMAC Eric W. Biederman
2016-07-06 18:12           ` Eric W. Biederman
     [not found]         ` <20160706181212.16267-1-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2016-07-06 18:12           ` [PATCH review 02/12] userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 03/12] vfs: Verify acls are valid within superblock's s_user_ns Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 04/12] fs: Check for invalid i_uid in may_follow_link() Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 05/12] cred: Reject inodes with invalid ids in set_create_file_as() Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 06/12] vfs: Don't modify inodes with a uid or gid unknown to the vfs Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 07/12] vfs: Don't create " Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 08/12] quota: Ensure qids map to the filesystem Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 09/12] quota: Handle quota data stored in s_user_ns in quota_setxquota Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 10/12] dquot: For now explicitly don't support filesystems outside of init_user_ns Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 11/12] evm: Translate user/group ids relative to s_user_ns when computing HMAC Eric W. Biederman
2016-07-06 18:12           ` [PATCH review 12/12] fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns Eric W. Biederman
2016-07-06 18:12         ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.