All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/11] FUSE mounts from non-init user namespaces
@ 2017-12-22 14:32 Dongsu Park
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
                   ` (9 more replies)
  0 siblings, 10 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park

This patchset v5 is based on work by Seth Forshee and Eric Biederman.
The latest patchset was v4:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html

At the moment, filesystems backed by physical medium can only be mounted
by real root in the initial user namespace. This restriction exists
because if it's allowed for root user in non-init user namespaces to
mount the filesystem, then it effectively allows the user to control the
underlying source of the filesystem. In case of FUSE, the source would
mean any underlying device.

However, in many use cases such as containers, it's necessary to allow
filesystems to be mounted from non-init user namespaces. Goal of this
patchset is to allow FUSE filesystems to be mounted from non-init user
namespaces. Support for other filesystems like ext4 are not in the
scope of this patchset.

Let me describe how to test mounting from non-init user namespaces. It's
assumed that tests are done via sshfs, a userspace filesystem based on
FUSE with ssh as backend. Testing system is Fedora 27.

====
$ sudo dnf install -y sshfs
$ sudo mkdir -p /mnt/userns

### workaround to get the sshfs permission checks
$ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies

$ unshare -U -r -m
# sshfs root@localhost: /mnt/userns

### You can see sshfs being mounted from a non-init user namespace
# mount | grep sshfs
root@localhost: on /mnt/userns type fuse.sshfs
(rw,nosuid,nodev,relatime,user_id=0,group_id=0)

# touch /mnt/userns/test
# ls -l /mnt/userns/test
-rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
====

Open another terminal, check the mountpoint from outside the namespace.

====
$ grep userns /proc/$(pidof sshfs)/mountinfo
131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
root@localhost: rw,user_id=0,group_id=0
====

After all tests are done, you can unmount the filesystem
inside the namespace.

====
# fusermount -u /mnt/userns
====

Changes since v4:
 * Remove other parts like ext4 to keep the patchset minimal for FUSE
 * Add and change commit messages
 * Describe how to test non-init user namespaces

TODO:
 * Think through potential security implications. There are 2 patches
   being prepared for security issues. One is "ima: define a new policy
   option named force" by Mimi Zohar, which adds an option to specify
   that the results should not be cached:
   https://marc.info/?l=linux-integrity&m=151275680115856&w=2
   The other one is to basically prevent FUSE results from being cached,
   which is still in progress.

 * Test IMA/LSMs. Details are written in
   https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md

Patches 1-2 deal with an additional flag of lookup_bdev() to check for
additional inode permission.

Patches 3-7 allow the superblock owner to change ownership of inodes, and
deal with additional capability checks w.r.t user namespaces.

Patches 8-10 allow FUSE filesystems to be mounted outside of the init
user namespace.

Patch 11 handles a corner case of non-root users in EVM.

The patchset is also available in our github repo:
  https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1


Eric W. Biederman (1):
  fs: Allow superblock owner to change ownership of inodes

Seth Forshee (10):
  block_dev: Support checking inode permissions in lookup_bdev()
  mtd: Check permissions towards mtd block device inode when mounting
  fs: Don't remove suid for CAP_FSETID for userns root
  fs: Allow superblock owner to access do_remount_sb()
  capabilities: Allow privileged user in s_user_ns to set security.*
    xattrs
  fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  fuse: Support fuse filesystems outside of init_user_ns
  fuse: Restrict allow_other to the superblock's namespace or a
    descendant
  fuse: Allow user namespace mounts
  evm: Don't update hmacs in user ns mounts

 drivers/md/bcache/super.c           |  2 +-
 drivers/md/dm-table.c               |  2 +-
 drivers/mtd/mtdsuper.c              |  6 +++++-
 fs/attr.c                           | 34 ++++++++++++++++++++++++++--------
 fs/block_dev.c                      | 13 ++++++++++---
 fs/fuse/cuse.c                      |  3 ++-
 fs/fuse/dev.c                       | 11 ++++++++---
 fs/fuse/dir.c                       | 16 ++++++++--------
 fs/fuse/fuse_i.h                    |  6 +++++-
 fs/fuse/inode.c                     | 35 +++++++++++++++++++++--------------
 fs/inode.c                          |  6 ++++--
 fs/ioctl.c                          |  4 ++--
 fs/namespace.c                      |  4 ++--
 fs/proc/base.c                      |  7 +++++++
 fs/proc/generic.c                   |  7 +++++++
 fs/proc/proc_sysctl.c               |  7 +++++++
 fs/quota/quota.c                    |  2 +-
 include/linux/fs.h                  |  2 +-
 kernel/user_namespace.c             |  1 +
 security/commoncap.c                |  8 ++++++--
 security/integrity/evm/evm_crypto.c |  3 ++-
 21 files changed, 127 insertions(+), 52 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32     ` Dongsu Park
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Alban Crequy,
	Eric W . Biederman, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Jan Kara, Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Patch v4 is available: https://patchwork.kernel.org/patch/8943601/

Cc: dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 drivers/md/bcache/super.c |  2 +-
 drivers/md/dm-table.c     |  2 +-
 drivers/mtd/mtdsuper.c    |  2 +-
 fs/block_dev.c            | 13 ++++++++++---
 fs/quota/quota.c          |  2 +-
 include/linux/fs.h        |  2 +-
 6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928..acc9d56c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 				  sb);
 	if (IS_ERR(bdev)) {
 		if (bdev == ERR_PTR(-EBUSY)) {
-			bdev = lookup_bdev(strim(path));
+			bdev = lookup_bdev(strim(path), 0);
 			mutex_lock(&bch_register_lock);
 			if (!IS_ERR(bdev) && bch_is_open(bdev))
 				err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88130b5d..bca5eaf4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
 	dev_t dev;
 	struct block_device *bdev;
 
-	bdev = lookup_bdev(path);
+	bdev = lookup_bdev(path, 0);
 	if (IS_ERR(bdev))
 		dev = name_to_dev_t(path);
 	else {
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index e43fea89..4a4d40c0 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 	/* try the old way - the hack where we allowed users to mount
 	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 	 */
-	bdev = lookup_bdev(dev_name);
+	bdev = lookup_bdev(dev_name, 0);
 	if (IS_ERR(bdev)) {
 		ret = PTR_ERR(bdev);
 		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb..5ca06095 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
 	struct block_device *bdev;
 	int err;
 
-	bdev = lookup_bdev(path);
+	bdev = lookup_bdev(path, 0);
 	if (IS_ERR(bdev))
 		return bdev;
 
@@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
 /**
  * lookup_bdev  - lookup a struct block_device by name
  * @pathname:	special file representing the block device
+ * @mask:	rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
  * Get a reference to the blockdevice at @pathname in the current
  * namespace if possible and return it.  Return ERR_PTR(error)
- * otherwise.
+ * otherwise.  If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
  */
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
 {
 	struct block_device *bdev;
 	struct inode *inode;
@@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
 		return ERR_PTR(error);
 
 	inode = d_backing_inode(path.dentry);
+	if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+		error = __inode_permission(inode, mask);
+		if (error)
+			goto fail;
+	}
 	error = -ENOTBLK;
 	if (!S_ISBLK(inode->i_mode))
 		goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 43612e2a..e5d47955 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)
 
 	if (IS_ERR(tmp))
 		return ERR_CAST(tmp);
-	bdev = lookup_bdev(tmp->name);
+	bdev = lookup_bdev(tmp->name, 0);
 	putname(tmp);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2995a271..fce19c49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
 #define BLKDEV_MAJOR_MAX	512
 extern const char *__bdevname(dev_t, char *buffer);
 extern const char *bdevname(struct block_device *bdev, char *buffer);
-extern struct block_device *lookup_bdev(const char *);
+extern struct block_device *lookup_bdev(const char *, int mask);
 extern void blkdev_show(struct seq_file *,off_t);
 
 #else
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-22 18:59   ` Coly Li
                     ` (2 more replies)
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, dm-devel,
	linux-bcache, linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara,
	Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Patch v4 is available: https://patchwork.kernel.org/patch/8943601/

Cc: dm-devel@redhat.com
Cc: linux-bcache@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mtd@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 drivers/md/bcache/super.c |  2 +-
 drivers/md/dm-table.c     |  2 +-
 drivers/mtd/mtdsuper.c    |  2 +-
 fs/block_dev.c            | 13 ++++++++++---
 fs/quota/quota.c          |  2 +-
 include/linux/fs.h        |  2 +-
 6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928..acc9d56c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 				  sb);
 	if (IS_ERR(bdev)) {
 		if (bdev == ERR_PTR(-EBUSY)) {
-			bdev = lookup_bdev(strim(path));
+			bdev = lookup_bdev(strim(path), 0);
 			mutex_lock(&bch_register_lock);
 			if (!IS_ERR(bdev) && bch_is_open(bdev))
 				err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88130b5d..bca5eaf4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
 	dev_t dev;
 	struct block_device *bdev;
 
-	bdev = lookup_bdev(path);
+	bdev = lookup_bdev(path, 0);
 	if (IS_ERR(bdev))
 		dev = name_to_dev_t(path);
 	else {
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index e43fea89..4a4d40c0 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 	/* try the old way - the hack where we allowed users to mount
 	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 	 */
-	bdev = lookup_bdev(dev_name);
+	bdev = lookup_bdev(dev_name, 0);
 	if (IS_ERR(bdev)) {
 		ret = PTR_ERR(bdev);
 		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb..5ca06095 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
 	struct block_device *bdev;
 	int err;
 
-	bdev = lookup_bdev(path);
+	bdev = lookup_bdev(path, 0);
 	if (IS_ERR(bdev))
 		return bdev;
 
@@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
 /**
  * lookup_bdev  - lookup a struct block_device by name
  * @pathname:	special file representing the block device
+ * @mask:	rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
  * Get a reference to the blockdevice at @pathname in the current
  * namespace if possible and return it.  Return ERR_PTR(error)
- * otherwise.
+ * otherwise.  If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
  */
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
 {
 	struct block_device *bdev;
 	struct inode *inode;
@@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
 		return ERR_PTR(error);
 
 	inode = d_backing_inode(path.dentry);
+	if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+		error = __inode_permission(inode, mask);
+		if (error)
+			goto fail;
+	}
 	error = -ENOTBLK;
 	if (!S_ISBLK(inode->i_mode))
 		goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 43612e2a..e5d47955 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)
 
 	if (IS_ERR(tmp))
 		return ERR_CAST(tmp);
-	bdev = lookup_bdev(tmp->name);
+	bdev = lookup_bdev(tmp->name, 0);
 	putname(tmp);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2995a271..fce19c49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
 #define BLKDEV_MAJOR_MAX	512
 extern const char *__bdevname(dev_t, char *buffer);
 extern const char *bdevname(struct block_device *bdev, char *buffer);
-extern struct block_device *lookup_bdev(const char *);
+extern struct block_device *lookup_bdev(const char *, int mask);
 extern void blkdev_show(struct seq_file *,off_t);
 
 #else
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
@ 2017-12-22 14:32     ` Dongsu Park
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Unprivileged users should not be able to mount mtd block devices
when they lack sufficient privileges towards the block device
inode.  Update mount_mtd() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Patch v3 is available: https://patchwork.kernel.org/patch/7640011/

Cc: linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 drivers/mtd/mtdsuper.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 4a4d40c0..3c8734f3 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 #ifdef CONFIG_BLOCK
 	struct block_device *bdev;
 	int ret, major;
+	int perm;
 #endif
 	int mtdnr;
 
@@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 	/* try the old way - the hack where we allowed users to mount
 	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 	 */
-	bdev = lookup_bdev(dev_name, 0);
+	perm = MAY_READ;
+	if (!(flags & MS_RDONLY))
+		perm |= MAY_WRITE;
+	bdev = lookup_bdev(dev_name, perm);
 	if (IS_ERR(bdev)) {
 		ret = PTR_ERR(bdev);
 		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
@ 2017-12-22 14:32     ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-mtd

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users should not be able to mount mtd block devices
when they lack sufficient privileges towards the block device
inode.  Update mount_mtd() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Patch v3 is available: https://patchwork.kernel.org/patch/7640011/

Cc: linux-mtd@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 drivers/mtd/mtdsuper.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 4a4d40c0..3c8734f3 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 #ifdef CONFIG_BLOCK
 	struct block_device *bdev;
 	int ret, major;
+	int perm;
 #endif
 	int mtdnr;
 
@@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
 	/* try the old way - the hack where we allowed users to mount
 	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
 	 */
-	bdev = lookup_bdev(dev_name, 0);
+	perm = MAY_READ;
+	if (!(flags & MS_RDONLY))
+		perm |= MAY_WRITE;
+	bdev = lookup_bdev(dev_name, perm);
 	if (IS_ERR(bdev)) {
 		ret = PTR_ERR(bdev);
 		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-22 14:32   ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
  2017-12-22 14:32     ` Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32   ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi, Kees Cook,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Luis R. Rodriguez, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files.  Ordinarily the capable_wrt_inode_uidgid check is
sufficient to allow access to files but when the underlying filesystem
has uids or gids that don't map to the current user namespace it is
not enough, so the chown permission checks need to be extended to
allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient, to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

For the proc filesystem this relaxation of permissions is not safe, as
some files are owned by users (particularly GLOBAL_ROOT_UID) outside
of the control of the mounter of the proc and that would be unsafe to
grant chown access to.  So update setattr on proc to disallow changing
files whose uids or gids are outside of proc's s_user_ns.

The original version of this patch was written by: Seth Forshee.  I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Patch v4 is available: https://patchwork.kernel.org/patch/8944611/

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: "Luis R. Rodriguez" <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Inspired-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
[saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 fs/attr.c             | 34 ++++++++++++++++++++++++++--------
 fs/proc/base.c        |  7 +++++++
 fs/proc/generic.c     |  7 +++++++
 fs/proc/proc_sysctl.c |  7 +++++++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6f..bf8e94f3 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,30 @@
 #include <linux/evm.h>
 #include <linux/ima.h>
 
+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+	if (uid_eq(current_fsuid(), inode->i_uid) &&
+	    uid_eq(uid, inode->i_uid))
+		return true;
+	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+		return true;
+	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+		return true;
+	return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+	if (uid_eq(current_fsuid(), inode->i_uid) &&
+	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+		return true;
+	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+		return true;
+	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+		return true;
+	return false;
+}
+
 /**
  * setattr_prepare - check if attribute changes to a dentry are allowed
  * @dentry:	dentry to check
@@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 		goto kill_priv;
 
 	/* Make sure a caller can chown. */
-	if ((ia_valid & ATTR_UID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
-	if ((ia_valid & ATTR_GID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb9..9d50ec92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	int error;
 	struct inode *inode = d_inode(dentry);
+	struct user_namespace *s_user_ns;
 
 	if (attr->ia_valid & ATTR_MODE)
 		return -EPERM;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 793a6757..527d46c8 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
 {
 	struct inode *inode = d_inode(dentry);
 	struct proc_dir_entry *de = PDE(inode);
+	struct user_namespace *s_user_ns;
 	int error;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, iattr);
 	if (error)
 		return error;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c5cbbdff..0f9562d1 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
 static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
+	struct user_namespace *s_user_ns;
 	int error;
 
 	if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
 		return -EPERM;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, attr);
 	if (error)
 		return error;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
       [not found]   ` <ac3d34002d7690f6ca5928b57b7fc4d707104b04.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (2 more replies)
  2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro, Luis R. Rodriguez, Kees Cook

From: Eric W. Biederman <ebiederm@xmission.com>

Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files.  Ordinarily the capable_wrt_inode_uidgid check is
sufficient to allow access to files but when the underlying filesystem
has uids or gids that don't map to the current user namespace it is
not enough, so the chown permission checks need to be extended to
allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient, to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

For the proc filesystem this relaxation of permissions is not safe, as
some files are owned by users (particularly GLOBAL_ROOT_UID) outside
of the control of the mounter of the proc and that would be unsafe to
grant chown access to.  So update setattr on proc to disallow changing
files whose uids or gids are outside of proc's s_user_ns.

The original version of this patch was written by: Seth Forshee.  I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Patch v4 is available: https://patchwork.kernel.org/patch/8944611/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Inspired-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
[saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/attr.c             | 34 ++++++++++++++++++++++++++--------
 fs/proc/base.c        |  7 +++++++
 fs/proc/generic.c     |  7 +++++++
 fs/proc/proc_sysctl.c |  7 +++++++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6f..bf8e94f3 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,30 @@
 #include <linux/evm.h>
 #include <linux/ima.h>
 
+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+	if (uid_eq(current_fsuid(), inode->i_uid) &&
+	    uid_eq(uid, inode->i_uid))
+		return true;
+	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+		return true;
+	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+		return true;
+	return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+	if (uid_eq(current_fsuid(), inode->i_uid) &&
+	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+		return true;
+	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+		return true;
+	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+		return true;
+	return false;
+}
+
 /**
  * setattr_prepare - check if attribute changes to a dentry are allowed
  * @dentry:	dentry to check
@@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 		goto kill_priv;
 
 	/* Make sure a caller can chown. */
-	if ((ia_valid & ATTR_UID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
 		return -EPERM;
 
 	/* Make sure caller can chgrp. */
-	if ((ia_valid & ATTR_GID) &&
-	    (!uid_eq(current_fsuid(), inode->i_uid) ||
-	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
-	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
 		return -EPERM;
 
 	/* Make sure a caller can chmod. */
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb9..9d50ec92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	int error;
 	struct inode *inode = d_inode(dentry);
+	struct user_namespace *s_user_ns;
 
 	if (attr->ia_valid & ATTR_MODE)
 		return -EPERM;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, attr);
 	if (error)
 		return error;
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 793a6757..527d46c8 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
 {
 	struct inode *inode = d_inode(dentry);
 	struct proc_dir_entry *de = PDE(inode);
+	struct user_namespace *s_user_ns;
 	int error;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, iattr);
 	if (error)
 		return error;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c5cbbdff..0f9562d1 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
 static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
+	struct user_namespace *s_user_ns;
 	int error;
 
 	if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
 		return -EPERM;
 
+	/* Don't let anyone mess with weird proc files */
+	s_user_ns = inode->i_sb->s_user_ns;
+	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+	    !kgid_has_mapping(s_user_ns, inode->i_gid))
+		return -EPERM;
+
 	error = setattr_prepare(dentry, attr);
 	if (error)
 		return error;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-12-22 14:32   ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32   ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Expand the check in should_remove_suid() to keep privileges for
CAP_FSETID in s_user_ns rather than init_user_ns.

Patch v4 is available: https://patchwork.kernel.org/patch/8944621/

--EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 fs/inode.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fd401028..6459a437 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
  */
 int should_remove_suid(struct dentry *dentry)
 {
-	umode_t mode = d_inode(dentry)->i_mode;
+	struct inode *inode = d_inode(dentry);
+	umode_t mode = inode->i_mode;
 	int kill = 0;
 
 	/* suid always must be killed */
@@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
 	if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
 		kill |= ATTR_KILL_SGID;
 
-	if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
+	if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
+		     S_ISREG(mode)))
 		return kill;
 
 	return 0;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
       [not found]   ` <ddf1fb9b5001e633e0022dee7fecb0ef431e851f.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:26   ` Serge E. Hallyn
  2017-12-22 14:32 ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro, Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

Expand the check in should_remove_suid() to keep privileges for
CAP_FSETID in s_user_ns rather than init_user_ns.

Patch v4 is available: https://patchwork.kernel.org/patch/8944621/

--EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/inode.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fd401028..6459a437 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
  */
 int should_remove_suid(struct dentry *dentry)
 {
-	umode_t mode = d_inode(dentry)->i_mode;
+	struct inode *inode = d_inode(dentry);
+	umode_t mode = inode->i_mode;
 	int kill = 0;
 
 	/* suid always must be killed */
@@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
 	if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
 		kill |= ATTR_KILL_SGID;
 
-	if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
+	if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
+		     S_ISREG(mode)))
 		return kill;
 
 	return 0;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-12-22 14:32   ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32     ` Dongsu Park
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Patch v4 is available: https://patchwork.kernel.org/patch/8944631/

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e158ec6b..830040d7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
 		 * Special case for "unmounting" root ...
 		 * we just try to remount it readonly.
 		 */
-		if (!capable(CAP_SYS_ADMIN))
+		if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 			return -EPERM;
 		down_write(&sb->s_umount);
 		if (!sb_rdonly(sb))
@@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 	down_write(&sb->s_umount);
 	if (ms_flags & MS_BIND)
 		err = change_mount_flags(path->mnt, ms_flags);
-	else if (!capable(CAP_SYS_ADMIN))
+	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		err = -EPERM;
 	else
 		err = do_remount_sb(sb, sb_flags, data, 0);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
                   ` (2 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
       [not found]   ` <8dd484dceb9e96e5b67f21b8a0cf333753985e89.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:30   ` Serge E. Hallyn
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro, Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Patch v4 is available: https://patchwork.kernel.org/patch/8944631/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e158ec6b..830040d7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
 		 * Special case for "unmounting" root ...
 		 * we just try to remount it readonly.
 		 */
-		if (!capable(CAP_SYS_ADMIN))
+		if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 			return -EPERM;
 		down_write(&sb->s_umount);
 		if (!sb_rdonly(sb))
@@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
 	down_write(&sb->s_umount);
 	if (ms_flags & MS_BIND)
 		err = change_mount_flags(path->mnt, ms_flags);
-	else if (!capable(CAP_SYS_ADMIN))
+	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		err = -EPERM;
 	else
 		err = do_remount_sb(sb, sb_flags, data, 0);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
@ 2017-12-22 14:32     ` Dongsu Park
  2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
                       ` (7 subsequent siblings)
  9 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	Alban Crequy, Eric W . Biederman, James Morris, Sargun Dhillon

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Patch v4 is available: https://patchwork.kernel.org/patch/8944641/

Cc: linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: James Morris <james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 security/commoncap.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e0934..dd0afef9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		       const void *value, size_t size, int flags)
 {
+	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
 	/* Ignore non-security xattrs */
 	if (strncmp(name, XATTR_SECURITY_PREFIX,
 			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 	if (strcmp(name, XATTR_NAME_CAPS) == 0)
 		return 0;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
@@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
+	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
 	/* Ignore non-security xattrs */
 	if (strncmp(name, XATTR_SECURITY_PREFIX,
 			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
 		return 0;
 	}
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
@ 2017-12-22 14:32     ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-security-module,
	James Morris, Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Patch v4 is available: https://patchwork.kernel.org/patch/8944641/

Cc: linux-security-module@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: James Morris <james.l.morris@oracle.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 security/commoncap.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e0934..dd0afef9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		       const void *value, size_t size, int flags)
 {
+	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
 	/* Ignore non-security xattrs */
 	if (strncmp(name, XATTR_SECURITY_PREFIX,
 			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 	if (strcmp(name, XATTR_NAME_CAPS) == 0)
 		return 0;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
@@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
+	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
 	/* Ignore non-security xattrs */
 	if (strncmp(name, XATTR_SECURITY_PREFIX,
 			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
 		return 0;
 	}
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
@ 2017-12-22 14:32     ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-security-module

From: Seth Forshee <seth.forshee@canonical.com>

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Patch v4 is available: https://patchwork.kernel.org/patch/8944641/

Cc: linux-security-module at vger.kernel.org
Cc: linux-kernel at vger.kernel.org
Cc: James Morris <james.l.morris@oracle.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 security/commoncap.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e0934..dd0afef9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		       const void *value, size_t size, int flags)
 {
+	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
 	/* Ignore non-security xattrs */
 	if (strncmp(name, XATTR_SECURITY_PREFIX,
 			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
 	if (strcmp(name, XATTR_NAME_CAPS) == 0)
 		return 0;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
@@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
+	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
 	/* Ignore non-security xattrs */
 	if (strncmp(name, XATTR_SECURITY_PREFIX,
 			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
 		return 0;
 	}
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
@ 2017-12-22 14:32     ` Dongsu Park
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 fs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb..8c628a8d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
 {
 	struct super_block *sb = file_inode(filp)->i_sb;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/* If filesystem doesn't support freeze feature, return. */
@@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
 {
 	struct super_block *sb = file_inode(filp)->i_sb;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/* Thaw */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
@ 2017-12-22 14:32     ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Alexander Viro

From: Seth Forshee <seth.forshee@canonical.com>

The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb..8c628a8d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
 {
 	struct super_block *sb = file_inode(filp)->i_sb;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/* If filesystem doesn't support freeze feature, return. */
@@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
 {
 	struct super_block *sb = file_inode(filp)->i_sb;
 
-	if (!capable(CAP_SYS_ADMIN))
+	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	/* Thaw */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2017-12-22 14:32     ` Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32   ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.

For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.

Patch v4 is available: https://patchwork.kernel.org/patch/8944661/

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 fs/fuse/cuse.c   |  3 ++-
 fs/fuse/dev.c    | 11 ++++++++---
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803..b1b83259 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	fuse_conn_init(&cc->fc, current_user_ns());
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 17f0d05b..0f780e16 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 }
 
@@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
+	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 
 	return req;
 
@@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
+	if (task_active_pid_ns(current) != fc->pid_ns ||
+	    current_user_ns() != fc->user_ns) {
 		rcu_read_lock();
 		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
 		rcu_read_unlock();
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382..ad1cfac1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d5773ca6..364e65c8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2f504d61..7f6b2e55 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
                   ` (3 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
  2018-01-17 10:59   ` Alban Crequy
       [not found]   ` <c85c293e19a478353aba8e6e3ee39e5914f798d5.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel

From: Seth Forshee <seth.forshee@canonical.com>

In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.

For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.

Patch v4 is available: https://patchwork.kernel.org/patch/8944661/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/fuse/cuse.c   |  3 ++-
 fs/fuse/dev.c    | 11 ++++++++---
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803..b1b83259 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	fuse_conn_init(&cc->fc, current_user_ns());
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 17f0d05b..0f780e16 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 }
 
@@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
+	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 
 	return req;
 
@@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
+	if (task_active_pid_ns(current) != fc->pid_ns ||
+	    current_user_ns() != fc->user_ns) {
 		rcu_read_lock();
 		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
 		rcu_read_unlock();
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382..ad1cfac1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d5773ca6..364e65c8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2f504d61..7f6b2e55 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2017-12-22 14:32   ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32   ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Patch v4 is available: https://patchwork.kernel.org/patch/8944671/

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1..d41559a0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4c..492c255e 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
                   ` (5 preceding siblings ...)
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-22 14:32 ` Dongsu Park
  2017-12-23  3:50   ` Serge E. Hallyn
       [not found]   ` <d055925e5d5c0099e9e9c871004fb45fab67e4bc.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel,
	Serge Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Patch v4 is available: https://patchwork.kernel.org/patch/8944671/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1..d41559a0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4c..492c255e 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 10/11] fuse: Allow user namespace mounts
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2017-12-22 14:32   ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32   ` [PATCH 11/11] evm: Don't update hmacs in user ns mounts Dongsu Park
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

To be able to mount fuse from non-init user namespaces, it's necessary
to set FS_USERNS_MOUNT flag to fs_flags.

Patch v4 is available: https://patchwork.kernel.org/patch/8944681/

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
[dongsu: add a simple commit messasge]
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 fs/fuse/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7f6b2e55..8c98edee 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "fuse",
-	.fs_flags	= FS_HAS_SUBTYPE,
+	.fs_flags	= FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 	.mount		= fuse_mount,
 	.kill_sb	= fuse_kill_sb_anon,
 };
@@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
 	.name		= "fuseblk",
 	.mount		= fuse_mount_blk,
 	.kill_sb	= fuse_kill_sb_blk,
-	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 };
 MODULE_ALIAS_FS("fuseblk");
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 10/11] fuse: Allow user namespace mounts
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
                   ` (6 preceding siblings ...)
  2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
@ 2017-12-22 14:32 ` Dongsu Park
       [not found]   ` <a26103156b3f6ba73b1e46c6f577f1bee74872d9.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2018-02-14 13:44   ` Miklos Szeredi
  2017-12-22 14:32   ` Dongsu Park
  2017-12-25  7:05 ` [PATCH v5 00/11] FUSE mounts from non-init user namespaces Eric W. Biederman
  9 siblings, 2 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-fsdevel

From: Seth Forshee <seth.forshee@canonical.com>

To be able to mount fuse from non-init user namespaces, it's necessary
to set FS_USERNS_MOUNT flag to fs_flags.

Patch v4 is available: https://patchwork.kernel.org/patch/8944681/

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
[dongsu: add a simple commit messasge]
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 fs/fuse/inode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7f6b2e55..8c98edee 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
 static struct file_system_type fuse_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "fuse",
-	.fs_flags	= FS_HAS_SUBTYPE,
+	.fs_flags	= FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 	.mount		= fuse_mount,
 	.kill_sb	= fuse_kill_sb_anon,
 };
@@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
 	.name		= "fuseblk",
 	.mount		= fuse_mount_blk,
 	.kill_sb	= fuse_kill_sb_blk,
-	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
 };
 MODULE_ALIAS_FS("fuseblk");
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 11/11] evm: Don't update hmacs in user ns mounts
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2017-12-22 14:32   ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-25  7:05   ` [PATCH v5 00/11] FUSE mounts from non-init user namespaces Eric W. Biederman
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Seth Forshee, linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	Alban Crequy, Eric W . Biederman, James Morris, Sargun Dhillon,
	linux-integrity-u79uwXL29TY76Z2rM5mHXA, Mimi Zohar

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

The kernel should not calculate new hmacs for mounts done by
non-root users. Update evm_calc_hmac_or_hash() to refuse to
calculate new hmacs for mounts for non-init user namespaces.

Cc: linux-integrity-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: James Morris <james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Mimi Zohar <zohar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
---
 security/integrity/evm/evm_crypto.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index bcd64baf..729f4545 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
 	int error;
 	int size;
 
-	if (!(inode->i_opflags & IOP_XATTR))
+	if (!(inode->i_opflags & IOP_XATTR) ||
+	    inode->i_sb->s_user_ns != &init_user_ns)
 		return -EOPNOTSUPP;
 
 	desc = init_desc(type);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 11/11] evm: Don't update hmacs in user ns mounts
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
@ 2017-12-22 14:32   ` Dongsu Park
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, Dongsu Park, linux-integrity,
	linux-security-module, James Morris, Mimi Zohar, Serge E. Hallyn

From: Seth Forshee <seth.forshee@canonical.com>

The kernel should not calculate new hmacs for mounts done by
non-root users. Update evm_calc_hmac_or_hash() to refuse to
calculate new hmacs for mounts for non-init user namespaces.

Cc: linux-integrity@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: James Morris <james.l.morris@oracle.com>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 security/integrity/evm/evm_crypto.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index bcd64baf..729f4545 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
 	int error;
 	int size;
 
-	if (!(inode->i_opflags & IOP_XATTR))
+	if (!(inode->i_opflags & IOP_XATTR) ||
+	    inode->i_sb->s_user_ns != &init_user_ns)
 		return -EOPNOTSUPP;
 
 	desc = init_desc(type);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 11/11] evm: Don't update hmacs in user ns mounts
@ 2017-12-22 14:32   ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-22 14:32 UTC (permalink / raw)
  To: linux-security-module

From: Seth Forshee <seth.forshee@canonical.com>

The kernel should not calculate new hmacs for mounts done by
non-root users. Update evm_calc_hmac_or_hash() to refuse to
calculate new hmacs for mounts for non-init user namespaces.

Cc: linux-integrity at vger.kernel.org
Cc: linux-security-module at vger.kernel.org
Cc: linux-kernel at vger.kernel.org
Cc: James Morris <james.l.morris@oracle.com>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
---
 security/integrity/evm/evm_crypto.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index bcd64baf..729f4545 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
 	int error;
 	int size;
 
-	if (!(inode->i_opflags & IOP_XATTR))
+	if (!(inode->i_opflags & IOP_XATTR) ||
+	    inode->i_sb->s_user_ns != &init_user_ns)
 		return -EOPNOTSUPP;
 
 	desc = init_desc(type);
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
       [not found]   ` <ef5e609602df6d7e2b4aa07b92600f04b6851902.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-22 18:59     ` Coly Li
  2017-12-23  3:03     ` Serge E. Hallyn
  1 sibling, 0 replies; 219+ messages in thread
From: Coly Li @ 2017-12-22 18:59 UTC (permalink / raw)
  To: Dongsu Park, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Alban Crequy,
	Eric W . Biederman, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Jan Kara, Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro

On 22/12/2017 10:32 PM, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Cc: Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Hi Dongsu,

Could you please use a macro like NO_PERMISSION_CHECK to replace hard
coded 0 ? At least for me, I don't need to check what does 0 mean in the
new lookup_bdev().

Thanks.

Coly Li

> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c     |  2 +-
>  drivers/mtd/mtdsuper.c    |  2 +-
>  fs/block_dev.c            | 13 ++++++++++---
>  fs/quota/quota.c          |  2 +-
>  include/linux/fs.h        |  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  				  sb);
>  	if (IS_ERR(bdev)) {
>  		if (bdev == ERR_PTR(-EBUSY)) {
> -			bdev = lookup_bdev(strim(path));
> +			bdev = lookup_bdev(strim(path), 0);
>  			mutex_lock(&bch_register_lock);
>  			if (!IS_ERR(bdev) && bch_is_open(bdev))
>  				err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
[snip]


-- 
Coly Li

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
@ 2017-12-22 18:59   ` Coly Li
  2017-12-23 12:00     ` Dongsu Park
       [not found]     ` <17fbec10-68b1-2d2b-d417-2cdfee22b0fa-53JG2FQvpdo@public.gmane.org>
       [not found]   ` <ef5e609602df6d7e2b4aa07b92600f04b6851902.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:03   ` Serge E. Hallyn
  2 siblings, 2 replies; 219+ messages in thread
From: Coly Li @ 2017-12-22 18:59 UTC (permalink / raw)
  To: Dongsu Park, linux-kernel
  Cc: containers, Alban Crequy, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, dm-devel, linux-bcache,
	linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara, Serge Hallyn

On 22/12/2017 10:32 PM, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-devel@redhat.com
> Cc: linux-bcache@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mtd@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Jan Kara <jack@suse.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Hi Dongsu,

Could you please use a macro like NO_PERMISSION_CHECK to replace hard
coded 0 ? At least for me, I don't need to check what does 0 mean in the
new lookup_bdev().

Thanks.

Coly Li

> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c     |  2 +-
>  drivers/mtd/mtdsuper.c    |  2 +-
>  fs/block_dev.c            | 13 ++++++++++---
>  fs/quota/quota.c          |  2 +-
>  include/linux/fs.h        |  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  				  sb);
>  	if (IS_ERR(bdev)) {
>  		if (bdev == ERR_PTR(-EBUSY)) {
> -			bdev = lookup_bdev(strim(path));
> +			bdev = lookup_bdev(strim(path), 0);
>  			mutex_lock(&bch_register_lock);
>  			if (!IS_ERR(bdev) && bch_is_open(bdev))
>  				err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
[snip]


-- 
Coly Li

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
       [not found]     ` <945d325a2239efcd55273abb2bac41cfc7264fea.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-22 21:06       ` Richard Weinberger
  2017-12-23  3:05       ` Serge E. Hallyn
  1 sibling, 0 replies; 219+ messages in thread
From: Richard Weinberger @ 2017-12-22 21:06 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi, Linux Containers, LKML, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Dongsu,

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode.  Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.

What is the big picture of this?
Can in future an unprivileged user just mount UBIFS?
Please note that UBIFS sits on top of a character device and not a block device.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
  2017-12-22 14:32     ` Dongsu Park
  (?)
  (?)
@ 2017-12-22 21:06     ` Richard Weinberger
       [not found]       ` <CAFLxGvwzRBGJf0-jCAwGts1HwV_nT072+yhHLP079sxQezoTFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 219+ messages in thread
From: Richard Weinberger @ 2017-12-22 21:06 UTC (permalink / raw)
  To: Dongsu Park
  Cc: LKML, Miklos Szeredi, Linux Containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-mtd

Dongsu,

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode.  Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.

What is the big picture of this?
Can in future an unprivileged user just mount UBIFS?
Please note that UBIFS sits on top of a character device and not a block device.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
       [not found]   ` <ef5e609602df6d7e2b4aa07b92600f04b6851902.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-22 18:59     ` Coly Li
@ 2017-12-23  3:03     ` Serge E. Hallyn
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:03 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA, Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Alban Crequy,
	Eric W . Biederman, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Jan Kara, Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro

On Fri, Dec 22, 2017 at 03:32:25PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Cc: Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c     |  2 +-
>  drivers/mtd/mtdsuper.c    |  2 +-
>  fs/block_dev.c            | 13 ++++++++++---
>  fs/quota/quota.c          |  2 +-
>  include/linux/fs.h        |  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  				  sb);
>  	if (IS_ERR(bdev)) {
>  		if (bdev == ERR_PTR(-EBUSY)) {
> -			bdev = lookup_bdev(strim(path));
> +			bdev = lookup_bdev(strim(path), 0);
>  			mutex_lock(&bch_register_lock);
>  			if (!IS_ERR(bdev) && bch_is_open(bdev))
>  				err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
>  	dev_t dev;
>  	struct block_device *bdev;
>  
> -	bdev = lookup_bdev(path);
> +	bdev = lookup_bdev(path, 0);
>  	if (IS_ERR(bdev))
>  		dev = name_to_dev_t(path);
>  	else {
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index e43fea89..4a4d40c0 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
>  	/* try the old way - the hack where we allowed users to mount
>  	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>  	 */
> -	bdev = lookup_bdev(dev_name);
> +	bdev = lookup_bdev(dev_name, 0);
>  	if (IS_ERR(bdev)) {
>  		ret = PTR_ERR(bdev);
>  		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb..5ca06095 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
>  	struct block_device *bdev;
>  	int err;
>  
> -	bdev = lookup_bdev(path);
> +	bdev = lookup_bdev(path, 0);
>  	if (IS_ERR(bdev))
>  		return bdev;
>  
> @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
>  /**
>   * lookup_bdev  - lookup a struct block_device by name
>   * @pathname:	special file representing the block device
> + * @mask:	rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
>   *
>   * Get a reference to the blockdevice at @pathname in the current
>   * namespace if possible and return it.  Return ERR_PTR(error)
> - * otherwise.
> + * otherwise.  If @mask is non-zero, check for access rights to the
> + * inode at @pathname.
>   */
> -struct block_device *lookup_bdev(const char *pathname)
> +struct block_device *lookup_bdev(const char *pathname, int mask)
>  {
>  	struct block_device *bdev;
>  	struct inode *inode;
> @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
>  		return ERR_PTR(error);
>  
>  	inode = d_backing_inode(path.dentry);
> +	if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
> +		error = __inode_permission(inode, mask);
> +		if (error)
> +			goto fail;
> +	}
>  	error = -ENOTBLK;
>  	if (!S_ISBLK(inode->i_mode))
>  		goto fail;
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 43612e2a..e5d47955 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)
>  
>  	if (IS_ERR(tmp))
>  		return ERR_CAST(tmp);
> -	bdev = lookup_bdev(tmp->name);
> +	bdev = lookup_bdev(tmp->name, 0);
>  	putname(tmp);
>  	if (IS_ERR(bdev))
>  		return ERR_CAST(bdev);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2995a271..fce19c49 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
>  #define BLKDEV_MAJOR_MAX	512
>  extern const char *__bdevname(dev_t, char *buffer);
>  extern const char *bdevname(struct block_device *bdev, char *buffer);
> -extern struct block_device *lookup_bdev(const char *);
> +extern struct block_device *lookup_bdev(const char *, int mask);
>  extern void blkdev_show(struct seq_file *,off_t);
>  
>  #else
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
  2017-12-22 14:32 ` [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev() Dongsu Park
  2017-12-22 18:59   ` Coly Li
       [not found]   ` <ef5e609602df6d7e2b4aa07b92600f04b6851902.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:03   ` Serge E. Hallyn
  2 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:03 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, dm-devel,
	linux-bcache, linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara,
	Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:25PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
> 
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
> 
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
> 
> Cc: dm-devel@redhat.com
> Cc: linux-bcache@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mtd@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Jan Kara <jack@suse.com>
> Cc: Serge Hallyn <serge@hallyn.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  drivers/md/bcache/super.c |  2 +-
>  drivers/md/dm-table.c     |  2 +-
>  drivers/mtd/mtdsuper.c    |  2 +-
>  fs/block_dev.c            | 13 ++++++++++---
>  fs/quota/quota.c          |  2 +-
>  include/linux/fs.h        |  2 +-
>  6 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  				  sb);
>  	if (IS_ERR(bdev)) {
>  		if (bdev == ERR_PTR(-EBUSY)) {
> -			bdev = lookup_bdev(strim(path));
> +			bdev = lookup_bdev(strim(path), 0);
>  			mutex_lock(&bch_register_lock);
>  			if (!IS_ERR(bdev) && bch_is_open(bdev))
>  				err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
>  	dev_t dev;
>  	struct block_device *bdev;
>  
> -	bdev = lookup_bdev(path);
> +	bdev = lookup_bdev(path, 0);
>  	if (IS_ERR(bdev))
>  		dev = name_to_dev_t(path);
>  	else {
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index e43fea89..4a4d40c0 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
>  	/* try the old way - the hack where we allowed users to mount
>  	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>  	 */
> -	bdev = lookup_bdev(dev_name);
> +	bdev = lookup_bdev(dev_name, 0);
>  	if (IS_ERR(bdev)) {
>  		ret = PTR_ERR(bdev);
>  		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb..5ca06095 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
>  	struct block_device *bdev;
>  	int err;
>  
> -	bdev = lookup_bdev(path);
> +	bdev = lookup_bdev(path, 0);
>  	if (IS_ERR(bdev))
>  		return bdev;
>  
> @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
>  /**
>   * lookup_bdev  - lookup a struct block_device by name
>   * @pathname:	special file representing the block device
> + * @mask:	rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
>   *
>   * Get a reference to the blockdevice at @pathname in the current
>   * namespace if possible and return it.  Return ERR_PTR(error)
> - * otherwise.
> + * otherwise.  If @mask is non-zero, check for access rights to the
> + * inode at @pathname.
>   */
> -struct block_device *lookup_bdev(const char *pathname)
> +struct block_device *lookup_bdev(const char *pathname, int mask)
>  {
>  	struct block_device *bdev;
>  	struct inode *inode;
> @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
>  		return ERR_PTR(error);
>  
>  	inode = d_backing_inode(path.dentry);
> +	if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
> +		error = __inode_permission(inode, mask);
> +		if (error)
> +			goto fail;
> +	}
>  	error = -ENOTBLK;
>  	if (!S_ISBLK(inode->i_mode))
>  		goto fail;
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 43612e2a..e5d47955 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)
>  
>  	if (IS_ERR(tmp))
>  		return ERR_CAST(tmp);
> -	bdev = lookup_bdev(tmp->name);
> +	bdev = lookup_bdev(tmp->name, 0);
>  	putname(tmp);
>  	if (IS_ERR(bdev))
>  		return ERR_CAST(bdev);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2995a271..fce19c49 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
>  #define BLKDEV_MAJOR_MAX	512
>  extern const char *__bdevname(dev_t, char *buffer);
>  extern const char *bdevname(struct block_device *bdev, char *buffer);
> -extern struct block_device *lookup_bdev(const char *);
> +extern struct block_device *lookup_bdev(const char *, int mask);
>  extern void blkdev_show(struct seq_file *,off_t);
>  
>  #else
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
       [not found]     ` <945d325a2239efcd55273abb2bac41cfc7264fea.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-22 21:06       ` Richard Weinberger
@ 2017-12-23  3:05       ` Serge E. Hallyn
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:05 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, Dec 22, 2017 at 03:32:26PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode.  Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.
> 
> Patch v3 is available: https://patchwork.kernel.org/patch/7640011/
> 
> Cc: linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> ---
>  drivers/mtd/mtdsuper.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index 4a4d40c0..3c8734f3 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
>  #ifdef CONFIG_BLOCK
>  	struct block_device *bdev;
>  	int ret, major;
> +	int perm;
>  #endif
>  	int mtdnr;
>  
> @@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
>  	/* try the old way - the hack where we allowed users to mount
>  	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>  	 */
> -	bdev = lookup_bdev(dev_name, 0);
> +	perm = MAY_READ;
> +	if (!(flags & MS_RDONLY))
> +		perm |= MAY_WRITE;
> +	bdev = lookup_bdev(dev_name, perm);
>  	if (IS_ERR(bdev)) {
>  		ret = PTR_ERR(bdev);
>  		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
  2017-12-22 14:32     ` Dongsu Park
                       ` (2 preceding siblings ...)
  (?)
@ 2017-12-23  3:05     ` Serge E. Hallyn
  -1 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:05 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, Miklos Szeredi, containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-mtd

On Fri, Dec 22, 2017 at 03:32:26PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode.  Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.
> 
> Patch v3 is available: https://patchwork.kernel.org/patch/7640011/
> 
> Cc: linux-mtd@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
>  drivers/mtd/mtdsuper.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index 4a4d40c0..3c8734f3 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
>  #ifdef CONFIG_BLOCK
>  	struct block_device *bdev;
>  	int ret, major;
> +	int perm;
>  #endif
>  	int mtdnr;
>  
> @@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
>  	/* try the old way - the hack where we allowed users to mount
>  	 * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
>  	 */
> -	bdev = lookup_bdev(dev_name, 0);
> +	perm = MAY_READ;
> +	if (!(flags & MS_RDONLY))
> +		perm |= MAY_WRITE;
> +	bdev = lookup_bdev(dev_name, perm);
>  	if (IS_ERR(bdev)) {
>  		ret = PTR_ERR(bdev);
>  		pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
@ 2017-12-23  3:17       ` Serge E. Hallyn
  2018-01-05 19:24   ` Luis R. Rodriguez
  2018-02-13 13:18   ` Miklos Szeredi
  2 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:17 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi, Kees Cook,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	Luis R. Rodriguez, Alban Crequy, Eric W . Biederman,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> 
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to

Note it is CAP_CHOWN

> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
> 
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.
> 
> Once chown has been called the existing capable_wrt_inode_uidgid
> checks are sufficient, to allow the owner of a superblock to do anything
> the global root user can do with an appropriate set of capabilities.
> 
> For the proc filesystem this relaxation of permissions is not safe, as
> some files are owned by users (particularly GLOBAL_ROOT_UID) outside
> of the control of the mounter of the proc and that would be unsafe to
> grant chown access to.  So update setattr on proc to disallow changing
> files whose uids or gids are outside of proc's s_user_ns.
> 
> The original version of this patch was written by: Seth Forshee.  I
> have rewritten and rethought this patch enough so it's really not the
> same thing (certainly it needs a different description), but he
> deserves credit for getting out there and getting the conversation
> started, and finding the potential gotcha's and putting up with my
> semi-paranoid feedback.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944611/
> 
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Cc: "Luis R. Rodriguez" <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> Inspired-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> ---
>  fs/attr.c             | 34 ++++++++++++++++++++++++++--------
>  fs/proc/base.c        |  7 +++++++
>  fs/proc/generic.c     |  7 +++++++
>  fs/proc/proc_sysctl.c |  7 +++++++
>  4 files changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    uid_eq(uid, inode->i_uid))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:	dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>  		goto kill_priv;
>  
>  	/* Make sure a caller can chown. */
> -	if ((ia_valid & ATTR_UID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>  		return -EPERM;
>  
>  	/* Make sure caller can chgrp. */
> -	if ((ia_valid & ATTR_GID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>  		return -EPERM;
>  
>  	/* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	int error;
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  
>  	if (attr->ia_valid & ATTR_MODE)
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 793a6757..527d46c8 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
>  {
>  	struct inode *inode = d_inode(dentry);
>  	struct proc_dir_entry *de = PDE(inode);
> +	struct user_namespace *s_user_ns;
>  	int error;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, iattr);
>  	if (error)
>  		return error;
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index c5cbbdff..0f9562d1 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
>  static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  	int error;
>  
>  	if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
@ 2017-12-23  3:17       ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:17 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Luis R. Rodriguez, Kees Cook

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
> 
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to

Note it is CAP_CHOWN

> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
> 
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.
> 
> Once chown has been called the existing capable_wrt_inode_uidgid
> checks are sufficient, to allow the owner of a superblock to do anything
> the global root user can do with an appropriate set of capabilities.
> 
> For the proc filesystem this relaxation of permissions is not safe, as
> some files are owned by users (particularly GLOBAL_ROOT_UID) outside
> of the control of the mounter of the proc and that would be unsafe to
> grant chown access to.  So update setattr on proc to disallow changing
> files whose uids or gids are outside of proc's s_user_ns.
> 
> The original version of this patch was written by: Seth Forshee.  I
> have rewritten and rethought this patch enough so it's really not the
> same thing (certainly it needs a different description), but he
> deserves credit for getting out there and getting the conversation
> started, and finding the potential gotcha's and putting up with my
> semi-paranoid feedback.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944611/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
> Cc: Kees Cook <keescook@chromium.org>
> Inspired-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/attr.c             | 34 ++++++++++++++++++++++++++--------
>  fs/proc/base.c        |  7 +++++++
>  fs/proc/generic.c     |  7 +++++++
>  fs/proc/proc_sysctl.c |  7 +++++++
>  4 files changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    uid_eq(uid, inode->i_uid))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:	dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>  		goto kill_priv;
>  
>  	/* Make sure a caller can chown. */
> -	if ((ia_valid & ATTR_UID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>  		return -EPERM;
>  
>  	/* Make sure caller can chgrp. */
> -	if ((ia_valid & ATTR_GID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>  		return -EPERM;
>  
>  	/* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	int error;
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  
>  	if (attr->ia_valid & ATTR_MODE)
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 793a6757..527d46c8 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
>  {
>  	struct inode *inode = d_inode(dentry);
>  	struct proc_dir_entry *de = PDE(inode);
> +	struct user_namespace *s_user_ns;
>  	int error;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, iattr);
>  	if (error)
>  		return error;
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index c5cbbdff..0f9562d1 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
>  static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  	int error;
>  
>  	if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
       [not found]   ` <ddf1fb9b5001e633e0022dee7fecb0ef431e851f.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:26     ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:26 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Expand the check in should_remove_suid() to keep privileges for

I realize this description came from Seth, but reading it now,
'Expand' seems wrong.  Expanding a check brings to my mind making
it stricter, not looser.  How about 'Relax the check' ?

> CAP_FSETID in s_user_ns rather than init_user_ns.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
> 
> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Why exactly?

This is wrong, because capable_wrt_inode_uidgid() does a check
against current_user_ns, not the  inode->i_sb->s_user_ns

> 
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  fs/inode.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fd401028..6459a437 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>   */
>  int should_remove_suid(struct dentry *dentry)
>  {
> -	umode_t mode = d_inode(dentry)->i_mode;
> +	struct inode *inode = d_inode(dentry);
> +	umode_t mode = inode->i_mode;
>  	int kill = 0;
>  
>  	/* suid always must be killed */
> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>  	if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>  		kill |= ATTR_KILL_SGID;
>  
> -	if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
> +	if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
> +		     S_ISREG(mode)))
>  		return kill;
>  
>  	return 0;
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
  2017-12-22 14:32 ` [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root Dongsu Park
       [not found]   ` <ddf1fb9b5001e633e0022dee7fecb0ef431e851f.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:26   ` Serge E. Hallyn
  2017-12-23 12:38     ` Dongsu Park
       [not found]     ` <20171223032606.GD6837-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  1 sibling, 2 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:26 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> Expand the check in should_remove_suid() to keep privileges for

I realize this description came from Seth, but reading it now,
'Expand' seems wrong.  Expanding a check brings to my mind making
it stricter, not looser.  How about 'Relax the check' ?

> CAP_FSETID in s_user_ns rather than init_user_ns.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
> 
> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Why exactly?

This is wrong, because capable_wrt_inode_uidgid() does a check
against current_user_ns, not the  inode->i_sb->s_user_ns

> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Serge Hallyn <serge@hallyn.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/inode.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fd401028..6459a437 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>   */
>  int should_remove_suid(struct dentry *dentry)
>  {
> -	umode_t mode = d_inode(dentry)->i_mode;
> +	struct inode *inode = d_inode(dentry);
> +	umode_t mode = inode->i_mode;
>  	int kill = 0;
>  
>  	/* suid always must be killed */
> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>  	if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>  		kill |= ATTR_KILL_SGID;
>  
> -	if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
> +	if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
> +		     S_ISREG(mode)))
>  		return kill;
>  
>  	return 0;
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
       [not found]   ` <8dd484dceb9e96e5b67f21b8a0cf333753985e89.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:30     ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:30 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

On Fri, Dec 22, 2017 at 03:32:29PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Superblock level remounts are currently restricted to global
> CAP_SYS_ADMIN, as is the path for changing the root mount to
> read only on umount. Loosen both of these permission checks to
> also allow CAP_SYS_ADMIN in any namespace which is privileged
> towards the userns which originally mounted the filesystem.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944631/
> 
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  fs/namespace.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e158ec6b..830040d7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
>  		 * Special case for "unmounting" root ...
>  		 * we just try to remount it readonly.
>  		 */
> -		if (!capable(CAP_SYS_ADMIN))
> +		if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  			return -EPERM;
>  		down_write(&sb->s_umount);
>  		if (!sb_rdonly(sb))
> @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
>  	down_write(&sb->s_umount);
>  	if (ms_flags & MS_BIND)
>  		err = change_mount_flags(path->mnt, ms_flags);
> -	else if (!capable(CAP_SYS_ADMIN))
> +	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		err = -EPERM;
>  	else
>  		err = do_remount_sb(sb, sb_flags, data, 0);
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()
  2017-12-22 14:32 ` [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb() Dongsu Park
       [not found]   ` <8dd484dceb9e96e5b67f21b8a0cf333753985e89.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:30   ` Serge E. Hallyn
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:30 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:29PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> Superblock level remounts are currently restricted to global
> CAP_SYS_ADMIN, as is the path for changing the root mount to
> read only on umount. Loosen both of these permission checks to
> also allow CAP_SYS_ADMIN in any namespace which is privileged
> towards the userns which originally mounted the filesystem.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944631/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/namespace.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e158ec6b..830040d7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
>  		 * Special case for "unmounting" root ...
>  		 * we just try to remount it readonly.
>  		 */
> -		if (!capable(CAP_SYS_ADMIN))
> +		if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  			return -EPERM;
>  		down_write(&sb->s_umount);
>  		if (!sb_rdonly(sb))
> @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
>  	down_write(&sb->s_umount);
>  	if (ms_flags & MS_BIND)
>  		err = change_mount_flags(path->mnt, ms_flags);
> -	else if (!capable(CAP_SYS_ADMIN))
> +	else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		err = -EPERM;
>  	else
>  		err = do_remount_sb(sb, sb_flags, data, 0);
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
       [not found]     ` <5adc5e31c25beb987798ecc219df79671547a9ac.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:33       ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:33 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, Alban Crequy,
	Eric W . Biederman, James Morris, Sargun Dhillon

On Fri, Dec 22, 2017 at 03:32:30PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> A privileged user in s_user_ns will generally have the ability to
> manipulate the backing store and insert security.* xattrs into
> the filesystem directly. Therefore the kernel must be prepared to
> handle these xattrs from unprivileged mounts, and it makes little
> sense for commoncap to prevent writing these xattrs to the
> filesystem. The capability and LSM code have already been updated
> to appropriately handle xattrs from unprivileged mounts, so it
> is safe to loosen this restriction on setting xattrs.
> 
> The exception to this logic is that writing xattrs to a mounted
> filesystem may also cause the LSM inode_post_setxattr or
> inode_setsecurity callbacks to be invoked. SELinux will deny the
> xattr update by virtue of applying mountpoint labeling to
> unprivileged userns mounts, and Smack will deny the writes for
> any user without global CAP_MAC_ADMIN, so loosening the
> capability check in commoncap is safe in this respect as well.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944641/
> 
> Cc: linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: James Morris <james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  security/commoncap.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 4f8e0934..dd0afef9 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  		       const void *value, size_t size, int flags)
>  {
> +	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>  	/* Ignore non-security xattrs */
>  	if (strncmp(name, XATTR_SECURITY_PREFIX,
>  			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>  		return 0;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> +	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>  	/* Ignore non-security xattrs */
>  	if (strncmp(name, XATTR_SECURITY_PREFIX,
>  			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  		return 0;
>  	}
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
  2017-12-22 14:32     ` Dongsu Park
@ 2017-12-23  3:33       ` Serge E. Hallyn
  -1 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:33 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon,
	linux-security-module, James Morris, Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:30PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> A privileged user in s_user_ns will generally have the ability to
> manipulate the backing store and insert security.* xattrs into
> the filesystem directly. Therefore the kernel must be prepared to
> handle these xattrs from unprivileged mounts, and it makes little
> sense for commoncap to prevent writing these xattrs to the
> filesystem. The capability and LSM code have already been updated
> to appropriately handle xattrs from unprivileged mounts, so it
> is safe to loosen this restriction on setting xattrs.
> 
> The exception to this logic is that writing xattrs to a mounted
> filesystem may also cause the LSM inode_post_setxattr or
> inode_setsecurity callbacks to be invoked. SELinux will deny the
> xattr update by virtue of applying mountpoint labeling to
> unprivileged userns mounts, and Smack will deny the writes for
> any user without global CAP_MAC_ADMIN, so loosening the
> capability check in commoncap is safe in this respect as well.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944641/
> 
> Cc: linux-security-module@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: James Morris <james.l.morris@oracle.com>
> Cc: Serge Hallyn <serge@hallyn.com>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  security/commoncap.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 4f8e0934..dd0afef9 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  		       const void *value, size_t size, int flags)
>  {
> +	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>  	/* Ignore non-security xattrs */
>  	if (strncmp(name, XATTR_SECURITY_PREFIX,
>  			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>  		return 0;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> +	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>  	/* Ignore non-security xattrs */
>  	if (strncmp(name, XATTR_SECURITY_PREFIX,
>  			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  		return 0;
>  	}
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs
@ 2017-12-23  3:33       ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:33 UTC (permalink / raw)
  To: linux-security-module

On Fri, Dec 22, 2017 at 03:32:30PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> A privileged user in s_user_ns will generally have the ability to
> manipulate the backing store and insert security.* xattrs into
> the filesystem directly. Therefore the kernel must be prepared to
> handle these xattrs from unprivileged mounts, and it makes little
> sense for commoncap to prevent writing these xattrs to the
> filesystem. The capability and LSM code have already been updated
> to appropriately handle xattrs from unprivileged mounts, so it
> is safe to loosen this restriction on setting xattrs.
> 
> The exception to this logic is that writing xattrs to a mounted
> filesystem may also cause the LSM inode_post_setxattr or
> inode_setsecurity callbacks to be invoked. SELinux will deny the
> xattr update by virtue of applying mountpoint labeling to
> unprivileged userns mounts, and Smack will deny the writes for
> any user without global CAP_MAC_ADMIN, so loosening the
> capability check in commoncap is safe in this respect as well.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944641/
> 
> Cc: linux-security-module at vger.kernel.org
> Cc: linux-kernel at vger.kernel.org
> Cc: James Morris <james.l.morris@oracle.com>
> Cc: Serge Hallyn <serge@hallyn.com>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  security/commoncap.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 4f8e0934..dd0afef9 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
>  int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  		       const void *value, size_t size, int flags)
>  {
> +	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>  	/* Ignore non-security xattrs */
>  	if (strncmp(name, XATTR_SECURITY_PREFIX,
>  			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>  	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>  		return 0;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   */
>  int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  {
> +	struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
>  	/* Ignore non-security xattrs */
>  	if (strncmp(name, XATTR_SECURITY_PREFIX,
>  			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
>  		return 0;
>  	}
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  	return 0;
>  }
> -- 
> 2.13.6
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
       [not found]     ` <61a37f0b159dd56825696d8d3beb8eaffdf1f72f.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:39       ` Serge E. Hallyn
  2018-02-14 12:28       ` Miklos Szeredi
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:39 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

On Fri, Dec 22, 2017 at 03:32:31PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.
> 
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> ---
>  fs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb..8c628a8d 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
>  {
>  	struct super_block *sb = file_inode(filp)->i_sb;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
>  	/* If filesystem doesn't support freeze feature, return. */
> @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
>  {
>  	struct super_block *sb = file_inode(filp)->i_sb;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
>  	/* Thaw */
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  2017-12-22 14:32     ` Dongsu Park
  (?)
@ 2017-12-23  3:39     ` Serge E. Hallyn
  -1 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:39 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, Miklos Szeredi, containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

On Fri, Dec 22, 2017 at 03:32:31PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb..8c628a8d 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
>  {
>  	struct super_block *sb = file_inode(filp)->i_sb;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
>  	/* If filesystem doesn't support freeze feature, return. */
> @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
>  {
>  	struct super_block *sb = file_inode(filp)->i_sb;
>  
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
>  		return -EPERM;
>  
>  	/* Thaw */
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
@ 2017-12-23  3:46       ` Serge E. Hallyn
       [not found]   ` <c85c293e19a478353aba8e6e3ee39e5914f798d5.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:46 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 22, 2017 at 03:32:32PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
> 
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
> 
>  - The namespace must be the same as s_user_ns.
> 
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
> 
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
> 
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>  	if (!cc)
>  		return -ENOMEM;
>  
> -	fuse_conn_init(&cc->fc);
> +	fuse_conn_init(&cc->fc, current_user_ns());
>  
>  	fud = fuse_dev_alloc(&cc->fc);
>  	if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>  	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>  	__set_bit(FR_WAITING, &req->flags);
>  	if (for_background)
>  		__set_bit(FR_BACKGROUND, &req->flags);
> +	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +		fuse_put_request(fc, req);
> +		return ERR_PTR(-EOVERFLOW);
> +	}
>  
>  	return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>  	in = &req->in;
>  	reqsize = in->h.len;
>  
> -	if (task_active_pid_ns(current) != fc->pid_ns) {
> +	if (task_active_pid_ns(current) != fc->pid_ns ||
> +	    current_user_ns() != fc->user_ns) {
>  		rcu_read_lock();
>  		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>  		rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>  	stat->ino = attr->ino;
>  	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>  	stat->nlink = attr->nlink;
> -	stat->uid = make_kuid(&init_user_ns, attr->uid);
> -	stat->gid = make_kgid(&init_user_ns, attr->gid);
> +	stat->uid = make_kuid(fc->user_ns, attr->uid);
> +	stat->gid = make_kgid(fc->user_ns, attr->gid);
>  	stat->rdev = inode->i_rdev;
>  	stat->atime.tv_sec = attr->atime;
>  	stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>  	return true;
>  }
>  
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -			   bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>  	unsigned ivalid = iattr->ia_valid;
>  
>  	if (ivalid & ATTR_MODE)
>  		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>  	if (ivalid & ATTR_UID)
> -		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>  	if (ivalid & ATTR_GID)
> -		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>  	if (ivalid & ATTR_SIZE)
>  		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>  	if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>  
>  	memset(&inarg, 0, sizeof(inarg));
>  	memset(&outarg, 0, sizeof(outarg));
> -	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>  	if (file) {
>  		struct fuse_file *ff = file->private_data;
>  		inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>  
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>  	/** The pid namespace for this mount */
>  	struct pid_namespace *pid_ns;
>  
> +	/** The user namespace for this mount */
> +	struct user_namespace *user_ns;
> +
>  	/** Maximum read size */
>  	unsigned max_read;
>  
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>  
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>  	inode->i_ino     = fuse_squash_ino(attr->ino);
>  	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>  	set_nlink(inode, attr->nlink);
> -	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>  	inode->i_blocks  = attr->blocks;
>  	inode->i_atime.tv_sec   = attr->atime;
>  	inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>  	return err;
>  }
>  
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +			  struct user_namespace *user_ns)
>  {
>  	char *p;
>  	memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>  		case OPT_USER_ID:
>  			if (fuse_match_uint(&args[0], &uv))
>  				return 0;
> -			d->user_id = make_kuid(current_user_ns(), uv);
> +			d->user_id = make_kuid(user_ns, uv);
>  			if (!uid_valid(d->user_id))
>  				return 0;
>  			d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>  		case OPT_GROUP_ID:
>  			if (fuse_match_uint(&args[0], &uv))
>  				return 0;
> -			d->group_id = make_kgid(current_user_ns(), uv);
> +			d->group_id = make_kgid(user_ns, uv);
>  			if (!gid_valid(d->group_id))
>  				return 0;
>  			d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>  	struct super_block *sb = root->d_sb;
>  	struct fuse_conn *fc = get_fuse_conn_super(sb);
>  
> -	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>  	if (fc->default_permissions)
>  		seq_puts(m, ",default_permissions");
>  	if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>  	fpq->connected = 1;
>  }
>  
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>  	memset(fc, 0, sizeof(*fc));
>  	spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>  	fc->attr_version = 1;
>  	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>  	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +	fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>  
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>  		if (fc->destroy_req)
>  			fuse_request_free(fc->destroy_req);
>  		put_pid_ns(fc->pid_ns);
> +		put_user_ns(fc->user_ns);
>  		fc->release(fc);
>  	}
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>  
> -	if (!parse_fuse_opt(data, &d, is_bdev))
> +	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>  		goto err;
>  
>  	if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!file)
>  		goto err;
>  
> -	if ((file->f_op != &fuse_dev_operations) ||
> -	    (file->f_cred->user_ns != &init_user_ns))
> +	/*
> +	 * Require mount to happen from the same user namespace which
> +	 * opened /dev/fuse to prevent potential attacks.
> +	 */
> +	if (file->f_op != &fuse_dev_operations ||
> +	    file->f_cred->user_ns != sb->s_user_ns)
>  		goto err_fput;
>  
>  	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!fc)
>  		goto err_fput;
>  
> -	fuse_conn_init(fc);
> +	fuse_conn_init(fc, sb->s_user_ns);
>  	fc->release = fuse_free_conn;
>  
>  	fud = fuse_dev_alloc(fc);
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
@ 2017-12-23  3:46       ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:46 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, Miklos Szeredi, containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 03:32:32PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
> 
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
> 
>  - The namespace must be the same as s_user_ns.
> 
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
> 
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>  	if (!cc)
>  		return -ENOMEM;
>  
> -	fuse_conn_init(&cc->fc);
> +	fuse_conn_init(&cc->fc, current_user_ns());
>  
>  	fud = fuse_dev_alloc(&cc->fc);
>  	if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>  	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>  	__set_bit(FR_WAITING, &req->flags);
>  	if (for_background)
>  		__set_bit(FR_BACKGROUND, &req->flags);
> +	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +		fuse_put_request(fc, req);
> +		return ERR_PTR(-EOVERFLOW);
> +	}
>  
>  	return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>  	in = &req->in;
>  	reqsize = in->h.len;
>  
> -	if (task_active_pid_ns(current) != fc->pid_ns) {
> +	if (task_active_pid_ns(current) != fc->pid_ns ||
> +	    current_user_ns() != fc->user_ns) {
>  		rcu_read_lock();
>  		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>  		rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>  	stat->ino = attr->ino;
>  	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>  	stat->nlink = attr->nlink;
> -	stat->uid = make_kuid(&init_user_ns, attr->uid);
> -	stat->gid = make_kgid(&init_user_ns, attr->gid);
> +	stat->uid = make_kuid(fc->user_ns, attr->uid);
> +	stat->gid = make_kgid(fc->user_ns, attr->gid);
>  	stat->rdev = inode->i_rdev;
>  	stat->atime.tv_sec = attr->atime;
>  	stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>  	return true;
>  }
>  
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -			   bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>  	unsigned ivalid = iattr->ia_valid;
>  
>  	if (ivalid & ATTR_MODE)
>  		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>  	if (ivalid & ATTR_UID)
> -		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>  	if (ivalid & ATTR_GID)
> -		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>  	if (ivalid & ATTR_SIZE)
>  		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>  	if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>  
>  	memset(&inarg, 0, sizeof(inarg));
>  	memset(&outarg, 0, sizeof(outarg));
> -	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>  	if (file) {
>  		struct fuse_file *ff = file->private_data;
>  		inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>  
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>  	/** The pid namespace for this mount */
>  	struct pid_namespace *pid_ns;
>  
> +	/** The user namespace for this mount */
> +	struct user_namespace *user_ns;
> +
>  	/** Maximum read size */
>  	unsigned max_read;
>  
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>  
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>  	inode->i_ino     = fuse_squash_ino(attr->ino);
>  	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>  	set_nlink(inode, attr->nlink);
> -	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>  	inode->i_blocks  = attr->blocks;
>  	inode->i_atime.tv_sec   = attr->atime;
>  	inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>  	return err;
>  }
>  
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +			  struct user_namespace *user_ns)
>  {
>  	char *p;
>  	memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>  		case OPT_USER_ID:
>  			if (fuse_match_uint(&args[0], &uv))
>  				return 0;
> -			d->user_id = make_kuid(current_user_ns(), uv);
> +			d->user_id = make_kuid(user_ns, uv);
>  			if (!uid_valid(d->user_id))
>  				return 0;
>  			d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>  		case OPT_GROUP_ID:
>  			if (fuse_match_uint(&args[0], &uv))
>  				return 0;
> -			d->group_id = make_kgid(current_user_ns(), uv);
> +			d->group_id = make_kgid(user_ns, uv);
>  			if (!gid_valid(d->group_id))
>  				return 0;
>  			d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>  	struct super_block *sb = root->d_sb;
>  	struct fuse_conn *fc = get_fuse_conn_super(sb);
>  
> -	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>  	if (fc->default_permissions)
>  		seq_puts(m, ",default_permissions");
>  	if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>  	fpq->connected = 1;
>  }
>  
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>  	memset(fc, 0, sizeof(*fc));
>  	spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>  	fc->attr_version = 1;
>  	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>  	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +	fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>  
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>  		if (fc->destroy_req)
>  			fuse_request_free(fc->destroy_req);
>  		put_pid_ns(fc->pid_ns);
> +		put_user_ns(fc->user_ns);
>  		fc->release(fc);
>  	}
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>  
> -	if (!parse_fuse_opt(data, &d, is_bdev))
> +	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>  		goto err;
>  
>  	if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!file)
>  		goto err;
>  
> -	if ((file->f_op != &fuse_dev_operations) ||
> -	    (file->f_cred->user_ns != &init_user_ns))
> +	/*
> +	 * Require mount to happen from the same user namespace which
> +	 * opened /dev/fuse to prevent potential attacks.
> +	 */
> +	if (file->f_op != &fuse_dev_operations ||
> +	    file->f_cred->user_ns != sb->s_user_ns)
>  		goto err_fput;
>  
>  	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!fc)
>  		goto err_fput;
>  
> -	fuse_conn_init(fc);
> +	fuse_conn_init(fc, sb->s_user_ns);
>  	fc->release = fuse_free_conn;
>  
>  	fud = fuse_dev_alloc(fc);
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
       [not found]   ` <d055925e5d5c0099e9e9c871004fb45fab67e4bc.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2017-12-23  3:50     ` Serge E. Hallyn
  2018-02-19 23:16       ` Eric W. Biederman
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:50 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 22, 2017 at 03:32:33PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
> 
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> ---
>  fs/fuse/dir.c           | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>  	const struct cred *cred;
>  
>  	if (fc->allow_other)
> -		return 1;
> +		return current_in_userns(fc->user_ns);
>  
>  	cred = current_cred();
>  	if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
>  {
>  	return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);

I have to say I'm not happy with this name.  I wish it had been
called current_under_userns or something to indicate it may also
be in a child.

>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
@ 2017-12-23  3:50   ` Serge E. Hallyn
       [not found]   ` <d055925e5d5c0099e9e9c871004fb45fab67e4bc.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:50 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Serge Hallyn

On Fri, Dec 22, 2017 at 03:32:33PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/fuse/dir.c           | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>  	const struct cred *cred;
>  
>  	if (fc->allow_other)
> -		return 1;
> +		return current_in_userns(fc->user_ns);
>  
>  	cred = current_cred();
>  	if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
>  {
>  	return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);

I have to say I'm not happy with this name.  I wish it had been
called current_under_userns or something to indicate it may also
be in a child.

>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
  2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
@ 2017-12-23  3:51       ` Serge E. Hallyn
  2018-02-14 13:44   ` Miklos Szeredi
  1 sibling, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:51 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 22, 2017 at 03:32:34PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
> 
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>  	.owner		= THIS_MODULE,
>  	.name		= "fuse",
> -	.fs_flags	= FS_HAS_SUBTYPE,
> +	.fs_flags	= FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  	.mount		= fuse_mount,
>  	.kill_sb	= fuse_kill_sb_anon,
>  };
> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>  	.name		= "fuseblk",
>  	.mount		= fuse_mount_blk,
>  	.kill_sb	= fuse_kill_sb_blk,
> -	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> +	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");
>  
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
@ 2017-12-23  3:51       ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  3:51 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, Miklos Szeredi, containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 03:32:34PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
> 
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
> 
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>  	.owner		= THIS_MODULE,
>  	.name		= "fuse",
> -	.fs_flags	= FS_HAS_SUBTYPE,
> +	.fs_flags	= FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  	.mount		= fuse_mount,
>  	.kill_sb	= fuse_kill_sb_anon,
>  };
> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>  	.name		= "fuseblk",
>  	.mount		= fuse_mount_blk,
>  	.kill_sb	= fuse_kill_sb_blk,
> -	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> +	.fs_flags	= FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");
>  
> -- 
> 2.13.6
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
  2017-12-22 14:32   ` Dongsu Park
  (?)
@ 2017-12-23  4:03       ` Serge E. Hallyn
  -1 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  4:03 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, Alban Crequy,
	Eric W . Biederman, James Morris, Sargun Dhillon,
	linux-integrity-u79uwXL29TY76Z2rM5mHXA, Mimi Zohar

On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> The kernel should not calculate new hmacs for mounts done by
> non-root users. Update evm_calc_hmac_or_hash() to refuse to
> calculate new hmacs for mounts for non-init user namespaces.
> 
> Cc: linux-integrity-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: James Morris <james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: Mimi Zohar <zohar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Hi Mimi,

does this change seem sufficient to you?

> Cc: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  security/integrity/evm/evm_crypto.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> index bcd64baf..729f4545 100644
> --- a/security/integrity/evm/evm_crypto.c
> +++ b/security/integrity/evm/evm_crypto.c
> @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
>  	int error;
>  	int size;
>  
> -	if (!(inode->i_opflags & IOP_XATTR))
> +	if (!(inode->i_opflags & IOP_XATTR) ||
> +	    inode->i_sb->s_user_ns != &init_user_ns)
>  		return -EOPNOTSUPP;
>  
>  	desc = init_desc(type);
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
@ 2017-12-23  4:03       ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  4:03 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-integrity,
	linux-security-module, James Morris, Mimi Zohar, Serge E. Hallyn

On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> The kernel should not calculate new hmacs for mounts done by
> non-root users. Update evm_calc_hmac_or_hash() to refuse to
> calculate new hmacs for mounts for non-init user namespaces.
> 
> Cc: linux-integrity@vger.kernel.org
> Cc: linux-security-module@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: James Morris <james.l.morris@oracle.com>
> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>

Hi Mimi,

does this change seem sufficient to you?

> Cc: "Serge E. Hallyn" <serge@hallyn.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  security/integrity/evm/evm_crypto.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> index bcd64baf..729f4545 100644
> --- a/security/integrity/evm/evm_crypto.c
> +++ b/security/integrity/evm/evm_crypto.c
> @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
>  	int error;
>  	int size;
>  
> -	if (!(inode->i_opflags & IOP_XATTR))
> +	if (!(inode->i_opflags & IOP_XATTR) ||
> +	    inode->i_sb->s_user_ns != &init_user_ns)
>  		return -EOPNOTSUPP;
>  
>  	desc = init_desc(type);
> -- 
> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 11/11] evm: Don't update hmacs in user ns mounts
@ 2017-12-23  4:03       ` Serge E. Hallyn
  0 siblings, 0 replies; 219+ messages in thread
From: Serge E. Hallyn @ 2017-12-23  4:03 UTC (permalink / raw)
  To: linux-security-module

On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
> 
> The kernel should not calculate new hmacs for mounts done by
> non-root users. Update evm_calc_hmac_or_hash() to refuse to
> calculate new hmacs for mounts for non-init user namespaces.
> 
> Cc: linux-integrity at vger.kernel.org
> Cc: linux-security-module at vger.kernel.org
> Cc: linux-kernel at vger.kernel.org
> Cc: James Morris <james.l.morris@oracle.com>
> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>

Hi Mimi,

does this change seem sufficient to you?

> Cc: "Serge E. Hallyn" <serge@hallyn.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  security/integrity/evm/evm_crypto.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> index bcd64baf..729f4545 100644
> --- a/security/integrity/evm/evm_crypto.c
> +++ b/security/integrity/evm/evm_crypto.c
> @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
>  	int error;
>  	int size;
>  
> -	if (!(inode->i_opflags & IOP_XATTR))
> +	if (!(inode->i_opflags & IOP_XATTR) ||
> +	    inode->i_sb->s_user_ns != &init_user_ns)
>  		return -EOPNOTSUPP;
>  
>  	desc = init_desc(type);
> -- 
> 2.13.6
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
       [not found]     ` <17fbec10-68b1-2d2b-d417-2cdfee22b0fa-53JG2FQvpdo@public.gmane.org>
@ 2017-12-23 12:00       ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-23 12:00 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA, Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Alban Crequy,
	Eric W . Biederman, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Jan Kara, Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro

Hi,

On Fri, Dec 22, 2017 at 7:59 PM, Coly Li <i@coly.li> wrote:
> On 22/12/2017 10:32 PM, Dongsu Park wrote:
> Hi Dongsu,
>
> Could you please use a macro like NO_PERMISSION_CHECK to replace hard
> coded 0 ? At least for me, I don't need to check what does 0 mean in the
> new lookup_bdev().

I see. I'll do that.

Thanks,
Dongsu

> Thanks.
>
> Coly Li
>
>> ---
>>  drivers/md/bcache/super.c |  2 +-
>>  drivers/md/dm-table.c     |  2 +-
>>  drivers/mtd/mtdsuper.c    |  2 +-
>>  fs/block_dev.c            | 13 ++++++++++---
>>  fs/quota/quota.c          |  2 +-
>>  include/linux/fs.h        |  2 +-
>>  6 files changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index b4d28928..acc9d56c 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>>                                 sb);
>>       if (IS_ERR(bdev)) {
>>               if (bdev == ERR_PTR(-EBUSY)) {
>> -                     bdev = lookup_bdev(strim(path));
>> +                     bdev = lookup_bdev(strim(path), 0);
>>                       mutex_lock(&bch_register_lock);
>>                       if (!IS_ERR(bdev) && bch_is_open(bdev))
>>                               err = "device already registered";
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 88130b5d..bca5eaf4 100644
> [snip]
>
>
> --
> Coly Li

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()
  2017-12-22 18:59   ` Coly Li
@ 2017-12-23 12:00     ` Dongsu Park
       [not found]     ` <17fbec10-68b1-2d2b-d417-2cdfee22b0fa-53JG2FQvpdo@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-23 12:00 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, dm-devel,
	linux-bcache, linux-fsdevel, linux-mtd, Alexander Viro, Jan Kara,
	Serge Hallyn

Hi,

On Fri, Dec 22, 2017 at 7:59 PM, Coly Li <i@coly.li> wrote:
> On 22/12/2017 10:32 PM, Dongsu Park wrote:
> Hi Dongsu,
>
> Could you please use a macro like NO_PERMISSION_CHECK to replace hard
> coded 0 ? At least for me, I don't need to check what does 0 mean in the
> new lookup_bdev().

I see. I'll do that.

Thanks,
Dongsu

> Thanks.
>
> Coly Li
>
>> ---
>>  drivers/md/bcache/super.c |  2 +-
>>  drivers/md/dm-table.c     |  2 +-
>>  drivers/mtd/mtdsuper.c    |  2 +-
>>  fs/block_dev.c            | 13 ++++++++++---
>>  fs/quota/quota.c          |  2 +-
>>  include/linux/fs.h        |  2 +-
>>  6 files changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index b4d28928..acc9d56c 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>>                                 sb);
>>       if (IS_ERR(bdev)) {
>>               if (bdev == ERR_PTR(-EBUSY)) {
>> -                     bdev = lookup_bdev(strim(path));
>> +                     bdev = lookup_bdev(strim(path), 0);
>>                       mutex_lock(&bch_register_lock);
>>                       if (!IS_ERR(bdev) && bch_is_open(bdev))
>>                               err = "device already registered";
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 88130b5d..bca5eaf4 100644
> [snip]
>
>
> --
> Coly Li

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
  2017-12-22 21:06     ` Richard Weinberger
@ 2017-12-23 12:18           ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-23 12:18 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Miklos Szeredi, Linux Containers, LKML, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi,

On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
<richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Dongsu,
>
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>
>> Unprivileged users should not be able to mount mtd block devices
>> when they lack sufficient privileges towards the block device
>> inode.  Update mount_mtd() to validate that the user has the
>> required access to the inode at the specified path. The check
>> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
>> continue working as before.
>
> What is the big picture of this?
> Can in future an unprivileged user just mount UBIFS?

I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
To my understanding, in these days many container runtimes allow
unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
That's why the kernel should deal with additional permission checks
that might have not been necessary in the past.
This MTD patch is one of those special cases.

> Please note that UBIFS sits on top of a character device and not a block device.

Aha, good to know.

Thanks,
Dongsu

> --
> Thanks,
> //richard

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
@ 2017-12-23 12:18           ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-23 12:18 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: LKML, Miklos Szeredi, Linux Containers, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-mtd

Hi,

On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
<richard.weinberger@gmail.com> wrote:
> Dongsu,
>
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> Unprivileged users should not be able to mount mtd block devices
>> when they lack sufficient privileges towards the block device
>> inode.  Update mount_mtd() to validate that the user has the
>> required access to the inode at the specified path. The check
>> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
>> continue working as before.
>
> What is the big picture of this?
> Can in future an unprivileged user just mount UBIFS?

I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
To my understanding, in these days many container runtimes allow
unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
That's why the kernel should deal with additional permission checks
that might have not been necessary in the past.
This MTD patch is one of those special cases.

> Please note that UBIFS sits on top of a character device and not a block device.

Aha, good to know.

Thanks,
Dongsu

> --
> Thanks,
> //richard

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
       [not found]     ` <20171223032606.GD6837-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2017-12-23 12:38       ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2017-12-23 12:38 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Miklos Szeredi, Linux Containers, LKML, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

Hi,

On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>
>> Expand the check in should_remove_suid() to keep privileges for
>
> I realize this description came from Seth, but reading it now,
> 'Expand' seems wrong.  Expanding a check brings to my mind making
> it stricter, not looser.  How about 'Relax the check' ?

Makes sense. Will do.

>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>
>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>
> Why exactly?
>
> This is wrong, because capable_wrt_inode_uidgid() does a check
> against current_user_ns, not the  inode->i_sb->s_user_ns

Ah. I see.
I suppose it was changed probably for the privileged_wrt_inode_uidgid()
called by capable_wrt_inode_uidgid(). But as you pointed out, that checks
against current_user_ns, which is wrong. I would just create another
wrapper like capable_userns_wrt_inode_uidgid(), which takes an
additional parameter of (struct user_namespace *), to be able to check for
both ns_capable() and privileged_wrt_inode_uidgid().

Thanks,
Dongsu

>> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
>> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
>> ---
>>  fs/inode.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index fd401028..6459a437 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>>   */
>>  int should_remove_suid(struct dentry *dentry)
>>  {
>> -     umode_t mode = d_inode(dentry)->i_mode;
>> +     struct inode *inode = d_inode(dentry);
>> +     umode_t mode = inode->i_mode;
>>       int kill = 0;
>>
>>       /* suid always must be killed */
>> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>>       if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>>               kill |= ATTR_KILL_SGID;
>>
>> -     if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
>> +     if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
>> +                  S_ISREG(mode)))
>>               return kill;
>>
>>       return 0;
>> --
>> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
  2017-12-23  3:26   ` Serge E. Hallyn
@ 2017-12-23 12:38     ` Dongsu Park
       [not found]       ` <CANxcAMtpE05xpOPt3Ua+4DkiTzkW5hOo4BBpiNZh_5+RTCfThA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]     ` <20171223032606.GD6837-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  1 sibling, 1 reply; 219+ messages in thread
From: Dongsu Park @ 2017-12-23 12:38 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: LKML, Linux Containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

Hi,

On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> Expand the check in should_remove_suid() to keep privileges for
>
> I realize this description came from Seth, but reading it now,
> 'Expand' seems wrong.  Expanding a check brings to my mind making
> it stricter, not looser.  How about 'Relax the check' ?

Makes sense. Will do.

>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>
>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>
> Why exactly?
>
> This is wrong, because capable_wrt_inode_uidgid() does a check
> against current_user_ns, not the  inode->i_sb->s_user_ns

Ah. I see.
I suppose it was changed probably for the privileged_wrt_inode_uidgid()
called by capable_wrt_inode_uidgid(). But as you pointed out, that checks
against current_user_ns, which is wrong. I would just create another
wrapper like capable_userns_wrt_inode_uidgid(), which takes an
additional parameter of (struct user_namespace *), to be able to check for
both ns_capable() and privileged_wrt_inode_uidgid().

Thanks,
Dongsu

>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Serge Hallyn <serge@hallyn.com>
>> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
>> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
>> ---
>>  fs/inode.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index fd401028..6459a437 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>>   */
>>  int should_remove_suid(struct dentry *dentry)
>>  {
>> -     umode_t mode = d_inode(dentry)->i_mode;
>> +     struct inode *inode = d_inode(dentry);
>> +     umode_t mode = inode->i_mode;
>>       int kill = 0;
>>
>>       /* suid always must be killed */
>> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>>       if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>>               kill |= ATTR_KILL_SGID;
>>
>> -     if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
>> +     if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
>> +                  S_ISREG(mode)))
>>               return kill;
>>
>>       return 0;
>> --
>> 2.13.6

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
       [not found]           ` <CANxcAMtVqgLmQaTtfJocGGgsn5dSX2CDwzh6bwv6OnjUUwsTrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-23 12:56             ` Richard Weinberger
  0 siblings, 0 replies; 219+ messages in thread
From: Richard Weinberger @ 2017-12-23 12:56 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi, Linux Containers, LKML, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Dongsu,

Am Samstag, 23. Dezember 2017, 13:18:30 CET schrieb Dongsu Park:
> Hi,
> 
> On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
> 
> <richard.weinberger-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Dongsu,
> > 
> > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> >> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> >> 
> >> Unprivileged users should not be able to mount mtd block devices
> >> when they lack sufficient privileges towards the block device
> >> inode.  Update mount_mtd() to validate that the user has the
> >> required access to the inode at the specified path. The check
> >> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> >> continue working as before.
> > 
> > What is the big picture of this?
> > Can in future an unprivileged user just mount UBIFS?
> 
> I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
> To my understanding, in these days many container runtimes allow
> unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
> That's why the kernel should deal with additional permission checks
> that might have not been necessary in the past.
> This MTD patch is one of those special cases.

My fear is that a corner case is forgotten and all of a sudden someone can do 
funky things with MTD in a container...

Thanks,
//richard

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting
  2017-12-23 12:18           ` Dongsu Park
  (?)
  (?)
@ 2017-12-23 12:56           ` Richard Weinberger
  -1 siblings, 0 replies; 219+ messages in thread
From: Richard Weinberger @ 2017-12-23 12:56 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Richard Weinberger, LKML, Miklos Szeredi, Linux Containers,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-mtd

Dongsu,

Am Samstag, 23. Dezember 2017, 13:18:30 CET schrieb Dongsu Park:
> Hi,
> 
> On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
> 
> <richard.weinberger@gmail.com> wrote:
> > Dongsu,
> > 
> > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> >> From: Seth Forshee <seth.forshee@canonical.com>
> >> 
> >> Unprivileged users should not be able to mount mtd block devices
> >> when they lack sufficient privileges towards the block device
> >> inode.  Update mount_mtd() to validate that the user has the
> >> required access to the inode at the specified path. The check
> >> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> >> continue working as before.
> > 
> > What is the big picture of this?
> > Can in future an unprivileged user just mount UBIFS?
> 
> I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
> To my understanding, in these days many container runtimes allow
> unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
> That's why the kernel should deal with additional permission checks
> that might have not been necessary in the past.
> This MTD patch is one of those special cases.

My fear is that a corner case is forgotten and all of a sudden someone can do 
funky things with MTD in a container...

Thanks,
//richard

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
       [not found]       ` <20171223040348.GK6837-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2017-12-24  5:12         ` Mimi Zohar
  0 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:12 UTC (permalink / raw)
  To: Serge E. Hallyn, Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, Alban Crequy,
	Eric W . Biederman, James Morris, Sargun Dhillon,
	linux-integrity-u79uwXL29TY76Z2rM5mHXA

Hi Serge,

On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > From: Seth Forshee <seth.forshee@canonical.com>
> > 
> > The kernel should not calculate new hmacs for mounts done by
> > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > calculate new hmacs for mounts for non-init user namespaces.
> > 
> > Cc: linux-integrity@vger.kernel.org
> > Cc: linux-security-module@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: James Morris <james.l.morris@oracle.com>
> > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> 
> Hi Mimi,
> 
> does this change seem sufficient to you?

I think this is the correct behavior in the context of fuse file
systems.  This patch, the "ima: define a new policy option named
force" patch, and an updated IMA policy should be upstreamed together.
 The cover letter should provide the motivation for these patches.

Mimi

> 
> > Cc: "Serge E. Hallyn" <serge@hallyn.com>
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> > ---
> >  security/integrity/evm/evm_crypto.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> > index bcd64baf..729f4545 100644
> > --- a/security/integrity/evm/evm_crypto.c
> > +++ b/security/integrity/evm/evm_crypto.c
> > @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
> >  	int error;
> >  	int size;
> >  
> > -	if (!(inode->i_opflags & IOP_XATTR))
> > +	if (!(inode->i_opflags & IOP_XATTR) ||
> > +	    inode->i_sb->s_user_ns != &init_user_ns)
> >  		return -EOPNOTSUPP;
> >  
> >  	desc = init_desc(type);
> > -- 
> > 2.13.6
> 

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
  2017-12-23  4:03       ` Serge E. Hallyn
  (?)
@ 2017-12-24  5:12         ` Mimi Zohar
  -1 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:12 UTC (permalink / raw)
  To: Serge E. Hallyn, Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-integrity,
	linux-security-module, James Morris

Hi Serge,

On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > From: Seth Forshee <seth.forshee@canonical.com>
> > 
> > The kernel should not calculate new hmacs for mounts done by
> > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > calculate new hmacs for mounts for non-init user namespaces.
> > 
> > Cc: linux-integrity@vger.kernel.org
> > Cc: linux-security-module@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: James Morris <james.l.morris@oracle.com>
> > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> 
> Hi Mimi,
> 
> does this change seem sufficient to you?

I think this is the correct behavior in the context of fuse file
systems.  This patch, the "ima: define a new policy option named
force" patch, and an updated IMA policy should be upstreamed together.
 The cover letter should provide the motivation for these patches.

Mimi

> 
> > Cc: "Serge E. Hallyn" <serge@hallyn.com>
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> > ---
> >  security/integrity/evm/evm_crypto.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> > index bcd64baf..729f4545 100644
> > --- a/security/integrity/evm/evm_crypto.c
> > +++ b/security/integrity/evm/evm_crypto.c
> > @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
> >  	int error;
> >  	int size;
> >  
> > -	if (!(inode->i_opflags & IOP_XATTR))
> > +	if (!(inode->i_opflags & IOP_XATTR) ||
> > +	    inode->i_sb->s_user_ns != &init_user_ns)
> >  		return -EOPNOTSUPP;
> >  
> >  	desc = init_desc(type);
> > -- 
> > 2.13.6
> 

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 11/11] evm: Don't update hmacs in user ns mounts
@ 2017-12-24  5:12         ` Mimi Zohar
  0 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:12 UTC (permalink / raw)
  To: linux-security-module

Hi Serge,

On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > From: Seth Forshee <seth.forshee@canonical.com>
> > 
> > The kernel should not calculate new hmacs for mounts done by
> > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > calculate new hmacs for mounts for non-init user namespaces.
> > 
> > Cc: linux-integrity at vger.kernel.org
> > Cc: linux-security-module at vger.kernel.org
> > Cc: linux-kernel at vger.kernel.org
> > Cc: James Morris <james.l.morris@oracle.com>
> > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> 
> Hi Mimi,
> 
> does this change seem sufficient to you?

I think this is the correct behavior in the context of fuse file
systems. ?This patch, the "ima: define a new policy option named
force" patch, and an updated IMA policy should be upstreamed together.
?The cover letter should provide the motivation for these patches.

Mimi

> 
> > Cc: "Serge E. Hallyn" <serge@hallyn.com>
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> > ---
> >  security/integrity/evm/evm_crypto.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> > index bcd64baf..729f4545 100644
> > --- a/security/integrity/evm/evm_crypto.c
> > +++ b/security/integrity/evm/evm_crypto.c
> > @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
> >  	int error;
> >  	int size;
> >  
> > -	if (!(inode->i_opflags & IOP_XATTR))
> > +	if (!(inode->i_opflags & IOP_XATTR) ||
> > +	    inode->i_sb->s_user_ns != &init_user_ns)
> >  		return -EOPNOTSUPP;
> >  
> >  	desc = init_desc(type);
> > -- 
> > 2.13.6
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
@ 2017-12-24  5:12         ` Mimi Zohar
  0 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:12 UTC (permalink / raw)
  To: Serge E. Hallyn, Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-integrity,
	linux-security-module, James Morris

Hi Serge,

On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > From: Seth Forshee <seth.forshee@canonical.com>
> > 
> > The kernel should not calculate new hmacs for mounts done by
> > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > calculate new hmacs for mounts for non-init user namespaces.
> > 
> > Cc: linux-integrity@vger.kernel.org
> > Cc: linux-security-module@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: James Morris <james.l.morris@oracle.com>
> > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> 
> Hi Mimi,
> 
> does this change seem sufficient to you?

I think this is the correct behavior in the context of fuse file
systems.  This patch, the "ima: define a new policy option named
force" patch, and an updated IMA policy should be upstreamed together.
 The cover letter should provide the motivation for these patches.

Mimi

> 
> > Cc: "Serge E. Hallyn" <serge@hallyn.com>
> > Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> > Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> > ---
> >  security/integrity/evm/evm_crypto.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> > index bcd64baf..729f4545 100644
> > --- a/security/integrity/evm/evm_crypto.c
> > +++ b/security/integrity/evm/evm_crypto.c
> > @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
> >  	int error;
> >  	int size;
> >  
> > -	if (!(inode->i_opflags & IOP_XATTR))
> > +	if (!(inode->i_opflags & IOP_XATTR) ||
> > +	    inode->i_sb->s_user_ns != &init_user_ns)
> >  		return -EOPNOTSUPP;
> >  
> >  	desc = init_desc(type);
> > -- 
> > 2.13.6
> 

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
       [not found]         ` <1514092328.5221.116.camel-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2017-12-24  5:56           ` Mimi Zohar
  0 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:56 UTC (permalink / raw)
  To: Serge E. Hallyn, Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, Alban Crequy,
	Eric W . Biederman, James Morris, Sargun Dhillon,
	linux-integrity-u79uwXL29TY76Z2rM5mHXA

On Sun, 2017-12-24 at 00:12 -0500, Mimi Zohar wrote:
> Hi Serge,
> 
> On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> > On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > > From: Seth Forshee <seth.forshee@canonical.com>
> > > 
> > > The kernel should not calculate new hmacs for mounts done by
> > > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > > calculate new hmacs for mounts for non-init user namespaces.
> > > 
> > > Cc: linux-integrity@vger.kernel.org
> > > Cc: linux-security-module@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: James Morris <james.l.morris@oracle.com>
> > > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> > 
> > Hi Mimi,
> > 
> > does this change seem sufficient to you?
> 
> I think this is the correct behavior in the context of fuse file
> systems.  This patch, the "ima: define a new policy option named
> force" patch, and an updated IMA policy should be upstreamed together.
>  The cover letter should provide the motivation for these patches.

Ah, this patch is being upstreamed with the fuse mounts patches.  I
guess Seth is planning on posting the IMA policy changes for fuse
separately.

Mimi

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
  2017-12-24  5:12         ` Mimi Zohar
  (?)
@ 2017-12-24  5:56           ` Mimi Zohar
  -1 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:56 UTC (permalink / raw)
  To: Serge E. Hallyn, Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-integrity,
	linux-security-module, James Morris

On Sun, 2017-12-24 at 00:12 -0500, Mimi Zohar wrote:
> Hi Serge,
> 
> On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> > On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > > From: Seth Forshee <seth.forshee@canonical.com>
> > > 
> > > The kernel should not calculate new hmacs for mounts done by
> > > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > > calculate new hmacs for mounts for non-init user namespaces.
> > > 
> > > Cc: linux-integrity@vger.kernel.org
> > > Cc: linux-security-module@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: James Morris <james.l.morris@oracle.com>
> > > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> > 
> > Hi Mimi,
> > 
> > does this change seem sufficient to you?
> 
> I think this is the correct behavior in the context of fuse file
> systems.  This patch, the "ima: define a new policy option named
> force" patch, and an updated IMA policy should be upstreamed together.
>  The cover letter should provide the motivation for these patches.

Ah, this patch is being upstreamed with the fuse mounts patches.  I
guess Seth is planning on posting the IMA policy changes for fuse
separately.

Mimi

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH 11/11] evm: Don't update hmacs in user ns mounts
@ 2017-12-24  5:56           ` Mimi Zohar
  0 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:56 UTC (permalink / raw)
  To: linux-security-module

On Sun, 2017-12-24 at 00:12 -0500, Mimi Zohar wrote:
> Hi Serge,
> 
> On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> > On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > > From: Seth Forshee <seth.forshee@canonical.com>
> > > 
> > > The kernel should not calculate new hmacs for mounts done by
> > > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > > calculate new hmacs for mounts for non-init user namespaces.
> > > 
> > > Cc: linux-integrity at vger.kernel.org
> > > Cc: linux-security-module at vger.kernel.org
> > > Cc: linux-kernel at vger.kernel.org
> > > Cc: James Morris <james.l.morris@oracle.com>
> > > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> > 
> > Hi Mimi,
> > 
> > does this change seem sufficient to you?
> 
> I think this is the correct behavior in the context of fuse file
> systems. ?This patch, the "ima: define a new policy option named
> force" patch, and an updated IMA policy should be upstreamed together.
> ?The cover letter should provide the motivation for these patches.

Ah, this patch is being upstreamed with the fuse mounts patches. ?I
guess Seth is planning on posting the IMA policy changes for fuse
separately.

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts
@ 2017-12-24  5:56           ` Mimi Zohar
  0 siblings, 0 replies; 219+ messages in thread
From: Mimi Zohar @ 2017-12-24  5:56 UTC (permalink / raw)
  To: Serge E. Hallyn, Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-integrity,
	linux-security-module, James Morris

On Sun, 2017-12-24 at 00:12 -0500, Mimi Zohar wrote:
> Hi Serge,
> 
> On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> > On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > > From: Seth Forshee <seth.forshee@canonical.com>
> > > 
> > > The kernel should not calculate new hmacs for mounts done by
> > > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > > calculate new hmacs for mounts for non-init user namespaces.
> > > 
> > > Cc: linux-integrity@vger.kernel.org
> > > Cc: linux-security-module@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: James Morris <james.l.morris@oracle.com>
> > > Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
> > 
> > Hi Mimi,
> > 
> > does this change seem sufficient to you?
> 
> I think this is the correct behavior in the context of fuse file
> systems.  This patch, the "ima: define a new policy option named
> force" patch, and an updated IMA policy should be upstreamed together.
>  The cover letter should provide the motivation for these patches.

Ah, this patch is being upstreamed with the fuse mounts patches.  I
guess Seth is planning on posting the IMA policy changes for fuse
separately.

Mimi

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
       [not found] ` <cover.1512741134.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2017-12-22 14:32   ` [PATCH 11/11] evm: Don't update hmacs in user ns mounts Dongsu Park
@ 2017-12-25  7:05   ` Eric W. Biederman
  2018-02-13 11:32     ` Miklos Szeredi
  2018-02-21 20:24     ` Eric W. Biederman
  13 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2017-12-25  7:05 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon

Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> writes:

> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
> The latest patchset was v4:
> https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1132206.html
>
> At the moment, filesystems backed by physical medium can only be mounted
> by real root in the initial user namespace. This restriction exists
> because if it's allowed for root user in non-init user namespaces to
> mount the filesystem, then it effectively allows the user to control the
> underlying source of the filesystem. In case of FUSE, the source would
> mean any underlying device.
>
> However, in many use cases such as containers, it's necessary to allow
> filesystems to be mounted from non-init user namespaces. Goal of this
> patchset is to allow FUSE filesystems to be mounted from non-init user
> namespaces. Support for other filesystems like ext4 are not in the
> scope of this patchset.
>
> Let me describe how to test mounting from non-init user namespaces. It's
> assumed that tests are done via sshfs, a userspace filesystem based on
> FUSE with ssh as backend. Testing system is Fedora 27.

In general I am for this work, and more bodies and more eyes on it is
generally better.

I will review this after the New Year, I am out for the holidays right
now.

Eric


>
> ====
> $ sudo dnf install -y sshfs
> $ sudo mkdir -p /mnt/userns
>
> ### workaround to get the sshfs permission checks
> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>
> $ unshare -U -r -m
> # sshfs root@localhost: /mnt/userns
>
> ### You can see sshfs being mounted from a non-init user namespace
> # mount | grep sshfs
> root@localhost: on /mnt/userns type fuse.sshfs
> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>
> # touch /mnt/userns/test
> # ls -l /mnt/userns/test
> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
> ====
>
> Open another terminal, check the mountpoint from outside the namespace.
>
> ====
> $ grep userns /proc/$(pidof sshfs)/mountinfo
> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
> root@localhost: rw,user_id=0,group_id=0
> ====
>
> After all tests are done, you can unmount the filesystem
> inside the namespace.
>
> ====
> # fusermount -u /mnt/userns
> ====
>
> Changes since v4:
>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>  * Add and change commit messages
>  * Describe how to test non-init user namespaces
>
> TODO:
>  * Think through potential security implications. There are 2 patches
>    being prepared for security issues. One is "ima: define a new policy
>    option named force" by Mimi Zohar, which adds an option to specify
>    that the results should not be cached:
>    https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>    The other one is to basically prevent FUSE results from being cached,
>    which is still in progress.
>
>  * Test IMA/LSMs. Details are written in
>    https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>
> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
> additional inode permission.
>
> Patches 3-7 allow the superblock owner to change ownership of inodes, and
> deal with additional capability checks w.r.t user namespaces.
>
> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
> user namespace.
>
> Patch 11 handles a corner case of non-root users in EVM.
>
> The patchset is also available in our github repo:
>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>
>
> Eric W. Biederman (1):
>   fs: Allow superblock owner to change ownership of inodes
>
> Seth Forshee (10):
>   block_dev: Support checking inode permissions in lookup_bdev()
>   mtd: Check permissions towards mtd block device inode when mounting
>   fs: Don't remove suid for CAP_FSETID for userns root
>   fs: Allow superblock owner to access do_remount_sb()
>   capabilities: Allow privileged user in s_user_ns to set security.*
>     xattrs
>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>   fuse: Support fuse filesystems outside of init_user_ns
>   fuse: Restrict allow_other to the superblock's namespace or a
>     descendant
>   fuse: Allow user namespace mounts
>   evm: Don't update hmacs in user ns mounts
>
>  drivers/md/bcache/super.c           |  2 +-
>  drivers/md/dm-table.c               |  2 +-
>  drivers/mtd/mtdsuper.c              |  6 +++++-
>  fs/attr.c                           | 34 ++++++++++++++++++++++++++--------
>  fs/block_dev.c                      | 13 ++++++++++---
>  fs/fuse/cuse.c                      |  3 ++-
>  fs/fuse/dev.c                       | 11 ++++++++---
>  fs/fuse/dir.c                       | 16 ++++++++--------
>  fs/fuse/fuse_i.h                    |  6 +++++-
>  fs/fuse/inode.c                     | 35 +++++++++++++++++++++--------------
>  fs/inode.c                          |  6 ++++--
>  fs/ioctl.c                          |  4 ++--
>  fs/namespace.c                      |  4 ++--
>  fs/proc/base.c                      |  7 +++++++
>  fs/proc/generic.c                   |  7 +++++++
>  fs/proc/proc_sysctl.c               |  7 +++++++
>  fs/quota/quota.c                    |  2 +-
>  include/linux/fs.h                  |  2 +-
>  kernel/user_namespace.c             |  1 +
>  security/commoncap.c                |  8 ++++++--
>  security/integrity/evm/evm_crypto.c |  3 ++-
>  21 files changed, 127 insertions(+), 52 deletions(-)

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
                   ` (8 preceding siblings ...)
  2017-12-22 14:32   ` Dongsu Park
@ 2017-12-25  7:05 ` Eric W. Biederman
       [not found]   ` <877etbcmnd.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2018-01-09 15:05   ` Dongsu Park
  9 siblings, 2 replies; 219+ messages in thread
From: Eric W. Biederman @ 2017-12-25  7:05 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon

Dongsu Park <dongsu@kinvolk.io> writes:

> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
> The latest patchset was v4:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html
>
> At the moment, filesystems backed by physical medium can only be mounted
> by real root in the initial user namespace. This restriction exists
> because if it's allowed for root user in non-init user namespaces to
> mount the filesystem, then it effectively allows the user to control the
> underlying source of the filesystem. In case of FUSE, the source would
> mean any underlying device.
>
> However, in many use cases such as containers, it's necessary to allow
> filesystems to be mounted from non-init user namespaces. Goal of this
> patchset is to allow FUSE filesystems to be mounted from non-init user
> namespaces. Support for other filesystems like ext4 are not in the
> scope of this patchset.
>
> Let me describe how to test mounting from non-init user namespaces. It's
> assumed that tests are done via sshfs, a userspace filesystem based on
> FUSE with ssh as backend. Testing system is Fedora 27.

In general I am for this work, and more bodies and more eyes on it is
generally better.

I will review this after the New Year, I am out for the holidays right
now.

Eric


>
> ====
> $ sudo dnf install -y sshfs
> $ sudo mkdir -p /mnt/userns
>
> ### workaround to get the sshfs permission checks
> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>
> $ unshare -U -r -m
> # sshfs root@localhost: /mnt/userns
>
> ### You can see sshfs being mounted from a non-init user namespace
> # mount | grep sshfs
> root@localhost: on /mnt/userns type fuse.sshfs
> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>
> # touch /mnt/userns/test
> # ls -l /mnt/userns/test
> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
> ====
>
> Open another terminal, check the mountpoint from outside the namespace.
>
> ====
> $ grep userns /proc/$(pidof sshfs)/mountinfo
> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
> root@localhost: rw,user_id=0,group_id=0
> ====
>
> After all tests are done, you can unmount the filesystem
> inside the namespace.
>
> ====
> # fusermount -u /mnt/userns
> ====
>
> Changes since v4:
>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>  * Add and change commit messages
>  * Describe how to test non-init user namespaces
>
> TODO:
>  * Think through potential security implications. There are 2 patches
>    being prepared for security issues. One is "ima: define a new policy
>    option named force" by Mimi Zohar, which adds an option to specify
>    that the results should not be cached:
>    https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>    The other one is to basically prevent FUSE results from being cached,
>    which is still in progress.
>
>  * Test IMA/LSMs. Details are written in
>    https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>
> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
> additional inode permission.
>
> Patches 3-7 allow the superblock owner to change ownership of inodes, and
> deal with additional capability checks w.r.t user namespaces.
>
> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
> user namespace.
>
> Patch 11 handles a corner case of non-root users in EVM.
>
> The patchset is also available in our github repo:
>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>
>
> Eric W. Biederman (1):
>   fs: Allow superblock owner to change ownership of inodes
>
> Seth Forshee (10):
>   block_dev: Support checking inode permissions in lookup_bdev()
>   mtd: Check permissions towards mtd block device inode when mounting
>   fs: Don't remove suid for CAP_FSETID for userns root
>   fs: Allow superblock owner to access do_remount_sb()
>   capabilities: Allow privileged user in s_user_ns to set security.*
>     xattrs
>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>   fuse: Support fuse filesystems outside of init_user_ns
>   fuse: Restrict allow_other to the superblock's namespace or a
>     descendant
>   fuse: Allow user namespace mounts
>   evm: Don't update hmacs in user ns mounts
>
>  drivers/md/bcache/super.c           |  2 +-
>  drivers/md/dm-table.c               |  2 +-
>  drivers/mtd/mtdsuper.c              |  6 +++++-
>  fs/attr.c                           | 34 ++++++++++++++++++++++++++--------
>  fs/block_dev.c                      | 13 ++++++++++---
>  fs/fuse/cuse.c                      |  3 ++-
>  fs/fuse/dev.c                       | 11 ++++++++---
>  fs/fuse/dir.c                       | 16 ++++++++--------
>  fs/fuse/fuse_i.h                    |  6 +++++-
>  fs/fuse/inode.c                     | 35 +++++++++++++++++++++--------------
>  fs/inode.c                          |  6 ++++--
>  fs/ioctl.c                          |  4 ++--
>  fs/namespace.c                      |  4 ++--
>  fs/proc/base.c                      |  7 +++++++
>  fs/proc/generic.c                   |  7 +++++++
>  fs/proc/proc_sysctl.c               |  7 +++++++
>  fs/quota/quota.c                    |  2 +-
>  include/linux/fs.h                  |  2 +-
>  kernel/user_namespace.c             |  1 +
>  security/commoncap.c                |  8 ++++++--
>  security/integrity/evm/evm_crypto.c |  3 ++-
>  21 files changed, 127 insertions(+), 52 deletions(-)

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
       [not found]   ` <ac3d34002d7690f6ca5928b57b7fc4d707104b04.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:17       ` Serge E. Hallyn
@ 2018-01-05 19:24     ` Luis R. Rodriguez
  2018-02-13 13:18     ` Miklos Szeredi
  2 siblings, 0 replies; 219+ messages in thread
From: Luis R. Rodriguez @ 2018-01-05 19:24 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi, Kees Cook,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
	Luis R. Rodriguez, Alban Crequy, Eric W . Biederman,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    uid_eq(uid, inode->i_uid))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:	dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>  		goto kill_priv;
>  
>  	/* Make sure a caller can chown. */
> -	if ((ia_valid & ATTR_UID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>  		return -EPERM;

I think this patch would read much better and easier to review if it was
split up by first adding the helpers, and then extending them afterwards.

>  
>  	/* Make sure caller can chgrp. */
> -	if ((ia_valid & ATTR_GID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>  		return -EPERM;
>  
>  	/* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	int error;
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  
>  	if (attr->ia_valid & ATTR_MODE)
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;

Are we sure proc is the only special one? How was it observed first that this was
require for proc? Has anyone tried fuzzing by trying this op with a slew of other
filesystems on all files?

  Luis

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
       [not found]   ` <ac3d34002d7690f6ca5928b57b7fc4d707104b04.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2018-01-05 19:24   ` Luis R. Rodriguez
  2018-01-09 15:10     ` Dongsu Park
       [not found]     ` <20180105192407.GF22430-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>
  2018-02-13 13:18   ` Miklos Szeredi
  2 siblings, 2 replies; 219+ messages in thread
From: Luis R. Rodriguez @ 2018-01-05 19:24 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Luis R. Rodriguez, Kees Cook

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
>  #include <linux/evm.h>
>  #include <linux/ima.h>
>  
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    uid_eq(uid, inode->i_uid))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> +	if (uid_eq(current_fsuid(), inode->i_uid) &&
> +	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> +		return true;
> +	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +		return true;
> +	if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * setattr_prepare - check if attribute changes to a dentry are allowed
>   * @dentry:	dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>  		goto kill_priv;
>  
>  	/* Make sure a caller can chown. */
> -	if ((ia_valid & ATTR_UID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	     !uid_eq(attr->ia_uid, inode->i_uid)) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>  		return -EPERM;

I think this patch would read much better and easier to review if it was
split up by first adding the helpers, and then extending them afterwards.

>  
>  	/* Make sure caller can chgrp. */
> -	if ((ia_valid & ATTR_GID) &&
> -	    (!uid_eq(current_fsuid(), inode->i_uid) ||
> -	    (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> -	    !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> +	if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>  		return -EPERM;
>  
>  	/* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	int error;
>  	struct inode *inode = d_inode(dentry);
> +	struct user_namespace *s_user_ns;
>  
>  	if (attr->ia_valid & ATTR_MODE)
>  		return -EPERM;
>  
> +	/* Don't let anyone mess with weird proc files */
> +	s_user_ns = inode->i_sb->s_user_ns;
> +	if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> +	    !kgid_has_mapping(s_user_ns, inode->i_gid))
> +		return -EPERM;
> +
>  	error = setattr_prepare(dentry, attr);
>  	if (error)
>  		return error;

Are we sure proc is the only special one? How was it observed first that this was
require for proc? Has anyone tried fuzzing by trying this op with a slew of other
filesystems on all files?

  Luis

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
       [not found]   ` <877etbcmnd.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-01-09 15:05     ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2018-01-09 15:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Containers, LKML, Seth Forshee,
	Alban Crequy, Sargun Dhillon

Hi,

On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> writes:
>
>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>> The latest patchset was v4:
>> https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1132206.html
>>
>> At the moment, filesystems backed by physical medium can only be mounted
>> by real root in the initial user namespace. This restriction exists
>> because if it's allowed for root user in non-init user namespaces to
>> mount the filesystem, then it effectively allows the user to control the
>> underlying source of the filesystem. In case of FUSE, the source would
>> mean any underlying device.
>>
>> However, in many use cases such as containers, it's necessary to allow
>> filesystems to be mounted from non-init user namespaces. Goal of this
>> patchset is to allow FUSE filesystems to be mounted from non-init user
>> namespaces. Support for other filesystems like ext4 are not in the
>> scope of this patchset.
>>
>> Let me describe how to test mounting from non-init user namespaces. It's
>> assumed that tests are done via sshfs, a userspace filesystem based on
>> FUSE with ssh as backend. Testing system is Fedora 27.
>
> In general I am for this work, and more bodies and more eyes on it is
> generally better.
>
> I will review this after the New Year, I am out for the holidays right
> now.

Thanks. I'll wait for your review.

Dongsu

> Eric
>
>
>>
>> ====
>> $ sudo dnf install -y sshfs
>> $ sudo mkdir -p /mnt/userns
>>
>> ### workaround to get the sshfs permission checks
>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>
>> $ unshare -U -r -m
>> # sshfs root@localhost: /mnt/userns
>>
>> ### You can see sshfs being mounted from a non-init user namespace
>> # mount | grep sshfs
>> root@localhost: on /mnt/userns type fuse.sshfs
>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>
>> # touch /mnt/userns/test
>> # ls -l /mnt/userns/test
>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>> ====
>>
>> Open another terminal, check the mountpoint from outside the namespace.
>>
>> ====
>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>> root@localhost: rw,user_id=0,group_id=0
>> ====
>>
>> After all tests are done, you can unmount the filesystem
>> inside the namespace.
>>
>> ====
>> # fusermount -u /mnt/userns
>> ====
>>
>> Changes since v4:
>>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>>  * Add and change commit messages
>>  * Describe how to test non-init user namespaces
>>
>> TODO:
>>  * Think through potential security implications. There are 2 patches
>>    being prepared for security issues. One is "ima: define a new policy
>>    option named force" by Mimi Zohar, which adds an option to specify
>>    that the results should not be cached:
>>    https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>>    The other one is to basically prevent FUSE results from being cached,
>>    which is still in progress.
>>
>>  * Test IMA/LSMs. Details are written in
>>    https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>>
>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>> deal with additional capability checks w.r.t user namespaces.
>>
>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>> user namespace.
>>
>> Patch 11 handles a corner case of non-root users in EVM.
>>
>> The patchset is also available in our github repo:
>>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>
>>
>> Eric W. Biederman (1):
>>   fs: Allow superblock owner to change ownership of inodes
>>
>> Seth Forshee (10):
>>   block_dev: Support checking inode permissions in lookup_bdev()
>>   mtd: Check permissions towards mtd block device inode when mounting
>>   fs: Don't remove suid for CAP_FSETID for userns root
>>   fs: Allow superblock owner to access do_remount_sb()
>>   capabilities: Allow privileged user in s_user_ns to set security.*
>>     xattrs
>>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>>   fuse: Support fuse filesystems outside of init_user_ns
>>   fuse: Restrict allow_other to the superblock's namespace or a
>>     descendant
>>   fuse: Allow user namespace mounts
>>   evm: Don't update hmacs in user ns mounts
>>
>>  drivers/md/bcache/super.c           |  2 +-
>>  drivers/md/dm-table.c               |  2 +-
>>  drivers/mtd/mtdsuper.c              |  6 +++++-
>>  fs/attr.c                           | 34 ++++++++++++++++++++++++++--------
>>  fs/block_dev.c                      | 13 ++++++++++---
>>  fs/fuse/cuse.c                      |  3 ++-
>>  fs/fuse/dev.c                       | 11 ++++++++---
>>  fs/fuse/dir.c                       | 16 ++++++++--------
>>  fs/fuse/fuse_i.h                    |  6 +++++-
>>  fs/fuse/inode.c                     | 35 +++++++++++++++++++++--------------
>>  fs/inode.c                          |  6 ++++--
>>  fs/ioctl.c                          |  4 ++--
>>  fs/namespace.c                      |  4 ++--
>>  fs/proc/base.c                      |  7 +++++++
>>  fs/proc/generic.c                   |  7 +++++++
>>  fs/proc/proc_sysctl.c               |  7 +++++++
>>  fs/quota/quota.c                    |  2 +-
>>  include/linux/fs.h                  |  2 +-
>>  kernel/user_namespace.c             |  1 +
>>  security/commoncap.c                |  8 ++++++--
>>  security/integrity/evm/evm_crypto.c |  3 ++-
>>  21 files changed, 127 insertions(+), 52 deletions(-)

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
  2017-12-25  7:05 ` [PATCH v5 00/11] FUSE mounts from non-init user namespaces Eric W. Biederman
       [not found]   ` <877etbcmnd.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-01-09 15:05   ` Dongsu Park
  2018-01-18 14:58     ` Alban Crequy
       [not found]     ` <CANxcAMvwwiPXBTKmTM9sEo8Y1T--V7fNaFqzHfyEvwvaYQV60A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 219+ messages in thread
From: Dongsu Park @ 2018-01-09 15:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux Containers, Alban Crequy, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon

Hi,

On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Dongsu Park <dongsu@kinvolk.io> writes:
>
>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>> The latest patchset was v4:
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html
>>
>> At the moment, filesystems backed by physical medium can only be mounted
>> by real root in the initial user namespace. This restriction exists
>> because if it's allowed for root user in non-init user namespaces to
>> mount the filesystem, then it effectively allows the user to control the
>> underlying source of the filesystem. In case of FUSE, the source would
>> mean any underlying device.
>>
>> However, in many use cases such as containers, it's necessary to allow
>> filesystems to be mounted from non-init user namespaces. Goal of this
>> patchset is to allow FUSE filesystems to be mounted from non-init user
>> namespaces. Support for other filesystems like ext4 are not in the
>> scope of this patchset.
>>
>> Let me describe how to test mounting from non-init user namespaces. It's
>> assumed that tests are done via sshfs, a userspace filesystem based on
>> FUSE with ssh as backend. Testing system is Fedora 27.
>
> In general I am for this work, and more bodies and more eyes on it is
> generally better.
>
> I will review this after the New Year, I am out for the holidays right
> now.

Thanks. I'll wait for your review.

Dongsu

> Eric
>
>
>>
>> ====
>> $ sudo dnf install -y sshfs
>> $ sudo mkdir -p /mnt/userns
>>
>> ### workaround to get the sshfs permission checks
>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>
>> $ unshare -U -r -m
>> # sshfs root@localhost: /mnt/userns
>>
>> ### You can see sshfs being mounted from a non-init user namespace
>> # mount | grep sshfs
>> root@localhost: on /mnt/userns type fuse.sshfs
>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>
>> # touch /mnt/userns/test
>> # ls -l /mnt/userns/test
>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>> ====
>>
>> Open another terminal, check the mountpoint from outside the namespace.
>>
>> ====
>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>> root@localhost: rw,user_id=0,group_id=0
>> ====
>>
>> After all tests are done, you can unmount the filesystem
>> inside the namespace.
>>
>> ====
>> # fusermount -u /mnt/userns
>> ====
>>
>> Changes since v4:
>>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>>  * Add and change commit messages
>>  * Describe how to test non-init user namespaces
>>
>> TODO:
>>  * Think through potential security implications. There are 2 patches
>>    being prepared for security issues. One is "ima: define a new policy
>>    option named force" by Mimi Zohar, which adds an option to specify
>>    that the results should not be cached:
>>    https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>>    The other one is to basically prevent FUSE results from being cached,
>>    which is still in progress.
>>
>>  * Test IMA/LSMs. Details are written in
>>    https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>>
>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>> deal with additional capability checks w.r.t user namespaces.
>>
>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>> user namespace.
>>
>> Patch 11 handles a corner case of non-root users in EVM.
>>
>> The patchset is also available in our github repo:
>>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>
>>
>> Eric W. Biederman (1):
>>   fs: Allow superblock owner to change ownership of inodes
>>
>> Seth Forshee (10):
>>   block_dev: Support checking inode permissions in lookup_bdev()
>>   mtd: Check permissions towards mtd block device inode when mounting
>>   fs: Don't remove suid for CAP_FSETID for userns root
>>   fs: Allow superblock owner to access do_remount_sb()
>>   capabilities: Allow privileged user in s_user_ns to set security.*
>>     xattrs
>>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>>   fuse: Support fuse filesystems outside of init_user_ns
>>   fuse: Restrict allow_other to the superblock's namespace or a
>>     descendant
>>   fuse: Allow user namespace mounts
>>   evm: Don't update hmacs in user ns mounts
>>
>>  drivers/md/bcache/super.c           |  2 +-
>>  drivers/md/dm-table.c               |  2 +-
>>  drivers/mtd/mtdsuper.c              |  6 +++++-
>>  fs/attr.c                           | 34 ++++++++++++++++++++++++++--------
>>  fs/block_dev.c                      | 13 ++++++++++---
>>  fs/fuse/cuse.c                      |  3 ++-
>>  fs/fuse/dev.c                       | 11 ++++++++---
>>  fs/fuse/dir.c                       | 16 ++++++++--------
>>  fs/fuse/fuse_i.h                    |  6 +++++-
>>  fs/fuse/inode.c                     | 35 +++++++++++++++++++++--------------
>>  fs/inode.c                          |  6 ++++--
>>  fs/ioctl.c                          |  4 ++--
>>  fs/namespace.c                      |  4 ++--
>>  fs/proc/base.c                      |  7 +++++++
>>  fs/proc/generic.c                   |  7 +++++++
>>  fs/proc/proc_sysctl.c               |  7 +++++++
>>  fs/quota/quota.c                    |  2 +-
>>  include/linux/fs.h                  |  2 +-
>>  kernel/user_namespace.c             |  1 +
>>  security/commoncap.c                |  8 ++++++--
>>  security/integrity/evm/evm_crypto.c |  3 ++-
>>  21 files changed, 127 insertions(+), 52 deletions(-)

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
       [not found]     ` <20180105192407.GF22430-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>
@ 2018-01-09 15:10       ` Dongsu Park
  0 siblings, 0 replies; 219+ messages in thread
From: Dongsu Park @ 2018-01-09 15:10 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Miklos Szeredi, Kees Cook, Linux Containers, LKML, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro

Hi,

On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 12ffdb6f..bf8e94f3 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -18,6 +18,30 @@
>>  #include <linux/evm.h>
>>  #include <linux/ima.h>
>>
>> +static bool chown_ok(const struct inode *inode, kuid_t uid)
>> +{
>> +     if (uid_eq(current_fsuid(), inode->i_uid) &&
>> +         uid_eq(uid, inode->i_uid))
>> +             return true;
>> +     if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +             return true;
>> +     if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> +             return true;
>> +     return false;
>> +}
>> +
>> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>> +{
>> +     if (uid_eq(current_fsuid(), inode->i_uid) &&
>> +         (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
>> +             return true;
>> +     if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +             return true;
>> +     if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> +             return true;
>> +     return false;
>> +}
>> +
>>  /**
>>   * setattr_prepare - check if attribute changes to a dentry are allowed
>>   * @dentry:  dentry to check
>> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>>               goto kill_priv;
>>
>>       /* Make sure a caller can chown. */
>> -     if ((ia_valid & ATTR_UID) &&
>> -         (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -          !uid_eq(attr->ia_uid, inode->i_uid)) &&
>> -         !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +     if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>>               return -EPERM;
>
> I think this patch would read much better and easier to review if it was
> split up by first adding the helpers, and then extending them afterwards.

I'm fine with splitting it up into multiple patches, if the original author
Eric agrees.

>>       /* Make sure caller can chgrp. */
>> -     if ((ia_valid & ATTR_GID) &&
>> -         (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -         (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
>> -         !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +     if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>>               return -EPERM;
>>
>>       /* Make sure a caller can chmod. */
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 31934cb9..9d50ec92 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>       int error;
>>       struct inode *inode = d_inode(dentry);
>> +     struct user_namespace *s_user_ns;
>>
>>       if (attr->ia_valid & ATTR_MODE)
>>               return -EPERM;
>>
>> +     /* Don't let anyone mess with weird proc files */
>> +     s_user_ns = inode->i_sb->s_user_ns;
>> +     if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
>> +         !kgid_has_mapping(s_user_ns, inode->i_gid))
>> +             return -EPERM;
>> +
>>       error = setattr_prepare(dentry, attr);
>>       if (error)
>>               return error;
>
> Are we sure proc is the only special one? How was it observed first that this was
> require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> filesystems on all files?

From my limited knowledge about procfs, I suppose that procfs is a little
different from ordinary filesystems. Procfs is not exactly namespaced,
it has many inconsistencies. Some files under /proc should be owned by the
global root, regardless of user namespaces. That's why we need to handle such
special cases for proc. As it has been historically like that since the
beginning, it's hard to change it fundamentally.

However, you have good points. Other than procfs, there could be other
filesystems that have potential issues when relaxing privileges. Question is
how we can be sure that there's no hidden issues. From my understanding,
usually we could run testsuites like LTP
(https://github.com/linux-test-project/ltp.git) to avoid such regressions.
Today I have run LTP tests for fs & containers, with the patchset included.
It seemed to work fine without failures. Obviously it doesn't mean that it's
completely bug-free, when we are talking about unknown issues.
Please let me know if there are other good ways to figure out potential issues.

Thanks,
Dongsu

>   Luis

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2018-01-05 19:24   ` Luis R. Rodriguez
@ 2018-01-09 15:10     ` Dongsu Park
  2018-01-09 17:23       ` Luis R. Rodriguez
       [not found]       ` <CANxcAMvDQFH0g5PPnVZ3p2Tei04N+8fNf0pk02DrfTkBHjjrPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]     ` <20180105192407.GF22430-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>
  1 sibling, 2 replies; 219+ messages in thread
From: Dongsu Park @ 2018-01-09 15:10 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: LKML, Linux Containers, Alban Crequy, Eric W . Biederman,
	Miklos Szeredi, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro, Kees Cook

Hi,

On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 12ffdb6f..bf8e94f3 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -18,6 +18,30 @@
>>  #include <linux/evm.h>
>>  #include <linux/ima.h>
>>
>> +static bool chown_ok(const struct inode *inode, kuid_t uid)
>> +{
>> +     if (uid_eq(current_fsuid(), inode->i_uid) &&
>> +         uid_eq(uid, inode->i_uid))
>> +             return true;
>> +     if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +             return true;
>> +     if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> +             return true;
>> +     return false;
>> +}
>> +
>> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>> +{
>> +     if (uid_eq(current_fsuid(), inode->i_uid) &&
>> +         (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
>> +             return true;
>> +     if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +             return true;
>> +     if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> +             return true;
>> +     return false;
>> +}
>> +
>>  /**
>>   * setattr_prepare - check if attribute changes to a dentry are allowed
>>   * @dentry:  dentry to check
>> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>>               goto kill_priv;
>>
>>       /* Make sure a caller can chown. */
>> -     if ((ia_valid & ATTR_UID) &&
>> -         (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -          !uid_eq(attr->ia_uid, inode->i_uid)) &&
>> -         !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +     if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>>               return -EPERM;
>
> I think this patch would read much better and easier to review if it was
> split up by first adding the helpers, and then extending them afterwards.

I'm fine with splitting it up into multiple patches, if the original author
Eric agrees.

>>       /* Make sure caller can chgrp. */
>> -     if ((ia_valid & ATTR_GID) &&
>> -         (!uid_eq(current_fsuid(), inode->i_uid) ||
>> -         (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
>> -         !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> +     if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>>               return -EPERM;
>>
>>       /* Make sure a caller can chmod. */
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 31934cb9..9d50ec92 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>       int error;
>>       struct inode *inode = d_inode(dentry);
>> +     struct user_namespace *s_user_ns;
>>
>>       if (attr->ia_valid & ATTR_MODE)
>>               return -EPERM;
>>
>> +     /* Don't let anyone mess with weird proc files */
>> +     s_user_ns = inode->i_sb->s_user_ns;
>> +     if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
>> +         !kgid_has_mapping(s_user_ns, inode->i_gid))
>> +             return -EPERM;
>> +
>>       error = setattr_prepare(dentry, attr);
>>       if (error)
>>               return error;
>
> Are we sure proc is the only special one? How was it observed first that this was
> require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> filesystems on all files?

>From my limited knowledge about procfs, I suppose that procfs is a little
different from ordinary filesystems. Procfs is not exactly namespaced,
it has many inconsistencies. Some files under /proc should be owned by the
global root, regardless of user namespaces. That's why we need to handle such
special cases for proc. As it has been historically like that since the
beginning, it's hard to change it fundamentally.

However, you have good points. Other than procfs, there could be other
filesystems that have potential issues when relaxing privileges. Question is
how we can be sure that there's no hidden issues. From my understanding,
usually we could run testsuites like LTP
(https://github.com/linux-test-project/ltp.git) to avoid such regressions.
Today I have run LTP tests for fs & containers, with the patchset included.
It seemed to work fine without failures. Obviously it doesn't mean that it's
completely bug-free, when we are talking about unknown issues.
Please let me know if there are other good ways to figure out potential issues.

Thanks,
Dongsu

>   Luis

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
       [not found]       ` <CANxcAMvDQFH0g5PPnVZ3p2Tei04N+8fNf0pk02DrfTkBHjjrPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-09 17:23         ` Luis R. Rodriguez
  0 siblings, 0 replies; 219+ messages in thread
From: Luis R. Rodriguez @ 2018-01-09 17:23 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi, Kees Cook, Linux Containers, LKML, Seth Forshee,
	Luis R. Rodriguez, Alban Crequy, Eric W . Biederman,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro

On Tue, Jan 09, 2018 at 04:10:54PM +0100, Dongsu Park wrote:
> On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcgrof-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> > I think this patch would read much better and easier to review if it was
> > split up by first adding the helpers, and then extending them afterwards.
> 
> I'm fine with splitting it up into multiple patches, if the original author
> Eric agrees.

Great.

> > Are we sure proc is the only special one? How was it observed first that this was
> > require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> > filesystems on all files?
>
> Please let me know if there are other good ways to figure out potential issues.

I think the trick would be to create a test which mimicks the issue and then try to
mount and run the test against as many filesystems as we support. So would developing
a test be possible here?

  Luis

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2018-01-09 15:10     ` Dongsu Park
@ 2018-01-09 17:23       ` Luis R. Rodriguez
       [not found]       ` <CANxcAMvDQFH0g5PPnVZ3p2Tei04N+8fNf0pk02DrfTkBHjjrPQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Luis R. Rodriguez @ 2018-01-09 17:23 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Luis R. Rodriguez, LKML, Linux Containers, Alban Crequy,
	Eric W . Biederman, Miklos Szeredi, Seth Forshee, Sargun Dhillon,
	linux-fsdevel, Alexander Viro, Kees Cook

On Tue, Jan 09, 2018 at 04:10:54PM +0100, Dongsu Park wrote:
> On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> > On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> > I think this patch would read much better and easier to review if it was
> > split up by first adding the helpers, and then extending them afterwards.
> 
> I'm fine with splitting it up into multiple patches, if the original author
> Eric agrees.

Great.

> > Are we sure proc is the only special one? How was it observed first that this was
> > require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> > filesystems on all files?
>
> Please let me know if there are other good ways to figure out potential issues.

I think the trick would be to create a test which mimicks the issue and then try to
mount and run the test against as many filesystems as we support. So would developing
a test be possible here?

  Luis

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found]   ` <c85c293e19a478353aba8e6e3ee39e5914f798d5.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:46       ` Serge E. Hallyn
@ 2018-01-17 10:59     ` Alban Crequy
  2018-02-12 15:57       ` Miklos Szeredi
  2018-02-20  2:12       ` Eric W. Biederman
  3 siblings, 0 replies; 219+ messages in thread
From: Alban Crequy @ 2018-01-17 10:59 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tom Gundersen, Seth Forshee,
	Tejun Heo, Eric W . Biederman, David Herrmann, Sargun Dhillon,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

[Adding Tejun, David, Tom for question about cuse]

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.

Was a use case discussed for using cuse in a new unprivileged userns?

I ran some tests yesterday with cusexmp [1] and I could add a new char
device as an unprivileged user with:

$ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
--maj=99 --min=30 --name=foo

where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
Then, I could see the new device:

$ cat /proc/devices | grep foo
 99 foo

On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
seems dangerous if the dev node can be provided otherwise and if we
don't have a use case for it.

Thoughts?

[1] https://github.com/fuse4x/fuse/blob/master/example/cusexmp.c#L9

Cheers,
Alban


> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
>  #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>         if (!cc)
>                 return -ENOMEM;
>
> -       fuse_conn_init(&cc->fc);
> +       fuse_conn_init(&cc->fc, current_user_ns());
>
>         fud = fuse_dev_alloc(&cc->fc);
>         if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>
>         return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> +       if (task_active_pid_ns(current) != fc->pid_ns ||
> +           current_user_ns() != fc->user_ns) {
>                 rcu_read_lock();
>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>                 rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>         stat->ino = attr->ino;
>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         stat->nlink = attr->nlink;
> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>         stat->rdev = inode->i_rdev;
>         stat->atime.tv_sec = attr->atime;
>         stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>         return true;
>  }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -                          bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>         unsigned ivalid = iattr->ia_valid;
>
>         if (ivalid & ATTR_MODE)
>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>         if (ivalid & ATTR_UID)
> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>         if (ivalid & ATTR_GID)
> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>         if (ivalid & ATTR_SIZE)
>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>         if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
>         memset(&inarg, 0, sizeof(inarg));
>         memset(&outarg, 0, sizeof(outarg));
> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>         if (file) {
>                 struct fuse_file *ff = file->private_data;
>                 inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>         /** The pid namespace for this mount */
>         struct pid_namespace *pid_ns;
>
> +       /** The user namespace for this mount */
> +       struct user_namespace *user_ns;
> +
>         /** Maximum read size */
>         unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>         inode->i_ino     = fuse_squash_ino(attr->ino);
>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         set_nlink(inode, attr->nlink);
> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>         inode->i_blocks  = attr->blocks;
>         inode->i_atime.tv_sec   = attr->atime;
>         inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>         return err;
>  }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +                         struct user_namespace *user_ns)
>  {
>         char *p;
>         memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_USER_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->user_id = make_kuid(current_user_ns(), uv);
> +                       d->user_id = make_kuid(user_ns, uv);
>                         if (!uid_valid(d->user_id))
>                                 return 0;
>                         d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_GROUP_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->group_id = make_kgid(current_user_ns(), uv);
> +                       d->group_id = make_kgid(user_ns, uv);
>                         if (!gid_valid(d->group_id))
>                                 return 0;
>                         d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>         struct super_block *sb = root->d_sb;
>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>         if (fc->default_permissions)
>                 seq_puts(m, ",default_permissions");
>         if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>         fpq->connected = 1;
>  }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>         memset(fc, 0, sizeof(*fc));
>         spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>         fc->attr_version = 1;
>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +       fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 if (fc->destroy_req)
>                         fuse_request_free(fc->destroy_req);
>                 put_pid_ns(fc->pid_ns);
> +               put_user_ns(fc->user_ns);
>                 fc->release(fc);
>         }
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> -       if (!parse_fuse_opt(data, &d, is_bdev))
> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>                 goto err;
>
>         if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!file)
>                 goto err;
>
> -       if ((file->f_op != &fuse_dev_operations) ||
> -           (file->f_cred->user_ns != &init_user_ns))
> +       /*
> +        * Require mount to happen from the same user namespace which
> +        * opened /dev/fuse to prevent potential attacks.
> +        */
> +       if (file->f_op != &fuse_dev_operations ||
> +           file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!fc)
>                 goto err_fput;
>
> -       fuse_conn_init(fc);
> +       fuse_conn_init(fc, sb->s_user_ns);
>         fc->release = fuse_free_conn;
>
>         fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
@ 2018-01-17 10:59   ` Alban Crequy
  2018-01-17 14:29     ` Seth Forshee
       [not found]     ` <CADZs7q5NA7Kox62vnCOkL=TGgzTxX+oNYz6=oNXKWkQkQwSMrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]   ` <c85c293e19a478353aba8e6e3ee39e5914f798d5.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  1 sibling, 2 replies; 219+ messages in thread
From: Alban Crequy @ 2018-01-17 10:59 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Eric W . Biederman, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

[Adding Tejun, David, Tom for question about cuse]

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.

Was a use case discussed for using cuse in a new unprivileged userns?

I ran some tests yesterday with cusexmp [1] and I could add a new char
device as an unprivileged user with:

$ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
--maj=99 --min=30 --name=foo

where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
Then, I could see the new device:

$ cat /proc/devices | grep foo
 99 foo

On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
seems dangerous if the dev node can be provided otherwise and if we
don't have a use case for it.

Thoughts?

[1] https://github.com/fuse4x/fuse/blob/master/example/cusexmp.c#L9

Cheers,
Alban


> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
>  #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>         if (!cc)
>                 return -ENOMEM;
>
> -       fuse_conn_init(&cc->fc);
> +       fuse_conn_init(&cc->fc, current_user_ns());
>
>         fud = fuse_dev_alloc(&cc->fc);
>         if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>
>         return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> +       if (task_active_pid_ns(current) != fc->pid_ns ||
> +           current_user_ns() != fc->user_ns) {
>                 rcu_read_lock();
>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>                 rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>         stat->ino = attr->ino;
>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         stat->nlink = attr->nlink;
> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>         stat->rdev = inode->i_rdev;
>         stat->atime.tv_sec = attr->atime;
>         stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>         return true;
>  }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -                          bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>         unsigned ivalid = iattr->ia_valid;
>
>         if (ivalid & ATTR_MODE)
>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>         if (ivalid & ATTR_UID)
> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>         if (ivalid & ATTR_GID)
> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>         if (ivalid & ATTR_SIZE)
>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>         if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
>         memset(&inarg, 0, sizeof(inarg));
>         memset(&outarg, 0, sizeof(outarg));
> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>         if (file) {
>                 struct fuse_file *ff = file->private_data;
>                 inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>         /** The pid namespace for this mount */
>         struct pid_namespace *pid_ns;
>
> +       /** The user namespace for this mount */
> +       struct user_namespace *user_ns;
> +
>         /** Maximum read size */
>         unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>         inode->i_ino     = fuse_squash_ino(attr->ino);
>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         set_nlink(inode, attr->nlink);
> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>         inode->i_blocks  = attr->blocks;
>         inode->i_atime.tv_sec   = attr->atime;
>         inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>         return err;
>  }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +                         struct user_namespace *user_ns)
>  {
>         char *p;
>         memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_USER_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->user_id = make_kuid(current_user_ns(), uv);
> +                       d->user_id = make_kuid(user_ns, uv);
>                         if (!uid_valid(d->user_id))
>                                 return 0;
>                         d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_GROUP_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->group_id = make_kgid(current_user_ns(), uv);
> +                       d->group_id = make_kgid(user_ns, uv);
>                         if (!gid_valid(d->group_id))
>                                 return 0;
>                         d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>         struct super_block *sb = root->d_sb;
>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>         if (fc->default_permissions)
>                 seq_puts(m, ",default_permissions");
>         if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>         fpq->connected = 1;
>  }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>         memset(fc, 0, sizeof(*fc));
>         spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>         fc->attr_version = 1;
>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +       fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 if (fc->destroy_req)
>                         fuse_request_free(fc->destroy_req);
>                 put_pid_ns(fc->pid_ns);
> +               put_user_ns(fc->user_ns);
>                 fc->release(fc);
>         }
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> -       if (!parse_fuse_opt(data, &d, is_bdev))
> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>                 goto err;
>
>         if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!file)
>                 goto err;
>
> -       if ((file->f_op != &fuse_dev_operations) ||
> -           (file->f_cred->user_ns != &init_user_ns))
> +       /*
> +        * Require mount to happen from the same user namespace which
> +        * opened /dev/fuse to prevent potential attacks.
> +        */
> +       if (file->f_op != &fuse_dev_operations ||
> +           file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!fc)
>                 goto err_fput;
>
> -       fuse_conn_init(fc);
> +       fuse_conn_init(fc, sb->s_user_ns);
>         fc->release = fuse_free_conn;
>
>         fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found]     ` <CADZs7q5NA7Kox62vnCOkL=TGgzTxX+oNYz6=oNXKWkQkQwSMrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-17 14:29       ` Seth Forshee
  0 siblings, 0 replies; 219+ messages in thread
From: Seth Forshee @ 2018-01-17 14:29 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tom Gundersen,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric W . Biederman,
	David Herrmann, Sargun Dhillon, Tejun Heo

On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> [Adding Tejun, David, Tom for question about cuse]
> 
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> > From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> >
> > In order to support mounts from namespaces other than
> > init_user_ns, fuse must translate uids and gids to/from the
> > userns of the process servicing requests on /dev/fuse. This
> > patch does that, with a couple of restrictions on the namespace:
> >
> >  - The userns for the fuse connection is fixed to the namespace
> >    from which /dev/fuse is opened.
> >
> >  - The namespace must be the same as s_user_ns.
> >
> > These restrictions simplify the implementation by avoiding the
> > need to pass around userns references and by allowing fuse to
> > rely on the checks in inode_change_ok for ownership changes.
> > Either restriction could be relaxed in the future if needed.
> >
> > For cuse the namespace used for the connection is also simply
> > current_user_ns() at the time /dev/cuse is opened.
> 
> Was a use case discussed for using cuse in a new unprivileged userns?
> 
> I ran some tests yesterday with cusexmp [1] and I could add a new char
> device as an unprivileged user with:
> 
> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> --maj=99 --min=30 --name=foo
> 
> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> Then, I could see the new device:
> 
> $ cat /proc/devices | grep foo
>  99 foo
> 
> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> seems dangerous if the dev node can be provided otherwise and if we
> don't have a use case for it.
> 
> Thoughts?

I can't remember the specific reasons, but I had concluded that letting
unprivileged users use cuse within a user namespace isn't safe. But
having a cuse device node usable by regular users at all is equally
unsafe I suspect, so I don't think your example demonstrates any problem
specific to user namespaces. There shouldn't be any way to use a user
namespace to gain access permissions towards /dev/cuse, otherwise we
have bigger problems than cuse to worry about.

Seth

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 10:59   ` Alban Crequy
@ 2018-01-17 14:29     ` Seth Forshee
  2018-01-17 18:56       ` Alban Crequy
  2018-01-17 18:56       ` Alban Crequy
       [not found]     ` <CADZs7q5NA7Kox62vnCOkL=TGgzTxX+oNYz6=oNXKWkQkQwSMrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 219+ messages in thread
From: Seth Forshee @ 2018-01-17 14:29 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> [Adding Tejun, David, Tom for question about cuse]
> 
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> > From: Seth Forshee <seth.forshee@canonical.com>
> >
> > In order to support mounts from namespaces other than
> > init_user_ns, fuse must translate uids and gids to/from the
> > userns of the process servicing requests on /dev/fuse. This
> > patch does that, with a couple of restrictions on the namespace:
> >
> >  - The userns for the fuse connection is fixed to the namespace
> >    from which /dev/fuse is opened.
> >
> >  - The namespace must be the same as s_user_ns.
> >
> > These restrictions simplify the implementation by avoiding the
> > need to pass around userns references and by allowing fuse to
> > rely on the checks in inode_change_ok for ownership changes.
> > Either restriction could be relaxed in the future if needed.
> >
> > For cuse the namespace used for the connection is also simply
> > current_user_ns() at the time /dev/cuse is opened.
> 
> Was a use case discussed for using cuse in a new unprivileged userns?
> 
> I ran some tests yesterday with cusexmp [1] and I could add a new char
> device as an unprivileged user with:
> 
> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> --maj=99 --min=30 --name=foo
> 
> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> Then, I could see the new device:
> 
> $ cat /proc/devices | grep foo
>  99 foo
> 
> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> seems dangerous if the dev node can be provided otherwise and if we
> don't have a use case for it.
> 
> Thoughts?

I can't remember the specific reasons, but I had concluded that letting
unprivileged users use cuse within a user namespace isn't safe. But
having a cuse device node usable by regular users at all is equally
unsafe I suspect, so I don't think your example demonstrates any problem
specific to user namespaces. There shouldn't be any way to use a user
namespace to gain access permissions towards /dev/cuse, otherwise we
have bigger problems than cuse to worry about.

Seth

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 14:29     ` Seth Forshee
@ 2018-01-17 18:56       ` Alban Crequy
  2018-01-17 18:56       ` Alban Crequy
  1 sibling, 0 replies; 219+ messages in thread
From: Alban Crequy @ 2018-01-17 18:56 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tom Gundersen,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric W . Biederman,
	David Herrmann, Sargun Dhillon, Tejun Heo

On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
<seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> [Adding Tejun, David, Tom for question about cuse]
>>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>> > From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>> >
>> > In order to support mounts from namespaces other than
>> > init_user_ns, fuse must translate uids and gids to/from the
>> > userns of the process servicing requests on /dev/fuse. This
>> > patch does that, with a couple of restrictions on the namespace:
>> >
>> >  - The userns for the fuse connection is fixed to the namespace
>> >    from which /dev/fuse is opened.
>> >
>> >  - The namespace must be the same as s_user_ns.
>> >
>> > These restrictions simplify the implementation by avoiding the
>> > need to pass around userns references and by allowing fuse to
>> > rely on the checks in inode_change_ok for ownership changes.
>> > Either restriction could be relaxed in the future if needed.
>> >
>> > For cuse the namespace used for the connection is also simply
>> > current_user_ns() at the time /dev/cuse is opened.
>>
>> Was a use case discussed for using cuse in a new unprivileged userns?
>>
>> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> device as an unprivileged user with:
>>
>> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> --maj=99 --min=30 --name=foo
>>
>> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> Then, I could see the new device:
>>
>> $ cat /proc/devices | grep foo
>>  99 foo
>>
>> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> seems dangerous if the dev node can be provided otherwise and if we
>> don't have a use case for it.
>>
>> Thoughts?
>
> I can't remember the specific reasons, but I had concluded that letting
> unprivileged users use cuse within a user namespace isn't safe. But
> having a cuse device node usable by regular users at all is equally
> unsafe I suspect,

This makes sense.

> so I don't think your example demonstrates any problem
> specific to user namespaces. There shouldn't be any way to use a user
> namespace to gain access permissions towards /dev/cuse, otherwise we
> have bigger problems than cuse to worry about.

From my tests, the patch seem safe but I don't fully understand why that is.

I am not trying to gain more permissions towards /dev/cuse but to
create another cuse char file from within the unprivileged userns. I
tested the scenario by patching the memfs userspace FUSE driver to
generate the char device whenever the file is named "cuse" (turning
the regular file into a char device with the cuse major/minor behind
the scene):

$ unshare -U -r -m
# memfs /mnt/memfs &
# ls -l /mnt/memfs
# echo -n > /mnt/memfs/cuse
-bash: /mnt/memfs/cuse: Input/output error
# ls -l /mnt/memfs/cuse
crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
# cat /mnt/memfs/cuse
cat: /mnt/memfs/cuse: Permission denied

But then, I could not use that char device, even though it seems to
have the correct major/minor and permissions. The kernel FUSE code
seems to call init_special_inode() to handle character devices. I
don't understand why it seems to be safe.

Thanks!
Alban

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 14:29     ` Seth Forshee
  2018-01-17 18:56       ` Alban Crequy
@ 2018-01-17 18:56       ` Alban Crequy
  2018-01-17 19:31         ` Seth Forshee
       [not found]         ` <CADZs7q6ZHGHbrdL96Bmy148Zc6TxruiJrEeDjaDYEX8U-5QV1A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 219+ messages in thread
From: Alban Crequy @ 2018-01-17 18:56 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:
> On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> [Adding Tejun, David, Tom for question about cuse]
>>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> > From: Seth Forshee <seth.forshee@canonical.com>
>> >
>> > In order to support mounts from namespaces other than
>> > init_user_ns, fuse must translate uids and gids to/from the
>> > userns of the process servicing requests on /dev/fuse. This
>> > patch does that, with a couple of restrictions on the namespace:
>> >
>> >  - The userns for the fuse connection is fixed to the namespace
>> >    from which /dev/fuse is opened.
>> >
>> >  - The namespace must be the same as s_user_ns.
>> >
>> > These restrictions simplify the implementation by avoiding the
>> > need to pass around userns references and by allowing fuse to
>> > rely on the checks in inode_change_ok for ownership changes.
>> > Either restriction could be relaxed in the future if needed.
>> >
>> > For cuse the namespace used for the connection is also simply
>> > current_user_ns() at the time /dev/cuse is opened.
>>
>> Was a use case discussed for using cuse in a new unprivileged userns?
>>
>> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> device as an unprivileged user with:
>>
>> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> --maj=99 --min=30 --name=foo
>>
>> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> Then, I could see the new device:
>>
>> $ cat /proc/devices | grep foo
>>  99 foo
>>
>> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> seems dangerous if the dev node can be provided otherwise and if we
>> don't have a use case for it.
>>
>> Thoughts?
>
> I can't remember the specific reasons, but I had concluded that letting
> unprivileged users use cuse within a user namespace isn't safe. But
> having a cuse device node usable by regular users at all is equally
> unsafe I suspect,

This makes sense.

> so I don't think your example demonstrates any problem
> specific to user namespaces. There shouldn't be any way to use a user
> namespace to gain access permissions towards /dev/cuse, otherwise we
> have bigger problems than cuse to worry about.

>From my tests, the patch seem safe but I don't fully understand why that is.

I am not trying to gain more permissions towards /dev/cuse but to
create another cuse char file from within the unprivileged userns. I
tested the scenario by patching the memfs userspace FUSE driver to
generate the char device whenever the file is named "cuse" (turning
the regular file into a char device with the cuse major/minor behind
the scene):

$ unshare -U -r -m
# memfs /mnt/memfs &
# ls -l /mnt/memfs
# echo -n > /mnt/memfs/cuse
-bash: /mnt/memfs/cuse: Input/output error
# ls -l /mnt/memfs/cuse
crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
# cat /mnt/memfs/cuse
cat: /mnt/memfs/cuse: Permission denied

But then, I could not use that char device, even though it seems to
have the correct major/minor and permissions. The kernel FUSE code
seems to call init_special_inode() to handle character devices. I
don't understand why it seems to be safe.

Thanks!
Alban

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found]         ` <CADZs7q6ZHGHbrdL96Bmy148Zc6TxruiJrEeDjaDYEX8U-5QV1A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-17 19:31           ` Seth Forshee
  0 siblings, 0 replies; 219+ messages in thread
From: Seth Forshee @ 2018-01-17 19:31 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tom Gundersen,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric W . Biederman,
	David Herrmann, Sargun Dhillon, Tejun Heo

On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
> <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> >> [Adding Tejun, David, Tom for question about cuse]
> >>
> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> >> > From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> >> >
> >> > In order to support mounts from namespaces other than
> >> > init_user_ns, fuse must translate uids and gids to/from the
> >> > userns of the process servicing requests on /dev/fuse. This
> >> > patch does that, with a couple of restrictions on the namespace:
> >> >
> >> >  - The userns for the fuse connection is fixed to the namespace
> >> >    from which /dev/fuse is opened.
> >> >
> >> >  - The namespace must be the same as s_user_ns.
> >> >
> >> > These restrictions simplify the implementation by avoiding the
> >> > need to pass around userns references and by allowing fuse to
> >> > rely on the checks in inode_change_ok for ownership changes.
> >> > Either restriction could be relaxed in the future if needed.
> >> >
> >> > For cuse the namespace used for the connection is also simply
> >> > current_user_ns() at the time /dev/cuse is opened.
> >>
> >> Was a use case discussed for using cuse in a new unprivileged userns?
> >>
> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
> >> device as an unprivileged user with:
> >>
> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> >> --maj=99 --min=30 --name=foo
> >>
> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> >> Then, I could see the new device:
> >>
> >> $ cat /proc/devices | grep foo
> >>  99 foo
> >>
> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> >> seems dangerous if the dev node can be provided otherwise and if we
> >> don't have a use case for it.
> >>
> >> Thoughts?
> >
> > I can't remember the specific reasons, but I had concluded that letting
> > unprivileged users use cuse within a user namespace isn't safe. But
> > having a cuse device node usable by regular users at all is equally
> > unsafe I suspect,
> 
> This makes sense.
> 
> > so I don't think your example demonstrates any problem
> > specific to user namespaces. There shouldn't be any way to use a user
> > namespace to gain access permissions towards /dev/cuse, otherwise we
> > have bigger problems than cuse to worry about.
> 
> From my tests, the patch seem safe but I don't fully understand why that is.
> 
> I am not trying to gain more permissions towards /dev/cuse but to
> create another cuse char file from within the unprivileged userns. I
> tested the scenario by patching the memfs userspace FUSE driver to
> generate the char device whenever the file is named "cuse" (turning
> the regular file into a char device with the cuse major/minor behind
> the scene):
> 
> $ unshare -U -r -m
> # memfs /mnt/memfs &
> # ls -l /mnt/memfs
> # echo -n > /mnt/memfs/cuse
> -bash: /mnt/memfs/cuse: Input/output error
> # ls -l /mnt/memfs/cuse
> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
> # cat /mnt/memfs/cuse
> cat: /mnt/memfs/cuse: Permission denied
> 
> But then, I could not use that char device, even though it seems to
> have the correct major/minor and permissions. The kernel FUSE code
> seems to call init_special_inode() to handle character devices. I
> don't understand why it seems to be safe.

Because for new mounts in non-init user namespaces alloc_super() sets
SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
that filesystem.

Seth

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 18:56       ` Alban Crequy
@ 2018-01-17 19:31         ` Seth Forshee
  2018-01-18 10:29           ` Alban Crequy
  2018-01-18 10:29           ` Alban Crequy
       [not found]         ` <CADZs7q6ZHGHbrdL96Bmy148Zc6TxruiJrEeDjaDYEX8U-5QV1A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 219+ messages in thread
From: Seth Forshee @ 2018-01-17 19:31 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
> <seth.forshee@canonical.com> wrote:
> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> >> [Adding Tejun, David, Tom for question about cuse]
> >>
> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> >> > From: Seth Forshee <seth.forshee@canonical.com>
> >> >
> >> > In order to support mounts from namespaces other than
> >> > init_user_ns, fuse must translate uids and gids to/from the
> >> > userns of the process servicing requests on /dev/fuse. This
> >> > patch does that, with a couple of restrictions on the namespace:
> >> >
> >> >  - The userns for the fuse connection is fixed to the namespace
> >> >    from which /dev/fuse is opened.
> >> >
> >> >  - The namespace must be the same as s_user_ns.
> >> >
> >> > These restrictions simplify the implementation by avoiding the
> >> > need to pass around userns references and by allowing fuse to
> >> > rely on the checks in inode_change_ok for ownership changes.
> >> > Either restriction could be relaxed in the future if needed.
> >> >
> >> > For cuse the namespace used for the connection is also simply
> >> > current_user_ns() at the time /dev/cuse is opened.
> >>
> >> Was a use case discussed for using cuse in a new unprivileged userns?
> >>
> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
> >> device as an unprivileged user with:
> >>
> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> >> --maj=99 --min=30 --name=foo
> >>
> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> >> Then, I could see the new device:
> >>
> >> $ cat /proc/devices | grep foo
> >>  99 foo
> >>
> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> >> seems dangerous if the dev node can be provided otherwise and if we
> >> don't have a use case for it.
> >>
> >> Thoughts?
> >
> > I can't remember the specific reasons, but I had concluded that letting
> > unprivileged users use cuse within a user namespace isn't safe. But
> > having a cuse device node usable by regular users at all is equally
> > unsafe I suspect,
> 
> This makes sense.
> 
> > so I don't think your example demonstrates any problem
> > specific to user namespaces. There shouldn't be any way to use a user
> > namespace to gain access permissions towards /dev/cuse, otherwise we
> > have bigger problems than cuse to worry about.
> 
> From my tests, the patch seem safe but I don't fully understand why that is.
> 
> I am not trying to gain more permissions towards /dev/cuse but to
> create another cuse char file from within the unprivileged userns. I
> tested the scenario by patching the memfs userspace FUSE driver to
> generate the char device whenever the file is named "cuse" (turning
> the regular file into a char device with the cuse major/minor behind
> the scene):
> 
> $ unshare -U -r -m
> # memfs /mnt/memfs &
> # ls -l /mnt/memfs
> # echo -n > /mnt/memfs/cuse
> -bash: /mnt/memfs/cuse: Input/output error
> # ls -l /mnt/memfs/cuse
> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
> # cat /mnt/memfs/cuse
> cat: /mnt/memfs/cuse: Permission denied
> 
> But then, I could not use that char device, even though it seems to
> have the correct major/minor and permissions. The kernel FUSE code
> seems to call init_special_inode() to handle character devices. I
> don't understand why it seems to be safe.

Because for new mounts in non-init user namespaces alloc_super() sets
SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
that filesystem.

Seth

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 19:31         ` Seth Forshee
@ 2018-01-18 10:29           ` Alban Crequy
  2018-01-18 10:29           ` Alban Crequy
  1 sibling, 0 replies; 219+ messages in thread
From: Alban Crequy @ 2018-01-18 10:29 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tom Gundersen,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric W . Biederman,
	David Herrmann, Sargun Dhillon, Tejun Heo

On Wed, Jan 17, 2018 at 8:31 PM, Seth Forshee
<seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
>> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
>> <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
>> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> >> [Adding Tejun, David, Tom for question about cuse]
>> >>
>> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>> >> > From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>> >> >
>> >> > In order to support mounts from namespaces other than
>> >> > init_user_ns, fuse must translate uids and gids to/from the
>> >> > userns of the process servicing requests on /dev/fuse. This
>> >> > patch does that, with a couple of restrictions on the namespace:
>> >> >
>> >> >  - The userns for the fuse connection is fixed to the namespace
>> >> >    from which /dev/fuse is opened.
>> >> >
>> >> >  - The namespace must be the same as s_user_ns.
>> >> >
>> >> > These restrictions simplify the implementation by avoiding the
>> >> > need to pass around userns references and by allowing fuse to
>> >> > rely on the checks in inode_change_ok for ownership changes.
>> >> > Either restriction could be relaxed in the future if needed.
>> >> >
>> >> > For cuse the namespace used for the connection is also simply
>> >> > current_user_ns() at the time /dev/cuse is opened.
>> >>
>> >> Was a use case discussed for using cuse in a new unprivileged userns?
>> >>
>> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> >> device as an unprivileged user with:
>> >>
>> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> >> --maj=99 --min=30 --name=foo
>> >>
>> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> >> Then, I could see the new device:
>> >>
>> >> $ cat /proc/devices | grep foo
>> >>  99 foo
>> >>
>> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> >> seems dangerous if the dev node can be provided otherwise and if we
>> >> don't have a use case for it.
>> >>
>> >> Thoughts?
>> >
>> > I can't remember the specific reasons, but I had concluded that letting
>> > unprivileged users use cuse within a user namespace isn't safe. But
>> > having a cuse device node usable by regular users at all is equally
>> > unsafe I suspect,
>>
>> This makes sense.
>>
>> > so I don't think your example demonstrates any problem
>> > specific to user namespaces. There shouldn't be any way to use a user
>> > namespace to gain access permissions towards /dev/cuse, otherwise we
>> > have bigger problems than cuse to worry about.
>>
>> From my tests, the patch seem safe but I don't fully understand why that is.
>>
>> I am not trying to gain more permissions towards /dev/cuse but to
>> create another cuse char file from within the unprivileged userns. I
>> tested the scenario by patching the memfs userspace FUSE driver to
>> generate the char device whenever the file is named "cuse" (turning
>> the regular file into a char device with the cuse major/minor behind
>> the scene):
>>
>> $ unshare -U -r -m
>> # memfs /mnt/memfs &
>> # ls -l /mnt/memfs
>> # echo -n > /mnt/memfs/cuse
>> -bash: /mnt/memfs/cuse: Input/output error
>> # ls -l /mnt/memfs/cuse
>> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
>> # cat /mnt/memfs/cuse
>> cat: /mnt/memfs/cuse: Permission denied
>>
>> But then, I could not use that char device, even though it seems to
>> have the correct major/minor and permissions. The kernel FUSE code
>> seems to call init_special_inode() to handle character devices. I
>> don't understand why it seems to be safe.
>
> Because for new mounts in non-init user namespaces alloc_super() sets
> SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
> that filesystem.

I see. Thanks for the explanation!

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-01-17 19:31         ` Seth Forshee
  2018-01-18 10:29           ` Alban Crequy
@ 2018-01-18 10:29           ` Alban Crequy
  1 sibling, 0 replies; 219+ messages in thread
From: Alban Crequy @ 2018-01-18 10:29 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Dongsu Park, linux-kernel, containers, Eric W . Biederman,
	Miklos Szeredi, Sargun Dhillon, linux-fsdevel, Tejun Heo,
	David Herrmann, Tom Gundersen

On Wed, Jan 17, 2018 at 8:31 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:
> On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
>> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
>> <seth.forshee@canonical.com> wrote:
>> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> >> [Adding Tejun, David, Tom for question about cuse]
>> >>
>> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> >> > From: Seth Forshee <seth.forshee@canonical.com>
>> >> >
>> >> > In order to support mounts from namespaces other than
>> >> > init_user_ns, fuse must translate uids and gids to/from the
>> >> > userns of the process servicing requests on /dev/fuse. This
>> >> > patch does that, with a couple of restrictions on the namespace:
>> >> >
>> >> >  - The userns for the fuse connection is fixed to the namespace
>> >> >    from which /dev/fuse is opened.
>> >> >
>> >> >  - The namespace must be the same as s_user_ns.
>> >> >
>> >> > These restrictions simplify the implementation by avoiding the
>> >> > need to pass around userns references and by allowing fuse to
>> >> > rely on the checks in inode_change_ok for ownership changes.
>> >> > Either restriction could be relaxed in the future if needed.
>> >> >
>> >> > For cuse the namespace used for the connection is also simply
>> >> > current_user_ns() at the time /dev/cuse is opened.
>> >>
>> >> Was a use case discussed for using cuse in a new unprivileged userns?
>> >>
>> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> >> device as an unprivileged user with:
>> >>
>> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> >> --maj=99 --min=30 --name=foo
>> >>
>> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> >> Then, I could see the new device:
>> >>
>> >> $ cat /proc/devices | grep foo
>> >>  99 foo
>> >>
>> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> >> seems dangerous if the dev node can be provided otherwise and if we
>> >> don't have a use case for it.
>> >>
>> >> Thoughts?
>> >
>> > I can't remember the specific reasons, but I had concluded that letting
>> > unprivileged users use cuse within a user namespace isn't safe. But
>> > having a cuse device node usable by regular users at all is equally
>> > unsafe I suspect,
>>
>> This makes sense.
>>
>> > so I don't think your example demonstrates any problem
>> > specific to user namespaces. There shouldn't be any way to use a user
>> > namespace to gain access permissions towards /dev/cuse, otherwise we
>> > have bigger problems than cuse to worry about.
>>
>> From my tests, the patch seem safe but I don't fully understand why that is.
>>
>> I am not trying to gain more permissions towards /dev/cuse but to
>> create another cuse char file from within the unprivileged userns. I
>> tested the scenario by patching the memfs userspace FUSE driver to
>> generate the char device whenever the file is named "cuse" (turning
>> the regular file into a char device with the cuse major/minor behind
>> the scene):
>>
>> $ unshare -U -r -m
>> # memfs /mnt/memfs &
>> # ls -l /mnt/memfs
>> # echo -n > /mnt/memfs/cuse
>> -bash: /mnt/memfs/cuse: Input/output error
>> # ls -l /mnt/memfs/cuse
>> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
>> # cat /mnt/memfs/cuse
>> cat: /mnt/memfs/cuse: Permission denied
>>
>> But then, I could not use that char device, even though it seems to
>> have the correct major/minor and permissions. The kernel FUSE code
>> seems to call init_special_inode() to handle character devices. I
>> don't understand why it seems to be safe.
>
> Because for new mounts in non-init user namespaces alloc_super() sets
> SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
> that filesystem.

I see. Thanks for the explanation!

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
       [not found]     ` <CANxcAMvwwiPXBTKmTM9sEo8Y1T--V7fNaFqzHfyEvwvaYQV60A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-18 14:58       ` Alban Crequy
  0 siblings, 0 replies; 219+ messages in thread
From: Alban Crequy @ 2018-01-18 14:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Containers, LKML, Seth Forshee, Sargun Dhillon

On Tue, Jan 9, 2018 at 4:05 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> Hi,
>
> On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> writes:
>>
>>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>>> The latest patchset was v4:
>>> https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1132206.html
>>>
>>> At the moment, filesystems backed by physical medium can only be mounted
>>> by real root in the initial user namespace. This restriction exists
>>> because if it's allowed for root user in non-init user namespaces to
>>> mount the filesystem, then it effectively allows the user to control the
>>> underlying source of the filesystem. In case of FUSE, the source would
>>> mean any underlying device.
>>>
>>> However, in many use cases such as containers, it's necessary to allow
>>> filesystems to be mounted from non-init user namespaces. Goal of this
>>> patchset is to allow FUSE filesystems to be mounted from non-init user
>>> namespaces. Support for other filesystems like ext4 are not in the
>>> scope of this patchset.
>>>
>>> Let me describe how to test mounting from non-init user namespaces. It's
>>> assumed that tests are done via sshfs, a userspace filesystem based on
>>> FUSE with ssh as backend. Testing system is Fedora 27.
>>
>> In general I am for this work, and more bodies and more eyes on it is
>> generally better.
>>
>> I will review this after the New Year, I am out for the holidays right
>> now.
>
> Thanks. I'll wait for your review.

Hi Eric,

Do you have some cycles for this now that it is the new year?

A review on the associated ima issue would also be appreciated:
https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1587678.html

Cheers,
Alban

>>> ====
>>> $ sudo dnf install -y sshfs
>>> $ sudo mkdir -p /mnt/userns
>>>
>>> ### workaround to get the sshfs permission checks
>>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>>
>>> $ unshare -U -r -m
>>> # sshfs root@localhost: /mnt/userns
>>>
>>> ### You can see sshfs being mounted from a non-init user namespace
>>> # mount | grep sshfs
>>> root@localhost: on /mnt/userns type fuse.sshfs
>>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>>
>>> # touch /mnt/userns/test
>>> # ls -l /mnt/userns/test
>>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>>> ====
>>>
>>> Open another terminal, check the mountpoint from outside the namespace.
>>>
>>> ====
>>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>>> root@localhost: rw,user_id=0,group_id=0
>>> ====
>>>
>>> After all tests are done, you can unmount the filesystem
>>> inside the namespace.
>>>
>>> ====
>>> # fusermount -u /mnt/userns
>>> ====
>>>
>>> Changes since v4:
>>>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>>>  * Add and change commit messages
>>>  * Describe how to test non-init user namespaces
>>>
>>> TODO:
>>>  * Think through potential security implications. There are 2 patches
>>>    being prepared for security issues. One is "ima: define a new policy
>>>    option named force" by Mimi Zohar, which adds an option to specify
>>>    that the results should not be cached:
>>>    https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>>>    The other one is to basically prevent FUSE results from being cached,
>>>    which is still in progress.
>>>
>>>  * Test IMA/LSMs. Details are written in
>>>    https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>>
>>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>>> additional inode permission.
>>>
>>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>>> deal with additional capability checks w.r.t user namespaces.
>>>
>>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>>> user namespace.
>>>
>>> Patch 11 handles a corner case of non-root users in EVM.
>>>
>>> The patchset is also available in our github repo:
>>>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>>
>>>
>>> Eric W. Biederman (1):
>>>   fs: Allow superblock owner to change ownership of inodes
>>>
>>> Seth Forshee (10):
>>>   block_dev: Support checking inode permissions in lookup_bdev()
>>>   mtd: Check permissions towards mtd block device inode when mounting
>>>   fs: Don't remove suid for CAP_FSETID for userns root
>>>   fs: Allow superblock owner to access do_remount_sb()
>>>   capabilities: Allow privileged user in s_user_ns to set security.*
>>>     xattrs
>>>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>>>   fuse: Support fuse filesystems outside of init_user_ns
>>>   fuse: Restrict allow_other to the superblock's namespace or a
>>>     descendant
>>>   fuse: Allow user namespace mounts
>>>   evm: Don't update hmacs in user ns mounts
>>>
>>>  drivers/md/bcache/super.c           |  2 +-
>>>  drivers/md/dm-table.c               |  2 +-
>>>  drivers/mtd/mtdsuper.c              |  6 +++++-
>>>  fs/attr.c                           | 34 ++++++++++++++++++++++++++--------
>>>  fs/block_dev.c                      | 13 ++++++++++---
>>>  fs/fuse/cuse.c                      |  3 ++-
>>>  fs/fuse/dev.c                       | 11 ++++++++---
>>>  fs/fuse/dir.c                       | 16 ++++++++--------
>>>  fs/fuse/fuse_i.h                    |  6 +++++-
>>>  fs/fuse/inode.c                     | 35 +++++++++++++++++++++--------------
>>>  fs/inode.c                          |  6 ++++--
>>>  fs/ioctl.c                          |  4 ++--
>>>  fs/namespace.c                      |  4 ++--
>>>  fs/proc/base.c                      |  7 +++++++
>>>  fs/proc/generic.c                   |  7 +++++++
>>>  fs/proc/proc_sysctl.c               |  7 +++++++
>>>  fs/quota/quota.c                    |  2 +-
>>>  include/linux/fs.h                  |  2 +-
>>>  kernel/user_namespace.c             |  1 +
>>>  security/commoncap.c                |  8 ++++++--
>>>  security/integrity/evm/evm_crypto.c |  3 ++-
>>>  21 files changed, 127 insertions(+), 52 deletions(-)

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
  2018-01-09 15:05   ` Dongsu Park
@ 2018-01-18 14:58     ` Alban Crequy
       [not found]       ` <CADZs7q438szfwd-kaaRDnpDFrmno3zy7Zq+6EsnotW8bS0vrTA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]     ` <CANxcAMvwwiPXBTKmTM9sEo8Y1T--V7fNaFqzHfyEvwvaYQV60A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 219+ messages in thread
From: Alban Crequy @ 2018-01-18 14:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dongsu Park, LKML, Linux Containers, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon

On Tue, Jan 9, 2018 at 4:05 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> Hi,
>
> On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Dongsu Park <dongsu@kinvolk.io> writes:
>>
>>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>>> The latest patchset was v4:
>>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1132206.html
>>>
>>> At the moment, filesystems backed by physical medium can only be mounted
>>> by real root in the initial user namespace. This restriction exists
>>> because if it's allowed for root user in non-init user namespaces to
>>> mount the filesystem, then it effectively allows the user to control the
>>> underlying source of the filesystem. In case of FUSE, the source would
>>> mean any underlying device.
>>>
>>> However, in many use cases such as containers, it's necessary to allow
>>> filesystems to be mounted from non-init user namespaces. Goal of this
>>> patchset is to allow FUSE filesystems to be mounted from non-init user
>>> namespaces. Support for other filesystems like ext4 are not in the
>>> scope of this patchset.
>>>
>>> Let me describe how to test mounting from non-init user namespaces. It's
>>> assumed that tests are done via sshfs, a userspace filesystem based on
>>> FUSE with ssh as backend. Testing system is Fedora 27.
>>
>> In general I am for this work, and more bodies and more eyes on it is
>> generally better.
>>
>> I will review this after the New Year, I am out for the holidays right
>> now.
>
> Thanks. I'll wait for your review.

Hi Eric,

Do you have some cycles for this now that it is the new year?

A review on the associated ima issue would also be appreciated:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html

Cheers,
Alban

>>> ====
>>> $ sudo dnf install -y sshfs
>>> $ sudo mkdir -p /mnt/userns
>>>
>>> ### workaround to get the sshfs permission checks
>>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>>
>>> $ unshare -U -r -m
>>> # sshfs root@localhost: /mnt/userns
>>>
>>> ### You can see sshfs being mounted from a non-init user namespace
>>> # mount | grep sshfs
>>> root@localhost: on /mnt/userns type fuse.sshfs
>>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>>
>>> # touch /mnt/userns/test
>>> # ls -l /mnt/userns/test
>>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>>> ====
>>>
>>> Open another terminal, check the mountpoint from outside the namespace.
>>>
>>> ====
>>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>>> root@localhost: rw,user_id=0,group_id=0
>>> ====
>>>
>>> After all tests are done, you can unmount the filesystem
>>> inside the namespace.
>>>
>>> ====
>>> # fusermount -u /mnt/userns
>>> ====
>>>
>>> Changes since v4:
>>>  * Remove other parts like ext4 to keep the patchset minimal for FUSE
>>>  * Add and change commit messages
>>>  * Describe how to test non-init user namespaces
>>>
>>> TODO:
>>>  * Think through potential security implications. There are 2 patches
>>>    being prepared for security issues. One is "ima: define a new policy
>>>    option named force" by Mimi Zohar, which adds an option to specify
>>>    that the results should not be cached:
>>>    https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>>>    The other one is to basically prevent FUSE results from being cached,
>>>    which is still in progress.
>>>
>>>  * Test IMA/LSMs. Details are written in
>>>    https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>>
>>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>>> additional inode permission.
>>>
>>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>>> deal with additional capability checks w.r.t user namespaces.
>>>
>>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>>> user namespace.
>>>
>>> Patch 11 handles a corner case of non-root users in EVM.
>>>
>>> The patchset is also available in our github repo:
>>>   https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>>
>>>
>>> Eric W. Biederman (1):
>>>   fs: Allow superblock owner to change ownership of inodes
>>>
>>> Seth Forshee (10):
>>>   block_dev: Support checking inode permissions in lookup_bdev()
>>>   mtd: Check permissions towards mtd block device inode when mounting
>>>   fs: Don't remove suid for CAP_FSETID for userns root
>>>   fs: Allow superblock owner to access do_remount_sb()
>>>   capabilities: Allow privileged user in s_user_ns to set security.*
>>>     xattrs
>>>   fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>>>   fuse: Support fuse filesystems outside of init_user_ns
>>>   fuse: Restrict allow_other to the superblock's namespace or a
>>>     descendant
>>>   fuse: Allow user namespace mounts
>>>   evm: Don't update hmacs in user ns mounts
>>>
>>>  drivers/md/bcache/super.c           |  2 +-
>>>  drivers/md/dm-table.c               |  2 +-
>>>  drivers/mtd/mtdsuper.c              |  6 +++++-
>>>  fs/attr.c                           | 34 ++++++++++++++++++++++++++--------
>>>  fs/block_dev.c                      | 13 ++++++++++---
>>>  fs/fuse/cuse.c                      |  3 ++-
>>>  fs/fuse/dev.c                       | 11 ++++++++---
>>>  fs/fuse/dir.c                       | 16 ++++++++--------
>>>  fs/fuse/fuse_i.h                    |  6 +++++-
>>>  fs/fuse/inode.c                     | 35 +++++++++++++++++++++--------------
>>>  fs/inode.c                          |  6 ++++--
>>>  fs/ioctl.c                          |  4 ++--
>>>  fs/namespace.c                      |  4 ++--
>>>  fs/proc/base.c                      |  7 +++++++
>>>  fs/proc/generic.c                   |  7 +++++++
>>>  fs/proc/proc_sysctl.c               |  7 +++++++
>>>  fs/quota/quota.c                    |  2 +-
>>>  include/linux/fs.h                  |  2 +-
>>>  kernel/user_namespace.c             |  1 +
>>>  security/commoncap.c                |  8 ++++++--
>>>  security/integrity/evm/evm_crypto.c |  3 ++-
>>>  21 files changed, 127 insertions(+), 52 deletions(-)

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
@ 2018-02-12 15:57       ` Miklos Szeredi
       [not found]   ` <c85c293e19a478353aba8e6e3ee39e5914f798d5.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-12 15:57 UTC (permalink / raw)
  To: Dongsu Park
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon,
	linux-fsdevel

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.

Can we not introduce potential userspace interface regressions?

The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
allow server to run in different pid_ns") will probably bite us here
as well.

We basically need two modes of operation:

a) old, backward compatible (not introducing any new failure mores),
created with privileged mount
b) new, non-backward compatible, created with unprivileged mount

Technically there would still be a risk from breaking userspace, since
we are using the same entry point for both, but let's hope that no
practical problems come from that.

> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
>  #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>         if (!cc)
>                 return -ENOMEM;
>
> -       fuse_conn_init(&cc->fc);
> +       fuse_conn_init(&cc->fc, current_user_ns());
>
>         fud = fuse_dev_alloc(&cc->fc);
>         if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>
>         return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> +       if (task_active_pid_ns(current) != fc->pid_ns ||
> +           current_user_ns() != fc->user_ns) {

I don't get it.  Why recalculate the pid if the user_ns does not match?

>                 rcu_read_lock();
>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>                 rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>         stat->ino = attr->ino;
>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         stat->nlink = attr->nlink;
> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>         stat->rdev = inode->i_rdev;
>         stat->atime.tv_sec = attr->atime;
>         stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>         return true;
>  }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -                          bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>         unsigned ivalid = iattr->ia_valid;
>
>         if (ivalid & ATTR_MODE)
>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>         if (ivalid & ATTR_UID)
> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>         if (ivalid & ATTR_GID)
> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>         if (ivalid & ATTR_SIZE)
>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>         if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
>         memset(&inarg, 0, sizeof(inarg));
>         memset(&outarg, 0, sizeof(outarg));
> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>         if (file) {
>                 struct fuse_file *ff = file->private_data;
>                 inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>         /** The pid namespace for this mount */
>         struct pid_namespace *pid_ns;
>
> +       /** The user namespace for this mount */
> +       struct user_namespace *user_ns;
> +
>         /** Maximum read size */
>         unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>         inode->i_ino     = fuse_squash_ino(attr->ino);
>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         set_nlink(inode, attr->nlink);
> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>         inode->i_blocks  = attr->blocks;
>         inode->i_atime.tv_sec   = attr->atime;
>         inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>         return err;
>  }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +                         struct user_namespace *user_ns)
>  {
>         char *p;
>         memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_USER_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->user_id = make_kuid(current_user_ns(), uv);
> +                       d->user_id = make_kuid(user_ns, uv);
>                         if (!uid_valid(d->user_id))
>                                 return 0;
>                         d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_GROUP_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->group_id = make_kgid(current_user_ns(), uv);
> +                       d->group_id = make_kgid(user_ns, uv);
>                         if (!gid_valid(d->group_id))
>                                 return 0;
>                         d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>         struct super_block *sb = root->d_sb;
>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>         if (fc->default_permissions)
>                 seq_puts(m, ",default_permissions");
>         if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>         fpq->connected = 1;
>  }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>         memset(fc, 0, sizeof(*fc));
>         spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>         fc->attr_version = 1;
>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +       fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 if (fc->destroy_req)
>                         fuse_request_free(fc->destroy_req);
>                 put_pid_ns(fc->pid_ns);
> +               put_user_ns(fc->user_ns);
>                 fc->release(fc);
>         }
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> -       if (!parse_fuse_opt(data, &d, is_bdev))
> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>                 goto err;
>
>         if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!file)
>                 goto err;
>
> -       if ((file->f_op != &fuse_dev_operations) ||
> -           (file->f_cred->user_ns != &init_user_ns))
> +       /*
> +        * Require mount to happen from the same user namespace which
> +        * opened /dev/fuse to prevent potential attacks.
> +        */
> +       if (file->f_op != &fuse_dev_operations ||
> +           file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!fc)
>                 goto err_fput;
>
> -       fuse_conn_init(fc);
> +       fuse_conn_init(fc, sb->s_user_ns);
>         fc->release = fuse_free_conn;
>
>         fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
@ 2018-02-12 15:57       ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-12 15:57 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, containers, Alban Crequy, Eric W . Biederman, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.

Can we not introduce potential userspace interface regressions?

The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
allow server to run in different pid_ns") will probably bite us here
as well.

We basically need two modes of operation:

a) old, backward compatible (not introducing any new failure mores),
created with privileged mount
b) new, non-backward compatible, created with unprivileged mount

Technically there would still be a risk from breaking userspace, since
we are using the same entry point for both, but let's hope that no
practical problems come from that.

> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
>  #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>         if (!cc)
>                 return -ENOMEM;
>
> -       fuse_conn_init(&cc->fc);
> +       fuse_conn_init(&cc->fc, current_user_ns());
>
>         fud = fuse_dev_alloc(&cc->fc);
>         if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>
>         return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> +       if (task_active_pid_ns(current) != fc->pid_ns ||
> +           current_user_ns() != fc->user_ns) {

I don't get it.  Why recalculate the pid if the user_ns does not match?

>                 rcu_read_lock();
>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>                 rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>         stat->ino = attr->ino;
>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         stat->nlink = attr->nlink;
> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>         stat->rdev = inode->i_rdev;
>         stat->atime.tv_sec = attr->atime;
>         stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>         return true;
>  }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> -                          bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>  {
>         unsigned ivalid = iattr->ia_valid;
>
>         if (ivalid & ATTR_MODE)
>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>         if (ivalid & ATTR_UID)
> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>         if (ivalid & ATTR_GID)
> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>         if (ivalid & ATTR_SIZE)
>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>         if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
>         memset(&inarg, 0, sizeof(inarg));
>         memset(&outarg, 0, sizeof(outarg));
> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>         if (file) {
>                 struct fuse_file *ff = file->private_data;
>                 inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
>  #include <linux/xattr.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
>  /** Max number of pages that can be used in a single read request */
>  #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
>         /** The pid namespace for this mount */
>         struct pid_namespace *pid_ns;
>
> +       /** The user namespace for this mount */
> +       struct user_namespace *user_ns;
> +
>         /** Maximum read size */
>         unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>  /**
>   * Initialize fuse_conn
>   */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
>  /**
>   * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>         inode->i_ino     = fuse_squash_ino(attr->ino);
>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>         set_nlink(inode, attr->nlink);
> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>         inode->i_blocks  = attr->blocks;
>         inode->i_atime.tv_sec   = attr->atime;
>         inode->i_atime.tv_nsec  = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>         return err;
>  }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> +                         struct user_namespace *user_ns)
>  {
>         char *p;
>         memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_USER_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->user_id = make_kuid(current_user_ns(), uv);
> +                       d->user_id = make_kuid(user_ns, uv);
>                         if (!uid_valid(d->user_id))
>                                 return 0;
>                         d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>                 case OPT_GROUP_ID:
>                         if (fuse_match_uint(&args[0], &uv))
>                                 return 0;
> -                       d->group_id = make_kgid(current_user_ns(), uv);
> +                       d->group_id = make_kgid(user_ns, uv);
>                         if (!gid_valid(d->group_id))
>                                 return 0;
>                         d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>         struct super_block *sb = root->d_sb;
>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>         if (fc->default_permissions)
>                 seq_puts(m, ",default_permissions");
>         if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>         fpq->connected = 1;
>  }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>  {
>         memset(fc, 0, sizeof(*fc));
>         spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>         fc->attr_version = 1;
>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> +       fc->user_ns = get_user_ns(user_ns);
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 if (fc->destroy_req)
>                         fuse_request_free(fc->destroy_req);
>                 put_pid_ns(fc->pid_ns);
> +               put_user_ns(fc->user_ns);
>                 fc->release(fc);
>         }
>  }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> -       if (!parse_fuse_opt(data, &d, is_bdev))
> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>                 goto err;
>
>         if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!file)
>                 goto err;
>
> -       if ((file->f_op != &fuse_dev_operations) ||
> -           (file->f_cred->user_ns != &init_user_ns))
> +       /*
> +        * Require mount to happen from the same user namespace which
> +        * opened /dev/fuse to prevent potential attacks.
> +        */
> +       if (file->f_op != &fuse_dev_operations ||
> +           file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>         if (!fc)
>                 goto err_fput;
>
> -       fuse_conn_init(fc);
> +       fuse_conn_init(fc, sb->s_user_ns);
>         fc->release = fuse_free_conn;
>
>         fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found]       ` <CAOssrKd6vkMDwRT=QQofKCufzQczzQ7dXoVbVfVax-0HqD986w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-12 16:35         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-12 16:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>
>> In order to support mounts from namespaces other than
>> init_user_ns, fuse must translate uids and gids to/from the
>> userns of the process servicing requests on /dev/fuse. This
>> patch does that, with a couple of restrictions on the namespace:
>>
>>  - The userns for the fuse connection is fixed to the namespace
>>    from which /dev/fuse is opened.
>>
>>  - The namespace must be the same as s_user_ns.
>>
>> These restrictions simplify the implementation by avoiding the
>> need to pass around userns references and by allowing fuse to
>> rely on the checks in inode_change_ok for ownership changes.
>> Either restriction could be relaxed in the future if needed.
>
> Can we not introduce potential userspace interface regressions?
>
> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
> allow server to run in different pid_ns") will probably bite us here
> as well.

Maybe, but unlike the pid namespace no one has been able to mount
fuse outside of init_user_ns so we are much less exposed.  I agree we
should be careful.

> We basically need two modes of operation:
>
> a) old, backward compatible (not introducing any new failure mores),
> created with privileged mount
> b) new, non-backward compatible, created with unprivileged mount
>
> Technically there would still be a risk from breaking userspace, since
> we are using the same entry point for both, but let's hope that no
> practical problems come from that.

Answering from a 10,000 foot perspective:

There are two cases.  Requests to read/write the filesystem from outside
of s_user_ns.  These run no risk of breaking userspace as this mode has
not been implemented before.

Restrictions at mount time to ensure we are not dealing with a crazy mix
of namespaces.  This has a small chance of breaking someone's crazy
setup.


Dropping requests to read/write the filesystem when the requester does
not map into s_user_ns should not be a problem to enable universally.  If
s_user_ns is init_user_ns everything maps so there is no restriction.



What we can do if we want to ensure maximum backwards compatibility
is if the fuse filesystem is mounted in init_user_ns but if device for
the communication channel is opened in some other user namespace we
can just force the communication channel to operate in init_user_ns.

That will be 100% backwards compatible in all cases and as far as I can
see remove the need for having different ``modes'' of operation.



This does look like the time to give all of this a hard look and see if
we can get these patches in shape to be merged.

Eric



>> For cuse the namespace used for the connection is also simply
>> current_user_ns() at the time /dev/cuse is opened.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>>
>> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
>> ---
>>  fs/fuse/cuse.c   |  3 ++-
>>  fs/fuse/dev.c    | 11 ++++++++---
>>  fs/fuse/dir.c    | 14 +++++++-------
>>  fs/fuse/fuse_i.h |  6 +++++-
>>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>>  5 files changed, 41 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
>> index e9e97803..b1b83259 100644
>> --- a/fs/fuse/cuse.c
>> +++ b/fs/fuse/cuse.c
>> @@ -48,6 +48,7 @@
>>  #include <linux/stat.h>
>>  #include <linux/module.h>
>>  #include <linux/uio.h>
>> +#include <linux/user_namespace.h>
>>
>>  #include "fuse_i.h"
>>
>> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>>         if (!cc)
>>                 return -ENOMEM;
>>
>> -       fuse_conn_init(&cc->fc);
>> +       fuse_conn_init(&cc->fc, current_user_ns());
>>
>>         fud = fuse_dev_alloc(&cc->fc);
>>         if (!fud) {
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 17f0d05b..0f780e16 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>>
>>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>>  {
>> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
>> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>>  }
>>
>> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>>         __set_bit(FR_WAITING, &req->flags);
>>         if (for_background)
>>                 __set_bit(FR_BACKGROUND, &req->flags);
>> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>>
>>         return req;
>>
>> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>>         in = &req->in;
>>         reqsize = in->h.len;
>>
>> -       if (task_active_pid_ns(current) != fc->pid_ns) {
>> +       if (task_active_pid_ns(current) != fc->pid_ns ||
>> +           current_user_ns() != fc->user_ns) {
>
> I don't get it.  Why recalculate the pid if the user_ns does not match?
>
>>                 rcu_read_lock();
>>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>>                 rcu_read_unlock();
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 24967382..ad1cfac1 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>>         stat->ino = attr->ino;
>>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>>         stat->nlink = attr->nlink;
>> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
>> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
>> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
>> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>>         stat->rdev = inode->i_rdev;
>>         stat->atime.tv_sec = attr->atime;
>>         stat->atime.tv_nsec = attr->atimensec;
>> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>>         return true;
>>  }
>>
>> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
>> -                          bool trust_local_cmtime)
>> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
>> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>>  {
>>         unsigned ivalid = iattr->ia_valid;
>>
>>         if (ivalid & ATTR_MODE)
>>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>>         if (ivalid & ATTR_UID)
>> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
>> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>>         if (ivalid & ATTR_GID)
>> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
>> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>>         if (ivalid & ATTR_SIZE)
>>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>>         if (ivalid & ATTR_ATIME) {
>> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>>
>>         memset(&inarg, 0, sizeof(inarg));
>>         memset(&outarg, 0, sizeof(outarg));
>> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
>> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>>         if (file) {
>>                 struct fuse_file *ff = file->private_data;
>>                 inarg.valid |= FATTR_FH;
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index d5773ca6..364e65c8 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -26,6 +26,7 @@
>>  #include <linux/xattr.h>
>>  #include <linux/pid_namespace.h>
>>  #include <linux/refcount.h>
>> +#include <linux/user_namespace.h>
>>
>>  /** Max number of pages that can be used in a single read request */
>>  #define FUSE_MAX_PAGES_PER_REQ 32
>> @@ -466,6 +467,9 @@ struct fuse_conn {
>>         /** The pid namespace for this mount */
>>         struct pid_namespace *pid_ns;
>>
>> +       /** The user namespace for this mount */
>> +       struct user_namespace *user_ns;
>> +
>>         /** Maximum read size */
>>         unsigned max_read;
>>
>> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>>  /**
>>   * Initialize fuse_conn
>>   */
>> -void fuse_conn_init(struct fuse_conn *fc);
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>>
>>  /**
>>   * Release reference to fuse_conn
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 2f504d61..7f6b2e55 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>>         inode->i_ino     = fuse_squash_ino(attr->ino);
>>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>>         set_nlink(inode, attr->nlink);
>> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
>> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
>> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
>> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>>         inode->i_blocks  = attr->blocks;
>>         inode->i_atime.tv_sec   = attr->atime;
>>         inode->i_atime.tv_nsec  = attr->atimensec;
>> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>>         return err;
>>  }
>>
>> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
>> +                         struct user_namespace *user_ns)
>>  {
>>         char *p;
>>         memset(d, 0, sizeof(struct fuse_mount_data));
>> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>>                 case OPT_USER_ID:
>>                         if (fuse_match_uint(&args[0], &uv))
>>                                 return 0;
>> -                       d->user_id = make_kuid(current_user_ns(), uv);
>> +                       d->user_id = make_kuid(user_ns, uv);
>>                         if (!uid_valid(d->user_id))
>>                                 return 0;
>>                         d->user_id_present = 1;
>> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>>                 case OPT_GROUP_ID:
>>                         if (fuse_match_uint(&args[0], &uv))
>>                                 return 0;
>> -                       d->group_id = make_kgid(current_user_ns(), uv);
>> +                       d->group_id = make_kgid(user_ns, uv);
>>                         if (!gid_valid(d->group_id))
>>                                 return 0;
>>                         d->group_id_present = 1;
>> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>>         struct super_block *sb = root->d_sb;
>>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>>
>> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
>> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
>> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
>> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>>         if (fc->default_permissions)
>>                 seq_puts(m, ",default_permissions");
>>         if (fc->allow_other)
>> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>>         fpq->connected = 1;
>>  }
>>
>> -void fuse_conn_init(struct fuse_conn *fc)
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>>  {
>>         memset(fc, 0, sizeof(*fc));
>>         spin_lock_init(&fc->lock);
>> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>>         fc->attr_version = 1;
>>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
>> +       fc->user_ns = get_user_ns(user_ns);
>>  }
>>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>>
>> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>>                 if (fc->destroy_req)
>>                         fuse_request_free(fc->destroy_req);
>>                 put_pid_ns(fc->pid_ns);
>> +               put_user_ns(fc->user_ns);
>>                 fc->release(fc);
>>         }
>>  }
>> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>
>>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>>
>> -       if (!parse_fuse_opt(data, &d, is_bdev))
>> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>>                 goto err;
>>
>>         if (is_bdev) {
>> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>         if (!file)
>>                 goto err;
>>
>> -       if ((file->f_op != &fuse_dev_operations) ||
>> -           (file->f_cred->user_ns != &init_user_ns))
>> +       /*
>> +        * Require mount to happen from the same user namespace which
>> +        * opened /dev/fuse to prevent potential attacks.
>> +        */
>> +       if (file->f_op != &fuse_dev_operations ||
>> +           file->f_cred->user_ns != sb->s_user_ns)
>>                 goto err_fput;
>>
>>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
>> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>         if (!fc)
>>                 goto err_fput;
>>
>> -       fuse_conn_init(fc);
>> +       fuse_conn_init(fc, sb->s_user_ns);
>>         fc->release = fuse_free_conn;
>>
>>         fud = fuse_dev_alloc(fc);
>> --
>> 2.13.6
>>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-12 15:57       ` Miklos Szeredi
  (?)
@ 2018-02-12 16:35       ` Eric W. Biederman
       [not found]         ` <87lgfy5fpd.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2018-02-13 10:20         ` Miklos Szeredi
  -1 siblings, 2 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-12 16:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> In order to support mounts from namespaces other than
>> init_user_ns, fuse must translate uids and gids to/from the
>> userns of the process servicing requests on /dev/fuse. This
>> patch does that, with a couple of restrictions on the namespace:
>>
>>  - The userns for the fuse connection is fixed to the namespace
>>    from which /dev/fuse is opened.
>>
>>  - The namespace must be the same as s_user_ns.
>>
>> These restrictions simplify the implementation by avoiding the
>> need to pass around userns references and by allowing fuse to
>> rely on the checks in inode_change_ok for ownership changes.
>> Either restriction could be relaxed in the future if needed.
>
> Can we not introduce potential userspace interface regressions?
>
> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
> allow server to run in different pid_ns") will probably bite us here
> as well.

Maybe, but unlike the pid namespace no one has been able to mount
fuse outside of init_user_ns so we are much less exposed.  I agree we
should be careful.

> We basically need two modes of operation:
>
> a) old, backward compatible (not introducing any new failure mores),
> created with privileged mount
> b) new, non-backward compatible, created with unprivileged mount
>
> Technically there would still be a risk from breaking userspace, since
> we are using the same entry point for both, but let's hope that no
> practical problems come from that.

Answering from a 10,000 foot perspective:

There are two cases.  Requests to read/write the filesystem from outside
of s_user_ns.  These run no risk of breaking userspace as this mode has
not been implemented before.

Restrictions at mount time to ensure we are not dealing with a crazy mix
of namespaces.  This has a small chance of breaking someone's crazy
setup.


Dropping requests to read/write the filesystem when the requester does
not map into s_user_ns should not be a problem to enable universally.  If
s_user_ns is init_user_ns everything maps so there is no restriction.



What we can do if we want to ensure maximum backwards compatibility
is if the fuse filesystem is mounted in init_user_ns but if device for
the communication channel is opened in some other user namespace we
can just force the communication channel to operate in init_user_ns.

That will be 100% backwards compatible in all cases and as far as I can
see remove the need for having different ``modes'' of operation.



This does look like the time to give all of this a hard look and see if
we can get these patches in shape to be merged.

Eric



>> For cuse the namespace used for the connection is also simply
>> current_user_ns() at the time /dev/cuse is opened.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
>> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
>> ---
>>  fs/fuse/cuse.c   |  3 ++-
>>  fs/fuse/dev.c    | 11 ++++++++---
>>  fs/fuse/dir.c    | 14 +++++++-------
>>  fs/fuse/fuse_i.h |  6 +++++-
>>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>>  5 files changed, 41 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
>> index e9e97803..b1b83259 100644
>> --- a/fs/fuse/cuse.c
>> +++ b/fs/fuse/cuse.c
>> @@ -48,6 +48,7 @@
>>  #include <linux/stat.h>
>>  #include <linux/module.h>
>>  #include <linux/uio.h>
>> +#include <linux/user_namespace.h>
>>
>>  #include "fuse_i.h"
>>
>> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>>         if (!cc)
>>                 return -ENOMEM;
>>
>> -       fuse_conn_init(&cc->fc);
>> +       fuse_conn_init(&cc->fc, current_user_ns());
>>
>>         fud = fuse_dev_alloc(&cc->fc);
>>         if (!fud) {
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 17f0d05b..0f780e16 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>>
>>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>>  {
>> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> +       req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
>> +       req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>>  }
>>
>> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>>         __set_bit(FR_WAITING, &req->flags);
>>         if (for_background)
>>                 __set_bit(FR_BACKGROUND, &req->flags);
>> +       if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>>
>>         return req;
>>
>> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>>         in = &req->in;
>>         reqsize = in->h.len;
>>
>> -       if (task_active_pid_ns(current) != fc->pid_ns) {
>> +       if (task_active_pid_ns(current) != fc->pid_ns ||
>> +           current_user_ns() != fc->user_ns) {
>
> I don't get it.  Why recalculate the pid if the user_ns does not match?
>
>>                 rcu_read_lock();
>>                 in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>>                 rcu_read_unlock();
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 24967382..ad1cfac1 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>>         stat->ino = attr->ino;
>>         stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>>         stat->nlink = attr->nlink;
>> -       stat->uid = make_kuid(&init_user_ns, attr->uid);
>> -       stat->gid = make_kgid(&init_user_ns, attr->gid);
>> +       stat->uid = make_kuid(fc->user_ns, attr->uid);
>> +       stat->gid = make_kgid(fc->user_ns, attr->gid);
>>         stat->rdev = inode->i_rdev;
>>         stat->atime.tv_sec = attr->atime;
>>         stat->atime.tv_nsec = attr->atimensec;
>> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>>         return true;
>>  }
>>
>> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
>> -                          bool trust_local_cmtime)
>> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
>> +                          struct fuse_setattr_in *arg, bool trust_local_cmtime)
>>  {
>>         unsigned ivalid = iattr->ia_valid;
>>
>>         if (ivalid & ATTR_MODE)
>>                 arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
>>         if (ivalid & ATTR_UID)
>> -               arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
>> +               arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>>         if (ivalid & ATTR_GID)
>> -               arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
>> +               arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>>         if (ivalid & ATTR_SIZE)
>>                 arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
>>         if (ivalid & ATTR_ATIME) {
>> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>>
>>         memset(&inarg, 0, sizeof(inarg));
>>         memset(&outarg, 0, sizeof(outarg));
>> -       iattr_to_fattr(attr, &inarg, trust_local_cmtime);
>> +       iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>>         if (file) {
>>                 struct fuse_file *ff = file->private_data;
>>                 inarg.valid |= FATTR_FH;
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index d5773ca6..364e65c8 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -26,6 +26,7 @@
>>  #include <linux/xattr.h>
>>  #include <linux/pid_namespace.h>
>>  #include <linux/refcount.h>
>> +#include <linux/user_namespace.h>
>>
>>  /** Max number of pages that can be used in a single read request */
>>  #define FUSE_MAX_PAGES_PER_REQ 32
>> @@ -466,6 +467,9 @@ struct fuse_conn {
>>         /** The pid namespace for this mount */
>>         struct pid_namespace *pid_ns;
>>
>> +       /** The user namespace for this mount */
>> +       struct user_namespace *user_ns;
>> +
>>         /** Maximum read size */
>>         unsigned max_read;
>>
>> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>>  /**
>>   * Initialize fuse_conn
>>   */
>> -void fuse_conn_init(struct fuse_conn *fc);
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>>
>>  /**
>>   * Release reference to fuse_conn
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 2f504d61..7f6b2e55 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>>         inode->i_ino     = fuse_squash_ino(attr->ino);
>>         inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>>         set_nlink(inode, attr->nlink);
>> -       inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
>> -       inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
>> +       inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
>> +       inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
>>         inode->i_blocks  = attr->blocks;
>>         inode->i_atime.tv_sec   = attr->atime;
>>         inode->i_atime.tv_nsec  = attr->atimensec;
>> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>>         return err;
>>  }
>>
>> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
>> +                         struct user_namespace *user_ns)
>>  {
>>         char *p;
>>         memset(d, 0, sizeof(struct fuse_mount_data));
>> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>>                 case OPT_USER_ID:
>>                         if (fuse_match_uint(&args[0], &uv))
>>                                 return 0;
>> -                       d->user_id = make_kuid(current_user_ns(), uv);
>> +                       d->user_id = make_kuid(user_ns, uv);
>>                         if (!uid_valid(d->user_id))
>>                                 return 0;
>>                         d->user_id_present = 1;
>> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>>                 case OPT_GROUP_ID:
>>                         if (fuse_match_uint(&args[0], &uv))
>>                                 return 0;
>> -                       d->group_id = make_kgid(current_user_ns(), uv);
>> +                       d->group_id = make_kgid(user_ns, uv);
>>                         if (!gid_valid(d->group_id))
>>                                 return 0;
>>                         d->group_id_present = 1;
>> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>>         struct super_block *sb = root->d_sb;
>>         struct fuse_conn *fc = get_fuse_conn_super(sb);
>>
>> -       seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
>> -       seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
>> +       seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
>> +       seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>>         if (fc->default_permissions)
>>                 seq_puts(m, ",default_permissions");
>>         if (fc->allow_other)
>> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>>         fpq->connected = 1;
>>  }
>>
>> -void fuse_conn_init(struct fuse_conn *fc)
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>>  {
>>         memset(fc, 0, sizeof(*fc));
>>         spin_lock_init(&fc->lock);
>> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>>         fc->attr_version = 1;
>>         get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>>         fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
>> +       fc->user_ns = get_user_ns(user_ns);
>>  }
>>  EXPORT_SYMBOL_GPL(fuse_conn_init);
>>
>> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>>                 if (fc->destroy_req)
>>                         fuse_request_free(fc->destroy_req);
>>                 put_pid_ns(fc->pid_ns);
>> +               put_user_ns(fc->user_ns);
>>                 fc->release(fc);
>>         }
>>  }
>> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>
>>         sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>>
>> -       if (!parse_fuse_opt(data, &d, is_bdev))
>> +       if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>>                 goto err;
>>
>>         if (is_bdev) {
>> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>         if (!file)
>>                 goto err;
>>
>> -       if ((file->f_op != &fuse_dev_operations) ||
>> -           (file->f_cred->user_ns != &init_user_ns))
>> +       /*
>> +        * Require mount to happen from the same user namespace which
>> +        * opened /dev/fuse to prevent potential attacks.
>> +        */
>> +       if (file->f_op != &fuse_dev_operations ||
>> +           file->f_cred->user_ns != sb->s_user_ns)
>>                 goto err_fput;
>>
>>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
>> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>         if (!fc)
>>                 goto err_fput;
>>
>> -       fuse_conn_init(fc);
>> +       fuse_conn_init(fc, sb->s_user_ns);
>>         fc->release = fuse_free_conn;
>>
>>         fud = fuse_dev_alloc(fc);
>> --
>> 2.13.6
>>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found]         ` <87lgfy5fpd.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-02-13 10:20           ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 10:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel

On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>>
>>> In order to support mounts from namespaces other than
>>> init_user_ns, fuse must translate uids and gids to/from the
>>> userns of the process servicing requests on /dev/fuse. This
>>> patch does that, with a couple of restrictions on the namespace:
>>>
>>>  - The userns for the fuse connection is fixed to the namespace
>>>    from which /dev/fuse is opened.
>>>
>>>  - The namespace must be the same as s_user_ns.
>>>
>>> These restrictions simplify the implementation by avoiding the
>>> need to pass around userns references and by allowing fuse to
>>> rely on the checks in inode_change_ok for ownership changes.
>>> Either restriction could be relaxed in the future if needed.
>>
>> Can we not introduce potential userspace interface regressions?
>>
>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>> allow server to run in different pid_ns") will probably bite us here
>> as well.
>
> Maybe, but unlike the pid namespace no one has been able to mount
> fuse outside of init_user_ns so we are much less exposed.  I agree we
> should be careful.

Have to wrap my head around all the rules here.

There's the may_mount() one:

    ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)

Um, first of all, why isn't it checking current->cred->user_ns?

Ah, there it is in sget():

    ns_capable(user_ns, CAP_SYS_ADMIN)

I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
doesn't have FS_USERNS_MOUNT.  This is the one that prevents fuse
mounts from being created when (current->cred->user_ns !=
&init_user_ns).

Maybe there's a logic to this web of namespaces, but I don't yet see
it.  Is it documented somewhere?

>> We basically need two modes of operation:
>>
>> a) old, backward compatible (not introducing any new failure mores),
>> created with privileged mount
>> b) new, non-backward compatible, created with unprivileged mount
>>
>> Technically there would still be a risk from breaking userspace, since
>> we are using the same entry point for both, but let's hope that no
>> practical problems come from that.
>
> Answering from a 10,000 foot perspective:
>
> There are two cases.  Requests to read/write the filesystem from outside
> of s_user_ns.  These run no risk of breaking userspace as this mode has
> not been implemented before.

This comes from the fact that (s_user_ns == &init_user_ns) and all
user namespaces are "inside" init_user_ns, right?

One question: why does current code use the from_[ug]id_munged()
variant, when the conversion can never fail.  Or can it?

> Restrictions at mount time to ensure we are not dealing with a crazy mix
> of namespaces.  This has a small chance of breaking someone's crazy
> setup.
>
>
> Dropping requests to read/write the filesystem when the requester does
> not map into s_user_ns should not be a problem to enable universally.  If
> s_user_ns is init_user_ns everything maps so there is no restriction.
>
>
>
> What we can do if we want to ensure maximum backwards compatibility
> is if the fuse filesystem is mounted in init_user_ns but if device for
> the communication channel is opened in some other user namespace we
> can just force the communication channel to operate in init_user_ns.
>
> That will be 100% backwards compatible in all cases and as far as I can
> see remove the need for having different ``modes'' of operation.

Okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-12 16:35       ` Eric W. Biederman
       [not found]         ` <87lgfy5fpd.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-02-13 10:20         ` Miklos Szeredi
       [not found]           ` <CAOssrKcKz8p9YQJLf2W_NCBo+12auxir5jFwXGbANdWdgavpsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-16 21:52           ` Eric W. Biederman
  1 sibling, 2 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 10:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Miklos Szeredi <mszeredi@redhat.com> writes:
>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>>> From: Seth Forshee <seth.forshee@canonical.com>
>>>
>>> In order to support mounts from namespaces other than
>>> init_user_ns, fuse must translate uids and gids to/from the
>>> userns of the process servicing requests on /dev/fuse. This
>>> patch does that, with a couple of restrictions on the namespace:
>>>
>>>  - The userns for the fuse connection is fixed to the namespace
>>>    from which /dev/fuse is opened.
>>>
>>>  - The namespace must be the same as s_user_ns.
>>>
>>> These restrictions simplify the implementation by avoiding the
>>> need to pass around userns references and by allowing fuse to
>>> rely on the checks in inode_change_ok for ownership changes.
>>> Either restriction could be relaxed in the future if needed.
>>
>> Can we not introduce potential userspace interface regressions?
>>
>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>> allow server to run in different pid_ns") will probably bite us here
>> as well.
>
> Maybe, but unlike the pid namespace no one has been able to mount
> fuse outside of init_user_ns so we are much less exposed.  I agree we
> should be careful.

Have to wrap my head around all the rules here.

There's the may_mount() one:

    ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)

Um, first of all, why isn't it checking current->cred->user_ns?

Ah, there it is in sget():

    ns_capable(user_ns, CAP_SYS_ADMIN)

I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
doesn't have FS_USERNS_MOUNT.  This is the one that prevents fuse
mounts from being created when (current->cred->user_ns !=
&init_user_ns).

Maybe there's a logic to this web of namespaces, but I don't yet see
it.  Is it documented somewhere?

>> We basically need two modes of operation:
>>
>> a) old, backward compatible (not introducing any new failure mores),
>> created with privileged mount
>> b) new, non-backward compatible, created with unprivileged mount
>>
>> Technically there would still be a risk from breaking userspace, since
>> we are using the same entry point for both, but let's hope that no
>> practical problems come from that.
>
> Answering from a 10,000 foot perspective:
>
> There are two cases.  Requests to read/write the filesystem from outside
> of s_user_ns.  These run no risk of breaking userspace as this mode has
> not been implemented before.

This comes from the fact that (s_user_ns == &init_user_ns) and all
user namespaces are "inside" init_user_ns, right?

One question: why does current code use the from_[ug]id_munged()
variant, when the conversion can never fail.  Or can it?

> Restrictions at mount time to ensure we are not dealing with a crazy mix
> of namespaces.  This has a small chance of breaking someone's crazy
> setup.
>
>
> Dropping requests to read/write the filesystem when the requester does
> not map into s_user_ns should not be a problem to enable universally.  If
> s_user_ns is init_user_ns everything maps so there is no restriction.
>
>
>
> What we can do if we want to ensure maximum backwards compatibility
> is if the fuse filesystem is mounted in init_user_ns but if device for
> the communication channel is opened in some other user namespace we
> can just force the communication channel to operate in init_user_ns.
>
> That will be 100% backwards compatible in all cases and as far as I can
> see remove the need for having different ``modes'' of operation.

Okay.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
@ 2018-02-13 11:32     ` Miklos Szeredi
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 11:32 UTC (permalink / raw)
  To: Dongsu Park
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Alban Crequy, Eric W . Biederman, Sargun Dhillon

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:

> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
> additional inode permission.

fuse_blk is less suitable for unprivileged mounting than plain fuse.
fusermount doesn't allow mounting fuse_blk unprivileged, so there's
little data about that usecase (IIRC ntfs3g guys did that, or at least
tried to do it, but I don't remember the details).

As such, I think we should leave it out of the initial version.  Which
means you can drop patches 1-2 from this series.  Unless there's a
strong use case for this.  In which case we should look hard at the
differences between fuse_blk and fuse and how that affects
unprivileged operation.   There are a few assumptions about fuse_blk
filesystem being more "well behaved", I think.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
@ 2018-02-13 11:32     ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 11:32 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, containers, Alban Crequy, Eric W . Biederman, Seth Forshee,
	Sargun Dhillon

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:

> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
> additional inode permission.

fuse_blk is less suitable for unprivileged mounting than plain fuse.
fusermount doesn't allow mounting fuse_blk unprivileged, so there's
little data about that usecase (IIRC ntfs3g guys did that, or at least
tried to do it, but I don't remember the details).

As such, I think we should leave it out of the initial version.  Which
means you can drop patches 1-2 from this series.  Unless there's a
strong use case for this.  In which case we should look hard at the
differences between fuse_blk and fuse and how that affects
unprivileged operation.   There are a few assumptions about fuse_blk
filesystem being more "well behaved", I think.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
       [not found]   ` <ac3d34002d7690f6ca5928b57b7fc4d707104b04.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:17       ` Serge E. Hallyn
  2018-01-05 19:24     ` Luis R. Rodriguez
@ 2018-02-13 13:18     ` Miklos Szeredi
  2 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 13:18 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Kees Cook,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Luis R. Rodriguez, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
>
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.

How can the filesystem be corrupted if chown is denied?

It is not clear to me what the purpose of this patch is or what the
exact usecase this is fixing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
       [not found]   ` <ac3d34002d7690f6ca5928b57b7fc4d707104b04.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2018-01-05 19:24   ` Luis R. Rodriguez
@ 2018-02-13 13:18   ` Miklos Szeredi
  2018-02-16 22:00     ` Eric W. Biederman
       [not found]     ` <CAOssrKcZeAHsRz7P_dxh==QAKnp7HeSTh4vWY2tgbWa1ZD918g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 2 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 13:18 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, containers, Alban Crequy, Eric W . Biederman, Seth Forshee,
	Sargun Dhillon, linux-fsdevel, Alexander Viro, Luis R. Rodriguez,
	Kees Cook

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
>
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
>
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.

How can the filesystem be corrupted if chown is denied?

It is not clear to me what the purpose of this patch is or what the
exact usecase this is fixing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
  2017-12-23 12:38     ` Dongsu Park
@ 2018-02-13 13:37           ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Linux Containers, LKML, Alexander Viro, Seth Forshee,
	Alban Crequy, Eric W . Biederman, Sargun Dhillon, linux-fsdevel

On Sat, Dec 23, 2017 at 1:38 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> Hi,
>
> On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>>
>>> Expand the check in should_remove_suid() to keep privileges for
>>
>> I realize this description came from Seth, but reading it now,
>> 'Expand' seems wrong.  Expanding a check brings to my mind making
>> it stricter, not looser.  How about 'Relax the check' ?
>
> Makes sense. Will do.
>
>>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>>
>>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>>
>>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>>
>> Why exactly?
>>
>> This is wrong, because capable_wrt_inode_uidgid() does a check
>> against current_user_ns, not the  inode->i_sb->s_user_ns

I'm thoroughly confused.   s_user_ns is supposed to be about the
usernamespace the filesystem perceives to be in, right?  How does that
come into play when checking permissions to do something?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root
@ 2018-02-13 13:37           ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Serge E. Hallyn, LKML, Linux Containers, Alban Crequy,
	Eric W . Biederman, Seth Forshee, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

On Sat, Dec 23, 2017 at 1:38 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> Hi,
>
> On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>>> From: Seth Forshee <seth.forshee@canonical.com>
>>>
>>> Expand the check in should_remove_suid() to keep privileges for
>>
>> I realize this description came from Seth, but reading it now,
>> 'Expand' seems wrong.  Expanding a check brings to my mind making
>> it stricter, not looser.  How about 'Relax the check' ?
>
> Makes sense. Will do.
>
>>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>>
>>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>>
>>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>>
>> Why exactly?
>>
>> This is wrong, because capable_wrt_inode_uidgid() does a check
>> against current_user_ns, not the  inode->i_sb->s_user_ns

I'm thoroughly confused.   s_user_ns is supposed to be about the
usernamespace the filesystem perceives to be in, right?  How does that
come into play when checking permissions to do something?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
       [not found]     ` <61a37f0b159dd56825696d8d3beb8eaffdf1f72f.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:39       ` Serge E. Hallyn
@ 2018-02-14 12:28       ` Miklos Szeredi
  1 sibling, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-14 12:28 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon, linux-fsdevel,
	Alexander Viro

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.

Why is this required for unprivileged fuse?

Fuse doesn't support freeze, so this seems to make no sense in the
context of this patchset.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  2017-12-22 14:32     ` Dongsu Park
  (?)
  (?)
@ 2018-02-14 12:28     ` Miklos Szeredi
       [not found]       ` <CAOssrKeSTY1pAhpmegFWdGh7irNbT4veG5JaYFj8Q1JjMynadw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-14 12:28 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, Linux Containers, Alban Crequy, Eric W . Biederman,
	Seth Forshee, Sargun Dhillon, linux-fsdevel, Alexander Viro

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.

Why is this required for unprivileged fuse?

Fuse doesn't support freeze, so this seems to make no sense in the
context of this patchset.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
       [not found]   ` <a26103156b3f6ba73b1e46c6f577f1bee74872d9.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  2017-12-23  3:51       ` Serge E. Hallyn
@ 2018-02-14 13:44     ` Miklos Szeredi
  1 sibling, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-14 13:44 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>         .owner          = THIS_MODULE,
>         .name           = "fuse",
> -       .fs_flags       = FS_HAS_SUBTYPE,
> +       .fs_flags       = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>         .mount          = fuse_mount,
>         .kill_sb        = fuse_kill_sb_anon,
>  };

I think enabling FS_USERNS_MOUNT should be pretty safe.

I was thinking opting out should be as simple as "chmod o-rw
/dev/fuse".  But that breaks libfuse, even though fusermount opens
/dev/fuse in privileged mode, so it shouldn't.  That can be fixed in
libfuse, but it's an unfortunate bug and it also means /dev/fuse is
configured with "crw-rw-rw-" in most cases.  Which means it will be
opting out, not opting in, which is the less safe version.

> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>         .name           = "fuseblk",
>         .mount          = fuse_mount_blk,
>         .kill_sb        = fuse_kill_sb_blk,
> -       .fs_flags       = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> +       .fs_flags       = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");

As I said, this hunk should be dropped from the first version, because
it's possibly unsafe.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
  2017-12-22 14:32 ` [PATCH 10/11] fuse: Allow user namespace mounts Dongsu Park
       [not found]   ` <a26103156b3f6ba73b1e46c6f577f1bee74872d9.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
@ 2018-02-14 13:44   ` Miklos Szeredi
       [not found]     ` <CAOssrKcHOp9OaCWRALsxe5MTk+tv7Gi5rPsHz2VLguzK-P+LMw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-15  8:46     ` Miklos Szeredi
  1 sibling, 2 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-14 13:44 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, Linux Containers, Alban Crequy, Eric W . Biederman,
	Seth Forshee, Sargun Dhillon, linux-fsdevel

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
> From: Seth Forshee <seth.forshee@canonical.com>
>
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/inode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>  static struct file_system_type fuse_fs_type = {
>         .owner          = THIS_MODULE,
>         .name           = "fuse",
> -       .fs_flags       = FS_HAS_SUBTYPE,
> +       .fs_flags       = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>         .mount          = fuse_mount,
>         .kill_sb        = fuse_kill_sb_anon,
>  };

I think enabling FS_USERNS_MOUNT should be pretty safe.

I was thinking opting out should be as simple as "chmod o-rw
/dev/fuse".  But that breaks libfuse, even though fusermount opens
/dev/fuse in privileged mode, so it shouldn't.  That can be fixed in
libfuse, but it's an unfortunate bug and it also means /dev/fuse is
configured with "crw-rw-rw-" in most cases.  Which means it will be
opting out, not opting in, which is the less safe version.

> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
>         .name           = "fuseblk",
>         .mount          = fuse_mount_blk,
>         .kill_sb        = fuse_kill_sb_blk,
> -       .fs_flags       = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> +       .fs_flags       = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>  };
>  MODULE_ALIAS_FS("fuseblk");

As I said, this hunk should be dropped from the first version, because
it's possibly unsafe.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
       [not found]     ` <CAOssrKcHOp9OaCWRALsxe5MTk+tv7Gi5rPsHz2VLguzK-P+LMw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-15  8:46       ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-15  8:46 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Eric W . Biederman, Sargun Dhillon, linux-fsdevel

On Wed, Feb 14, 2018 at 2:44 PM, Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>
>> To be able to mount fuse from non-init user namespaces, it's necessary
>> to set FS_USERNS_MOUNT flag to fs_flags.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>>
>> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>> [dongsu: add a simple commit messasge]
>> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
>> ---
>>  fs/fuse/inode.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 7f6b2e55..8c98edee 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>>  static struct file_system_type fuse_fs_type = {
>>         .owner          = THIS_MODULE,
>>         .name           = "fuse",
>> -       .fs_flags       = FS_HAS_SUBTYPE,
>> +       .fs_flags       = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>>         .mount          = fuse_mount,
>>         .kill_sb        = fuse_kill_sb_anon,
>>  };
>
> I think enabling FS_USERNS_MOUNT should be pretty safe.
>
> I was thinking opting out should be as simple as "chmod o-rw
> /dev/fuse".  But that breaks libfuse, even though fusermount opens
> /dev/fuse in privileged mode, so it shouldn't.

I'm talking rubbish, /dev/fuse is opened without privs in fusermount as well.

So there's not way to differentiate user_ns unpriv mounts from suid
fusermount unpriv mounts.

Maybe that's just as well...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 10/11] fuse: Allow user namespace mounts
  2018-02-14 13:44   ` Miklos Szeredi
       [not found]     ` <CAOssrKcHOp9OaCWRALsxe5MTk+tv7Gi5rPsHz2VLguzK-P+LMw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-15  8:46     ` Miklos Szeredi
  1 sibling, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-15  8:46 UTC (permalink / raw)
  To: Dongsu Park
  Cc: lkml, Linux Containers, Alban Crequy, Eric W . Biederman,
	Seth Forshee, Sargun Dhillon, linux-fsdevel

On Wed, Feb 14, 2018 at 2:44 PM, Miklos Szeredi <mszeredi@redhat.com> wrote:
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> To be able to mount fuse from non-init user namespaces, it's necessary
>> to set FS_USERNS_MOUNT flag to fs_flags.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Miklos Szeredi <mszeredi@redhat.com>
>> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
>> [dongsu: add a simple commit messasge]
>> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
>> ---
>>  fs/fuse/inode.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 7f6b2e55..8c98edee 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>>  static struct file_system_type fuse_fs_type = {
>>         .owner          = THIS_MODULE,
>>         .name           = "fuse",
>> -       .fs_flags       = FS_HAS_SUBTYPE,
>> +       .fs_flags       = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>>         .mount          = fuse_mount,
>>         .kill_sb        = fuse_kill_sb_anon,
>>  };
>
> I think enabling FS_USERNS_MOUNT should be pretty safe.
>
> I was thinking opting out should be as simple as "chmod o-rw
> /dev/fuse".  But that breaks libfuse, even though fusermount opens
> /dev/fuse in privileged mode, so it shouldn't.

I'm talking rubbish, /dev/fuse is opened without privs in fusermount as well.

So there's not way to differentiate user_ns unpriv mounts from suid
fusermount unpriv mounts.

Maybe that's just as well...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
       [not found]           ` <CAOssrKcKz8p9YQJLf2W_NCBo+12auxir5jFwXGbANdWdgavpsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-16 21:52             ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-16 21:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>>>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>>>
>>>> In order to support mounts from namespaces other than
>>>> init_user_ns, fuse must translate uids and gids to/from the
>>>> userns of the process servicing requests on /dev/fuse. This
>>>> patch does that, with a couple of restrictions on the namespace:
>>>>
>>>>  - The userns for the fuse connection is fixed to the namespace
>>>>    from which /dev/fuse is opened.
>>>>
>>>>  - The namespace must be the same as s_user_ns.
>>>>
>>>> These restrictions simplify the implementation by avoiding the
>>>> need to pass around userns references and by allowing fuse to
>>>> rely on the checks in inode_change_ok for ownership changes.
>>>> Either restriction could be relaxed in the future if needed.
>>>
>>> Can we not introduce potential userspace interface regressions?
>>>
>>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>>> allow server to run in different pid_ns") will probably bite us here
>>> as well.
>>
>> Maybe, but unlike the pid namespace no one has been able to mount
>> fuse outside of init_user_ns so we are much less exposed.  I agree we
>> should be careful.
>
> Have to wrap my head around all the rules here.
>
> There's the may_mount() one:
>
>     ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)
>
> Um, first of all, why isn't it checking current->cred->user_ns?
>
> Ah, there it is in sget():
>
>     ns_capable(user_ns, CAP_SYS_ADMIN)
>
> I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
> doesn't have FS_USERNS_MOUNT.  This is the one that prevents fuse
> mounts from being created when (current->cred->user_ns !=
> &init_user_ns).
>
> Maybe there's a logic to this web of namespaces, but I don't yet see
> it.  Is it documented somewhere?

I think this is a bit simpler than the fiddly details in the
implementation might make it look.

The fundamental idea is that permission to have full control over
a mount namespace, is different than permission to have full control
over an instance of a filesystem.

Implementing that separation of permission checks gets a little bit
fiddly.  The first challenge is that there are several filesystems like
sysfs and proc whose internal mount is created outside of a process.
Then there are the file systems like nfs and afs that have ``referral
points'' that transition you to other instances of those filesystems
when you transition over them.  That is the reason why there are
exceptions for SB_KERNMOUNT and SB_SUBMOUNT.

may_mount is just the permission check for the mount namespace.  It
checks that the current process has CAP_SYS_ADMIN in the user namespace
that owns the current mount namespace.  AKA is the process allowed to
change the mount namespace.

sget is just the permission check for mounting a filesystem.  It checks
that the mounter has CAP_SYS_ADMIN over the user namespace that will own
the newly mounted filesystem.

By the time execition gets to to sget_userns in general all of the
permission checks have all been made.  But if the filesystem is not one
that supports mounting within a user namespace the code checks
capable(CAP_SYS_ADMIN).

That is more convoluted than I would like but the checks derive from the
definition of what we are doing.

>
>>> We basically need two modes of operation:
>>>
>>> a) old, backward compatible (not introducing any new failure mores),
>>> created with privileged mount
>>> b) new, non-backward compatible, created with unprivileged mount
>>>
>>> Technically there would still be a risk from breaking userspace, since
>>> we are using the same entry point for both, but let's hope that no
>>> practical problems come from that.
>>
>> Answering from a 10,000 foot perspective:
>>
>> There are two cases.  Requests to read/write the filesystem from outside
>> of s_user_ns.  These run no risk of breaking userspace as this mode has
>> not been implemented before.
>
> This comes from the fact that (s_user_ns == &init_user_ns) and all
> user namespaces are "inside" init_user_ns, right?

Yes.

> One question: why does current code use the from_[ug]id_munged()
> variant, when the conversion can never fail.  Or can it?

There is always at least (uid_t)-1 that can fail if it shows up on a
filesystem.  As far as I can tell no one was using it for a uid, there
were already uses of (uid_t)-1 as a special case, and I just grabbed it
to become INVALID_UID.

In practice the mapping can't fail unless someone malicious starts using
that id.

I believe I picked the _munged variant so in case that version hits
we are guaranteed to return the 16bit nobody user.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-13 10:20         ` Miklos Szeredi
       [not found]           ` <CAOssrKcKz8p9YQJLf2W_NCBo+12auxir5jFwXGbANdWdgavpsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-16 21:52           ` Eric W. Biederman
  1 sibling, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-16 21:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Miklos Szeredi <mszeredi@redhat.com> writes:
>>
>>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>>>> From: Seth Forshee <seth.forshee@canonical.com>
>>>>
>>>> In order to support mounts from namespaces other than
>>>> init_user_ns, fuse must translate uids and gids to/from the
>>>> userns of the process servicing requests on /dev/fuse. This
>>>> patch does that, with a couple of restrictions on the namespace:
>>>>
>>>>  - The userns for the fuse connection is fixed to the namespace
>>>>    from which /dev/fuse is opened.
>>>>
>>>>  - The namespace must be the same as s_user_ns.
>>>>
>>>> These restrictions simplify the implementation by avoiding the
>>>> need to pass around userns references and by allowing fuse to
>>>> rely on the checks in inode_change_ok for ownership changes.
>>>> Either restriction could be relaxed in the future if needed.
>>>
>>> Can we not introduce potential userspace interface regressions?
>>>
>>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>>> allow server to run in different pid_ns") will probably bite us here
>>> as well.
>>
>> Maybe, but unlike the pid namespace no one has been able to mount
>> fuse outside of init_user_ns so we are much less exposed.  I agree we
>> should be careful.
>
> Have to wrap my head around all the rules here.
>
> There's the may_mount() one:
>
>     ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)
>
> Um, first of all, why isn't it checking current->cred->user_ns?
>
> Ah, there it is in sget():
>
>     ns_capable(user_ns, CAP_SYS_ADMIN)
>
> I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
> doesn't have FS_USERNS_MOUNT.  This is the one that prevents fuse
> mounts from being created when (current->cred->user_ns !=
> &init_user_ns).
>
> Maybe there's a logic to this web of namespaces, but I don't yet see
> it.  Is it documented somewhere?

I think this is a bit simpler than the fiddly details in the
implementation might make it look.

The fundamental idea is that permission to have full control over
a mount namespace, is different than permission to have full control
over an instance of a filesystem.

Implementing that separation of permission checks gets a little bit
fiddly.  The first challenge is that there are several filesystems like
sysfs and proc whose internal mount is created outside of a process.
Then there are the file systems like nfs and afs that have ``referral
points'' that transition you to other instances of those filesystems
when you transition over them.  That is the reason why there are
exceptions for SB_KERNMOUNT and SB_SUBMOUNT.

may_mount is just the permission check for the mount namespace.  It
checks that the current process has CAP_SYS_ADMIN in the user namespace
that owns the current mount namespace.  AKA is the process allowed to
change the mount namespace.

sget is just the permission check for mounting a filesystem.  It checks
that the mounter has CAP_SYS_ADMIN over the user namespace that will own
the newly mounted filesystem.

By the time execition gets to to sget_userns in general all of the
permission checks have all been made.  But if the filesystem is not one
that supports mounting within a user namespace the code checks
capable(CAP_SYS_ADMIN).

That is more convoluted than I would like but the checks derive from the
definition of what we are doing.

>
>>> We basically need two modes of operation:
>>>
>>> a) old, backward compatible (not introducing any new failure mores),
>>> created with privileged mount
>>> b) new, non-backward compatible, created with unprivileged mount
>>>
>>> Technically there would still be a risk from breaking userspace, since
>>> we are using the same entry point for both, but let's hope that no
>>> practical problems come from that.
>>
>> Answering from a 10,000 foot perspective:
>>
>> There are two cases.  Requests to read/write the filesystem from outside
>> of s_user_ns.  These run no risk of breaking userspace as this mode has
>> not been implemented before.
>
> This comes from the fact that (s_user_ns == &init_user_ns) and all
> user namespaces are "inside" init_user_ns, right?

Yes.

> One question: why does current code use the from_[ug]id_munged()
> variant, when the conversion can never fail.  Or can it?

There is always at least (uid_t)-1 that can fail if it shows up on a
filesystem.  As far as I can tell no one was using it for a uid, there
were already uses of (uid_t)-1 as a special case, and I just grabbed it
to become INVALID_UID.

In practice the mapping can't fail unless someone malicious starts using
that id.

I believe I picked the _munged variant so in case that version hits
we are guaranteed to return the 16bit nobody user.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
       [not found]     ` <CAOssrKey+oxahrXHO5d6Lu1ZD=r1t-b0i4iZM_Ke9ToqTckjkQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-16 21:53       ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-16 21:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Alban Crequy, Sargun Dhillon

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>
> fuse_blk is less suitable for unprivileged mounting than plain fuse.
> fusermount doesn't allow mounting fuse_blk unprivileged, so there's
> little data about that usecase (IIRC ntfs3g guys did that, or at least
> tried to do it, but I don't remember the details).
>
> As such, I think we should leave it out of the initial version.  Which
> means you can drop patches 1-2 from this series.  Unless there's a
> strong use case for this.  In which case we should look hard at the
> differences between fuse_blk and fuse and how that affects
> unprivileged operation.   There are a few assumptions about fuse_blk
> filesystem being more "well behaved", I think.

Especially to start with I am fine with that.

It makes a lot of sense to get the obvious cases first.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
  2018-02-13 11:32     ` Miklos Szeredi
  (?)
@ 2018-02-16 21:53     ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-16 21:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>
> fuse_blk is less suitable for unprivileged mounting than plain fuse.
> fusermount doesn't allow mounting fuse_blk unprivileged, so there's
> little data about that usecase (IIRC ntfs3g guys did that, or at least
> tried to do it, but I don't remember the details).
>
> As such, I think we should leave it out of the initial version.  Which
> means you can drop patches 1-2 from this series.  Unless there's a
> strong use case for this.  In which case we should look hard at the
> differences between fuse_blk and fuse and how that affects
> unprivileged operation.   There are a few assumptions about fuse_blk
> filesystem being more "well behaved", I think.

Especially to start with I am fine with that.

It makes a lot of sense to get the obvious cases first.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
       [not found]     ` <CAOssrKcZeAHsRz7P_dxh==QAKnp7HeSTh4vWY2tgbWa1ZD918g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-16 22:00       ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-16 22:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, lkml,
	Seth Forshee, Luis R. Rodriguez, Alban Crequy, Alexander Viro,
	Sargun Dhillon, linux-fsdevel, Kees Cook

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>> From: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>
>> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
>> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
>> sufficient to allow access to files but when the underlying filesystem
>> has uids or gids that don't map to the current user namespace it is
>> not enough, so the chown permission checks need to be extended to
>> allow this case.
>>
>> Calling chown on filesystem nodes whose uid or gid don't map is
>> necessary if those nodes are going to be modified as writing back
>> inodes which contain uids or gids that don't map is likely to cause
>> filesystem corruption of the uid or gid fields.
>
> How can the filesystem be corrupted if chown is denied?
>
> It is not clear to me what the purpose of this patch is or what the
> exact usecase this is fixing.

It isn't a fix and we can delay this one and similar patches
that enable things until we are certain all of the necessary
restrictions are in place.  This is not essential for safely getting
fully unprivileged mounting of fuse to work.

The overall strategy has been to handle as many of the generic concerns
at the vfs level as possible to separate filesystem concerns and generic
concerns.

In this case the generic concern is what happens when the uid is read
from the filesystem and it gets mapped to INVALID_UID and then the inode
for that file is written back.

That is a trap for the unwary filesystem implementation and not a case
that I think anyone will actually care about.  It is just not useful
to mount a filesystem and to not map some of it's ids.   So the generic
vfs code just denies writes to files like show with uid of INVALID_UID
or gid of INVALID_GID.  Just to ensure that problems don't show up.

This patch gets through those defenses.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes
  2018-02-13 13:18   ` Miklos Szeredi
@ 2018-02-16 22:00     ` Eric W. Biederman
       [not found]     ` <CAOssrKcZeAHsRz7P_dxh==QAKnp7HeSTh4vWY2tgbWa1ZD918g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-16 22:00 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel, Alexander Viro, Luis R. Rodriguez,
	Kees Cook

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Eric W. Biederman <ebiederm@xmission.com>
>>
>> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
>> chown files.  Ordinarily the capable_wrt_inode_uidgid check is
>> sufficient to allow access to files but when the underlying filesystem
>> has uids or gids that don't map to the current user namespace it is
>> not enough, so the chown permission checks need to be extended to
>> allow this case.
>>
>> Calling chown on filesystem nodes whose uid or gid don't map is
>> necessary if those nodes are going to be modified as writing back
>> inodes which contain uids or gids that don't map is likely to cause
>> filesystem corruption of the uid or gid fields.
>
> How can the filesystem be corrupted if chown is denied?
>
> It is not clear to me what the purpose of this patch is or what the
> exact usecase this is fixing.

It isn't a fix and we can delay this one and similar patches
that enable things until we are certain all of the necessary
restrictions are in place.  This is not essential for safely getting
fully unprivileged mounting of fuse to work.

The overall strategy has been to handle as many of the generic concerns
at the vfs level as possible to separate filesystem concerns and generic
concerns.

In this case the generic concern is what happens when the uid is read
from the filesystem and it gets mapped to INVALID_UID and then the inode
for that file is written back.

That is a trap for the unwary filesystem implementation and not a case
that I think anyone will actually care about.  It is just not useful
to mount a filesystem and to not map some of it's ids.   So the generic
vfs code just denies writes to files like show with uid of INVALID_UID
or gid of INVALID_GID.  Just to ensure that problems don't show up.

This patch gets through those defenses.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  2018-02-14 12:28     ` Miklos Szeredi
@ 2018-02-19 22:56           ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-19 22:56 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Alexander Viro, Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> wrote:
>> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>>
>> The user in control of a super block should be allowed to freeze
>> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
>> ioctls to require CAP_SYS_ADMIN in s_user_ns.
>
> Why is this required for unprivileged fuse?
>
> Fuse doesn't support freeze, so this seems to make no sense in the
> context of this patchset.

This isn't required to support fuse.  It is a relaxation in permissions
so it isn't strictly necessary for anything.

Until just recently Seth and I work working through the vfs looking at
what we need in general for unprivileged mounts.  With fuse as our focus
but we were not limiting ourselves to fuse.

I have been putting off relaxation of permissions like this because they
are not necessary for safety.  But in general they do make sense.

In practice I think all we need to worry about for fuse is the last 4 patches.


Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
@ 2018-02-19 22:56           ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-19 22:56 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dongsu Park, lkml, Linux Containers, Alban Crequy, Seth Forshee,
	Sargun Dhillon, linux-fsdevel, Alexander Viro

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <dongsu@kinvolk.io> wrote:
>> From: Seth Forshee <seth.forshee@canonical.com>
>>
>> The user in control of a super block should be allowed to freeze
>> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
>> ioctls to require CAP_SYS_ADMIN in s_user_ns.
>
> Why is this required for unprivileged fuse?
>
> Fuse doesn't support freeze, so this seems to make no sense in the
> context of this patchset.

This isn't required to support fuse.  It is a relaxation in permissions
so it isn't strictly necessary for anything.

Until just recently Seth and I work working through the vfs looking at
what we need in general for unprivileged mounts.  With fuse as our focus
but we were not limiting ourselves to fuse.

I have been putting off relaxation of permissions like this because they
are not necessary for safety.  But in general they do make sense.

In practice I think all we need to worry about for fuse is the last 4 patches.


Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
  2018-01-18 14:58     ` Alban Crequy
@ 2018-02-19 23:09           ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-19 23:09 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Miklos Szeredi, Linux Containers, LKML, Seth Forshee, Sargun Dhillon

Alban Crequy <alban-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> writes:

> Hi Eric,
>
> Do you have some cycles for this now that it is the new year?
>
> A review on the associated ima issue would also be appreciated:
> https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1587678.html

It has taken me longer than I expected but I do have time now.  I am
moving through these patches and issues a little slowly I do intend to
get us through the fuse issues this development cycle if at all
possible.

I think for starters we should restrict ourselves to the last 4 patches
aka (8, 9, 10, 11).

In particular we should concentrate on
[8/11] fuse: Support fuse filesystems outside of init_user_ns
[9/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

The tricky issues are handled in the vfs, and I think the remaining
tricky issues are evm and ima.  Which are close enough to be resolved
that we can count them as resolved.

Once we have 8 & 9 reviewed and merged we can double check there isn't
some silly reason not to set FS_USERNS_MOUNT on fuse and then enable it.

I would like to double check and ensure there are not silly issues with
posix acls or anything else in the vfs.  But I think except for a silly
oversight we are good.

I should probably also add a patch that adds to
Documentation/filesystems that explains what the vfs does for
unprivileged mounts.  So that I can point people working on filesystems
and are thinking about enabling user namespace mounts at the
documentation for what the vfs does.  That would also provide a good
checklist to ensure the way the vfs handles things is sufficient for
fuse.

As for the earlier patches that enable things.  Overall they are
good.  They are slightly dangerous as they enable more code paths
to unprivileged users.  But mostly I think they are not immediately
necessary and as such a distraction to getting this code in.

That said once we get the fuse bits reviewed merged I will be more than
happy to merge the relaxation of permission checks that we can perform
now that s_user_ns exists.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces
@ 2018-02-19 23:09           ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-19 23:09 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Dongsu Park, LKML, Linux Containers, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon

Alban Crequy <alban@kinvolk.io> writes:

> Hi Eric,
>
> Do you have some cycles for this now that it is the new year?
>
> A review on the associated ima issue would also be appreciated:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1587678.html

It has taken me longer than I expected but I do have time now.  I am
moving through these patches and issues a little slowly I do intend to
get us through the fuse issues this development cycle if at all
possible.

I think for starters we should restrict ourselves to the last 4 patches
aka (8, 9, 10, 11).

In particular we should concentrate on
[8/11] fuse: Support fuse filesystems outside of init_user_ns
[9/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

The tricky issues are handled in the vfs, and I think the remaining
tricky issues are evm and ima.  Which are close enough to be resolved
that we can count them as resolved.

Once we have 8 & 9 reviewed and merged we can double check there isn't
some silly reason not to set FS_USERNS_MOUNT on fuse and then enable it.

I would like to double check and ensure there are not silly issues with
posix acls or anything else in the vfs.  But I think except for a silly
oversight we are good.

I should probably also add a patch that adds to
Documentation/filesystems that explains what the vfs does for
unprivileged mounts.  So that I can point people working on filesystems
and are thinking about enabling user namespace mounts at the
documentation for what the vfs does.  That would also provide a good
checklist to ensure the way the vfs handles things is sufficient for
fuse.

As for the earlier patches that enable things.  Overall they are
good.  They are slightly dangerous as they enable more code paths
to unprivileged users.  But mostly I think they are not immediately
necessary and as such a distraction to getting this code in.

That said once we get the fuse bits reviewed merged I will be more than
happy to merge the relaxation of permission checks that we can perform
now that s_user_ns exists.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2017-12-22 14:32 ` [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant Dongsu Park
@ 2018-02-19 23:16       ` Eric W. Biederman
       [not found]   ` <d055925e5d5c0099e9e9c871004fb45fab67e4bc.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-19 23:16 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> writes:

> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
>
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>

Reviewed-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

> ---
>  fs/fuse/dir.c           | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>  	const struct cred *cred;
>  
>  	if (fc->allow_other)
> -		return 1;
> +		return current_in_userns(fc->user_ns);
>  
>  	cred = current_cred();
>  	if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
>  {
>  	return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);
>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant
@ 2018-02-19 23:16       ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-19 23:16 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, linux-fsdevel, Serge Hallyn

Dongsu Park <dongsu@kinvolk.io> writes:

> From: Seth Forshee <seth.forshee@canonical.com>
>
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>

Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>

> ---
>  fs/fuse/dir.c           | 2 +-
>  kernel/user_namespace.c | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>  	const struct cred *cred;
>  
>  	if (fc->allow_other)
> -		return 1;
> +		return current_in_userns(fc->user_ns);
>  
>  	cred = current_cred();
>  	if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
>  {
>  	return in_userns(target_ns, current_user_ns());
>  }
> +EXPORT_SYMBOL(current_in_userns);
>  
>  static inline struct user_namespace *to_user_ns(struct ns_common *ns)
>  {

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
  2017-12-22 14:32 ` [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns Dongsu Park
@ 2018-02-20  2:12       ` Eric W. Biederman
       [not found]   ` <c85c293e19a478353aba8e6e3ee39e5914f798d5.1512041070.git.dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
  1 sibling, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-20  2:12 UTC (permalink / raw)
  To: Dongsu Park
  Cc: Miklos Szeredi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org> writes:

> From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>  	if (!cc)
>  		return -ENOMEM;
>  
As noticed in the review this should probably say:
	if (current_user_ns() != &init_user_ns)
		return -EINVAL;

Just so we don't need to think about cuse being opened in a user
namespace at this point.  It is probably harmless.  But it isn't
what we are focusing on.

> -	fuse_conn_init(&cc->fc);
> +	fuse_conn_init(&cc->fc, current_user_ns());
>  
>  	fud = fuse_dev_alloc(&cc->fc);
>  	if (!fud) {


> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>  	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>  	__set_bit(FR_WAITING, &req->flags);
>  	if (for_background)
>  		__set_bit(FR_BACKGROUND, &req->flags);
> +	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +		fuse_put_request(fc, req);
> +		return ERR_PTR(-EOVERFLOW);
> +	}
>  
>  	return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>  	in = &req->in;
>  	reqsize = in->h.len;
>  
> -	if (task_active_pid_ns(current) != fc->pid_ns) {
> +	if (task_active_pid_ns(current) != fc->pid_ns ||
> +	    current_user_ns() != fc->user_ns) {
>  		rcu_read_lock();
>  		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>  		rcu_read_unlock();

The hunk above is a rebase error.  I believe it started out by erroring
out in the same case the pid namespace case errored out.  Miklos has a
good point that we need to handle the case where we have servers running
in jails of one sort or another because at least sandstorm runs
applications in that fashion, and we have previously had error reports
about that configuration breaking.

I think we can easily fix that.  Either by adding extra translation as
we did for the pid namespace or changing the user namespace used on the
connection.  I believe extra translation like we did with the pid
namespace will be more consistent.  And again it won't be a special
case except possibly during mount.  Of course there is weirdness there.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns
@ 2018-02-20  2:12       ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-20  2:12 UTC (permalink / raw)
  To: Dongsu Park
  Cc: linux-kernel, containers, Alban Crequy, Miklos Szeredi,
	Seth Forshee, Sargun Dhillon, linux-fsdevel

Dongsu Park <dongsu@kinvolk.io> writes:

> From: Seth Forshee <seth.forshee@canonical.com>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
>  - The userns for the fuse connection is fixed to the namespace
>    from which /dev/fuse is opened.
>
>  - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
> Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
> ---
>  fs/fuse/cuse.c   |  3 ++-
>  fs/fuse/dev.c    | 11 ++++++++---
>  fs/fuse/dir.c    | 14 +++++++-------
>  fs/fuse/fuse_i.h |  6 +++++-
>  fs/fuse/inode.c  | 31 +++++++++++++++++++------------
>  5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
>  #include <linux/stat.h>
>  #include <linux/module.h>
>  #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>  
>  #include "fuse_i.h"
>  
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>  	if (!cc)
>  		return -ENOMEM;
>  
As noticed in the review this should probably say:
	if (current_user_ns() != &init_user_ns)
		return -EINVAL;

Just so we don't need to think about cuse being opened in a user
namespace at this point.  It is probably harmless.  But it isn't
what we are focusing on.

> -	fuse_conn_init(&cc->fc);
> +	fuse_conn_init(&cc->fc, current_user_ns());
>  
>  	fud = fuse_dev_alloc(&cc->fc);
>  	if (!fud) {


> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>  
>  static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> +	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>  	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>  }
>  
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>  	__set_bit(FR_WAITING, &req->flags);
>  	if (for_background)
>  		__set_bit(FR_BACKGROUND, &req->flags);
> +	if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> +		fuse_put_request(fc, req);
> +		return ERR_PTR(-EOVERFLOW);
> +	}
>  
>  	return req;
>  
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>  	in = &req->in;
>  	reqsize = in->h.len;
>  
> -	if (task_active_pid_ns(current) != fc->pid_ns) {
> +	if (task_active_pid_ns(current) != fc->pid_ns ||
> +	    current_user_ns() != fc->user_ns) {
>  		rcu_read_lock();
>  		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>  		rcu_read_unlock();

The hunk above is a rebase error.  I believe it started out by erroring
out in the same case the pid namespace case errored out.  Miklos has a
good point that we need to handle the case where we have servers running
in jails of one sort or another because at least sandstorm runs
applications in that fashion, and we have previously had error reports
about that configuration breaking.

I think we can easily fix that.  Either by adding extra translation as
we did for the pid namespace or changing the user namespace used on the
connection.  I believe extra translation like we did with the pid
namespace will be more consistent.  And again it won't be a special
case except possibly during mount.  Of course there is weirdness there.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 0/6] fuse: mounts from non-init user namespaces
  2017-12-22 14:32 [PATCH v5 00/11] FUSE mounts from non-init user namespaces Dongsu Park
@ 2018-02-21 20:24     ` Eric W. Biederman
  2017-12-22 14:32 ` [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes Dongsu Park
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

I had to change the core of this patchset around some as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

My apologies if I have lost someone's ack or review somewhere.  Let me
know and I will fix it.

These changes are also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v6
  
Eric W. Biederman (4):
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fuse: Support fuse filesystems outside of init_user_ns
      fuse: Ensure posix acls are translated outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant

 fs/fuse/acl.c           |  4 ++--
 fs/fuse/cuse.c          |  7 ++++++-
 fs/fuse/dev.c           | 26 +++++++++++++-------------
 fs/fuse/dir.c           | 16 ++++++++--------
 fs/fuse/fuse_i.h        |  7 ++++++-
 fs/fuse/inode.c         | 38 ++++++++++++++++++++++++++------------
 fs/fuse/xattr.c         | 43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/user_namespace.c |  1 +
 8 files changed, 105 insertions(+), 37 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 0/6] fuse: mounts from non-init user namespaces
@ 2018-02-21 20:24     ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

I had to change the core of this patchset around some as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

My apologies if I have lost someone's ack or review somewhere.  Let me
know and I will fix it.

These changes are also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v6
  
Eric W. Biederman (4):
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fuse: Support fuse filesystems outside of init_user_ns
      fuse: Ensure posix acls are translated outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant

 fs/fuse/acl.c           |  4 ++--
 fs/fuse/cuse.c          |  7 ++++++-
 fs/fuse/dev.c           | 26 +++++++++++++-------------
 fs/fuse/dir.c           | 16 ++++++++--------
 fs/fuse/fuse_i.h        |  7 ++++++-
 fs/fuse/inode.c         | 38 ++++++++++++++++++++++++++------------
 fs/fuse/xattr.c         | 43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/user_namespace.c |  1 +
 8 files changed, 105 insertions(+), 37 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-21 20:24     ` Eric W. Biederman
@ 2018-02-21 20:29         ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
@ 2018-02-21 20:29         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
  2018-02-21 20:24     ` Eric W. Biederman
@ 2018-02-21 20:29         ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/dev.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..216db3f51a31 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 }
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
@ 2018-02-21 20:29         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..216db3f51a31 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 }
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-21 20:24     ` Eric W. Biederman
@ 2018-02-21 20:29         ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..5a48cee6d7d3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 216db3f51a31..338cfda3eb8f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..ad1cfac1942f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..7772e2b4057e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..e018dc3999f4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns
@ 2018-02-21 20:29         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..5a48cee6d7d3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 216db3f51a31..338cfda3eb8f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..ad1cfac1942f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..7772e2b4057e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..e018dc3999f4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-21 20:24     ` Eric W. Biederman
@ 2018-02-21 20:29         ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Ensure the translation happens by failing to read or write
posix acls when the filesystem has not indicated it supports
posix acls.

This ensures that modern cached posix acl support is available
and used when dealing with posix acls.  This is important
because only that path has the code to convernt the uids and
gids in posix acls into the user namespace of a fuse filesystem.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/fuse_i.h |  1 +
 fs/fuse/inode.c  |  7 +++++++
 fs/fuse/xattr.c  | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7772e2b4057e..986fa2b043ab 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
 int fuse_removexattr(struct inode *inode, const char *name);
 extern const struct xattr_handler *fuse_xattr_handlers[];
 extern const struct xattr_handler *fuse_acl_xattr_handlers[];
+extern const struct xattr_handler *fuse_no_acl_xattr_handlers[];
 
 struct posix_acl;
 struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e018dc3999f4..a52cf2019a58 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
+	/*
+	 * If we are not in the initial user namespace posix
+	 * acls must be translated.
+	 */
+	if (sb->s_user_ns != &init_user_ns)
+		sb->s_xattr = fuse_no_acl_xattr_handlers;
+
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
 	err = -ENOMEM;
 	if (!fc)
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..433717640f78 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler,
 	return fuse_setxattr(inode, name, value, size, flags);
 }
 
+static bool no_xattr_list(struct dentry *dentry)
+{
+	return false;
+}
+
+static int no_xattr_get(const struct xattr_handler *handler,
+			struct dentry *dentry, struct inode *inode,
+			const char *name, void *value, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static int no_xattr_set(const struct xattr_handler *handler,
+			struct dentry *dentry, struct inode *nodee,
+			const char *name, const void *value,
+			size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+
 static const struct xattr_handler fuse_xattr_handler = {
 	.prefix = "",
 	.get    = fuse_xattr_get,
@@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = {
 	&fuse_xattr_handler,
 	NULL
 };
+
+static const struct xattr_handler fuse_no_acl_access_xattr_handler = {
+	.name  = XATTR_NAME_POSIX_ACL_ACCESS,
+	.flags = ACL_TYPE_ACCESS,
+	.list  = no_xattr_list,
+	.get   = no_xattr_get,
+	.set   = no_xattr_set,
+};
+
+static const struct xattr_handler fuse_no_acl_default_xattr_handler = {
+	.name  = XATTR_NAME_POSIX_ACL_DEFAULT,
+	.flags = ACL_TYPE_ACCESS,
+	.list  = no_xattr_list,
+	.get   = no_xattr_get,
+	.set   = no_xattr_set,
+};
+
+const struct xattr_handler *fuse_no_acl_xattr_handlers[] = {
+	&fuse_no_acl_access_xattr_handler,
+	&fuse_no_acl_default_xattr_handler,
+	&fuse_xattr_handler,
+	NULL
+};
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
@ 2018-02-21 20:29         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Ensure the translation happens by failing to read or write
posix acls when the filesystem has not indicated it supports
posix acls.

This ensures that modern cached posix acl support is available
and used when dealing with posix acls.  This is important
because only that path has the code to convernt the uids and
gids in posix acls into the user namespace of a fuse filesystem.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/fuse_i.h |  1 +
 fs/fuse/inode.c  |  7 +++++++
 fs/fuse/xattr.c  | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7772e2b4057e..986fa2b043ab 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
 int fuse_removexattr(struct inode *inode, const char *name);
 extern const struct xattr_handler *fuse_xattr_handlers[];
 extern const struct xattr_handler *fuse_acl_xattr_handlers[];
+extern const struct xattr_handler *fuse_no_acl_xattr_handlers[];
 
 struct posix_acl;
 struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e018dc3999f4..a52cf2019a58 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
+	/*
+	 * If we are not in the initial user namespace posix
+	 * acls must be translated.
+	 */
+	if (sb->s_user_ns != &init_user_ns)
+		sb->s_xattr = fuse_no_acl_xattr_handlers;
+
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
 	err = -ENOMEM;
 	if (!fc)
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..433717640f78 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler,
 	return fuse_setxattr(inode, name, value, size, flags);
 }
 
+static bool no_xattr_list(struct dentry *dentry)
+{
+	return false;
+}
+
+static int no_xattr_get(const struct xattr_handler *handler,
+			struct dentry *dentry, struct inode *inode,
+			const char *name, void *value, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
+static int no_xattr_set(const struct xattr_handler *handler,
+			struct dentry *dentry, struct inode *nodee,
+			const char *name, const void *value,
+			size_t size, int flags)
+{
+	return -EOPNOTSUPP;
+}
+
 static const struct xattr_handler fuse_xattr_handler = {
 	.prefix = "",
 	.get    = fuse_xattr_get,
@@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = {
 	&fuse_xattr_handler,
 	NULL
 };
+
+static const struct xattr_handler fuse_no_acl_access_xattr_handler = {
+	.name  = XATTR_NAME_POSIX_ACL_ACCESS,
+	.flags = ACL_TYPE_ACCESS,
+	.list  = no_xattr_list,
+	.get   = no_xattr_get,
+	.set   = no_xattr_set,
+};
+
+static const struct xattr_handler fuse_no_acl_default_xattr_handler = {
+	.name  = XATTR_NAME_POSIX_ACL_DEFAULT,
+	.flags = ACL_TYPE_ACCESS,
+	.list  = no_xattr_list,
+	.get   = no_xattr_get,
+	.set   = no_xattr_set,
+};
+
+const struct xattr_handler *fuse_no_acl_xattr_handlers[] = {
+	&fuse_no_acl_access_xattr_handler,
+	&fuse_no_acl_default_xattr_handler,
+	&fuse_xattr_handler,
+	NULL
+};
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2018-02-21 20:24     ` Eric W. Biederman
@ 2018-02-21 20:29         ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Reviewed-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1942f..d41559a0aa6b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant
@ 2018-02-21 20:29         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-21 20:29 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1942f..d41559a0aa6b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-21 20:29         ` Eric W. Biederman
  (?)
@ 2018-02-22 10:13         ` Miklos Szeredi
       [not found]           ` <CAOssrKch20vj8phkjfjMe=07-8uQiuXfOuCTDjrMzPbkg6DoxA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-22 10:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> At the point of fuse_dev_do_read the user space process that initiated the
> action on the fuse filesystem may no longer exist.  The process have been
> killed or may have fired an asynchronous request and exited.
>
> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
> the pid has been reallocated it can return practically any pid.  Any pid is
> possible as the pid allocator allocates pid numbers in different pid
> namespaces independently.
>
> The only way to make translation in fuse_dev_do_read reliable is to call
> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
> fuse_dev_do_read.  That reference counting in other contexts has been shown
> to bounce cache lines between processors and in general be slow.  So that is
> not desirable.
>
> The only known user of running the fuse server in a different pid namespace
> from the filesystem does not care what the pids are in the fuse messages
> so removing this code should not matter.

Shouldn't we at least zero out the pid in that case?

Thanks,
Miklos


>
> Getting the translation to a server running outside of the pid namespace
> of a container can still be achieved by playing setns games at mount time.
> It is also possible to add an option to pass a pid namespace into the fuse
> filesystem at mount time.
>
> Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/fuse/dev.c | 6 ------
>  1 file changed, 6 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 5d06384c2cae..0fb58f364fa6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>         in = &req->in;
>         reqsize = in->h.len;
>
> -       if (task_active_pid_ns(current) != fc->pid_ns) {
> -               rcu_read_lock();
> -               in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
> -               rcu_read_unlock();
> -       }
> -
>         /* If request is too large, reply with an error and restart the read */
>         if (nbytes < reqsize) {
>                 req->out.h.error = -EIO;
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
  2018-02-21 20:29         ` Eric W. Biederman
  (?)
@ 2018-02-22 10:26         ` Miklos Szeredi
       [not found]           ` <CAOssrKeYuVj6ZWUrXp7R_d+wdoArnJ=mhRp22qE9JBW3x-7tfw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-22 10:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Upon a cursory examinination the uid and gid of a fuse request are
> necessary for correct operation.  Failing a fuse request where those
> values are not reliable seems a straight forward and reliable means of
> ensuring that fuse requests with bad data are not sent or processed.
>
> In most cases the vfs will avoid actions it suspects will cause
> an inode write back of an inode with an invalid uid or gid.  But that does
> not map precisely to what fuse is doing, so test for this and solve
> this at the fuse level as well.
>
> Performing this work in fuse_req_init_context is cheap as the code is
> already performing the translation here and only needs to check the
> result of the translation to see if things are not representable in
> a form the fuse server can handle.
>
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  fs/fuse/dev.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0fb58f364fa6..216db3f51a31 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
>         refcount_dec(&req->count);
>  }
>
> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> +       req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
> +       req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> +
> +       return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
>  }
>
>  void fuse_set_initialized(struct fuse_conn *fc)
> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>                         wake_up(&fc->blocked_waitq);
>                 goto out;
>         }
> -
> -       fuse_req_init_context(fc, req);
>         __set_bit(FR_WAITING, &req->flags);
>         if (for_background)
>                 __set_bit(FR_BACKGROUND, &req->flags);
> -
> +       if (unlikely(!fuse_req_init_context(fc, req))) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }
>         return req;
>
>   out:
> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>         if (!req)
>                 req = get_reserved_req(fc, file);
>
> -       fuse_req_init_context(fc, req);
>         __set_bit(FR_WAITING, &req->flags);
>         __clear_bit(FR_BACKGROUND, &req->flags);
> +       if (unlikely(!fuse_req_init_context(fc, req))) {
> +               fuse_put_request(fc, req);
> +               return ERR_PTR(-EOVERFLOW);
> +       }

I think failing the "_nofail" variant is the wrong thing to do.  This
is called to allocate a FLUSH request on close() and in readdirplus to
allocate a FORGET request.  Failing the latter results in refcount
leak in userspace.   Failing the former results in missing unlock on
close() of posix locks.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-21 20:29         ` Eric W. Biederman
  (?)
@ 2018-02-22 11:40         ` Miklos Szeredi
       [not found]           ` <CAOssrKeNLBeMkMrrCeRBO9Z80zFxCCEygKL3DErnQ9xBoLkH0g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-22 11:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Ensure the translation happens by failing to read or write
> posix acls when the filesystem has not indicated it supports
> posix acls.

For the first iteration this is fine, but  we could convert the raw
xattrs as well, if we later want to, right?

Thanks,
Miklos

>
> This ensures that modern cached posix acl support is available
> and used when dealing with posix acls.  This is important
> because only that path has the code to convernt the uids and
> gids in posix acls into the user namespace of a fuse filesystem.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/fuse/fuse_i.h |  1 +
>  fs/fuse/inode.c  |  7 +++++++
>  fs/fuse/xattr.c  | 43 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 51 insertions(+)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 7772e2b4057e..986fa2b043ab 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
>  int fuse_removexattr(struct inode *inode, const char *name);
>  extern const struct xattr_handler *fuse_xattr_handlers[];
>  extern const struct xattr_handler *fuse_acl_xattr_handlers[];
> +extern const struct xattr_handler *fuse_no_acl_xattr_handlers[];
>
>  struct posix_acl;
>  struct posix_acl *fuse_get_acl(struct inode *inode, int type);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e018dc3999f4..a52cf2019a58 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>             file->f_cred->user_ns != sb->s_user_ns)
>                 goto err_fput;
>
> +       /*
> +        * If we are not in the initial user namespace posix
> +        * acls must be translated.
> +        */
> +       if (sb->s_user_ns != &init_user_ns)
> +               sb->s_xattr = fuse_no_acl_xattr_handlers;
> +
>         fc = kmalloc(sizeof(*fc), GFP_KERNEL);
>         err = -ENOMEM;
>         if (!fc)
> diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
> index 3caac46b08b0..433717640f78 100644
> --- a/fs/fuse/xattr.c
> +++ b/fs/fuse/xattr.c
> @@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler,
>         return fuse_setxattr(inode, name, value, size, flags);
>  }
>
> +static bool no_xattr_list(struct dentry *dentry)
> +{
> +       return false;
> +}
> +
> +static int no_xattr_get(const struct xattr_handler *handler,
> +                       struct dentry *dentry, struct inode *inode,
> +                       const char *name, void *value, size_t size)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int no_xattr_set(const struct xattr_handler *handler,
> +                       struct dentry *dentry, struct inode *nodee,
> +                       const char *name, const void *value,
> +                       size_t size, int flags)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
>  static const struct xattr_handler fuse_xattr_handler = {
>         .prefix = "",
>         .get    = fuse_xattr_get,
> @@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = {
>         &fuse_xattr_handler,
>         NULL
>  };
> +
> +static const struct xattr_handler fuse_no_acl_access_xattr_handler = {
> +       .name  = XATTR_NAME_POSIX_ACL_ACCESS,
> +       .flags = ACL_TYPE_ACCESS,
> +       .list  = no_xattr_list,
> +       .get   = no_xattr_get,
> +       .set   = no_xattr_set,
> +};
> +
> +static const struct xattr_handler fuse_no_acl_default_xattr_handler = {
> +       .name  = XATTR_NAME_POSIX_ACL_DEFAULT,
> +       .flags = ACL_TYPE_ACCESS,
> +       .list  = no_xattr_list,
> +       .get   = no_xattr_get,
> +       .set   = no_xattr_set,
> +};
> +
> +const struct xattr_handler *fuse_no_acl_xattr_handlers[] = {
> +       &fuse_no_acl_access_xattr_handler,
> +       &fuse_no_acl_default_xattr_handler,
> +       &fuse_xattr_handler,
> +       NULL
> +};
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
  2018-02-22 10:26         ` Miklos Szeredi
@ 2018-02-22 18:15               ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 18:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Upon a cursory examinination the uid and gid of a fuse request are
>> necessary for correct operation.  Failing a fuse request where those
>> values are not reliable seems a straight forward and reliable means of
>> ensuring that fuse requests with bad data are not sent or processed.
>>
>> In most cases the vfs will avoid actions it suspects will cause
>> an inode write back of an inode with an invalid uid or gid.  But that does
>> not map precisely to what fuse is doing, so test for this and solve
>> this at the fuse level as well.
>>
>> Performing this work in fuse_req_init_context is cheap as the code is
>> already performing the translation here and only needs to check the
>> result of the translation to see if things are not representable in
>> a form the fuse server can handle.
>>
>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> ---
>>  fs/fuse/dev.c | 20 +++++++++++++-------
>>  1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 0fb58f364fa6..216db3f51a31 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
>>         refcount_dec(&req->count);
>>  }
>>
>> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>>  {
>> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> +       req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
>> +       req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
>>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>> +
>> +       return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
>>  }
>>
>>  void fuse_set_initialized(struct fuse_conn *fc)
>> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>>                         wake_up(&fc->blocked_waitq);
>>                 goto out;
>>         }
>> -
>> -       fuse_req_init_context(fc, req);
>>         __set_bit(FR_WAITING, &req->flags);
>>         if (for_background)
>>                 __set_bit(FR_BACKGROUND, &req->flags);
>> -
>> +       if (unlikely(!fuse_req_init_context(fc, req))) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>>         return req;
>>
>>   out:
>> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>>         if (!req)
>>                 req = get_reserved_req(fc, file);
>>
>> -       fuse_req_init_context(fc, req);
>>         __set_bit(FR_WAITING, &req->flags);
>>         __clear_bit(FR_BACKGROUND, &req->flags);
>> +       if (unlikely(!fuse_req_init_context(fc, req))) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>
> I think failing the "_nofail" variant is the wrong thing to do.  This
> is called to allocate a FLUSH request on close() and in readdirplus to
> allocate a FORGET request.  Failing the latter results in refcount
> leak in userspace.   Failing the former results in missing unlock on
> close() of posix locks.

Doh!  You are quite correct.

Modifying fuse_get_req_nofail_nopages to fail is a bug.

I am thinking the proper solution is to write:

    static void fuse_req_init_context_nofail(struct fuse_req *req)
    {
            req->in.h.uid = 0;
            req->in.h.gid = 0;
            req->in.h.pid = 0;
    }

And use that in the nofail case.  As it appears neither flush nor
the eviction of inodes is a user space triggered action and as such
user space identifiers are nonsense in those cases.

I will respin this patch shortly.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids
@ 2018-02-22 18:15               ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 18:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Upon a cursory examinination the uid and gid of a fuse request are
>> necessary for correct operation.  Failing a fuse request where those
>> values are not reliable seems a straight forward and reliable means of
>> ensuring that fuse requests with bad data are not sent or processed.
>>
>> In most cases the vfs will avoid actions it suspects will cause
>> an inode write back of an inode with an invalid uid or gid.  But that does
>> not map precisely to what fuse is doing, so test for this and solve
>> this at the fuse level as well.
>>
>> Performing this work in fuse_req_init_context is cheap as the code is
>> already performing the translation here and only needs to check the
>> result of the translation to see if things are not representable in
>> a form the fuse server can handle.
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>> ---
>>  fs/fuse/dev.c | 20 +++++++++++++-------
>>  1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 0fb58f364fa6..216db3f51a31 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
>>         refcount_dec(&req->count);
>>  }
>>
>> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>>  {
>> -       req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> -       req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> +       req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
>> +       req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
>>         req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>> +
>> +       return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
>>  }
>>
>>  void fuse_set_initialized(struct fuse_conn *fc)
>> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>>                         wake_up(&fc->blocked_waitq);
>>                 goto out;
>>         }
>> -
>> -       fuse_req_init_context(fc, req);
>>         __set_bit(FR_WAITING, &req->flags);
>>         if (for_background)
>>                 __set_bit(FR_BACKGROUND, &req->flags);
>> -
>> +       if (unlikely(!fuse_req_init_context(fc, req))) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>>         return req;
>>
>>   out:
>> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>>         if (!req)
>>                 req = get_reserved_req(fc, file);
>>
>> -       fuse_req_init_context(fc, req);
>>         __set_bit(FR_WAITING, &req->flags);
>>         __clear_bit(FR_BACKGROUND, &req->flags);
>> +       if (unlikely(!fuse_req_init_context(fc, req))) {
>> +               fuse_put_request(fc, req);
>> +               return ERR_PTR(-EOVERFLOW);
>> +       }
>
> I think failing the "_nofail" variant is the wrong thing to do.  This
> is called to allocate a FLUSH request on close() and in readdirplus to
> allocate a FORGET request.  Failing the latter results in refcount
> leak in userspace.   Failing the former results in missing unlock on
> close() of posix locks.

Doh!  You are quite correct.

Modifying fuse_get_req_nofail_nopages to fail is a bug.

I am thinking the proper solution is to write:

    static void fuse_req_init_context_nofail(struct fuse_req *req)
    {
            req->in.h.uid = 0;
            req->in.h.gid = 0;
            req->in.h.pid = 0;
    }

And use that in the nofail case.  As it appears neither flush nor
the eviction of inodes is a user space triggered action and as such
user space identifiers are nonsense in those cases.

I will respin this patch shortly.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-22 10:13         ` Miklos Szeredi
@ 2018-02-22 19:04               ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 19:04 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> At the point of fuse_dev_do_read the user space process that initiated the
>> action on the fuse filesystem may no longer exist.  The process have been
>> killed or may have fired an asynchronous request and exited.
>>
>> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
>> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
>> the pid has been reallocated it can return practically any pid.  Any pid is
>> possible as the pid allocator allocates pid numbers in different pid
>> namespaces independently.
>>
>> The only way to make translation in fuse_dev_do_read reliable is to call
>> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
>> fuse_dev_do_read.  That reference counting in other contexts has been shown
>> to bounce cache lines between processors and in general be slow.  So that is
>> not desirable.
>>
>> The only known user of running the fuse server in a different pid namespace
>> from the filesystem does not care what the pids are in the fuse messages
>> so removing this code should not matter.
>
> Shouldn't we at least zero out the pid in that case?

This is an explicit case of passing a file descriptor between pid
namespaces.  So I think there are plenty of buyer be ware signs out.
So I don't know if there are any real world advantages of zeroing the
pid.

I can see a case for using the pid namespace of the opener of /dev/fuse
instead of the pid namespace of the mounter of the fuse filesystem.
Although in practice I would be surprised if they were different.

I am very leary about caring during a read operation.  Caring about the
current processes during read/write tends to break caching, is error prone
as the need for this patch demonstrates, and is generally likely to be
slower than not caring.

So yes we can zero the pid.   I don't think it is wise to zero the pid
unless we zero the pid in fuse_req_init_context.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
@ 2018-02-22 19:04               ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 19:04 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> At the point of fuse_dev_do_read the user space process that initiated the
>> action on the fuse filesystem may no longer exist.  The process have been
>> killed or may have fired an asynchronous request and exited.
>>
>> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
>> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
>> the pid has been reallocated it can return practically any pid.  Any pid is
>> possible as the pid allocator allocates pid numbers in different pid
>> namespaces independently.
>>
>> The only way to make translation in fuse_dev_do_read reliable is to call
>> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
>> fuse_dev_do_read.  That reference counting in other contexts has been shown
>> to bounce cache lines between processors and in general be slow.  So that is
>> not desirable.
>>
>> The only known user of running the fuse server in a different pid namespace
>> from the filesystem does not care what the pids are in the fuse messages
>> so removing this code should not matter.
>
> Shouldn't we at least zero out the pid in that case?

This is an explicit case of passing a file descriptor between pid
namespaces.  So I think there are plenty of buyer be ware signs out.
So I don't know if there are any real world advantages of zeroing the
pid.

I can see a case for using the pid namespace of the opener of /dev/fuse
instead of the pid namespace of the mounter of the fuse filesystem.
Although in practice I would be surprised if they were different.

I am very leary about caring during a read operation.  Caring about the
current processes during read/write tends to break caching, is error prone
as the need for this patch demonstrates, and is generally likely to be
slower than not caring.

So yes we can zero the pid.   I don't think it is wise to zero the pid
unless we zero the pid in fuse_req_init_context.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-22 11:40         ` Miklos Szeredi
@ 2018-02-22 19:18               ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 19:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Ensure the translation happens by failing to read or write
>> posix acls when the filesystem has not indicated it supports
>> posix acls.
>
> For the first iteration this is fine, but  we could convert the raw
> xattrs as well, if we later want to, right?

I will say maybe.  This is tricky.   The code would not be too hard,
and the function to do the work posix_acl_fix_xattr_userns already
exists in fs/posix_acl.c

I don't actually expect that to work longterm.  I expect the direction
the kernel internals are moving is that all filesystems that implement
posix acls will be expected to implement .get_acl and .set_acl.

I would have to reread the old thread that got us to this point with
posix acls before I could really understand the backwards compatible
fuse use case, and I would have to reread the rest of the acl processing
in the kernel before I could recall exactly what makes sense.

If there was an obvious way to whitelist xattrs that fuse can support
for user namespaces I think I would go for that.  Just to avoid future
problems with future xattrs.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
@ 2018-02-22 19:18               ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 19:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Ensure the translation happens by failing to read or write
>> posix acls when the filesystem has not indicated it supports
>> posix acls.
>
> For the first iteration this is fine, but  we could convert the raw
> xattrs as well, if we later want to, right?

I will say maybe.  This is tricky.   The code would not be too hard,
and the function to do the work posix_acl_fix_xattr_userns already
exists in fs/posix_acl.c

I don't actually expect that to work longterm.  I expect the direction
the kernel internals are moving is that all filesystems that implement
posix acls will be expected to implement .get_acl and .set_acl.

I would have to reread the old thread that got us to this point with
posix acls before I could really understand the backwards compatible
fuse use case, and I would have to reread the rest of the acl processing
in the kernel before I could recall exactly what makes sense.

If there was an obvious way to whitelist xattrs that fuse can support
for user namespaces I think I would go for that.  Just to avoid future
problems with future xattrs.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-22 19:18               ` Eric W. Biederman
@ 2018-02-22 22:50                   ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 22:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
>> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> Ensure the translation happens by failing to read or write
>>> posix acls when the filesystem has not indicated it supports
>>> posix acls.
>>
>> For the first iteration this is fine, but  we could convert the raw
>> xattrs as well, if we later want to, right?
>
> I will say maybe.  This is tricky.   The code would not be too hard,
> and the function to do the work posix_acl_fix_xattr_userns already
> exists in fs/posix_acl.c
>
> I don't actually expect that to work longterm.  I expect the direction
> the kernel internals are moving is that all filesystems that implement
> posix acls will be expected to implement .get_acl and .set_acl.
>
> I would have to reread the old thread that got us to this point with
> posix acls before I could really understand the backwards compatible
> fuse use case, and I would have to reread the rest of the acl processing
> in the kernel before I could recall exactly what makes sense.
>
> If there was an obvious way to whitelist xattrs that fuse can support
> for user namespaces I think I would go for that.  Just to avoid future
> problems with future xattrs.

I am remembering why this is such a sticky issue.

Today when a posix acl is read from user space the code does:
      posix_acl_to_xattr(&init_user_ns, ...) in posix_acl_xattr_get
      posix_acl_fix_xattr_to_user() in getxattr

Similary when a posix acl is written from user space the code does:
      posix_acl_fix_xattr_from_user() in setxattr
      posix_acl_from_xattr(&init_user_us, ...) in posix_acl_xattr_set

If every posix acl supporting filesystem in the kernel would use
posix_acl_access_xattr_handler and posix_acl_default_xattr_handler the
function posix_acl_fix_xattr_to_user and posix_acl_fix_xattr_from_user
and posix_acl_fix_xattr_userns could all be removed and the posix acl
handling could be that little bit simpler and faster.

So if we could figure out how to use the generic acl support for the old
brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
easier to support them long term.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
@ 2018-02-22 22:50                   ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-22 22:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

ebiederm@xmission.com (Eric W. Biederman) writes:

> Miklos Szeredi <mszeredi@redhat.com> writes:
>
>> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Ensure the translation happens by failing to read or write
>>> posix acls when the filesystem has not indicated it supports
>>> posix acls.
>>
>> For the first iteration this is fine, but  we could convert the raw
>> xattrs as well, if we later want to, right?
>
> I will say maybe.  This is tricky.   The code would not be too hard,
> and the function to do the work posix_acl_fix_xattr_userns already
> exists in fs/posix_acl.c
>
> I don't actually expect that to work longterm.  I expect the direction
> the kernel internals are moving is that all filesystems that implement
> posix acls will be expected to implement .get_acl and .set_acl.
>
> I would have to reread the old thread that got us to this point with
> posix acls before I could really understand the backwards compatible
> fuse use case, and I would have to reread the rest of the acl processing
> in the kernel before I could recall exactly what makes sense.
>
> If there was an obvious way to whitelist xattrs that fuse can support
> for user namespaces I think I would go for that.  Just to avoid future
> problems with future xattrs.

I am remembering why this is such a sticky issue.

Today when a posix acl is read from user space the code does:
      posix_acl_to_xattr(&init_user_ns, ...) in posix_acl_xattr_get
      posix_acl_fix_xattr_to_user() in getxattr

Similary when a posix acl is written from user space the code does:
      posix_acl_fix_xattr_from_user() in setxattr
      posix_acl_from_xattr(&init_user_us, ...) in posix_acl_xattr_set

If every posix acl supporting filesystem in the kernel would use
posix_acl_access_xattr_handler and posix_acl_default_xattr_handler the
function posix_acl_fix_xattr_to_user and posix_acl_fix_xattr_from_user
and posix_acl_fix_xattr_userns could all be removed and the posix acl
handling could be that little bit simpler and faster.

So if we could figure out how to use the generic acl support for the old
brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
easier to support them long term.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
       [not found]                   ` <87mv004p0t.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-02-26  7:47                     ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-26  7:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:

> So if we could figure out how to use the generic acl support for the old
> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
> easier to support them long term.

Simplest and most robust way seems to be to do everything the same (as
with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-22 22:50                   ` Eric W. Biederman
  (?)
@ 2018-02-26  7:47                   ` Miklos Szeredi
       [not found]                     ` <CAOssrKd+c0Mx+=S-+zr1QS8a37Pm=VGki=FVR+LXQZBsk3byqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-26  7:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:

> So if we could figure out how to use the generic acl support for the old
> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
> easier to support them long term.

Simplest and most robust way seems to be to do everything the same (as
with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-26  7:47                   ` Miklos Szeredi
@ 2018-02-26 16:35                         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 16:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
>> So if we could figure out how to use the generic acl support for the old
>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>> easier to support them long term.
>
> Simplest and most robust way seems to be to do everything the same (as
> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Good point.  That sounds like for the !fc->posix_acl case we just
need a careful use of "forget_all_cached_acls(inode)".

I will take a quick look at that, and see if that is easy/sufficient to
cover the legacy fuse case.  Otherwise I will go with what I already
have here.

That feels like a better path.  And internally I would call what is
today fc->posix_acl fc->cached_posix_acl.  To better convey the intent.
Fingers crossed.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
@ 2018-02-26 16:35                         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 16:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>
>> So if we could figure out how to use the generic acl support for the old
>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>> easier to support them long term.
>
> Simplest and most robust way seems to be to do everything the same (as
> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Good point.  That sounds like for the !fc->posix_acl case we just
need a careful use of "forget_all_cached_acls(inode)".

I will take a quick look at that, and see if that is easy/sufficient to
cover the legacy fuse case.  Otherwise I will go with what I already
have here.

That feels like a better path.  And internally I would call what is
today fc->posix_acl fc->cached_posix_acl.  To better convey the intent.
Fingers crossed.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
  2018-02-26 16:35                         ` Eric W. Biederman
@ 2018-02-26 21:51                             ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 21:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
>> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>>> So if we could figure out how to use the generic acl support for the old
>>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>>> easier to support them long term.
>>
>> Simplest and most robust way seems to be to do everything the same (as
>> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.
>
> Good point.  That sounds like for the !fc->posix_acl case we just
> need a careful use of "forget_all_cached_acls(inode)".
>
> I will take a quick look at that, and see if that is easy/sufficient to
> cover the legacy fuse case.  Otherwise I will go with what I already
> have here.
>
> That feels like a better path.  And internally I would call what is
> today fc->posix_acl fc->cached_posix_acl.  To better convey the intent.
> Fingers crossed.

It looks like simply setting
"inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;" is the secret
sauce needed to disable caching in the legacy case and make everything
work.

I had to tweak the calls to forget_all_cached_acls so that won't clear
the ACL_DONT_CACHE status but otherwise that was an absolutely trivial
change to combine those two code paths.

I will post my updated patches shortly.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns
@ 2018-02-26 21:51                             ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 21:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

ebiederm@xmission.com (Eric W. Biederman) writes:

> Miklos Szeredi <mszeredi@redhat.com> writes:
>
>> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>
>>> So if we could figure out how to use the generic acl support for the old
>>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>>> easier to support them long term.
>>
>> Simplest and most robust way seems to be to do everything the same (as
>> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.
>
> Good point.  That sounds like for the !fc->posix_acl case we just
> need a careful use of "forget_all_cached_acls(inode)".
>
> I will take a quick look at that, and see if that is easy/sufficient to
> cover the legacy fuse case.  Otherwise I will go with what I already
> have here.
>
> That feels like a better path.  And internally I would call what is
> today fc->posix_acl fc->cached_posix_acl.  To better convey the intent.
> Fingers crossed.

It looks like simply setting
"inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;" is the secret
sauce needed to disable caching in the legacy case and make everything
work.

I had to tweak the calls to forget_all_cached_acls so that won't clear
the ACL_DONT_CACHE status but otherwise that was an absolutely trivial
change to combine those two code paths.

I will post my updated patches shortly.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 0/7] fuse: mounts from non-init user namespaces
       [not found]     ` <878tbmf5vl.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                         ` (4 preceding siblings ...)
  2018-02-21 20:29         ` Eric W. Biederman
@ 2018-02-26 23:52       ` Eric W. Biederman
  5 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
  added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
  posix_acl_default_xattr_handler, by teaching fuse to set
  ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

These changes are also available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v7

Eric W. Biederman (6):
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
      fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS
      fuse: Simplfiy the posix acl handling logic.
      fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant

 fs/fuse/acl.c           | 10 +++++-----
 fs/fuse/cuse.c          |  7 ++++++-
 fs/fuse/dev.c           | 30 +++++++++++++++++-------------
 fs/fuse/dir.c           | 27 +++++++++++++--------------
 fs/fuse/fuse_i.h        | 11 ++++++++---
 fs/fuse/inode.c         | 44 +++++++++++++++++++++++++++++---------------
 fs/fuse/xattr.c         |  6 +-----
 fs/posix_acl.c          |  7 +++++--
 kernel/user_namespace.c |  1 +
 9 files changed, 85 insertions(+), 58 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 0/7] fuse: mounts from non-init user namespaces
  2018-02-21 20:24     ` Eric W. Biederman
  (?)
  (?)
@ 2018-02-26 23:52     ` Eric W. Biederman
  2018-02-26 23:52       ` [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
                         ` (5 more replies)
  -1 siblings, 6 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
  added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
  posix_acl_default_xattr_handler, by teaching fuse to set
  ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

These changes are also available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v7

Eric W. Biederman (6):
      fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
      fuse: Fail all requests with invalid uids or gids
      fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
      fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS
      fuse: Simplfiy the posix acl handling logic.
      fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
      fuse: Restrict allow_other to the superblock's namespace or a descendant

 fs/fuse/acl.c           | 10 +++++-----
 fs/fuse/cuse.c          |  7 ++++++-
 fs/fuse/dev.c           | 30 +++++++++++++++++-------------
 fs/fuse/dir.c           | 27 +++++++++++++--------------
 fs/fuse/fuse_i.h        | 11 ++++++++---
 fs/fuse/inode.c         | 44 +++++++++++++++++++++++++++++---------------
 fs/fuse/xattr.c         |  6 +-----
 fs/posix_acl.c          |  7 +++++--
 kernel/user_namespace.c |  1 +
 9 files changed, 85 insertions(+), 58 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-02-26 23:52         ` Eric W. Biederman
  2018-02-26 23:52           ` Eric W. Biederman
                           ` (6 subsequent siblings)
  7 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  2018-02-26 23:52     ` Eric W. Biederman
@ 2018-02-26 23:52       ` Eric W. Biederman
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	in = &req->in;
 	reqsize = in->h.len;
 
-	if (task_active_pid_ns(current) != fc->pid_ns) {
-		rcu_read_lock();
-		in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
-		rcu_read_unlock();
-	}
-
 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
 		req->out.h.error = -EIO;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids
  2018-02-26 23:52     ` Eric W. Biederman
@ 2018-02-26 23:52           ` Eric W. Biederman
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/dev.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+	req->in.h.uid = 0;
+	req->in.h.gid = 0;
+	req->in.h.pid = 0;
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
+	fuse_req_init_context_nofail(req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
 	return req;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids
@ 2018-02-26 23:52           ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dev.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
 	refcount_dec(&req->count);
 }
 
-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+	req->in.h.uid = 0;
+	req->in.h.gid = 0;
+	req->in.h.pid = 0;
 }
 
 void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
 			wake_up(&fc->blocked_waitq);
 		goto out;
 	}
-
-	fuse_req_init_context(fc, req);
 	__set_bit(FR_WAITING, &req->flags);
 	if (for_background)
 		__set_bit(FR_BACKGROUND, &req->flags);
-
+	if (unlikely(!fuse_req_init_context(fc, req))) {
+		fuse_put_request(fc, req);
+		return ERR_PTR(-EOVERFLOW);
+	}
 	return req;
 
  out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
 	if (!req)
 		req = get_reserved_req(fc, file);
 
-	fuse_req_init_context(fc, req);
+	fuse_req_init_context_nofail(req);
 	__set_bit(FR_WAITING, &req->flags);
 	__clear_bit(FR_BACKGROUND, &req->flags);
 	return req;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2018-02-26 23:52         ` Eric W. Biederman
  2018-02-26 23:52           ` Eric W. Biederman
@ 2018-02-26 23:52         ` Eric W. Biederman
  2018-02-26 23:52         ` [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS Eric W. Biederman
                           ` (4 subsequent siblings)
  7 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Fuse is about to join overlayfs in relying on get_acl respecting
ACL_DONT_CACHE so update the documentation in get_acl to reflect that
fact.  The comment and this change description should give people a
clue that respecting ACL_DONT_CACHE in get_acl is important, and they
should audit the filesystems before removing that support.

Additionaly update the comment above the call to get_acl itself and
remove the wrong information that an implementation of get_acl can
prevent caching by calling forget_cached_acl.  Replace that with the
correct information that to prevent caching all that is necessary is
to set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE when the
inode is initialized.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/posix_acl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..3c24fc263401 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -121,14 +121,17 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 	 * could wait for that other task to complete its job, but it's easier
 	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
 	 * be an unlikely race.)
+	 *
+	 * ACL_DONT_CACHE is treated as another task updating the acl and
+	 * remains set.
 	 */
 	if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
 		/* fall through */ ;
 
 	/*
 	 * Normally, the ACL returned by ->get_acl will be cached.
-	 * A filesystem can prevent that by calling
-	 * forget_cached_acl(inode, type) in ->get_acl.
+	 * A filesystem can prevent that by calling setting
+	 * inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE.
 	 *
 	 * If the filesystem doesn't have a get_acl() function at all, we'll
 	 * just create the negative cache entry.
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-26 23:52     ` Eric W. Biederman
  2018-02-26 23:52       ` [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read Eric W. Biederman
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-02-26 23:52       ` Eric W. Biederman
       [not found]         ` <20180226235302.12708-3-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2018-02-26 23:52       ` [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS Eric W. Biederman
                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Fuse is about to join overlayfs in relying on get_acl respecting
ACL_DONT_CACHE so update the documentation in get_acl to reflect that
fact.  The comment and this change description should give people a
clue that respecting ACL_DONT_CACHE in get_acl is important, and they
should audit the filesystems before removing that support.

Additionaly update the comment above the call to get_acl itself and
remove the wrong information that an implementation of get_acl can
prevent caching by calling forget_cached_acl.  Replace that with the
correct information that to prevent caching all that is necessary is
to set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE when the
inode is initialized.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/posix_acl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..3c24fc263401 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -121,14 +121,17 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 	 * could wait for that other task to complete its job, but it's easier
 	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
 	 * be an unlikely race.)
+	 *
+	 * ACL_DONT_CACHE is treated as another task updating the acl and
+	 * remains set.
 	 */
 	if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
 		/* fall through */ ;
 
 	/*
 	 * Normally, the ACL returned by ->get_acl will be cached.
-	 * A filesystem can prevent that by calling
-	 * forget_cached_acl(inode, type) in ->get_acl.
+	 * A filesystem can prevent that by calling setting
+	 * inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE.
 	 *
 	 * If the filesystem doesn't have a get_acl() function at all, we'll
 	 * just create the negative cache entry.
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                           ` (2 preceding siblings ...)
  2018-02-26 23:52         ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Eric W. Biederman
@ 2018-02-26 23:52         ` Eric W. Biederman
  2018-02-26 23:53           ` Eric W. Biederman
                           ` (3 subsequent siblings)
  7 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

When FUSE_GETXATTR will never return anything call cache_no_acl to
cache that state in the vfs as well in fuse with fc->no_getxattr.

The only code path this affects are the code paths that call
fuse_get_acl and caching a NULL or returning it immediately
is exactly the same effect so this should not effect anything.

This keeps the vfs from waisting it's time calling down into fuse
when fuse isn't going to do anything, and it makes it clear
when a NULL should be cached for optimal performance.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/xattr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..0520a4f47226 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -82,6 +82,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 		ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX);
 	if (ret == -ENOSYS) {
 		fc->no_getxattr = 1;
+		cache_no_acl(inode);
 		ret = -EOPNOTSUPP;
 	}
 	return ret;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS
  2018-02-26 23:52     ` Eric W. Biederman
                         ` (2 preceding siblings ...)
  2018-02-26 23:52       ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Eric W. Biederman
@ 2018-02-26 23:52       ` Eric W. Biederman
  2018-02-26 23:53       ` [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
  2018-03-02 21:58       ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  5 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

When FUSE_GETXATTR will never return anything call cache_no_acl to
cache that state in the vfs as well in fuse with fc->no_getxattr.

The only code path this affects are the code paths that call
fuse_get_acl and caching a NULL or returning it immediately
is exactly the same effect so this should not effect anything.

This keeps the vfs from waisting it's time calling down into fuse
when fuse isn't going to do anything, and it makes it clear
when a NULL should be cached for optimal performance.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/xattr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..0520a4f47226 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -82,6 +82,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 		ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX);
 	if (ret == -ENOSYS) {
 		fc->no_getxattr = 1;
+		cache_no_acl(inode);
 		ret = -EOPNOTSUPP;
 	}
 	return ret;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
  2018-02-26 23:52     ` Eric W. Biederman
@ 2018-02-26 23:53           ` Eric W. Biederman
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Rename the fuse connection flag posix_acl to cached_posix_acl as that
is what it actually means.  That fuse will cache and operate on the
cached value of the posix acl.

When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
so that get_acl and friends won't cache the acl values even if they
are called.

Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
wrapper only takes effect when cached_posix_acl is true to prevent
losing the nocache or noxattr status in when posix acls are not
cached.

Always use posix_acl_access_xattr_handler so the fuse code
benefits from the generic posix acl handlers as much as possible.
This will become important as the code works on translation
of uid and gid in the posix acls when fuse is not mounted in
the initial user namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/acl.c    |  6 +++---
 fs/fuse/dir.c    | 11 +++++------
 fs/fuse/fuse_i.h |  5 +++--
 fs/fuse/inode.c  | 13 ++++++++++---
 fs/fuse/xattr.c  |  5 -----
 5 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..8fb2153dbf50 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -19,7 +19,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 	void *value = NULL;
 	struct posix_acl *acl;
 
-	if (!fc->posix_acl || fc->no_getxattr)
+	if (fc->no_getxattr)
 		return NULL;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -53,7 +53,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	const char *name;
 	int ret;
 
-	if (!fc->posix_acl || fc->no_setxattr)
+	if (fc->no_setxattr)
 		return -EOPNOTSUPP;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -92,7 +92,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	} else {
 		ret = fuse_removexattr(inode, name);
 	}
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	fuse_invalidate_attr(inode);
 
 	return ret;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..a44ca509db4f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -237,7 +237,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 		if (ret || (outarg.attr.mode ^ inode->i_mode) & S_IFMT)
 			goto invalid;
 
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		fuse_change_attributes(inode, &outarg.attr,
 				       entry_attr_timeout(&outarg),
 				       attr_version);
@@ -930,7 +930,7 @@ static int fuse_update_get_attr(struct inode *inode, struct file *file,
 	int err = 0;
 
 	if (time_before64(fi->i_time, get_jiffies_64())) {
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		err = fuse_do_getattr(inode, stat, file);
 	} else if (stat) {
 		generic_fillattr(inode, stat);
@@ -1076,7 +1076,7 @@ static int fuse_perm_getattr(struct inode *inode, int mask)
 	if (mask & MAY_NOT_BLOCK)
 		return -ECHILD;
 
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	return fuse_do_getattr(inode, NULL, NULL);
 }
 
@@ -1246,7 +1246,7 @@ static int fuse_direntplus_link(struct file *file,
 		fi->nlookup++;
 		spin_unlock(&fc->lock);
 
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		fuse_change_attributes(inode, &o->attr,
 				       entry_attr_timeout(o),
 				       attr_version);
@@ -1764,8 +1764,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr)
 		 * If filesystem supports acls it may have updated acl xattrs in
 		 * the filesystem, so forget cached acls for the inode.
 		 */
-		if (fc->posix_acl)
-			forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 
 		/* Directory mode changed, may need to revalidate access */
 		if (d_is_dir(entry) && (attr->ia_valid & ATTR_MODE))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..3cf296d60bc0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,7 +619,7 @@ struct fuse_conn {
 	unsigned no_lseek:1;
 
 	/** Does the filesystem support posix acls? */
-	unsigned posix_acl:1;
+	unsigned cached_posix_acl:1;
 
 	/** Check permissions based on the file mode or not? */
 	unsigned default_permissions:1;
@@ -913,6 +913,8 @@ void fuse_release_nowrite(struct inode *inode);
 
 u64 fuse_get_attr_version(struct fuse_conn *fc);
 
+void fuse_forget_cached_acls(struct inode *inode);
+
 /**
  * File-system tells the kernel to invalidate cache for the given node id.
  */
@@ -974,7 +976,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
 int fuse_removexattr(struct inode *inode, const char *name);
 extern const struct xattr_handler *fuse_xattr_handlers[];
-extern const struct xattr_handler *fuse_acl_xattr_handlers[];
 
 struct posix_acl;
 struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..0c3ccca7c554 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -313,6 +313,8 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		if (!fc->writeback_cache || !S_ISREG(attr->mode))
 			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
+		if (!fc->cached_posix_acl)
+			inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
@@ -331,6 +333,12 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 	return inode;
 }
 
+void fuse_forget_cached_acls(struct inode *inode)
+{
+	if (get_fuse_conn(inode)->cached_posix_acl)
+		forget_all_cached_acls(inode);
+}
+
 int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
 			     loff_t offset, loff_t len)
 {
@@ -343,7 +351,7 @@ int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
 		return -ENOENT;
 
 	fuse_invalidate_attr(inode);
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	if (offset >= 0) {
 		pg_start = offset >> PAGE_SHIFT;
 		if (len <= 0)
@@ -915,8 +923,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->sb->s_time_gran = arg->time_gran;
 			if ((arg->flags & FUSE_POSIX_ACL)) {
 				fc->default_permissions = 1;
-				fc->posix_acl = 1;
-				fc->sb->s_xattr = fuse_acl_xattr_handlers;
+				fc->cached_posix_acl = 1;
 			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 0520a4f47226..48a95e1bb020 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -200,11 +200,6 @@ static const struct xattr_handler fuse_xattr_handler = {
 };
 
 const struct xattr_handler *fuse_xattr_handlers[] = {
-	&fuse_xattr_handler,
-	NULL
-};
-
-const struct xattr_handler *fuse_acl_xattr_handlers[] = {
 	&posix_acl_access_xattr_handler,
 	&posix_acl_default_xattr_handler,
 	&fuse_xattr_handler,
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
@ 2018-02-26 23:53           ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

Rename the fuse connection flag posix_acl to cached_posix_acl as that
is what it actually means.  That fuse will cache and operate on the
cached value of the posix acl.

When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
so that get_acl and friends won't cache the acl values even if they
are called.

Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
wrapper only takes effect when cached_posix_acl is true to prevent
losing the nocache or noxattr status in when posix acls are not
cached.

Always use posix_acl_access_xattr_handler so the fuse code
benefits from the generic posix acl handlers as much as possible.
This will become important as the code works on translation
of uid and gid in the posix acls when fuse is not mounted in
the initial user namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  6 +++---
 fs/fuse/dir.c    | 11 +++++------
 fs/fuse/fuse_i.h |  5 +++--
 fs/fuse/inode.c  | 13 ++++++++++---
 fs/fuse/xattr.c  |  5 -----
 5 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..8fb2153dbf50 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -19,7 +19,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 	void *value = NULL;
 	struct posix_acl *acl;
 
-	if (!fc->posix_acl || fc->no_getxattr)
+	if (fc->no_getxattr)
 		return NULL;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -53,7 +53,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	const char *name;
 	int ret;
 
-	if (!fc->posix_acl || fc->no_setxattr)
+	if (fc->no_setxattr)
 		return -EOPNOTSUPP;
 
 	if (type == ACL_TYPE_ACCESS)
@@ -92,7 +92,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	} else {
 		ret = fuse_removexattr(inode, name);
 	}
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	fuse_invalidate_attr(inode);
 
 	return ret;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..a44ca509db4f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -237,7 +237,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 		if (ret || (outarg.attr.mode ^ inode->i_mode) & S_IFMT)
 			goto invalid;
 
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		fuse_change_attributes(inode, &outarg.attr,
 				       entry_attr_timeout(&outarg),
 				       attr_version);
@@ -930,7 +930,7 @@ static int fuse_update_get_attr(struct inode *inode, struct file *file,
 	int err = 0;
 
 	if (time_before64(fi->i_time, get_jiffies_64())) {
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		err = fuse_do_getattr(inode, stat, file);
 	} else if (stat) {
 		generic_fillattr(inode, stat);
@@ -1076,7 +1076,7 @@ static int fuse_perm_getattr(struct inode *inode, int mask)
 	if (mask & MAY_NOT_BLOCK)
 		return -ECHILD;
 
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	return fuse_do_getattr(inode, NULL, NULL);
 }
 
@@ -1246,7 +1246,7 @@ static int fuse_direntplus_link(struct file *file,
 		fi->nlookup++;
 		spin_unlock(&fc->lock);
 
-		forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 		fuse_change_attributes(inode, &o->attr,
 				       entry_attr_timeout(o),
 				       attr_version);
@@ -1764,8 +1764,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr)
 		 * If filesystem supports acls it may have updated acl xattrs in
 		 * the filesystem, so forget cached acls for the inode.
 		 */
-		if (fc->posix_acl)
-			forget_all_cached_acls(inode);
+		fuse_forget_cached_acls(inode);
 
 		/* Directory mode changed, may need to revalidate access */
 		if (d_is_dir(entry) && (attr->ia_valid & ATTR_MODE))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..3cf296d60bc0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,7 +619,7 @@ struct fuse_conn {
 	unsigned no_lseek:1;
 
 	/** Does the filesystem support posix acls? */
-	unsigned posix_acl:1;
+	unsigned cached_posix_acl:1;
 
 	/** Check permissions based on the file mode or not? */
 	unsigned default_permissions:1;
@@ -913,6 +913,8 @@ void fuse_release_nowrite(struct inode *inode);
 
 u64 fuse_get_attr_version(struct fuse_conn *fc);
 
+void fuse_forget_cached_acls(struct inode *inode);
+
 /**
  * File-system tells the kernel to invalidate cache for the given node id.
  */
@@ -974,7 +976,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
 ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
 int fuse_removexattr(struct inode *inode, const char *name);
 extern const struct xattr_handler *fuse_xattr_handlers[];
-extern const struct xattr_handler *fuse_acl_xattr_handlers[];
 
 struct posix_acl;
 struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..0c3ccca7c554 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -313,6 +313,8 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		if (!fc->writeback_cache || !S_ISREG(attr->mode))
 			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
+		if (!fc->cached_posix_acl)
+			inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
@@ -331,6 +333,12 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 	return inode;
 }
 
+void fuse_forget_cached_acls(struct inode *inode)
+{
+	if (get_fuse_conn(inode)->cached_posix_acl)
+		forget_all_cached_acls(inode);
+}
+
 int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
 			     loff_t offset, loff_t len)
 {
@@ -343,7 +351,7 @@ int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
 		return -ENOENT;
 
 	fuse_invalidate_attr(inode);
-	forget_all_cached_acls(inode);
+	fuse_forget_cached_acls(inode);
 	if (offset >= 0) {
 		pg_start = offset >> PAGE_SHIFT;
 		if (len <= 0)
@@ -915,8 +923,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->sb->s_time_gran = arg->time_gran;
 			if ((arg->flags & FUSE_POSIX_ACL)) {
 				fc->default_permissions = 1;
-				fc->posix_acl = 1;
-				fc->sb->s_xattr = fuse_acl_xattr_handlers;
+				fc->cached_posix_acl = 1;
 			}
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 0520a4f47226..48a95e1bb020 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -200,11 +200,6 @@ static const struct xattr_handler fuse_xattr_handler = {
 };
 
 const struct xattr_handler *fuse_xattr_handlers[] = {
-	&fuse_xattr_handler,
-	NULL
-};
-
-const struct xattr_handler *fuse_acl_xattr_handlers[] = {
 	&posix_acl_access_xattr_handler,
 	&posix_acl_default_xattr_handler,
 	&fuse_xattr_handler,
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns
  2018-02-26 23:52     ` Eric W. Biederman
@ 2018-02-26 23:53           ` Eric W. Biederman
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 8fb2153dbf50..5a67c80e21d6 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index a44ca509db4f..79cca1687457 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3cf296d60bc0..eba0beea8634 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0c3ccca7c554..cd3d29610688 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -485,7 +485,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -521,7 +522,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -530,7 +531,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -573,8 +574,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -605,7 +606,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -629,6 +630,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -638,6 +640,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1068,7 +1071,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1093,8 +1096,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1102,7 +1109,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns
@ 2018-02-26 23:53           ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes.  Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the
cuse support does not appear safe for unprivileged users.  Practically
the permissions on /dev/cuse only make it accessible to the global root
user.  If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem.  Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/acl.c    |  4 ++--
 fs/fuse/cuse.c   |  7 ++++++-
 fs/fuse/dev.c    |  4 ++--
 fs/fuse/dir.c    | 14 +++++++-------
 fs/fuse/fuse_i.h |  6 +++++-
 fs/fuse/inode.c  | 31 +++++++++++++++++++------------
 6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 8fb2153dbf50..5a67c80e21d6 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
 		return ERR_PTR(-ENOMEM);
 	size = fuse_getxattr(inode, name, value, PAGE_SIZE);
 	if (size > 0)
-		acl = posix_acl_from_xattr(&init_user_ns, value, size);
+		acl = posix_acl_from_xattr(fc->user_ns, value, size);
 	else if ((size == 0) || (size == -ENODATA) ||
 		 (size == -EOPNOTSUPP && fc->no_getxattr))
 		acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		if (!value)
 			return -ENOMEM;
 
-		ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+		ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
 		if (ret < 0) {
 			kfree(value);
 			return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
 #include <linux/stat.h>
 #include <linux/module.h>
 #include <linux/uio.h>
+#include <linux/user_namespace.h>
 
 #include "fuse_i.h"
 
@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
 	if (!cc)
 		return -ENOMEM;
 
-	fuse_conn_init(&cc->fc);
+	/*
+	 * Limit the cuse channel to requests that can
+	 * be represented in file->f_cred->user_ns.
+	 */
+	fuse_conn_init(&cc->fc, file->f_cred->user_ns);
 
 	fud = fuse_dev_alloc(&cc->fc);
 	if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
 
 static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
 {
-	req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
-	req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+	req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+	req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
 	req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
 
 	return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index a44ca509db4f..79cca1687457 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->ino = attr->ino;
 	stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	stat->nlink = attr->nlink;
-	stat->uid = make_kuid(&init_user_ns, attr->uid);
-	stat->gid = make_kgid(&init_user_ns, attr->gid);
+	stat->uid = make_kuid(fc->user_ns, attr->uid);
+	stat->gid = make_kgid(fc->user_ns, attr->gid);
 	stat->rdev = inode->i_rdev;
 	stat->atime.tv_sec = attr->atime;
 	stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
-			   bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+			   struct fuse_setattr_in *arg, bool trust_local_cmtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
 	if (ivalid & ATTR_MODE)
 		arg->valid |= FATTR_MODE,   arg->mode = iattr->ia_mode;
 	if (ivalid & ATTR_UID)
-		arg->valid |= FATTR_UID,    arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+		arg->valid |= FATTR_UID,    arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
 	if (ivalid & ATTR_GID)
-		arg->valid |= FATTR_GID,    arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+		arg->valid |= FATTR_GID,    arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
 	if (ivalid & ATTR_SIZE)
 		arg->valid |= FATTR_SIZE,   arg->size = iattr->ia_size;
 	if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+	iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3cf296d60bc0..eba0beea8634 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
 #include <linux/xattr.h>
 #include <linux/pid_namespace.h>
 #include <linux/refcount.h>
+#include <linux/user_namespace.h>
 
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
+	/** The user namespace for this mount */
+	struct user_namespace *user_ns;
+
 	/** Maximum read size */
 	unsigned max_read;
 
@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
 /**
  * Initialize fuse_conn
  */
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0c3ccca7c554..cd3d29610688 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_ino     = fuse_squash_ino(attr->ino);
 	inode->i_mode    = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
 	set_nlink(inode, attr->nlink);
-	inode->i_uid     = make_kuid(&init_user_ns, attr->uid);
-	inode->i_gid     = make_kgid(&init_user_ns, attr->gid);
+	inode->i_uid     = make_kuid(fc->user_ns, attr->uid);
+	inode->i_gid     = make_kgid(fc->user_ns, attr->gid);
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
@@ -485,7 +485,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
 	return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+			  struct user_namespace *user_ns)
 {
 	char *p;
 	memset(d, 0, sizeof(struct fuse_mount_data));
@@ -521,7 +522,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_USER_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->user_id = make_kuid(current_user_ns(), uv);
+			d->user_id = make_kuid(user_ns, uv);
 			if (!uid_valid(d->user_id))
 				return 0;
 			d->user_id_present = 1;
@@ -530,7 +531,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 		case OPT_GROUP_ID:
 			if (fuse_match_uint(&args[0], &uv))
 				return 0;
-			d->group_id = make_kgid(current_user_ns(), uv);
+			d->group_id = make_kgid(user_ns, uv);
 			if (!gid_valid(d->group_id))
 				return 0;
 			d->group_id_present = 1;
@@ -573,8 +574,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 	struct super_block *sb = root->d_sb;
 	struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-	seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
-	seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+	seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+	seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
 	if (fc->default_permissions)
 		seq_puts(m, ",default_permissions");
 	if (fc->allow_other)
@@ -605,7 +606,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 	fpq->connected = 1;
 }
 
-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
 {
 	memset(fc, 0, sizeof(*fc));
 	spin_lock_init(&fc->lock);
@@ -629,6 +630,7 @@ void fuse_conn_init(struct fuse_conn *fc)
 	fc->attr_version = 1;
 	get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	fc->user_ns = get_user_ns(user_ns);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
@@ -638,6 +640,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		put_pid_ns(fc->pid_ns);
+		put_user_ns(fc->user_ns);
 		fc->release(fc);
 	}
 }
@@ -1068,7 +1071,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 
 	sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-	if (!parse_fuse_opt(data, &d, is_bdev))
+	if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
 		goto err;
 
 	if (is_bdev) {
@@ -1093,8 +1096,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!file)
 		goto err;
 
-	if ((file->f_op != &fuse_dev_operations) ||
-	    (file->f_cred->user_ns != &init_user_ns))
+	/*
+	 * Require mount to happen from the same user namespace which
+	 * opened /dev/fuse to prevent potential attacks.
+	 */
+	if (file->f_op != &fuse_dev_operations ||
+	    file->f_cred->user_ns != sb->s_user_ns)
 		goto err_fput;
 
 	fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1102,7 +1109,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (!fc)
 		goto err_fput;
 
-	fuse_conn_init(fc);
+	fuse_conn_init(fc, sb->s_user_ns);
 	fc->release = fuse_free_conn;
 
 	fud = fuse_dev_alloc(fc);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                           ` (5 preceding siblings ...)
  2018-02-26 23:53           ` Eric W. Biederman
@ 2018-02-26 23:53         ` Eric W. Biederman
  2018-03-02 21:58         ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  7 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

From: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Cc: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Reviewed-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Seth Forshee <seth.forshee-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Signed-off-by: Dongsu Park <dongsu-lYLaGTFnO9sWenYVfaLwtA@public.gmane.org>
Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 79cca1687457..0cbd1ff3dd48 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant
  2018-02-26 23:52     ` Eric W. Biederman
                         ` (3 preceding siblings ...)
  2018-02-26 23:52       ` [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS Eric W. Biederman
@ 2018-02-26 23:53       ` Eric W. Biederman
  2018-03-02 21:58       ` [PATCH v8 0/6] fuse: mounts from non-init user namespaces Eric W. Biederman
  5 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-26 23:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-kernel, containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn,
	Eric W. Biederman

From: Seth Forshee <seth.forshee@canonical.com>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/fuse/dir.c           | 2 +-
 kernel/user_namespace.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 79cca1687457..0cbd1ff3dd48 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
 	const struct cred *cred;
 
 	if (fc->allow_other)
-		return 1;
+		return current_in_userns(fc->user_ns);
 
 	cred = current_cred();
 	if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
 {
 	return in_userns(target_ns, current_user_ns());
 }
+EXPORT_SYMBOL(current_in_userns);
 
 static inline struct user_namespace *to_user_ns(struct ns_common *ns)
 {
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-26 23:52       ` [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE Eric W. Biederman
@ 2018-02-27  1:13             ` Linus Torvalds
  0 siblings, 0 replies; 219+ messages in thread
From: Linus Torvalds @ 2018-02-27  1:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Containers, Linux Kernel Mailing List,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel

On Mon, Feb 26, 2018 at 3:52 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Additionaly update the comment above the call to get_acl itself and
> remove the wrong information that an implementation of get_acl can
> prevent caching by calling forget_cached_acl.

This part is just confusing.

First off, that comment is correct: a filesystem _can_ prevent the
returning of cached data by just calling forget_cached_acl().

Note that there are two different cases: saying that you _never_ want
to cache things (ACL_DONT_CACHE) and saying that there _currently_ is
no cached data (ACL_NOT_CACHED).

forget_cached_acl() just removes the current cache.

You're just replacing one case of "no cached" information with the other.

Just explain the two cases, don't try to muddy the waters even more..

PLUS you are just confusing things entirely. That whole new comment of yours:

+        * ACL_DONT_CACHE is treated as another task updating the acl and
+        * remains set.

is just garbage.

The code is very clear - it will only replace a ACL_NOT_CACHED entry.
The code is clear:

        if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
                /* fall through */ ;

this is basically just an atomic "if *p == ACL_NOT_CACHED then replace
it with 'sentinel'".

Your comment does not add any clarity at all, and only confuses
things. It has nothing to do with "treated as another task updating
the acl".

The fact is, ACL_DONT_CACHE is treated as if the cache is simply
already filled - it's just filled with "no cache".

So the only thing special is ACL_NOT_CACHED, which is the only thing
we will try to _replace_.

So NAK on this patch entirely. It's just adding confusion, not adding
clarifications.

                Linus

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
@ 2018-02-27  1:13             ` Linus Torvalds
  0 siblings, 0 replies; 219+ messages in thread
From: Linus Torvalds @ 2018-02-27  1:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

On Mon, Feb 26, 2018 at 3:52 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Additionaly update the comment above the call to get_acl itself and
> remove the wrong information that an implementation of get_acl can
> prevent caching by calling forget_cached_acl.

This part is just confusing.

First off, that comment is correct: a filesystem _can_ prevent the
returning of cached data by just calling forget_cached_acl().

Note that there are two different cases: saying that you _never_ want
to cache things (ACL_DONT_CACHE) and saying that there _currently_ is
no cached data (ACL_NOT_CACHED).

forget_cached_acl() just removes the current cache.

You're just replacing one case of "no cached" information with the other.

Just explain the two cases, don't try to muddy the waters even more..

PLUS you are just confusing things entirely. That whole new comment of yours:

+        * ACL_DONT_CACHE is treated as another task updating the acl and
+        * remains set.

is just garbage.

The code is very clear - it will only replace a ACL_NOT_CACHED entry.
The code is clear:

        if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
                /* fall through */ ;

this is basically just an atomic "if *p == ACL_NOT_CACHED then replace
it with 'sentinel'".

Your comment does not add any clarity at all, and only confuses
things. It has nothing to do with "treated as another task updating
the acl".

The fact is, ACL_DONT_CACHE is treated as if the cache is simply
already filled - it's just filled with "no cache".

So the only thing special is ACL_NOT_CACHED, which is the only thing
we will try to _replace_.

So NAK on this patch entirely. It's just adding confusion, not adding
clarifications.

                Linus

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  1:13             ` Linus Torvalds
@ 2018-02-27  2:53                 ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-27  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Containers, Linux Kernel Mailing List,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel


So the purpose for having a patch in the first place is that
2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
which addded ACL_DONT_CACHED did not result in any comment updates
to get_acl.

Which mean that if you read the comments in get_acl() that you
don't even think of ACL_DONT_CACHED.

Which means that this comment:
	/*
	 * If the ACL isn't being read yet, set our sentinel.  Otherwise, the
	 * current value of the ACL will not be ACL_NOT_CACHED and so our own
	 * sentinel will not be set; another task will update the cache.  We
	 * could wait for that other task to complete its job, but it's easier
	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
	 * be an unlikely race.)
	 */

Which presumes the only reason the acl could be anything other
ACL_NOT_CACHED is because get_acl() is already being called upon it in
another task.

I wanted something to mention ACL_DONT_CACHED so someone would at least
think about that case if they ever step up to modify the code.

The code is perfectly clear, the comment is not.   That scares me.

And I had to read the code about a dozen times before I realized the
ACL_DONT_CACHED case even exists.   Not useful when I am need to use
that to preserve historical fuse semantics.

So something is missing here even if my wording does not improve things.



Then we get this comment:
	/*
	 * Normally, the ACL returned by ->get_acl will be cached.
	 * A filesystem can prevent that by calling
	 * forget_cached_acl(inode, type) in ->get_acl.
	 */

Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
That comment is and always has been rubbish.

I don't have a clue what it is trying to say but it is not something
a person can use to write filesystem code with.


Truths:
- forget_cached_acl(inode, type) can be used to invalidate the acl
  cache.

- Calling forget_cached_acl from within the filesystems ->get_acl
  method won't prevent a cached value from being returend because
  ->get_acl will be set.

- Calling forget_cached_acl from within the filesystems ->get_acl
  method won't prevent a returned value from being cached
  because it the caching happens after ->get_acl returns.

- Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
  a value from ->get_acl from being cached.
  

In summary I only care about two things.
1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
   at the code, and people updating the code will have a hint that they
   need to consider that case.

2) That misleading completely bogus comment being removed/fixed.


And yes I agree the code is clear.  The comments are not.


Does this look better as a comment updating patch?

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..5453094b8828 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
        struct posix_acl **p;
        struct posix_acl *acl;
 
+       /*
+        * To avoid caching the result of ->get_acl
+        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
+        */
+
        /*
         * The sentinel is used to detect when another operation like
         * set_cached_acl() or forget_cached_acl() races with get_acl().
@@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
                /* fall through */ ;
 
        /*
-        * Normally, the ACL returned by ->get_acl will be cached.
-        * A filesystem can prevent that by calling
-        * forget_cached_acl(inode, type) in ->get_acl.
+        * The ACL returned by ->get_acl will be cached.
         *
         * If the filesystem doesn't have a get_acl() function at all, we'll
         * just create the negative cache entry.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
@ 2018-02-27  2:53                 ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-27  2:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn


So the purpose for having a patch in the first place is that
2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
which addded ACL_DONT_CACHED did not result in any comment updates
to get_acl.

Which mean that if you read the comments in get_acl() that you
don't even think of ACL_DONT_CACHED.

Which means that this comment:
	/*
	 * If the ACL isn't being read yet, set our sentinel.  Otherwise, the
	 * current value of the ACL will not be ACL_NOT_CACHED and so our own
	 * sentinel will not be set; another task will update the cache.  We
	 * could wait for that other task to complete its job, but it's easier
	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
	 * be an unlikely race.)
	 */

Which presumes the only reason the acl could be anything other
ACL_NOT_CACHED is because get_acl() is already being called upon it in
another task.

I wanted something to mention ACL_DONT_CACHED so someone would at least
think about that case if they ever step up to modify the code.

The code is perfectly clear, the comment is not.   That scares me.

And I had to read the code about a dozen times before I realized the
ACL_DONT_CACHED case even exists.   Not useful when I am need to use
that to preserve historical fuse semantics.

So something is missing here even if my wording does not improve things.



Then we get this comment:
	/*
	 * Normally, the ACL returned by ->get_acl will be cached.
	 * A filesystem can prevent that by calling
	 * forget_cached_acl(inode, type) in ->get_acl.
	 */

Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
That comment is and always has been rubbish.

I don't have a clue what it is trying to say but it is not something
a person can use to write filesystem code with.


Truths:
- forget_cached_acl(inode, type) can be used to invalidate the acl
  cache.

- Calling forget_cached_acl from within the filesystems ->get_acl
  method won't prevent a cached value from being returend because
  ->get_acl will be set.

- Calling forget_cached_acl from within the filesystems ->get_acl
  method won't prevent a returned value from being cached
  because it the caching happens after ->get_acl returns.

- Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
  a value from ->get_acl from being cached.
  

In summary I only care about two things.
1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
   at the code, and people updating the code will have a hint that they
   need to consider that case.

2) That misleading completely bogus comment being removed/fixed.


And yes I agree the code is clear.  The comments are not.


Does this look better as a comment updating patch?

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..5453094b8828 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
        struct posix_acl **p;
        struct posix_acl *acl;
 
+       /*
+        * To avoid caching the result of ->get_acl
+        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
+        */
+
        /*
         * The sentinel is used to detect when another operation like
         * set_cached_acl() or forget_cached_acl() races with get_acl().
@@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
                /* fall through */ ;
 
        /*
-        * Normally, the ACL returned by ->get_acl will be cached.
-        * A filesystem can prevent that by calling
-        * forget_cached_acl(inode, type) in ->get_acl.
+        * The ACL returned by ->get_acl will be cached.
         *
         * If the filesystem doesn't have a get_acl() function at all, we'll
         * just create the negative cache entry.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  2:53                 ` Eric W. Biederman
@ 2018-02-27  3:14                     ` Eric W. Biederman
  -1 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-27  3:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Containers, Linux Kernel Mailing List,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

2> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.
>
> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.
>
> Which means that this comment:
> 	/*
> 	 * If the ACL isn't being read yet, set our sentinel.  Otherwise, the
> 	 * current value of the ACL will not be ACL_NOT_CACHED and so our own
> 	 * sentinel will not be set; another task will update the cache.  We
> 	 * could wait for that other task to complete its job, but it's easier
> 	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
> 	 * be an unlikely race.)
> 	 */
>
> Which presumes the only reason the acl could be anything other
> ACL_NOT_CACHED is because get_acl() is already being called upon it in
> another task.
>
> I wanted something to mention ACL_DONT_CACHED so someone would at least
> think about that case if they ever step up to modify the code.
>
> The code is perfectly clear, the comment is not.   That scares me.
>
> And I had to read the code about a dozen times before I realized the
> ACL_DONT_CACHED case even exists.   Not useful when I am need to use
> that to preserve historical fuse semantics.
>
> So something is missing here even if my wording does not improve things.
>
>
>
> Then we get this comment:
> 	/*
> 	 * Normally, the ACL returned by ->get_acl will be cached.
> 	 * A filesystem can prevent that by calling
> 	 * forget_cached_acl(inode, type) in ->get_acl.
> 	 */
>
> Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
> That comment is and always has been rubbish.
>
> I don't have a clue what it is trying to say but it is not something
> a person can use to write filesystem code with.
>
>
> Truths:
> - forget_cached_acl(inode, type) can be used to invalidate the acl
>   cache.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
>   method won't prevent a cached value from being returend because
>   ->get_acl will be set.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
>   method won't prevent a returned value from being cached
>   because it the caching happens after ->get_acl returns.

Sigh.  Yes it will because we set the special sentinel value,
and forget_cached_acl will replace the sentinel value with
ACL_NOT_CACHED.

It is a terribly brittle and racy thing to do, and it probably won't
work to say cache this acl but not this one on a case by case bases
in ->get_acl.

As such I believe that usage of forget_cached_acl should be subsumed by
using ACL_NOT_CACHED.  If not we should really come up with a different
helper function name to call from ->get_acl.  Preferably one that does
"cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.


> - Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
>   a value from ->get_acl from being cached.
>   
>
> In summary I only care about two things.
> 1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
>    at the code, and people updating the code will have a hint that they
>    need to consider that case.
>
> 2) That misleading completely bogus comment being removed/fixed.
>
>
> And yes I agree the code is clear.  The comments are not.
>
>
> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>         struct posix_acl **p;
>         struct posix_acl *acl;
>  
> +       /*
> +        * To avoid caching the result of ->get_acl
> +        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> +        */
> +
>         /*
>          * The sentinel is used to detect when another operation like
>          * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>                 /* fall through */ ;
>  
>         /*
> -        * Normally, the ACL returned by ->get_acl will be cached.
> -        * A filesystem can prevent that by calling
> -        * forget_cached_acl(inode, type) in ->get_acl.
> +        * The ACL returned by ->get_acl will be cached.
>          *
>          * If the filesystem doesn't have a get_acl() function at all, we'll
>          * just create the negative cache entry.
>
> Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
@ 2018-02-27  3:14                     ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-02-27  3:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

ebiederm@xmission.com (Eric W. Biederman) writes:

2> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.
>
> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.
>
> Which means that this comment:
> 	/*
> 	 * If the ACL isn't being read yet, set our sentinel.  Otherwise, the
> 	 * current value of the ACL will not be ACL_NOT_CACHED and so our own
> 	 * sentinel will not be set; another task will update the cache.  We
> 	 * could wait for that other task to complete its job, but it's easier
> 	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
> 	 * be an unlikely race.)
> 	 */
>
> Which presumes the only reason the acl could be anything other
> ACL_NOT_CACHED is because get_acl() is already being called upon it in
> another task.
>
> I wanted something to mention ACL_DONT_CACHED so someone would at least
> think about that case if they ever step up to modify the code.
>
> The code is perfectly clear, the comment is not.   That scares me.
>
> And I had to read the code about a dozen times before I realized the
> ACL_DONT_CACHED case even exists.   Not useful when I am need to use
> that to preserve historical fuse semantics.
>
> So something is missing here even if my wording does not improve things.
>
>
>
> Then we get this comment:
> 	/*
> 	 * Normally, the ACL returned by ->get_acl will be cached.
> 	 * A filesystem can prevent that by calling
> 	 * forget_cached_acl(inode, type) in ->get_acl.
> 	 */
>
> Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
> That comment is and always has been rubbish.
>
> I don't have a clue what it is trying to say but it is not something
> a person can use to write filesystem code with.
>
>
> Truths:
> - forget_cached_acl(inode, type) can be used to invalidate the acl
>   cache.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
>   method won't prevent a cached value from being returend because
>   ->get_acl will be set.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
>   method won't prevent a returned value from being cached
>   because it the caching happens after ->get_acl returns.

Sigh.  Yes it will because we set the special sentinel value,
and forget_cached_acl will replace the sentinel value with
ACL_NOT_CACHED.

It is a terribly brittle and racy thing to do, and it probably won't
work to say cache this acl but not this one on a case by case bases
in ->get_acl.

As such I believe that usage of forget_cached_acl should be subsumed by
using ACL_NOT_CACHED.  If not we should really come up with a different
helper function name to call from ->get_acl.  Preferably one that does
"cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.


> - Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
>   a value from ->get_acl from being cached.
>   
>
> In summary I only care about two things.
> 1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
>    at the code, and people updating the code will have a hint that they
>    need to consider that case.
>
> 2) That misleading completely bogus comment being removed/fixed.
>
>
> And yes I agree the code is clear.  The comments are not.
>
>
> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>         struct posix_acl **p;
>         struct posix_acl *acl;
>  
> +       /*
> +        * To avoid caching the result of ->get_acl
> +        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> +        */
> +
>         /*
>          * The sentinel is used to detect when another operation like
>          * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>                 /* fall through */ ;
>  
>         /*
> -        * Normally, the ACL returned by ->get_acl will be cached.
> -        * A filesystem can prevent that by calling
> -        * forget_cached_acl(inode, type) in ->get_acl.
> +        * The ACL returned by ->get_acl will be cached.
>          *
>          * If the filesystem doesn't have a get_acl() function at all, we'll
>          * just create the negative cache entry.
>
> Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  2:53                 ` Eric W. Biederman
@ 2018-02-27  3:36                     ` Linus Torvalds
  -1 siblings, 0 replies; 219+ messages in thread
From: Linus Torvalds @ 2018-02-27  3:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Containers, Linux Kernel Mailing List,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel

On Mon, Feb 26, 2018 at 6:53 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.

I'm not opposed to just updating the comments.

I just think your updates were somewhat misleading.

> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.

Right. By all means add a comment about ACL_DONT_CACHE disabling the
cache entirely.

But don't _remove_ the other valid way to flush the cache, and don't
make that comment above cmpxchg() be even more confusing than the code
is.

> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>         struct posix_acl **p;
>         struct posix_acl *acl;
>
> +       /*
> +        * To avoid caching the result of ->get_acl
> +        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> +        */
> +
>         /*
>          * The sentinel is used to detect when another operation like
>          * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>                 /* fall through */ ;
>
>         /*
> -        * Normally, the ACL returned by ->get_acl will be cached.
> -        * A filesystem can prevent that by calling
> -        * forget_cached_acl(inode, type) in ->get_acl.
> +        * The ACL returned by ->get_acl will be cached.

Why do you hate forget_cached_acl()?

It's perfectly valid too. Don't remove that comment. Maybe reword it
to talk not about "preventing", but about "invalidating the cache".

But the old comment that you remove isn't _wrong_, it's just that the
"preventing" from returning the cached state with forget_cached_acl()
is just a one-time thing.

So forget_cached_acl() exists, and it works, and it does exactly what
its name says. It is a perfectly valid way to prevent the current
entry from being used in the future.

See? I object to you removing that, and trying to make it be like
ACL_DONT_CACHE is the *onyl* way to not cache something.

Because honestly, that's what your comment updates do. They take the
comments about _one_ case, and switch it over to be about the _othger_
case.

But dammit, there are _two_ ways to not cache things.

"Fixing" the comment to talk about one and removing the other isn't a
fix. It's just a stupid change that now has the problem the other way
around!

So fix the comment to really just talk about both things.

First: talk about how to avoid caching entirely (ACL_DONT_CACHE).
Then, talk about how to invalidate the cache once it has been
instantiated (forget_cached_acl()).

Don't do this idiotic "remove the valid comment just because you
happened to care about the _other_ case"


              Linus

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
@ 2018-02-27  3:36                     ` Linus Torvalds
  0 siblings, 0 replies; 219+ messages in thread
From: Linus Torvalds @ 2018-02-27  3:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

On Mon, Feb 26, 2018 at 6:53 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.

I'm not opposed to just updating the comments.

I just think your updates were somewhat misleading.

> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.

Right. By all means add a comment about ACL_DONT_CACHE disabling the
cache entirely.

But don't _remove_ the other valid way to flush the cache, and don't
make that comment above cmpxchg() be even more confusing than the code
is.

> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>         struct posix_acl **p;
>         struct posix_acl *acl;
>
> +       /*
> +        * To avoid caching the result of ->get_acl
> +        * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> +        */
> +
>         /*
>          * The sentinel is used to detect when another operation like
>          * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
>                 /* fall through */ ;
>
>         /*
> -        * Normally, the ACL returned by ->get_acl will be cached.
> -        * A filesystem can prevent that by calling
> -        * forget_cached_acl(inode, type) in ->get_acl.
> +        * The ACL returned by ->get_acl will be cached.

Why do you hate forget_cached_acl()?

It's perfectly valid too. Don't remove that comment. Maybe reword it
to talk not about "preventing", but about "invalidating the cache".

But the old comment that you remove isn't _wrong_, it's just that the
"preventing" from returning the cached state with forget_cached_acl()
is just a one-time thing.

So forget_cached_acl() exists, and it works, and it does exactly what
its name says. It is a perfectly valid way to prevent the current
entry from being used in the future.

See? I object to you removing that, and trying to make it be like
ACL_DONT_CACHE is the *onyl* way to not cache something.

Because honestly, that's what your comment updates do. They take the
comments about _one_ case, and switch it over to be about the _othger_
case.

But dammit, there are _two_ ways to not cache things.

"Fixing" the comment to talk about one and removing the other isn't a
fix. It's just a stupid change that now has the problem the other way
around!

So fix the comment to really just talk about both things.

First: talk about how to avoid caching entirely (ACL_DONT_CACHE).
Then, talk about how to invalidate the cache once it has been
instantiated (forget_cached_acl()).

Don't do this idiotic "remove the valid comment just because you
happened to care about the _other_ case"


              Linus

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
       [not found]                     ` <87tvu3qg2b.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-02-27  3:41                       ` Linus Torvalds
  0 siblings, 0 replies; 219+ messages in thread
From: Linus Torvalds @ 2018-02-27  3:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Containers, Linux Kernel Mailing List,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel

On Mon, Feb 26, 2018 at 7:14 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> As such I believe that usage of forget_cached_acl should be subsumed by
> using ACL_NOT_CACHED.  If not we should really come up with a different
> helper function name to call from ->get_acl.  Preferably one that does
> "cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.

You make your bias very clear, by simply trying to hide the other case.

But for chrissake, that's not the state right now. That other case
exists. You can't - and shouldn't - try to just hide it.

Besides, that "forget_cached_acl()" approach actually has a valid use
case. Maybe you _do_ want to cache ACL's, but with a timeout or
revalidation.

ACL_DONT_CACHE really is a big hammer that makes caching not work at
all. It's not necessarily the right thing to do at all.

                Linus

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
  2018-02-27  3:14                     ` Eric W. Biederman
  (?)
  (?)
@ 2018-02-27  3:41                     ` Linus Torvalds
       [not found]                       ` <CA+55aFwPo7Pbq+3Oup-oo8MUFHeEpFXp7qr6z2PrzKp7S0ON+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-03-02 19:53                       ` Eric W. Biederman
  -1 siblings, 2 replies; 219+ messages in thread
From: Linus Torvalds @ 2018-02-27  3:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn

On Mon, Feb 26, 2018 at 7:14 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> As such I believe that usage of forget_cached_acl should be subsumed by
> using ACL_NOT_CACHED.  If not we should really come up with a different
> helper function name to call from ->get_acl.  Preferably one that does
> "cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.

You make your bias very clear, by simply trying to hide the other case.

But for chrissake, that's not the state right now. That other case
exists. You can't - and shouldn't - try to just hide it.

Besides, that "forget_cached_acl()" approach actually has a valid use
case. Maybe you _do_ want to cache ACL's, but with a timeout or
revalidation.

ACL_DONT_CACHE really is a big hammer that makes caching not work at
all. It's not necessarily the right thing to do at all.

                Linus

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
       [not found]           ` <20180226235302.12708-5-ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-02-27  9:00             ` Miklos Szeredi
  0 siblings, 0 replies; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-27  9:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Rename the fuse connection flag posix_acl to cached_posix_acl as that
> is what it actually means.  That fuse will cache and operate on the
> cached value of the posix acl.
>
> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
> so that get_acl and friends won't cache the acl values even if they
> are called.
>
> Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
> wrapper only takes effect when cached_posix_acl is true to prevent
> losing the nocache or noxattr status in when posix acls are not
> cached.

Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE?  I think
it makes sense to generally not clear ACL_DONT_CACHE, since it's not
an actual acl value that needs forgetting.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
  2018-02-26 23:53           ` Eric W. Biederman
  (?)
  (?)
@ 2018-02-27  9:00           ` Miklos Szeredi
       [not found]             ` <CAOssrKeWvYpgj4_cgsRBL_kTOHyRS-9_mfO9JHP-JahgqFnfHQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 219+ messages in thread
From: Miklos Szeredi @ 2018-02-27  9:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Rename the fuse connection flag posix_acl to cached_posix_acl as that
> is what it actually means.  That fuse will cache and operate on the
> cached value of the posix acl.
>
> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
> so that get_acl and friends won't cache the acl values even if they
> are called.
>
> Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
> wrapper only takes effect when cached_posix_acl is true to prevent
> losing the nocache or noxattr status in when posix acls are not
> cached.

Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE?  I think
it makes sense to generally not clear ACL_DONT_CACHE, since it's not
an actual acl value that needs forgetting.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [RFC][PATCH] fs/posix_acl: Update the comments and support lightweight cache skipping
       [not found]                       ` <CA+55aFwPo7Pbq+3Oup-oo8MUFHeEpFXp7qr6z2PrzKp7S0ON+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-03-02 19:53                         ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-03-02 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Containers, Linux Kernel Mailing List,
	Seth Forshee, Alban Crequy, Sargun Dhillon, linux-fsdevel


The code has been missing a way for a ->get_acl method to not cache
a return value without risking invalidating a cached value
that was set while get_acl() was returning.

Add that support by implementing to_uncachable_acl, to_cachable_acl,
is_uncacheable_acl, and dealing with uncachable acls in get_acl().

Update the comments so that they are a little clearer about
what is going on in get_acl()

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

Linus my issue with the forget_cached_acl case was really that it was
too big of a hammer.  If you care about caching acls only somtimes
forget_cached_acl called from ->get_acl can stomp that acl you
explicitly cached with set_cached_acl.

With this change I can unify the legacy horrible fuse posix acl case
that requires not caching acls with a single if statement in the get_acl
method. AKA:

+	if (!IS_ERR(acl) && !fc->posix_acl)
+		acl = to_uncacheable_acl(acl);
 	return acl;

That code I know is locally correct even if later fuse decides to cache
negative acls when the underlying filesystem does not support xattrs.

 fs/posix_acl.c            | 56 ++++++++++++++++++++++++++++++++++-------------
 include/linux/posix_acl.h | 17 ++++++++++++++
 2 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..e58a68e18603 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -96,12 +96,16 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 {
 	void *sentinel;
 	struct posix_acl **p;
-	struct posix_acl *acl;
+	struct posix_acl *acl, *to_cache;
 
 	/*
 	 * The sentinel is used to detect when another operation like
 	 * set_cached_acl() or forget_cached_acl() races with get_acl().
 	 * It is guaranteed that is_uncached_acl(sentinel) is true.
+	 *
+	 * This is sufficient to prevent races between ->set_acl
+	 * calling set_cached_acl (outside of filesystem specific
+	 * locking) and get_acl() caching the returned acl.
 	 */
 
 	acl = get_cached_acl(inode, type);
@@ -126,12 +130,18 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 		/* fall through */ ;
 
 	/*
-	 * Normally, the ACL returned by ->get_acl will be cached.
-	 * A filesystem can prevent that by calling
-	 * forget_cached_acl(inode, type) in ->get_acl.
+	 * Normally, the ACL returned by ->get_acl() will be cached.
+	 *
+	 * A filesystem can prevent the acl returned by ->get_acl()
+	 * from being cached by wrapping it with to_uncachable_acl().
+	 *
+	 * A filesystem can at anytime effect the cache directly and
+	 * cause in process calls of get_acl() not to update the cache
+	 * by calling forget_cache_acl(inode, type) or
+	 * set_cached_acl(inode, type, acl).
 	 *
-	 * If the filesystem doesn't have a get_acl() function at all, we'll
-	 * just create the negative cache entry.
+	 * If the filesystem doesn't have a ->get_acl() function at
+	 * all, we'll just create the negative cache entry.
 	 */
 	if (!inode->i_op->get_acl) {
 		set_cached_acl(inode, type, NULL);
@@ -139,21 +149,37 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 	}
 	acl = inode->i_op->get_acl(inode, type);
 
+
+	/* To keep the logic simple default to not caching an acl when
+	 * the sentinel is cleared.
+	 */
+	to_cache = ACL_NOT_CACHED;
 	if (IS_ERR(acl)) {
-		/*
-		 * Remove our sentinel so that we don't block future attempts
-		 * to cache the ACL.
+		/* Clears the sentinel so that we don't block future
+		 * attempts to cache the ACL, and return an error.
 		 */
-		cmpxchg(p, sentinel, ACL_NOT_CACHED);
-		return acl;
+	}
+	else if (is_uncacheable_acl(acl)) {
+		/* Clears the sentinel so that we don't block future
+		 * attempts to cache the ACL, and return a valid ACL.
+		 */
+		acl = to_cacheable_acl(acl);
+	}
+	else {
+		to_cache = acl;
+		posix_acl_dup(to_cache);
 	}
 
 	/*
-	 * Cache the result, but only if our sentinel is still in place.
+	 * Remove the sentinel and replace it with the value that
+	 * needs to be cached, but only if the sentinel is still in
+	 * place.
 	 */
-	posix_acl_dup(acl);
-	if (unlikely(cmpxchg(p, sentinel, acl) != sentinel))
-		posix_acl_release(acl);
+	if (unlikely(cmpxchg(p, sentinel, to_cache) != sentinel)) {
+		if (!is_uncached_acl(to_cache))
+			posix_acl_release(to_cache);
+	}
+
 	return acl;
 }
 EXPORT_SYMBOL(get_acl);
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 540595a321a7..3be8929b9f48 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -56,6 +56,23 @@ posix_acl_release(struct posix_acl *acl)
 		kfree_rcu(acl, a_rcu);
 }
 
+/*
+ * Allow for acls returned from ->get_acl() to not be cached.
+ */
+static inline bool is_uncacheable_acl(struct posix_acl *acl)
+{
+	return ((unsigned long)acl) & 1UL;
+}
+
+static inline struct posix_acl *to_uncacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) | 1UL);
+}
+
+static inline struct posix_acl *to_cacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) & ~1UL);
+}
 
 /* posix_acl.c */
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [RFC][PATCH] fs/posix_acl: Update the comments and support lightweight cache skipping
  2018-02-27  3:41                     ` Linus Torvalds
       [not found]                       ` <CA+55aFwPo7Pbq+3Oup-oo8MUFHeEpFXp7qr6z2PrzKp7S0ON+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-03-02 19:53                       ` Eric W. Biederman
  1 sibling, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-03-02 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Linux Containers,
	linux-fsdevel, Alban Crequy, Seth Forshee, Sargun Dhillon,
	Dongsu Park, Serge E. Hallyn


The code has been missing a way for a ->get_acl method to not cache
a return value without risking invalidating a cached value
that was set while get_acl() was returning.

Add that support by implementing to_uncachable_acl, to_cachable_acl,
is_uncacheable_acl, and dealing with uncachable acls in get_acl().

Update the comments so that they are a little clearer about
what is going on in get_acl()

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Linus my issue with the forget_cached_acl case was really that it was
too big of a hammer.  If you care about caching acls only somtimes
forget_cached_acl called from ->get_acl can stomp that acl you
explicitly cached with set_cached_acl.

With this change I can unify the legacy horrible fuse posix acl case
that requires not caching acls with a single if statement in the get_acl
method. AKA:

+	if (!IS_ERR(acl) && !fc->posix_acl)
+		acl = to_uncacheable_acl(acl);
 	return acl;

That code I know is locally correct even if later fuse decides to cache
negative acls when the underlying filesystem does not support xattrs.

 fs/posix_acl.c            | 56 ++++++++++++++++++++++++++++++++++-------------
 include/linux/posix_acl.h | 17 ++++++++++++++
 2 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..e58a68e18603 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -96,12 +96,16 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 {
 	void *sentinel;
 	struct posix_acl **p;
-	struct posix_acl *acl;
+	struct posix_acl *acl, *to_cache;
 
 	/*
 	 * The sentinel is used to detect when another operation like
 	 * set_cached_acl() or forget_cached_acl() races with get_acl().
 	 * It is guaranteed that is_uncached_acl(sentinel) is true.
+	 *
+	 * This is sufficient to prevent races between ->set_acl
+	 * calling set_cached_acl (outside of filesystem specific
+	 * locking) and get_acl() caching the returned acl.
 	 */
 
 	acl = get_cached_acl(inode, type);
@@ -126,12 +130,18 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 		/* fall through */ ;
 
 	/*
-	 * Normally, the ACL returned by ->get_acl will be cached.
-	 * A filesystem can prevent that by calling
-	 * forget_cached_acl(inode, type) in ->get_acl.
+	 * Normally, the ACL returned by ->get_acl() will be cached.
+	 *
+	 * A filesystem can prevent the acl returned by ->get_acl()
+	 * from being cached by wrapping it with to_uncachable_acl().
+	 *
+	 * A filesystem can at anytime effect the cache directly and
+	 * cause in process calls of get_acl() not to update the cache
+	 * by calling forget_cache_acl(inode, type) or
+	 * set_cached_acl(inode, type, acl).
 	 *
-	 * If the filesystem doesn't have a get_acl() function at all, we'll
-	 * just create the negative cache entry.
+	 * If the filesystem doesn't have a ->get_acl() function at
+	 * all, we'll just create the negative cache entry.
 	 */
 	if (!inode->i_op->get_acl) {
 		set_cached_acl(inode, type, NULL);
@@ -139,21 +149,37 @@ struct posix_acl *get_acl(struct inode *inode, int type)
 	}
 	acl = inode->i_op->get_acl(inode, type);
 
+
+	/* To keep the logic simple default to not caching an acl when
+	 * the sentinel is cleared.
+	 */
+	to_cache = ACL_NOT_CACHED;
 	if (IS_ERR(acl)) {
-		/*
-		 * Remove our sentinel so that we don't block future attempts
-		 * to cache the ACL.
+		/* Clears the sentinel so that we don't block future
+		 * attempts to cache the ACL, and return an error.
 		 */
-		cmpxchg(p, sentinel, ACL_NOT_CACHED);
-		return acl;
+	}
+	else if (is_uncacheable_acl(acl)) {
+		/* Clears the sentinel so that we don't block future
+		 * attempts to cache the ACL, and return a valid ACL.
+		 */
+		acl = to_cacheable_acl(acl);
+	}
+	else {
+		to_cache = acl;
+		posix_acl_dup(to_cache);
 	}
 
 	/*
-	 * Cache the result, but only if our sentinel is still in place.
+	 * Remove the sentinel and replace it with the value that
+	 * needs to be cached, but only if the sentinel is still in
+	 * place.
 	 */
-	posix_acl_dup(acl);
-	if (unlikely(cmpxchg(p, sentinel, acl) != sentinel))
-		posix_acl_release(acl);
+	if (unlikely(cmpxchg(p, sentinel, to_cache) != sentinel)) {
+		if (!is_uncached_acl(to_cache))
+			posix_acl_release(to_cache);
+	}
+
 	return acl;
 }
 EXPORT_SYMBOL(get_acl);
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 540595a321a7..3be8929b9f48 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -56,6 +56,23 @@ posix_acl_release(struct posix_acl *acl)
 		kfree_rcu(acl, a_rcu);
 }
 
+/*
+ * Allow for acls returned from ->get_acl() to not be cached.
+ */
+static inline bool is_uncacheable_acl(struct posix_acl *acl)
+{
+	return ((unsigned long)acl) & 1UL;
+}
+
+static inline struct posix_acl *to_uncacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) | 1UL);
+}
+
+static inline struct posix_acl *to_cacheable_acl(struct posix_acl *acl)
+{
+	return (struct posix_acl *)(((unsigned long)acl) & ~1UL);
+}
 
 /* posix_acl.c */
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
  2018-02-27  9:00           ` Miklos Szeredi
@ 2018-03-02 21:49                 ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux Containers, lkml, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel

Miklos Szeredi <mszeredi-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Rename the fuse connection flag posix_acl to cached_posix_acl as that
>> is what it actually means.  That fuse will cache and operate on the
>> cached value of the posix acl.
>>
>> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
>> so that get_acl and friends won't cache the acl values even if they
>> are called.
>>
>> Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
>> wrapper only takes effect when cached_posix_acl is true to prevent
>> losing the nocache or noxattr status in when posix acls are not
>> cached.
>
> Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE?  I think
> it makes sense to generally not clear ACL_DONT_CACHE, since it's not
> an actual acl value that needs forgetting.

After stopping to make certain I understand the issues, I don't think
it makes sense to teach forget_cached_acl about ACL_DONT_CACHE.

If you are fogetting a cached attribute ACL_DONT_CACHE simply doesn't
make sense.

Further it makes sense to cache a negative result for fuse when
!fc->no_getxattr.  Even if you would ordinarily not cache posix acls.

So I think the better plan is to teach the posix acl code how to not
cache results on a case by case basis.  As I did in my rfc patch I
posted a little earlier today.  That works with forget_cached_acl and it
supports local reasoning.  Further while the performance might not be as
good as ACL_DONT_CACHE I don't think that matters as always going to the
fuse server to get acls is almost certainly going to dominate the acl
costs.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.
@ 2018-03-02 21:49                 ` Eric W. Biederman
  0 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: lkml, Linux Containers, linux-fsdevel, Alban Crequy,
	Seth Forshee, Sargun Dhillon, Dongsu Park, Serge E. Hallyn

Miklos Szeredi <mszeredi@redhat.com> writes:

> On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Rename the fuse connection flag posix_acl to cached_posix_acl as that
>> is what it actually means.  That fuse will cache and operate on the
>> cached value of the posix acl.
>>
>> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
>> so that get_acl and friends won't cache the acl values even if they
>> are called.
>>
>> Replace forget_all_cached_acls with fuse_forget_cached_acls.  This
>> wrapper only takes effect when cached_posix_acl is true to prevent
>> losing the nocache or noxattr status in when posix acls are not
>> cached.
>
> Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE?  I think
> it makes sense to generally not clear ACL_DONT_CACHE, since it's not
> an actual acl value that needs forgetting.

After stopping to make certain I understand the issues, I don't think
it makes sense to teach forget_cached_acl about ACL_DONT_CACHE.

If you are fogetting a cached attribute ACL_DONT_CACHE simply doesn't
make sense.

Further it makes sense to cache a negative result for fuse when
!fc->no_getxattr.  Even if you would ordinarily not cache posix acls.

So I think the better plan is to teach the posix acl code how to not
cache results on a case by case basis.  As I did in my rfc patch I
posted a little earlier today.  That works with forget_cached_acl and it
supports local reasoning.  Further while the performance might not be as
good as ACL_DONT_CACHE I don't think that matters as always going to the
fuse server to get acls is almost certainly going to dominate the acl
costs.

Eric

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [PATCH v8 0/6] fuse: mounts from non-init user namespaces
       [not found]       ` <87po4rz4ui.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                           ` (6 preceding siblings ...)
  2018-02-26 23:53         ` [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant Eric W. Biederman
@ 2018-03-02 21:58         ` Eric W. Biederman
  7 siblings, 0 replies; 219+ messages in thread
From: Eric W. Biederman @ 2018-03-02 21:58 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Seth Forshee, Alban Crequy,
	Sargun Dhillon, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse.  The non-fuse
vfs patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot.  Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
  added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
  posix_acl_default_xattr_handler, by teaching fuse to set
  ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Since v7 I have:
- Rethought and reworked how I am unifying the cached and the non-cached
  posix acl case so the code is cleaner and simpler.
  - I have dropped enhancements to caching negative acls when
    fc->no_getxattr is set.
  - Removed the need to wrap forget_all_cached_acls in fuse.
- Reorder the patches so the posix acl work comes first

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

These ch