All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] fix depvpts in user namespaces
@ 2013-03-15  9:13 ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

Hi,

devpts mounts in user namespaces is queued for 3.9. However, while playing
with it I found it to be less than ideal. Although it could possibly work
with custom software that can be made to point to /dev/pts/ptmx, a few things
prevent it from working correctly for people that, like us, are booting full
distributions.

In those scenarios, things like udev will kick in, maybe remount /dev undoing
any setup we might have done, and then software like sshd or anything else
calling openpty will search for /dev/ptmx, not /dev/pts/ptmx.

One of the problems that I am addressing in here is that we are disallowing
mknod in usernamespaces. Although I understand the motivation for that, I
believe that to be too restrictive, specially because we already control access
to the files separately. There should be no harm in mknod'ing something per se,
if manipulating it is forbidden.

That too, however, is too restrictive. Following the precedence that we set by
letting memcg manage the memory for tmpfs mounts, I am doing the same here with
the device cgroup. With the exception that instead of suggesting, here we have
a way to actually enforce it. Unless the mount was specifically marked as
nodev, reads and writes will be allowed to proceed if a device cgroup is
containing the process. The device cgroup will then be the one responsible for
setting fine grained access about which devices can and cannot be manipulated.

Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
link it to our namespaces'. My proposal is to multiplex it and return the
correct "root ptmx" depending on which userns is reading that device.

Glauber Costa (4):
  dev_cgroup: keep track of which cgroup is the root cgroup
  fs: allow dev accesses in userns in controlled situations
  fs: allow mknod in user namespaces
  devpts: fix usage in user namespaces

 fs/devpts/inode.c        | 157 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/namei.c               |   6 +-
 fs/namespace.c           |   2 +-
 include/linux/mount.h    |   2 +
 include/linux/security.h |   1 +
 security/device_cgroup.c |  15 ++++-
 6 files changed, 173 insertions(+), 10 deletions(-)

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 0/4] fix depvpts in user namespaces
@ 2013-03-15  9:13 ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

Hi,

devpts mounts in user namespaces is queued for 3.9. However, while playing
with it I found it to be less than ideal. Although it could possibly work
with custom software that can be made to point to /dev/pts/ptmx, a few things
prevent it from working correctly for people that, like us, are booting full
distributions.

In those scenarios, things like udev will kick in, maybe remount /dev undoing
any setup we might have done, and then software like sshd or anything else
calling openpty will search for /dev/ptmx, not /dev/pts/ptmx.

One of the problems that I am addressing in here is that we are disallowing
mknod in usernamespaces. Although I understand the motivation for that, I
believe that to be too restrictive, specially because we already control access
to the files separately. There should be no harm in mknod'ing something per se,
if manipulating it is forbidden.

That too, however, is too restrictive. Following the precedence that we set by
letting memcg manage the memory for tmpfs mounts, I am doing the same here with
the device cgroup. With the exception that instead of suggesting, here we have
a way to actually enforce it. Unless the mount was specifically marked as
nodev, reads and writes will be allowed to proceed if a device cgroup is
containing the process. The device cgroup will then be the one responsible for
setting fine grained access about which devices can and cannot be manipulated.

Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
link it to our namespaces'. My proposal is to multiplex it and return the
correct "root ptmx" depending on which userns is reading that device.

Glauber Costa (4):
  dev_cgroup: keep track of which cgroup is the root cgroup
  fs: allow dev accesses in userns in controlled situations
  fs: allow mknod in user namespaces
  devpts: fix usage in user namespaces

 fs/devpts/inode.c        | 157 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/namei.c               |   6 +-
 fs/namespace.c           |   2 +-
 include/linux/mount.h    |   2 +
 include/linux/security.h |   1 +
 security/device_cgroup.c |  15 ++++-
 6 files changed, 173 insertions(+), 10 deletions(-)

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
       [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15  9:13     ` Glauber Costa
  2013-03-15  9:13     ` Glauber Costa
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

Most of the other subsystems already keep track of that in some way.  We
will do that internally and provide a test to determine whether or not
our task is in a device cgroup that is not the root one. We can relax
some of our checks in that case, trusting that whoever set device cgroup
rules will be responsible to control access to their devices.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 include/linux/security.h |  1 +
 security/device_cgroup.c | 15 +++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index eee7478..fe58f71 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p);
 extern int cap_task_setioprio(struct task_struct *p, int ioprio);
 extern int cap_task_setnice(struct task_struct *p, int nice);
 extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
+bool *task_in_child_devcgroup(struct task_struct *task);
 
 struct msghdr;
 struct sk_buff;
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 1c69e38..03df5b2 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 	return css_to_devcgroup(task_subsys_state(task, devices_subsys_id));
 }
 
+static struct dev_cgroup *root_devcgroup;
+bool task_in_child_devcgroup(struct task_struct *task)
+{
+	bool ret;
+	rcu_read_lock();
+	ret = task_devcgroup(task) != root_devcgroup;
+	rcu_read_unlock();
+	return ret;
+}
+
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup *new_cgrp,
@@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
 	INIT_LIST_HEAD(&dev_cgroup->exceptions);
 	parent_cgroup = cgroup->parent;
 
-	if (parent_cgroup == NULL)
+	if (parent_cgroup == NULL) {
 		dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW;
-	else {
+		root_devcgroup = dev_cgroup;
+	} else {
 		parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup);
 		mutex_lock(&devcgroup_mutex);
 		ret = dev_exceptions_copy(&dev_cgroup->exceptions,
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
@ 2013-03-15  9:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

Most of the other subsystems already keep track of that in some way.  We
will do that internally and provide a test to determine whether or not
our task is in a device cgroup that is not the root one. We can relax
some of our checks in that case, trusting that whoever set device cgroup
rules will be responsible to control access to their devices.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 include/linux/security.h |  1 +
 security/device_cgroup.c | 15 +++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index eee7478..fe58f71 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p);
 extern int cap_task_setioprio(struct task_struct *p, int ioprio);
 extern int cap_task_setnice(struct task_struct *p, int nice);
 extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
+bool *task_in_child_devcgroup(struct task_struct *task);
 
 struct msghdr;
 struct sk_buff;
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 1c69e38..03df5b2 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 	return css_to_devcgroup(task_subsys_state(task, devices_subsys_id));
 }
 
+static struct dev_cgroup *root_devcgroup;
+bool task_in_child_devcgroup(struct task_struct *task)
+{
+	bool ret;
+	rcu_read_lock();
+	ret = task_devcgroup(task) != root_devcgroup;
+	rcu_read_unlock();
+	return ret;
+}
+
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup *new_cgrp,
@@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
 	INIT_LIST_HEAD(&dev_cgroup->exceptions);
 	parent_cgroup = cgroup->parent;
 
-	if (parent_cgroup == NULL)
+	if (parent_cgroup == NULL) {
 		dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW;
-	else {
+		root_devcgroup = dev_cgroup;
+	} else {
 		parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup);
 		mutex_lock(&devcgroup_mutex);
 		ret = dev_exceptions_copy(&dev_cgroup->exceptions,
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
       [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15  9:13     ` Glauber Costa
  2013-03-15  9:13     ` Glauber Costa
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

So far, unless the filesystem explicitly marks it (and most don't),
processes running in user namespaces won't be allowed to access any
devices. Although this makes sense, this is a quite restrictive rule,
since a lot of those accesses would be perfectly safe: aside from the
simple char devices in /dev/ like null, zero, etc, it is perfectly
possible to assign a device for usage inside a namespace if we can
establish trust in that operation.

We will do that by marking the mount as MNT_NODEV_NS instead of
MNT_NODEV. This is because if the mount operation explicitly asked for
nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the
task is not on a device cgroup. If it is, we will rely on the control
rules in devcg to intermediate the access an tell us what those tasks
can or cannot do.

There is precedence for that with memcg: although we don't explicitly
test it like I am doing it here, we are allowing tmpfs mounts to happen
in user namespaces because memcg will contain them.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/namei.c            | 4 ++++
 fs/namespace.c        | 2 +-
 include/linux/mount.h | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 57ae9c8..8a34d79 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2356,6 +2356,10 @@ static int may_open(struct path *path, int acc_mode, int flag)
 	case S_IFCHR:
 		if (path->mnt->mnt_flags & MNT_NODEV)
 			return -EACCES;
+
+		if ((path->mnt->mnt_flags & MNT_NODEV_NS) &&
+			!task_in_child_devcgroup(current))
+			return -EACCES;
 		/*FALLTHRU*/
 	case S_IFIFO:
 	case S_IFSOCK:
diff --git a/fs/namespace.c b/fs/namespace.c
index 50ca17d..fe8127e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1935,7 +1935,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
 		 */
 		if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
 			flags |= MS_NODEV;
-			mnt_flags |= MNT_NODEV;
+			mnt_flags |= MNT_NODEV_NS;
 		}
 	}
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..8d190e4 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -32,6 +32,8 @@ struct mnt_namespace;
 #define MNT_SHRINKABLE	0x100
 #define MNT_WRITE_HOLD	0x200
 
+#define MNT_NODEV_NS	0x400   /* userns mount, and nodev not explicit */
+
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
 /*
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
@ 2013-03-15  9:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

So far, unless the filesystem explicitly marks it (and most don't),
processes running in user namespaces won't be allowed to access any
devices. Although this makes sense, this is a quite restrictive rule,
since a lot of those accesses would be perfectly safe: aside from the
simple char devices in /dev/ like null, zero, etc, it is perfectly
possible to assign a device for usage inside a namespace if we can
establish trust in that operation.

We will do that by marking the mount as MNT_NODEV_NS instead of
MNT_NODEV. This is because if the mount operation explicitly asked for
nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the
task is not on a device cgroup. If it is, we will rely on the control
rules in devcg to intermediate the access an tell us what those tasks
can or cannot do.

There is precedence for that with memcg: although we don't explicitly
test it like I am doing it here, we are allowing tmpfs mounts to happen
in user namespaces because memcg will contain them.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/namei.c            | 4 ++++
 fs/namespace.c        | 2 +-
 include/linux/mount.h | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 57ae9c8..8a34d79 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2356,6 +2356,10 @@ static int may_open(struct path *path, int acc_mode, int flag)
 	case S_IFCHR:
 		if (path->mnt->mnt_flags & MNT_NODEV)
 			return -EACCES;
+
+		if ((path->mnt->mnt_flags & MNT_NODEV_NS) &&
+			!task_in_child_devcgroup(current))
+			return -EACCES;
 		/*FALLTHRU*/
 	case S_IFIFO:
 	case S_IFSOCK:
diff --git a/fs/namespace.c b/fs/namespace.c
index 50ca17d..fe8127e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1935,7 +1935,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
 		 */
 		if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
 			flags |= MS_NODEV;
-			mnt_flags |= MNT_NODEV;
+			mnt_flags |= MNT_NODEV_NS;
 		}
 	}
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..8d190e4 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -32,6 +32,8 @@ struct mnt_namespace;
 #define MNT_SHRINKABLE	0x100
 #define MNT_WRITE_HOLD	0x200
 
+#define MNT_NODEV_NS	0x400   /* userns mount, and nodev not explicit */
+
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
 /*
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 3/4] fs: allow mknod in user namespaces
       [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-03-15  9:13     ` Glauber Costa
@ 2013-03-15  9:13   ` Glauber Costa
  2013-03-15  9:13   ` [PATCH 4/4] devpts: fix usage " Glauber Costa
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

Since we have strict control on who access the devices, it should be
no problem to allow the device to appear.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8a34d79..d0b4549 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 	if (error)
 		return error;
 
-	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
 		return -EPERM;
 
 	if (!dir->i_op->mknod)
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 3/4] fs: allow mknod in user namespaces
       [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15  9:13     ` Glauber Costa
  2013-03-15  9:13     ` Glauber Costa
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	Eric W. Biederman, Serge Hallyn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Glauber Costa, Aristeu Rozanski

Since we have strict control on who access the devices, it should be
no problem to allow the device to appear.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8a34d79..d0b4549 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 	if (error)
 		return error;
 
-	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
 		return -EPERM;
 
 	if (!dir->i_op->mknod)
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 3/4] fs: allow mknod in user namespaces
@ 2013-03-15  9:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	Eric W. Biederman, Serge Hallyn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Glauber Costa, Aristeu Rozanski

Since we have strict control on who access the devices, it should be
no problem to allow the device to appear.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8a34d79..d0b4549 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 	if (error)
 		return error;
 
-	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
 		return -EPERM;
 
 	if (!dir->i_op->mknod)
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 4/4] devpts: fix usage in user namespaces
       [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-03-15  9:13   ` Glauber Costa
@ 2013-03-15  9:13   ` Glauber Costa
  2013-03-15  9:13     ` Glauber Costa
  2013-03-15 10:26   ` [PATCH 0/4] fix depvpts " Eric W. Biederman
  6 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Eric W. Biederman

The current usage of devpts in user namespaces is quite problematic
for a couple of reasons:

* It requires the newinstance option, which is a sensible thing.
  However, if we are running full distributions inside containers
  or any other form of unmodified apps, that option wouldn't be passed
  for what they believe to be the initial mount. We could of course do
  the mount in the container initialization itself, but many
  distributions will somehow remount /dev in temporary storage for udev,
  and then mount devpts ontop of that. That gets rid of our setup, and
  overcoming that would require a complex synchronization between the
  control tool and the distribution booting. A much better behavior
  is to imply that option for the first mount in the user namespace.

* Again, unless we are talking about a custom application, standard
  procedures like ssh, openpty, and friends, will try to open /dev/ptmx
  instead of /dev/pts/ptmx. We can link that in userspace but that is
  quite cumbersome, for the same reasons listed above. A better behavior
  is to respect newinstance mounts regardless of whether they come from,
  but keep track of who is the first mount inside the usernamespace.
  That becomes the "root" of that namespace, and inodes without ties
  to a specific superblock will point to that instead of devpts_mount.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/devpts/inode.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 151 insertions(+), 6 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 073d30b..58fbff4 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -123,12 +123,129 @@ static const match_table_t tokens = {
 	{Opt_err, NULL}
 };
 
+struct userns_ptmx {
+	struct list_head list;
+	struct user_namespace *owner;
+	struct super_block *sb;
+};
+
 struct pts_fs_info {
 	struct ida allocated_ptys;
 	struct pts_mount_opts mount_opts;
 	struct dentry *ptmx_dentry;
+#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS)
+	struct userns_ptmx *ns_ptmx;
+#endif
 };
 
+#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS)
+/*
+ * When using user namespaces, we will require devpts to be created as a new
+ * instance and thus will span a new superblock. However, applications running
+ * inside a container may still try to open /dev/ptmx, instead of
+ * /dev/pts/ptmx. That will be the rule for all non custom applications. When
+ * that happen, they will end up opening the root namespace's /dev/ptmx. That
+ * will in turn create a new pts in the main /dev, outside the reach of our
+ * user namespace.
+ *
+ * The rule to deal with it is very simple: The first /dev/pts mount inside the
+ * namespace will become the owner of that namespace. Anyone reading /dev/ptmx
+ * will be redirected to /dev/pts/ptmx of that mount. Subsequent mounts will
+ * have to create a new instance to have a separated ptmx, just the way it it
+ * for non user namespace.
+ */
+static LIST_HEAD(userns_ptmx_list);
+static DEFINE_MUTEX(userns_ptmx_mutex);
+struct pts_fs_info;
+
+/*
+ * Having ns_ptmx stored somewhere in the superblock (and fsi is the obvious
+ * choice) is para-mount (pun intended) destroying it while umounting. We
+ * need to signal that there is something to destroy.
+ */
+static void define_ns_root_ptmx(struct super_block *sb,
+				struct userns_ptmx *ns_ptmx)
+{
+	struct pts_fs_info *fsi = sb->s_fs_info;
+	if (!ns_ptmx)
+		return;
+
+	fsi->ns_ptmx = ns_ptmx;
+	ns_ptmx->sb = sb;
+}
+
+static void remove_ns_root_ptmx(struct pts_fs_info *fsi)
+{
+	if (!fsi->ns_ptmx)
+		return;
+
+	if (WARN_ON(fsi->ns_ptmx->owner == &init_user_ns))
+		return;
+
+	mutex_lock(&userns_ptmx_mutex);
+	list_del(&fsi->ns_ptmx->list);
+	mutex_unlock(&userns_ptmx_mutex);
+	kfree(fsi->ns_ptmx);
+}
+
+static struct super_block *pts_sb_of_namespace(struct user_namespace *userns)
+{
+	struct userns_ptmx *ns_ptmx  = NULL;
+	struct super_block *sb = NULL;
+
+	mutex_lock(&userns_ptmx_mutex);
+	list_for_each_entry(ns_ptmx, &userns_ptmx_list, list) {
+		if (ns_ptmx->owner != userns)
+			continue;
+		sb = ns_ptmx->sb;
+		goto out;
+	}
+out:
+	mutex_unlock(&userns_ptmx_mutex);
+	return sb;
+}
+
+static struct userns_ptmx *pts_new_namespace_root(struct user_namespace *userns)
+{
+	struct userns_ptmx *ns_ptmx;
+	if (userns == &init_user_ns)
+		return NULL;
+
+	ns_ptmx = kzalloc(sizeof(*ns_ptmx), GFP_KERNEL);
+	if (!ns_ptmx)
+		return ns_ptmx;
+
+	ns_ptmx->owner = userns;
+	mutex_lock(&userns_ptmx_mutex);
+	list_add(&ns_ptmx->list, &userns_ptmx_list);
+	mutex_unlock(&userns_ptmx_mutex);
+
+	return ns_ptmx;
+}
+#else
+static inline void define_ns_root_ptmx(struct super_block *sb,
+				struct userns_ptmx *ns_ptmx)
+{
+}
+
+static inline struct super_block *
+pts_sb_of_namespace(struct user_namespace *userns)
+{
+	return NULL;
+}
+
+static inline struct userns_ptmx *
+pts_new_namespace_root(struct user_namespace *userns)
+{
+	return NULL;
+}
+
+static inline void remove_ns_root_ptmx(struct pts_fs_info *fsi)
+{
+}
+#endif
+
+
 static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -139,6 +256,13 @@ static inline struct super_block *pts_sb_from_inode(struct inode *inode)
 #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
 	if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
 		return inode->i_sb;
+	else if (current_user_ns() != &init_user_ns)
+		/*
+		 * If there is valid devpts superblock associated with the
+		 * inode, then we respect it no matter what. Otherwise, which
+		 * superblock we return depends on which namespace we are at.
+		 */
+		return pts_sb_of_namespace(current_user_ns());
 #endif
 	return devpts_mnt->mnt_sb;
 }
@@ -442,16 +566,31 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 	int error;
 	struct pts_mount_opts opts;
 	struct super_block *s;
+	struct user_namespace *userns = current_user_ns();
+	struct userns_ptmx *ns_ptmx = NULL;
 
 	error = parse_mount_options(data, PARSE_MOUNT, &opts);
 	if (error)
 		return ERR_PTR(error);
 
-	/* Require newinstance for all user namespace mounts to ensure
-	 * the mount options are not changed.
+	/*
+	 * Require newinstance for all user namespace mounts to ensure
+	 * the mount options are not changed. If this is the first mount
+	 * of the namespace, that option is implied.
 	 */
-	if ((current_user_ns() != &init_user_ns) && !opts.newinstance)
-		return ERR_PTR(-EINVAL);
+	if ((userns != &init_user_ns) && !opts.newinstance) {
+		/*
+		 * And this is how we know we're talking about the first
+		 * mount. If we don't have an assigned ptmx of userns,
+		 * then we have to create it.
+		 */
+		if (pts_sb_of_namespace(userns))
+			return ERR_PTR(-EINVAL);
+		ns_ptmx = pts_new_namespace_root(userns);
+		if (!ns_ptmx)
+			return ERR_PTR(-ENOMEM);
+		opts.newinstance = 1;
+	}
 
 	if (opts.newinstance)
 		s = sget(fs_type, NULL, set_anon_super, flags, NULL);
@@ -459,8 +598,10 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 		s = sget(fs_type, compare_init_pts_sb, set_anon_super, flags,
 			 NULL);
 
-	if (IS_ERR(s))
-		return ERR_CAST(s);
+	if (IS_ERR(s)) {
+		error = PTR_ERR(s);
+		goto out_no_s;
+	}
 
 	if (!s->s_root) {
 		error = devpts_fill_super(s, data, flags & MS_SILENT ? 1 : 0);
@@ -470,6 +611,7 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 	}
 
 	memcpy(&(DEVPTS_SB(s))->mount_opts, &opts, sizeof(opts));
+	define_ns_root_ptmx(s, ns_ptmx);
 
 	error = mknod_ptmx(s);
 	if (error)
@@ -479,6 +621,8 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 
 out_undo_sget:
 	deactivate_locked_super(s);
+out_no_s:
+	kfree(ns_ptmx);
 	return ERR_PTR(error);
 }
 
@@ -498,6 +642,7 @@ static void devpts_kill_sb(struct super_block *sb)
 {
 	struct pts_fs_info *fsi = DEVPTS_SB(sb);
 
+	remove_ns_root_ptmx(fsi);
 	kfree(fsi);
 	kill_litter_super(sb);
 }
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 4/4] devpts: fix usage in user namespaces
       [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15  9:13     ` Glauber Costa
  2013-03-15  9:13     ` Glauber Costa
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	Eric W. Biederman, Serge Hallyn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Glauber Costa

The current usage of devpts in user namespaces is quite problematic
for a couple of reasons:

* It requires the newinstance option, which is a sensible thing.
  However, if we are running full distributions inside containers
  or any other form of unmodified apps, that option wouldn't be passed
  for what they believe to be the initial mount. We could of course do
  the mount in the container initialization itself, but many
  distributions will somehow remount /dev in temporary storage for udev,
  and then mount devpts ontop of that. That gets rid of our setup, and
  overcoming that would require a complex synchronization between the
  control tool and the distribution booting. A much better behavior
  is to imply that option for the first mount in the user namespace.

* Again, unless we are talking about a custom application, standard
  procedures like ssh, openpty, and friends, will try to open /dev/ptmx
  instead of /dev/pts/ptmx. We can link that in userspace but that is
  quite cumbersome, for the same reasons listed above. A better behavior
  is to respect newinstance mounts regardless of whether they come from,
  but keep track of who is the first mount inside the usernamespace.
  That becomes the "root" of that namespace, and inodes without ties
  to a specific superblock will point to that instead of devpts_mount.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/devpts/inode.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 151 insertions(+), 6 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 073d30b..58fbff4 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -123,12 +123,129 @@ static const match_table_t tokens = {
 	{Opt_err, NULL}
 };
 
+struct userns_ptmx {
+	struct list_head list;
+	struct user_namespace *owner;
+	struct super_block *sb;
+};
+
 struct pts_fs_info {
 	struct ida allocated_ptys;
 	struct pts_mount_opts mount_opts;
 	struct dentry *ptmx_dentry;
+#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS)
+	struct userns_ptmx *ns_ptmx;
+#endif
 };
 
+#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS)
+/*
+ * When using user namespaces, we will require devpts to be created as a new
+ * instance and thus will span a new superblock. However, applications running
+ * inside a container may still try to open /dev/ptmx, instead of
+ * /dev/pts/ptmx. That will be the rule for all non custom applications. When
+ * that happen, they will end up opening the root namespace's /dev/ptmx. That
+ * will in turn create a new pts in the main /dev, outside the reach of our
+ * user namespace.
+ *
+ * The rule to deal with it is very simple: The first /dev/pts mount inside the
+ * namespace will become the owner of that namespace. Anyone reading /dev/ptmx
+ * will be redirected to /dev/pts/ptmx of that mount. Subsequent mounts will
+ * have to create a new instance to have a separated ptmx, just the way it it
+ * for non user namespace.
+ */
+static LIST_HEAD(userns_ptmx_list);
+static DEFINE_MUTEX(userns_ptmx_mutex);
+struct pts_fs_info;
+
+/*
+ * Having ns_ptmx stored somewhere in the superblock (and fsi is the obvious
+ * choice) is para-mount (pun intended) destroying it while umounting. We
+ * need to signal that there is something to destroy.
+ */
+static void define_ns_root_ptmx(struct super_block *sb,
+				struct userns_ptmx *ns_ptmx)
+{
+	struct pts_fs_info *fsi = sb->s_fs_info;
+	if (!ns_ptmx)
+		return;
+
+	fsi->ns_ptmx = ns_ptmx;
+	ns_ptmx->sb = sb;
+}
+
+static void remove_ns_root_ptmx(struct pts_fs_info *fsi)
+{
+	if (!fsi->ns_ptmx)
+		return;
+
+	if (WARN_ON(fsi->ns_ptmx->owner == &init_user_ns))
+		return;
+
+	mutex_lock(&userns_ptmx_mutex);
+	list_del(&fsi->ns_ptmx->list);
+	mutex_unlock(&userns_ptmx_mutex);
+	kfree(fsi->ns_ptmx);
+}
+
+static struct super_block *pts_sb_of_namespace(struct user_namespace *userns)
+{
+	struct userns_ptmx *ns_ptmx  = NULL;
+	struct super_block *sb = NULL;
+
+	mutex_lock(&userns_ptmx_mutex);
+	list_for_each_entry(ns_ptmx, &userns_ptmx_list, list) {
+		if (ns_ptmx->owner != userns)
+			continue;
+		sb = ns_ptmx->sb;
+		goto out;
+	}
+out:
+	mutex_unlock(&userns_ptmx_mutex);
+	return sb;
+}
+
+static struct userns_ptmx *pts_new_namespace_root(struct user_namespace *userns)
+{
+	struct userns_ptmx *ns_ptmx;
+	if (userns == &init_user_ns)
+		return NULL;
+
+	ns_ptmx = kzalloc(sizeof(*ns_ptmx), GFP_KERNEL);
+	if (!ns_ptmx)
+		return ns_ptmx;
+
+	ns_ptmx->owner = userns;
+	mutex_lock(&userns_ptmx_mutex);
+	list_add(&ns_ptmx->list, &userns_ptmx_list);
+	mutex_unlock(&userns_ptmx_mutex);
+
+	return ns_ptmx;
+}
+#else
+static inline void define_ns_root_ptmx(struct super_block *sb,
+				struct userns_ptmx *ns_ptmx)
+{
+}
+
+static inline struct super_block *
+pts_sb_of_namespace(struct user_namespace *userns)
+{
+	return NULL;
+}
+
+static inline struct userns_ptmx *
+pts_new_namespace_root(struct user_namespace *userns)
+{
+	return NULL;
+}
+
+static inline void remove_ns_root_ptmx(struct pts_fs_info *fsi)
+{
+}
+#endif
+
+
 static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -139,6 +256,13 @@ static inline struct super_block *pts_sb_from_inode(struct inode *inode)
 #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
 	if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
 		return inode->i_sb;
+	else if (current_user_ns() != &init_user_ns)
+		/*
+		 * If there is valid devpts superblock associated with the
+		 * inode, then we respect it no matter what. Otherwise, which
+		 * superblock we return depends on which namespace we are at.
+		 */
+		return pts_sb_of_namespace(current_user_ns());
 #endif
 	return devpts_mnt->mnt_sb;
 }
@@ -442,16 +566,31 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 	int error;
 	struct pts_mount_opts opts;
 	struct super_block *s;
+	struct user_namespace *userns = current_user_ns();
+	struct userns_ptmx *ns_ptmx = NULL;
 
 	error = parse_mount_options(data, PARSE_MOUNT, &opts);
 	if (error)
 		return ERR_PTR(error);
 
-	/* Require newinstance for all user namespace mounts to ensure
-	 * the mount options are not changed.
+	/*
+	 * Require newinstance for all user namespace mounts to ensure
+	 * the mount options are not changed. If this is the first mount
+	 * of the namespace, that option is implied.
 	 */
-	if ((current_user_ns() != &init_user_ns) && !opts.newinstance)
-		return ERR_PTR(-EINVAL);
+	if ((userns != &init_user_ns) && !opts.newinstance) {
+		/*
+		 * And this is how we know we're talking about the first
+		 * mount. If we don't have an assigned ptmx of userns,
+		 * then we have to create it.
+		 */
+		if (pts_sb_of_namespace(userns))
+			return ERR_PTR(-EINVAL);
+		ns_ptmx = pts_new_namespace_root(userns);
+		if (!ns_ptmx)
+			return ERR_PTR(-ENOMEM);
+		opts.newinstance = 1;
+	}
 
 	if (opts.newinstance)
 		s = sget(fs_type, NULL, set_anon_super, flags, NULL);
@@ -459,8 +598,10 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 		s = sget(fs_type, compare_init_pts_sb, set_anon_super, flags,
 			 NULL);
 
-	if (IS_ERR(s))
-		return ERR_CAST(s);
+	if (IS_ERR(s)) {
+		error = PTR_ERR(s);
+		goto out_no_s;
+	}
 
 	if (!s->s_root) {
 		error = devpts_fill_super(s, data, flags & MS_SILENT ? 1 : 0);
@@ -470,6 +611,7 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 	}
 
 	memcpy(&(DEVPTS_SB(s))->mount_opts, &opts, sizeof(opts));
+	define_ns_root_ptmx(s, ns_ptmx);
 
 	error = mknod_ptmx(s);
 	if (error)
@@ -479,6 +621,8 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 
 out_undo_sget:
 	deactivate_locked_super(s);
+out_no_s:
+	kfree(ns_ptmx);
 	return ERR_PTR(error);
 }
 
@@ -498,6 +642,7 @@ static void devpts_kill_sb(struct super_block *sb)
 {
 	struct pts_fs_info *fsi = DEVPTS_SB(sb);
 
+	remove_ns_root_ptmx(fsi);
 	kfree(fsi);
 	kill_litter_super(sb);
 }
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 4/4] devpts: fix usage in user namespaces
@ 2013-03-15  9:13     ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15  9:13 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrew Morton, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	Eric W. Biederman, Serge Hallyn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Glauber Costa

The current usage of devpts in user namespaces is quite problematic
for a couple of reasons:

* It requires the newinstance option, which is a sensible thing.
  However, if we are running full distributions inside containers
  or any other form of unmodified apps, that option wouldn't be passed
  for what they believe to be the initial mount. We could of course do
  the mount in the container initialization itself, but many
  distributions will somehow remount /dev in temporary storage for udev,
  and then mount devpts ontop of that. That gets rid of our setup, and
  overcoming that would require a complex synchronization between the
  control tool and the distribution booting. A much better behavior
  is to imply that option for the first mount in the user namespace.

* Again, unless we are talking about a custom application, standard
  procedures like ssh, openpty, and friends, will try to open /dev/ptmx
  instead of /dev/pts/ptmx. We can link that in userspace but that is
  quite cumbersome, for the same reasons listed above. A better behavior
  is to respect newinstance mounts regardless of whether they come from,
  but keep track of who is the first mount inside the usernamespace.
  That becomes the "root" of that namespace, and inodes without ties
  to a specific superblock will point to that instead of devpts_mount.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
---
 fs/devpts/inode.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 151 insertions(+), 6 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 073d30b..58fbff4 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -123,12 +123,129 @@ static const match_table_t tokens = {
 	{Opt_err, NULL}
 };
 
+struct userns_ptmx {
+	struct list_head list;
+	struct user_namespace *owner;
+	struct super_block *sb;
+};
+
 struct pts_fs_info {
 	struct ida allocated_ptys;
 	struct pts_mount_opts mount_opts;
 	struct dentry *ptmx_dentry;
+#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS)
+	struct userns_ptmx *ns_ptmx;
+#endif
 };
 
+#if defined(CONFIG_DEVPTS_MULTIPLE_INSTANCES) && defined(CONFIG_USER_NS)
+/*
+ * When using user namespaces, we will require devpts to be created as a new
+ * instance and thus will span a new superblock. However, applications running
+ * inside a container may still try to open /dev/ptmx, instead of
+ * /dev/pts/ptmx. That will be the rule for all non custom applications. When
+ * that happen, they will end up opening the root namespace's /dev/ptmx. That
+ * will in turn create a new pts in the main /dev, outside the reach of our
+ * user namespace.
+ *
+ * The rule to deal with it is very simple: The first /dev/pts mount inside the
+ * namespace will become the owner of that namespace. Anyone reading /dev/ptmx
+ * will be redirected to /dev/pts/ptmx of that mount. Subsequent mounts will
+ * have to create a new instance to have a separated ptmx, just the way it it
+ * for non user namespace.
+ */
+static LIST_HEAD(userns_ptmx_list);
+static DEFINE_MUTEX(userns_ptmx_mutex);
+struct pts_fs_info;
+
+/*
+ * Having ns_ptmx stored somewhere in the superblock (and fsi is the obvious
+ * choice) is para-mount (pun intended) destroying it while umounting. We
+ * need to signal that there is something to destroy.
+ */
+static void define_ns_root_ptmx(struct super_block *sb,
+				struct userns_ptmx *ns_ptmx)
+{
+	struct pts_fs_info *fsi = sb->s_fs_info;
+	if (!ns_ptmx)
+		return;
+
+	fsi->ns_ptmx = ns_ptmx;
+	ns_ptmx->sb = sb;
+}
+
+static void remove_ns_root_ptmx(struct pts_fs_info *fsi)
+{
+	if (!fsi->ns_ptmx)
+		return;
+
+	if (WARN_ON(fsi->ns_ptmx->owner == &init_user_ns))
+		return;
+
+	mutex_lock(&userns_ptmx_mutex);
+	list_del(&fsi->ns_ptmx->list);
+	mutex_unlock(&userns_ptmx_mutex);
+	kfree(fsi->ns_ptmx);
+}
+
+static struct super_block *pts_sb_of_namespace(struct user_namespace *userns)
+{
+	struct userns_ptmx *ns_ptmx  = NULL;
+	struct super_block *sb = NULL;
+
+	mutex_lock(&userns_ptmx_mutex);
+	list_for_each_entry(ns_ptmx, &userns_ptmx_list, list) {
+		if (ns_ptmx->owner != userns)
+			continue;
+		sb = ns_ptmx->sb;
+		goto out;
+	}
+out:
+	mutex_unlock(&userns_ptmx_mutex);
+	return sb;
+}
+
+static struct userns_ptmx *pts_new_namespace_root(struct user_namespace *userns)
+{
+	struct userns_ptmx *ns_ptmx;
+	if (userns == &init_user_ns)
+		return NULL;
+
+	ns_ptmx = kzalloc(sizeof(*ns_ptmx), GFP_KERNEL);
+	if (!ns_ptmx)
+		return ns_ptmx;
+
+	ns_ptmx->owner = userns;
+	mutex_lock(&userns_ptmx_mutex);
+	list_add(&ns_ptmx->list, &userns_ptmx_list);
+	mutex_unlock(&userns_ptmx_mutex);
+
+	return ns_ptmx;
+}
+#else
+static inline void define_ns_root_ptmx(struct super_block *sb,
+				struct userns_ptmx *ns_ptmx)
+{
+}
+
+static inline struct super_block *
+pts_sb_of_namespace(struct user_namespace *userns)
+{
+	return NULL;
+}
+
+static inline struct userns_ptmx *
+pts_new_namespace_root(struct user_namespace *userns)
+{
+	return NULL;
+}
+
+static inline void remove_ns_root_ptmx(struct pts_fs_info *fsi)
+{
+}
+#endif
+
+
 static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -139,6 +256,13 @@ static inline struct super_block *pts_sb_from_inode(struct inode *inode)
 #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
 	if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
 		return inode->i_sb;
+	else if (current_user_ns() != &init_user_ns)
+		/*
+		 * If there is valid devpts superblock associated with the
+		 * inode, then we respect it no matter what. Otherwise, which
+		 * superblock we return depends on which namespace we are at.
+		 */
+		return pts_sb_of_namespace(current_user_ns());
 #endif
 	return devpts_mnt->mnt_sb;
 }
@@ -442,16 +566,31 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 	int error;
 	struct pts_mount_opts opts;
 	struct super_block *s;
+	struct user_namespace *userns = current_user_ns();
+	struct userns_ptmx *ns_ptmx = NULL;
 
 	error = parse_mount_options(data, PARSE_MOUNT, &opts);
 	if (error)
 		return ERR_PTR(error);
 
-	/* Require newinstance for all user namespace mounts to ensure
-	 * the mount options are not changed.
+	/*
+	 * Require newinstance for all user namespace mounts to ensure
+	 * the mount options are not changed. If this is the first mount
+	 * of the namespace, that option is implied.
 	 */
-	if ((current_user_ns() != &init_user_ns) && !opts.newinstance)
-		return ERR_PTR(-EINVAL);
+	if ((userns != &init_user_ns) && !opts.newinstance) {
+		/*
+		 * And this is how we know we're talking about the first
+		 * mount. If we don't have an assigned ptmx of userns,
+		 * then we have to create it.
+		 */
+		if (pts_sb_of_namespace(userns))
+			return ERR_PTR(-EINVAL);
+		ns_ptmx = pts_new_namespace_root(userns);
+		if (!ns_ptmx)
+			return ERR_PTR(-ENOMEM);
+		opts.newinstance = 1;
+	}
 
 	if (opts.newinstance)
 		s = sget(fs_type, NULL, set_anon_super, flags, NULL);
@@ -459,8 +598,10 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 		s = sget(fs_type, compare_init_pts_sb, set_anon_super, flags,
 			 NULL);
 
-	if (IS_ERR(s))
-		return ERR_CAST(s);
+	if (IS_ERR(s)) {
+		error = PTR_ERR(s);
+		goto out_no_s;
+	}
 
 	if (!s->s_root) {
 		error = devpts_fill_super(s, data, flags & MS_SILENT ? 1 : 0);
@@ -470,6 +611,7 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 	}
 
 	memcpy(&(DEVPTS_SB(s))->mount_opts, &opts, sizeof(opts));
+	define_ns_root_ptmx(s, ns_ptmx);
 
 	error = mknod_ptmx(s);
 	if (error)
@@ -479,6 +621,8 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
 
 out_undo_sget:
 	deactivate_locked_super(s);
+out_no_s:
+	kfree(ns_ptmx);
 	return ERR_PTR(error);
 }
 
@@ -498,6 +642,7 @@ static void devpts_kill_sb(struct super_block *sb)
 {
 	struct pts_fs_info *fsi = DEVPTS_SB(sb);
 
+	remove_ns_root_ptmx(fsi);
 	kfree(fsi);
 	kill_litter_super(sb);
 }
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (5 preceding siblings ...)
  2013-03-15  9:13     ` Glauber Costa
@ 2013-03-15 10:26   ` Eric W. Biederman
       [not found]     ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-03-15 14:00     ` Serge Hallyn
  6 siblings, 2 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-15 10:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Hi,
>
> devpts mounts in user namespaces is queued for 3.9. However, while playing
> with it I found it to be less than ideal. Although it could possibly work
> with custom software that can be made to point to /dev/pts/ptmx, a few things
> prevent it from working correctly for people that, like us, are booting full
> distributions.

Full distributions that have not been modified to be minimally container
aware.

> In those scenarios, things like udev will kick in, maybe remount /dev undoing
> any setup we might have done, and then software like sshd or anything else
> calling openpty will search for /dev/ptmx, not /dev/pts/ptmx.

I believe udev stopped running in containers a year or so ago.

> One of the problems that I am addressing in here is that we are disallowing
> mknod in usernamespaces. Although I understand the motivation for that, I
> believe that to be too restrictive, specially because we already control access
> to the files separately. There should be no harm in mknod'ing something per se,
> if manipulating it is forbidden.

mknod in userspace needs to be a separate patchset.  There is no need to
solve mknod in userspace to solve devpts.


> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
> link it to our namespaces'. My proposal is to multiplex it and return the
> correct "root ptmx" depending on which userns is reading that device.

Doable.  I still strongly prefer my version of having /dev/ptmx act like
a link to /dev/pts/ptmx.  Letting the mount namespace control it.

In testing that works, and it allows a lot of devpts complexity to just
go away.  For older versions of udev you can even configure them with a
rule to make /dev/ptmx a symlink to /dev/pts/ptmx.  Newer versions of
udev completely gave up on creating devices and can longer be configured
to do anything useful in this regard.

So we might even be able to just get away with a bit of udev and
devtmpfs configuration.  And treat devpts as if newinstance is always
specified.  Certainly that has worked in my testing so far.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found]     ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-03-15 12:01       ` Glauber Costa
  2013-03-15 14:00       ` Serge Hallyn
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 12:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On 03/15/2013 02:26 PM, Eric W. Biederman wrote:
> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
>> Hi,
>>
>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>> with it I found it to be less than ideal. Although it could possibly work
>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>> prevent it from working correctly for people that, like us, are booting full
>> distributions.
> 
> Full distributions that have not been modified to be minimally container
> aware.
> 
Yes, which is true for every single distribution that was released
before containers became so pervasive. I believe we should be able to
make *better* decisions when we know we are in a container, but that
should still work.

>> In those scenarios, things like udev will kick in, maybe remount /dev undoing
>> any setup we might have done, and then software like sshd or anything else
>> calling openpty will search for /dev/ptmx, not /dev/pts/ptmx.
> 
> I believe udev stopped running in containers a year or so ago.

A year is not that big of a timeframe. I am running centos6 for
instance, and it runs udev. That is not even that ancient for enterprise
standards.

> 
>> One of the problems that I am addressing in here is that we are disallowing
>> mknod in usernamespaces. Although I understand the motivation for that, I
>> believe that to be too restrictive, specially because we already control access
>> to the files separately. There should be no harm in mknod'ing something per se,
>> if manipulating it is forbidden.
> 
> mknod in userspace needs to be a separate patchset.  There is no need to
> solve mknod in userspace to solve devpts.
> 
Well, yes. Patches 1 - 3 are technically independent of patch 4. If you
would review them and let me know what you think I would be much
appreciated. Reiterating, the proposal is akin to memcg+tmpfs, but I am
relaying control of devices to device cgroups, + requiring them to be
present.

> 
>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
>> link it to our namespaces'. My proposal is to multiplex it and return the
>> correct "root ptmx" depending on which userns is reading that device.
> 
> Doable.  I still strongly prefer my version of having /dev/ptmx act like
> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
> 
You mean an explicit link, or something else?

> In testing that works, and it allows a lot of devpts complexity to just
> go away.  For older versions of udev you can even configure them with a
> rule to make /dev/ptmx a symlink to /dev/pts/ptmx.  

At this point you are not getting rid of complexity, you are just moving
it to a different location. Instead of handling it in the kernel, we
know need to go and provide fixed configuration files for every single
distribution one may want to run in a container.

> 
> So we might even be able to just get away with a bit of udev and
> devtmpfs configuration.  And treat devpts as if newinstance is always
> specified.  Certainly that has worked in my testing so far.
> 

I can confirm that linking /dev/pts/ptmx to /dev/ptmx works. And also
that it needs configuration, and that this configuration will be
different for different distributions, possibly including distributions
releases. Handling it in the kernel is not *that* complicated and it
passed my tests with no hassles.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found]     ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-03-15 12:01       ` Glauber Costa
@ 2013-03-15 14:00       ` Serge Hallyn
  1 sibling, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
> > Hi,
> >
> > devpts mounts in user namespaces is queued for 3.9. However, while playing
> > with it I found it to be less than ideal. Although it could possibly work
> > with custom software that can be made to point to /dev/pts/ptmx, a few things
> > prevent it from working correctly for people that, like us, are booting full
> > distributions.
> 
> Full distributions that have not been modified to be minimally container
> aware.

Right, in fact in this case it doesn't need to be minimally container
aware, you just create the bind mount yourself and init just needs to
accept that it shouldn't touch it.

> > In those scenarios, things like udev will kick in, maybe remount /dev undoing
> > any setup we might have done, and then software like sshd or anything else
> > calling openpty will search for /dev/ptmx, not /dev/pts/ptmx.
> 
> I believe udev stopped running in containers a year or so ago.

No, udev runs fine in containers, we just don't allow udevadm trigger.

> > One of the problems that I am addressing in here is that we are disallowing
> > mknod in usernamespaces. Although I understand the motivation for that, I
> > believe that to be too restrictive, specially because we already control access
> > to the files separately. There should be no harm in mknod'ing something per se,
> > if manipulating it is forbidden.
> 
> mknod in userspace needs to be a separate patchset.  There is no need to
> solve mknod in userspace to solve devpts.
> 
> 
> > Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
> > link it to our namespaces'. My proposal is to multiplex it and return the
> > correct "root ptmx" depending on which userns is reading that device.
> 
> Doable.  I still strongly prefer my version of having /dev/ptmx act like
> a link to /dev/pts/ptmx.  Letting the mount namespace control it.

Right, Glauber have you seen this patch?  Eric did already solve this.
(And again that's a nice safeguard, but it shouldn't be necessary)

> In testing that works, and it allows a lot of devpts complexity to just
> go away.  For older versions of udev you can even configure them with a
> rule to make /dev/ptmx a symlink to /dev/pts/ptmx.  Newer versions of
> udev completely gave up on creating devices and can longer be configured
> to do anything useful in this regard.
> 
> So we might even be able to just get away with a bit of udev and
> devtmpfs configuration.

devtmpfs?  Until we get multiple separate mounts of devtmpfs, don't use
it in a container :)

> And treat devpts as if newinstance is always
> specified.  Certainly that has worked in my testing so far.
> 
> Eric

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 10:26   ` [PATCH 0/4] fix depvpts " Eric W. Biederman
       [not found]     ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-03-15 14:00     ` Serge Hallyn
  2013-03-15 14:42         ` Glauber Costa
  2013-03-15 14:42       ` Glauber Costa
  1 sibling, 2 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Glauber Costa, cgroups, Andrew Morton, mtk.manpages,
	Serge Hallyn, linux-fsdevel, containers

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Glauber Costa <glommer@parallels.com> writes:
> 
> > Hi,
> >
> > devpts mounts in user namespaces is queued for 3.9. However, while playing
> > with it I found it to be less than ideal. Although it could possibly work
> > with custom software that can be made to point to /dev/pts/ptmx, a few things
> > prevent it from working correctly for people that, like us, are booting full
> > distributions.
> 
> Full distributions that have not been modified to be minimally container
> aware.

Right, in fact in this case it doesn't need to be minimally container
aware, you just create the bind mount yourself and init just needs to
accept that it shouldn't touch it.

> > In those scenarios, things like udev will kick in, maybe remount /dev undoing
> > any setup we might have done, and then software like sshd or anything else
> > calling openpty will search for /dev/ptmx, not /dev/pts/ptmx.
> 
> I believe udev stopped running in containers a year or so ago.

No, udev runs fine in containers, we just don't allow udevadm trigger.

> > One of the problems that I am addressing in here is that we are disallowing
> > mknod in usernamespaces. Although I understand the motivation for that, I
> > believe that to be too restrictive, specially because we already control access
> > to the files separately. There should be no harm in mknod'ing something per se,
> > if manipulating it is forbidden.
> 
> mknod in userspace needs to be a separate patchset.  There is no need to
> solve mknod in userspace to solve devpts.
> 
> 
> > Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
> > link it to our namespaces'. My proposal is to multiplex it and return the
> > correct "root ptmx" depending on which userns is reading that device.
> 
> Doable.  I still strongly prefer my version of having /dev/ptmx act like
> a link to /dev/pts/ptmx.  Letting the mount namespace control it.

Right, Glauber have you seen this patch?  Eric did already solve this.
(And again that's a nice safeguard, but it shouldn't be necessary)

> In testing that works, and it allows a lot of devpts complexity to just
> go away.  For older versions of udev you can even configure them with a
> rule to make /dev/ptmx a symlink to /dev/pts/ptmx.  Newer versions of
> udev completely gave up on creating devices and can longer be configured
> to do anything useful in this regard.
> 
> So we might even be able to just get away with a bit of udev and
> devtmpfs configuration.

devtmpfs?  Until we get multiple separate mounts of devtmpfs, don't use
it in a container :)

> And treat devpts as if newinstance is always
> specified.  Certainly that has worked in my testing so far.
> 
> Eric

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
       [not found]     ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 14:07       ` Serge Hallyn
@ 2013-03-15 14:07       ` Serge Hallyn
  2013-03-15 19:27       ` Aristeu Rozanski
  2 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> Most of the other subsystems already keep track of that in some way.  We
> will do that internally and provide a test to determine whether or not
> our task is in a device cgroup that is not the root one. We can relax
> some of our checks in that case, trusting that whoever set device cgroup
> rules will be responsible to control access to their devices.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Patch looks fine.  AFAIK we're still waiting on Aristeu's patchset to
hit upstream.  As your patches are simpler I'd prefer, if there is
churn, for yours to be refactored than his.

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/security.h |  1 +
>  security/device_cgroup.c | 15 +++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/security.h b/include/linux/security.h
> index eee7478..fe58f71 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p);
>  extern int cap_task_setioprio(struct task_struct *p, int ioprio);
>  extern int cap_task_setnice(struct task_struct *p, int nice);
>  extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
> +bool *task_in_child_devcgroup(struct task_struct *task);
>  
>  struct msghdr;
>  struct sk_buff;
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 1c69e38..03df5b2 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
>  	return css_to_devcgroup(task_subsys_state(task, devices_subsys_id));
>  }
>  
> +static struct dev_cgroup *root_devcgroup;
> +bool task_in_child_devcgroup(struct task_struct *task)
> +{
> +	bool ret;
> +	rcu_read_lock();
> +	ret = task_devcgroup(task) != root_devcgroup;
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  struct cgroup_subsys devices_subsys;
>  
>  static int devcgroup_can_attach(struct cgroup *new_cgrp,
> @@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
>  	INIT_LIST_HEAD(&dev_cgroup->exceptions);
>  	parent_cgroup = cgroup->parent;
>  
> -	if (parent_cgroup == NULL)
> +	if (parent_cgroup == NULL) {
>  		dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW;
> -	else {
> +		root_devcgroup = dev_cgroup;
> +	} else {
>  		parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup);
>  		mutex_lock(&devcgroup_mutex);
>  		ret = dev_exceptions_copy(&dev_cgroup->exceptions,
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
       [not found]     ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 14:07       ` Serge Hallyn
  2013-03-15 14:43         ` Glauber Costa
  2013-03-15 14:43           ` Glauber Costa
  2013-03-15 14:07       ` Serge Hallyn
  2013-03-15 19:27       ` Aristeu Rozanski
  2 siblings, 2 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, Li Zefan

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> Most of the other subsystems already keep track of that in some way.  We
> will do that internally and provide a test to determine whether or not
> our task is in a device cgroup that is not the root one. We can relax
> some of our checks in that case, trusting that whoever set device cgroup
> rules will be responsible to control access to their devices.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Patch looks fine.  AFAIK we're still waiting on Aristeu's patchset to
hit upstream.  As your patches are simpler I'd prefer, if there is
churn, for yours to be refactored than his.

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/security.h |  1 +
>  security/device_cgroup.c | 15 +++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/security.h b/include/linux/security.h
> index eee7478..fe58f71 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p);
>  extern int cap_task_setioprio(struct task_struct *p, int ioprio);
>  extern int cap_task_setnice(struct task_struct *p, int nice);
>  extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
> +bool *task_in_child_devcgroup(struct task_struct *task);
>  
>  struct msghdr;
>  struct sk_buff;
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 1c69e38..03df5b2 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
>  	return css_to_devcgroup(task_subsys_state(task, devices_subsys_id));
>  }
>  
> +static struct dev_cgroup *root_devcgroup;
> +bool task_in_child_devcgroup(struct task_struct *task)
> +{
> +	bool ret;
> +	rcu_read_lock();
> +	ret = task_devcgroup(task) != root_devcgroup;
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  struct cgroup_subsys devices_subsys;
>  
>  static int devcgroup_can_attach(struct cgroup *new_cgrp,
> @@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
>  	INIT_LIST_HEAD(&dev_cgroup->exceptions);
>  	parent_cgroup = cgroup->parent;
>  
> -	if (parent_cgroup == NULL)
> +	if (parent_cgroup == NULL) {
>  		dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW;
> -	else {
> +		root_devcgroup = dev_cgroup;
> +	} else {
>  		parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup);
>  		mutex_lock(&devcgroup_mutex);
>  		ret = dev_exceptions_copy(&dev_cgroup->exceptions,
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
       [not found]     ` <1363338823-25292-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 14:20       ` Serge Hallyn
  0 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:20 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> So far, unless the filesystem explicitly marks it (and most don't),
> processes running in user namespaces won't be allowed to access any
> devices. Although this makes sense, this is a quite restrictive rule,
> since a lot of those accesses would be perfectly safe: aside from the
> simple char devices in /dev/ like null, zero, etc, it is perfectly
> possible to assign a device for usage inside a namespace if we can
> establish trust in that operation.

Yeah we've talked about that too - as Eric pointed out it's not
strictly necessary as we can set that up with bind mounts from the
host's /dev.  So if he wants to nack - especially temporarily while
other things fall into place - I won't argue, but I'm happy with this.

> We will do that by marking the mount as MNT_NODEV_NS instead of
> MNT_NODEV. This is because if the mount operation explicitly asked for
> nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the
> task is not on a device cgroup. If it is, we will rely on the control
> rules in devcg to intermediate the access an tell us what those tasks
> can or cannot do.

Well the devcg was meant to be a temporary stopgap solution until we
have device namespaces, and this seems to entrench them further, but
it does make sense.

> There is precedence for that with memcg: although we don't explicitly
> test it like I am doing it here, we are allowing tmpfs mounts to happen
> in user namespaces because memcg will contain them.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

> ---
>  fs/namei.c            | 4 ++++
>  fs/namespace.c        | 2 +-
>  include/linux/mount.h | 2 ++
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 57ae9c8..8a34d79 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2356,6 +2356,10 @@ static int may_open(struct path *path, int acc_mode, int flag)
>  	case S_IFCHR:
>  		if (path->mnt->mnt_flags & MNT_NODEV)
>  			return -EACCES;
> +
> +		if ((path->mnt->mnt_flags & MNT_NODEV_NS) &&
> +			!task_in_child_devcgroup(current))
> +			return -EACCES;
>  		/*FALLTHRU*/
>  	case S_IFIFO:
>  	case S_IFSOCK:
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 50ca17d..fe8127e 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1935,7 +1935,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
>  		 */
>  		if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
>  			flags |= MS_NODEV;
> -			mnt_flags |= MNT_NODEV;
> +			mnt_flags |= MNT_NODEV_NS;
>  		}
>  	}
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index d7029f4..8d190e4 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -32,6 +32,8 @@ struct mnt_namespace;
>  #define MNT_SHRINKABLE	0x100
>  #define MNT_WRITE_HOLD	0x200
>  
> +#define MNT_NODEV_NS	0x400   /* userns mount, and nodev not explicit */
> +
>  #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
>  #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
>  /*
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
  2013-03-15  9:13     ` Glauber Costa
  (?)
@ 2013-03-15 14:20     ` Serge Hallyn
  2013-03-19 15:32       ` Janne Karhunen
  -1 siblings, 1 reply; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:20 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, Andrew Morton, mtk.manpages, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski

Quoting Glauber Costa (glommer@parallels.com):
> So far, unless the filesystem explicitly marks it (and most don't),
> processes running in user namespaces won't be allowed to access any
> devices. Although this makes sense, this is a quite restrictive rule,
> since a lot of those accesses would be perfectly safe: aside from the
> simple char devices in /dev/ like null, zero, etc, it is perfectly
> possible to assign a device for usage inside a namespace if we can
> establish trust in that operation.

Yeah we've talked about that too - as Eric pointed out it's not
strictly necessary as we can set that up with bind mounts from the
host's /dev.  So if he wants to nack - especially temporarily while
other things fall into place - I won't argue, but I'm happy with this.

> We will do that by marking the mount as MNT_NODEV_NS instead of
> MNT_NODEV. This is because if the mount operation explicitly asked for
> nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the
> task is not on a device cgroup. If it is, we will rely on the control
> rules in devcg to intermediate the access an tell us what those tasks
> can or cannot do.

Well the devcg was meant to be a temporary stopgap solution until we
have device namespaces, and this seems to entrench them further, but
it does make sense.

> There is precedence for that with memcg: although we don't explicitly
> test it like I am doing it here, we are allowing tmpfs mounts to happen
> in user namespaces because memcg will contain them.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Aristeu Rozanski <aris@redhat.com>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Serge Hallyn <serge.hallyn@canonical.com>

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>

> ---
>  fs/namei.c            | 4 ++++
>  fs/namespace.c        | 2 +-
>  include/linux/mount.h | 2 ++
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 57ae9c8..8a34d79 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2356,6 +2356,10 @@ static int may_open(struct path *path, int acc_mode, int flag)
>  	case S_IFCHR:
>  		if (path->mnt->mnt_flags & MNT_NODEV)
>  			return -EACCES;
> +
> +		if ((path->mnt->mnt_flags & MNT_NODEV_NS) &&
> +			!task_in_child_devcgroup(current))
> +			return -EACCES;
>  		/*FALLTHRU*/
>  	case S_IFIFO:
>  	case S_IFSOCK:
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 50ca17d..fe8127e 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1935,7 +1935,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
>  		 */
>  		if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
>  			flags |= MS_NODEV;
> -			mnt_flags |= MNT_NODEV;
> +			mnt_flags |= MNT_NODEV_NS;
>  		}
>  	}
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index d7029f4..8d190e4 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -32,6 +32,8 @@ struct mnt_namespace;
>  #define MNT_SHRINKABLE	0x100
>  #define MNT_WRITE_HOLD	0x200
>  
> +#define MNT_NODEV_NS	0x400   /* userns mount, and nodev not explicit */
> +
>  #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
>  #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
>  /*
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
       [not found]     ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 14:37       ` Serge Hallyn
@ 2013-03-15 14:37       ` Serge Hallyn
  2013-03-15 18:03       ` Vasily Kulikov
  2013-03-15 20:43       ` Eric W. Biederman
  3 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:37 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> Since we have strict control on who access the devices, it should be
> no problem to allow the device to appear.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> ---
>  fs/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 8a34d79..d0b4549 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
>  	if (error)
>  		return error;
>  
> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))

I realize you're arguing that devicens is enough, but how about
doing inode_capable(dir, CAP_MKNOD) instead?

>  		return -EPERM;
>  
>  	if (!dir->i_op->mknod)
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
       [not found]     ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 14:37       ` Serge Hallyn
  2013-03-15 14:49           ` Glauber Costa
  2013-03-15 14:49         ` Glauber Costa
  2013-03-15 14:37       ` Serge Hallyn
                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:37 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> Since we have strict control on who access the devices, it should be
> no problem to allow the device to appear.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> ---
>  fs/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 8a34d79..d0b4549 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
>  	if (error)
>  		return error;
>  
> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))

I realize you're arguing that devicens is enough, but how about
doing inode_capable(dir, CAP_MKNOD) instead?

>  		return -EPERM;
>  
>  	if (!dir->i_op->mknod)
> -- 
> 1.8.1.2
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 14:00     ` Serge Hallyn
  2013-03-15 14:42         ` Glauber Costa
@ 2013-03-15 14:42       ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:42 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w

On 03/15/2013 06:00 PM, Serge Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>
>>> Hi,
>>>
>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>>> with it I found it to be less than ideal. Although it could possibly work
>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>>> prevent it from working correctly for people that, like us, are booting full
>>> distributions.
>>
>> Full distributions that have not been modified to be minimally container
>> aware.
> 
> Right, in fact in this case it doesn't need to be minimally container
> aware, you just create the bind mount yourself and init just needs to
> accept that it shouldn't touch it.
> 

Well, what if it doesn't?

At least in the system I am using, centos6, udev mounts a tmpfs in a
temporary location, and then mount --move this to /dev. This is now
empty, and devpts will be mounted ontop of that.

Although it can be changed, of course, it is very likely to be due to
its age. And that is not even the oldest distribution around.

Now, both operations are totally valid inside namespaces. Both mount
--move and mounting tmpfs. If there were any way to identify those
specific mounts and block them, I would be fine.

But so far, given my understanding, you guys are asking me to either go
convince people to change their very old stable distributions, or
complicate deployment with all sorts of special cases for them.

I fully agree that the behavior you describe is the best behavior if it
can be done, but I am not satisfied with the answer that legacy
distributions should somehow be adapted.

Let me reverse the question: If you bind mount /dev/pts and then udev
never touches it, etc, does my solution affects that in anyway? The way
I see it, we just become more capable of running legacy system without
giving nothing in return aside from code. And it is not even an
extremely complex code.

>>> One of the problems that I am addressing in here is that we are disallowing
>>> mknod in usernamespaces. Although I understand the motivation for that, I
>>> believe that to be too restrictive, specially because we already control access
>>> to the files separately. There should be no harm in mknod'ing something per se,
>>> if manipulating it is forbidden.
>>
>> mknod in userspace needs to be a separate patchset.  There is no need to
>> solve mknod in userspace to solve devpts.
>>
>>
>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
>>> link it to our namespaces'. My proposal is to multiplex it and return the
>>> correct "root ptmx" depending on which userns is reading that device.
>>
>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
> 
> Right, Glauber have you seen this patch?  Eric did already solve this.
> (And again that's a nice safeguard, but it shouldn't be necessary)
> 
No. Where was that sent to?

If you can point me to it, I am of course willing to test it. If it
solves my problem (the description suggests that there is high
probability), then I have no particular attachments to my specific solution.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 14:00     ` Serge Hallyn
@ 2013-03-15 14:42         ` Glauber Costa
  2013-03-15 14:42       ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:42 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Serge Hallyn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 03/15/2013 06:00 PM, Serge Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>
>>> Hi,
>>>
>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>>> with it I found it to be less than ideal. Although it could possibly work
>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>>> prevent it from working correctly for people that, like us, are booting full
>>> distributions.
>>
>> Full distributions that have not been modified to be minimally container
>> aware.
> 
> Right, in fact in this case it doesn't need to be minimally container
> aware, you just create the bind mount yourself and init just needs to
> accept that it shouldn't touch it.
> 

Well, what if it doesn't?

At least in the system I am using, centos6, udev mounts a tmpfs in a
temporary location, and then mount --move this to /dev. This is now
empty, and devpts will be mounted ontop of that.

Although it can be changed, of course, it is very likely to be due to
its age. And that is not even the oldest distribution around.

Now, both operations are totally valid inside namespaces. Both mount
--move and mounting tmpfs. If there were any way to identify those
specific mounts and block them, I would be fine.

But so far, given my understanding, you guys are asking me to either go
convince people to change their very old stable distributions, or
complicate deployment with all sorts of special cases for them.

I fully agree that the behavior you describe is the best behavior if it
can be done, but I am not satisfied with the answer that legacy
distributions should somehow be adapted.

Let me reverse the question: If you bind mount /dev/pts and then udev
never touches it, etc, does my solution affects that in anyway? The way
I see it, we just become more capable of running legacy system without
giving nothing in return aside from code. And it is not even an
extremely complex code.

>>> One of the problems that I am addressing in here is that we are disallowing
>>> mknod in usernamespaces. Although I understand the motivation for that, I
>>> believe that to be too restrictive, specially because we already control access
>>> to the files separately. There should be no harm in mknod'ing something per se,
>>> if manipulating it is forbidden.
>>
>> mknod in userspace needs to be a separate patchset.  There is no need to
>> solve mknod in userspace to solve devpts.
>>
>>
>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
>>> link it to our namespaces'. My proposal is to multiplex it and return the
>>> correct "root ptmx" depending on which userns is reading that device.
>>
>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
> 
> Right, Glauber have you seen this patch?  Eric did already solve this.
> (And again that's a nice safeguard, but it shouldn't be necessary)
> 
No. Where was that sent to?

If you can point me to it, I am of course willing to test it. If it
solves my problem (the description suggests that there is high
probability), then I have no particular attachments to my specific solution.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
@ 2013-03-15 14:42         ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:42 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Serge Hallyn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 03/15/2013 06:00 PM, Serge Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>
>>> Hi,
>>>
>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>>> with it I found it to be less than ideal. Although it could possibly work
>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>>> prevent it from working correctly for people that, like us, are booting full
>>> distributions.
>>
>> Full distributions that have not been modified to be minimally container
>> aware.
> 
> Right, in fact in this case it doesn't need to be minimally container
> aware, you just create the bind mount yourself and init just needs to
> accept that it shouldn't touch it.
> 

Well, what if it doesn't?

At least in the system I am using, centos6, udev mounts a tmpfs in a
temporary location, and then mount --move this to /dev. This is now
empty, and devpts will be mounted ontop of that.

Although it can be changed, of course, it is very likely to be due to
its age. And that is not even the oldest distribution around.

Now, both operations are totally valid inside namespaces. Both mount
--move and mounting tmpfs. If there were any way to identify those
specific mounts and block them, I would be fine.

But so far, given my understanding, you guys are asking me to either go
convince people to change their very old stable distributions, or
complicate deployment with all sorts of special cases for them.

I fully agree that the behavior you describe is the best behavior if it
can be done, but I am not satisfied with the answer that legacy
distributions should somehow be adapted.

Let me reverse the question: If you bind mount /dev/pts and then udev
never touches it, etc, does my solution affects that in anyway? The way
I see it, we just become more capable of running legacy system without
giving nothing in return aside from code. And it is not even an
extremely complex code.

>>> One of the problems that I am addressing in here is that we are disallowing
>>> mknod in usernamespaces. Although I understand the motivation for that, I
>>> believe that to be too restrictive, specially because we already control access
>>> to the files separately. There should be no harm in mknod'ing something per se,
>>> if manipulating it is forbidden.
>>
>> mknod in userspace needs to be a separate patchset.  There is no need to
>> solve mknod in userspace to solve devpts.
>>
>>
>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
>>> link it to our namespaces'. My proposal is to multiplex it and return the
>>> correct "root ptmx" depending on which userns is reading that device.
>>
>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
> 
> Right, Glauber have you seen this patch?  Eric did already solve this.
> (And again that's a nice safeguard, but it shouldn't be necessary)
> 
No. Where was that sent to?

If you can point me to it, I am of course willing to test it. If it
solves my problem (the description suggests that there is high
probability), then I have no particular attachments to my specific solution.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
  2013-03-15 14:07       ` Serge Hallyn
@ 2013-03-15 14:43         ` Glauber Costa
  2013-03-15 14:43           ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:43 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

On 03/15/2013 06:07 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> Most of the other subsystems already keep track of that in some way.  We
>> will do that internally and provide a test to determine whether or not
>> our task is in a device cgroup that is not the root one. We can relax
>> some of our checks in that case, trusting that whoever set device cgroup
>> rules will be responsible to control access to their devices.
>>
>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Patch looks fine.  AFAIK we're still waiting on Aristeu's patchset to
> hit upstream.  As your patches are simpler I'd prefer, if there is
> churn, for yours to be refactored than his.
> 
I have no problem with that. There is also a small build issue here that
needs to be fixed. If you allow me, with my guarantees that the patch
will be preserved in spirit I will keep your ack after any refactoring =)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
  2013-03-15 14:07       ` Serge Hallyn
@ 2013-03-15 14:43           ` Glauber Costa
  2013-03-15 14:43           ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:43 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, Li Zefan

On 03/15/2013 06:07 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> Most of the other subsystems already keep track of that in some way.  We
>> will do that internally and provide a test to determine whether or not
>> our task is in a device cgroup that is not the root one. We can relax
>> some of our checks in that case, trusting that whoever set device cgroup
>> rules will be responsible to control access to their devices.
>>
>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Patch looks fine.  AFAIK we're still waiting on Aristeu's patchset to
> hit upstream.  As your patches are simpler I'd prefer, if there is
> churn, for yours to be refactored than his.
> 
I have no problem with that. There is also a small build issue here that
needs to be fixed. If you allow me, with my guarantees that the patch
will be preserved in spirit I will keep your ack after any refactoring =)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
@ 2013-03-15 14:43           ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:43 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, Li Zefan

On 03/15/2013 06:07 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> Most of the other subsystems already keep track of that in some way.  We
>> will do that internally and provide a test to determine whether or not
>> our task is in a device cgroup that is not the root one. We can relax
>> some of our checks in that case, trusting that whoever set device cgroup
>> rules will be responsible to control access to their devices.
>>
>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> 
> Patch looks fine.  AFAIK we're still waiting on Aristeu's patchset to
> hit upstream.  As your patches are simpler I'd prefer, if there is
> churn, for yours to be refactored than his.
> 
I have no problem with that. There is also a small build issue here that
needs to be fixed. If you allow me, with my guarantees that the patch
will be preserved in spirit I will keep your ack after any refactoring =)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/4] devpts: fix usage in user namespaces
       [not found]     ` <1363338823-25292-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 14:45       ` Serge Hallyn
@ 2013-03-15 14:45       ` Serge Hallyn
  1 sibling, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:45 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):

Interesting approach, neat.  But I'll let you and Eric fight it
out :)

thanks,
-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/4] devpts: fix usage in user namespaces
       [not found]     ` <1363338823-25292-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 14:45       ` Serge Hallyn
  2013-03-15 14:45       ` Serge Hallyn
  1 sibling, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:45 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):

Interesting approach, neat.  But I'll let you and Eric fight it
out :)

thanks,
-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
  2013-03-15 14:37       ` Serge Hallyn
  2013-03-15 14:49           ` Glauber Costa
@ 2013-03-15 14:49         ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:49 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

On 03/15/2013 06:37 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> Since we have strict control on who access the devices, it should be
>> no problem to allow the device to appear.
>>
>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
>> ---
>>  fs/namei.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 8a34d79..d0b4549 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
>>  	if (error)
>>  		return error;
>>  
>> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
>> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
> 
> I realize you're arguing that devicens is enough, but how about
> doing inode_capable(dir, CAP_MKNOD) instead?
> 
I see no reason not to do it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
  2013-03-15 14:37       ` Serge Hallyn
@ 2013-03-15 14:49           ` Glauber Costa
  2013-03-15 14:49         ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:49 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups, Andrew Morton, mtk.manpages, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski

On 03/15/2013 06:37 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
>> Since we have strict control on who access the devices, it should be
>> no problem to allow the device to appear.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Cc: Aristeu Rozanski <aris@redhat.com>
>> Cc: Eric Biederman <ebiederm@xmission.com>
>> Cc: Serge Hallyn <serge.hallyn@canonical.com>
>> ---
>>  fs/namei.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 8a34d79..d0b4549 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
>>  	if (error)
>>  		return error;
>>  
>> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
>> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
> 
> I realize you're arguing that devicens is enough, but how about
> doing inode_capable(dir, CAP_MKNOD) instead?
> 
I see no reason not to do it.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
@ 2013-03-15 14:49           ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 14:49 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: cgroups, Andrew Morton, mtk.manpages, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski

On 03/15/2013 06:37 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
>> Since we have strict control on who access the devices, it should be
>> no problem to allow the device to appear.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Cc: Aristeu Rozanski <aris@redhat.com>
>> Cc: Eric Biederman <ebiederm@xmission.com>
>> Cc: Serge Hallyn <serge.hallyn@canonical.com>
>> ---
>>  fs/namei.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 8a34d79..d0b4549 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
>>  	if (error)
>>  		return error;
>>  
>> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
>> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
> 
> I realize you're arguing that devicens is enough, but how about
> doing inode_capable(dir, CAP_MKNOD) instead?
> 
I see no reason not to do it.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
       [not found]           ` <514333A2.5060408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 14:55             ` Serge Hallyn
  0 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:55 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 03/15/2013 06:07 PM, Serge Hallyn wrote:
> > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >> Most of the other subsystems already keep track of that in some way.  We
> >> will do that internally and provide a test to determine whether or not
> >> our task is in a device cgroup that is not the root one. We can relax
> >> some of our checks in that case, trusting that whoever set device cgroup
> >> rules will be responsible to control access to their devices.
> >>
> >> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> >> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> >> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> > 
> > Patch looks fine.  AFAIK we're still waiting on Aristeu's patchset to
> > hit upstream.  As your patches are simpler I'd prefer, if there is
> > churn, for yours to be refactored than his.
> > 
> I have no problem with that. There is also a small build issue here that
> needs to be fixed. If you allow me, with my guarantees that the patch
> will be preserved in spirit I will keep your ack after any refactoring =)
> 

Of course :)  Please keep my ack on it so long as you don't feel any
substantive changes were needed :)

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
  2013-03-15 14:43           ` Glauber Costa
  (?)
  (?)
@ 2013-03-15 14:55           ` Serge Hallyn
  -1 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 14:55 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, Andrew Morton, mtk.manpages, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski,
	Li Zefan

Quoting Glauber Costa (glommer@parallels.com):
> On 03/15/2013 06:07 PM, Serge Hallyn wrote:
> > Quoting Glauber Costa (glommer@parallels.com):
> >> Most of the other subsystems already keep track of that in some way.  We
> >> will do that internally and provide a test to determine whether or not
> >> our task is in a device cgroup that is not the root one. We can relax
> >> some of our checks in that case, trusting that whoever set device cgroup
> >> rules will be responsible to control access to their devices.
> >>
> >> Signed-off-by: Glauber Costa <glommer@parallels.com>
> >> Cc: Aristeu Rozanski <aris@redhat.com>
> >> Cc: Eric Biederman <ebiederm@xmission.com>
> >> Cc: Serge Hallyn <serge.hallyn@canonical.com>
> > 
> > Patch looks fine.  AFAIK we're still waiting on Aristeu's patchset to
> > hit upstream.  As your patches are simpler I'd prefer, if there is
> > churn, for yours to be refactored than his.
> > 
> I have no problem with that. There is also a small build issue here that
> needs to be fixed. If you allow me, with my guarantees that the patch
> will be preserved in spirit I will keep your ack after any refactoring =)
> 

Of course :)  Please keep my ack on it so long as you don't feel any
substantive changes were needed :)

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
       [not found]           ` <51433511.1020808-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 15:14             ` Serge Hallyn
  2013-03-15 15:14             ` Serge Hallyn
  1 sibling, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 15:14 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 03/15/2013 06:37 PM, Serge Hallyn wrote:
> > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >> Since we have strict control on who access the devices, it should be
> >> no problem to allow the device to appear.
> >>
> >> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> >> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> >> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> >> ---
> >>  fs/namei.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/fs/namei.c b/fs/namei.c
> >> index 8a34d79..d0b4549 100644
> >> --- a/fs/namei.c
> >> +++ b/fs/namei.c
> >> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
> >>  	if (error)
> >>  		return error;
> >>  
> >> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> >> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
> > 
> > I realize you're arguing that devicens is enough, but how about
> > doing inode_capable(dir, CAP_MKNOD) instead?
> > 
> I see no reason not to do it.

Cool, with that

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

thanks.

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
       [not found]           ` <51433511.1020808-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 15:14             ` Serge Hallyn
@ 2013-03-15 15:14             ` Serge Hallyn
  1 sibling, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 15:14 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Serge Hallyn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 03/15/2013 06:37 PM, Serge Hallyn wrote:
> > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >> Since we have strict control on who access the devices, it should be
> >> no problem to allow the device to appear.
> >>
> >> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> >> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> >> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> >> ---
> >>  fs/namei.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/fs/namei.c b/fs/namei.c
> >> index 8a34d79..d0b4549 100644
> >> --- a/fs/namei.c
> >> +++ b/fs/namei.c
> >> @@ -3126,7 +3126,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
> >>  	if (error)
> >>  		return error;
> >>  
> >> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> >> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
> > 
> > I realize you're arguing that devicens is enough, but how about
> > doing inode_capable(dir, CAP_MKNOD) instead?
> > 
> I see no reason not to do it.

Cool, with that

Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

thanks.

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found]         ` <5143333E.1040100-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 15:21           ` Serge Hallyn
  2013-03-15 15:26             ` Glauber Costa
  2013-03-15 15:26               ` Glauber Costa
  0 siblings, 2 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 15:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> >>
> >>> Hi,
> >>>
> >>> devpts mounts in user namespaces is queued for 3.9. However, while playing
> >>> with it I found it to be less than ideal. Although it could possibly work
> >>> with custom software that can be made to point to /dev/pts/ptmx, a few things
> >>> prevent it from working correctly for people that, like us, are booting full
> >>> distributions.
> >>
> >> Full distributions that have not been modified to be minimally container
> >> aware.
> > 
> > Right, in fact in this case it doesn't need to be minimally container
> > aware, you just create the bind mount yourself and init just needs to
> > accept that it shouldn't touch it.
> > 
> 
> Well, what if it doesn't?
> 
> At least in the system I am using, centos6, udev mounts a tmpfs in a
> temporary location, and then mount --move this to /dev. This is now
> empty, and devpts will be mounted ontop of that.

This also messes up your /dev/ttyN setup right?  How are you handling
that?

Usually in cases like this I've found that either (1) pre-mounting the
fs caused the container-unaware service to not mount it (which in this
case isn't happening) or (2) the service might tolerate my preventing
the offensive mount (using apparmor by pathname, you could also do
it with a special selinux label on /dev).

But like you say in this specific case I see no reason not to fix it
in the kernel.

> Although it can be changed, of course, it is very likely to be due to
> its age. And that is not even the oldest distribution around.
> 
> Now, both operations are totally valid inside namespaces. Both mount
> --move and mounting tmpfs. If there were any way to identify those
> specific mounts and block them, I would be fine.
> 
> But so far, given my understanding, you guys are asking me to either go
> convince people to change their very old stable distributions, or
> complicate deployment with all sorts of special cases for them.
> 
> I fully agree that the behavior you describe is the best behavior if it
> can be done, but I am not satisfied with the answer that legacy
> distributions should somehow be adapted.
> 
> Let me reverse the question: If you bind mount /dev/pts and then udev
> never touches it, etc, does my solution affects that in anyway? The way
> I see it, we just become more capable of running legacy system without
> giving nothing in return aside from code. And it is not even an
> extremely complex code.

Right - I don't object to your patch, I just wanted to see if you and
Eric could agree on which one to use.  After that I'll do a closer
review - but on first look it looked good to me.

> >>> One of the problems that I am addressing in here is that we are disallowing
> >>> mknod in usernamespaces. Although I understand the motivation for that, I
> >>> believe that to be too restrictive, specially because we already control access
> >>> to the files separately. There should be no harm in mknod'ing something per se,
> >>> if manipulating it is forbidden.
> >>
> >> mknod in userspace needs to be a separate patchset.  There is no need to
> >> solve mknod in userspace to solve devpts.
> >>
> >>
> >>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
> >>> link it to our namespaces'. My proposal is to multiplex it and return the
> >>> correct "root ptmx" depending on which userns is reading that device.
> >>
> >> Doable.  I still strongly prefer my version of having /dev/ptmx act like
> >> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
> > 
> > Right, Glauber have you seen this patch?  Eric did already solve this.
> > (And again that's a nice safeguard, but it shouldn't be necessary)
> > 
> No. Where was that sent to?
> 
> If you can point me to it, I am of course willing to test it. If it
> solves my problem (the description suggests that there is high
> probability), then I have no particular attachments to my specific solution.

Well shoot, I can't find it right now.  Not even in Eric's git tree.
IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
and open that instead.

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 15:21           ` Serge Hallyn
@ 2013-03-15 15:26             ` Glauber Costa
  2013-03-15 15:26               ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 15:26 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w

On 03/15/2013 07:21 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
>>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>>>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>>>>> with it I found it to be less than ideal. Although it could possibly work
>>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>>>>> prevent it from working correctly for people that, like us, are booting full
>>>>> distributions.
>>>>
>>>> Full distributions that have not been modified to be minimally container
>>>> aware.
>>>
>>> Right, in fact in this case it doesn't need to be minimally container
>>> aware, you just create the bind mount yourself and init just needs to
>>> accept that it shouldn't touch it.
>>>
>>
>> Well, what if it doesn't?
>>
>> At least in the system I am using, centos6, udev mounts a tmpfs in a
>> temporary location, and then mount --move this to /dev. This is now
>> empty, and devpts will be mounted ontop of that.
> 
> This also messes up your /dev/ttyN setup right?  How are you handling
> that?
> 
very simple: udev will just mknod everything back, so let him!

>> Let me reverse the question: If you bind mount /dev/pts and then udev
>> never touches it, etc, does my solution affects that in anyway? The way
>> I see it, we just become more capable of running legacy system without
>> giving nothing in return aside from code. And it is not even an
>> extremely complex code.
> 
> Right - I don't object to your patch, I just wanted to see if you and
> Eric could agree on which one to use.  After that I'll do a closer
> review - but on first look it looked good to me.
> 

As I said, as far as the specific part of the puzzle concerning the
/dev/ptmx to /dev/pts/ptmx mapping, I am fine with whatever works.

>>>>> One of the problems that I am addressing in here is that we are disallowing
>>>>> mknod in usernamespaces. Although I understand the motivation for that, I
>>>>> believe that to be too restrictive, specially because we already control access
>>>>> to the files separately. There should be no harm in mknod'ing something per se,
>>>>> if manipulating it is forbidden.
>>>>
>>>> mknod in userspace needs to be a separate patchset.  There is no need to
>>>> solve mknod in userspace to solve devpts.
>>>>
>>>>
>>>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
>>>>> link it to our namespaces'. My proposal is to multiplex it and return the
>>>>> correct "root ptmx" depending on which userns is reading that device.
>>>>
>>>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
>>>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
>>>
>>> Right, Glauber have you seen this patch?  Eric did already solve this.
>>> (And again that's a nice safeguard, but it shouldn't be necessary)
>>>
>> No. Where was that sent to?
>>
>> If you can point me to it, I am of course willing to test it. If it
>> solves my problem (the description suggests that there is high
>> probability), then I have no particular attachments to my specific solution.
> 
> Well shoot, I can't find it right now.  Not even in Eric's git tree.
> IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
> and open that instead.
>
Which gives a very good explanation about why haven't I seen it =)
Eric ?

What it a /dev/ptmx already exist? will it use it? That would be bad,
since that /dev/ptmx could be a host-side one. I actually believe
linking to $rootfs/dev/pts/ptmx is more robust than my solution against
remounts. So provided it can guarantee that the ptmx is not ever the
root ptmx, I would ack that.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 15:21           ` Serge Hallyn
@ 2013-03-15 15:26               ` Glauber Costa
  2013-03-15 15:26               ` Glauber Costa
  1 sibling, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 15:26 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, cgroups, Andrew Morton, mtk.manpages,
	Serge Hallyn, linux-fsdevel, containers

On 03/15/2013 07:21 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
>> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>> Glauber Costa <glommer@parallels.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>>>>> with it I found it to be less than ideal. Although it could possibly work
>>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>>>>> prevent it from working correctly for people that, like us, are booting full
>>>>> distributions.
>>>>
>>>> Full distributions that have not been modified to be minimally container
>>>> aware.
>>>
>>> Right, in fact in this case it doesn't need to be minimally container
>>> aware, you just create the bind mount yourself and init just needs to
>>> accept that it shouldn't touch it.
>>>
>>
>> Well, what if it doesn't?
>>
>> At least in the system I am using, centos6, udev mounts a tmpfs in a
>> temporary location, and then mount --move this to /dev. This is now
>> empty, and devpts will be mounted ontop of that.
> 
> This also messes up your /dev/ttyN setup right?  How are you handling
> that?
> 
very simple: udev will just mknod everything back, so let him!

>> Let me reverse the question: If you bind mount /dev/pts and then udev
>> never touches it, etc, does my solution affects that in anyway? The way
>> I see it, we just become more capable of running legacy system without
>> giving nothing in return aside from code. And it is not even an
>> extremely complex code.
> 
> Right - I don't object to your patch, I just wanted to see if you and
> Eric could agree on which one to use.  After that I'll do a closer
> review - but on first look it looked good to me.
> 

As I said, as far as the specific part of the puzzle concerning the
/dev/ptmx to /dev/pts/ptmx mapping, I am fine with whatever works.

>>>>> One of the problems that I am addressing in here is that we are disallowing
>>>>> mknod in usernamespaces. Although I understand the motivation for that, I
>>>>> believe that to be too restrictive, specially because we already control access
>>>>> to the files separately. There should be no harm in mknod'ing something per se,
>>>>> if manipulating it is forbidden.
>>>>
>>>> mknod in userspace needs to be a separate patchset.  There is no need to
>>>> solve mknod in userspace to solve devpts.
>>>>
>>>>
>>>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
>>>>> link it to our namespaces'. My proposal is to multiplex it and return the
>>>>> correct "root ptmx" depending on which userns is reading that device.
>>>>
>>>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
>>>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
>>>
>>> Right, Glauber have you seen this patch?  Eric did already solve this.
>>> (And again that's a nice safeguard, but it shouldn't be necessary)
>>>
>> No. Where was that sent to?
>>
>> If you can point me to it, I am of course willing to test it. If it
>> solves my problem (the description suggests that there is high
>> probability), then I have no particular attachments to my specific solution.
> 
> Well shoot, I can't find it right now.  Not even in Eric's git tree.
> IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
> and open that instead.
>
Which gives a very good explanation about why haven't I seen it =)
Eric ?

What it a /dev/ptmx already exist? will it use it? That would be bad,
since that /dev/ptmx could be a host-side one. I actually believe
linking to $rootfs/dev/pts/ptmx is more robust than my solution against
remounts. So provided it can guarantee that the ptmx is not ever the
root ptmx, I would ack that.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
@ 2013-03-15 15:26               ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 15:26 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Eric W. Biederman, cgroups, Andrew Morton, mtk.manpages,
	Serge Hallyn, linux-fsdevel, containers

On 03/15/2013 07:21 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
>> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>> Glauber Costa <glommer@parallels.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>>>>> with it I found it to be less than ideal. Although it could possibly work
>>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>>>>> prevent it from working correctly for people that, like us, are booting full
>>>>> distributions.
>>>>
>>>> Full distributions that have not been modified to be minimally container
>>>> aware.
>>>
>>> Right, in fact in this case it doesn't need to be minimally container
>>> aware, you just create the bind mount yourself and init just needs to
>>> accept that it shouldn't touch it.
>>>
>>
>> Well, what if it doesn't?
>>
>> At least in the system I am using, centos6, udev mounts a tmpfs in a
>> temporary location, and then mount --move this to /dev. This is now
>> empty, and devpts will be mounted ontop of that.
> 
> This also messes up your /dev/ttyN setup right?  How are you handling
> that?
> 
very simple: udev will just mknod everything back, so let him!

>> Let me reverse the question: If you bind mount /dev/pts and then udev
>> never touches it, etc, does my solution affects that in anyway? The way
>> I see it, we just become more capable of running legacy system without
>> giving nothing in return aside from code. And it is not even an
>> extremely complex code.
> 
> Right - I don't object to your patch, I just wanted to see if you and
> Eric could agree on which one to use.  After that I'll do a closer
> review - but on first look it looked good to me.
> 

As I said, as far as the specific part of the puzzle concerning the
/dev/ptmx to /dev/pts/ptmx mapping, I am fine with whatever works.

>>>>> One of the problems that I am addressing in here is that we are disallowing
>>>>> mknod in usernamespaces. Although I understand the motivation for that, I
>>>>> believe that to be too restrictive, specially because we already control access
>>>>> to the files separately. There should be no harm in mknod'ing something per se,
>>>>> if manipulating it is forbidden.
>>>>
>>>> mknod in userspace needs to be a separate patchset.  There is no need to
>>>> solve mknod in userspace to solve devpts.
>>>>
>>>>
>>>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
>>>>> link it to our namespaces'. My proposal is to multiplex it and return the
>>>>> correct "root ptmx" depending on which userns is reading that device.
>>>>
>>>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
>>>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
>>>
>>> Right, Glauber have you seen this patch?  Eric did already solve this.
>>> (And again that's a nice safeguard, but it shouldn't be necessary)
>>>
>> No. Where was that sent to?
>>
>> If you can point me to it, I am of course willing to test it. If it
>> solves my problem (the description suggests that there is high
>> probability), then I have no particular attachments to my specific solution.
> 
> Well shoot, I can't find it right now.  Not even in Eric's git tree.
> IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
> and open that instead.
>
Which gives a very good explanation about why haven't I seen it =)
Eric ?

What it a /dev/ptmx already exist? will it use it? That would be bad,
since that /dev/ptmx could be a host-side one. I actually believe
linking to $rootfs/dev/pts/ptmx is more robust than my solution against
remounts. So provided it can guarantee that the ptmx is not ever the
root ptmx, I would ack that.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found]               ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 15:58                 ` Serge Hallyn
@ 2013-03-15 15:58                 ` Serge Hallyn
  2013-03-15 21:02                 ` Eric W. Biederman
  2 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 15:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 03/15/2013 07:21 PM, Serge Hallyn wrote:
> > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
> >>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >>>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
> >>>>> with it I found it to be less than ideal. Although it could possibly work
> >>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
> >>>>> prevent it from working correctly for people that, like us, are booting full
> >>>>> distributions.
> >>>>
> >>>> Full distributions that have not been modified to be minimally container
> >>>> aware.
> >>>
> >>> Right, in fact in this case it doesn't need to be minimally container
> >>> aware, you just create the bind mount yourself and init just needs to
> >>> accept that it shouldn't touch it.
> >>>
> >>
> >> Well, what if it doesn't?
> >>
> >> At least in the system I am using, centos6, udev mounts a tmpfs in a
> >> temporary location, and then mount --move this to /dev. This is now
> >> empty, and devpts will be mounted ontop of that.
> > 
> > This also messes up your /dev/ttyN setup right?  How are you handling
> > that?
> > 
> very simple: udev will just mknod everything back, so let him!

So you're not using bind-mounted ptys over /dev/ttyN?

> >> Let me reverse the question: If you bind mount /dev/pts and then udev
> >> never touches it, etc, does my solution affects that in anyway? The way
> >> I see it, we just become more capable of running legacy system without
> >> giving nothing in return aside from code. And it is not even an
> >> extremely complex code.
> > 
> > Right - I don't object to your patch, I just wanted to see if you and
> > Eric could agree on which one to use.  After that I'll do a closer
> > review - but on first look it looked good to me.
> > 
> 
> As I said, as far as the specific part of the puzzle concerning the
> /dev/ptmx to /dev/pts/ptmx mapping, I am fine with whatever works.

ditto.

> >>>>> One of the problems that I am addressing in here is that we are disallowing
> >>>>> mknod in usernamespaces. Although I understand the motivation for that, I
> >>>>> believe that to be too restrictive, specially because we already control access
> >>>>> to the files separately. There should be no harm in mknod'ing something per se,
> >>>>> if manipulating it is forbidden.
> >>>>
> >>>> mknod in userspace needs to be a separate patchset.  There is no need to
> >>>> solve mknod in userspace to solve devpts.
> >>>>
> >>>>
> >>>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
> >>>>> link it to our namespaces'. My proposal is to multiplex it and return the
> >>>>> correct "root ptmx" depending on which userns is reading that device.
> >>>>
> >>>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
> >>>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
> >>>
> >>> Right, Glauber have you seen this patch?  Eric did already solve this.
> >>> (And again that's a nice safeguard, but it shouldn't be necessary)
> >>>
> >> No. Where was that sent to?
> >>
> >> If you can point me to it, I am of course willing to test it. If it
> >> solves my problem (the description suggests that there is high
> >> probability), then I have no particular attachments to my specific solution.
> > 
> > Well shoot, I can't find it right now.  Not even in Eric's git tree.
> > IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
> > and open that instead.
> >
> Which gives a very good explanation about why haven't I seen it =)
> Eric ?
> 
> What it a /dev/ptmx already exist? will it use it? That would be bad,

No, I think it was the open handler on the ptmx node that did it.  If
the fs underlying the ptmx file wasn't devpts then it would do the
path lookup.  So then if /dev/pts/ptmx was actually the host one, then
you get what you asked for, but it's not like a symlink (where the
link could be to host /dev/pts/ptmx whereas current task has a different
/dev/pts/ptmx)

> since that /dev/ptmx could be a host-side one. I actually believe
> linking to $rootfs/dev/pts/ptmx is more robust than my solution against
> remounts. So provided it can guarantee that the ptmx is not ever the
> root ptmx, I would ack that.
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found]               ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2013-03-15 15:58                 ` Serge Hallyn
  2013-03-15 16:01                   ` Glauber Costa
  2013-03-15 15:58                 ` Serge Hallyn
  2013-03-15 21:02                 ` Eric W. Biederman
  2 siblings, 1 reply; 65+ messages in thread
From: Serge Hallyn @ 2013-03-15 15:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Eric W. Biederman, cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Serge Hallyn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 03/15/2013 07:21 PM, Serge Hallyn wrote:
> > Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> >> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
> >>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >>>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
> >>>>> with it I found it to be less than ideal. Although it could possibly work
> >>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
> >>>>> prevent it from working correctly for people that, like us, are booting full
> >>>>> distributions.
> >>>>
> >>>> Full distributions that have not been modified to be minimally container
> >>>> aware.
> >>>
> >>> Right, in fact in this case it doesn't need to be minimally container
> >>> aware, you just create the bind mount yourself and init just needs to
> >>> accept that it shouldn't touch it.
> >>>
> >>
> >> Well, what if it doesn't?
> >>
> >> At least in the system I am using, centos6, udev mounts a tmpfs in a
> >> temporary location, and then mount --move this to /dev. This is now
> >> empty, and devpts will be mounted ontop of that.
> > 
> > This also messes up your /dev/ttyN setup right?  How are you handling
> > that?
> > 
> very simple: udev will just mknod everything back, so let him!

So you're not using bind-mounted ptys over /dev/ttyN?

> >> Let me reverse the question: If you bind mount /dev/pts and then udev
> >> never touches it, etc, does my solution affects that in anyway? The way
> >> I see it, we just become more capable of running legacy system without
> >> giving nothing in return aside from code. And it is not even an
> >> extremely complex code.
> > 
> > Right - I don't object to your patch, I just wanted to see if you and
> > Eric could agree on which one to use.  After that I'll do a closer
> > review - but on first look it looked good to me.
> > 
> 
> As I said, as far as the specific part of the puzzle concerning the
> /dev/ptmx to /dev/pts/ptmx mapping, I am fine with whatever works.

ditto.

> >>>>> One of the problems that I am addressing in here is that we are disallowing
> >>>>> mknod in usernamespaces. Although I understand the motivation for that, I
> >>>>> believe that to be too restrictive, specially because we already control access
> >>>>> to the files separately. There should be no harm in mknod'ing something per se,
> >>>>> if manipulating it is forbidden.
> >>>>
> >>>> mknod in userspace needs to be a separate patchset.  There is no need to
> >>>> solve mknod in userspace to solve devpts.
> >>>>
> >>>>
> >>>>> Last, /dev/ptmx will still always be the global ptmx device. We need to somehow
> >>>>> link it to our namespaces'. My proposal is to multiplex it and return the
> >>>>> correct "root ptmx" depending on which userns is reading that device.
> >>>>
> >>>> Doable.  I still strongly prefer my version of having /dev/ptmx act like
> >>>> a link to /dev/pts/ptmx.  Letting the mount namespace control it.
> >>>
> >>> Right, Glauber have you seen this patch?  Eric did already solve this.
> >>> (And again that's a nice safeguard, but it shouldn't be necessary)
> >>>
> >> No. Where was that sent to?
> >>
> >> If you can point me to it, I am of course willing to test it. If it
> >> solves my problem (the description suggests that there is high
> >> probability), then I have no particular attachments to my specific solution.
> > 
> > Well shoot, I can't find it right now.  Not even in Eric's git tree.
> > IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
> > and open that instead.
> >
> Which gives a very good explanation about why haven't I seen it =)
> Eric ?
> 
> What it a /dev/ptmx already exist? will it use it? That would be bad,

No, I think it was the open handler on the ptmx node that did it.  If
the fs underlying the ptmx file wasn't devpts then it would do the
path lookup.  So then if /dev/pts/ptmx was actually the host one, then
you get what you asked for, but it's not like a symlink (where the
link could be to host /dev/pts/ptmx whereas current task has a different
/dev/pts/ptmx)

> since that /dev/ptmx could be a host-side one. I actually believe
> linking to $rootfs/dev/pts/ptmx is more robust than my solution against
> remounts. So provided it can guarantee that the ptmx is not ever the
> root ptmx, I would ack that.
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 15:58                 ` Serge Hallyn
@ 2013-03-15 16:01                   ` Glauber Costa
  0 siblings, 0 replies; 65+ messages in thread
From: Glauber Costa @ 2013-03-15 16:01 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w

On 03/15/2013 07:58 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>> On 03/15/2013 07:21 PM, Serge Hallyn wrote:
>>> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>>>> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
>>>>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>>>>>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> devpts mounts in user namespaces is queued for 3.9. However, while playing
>>>>>>> with it I found it to be less than ideal. Although it could possibly work
>>>>>>> with custom software that can be made to point to /dev/pts/ptmx, a few things
>>>>>>> prevent it from working correctly for people that, like us, are booting full
>>>>>>> distributions.
>>>>>>
>>>>>> Full distributions that have not been modified to be minimally container
>>>>>> aware.
>>>>>
>>>>> Right, in fact in this case it doesn't need to be minimally container
>>>>> aware, you just create the bind mount yourself and init just needs to
>>>>> accept that it shouldn't touch it.
>>>>>
>>>>
>>>> Well, what if it doesn't?
>>>>
>>>> At least in the system I am using, centos6, udev mounts a tmpfs in a
>>>> temporary location, and then mount --move this to /dev. This is now
>>>> empty, and devpts will be mounted ontop of that.
>>>
>>> This also messes up your /dev/ttyN setup right?  How are you handling
>>> that?
>>>
>> very simple: udev will just mknod everything back, so let him!
> 
> So you're not using bind-mounted ptys over /dev/ttyN?
> 
Not in particular, and I haven't felt the need yet.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
       [not found]     ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 14:37       ` Serge Hallyn
  2013-03-15 14:37       ` Serge Hallyn
@ 2013-03-15 18:03       ` Vasily Kulikov
  2013-03-15 20:43       ` Eric W. Biederman
  3 siblings, 0 replies; 65+ messages in thread
From: Vasily Kulikov @ 2013-03-15 18:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Eric W. Biederman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Aristeu Rozanski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w

On Fri, Mar 15, 2013 at 13:13 +0400, Glauber Costa wrote:
> Since we have strict control on who access the devices, it should be
> no problem to allow the device to appear.
...
> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))

As now we have several mechanisms for dev nodes usage in containers
(cgroup, per-fs flags, CAP_MKNOD), probably it's better to document
all this stuff in a single document?  Enumerate all possible limits
of device files creation and usage, and describe unobvious ways to
abuse the API to escape from a container (allow loopback device?
allow ext4?  just guessing, didn't investigate myself).  It should
document several safe patterns for common cases and beware of
known-to-be-vulnerable ones.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
  2013-03-15  9:13     ` Glauber Costa
  (?)
  (?)
@ 2013-03-15 18:03     ` Vasily Kulikov
  -1 siblings, 0 replies; 65+ messages in thread
From: Vasily Kulikov @ 2013-03-15 18:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, containers, Aristeu Rozanski, mtk.manpages,
	linux-fsdevel, Andrew Morton, Eric W. Biederman

On Fri, Mar 15, 2013 at 13:13 +0400, Glauber Costa wrote:
> Since we have strict control on who access the devices, it should be
> no problem to allow the device to appear.
...
> -	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> +	if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))

As now we have several mechanisms for dev nodes usage in containers
(cgroup, per-fs flags, CAP_MKNOD), probably it's better to document
all this stuff in a single document?  Enumerate all possible limits
of device files creation and usage, and describe unobvious ways to
abuse the API to escape from a container (allow loopback device?
allow ext4?  just guessing, didn't investigate myself).  It should
document several safe patterns for common cases and beware of
known-to-be-vulnerable ones.

Thanks,

-- 
Vasily Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup
       [not found]     ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 14:07       ` Serge Hallyn
  2013-03-15 14:07       ` Serge Hallyn
@ 2013-03-15 19:27       ` Aristeu Rozanski
  2 siblings, 0 replies; 65+ messages in thread
From: Aristeu Rozanski @ 2013-03-15 19:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Eric W. Biederman

On Fri, Mar 15, 2013 at 01:13:40PM +0400, Glauber Costa wrote:
> Most of the other subsystems already keep track of that in some way.  We
> will do that internally and provide a test to determine whether or not
> our task is in a device cgroup that is not the root one. We can relax
> some of our checks in that case, trusting that whoever set device cgroup
> rules will be responsible to control access to their devices.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Cc: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
>  include/linux/security.h |  1 +
>  security/device_cgroup.c | 15 +++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/security.h b/include/linux/security.h
> index eee7478..fe58f71 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -96,6 +96,7 @@ extern int cap_task_setscheduler(struct task_struct *p);
>  extern int cap_task_setioprio(struct task_struct *p, int ioprio);
>  extern int cap_task_setnice(struct task_struct *p, int nice);
>  extern int cap_vm_enough_memory(struct mm_struct *mm, long pages);
> +bool *task_in_child_devcgroup(struct task_struct *task);
>  
>  struct msghdr;
>  struct sk_buff;
> diff --git a/security/device_cgroup.c b/security/device_cgroup.c
> index 1c69e38..03df5b2 100644
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -63,6 +63,16 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
>  	return css_to_devcgroup(task_subsys_state(task, devices_subsys_id));
>  }
>  
> +static struct dev_cgroup *root_devcgroup;
> +bool task_in_child_devcgroup(struct task_struct *task)
> +{
> +	bool ret;
> +	rcu_read_lock();
> +	ret = task_devcgroup(task) != root_devcgroup;
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  struct cgroup_subsys devices_subsys;
>  
>  static int devcgroup_can_attach(struct cgroup *new_cgrp,
> @@ -197,9 +207,10 @@ static struct cgroup_subsys_state *devcgroup_css_alloc(struct cgroup *cgroup)
>  	INIT_LIST_HEAD(&dev_cgroup->exceptions);
>  	parent_cgroup = cgroup->parent;
>  
> -	if (parent_cgroup == NULL)
> +	if (parent_cgroup == NULL) {
>  		dev_cgroup->behavior = DEVCG_DEFAULT_ALLOW;
> -	else {
> +		root_devcgroup = dev_cgroup;
> +	} else {
>  		parent_dev_cgroup = cgroup_to_devcgroup(parent_cgroup);
>  		mutex_lock(&devcgroup_mutex);
>  		ret = dev_exceptions_copy(&dev_cgroup->exceptions,

patch looks good
Acked-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

-- 
Aristeu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
       [not found]     ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                         ` (2 preceding siblings ...)
  2013-03-15 18:03       ` Vasily Kulikov
@ 2013-03-15 20:43       ` Eric W. Biederman
  3 siblings, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-15 20:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Aristeu Rozanski

Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Since we have strict control on who access the devices, it should be
> no problem to allow the device to appear.

Having cgroups or user namespaces grant privileges makes me uneasy.

With these patches it looks like I can do something evil like.

1. Create a devcgroup.
2. Put a process in it.
3. Create a usernamespace.
4. Run a container in that user namespace.
5. As an unprivileged user in that user namespace create another user namespace.
6. Call mknod and have it succeed.

Or in short I don't think this handles nested user namespaces at all.
With or without Serge's suggested change.

At a practical level now is not the right time to be granting more
permissions to user namespaces.  Lately too many silly bugs have been
found in what is already there.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
  2013-03-15  9:13     ` Glauber Costa
@ 2013-03-15 20:43       ` Eric W. Biederman
  -1 siblings, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-15 20:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, Andrew Morton, mtk.manpages, Serge Hallyn,
	linux-fsdevel, containers, Aristeu Rozanski

Glauber Costa <glommer@parallels.com> writes:

> Since we have strict control on who access the devices, it should be
> no problem to allow the device to appear.

Having cgroups or user namespaces grant privileges makes me uneasy.

With these patches it looks like I can do something evil like.

1. Create a devcgroup.
2. Put a process in it.
3. Create a usernamespace.
4. Run a container in that user namespace.
5. As an unprivileged user in that user namespace create another user namespace.
6. Call mknod and have it succeed.

Or in short I don't think this handles nested user namespaces at all.
With or without Serge's suggested change.

At a practical level now is not the right time to be granting more
permissions to user namespaces.  Lately too many silly bugs have been
found in what is already there.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
@ 2013-03-15 20:43       ` Eric W. Biederman
  0 siblings, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-15 20:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, Andrew Morton, mtk.manpages, Serge Hallyn,
	linux-fsdevel, containers, Aristeu Rozanski

Glauber Costa <glommer@parallels.com> writes:

> Since we have strict control on who access the devices, it should be
> no problem to allow the device to appear.

Having cgroups or user namespaces grant privileges makes me uneasy.

With these patches it looks like I can do something evil like.

1. Create a devcgroup.
2. Put a process in it.
3. Create a usernamespace.
4. Run a container in that user namespace.
5. As an unprivileged user in that user namespace create another user namespace.
6. Call mknod and have it succeed.

Or in short I don't think this handles nested user namespaces at all.
With or without Serge's suggested change.

At a practical level now is not the right time to be granting more
permissions to user namespaces.  Lately too many silly bugs have been
found in what is already there.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found]               ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2013-03-15 15:58                 ` Serge Hallyn
  2013-03-15 15:58                 ` Serge Hallyn
@ 2013-03-15 21:02                 ` Eric W. Biederman
  2 siblings, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-15 21:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> On 03/15/2013 07:21 PM, Serge Hallyn wrote:
>> Quoting Glauber Costa (glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
>>> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
>>>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>>>>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>>

>> Well shoot, I can't find it right now.  Not even in Eric's git tree.
>> IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
>> and open that instead.
>>
> Which gives a very good explanation about why haven't I seen it =)
> Eric ?

It is definitely in the development branches of my git tree.  I have
half a dozen patches that touch devpts.  I believe the latest dev branch
I have published is userns-always-map-user-v110.  And the code is in
there.

They have not reached the top of my queue in importance at this time,
and last time I was testing them there was a subtle race in something.
I expect it was just a change in pty layer that I haven't followed
closely enough.

> What it a /dev/ptmx already exist? will it use it? That would be bad,
> since that /dev/ptmx could be a host-side one. I actually believe
> linking to $rootfs/dev/pts/ptmx is more robust than my solution against
> remounts. So provided it can guarantee that the ptmx is not ever the
> root ptmx, I would ack that.

For those playing with udev, especially older udev where udev is still
udev and creates devices you can use the following udev rule to create
the pts/ptmx symlink.

KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx"

Before we do anything clever in the kernel it is definitely worth seeing
how far we can take that little udev rule.

I have heard of no container that runs a distro's initrd, or uses the
distro's code to mount root.  So containers even for old distro's can
and do tweak the distro's a little.   It is worth keeping the tweaks
small but udev and their kin are an important bit of that.

As much as I hate the notion I suspect for most of device management
what we want is to act like devtmpfs, and run all of the device node
creation etc outside of the container (possibly even with bind mounts).

Acting like devtmpfs should be something that is possible with no kernel
changes.   Whereas allowing unprivileged processes to create device
nodes probably has issues I haven't thought of yet.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 15:26               ` Glauber Costa
@ 2013-03-15 21:02                 ` Eric W. Biederman
  -1 siblings, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-15 21:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Serge Hallyn, cgroups, Andrew Morton, mtk.manpages, Serge Hallyn,
	linux-fsdevel, containers

Glauber Costa <glommer@parallels.com> writes:

> On 03/15/2013 07:21 PM, Serge Hallyn wrote:
>> Quoting Glauber Costa (glommer@parallels.com):
>>> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
>>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>>> Glauber Costa <glommer@parallels.com> writes:
>>>>>

>> Well shoot, I can't find it right now.  Not even in Eric's git tree.
>> IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
>> and open that instead.
>>
> Which gives a very good explanation about why haven't I seen it =)
> Eric ?

It is definitely in the development branches of my git tree.  I have
half a dozen patches that touch devpts.  I believe the latest dev branch
I have published is userns-always-map-user-v110.  And the code is in
there.

They have not reached the top of my queue in importance at this time,
and last time I was testing them there was a subtle race in something.
I expect it was just a change in pty layer that I haven't followed
closely enough.

> What it a /dev/ptmx already exist? will it use it? That would be bad,
> since that /dev/ptmx could be a host-side one. I actually believe
> linking to $rootfs/dev/pts/ptmx is more robust than my solution against
> remounts. So provided it can guarantee that the ptmx is not ever the
> root ptmx, I would ack that.

For those playing with udev, especially older udev where udev is still
udev and creates devices you can use the following udev rule to create
the pts/ptmx symlink.

KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx"

Before we do anything clever in the kernel it is definitely worth seeing
how far we can take that little udev rule.

I have heard of no container that runs a distro's initrd, or uses the
distro's code to mount root.  So containers even for old distro's can
and do tweak the distro's a little.   It is worth keeping the tweaks
small but udev and their kin are an important bit of that.

As much as I hate the notion I suspect for most of device management
what we want is to act like devtmpfs, and run all of the device node
creation etc outside of the container (possibly even with bind mounts).

Acting like devtmpfs should be something that is possible with no kernel
changes.   Whereas allowing unprivileged processes to create device
nodes probably has issues I haven't thought of yet.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
@ 2013-03-15 21:02                 ` Eric W. Biederman
  0 siblings, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-15 21:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Serge Hallyn, cgroups, Andrew Morton, mtk.manpages, Serge Hallyn,
	linux-fsdevel, containers

Glauber Costa <glommer@parallels.com> writes:

> On 03/15/2013 07:21 PM, Serge Hallyn wrote:
>> Quoting Glauber Costa (glommer@parallels.com):
>>> On 03/15/2013 06:00 PM, Serge Hallyn wrote:
>>>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>>>> Glauber Costa <glommer@parallels.com> writes:
>>>>>

>> Well shoot, I can't find it right now.  Not even in Eric's git tree.
>> IIRC upon lookup of /dev/pts it tried to find $rootfs/dev/pts/ptmx
>> and open that instead.
>>
> Which gives a very good explanation about why haven't I seen it =)
> Eric ?

It is definitely in the development branches of my git tree.  I have
half a dozen patches that touch devpts.  I believe the latest dev branch
I have published is userns-always-map-user-v110.  And the code is in
there.

They have not reached the top of my queue in importance at this time,
and last time I was testing them there was a subtle race in something.
I expect it was just a change in pty layer that I haven't followed
closely enough.

> What it a /dev/ptmx already exist? will it use it? That would be bad,
> since that /dev/ptmx could be a host-side one. I actually believe
> linking to $rootfs/dev/pts/ptmx is more robust than my solution against
> remounts. So provided it can guarantee that the ptmx is not ever the
> root ptmx, I would ack that.

For those playing with udev, especially older udev where udev is still
udev and creates devices you can use the following udev rule to create
the pts/ptmx symlink.

KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx"

Before we do anything clever in the kernel it is definitely worth seeing
how far we can take that little udev rule.

I have heard of no container that runs a distro's initrd, or uses the
distro's code to mount root.  So containers even for old distro's can
and do tweak the distro's a little.   It is worth keeping the tweaks
small but udev and their kin are an important bit of that.

As much as I hate the notion I suspect for most of device management
what we want is to act like devtmpfs, and run all of the device node
creation etc outside of the container (possibly even with bind mounts).

Acting like devtmpfs should be something that is possible with no kernel
changes.   Whereas allowing unprivileged processes to create device
nodes probably has issues I haven't thought of yet.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
       [not found]       ` <87a9q4gzs1.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-03-16  0:23         ` Serge Hallyn
  0 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-16  0:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Aristeu Rozanski

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> 
> > Since we have strict control on who access the devices, it should be
> > no problem to allow the device to appear.
> 
> Having cgroups or user namespaces grant privileges makes me uneasy.
> 
> With these patches it looks like I can do something evil like.
> 
> 1. Create a devcgroup.
> 2. Put a process in it.
> 3. Create a usernamespace.
> 4. Run a container in that user namespace.
> 5. As an unprivileged user in that user namespace create another user namespace.
> 6. Call mknod and have it succeed.

not if the devcgroup forbids it.

> Or in short I don't think this handles nested user namespaces at all.
> With or without Serge's suggested change.

Yeah my change doesn't help, other than to stop the unpriv user from
creating the device in an fs he doesn't own...

> At a practical level now is not the right time to be granting more
> permissions to user namespaces.  Lately too many silly bugs have been
> found in what is already there.

I agree.

I realize this doesn't help the centos old-udev situation, but otherwise
bind mounting device files works fine, so I agree we should wait.
Sorry.

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/4] fs: allow mknod in user namespaces
  2013-03-15 20:43       ` Eric W. Biederman
  (?)
@ 2013-03-16  0:23       ` Serge Hallyn
  -1 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-16  0:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Glauber Costa, cgroups, Andrew Morton, mtk.manpages,
	Serge Hallyn, linux-fsdevel, containers, Aristeu Rozanski

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Glauber Costa <glommer@parallels.com> writes:
> 
> > Since we have strict control on who access the devices, it should be
> > no problem to allow the device to appear.
> 
> Having cgroups or user namespaces grant privileges makes me uneasy.
> 
> With these patches it looks like I can do something evil like.
> 
> 1. Create a devcgroup.
> 2. Put a process in it.
> 3. Create a usernamespace.
> 4. Run a container in that user namespace.
> 5. As an unprivileged user in that user namespace create another user namespace.
> 6. Call mknod and have it succeed.

not if the devcgroup forbids it.

> Or in short I don't think this handles nested user namespaces at all.
> With or without Serge's suggested change.

Yeah my change doesn't help, other than to stop the unpriv user from
creating the device in an fs he doesn't own...

> At a practical level now is not the right time to be granting more
> permissions to user namespaces.  Lately too many silly bugs have been
> found in what is already there.

I agree.

I realize this doesn't help the centos old-udev situation, but otherwise
bind mounting device files works fine, so I agree we should wait.
Sorry.

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
       [not found]                 ` <87txoce5qy.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-03-18  3:20                   ` Serge Hallyn
  0 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-18  3:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
...
> > What it a /dev/ptmx already exist? will it use it? That would be bad,
> > since that /dev/ptmx could be a host-side one. I actually believe
> > linking to $rootfs/dev/pts/ptmx is more robust than my solution against
> > remounts. So provided it can guarantee that the ptmx is not ever the
> > root ptmx, I would ack that.
> 
> For those playing with udev, especially older udev where udev is still
> udev and creates devices you can use the following udev rule to create
> the pts/ptmx symlink.
> 
> KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx"
> 
> Before we do anything clever in the kernel it is definitely worth seeing
> how far we can take that little udev rule.

Before it was decided that it was ok to modify core packages to
accomodate containers, we had to install (non-standard) init jobs to
detect it was in a container and if so modify some behavior - for
instance to bind-mount a smaller /lib/init/fstab so that mountall
wouldn't try to mount some things.  That way the rootfs had to be
updated to run in a container, but could then still be used as a
rootfs for non-containers.

...

> As much as I hate the notion I suspect for most of device management
> what we want is to act like devtmpfs, and run all of the device node
> creation etc outside of the container (possibly even with bind mounts).

So you mean a task which is unprivileged on the host, privileged wrt the
container, and on host fs namespace, which bind mounts the host /dev
files into the container?

> Acting like devtmpfs should be something that is possible with no kernel
> changes.   Whereas allowing unprivileged processes to create device
> nodes probably has issues I haven't thought of yet.

Not sure what 'acting like devtmpfs' means (especially in contrast to
acting like udev) - maybe i need to go look at the code.

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-15 21:02                 ` Eric W. Biederman
  (?)
@ 2013-03-18  3:20                 ` Serge Hallyn
  2013-03-18 21:23                   ` Eric W. Biederman
  -1 siblings, 1 reply; 65+ messages in thread
From: Serge Hallyn @ 2013-03-18  3:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Glauber Costa, cgroups, Andrew Morton, mtk.manpages,
	Serge Hallyn, linux-fsdevel, containers

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Glauber Costa <glommer@parallels.com> writes:
...
> > What it a /dev/ptmx already exist? will it use it? That would be bad,
> > since that /dev/ptmx could be a host-side one. I actually believe
> > linking to $rootfs/dev/pts/ptmx is more robust than my solution against
> > remounts. So provided it can guarantee that the ptmx is not ever the
> > root ptmx, I would ack that.
> 
> For those playing with udev, especially older udev where udev is still
> udev and creates devices you can use the following udev rule to create
> the pts/ptmx symlink.
> 
> KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx"
> 
> Before we do anything clever in the kernel it is definitely worth seeing
> how far we can take that little udev rule.

Before it was decided that it was ok to modify core packages to
accomodate containers, we had to install (non-standard) init jobs to
detect it was in a container and if so modify some behavior - for
instance to bind-mount a smaller /lib/init/fstab so that mountall
wouldn't try to mount some things.  That way the rootfs had to be
updated to run in a container, but could then still be used as a
rootfs for non-containers.

...

> As much as I hate the notion I suspect for most of device management
> what we want is to act like devtmpfs, and run all of the device node
> creation etc outside of the container (possibly even with bind mounts).

So you mean a task which is unprivileged on the host, privileged wrt the
container, and on host fs namespace, which bind mounts the host /dev
files into the container?

> Acting like devtmpfs should be something that is possible with no kernel
> changes.   Whereas allowing unprivileged processes to create device
> nodes probably has issues I haven't thought of yet.

Not sure what 'acting like devtmpfs' means (especially in contrast to
acting like udev) - maybe i need to go look at the code.

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/4] fix depvpts in user namespaces
  2013-03-18  3:20                 ` Serge Hallyn
@ 2013-03-18 21:23                   ` Eric W. Biederman
  0 siblings, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-18 21:23 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
> ...
>> > What it a /dev/ptmx already exist? will it use it? That would be bad,
>> > since that /dev/ptmx could be a host-side one. I actually believe
>> > linking to $rootfs/dev/pts/ptmx is more robust than my solution against
>> > remounts. So provided it can guarantee that the ptmx is not ever the
>> > root ptmx, I would ack that.
>> 
>> For those playing with udev, especially older udev where udev is still
>> udev and creates devices you can use the following udev rule to create
>> the pts/ptmx symlink.
>> 
>> KERNEL=="ptmx" NAME:="pts/ptmx" SYMLINK="ptmx"
>> 
>> Before we do anything clever in the kernel it is definitely worth seeing
>> how far we can take that little udev rule.
>
> Before it was decided that it was ok to modify core packages to
> accomodate containers, we had to install (non-standard) init jobs to
> detect it was in a container and if so modify some behavior - for
> instance to bind-mount a smaller /lib/init/fstab so that mountall
> wouldn't try to mount some things.  That way the rootfs had to be
> updated to run in a container, but could then still be used as a
> rootfs for non-containers.

That little udev rule could be one of those trivial modifications.

> ...
>
>> As much as I hate the notion I suspect for most of device management
>> what we want is to act like devtmpfs, and run all of the device node
>> creation etc outside of the container (possibly even with bind mounts).
>
> So you mean a task which is unprivileged on the host, privileged wrt the
> container, and on host fs namespace, which bind mounts the host /dev
> files into the container?

That would be the implementation.  Although you could potentialy have
something privileged on the host doing the work from outside of the
container.

>> Acting like devtmpfs should be something that is possible with no kernel
>> changes.   Whereas allowing unprivileged processes to create device
>> nodes probably has issues I haven't thought of yet.
>
> Not sure what 'acting like devtmpfs' means (especially in contrast to
> acting like udev) - maybe i need to go look at the code.

devtmpfs is a magic kernel thread and instance of tmpfs that calls mknod
based on the default device name and permission policy that is new in
the kernel.

Which means that any distro that can use devmpts has no problems with
something else calling mknod and creating the devices.  And the initrd
is expected to have mounted devtmpfs last I looked.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
  2013-03-15 14:20     ` Serge Hallyn
@ 2013-03-19 15:32       ` Janne Karhunen
       [not found]         ` <CAE=Ncradvs_twWT8b6NQz85dm-Y8ayTH7NFv=i0vjYXpRBW9sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Janne Karhunen @ 2013-03-19 15:32 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

On Fri, Mar 15, 2013 at 4:20 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:

>> We will do that by marking the mount as MNT_NODEV_NS instead of
>> MNT_NODEV. This is because if the mount operation explicitly asked for
>> nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the
>> task is not on a device cgroup. If it is, we will rely on the control
>> rules in devcg to intermediate the access an tell us what those tasks
>> can or cannot do.
>
> Well the devcg was meant to be a temporary stopgap solution until we
> have device namespaces, and this seems to entrench them further, but
> it does make sense.

Just out of interest, what would such device namespace actually
do other than switch the device access on/off according to callers
namespace?

'Device namespace' Cells whitepaper [1]  talks about seems to be
saying they also implemented a callback for some drivers (only
init_ns accessible ioctl?) to support their 'foreground/background'
use case. While this is certainly one use case, it's certainly nothing
generic.

1. http://www.cs.columbia.edu/~nieh/pubs/sosp2011_cells.pdf


--
Janne

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
       [not found]         ` <CAE=Ncradvs_twWT8b6NQz85dm-Y8ayTH7NFv=i0vjYXpRBW9sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-03-19 15:37           ` Serge Hallyn
  2013-03-19 16:52             ` Janne Karhunen
  0 siblings, 1 reply; 65+ messages in thread
From: Serge Hallyn @ 2013-03-19 15:37 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

Quoting Janne Karhunen (janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> On Fri, Mar 15, 2013 at 4:20 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> >> We will do that by marking the mount as MNT_NODEV_NS instead of
> >> MNT_NODEV. This is because if the mount operation explicitly asked for
> >> nodev, we ought to respect it. MNT_NODEV_NS will forbid accesses if the
> >> task is not on a device cgroup. If it is, we will rely on the control
> >> rules in devcg to intermediate the access an tell us what those tasks
> >> can or cannot do.
> >
> > Well the devcg was meant to be a temporary stopgap solution until we
> > have device namespaces, and this seems to entrench them further, but
> > it does make sense.
> 
> Just out of interest, what would such device namespace actually
> do other than switch the device access on/off according to callers
> namespace?

It could also support mapping of <type>:maj:min inside namespace to
a different device on host.  In most cases we probably don't actually
want that, but it's an interesting enough thing to be worth thinking
through.

> 'Device namespace' Cells whitepaper [1]  talks about seems to be
> saying they also implemented a callback for some drivers (only

oh, thanks, haven't read that one, will have to.

> init_ns accessible ioctl?) to support their 'foreground/background'
> use case. While this is certainly one use case, it's certainly nothing
> generic.
> 
> 1. http://www.cs.columbia.edu/~nieh/pubs/sosp2011_cells.pdf
> 
> 
> --
> Janne

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
  2013-03-19 15:37           ` Serge Hallyn
@ 2013-03-19 16:52             ` Janne Karhunen
       [not found]               ` <CAE=NcrYeKQYqkPsB9FG5PpYd2VTqmTszfpY39aRJqR=vsXfa7w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Janne Karhunen @ 2013-03-19 16:52 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

On Tue, Mar 19, 2013 at 5:37 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:

>> > Well the devcg was meant to be a temporary stopgap solution until we
>> > have device namespaces, and this seems to entrench them further, but
>> > it does make sense.
>>
>> Just out of interest, what would such device namespace actually
>> do other than switch the device access on/off according to callers
>> namespace?
>
> It could also support mapping of <type>:maj:min inside namespace to
> a different device on host.  In most cases we probably don't actually
> want that, but it's an interesting enough thing to be worth thinking
> through.

It sounds to me that what you really want to do is likely use case and
device specific. Hence the idea about namespace specific ioctl device
action(s) might not be so bad. It would certainly be less intrusive than
tampering with device registrations or rerouting nod file_operations for
instance.

Classic on/off toggle case is easy though, but are there enough
reasons for merging such 'noop' namespace?


--
Janne

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
       [not found]               ` <CAE=NcrYeKQYqkPsB9FG5PpYd2VTqmTszfpY39aRJqR=vsXfa7w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-03-19 17:27                 ` Serge Hallyn
  2013-03-19 18:08                   ` Janne Karhunen
  2013-03-19 23:29                 ` Eric W. Biederman
  1 sibling, 1 reply; 65+ messages in thread
From: Serge Hallyn @ 2013-03-19 17:27 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

Quoting Janne Karhunen (janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> On Tue, Mar 19, 2013 at 5:37 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> >> > Well the devcg was meant to be a temporary stopgap solution until we
> >> > have device namespaces, and this seems to entrench them further, but
> >> > it does make sense.
> >>
> >> Just out of interest, what would such device namespace actually
> >> do other than switch the device access on/off according to callers
> >> namespace?
> >
> > It could also support mapping of <type>:maj:min inside namespace to
> > a different device on host.  In most cases we probably don't actually
> > want that, but it's an interesting enough thing to be worth thinking
> > through.
> 
> It sounds to me that what you really want to do is likely use case and
> device specific. Hence the idea about namespace specific ioctl device
> action(s) might not be so bad. It would certainly be less intrusive than
> tampering with device registrations or rerouting nod file_operations for
> instance.
> 
> Classic on/off toggle case is easy though, but are there enough
> reasons for merging such 'noop' namespace?

Eric's point is precisely that we may be able to work aruond it
sufficiently that we don't need to.  Noone has proposed a specific
design let alone code, so it's premature to talk about merging
anything :)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
  2013-03-19 17:27                 ` Serge Hallyn
@ 2013-03-19 18:08                   ` Janne Karhunen
       [not found]                     ` <CAE=NcraBvk_hwCd9BgASpDBkmEB+fg-kKwAPbT7bQeFRbq5DSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 65+ messages in thread
From: Janne Karhunen @ 2013-03-19 18:08 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

On Tue, Mar 19, 2013 at 7:27 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:

>> Classic on/off toggle case is easy though, but are there enough
>> reasons for merging such 'noop' namespace?
>
> Eric's point is precisely that we may be able to work aruond it
> sufficiently that we don't need to.  Noone has proposed a specific
> design let alone code, so it's premature to talk about merging
> anything :)

Well let's see that after I get this booting :P


--
Janne

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
       [not found]                     ` <CAE=NcraBvk_hwCd9BgASpDBkmEB+fg-kKwAPbT7bQeFRbq5DSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-03-19 19:15                       ` Serge Hallyn
  0 siblings, 0 replies; 65+ messages in thread
From: Serge Hallyn @ 2013-03-19 19:15 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman

Quoting Janne Karhunen (janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> On Tue, Mar 19, 2013 at 7:27 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> >> Classic on/off toggle case is easy though, but are there enough
> >> reasons for merging such 'noop' namespace?
> >
> > Eric's point is precisely that we may be able to work aruond it
> > sufficiently that we don't need to.  Noone has proposed a specific
> > design let alone code, so it's premature to talk about merging
> > anything :)
> 
> Well let's see that after I get this booting :P

Oh - you have code?  Fun :)

-serge

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/4] fs: allow dev accesses in userns in controlled situations
       [not found]               ` <CAE=NcrYeKQYqkPsB9FG5PpYd2VTqmTszfpY39aRJqR=vsXfa7w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-03-19 17:27                 ` Serge Hallyn
@ 2013-03-19 23:29                 ` Eric W. Biederman
  1 sibling, 0 replies; 65+ messages in thread
From: Eric W. Biederman @ 2013-03-19 23:29 UTC (permalink / raw)
  To: Janne Karhunen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn

Janne Karhunen <janne.karhunen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> On Tue, Mar 19, 2013 at 5:37 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
>
>>> > Well the devcg was meant to be a temporary stopgap solution until we
>>> > have device namespaces, and this seems to entrench them further, but
>>> > it does make sense.
>>>
>>> Just out of interest, what would such device namespace actually
>>> do other than switch the device access on/off according to callers
>>> namespace?
>>
>> It could also support mapping of <type>:maj:min inside namespace to
>> a different device on host.  In most cases we probably don't actually
>> want that, but it's an interesting enough thing to be worth thinking
>> through.
>
> It sounds to me that what you really want to do is likely use case and
> device specific. Hence the idea about namespace specific ioctl device
> action(s) might not be so bad. It would certainly be less intrusive than
> tampering with device registrations or rerouting nod file_operations for
> instance.
>
> Classic on/off toggle case is easy though, but are there enough
> reasons for merging such 'noop' namespace?

The two most compelling cases are:
- Container migration.
- Safe creation of virtual devices.

Now I think we can solve container migration by effectively hotuplugging
and hotplugging all of our devices.

The safe creation of virtual devices that I am thinking of are
essentially ptys (already covered with devpts) and loopback devices.
There are probably a couple more that I don't know off the top of my
head.

So far I haven't found any case that is sufficiently compelling, and
can't be replaced by acting like devtmpfs and controlling all device
creation.  Although I have heard of cases where applications still call
mknod even with devtmpfs and udev doing most of the work.

I am of the general thought that we should explore what we can do with
what we have now, at least until we find a compelling case for doing
more in the kernel.

Eric

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2013-03-19 23:29 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-15  9:13 [PATCH 0/4] fix depvpts in user namespaces Glauber Costa
2013-03-15  9:13 ` Glauber Costa
     [not found] ` <1363338823-25292-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15  9:13   ` [PATCH 1/4] dev_cgroup: keep track of which cgroup is the root cgroup Glauber Costa
2013-03-15  9:13     ` Glauber Costa
     [not found]     ` <1363338823-25292-2-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 14:07       ` Serge Hallyn
2013-03-15 14:43         ` Glauber Costa
2013-03-15 14:43         ` Glauber Costa
2013-03-15 14:43           ` Glauber Costa
     [not found]           ` <514333A2.5060408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 14:55             ` Serge Hallyn
2013-03-15 14:55           ` Serge Hallyn
2013-03-15 14:07       ` Serge Hallyn
2013-03-15 19:27       ` Aristeu Rozanski
2013-03-15  9:13   ` [PATCH 2/4] fs: allow dev accesses in userns in controlled situations Glauber Costa
2013-03-15  9:13     ` Glauber Costa
2013-03-15 14:20     ` Serge Hallyn
2013-03-19 15:32       ` Janne Karhunen
     [not found]         ` <CAE=Ncradvs_twWT8b6NQz85dm-Y8ayTH7NFv=i0vjYXpRBW9sg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-19 15:37           ` Serge Hallyn
2013-03-19 16:52             ` Janne Karhunen
     [not found]               ` <CAE=NcrYeKQYqkPsB9FG5PpYd2VTqmTszfpY39aRJqR=vsXfa7w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-19 17:27                 ` Serge Hallyn
2013-03-19 18:08                   ` Janne Karhunen
     [not found]                     ` <CAE=NcraBvk_hwCd9BgASpDBkmEB+fg-kKwAPbT7bQeFRbq5DSg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-19 19:15                       ` Serge Hallyn
2013-03-19 23:29                 ` Eric W. Biederman
     [not found]     ` <1363338823-25292-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 14:20       ` Serge Hallyn
2013-03-15  9:13   ` [PATCH 3/4] fs: allow mknod in user namespaces Glauber Costa
2013-03-15  9:13     ` Glauber Costa
     [not found]     ` <1363338823-25292-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 14:37       ` Serge Hallyn
2013-03-15 14:49         ` Glauber Costa
2013-03-15 14:49           ` Glauber Costa
     [not found]           ` <51433511.1020808-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 15:14             ` Serge Hallyn
2013-03-15 15:14             ` Serge Hallyn
2013-03-15 14:49         ` Glauber Costa
2013-03-15 14:37       ` Serge Hallyn
2013-03-15 18:03       ` Vasily Kulikov
2013-03-15 20:43       ` Eric W. Biederman
2013-03-15 18:03     ` Vasily Kulikov
2013-03-15 20:43     ` Eric W. Biederman
2013-03-15 20:43       ` Eric W. Biederman
2013-03-16  0:23       ` Serge Hallyn
     [not found]       ` <87a9q4gzs1.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-03-16  0:23         ` Serge Hallyn
2013-03-15  9:13   ` Glauber Costa
2013-03-15  9:13   ` [PATCH 4/4] devpts: fix usage " Glauber Costa
2013-03-15  9:13   ` Glauber Costa
2013-03-15  9:13     ` Glauber Costa
     [not found]     ` <1363338823-25292-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 14:45       ` Serge Hallyn
2013-03-15 14:45       ` Serge Hallyn
2013-03-15 10:26   ` [PATCH 0/4] fix depvpts " Eric W. Biederman
     [not found]     ` <87boalt0vi.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-03-15 12:01       ` Glauber Costa
2013-03-15 14:00       ` Serge Hallyn
2013-03-15 14:00     ` Serge Hallyn
2013-03-15 14:42       ` Glauber Costa
2013-03-15 14:42         ` Glauber Costa
     [not found]         ` <5143333E.1040100-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 15:21           ` Serge Hallyn
2013-03-15 15:26             ` Glauber Costa
2013-03-15 15:26             ` Glauber Costa
2013-03-15 15:26               ` Glauber Costa
2013-03-15 21:02               ` Eric W. Biederman
2013-03-15 21:02                 ` Eric W. Biederman
2013-03-18  3:20                 ` Serge Hallyn
2013-03-18 21:23                   ` Eric W. Biederman
     [not found]                 ` <87txoce5qy.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-03-18  3:20                   ` Serge Hallyn
     [not found]               ` <51433DBE.9020109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-03-15 15:58                 ` Serge Hallyn
2013-03-15 16:01                   ` Glauber Costa
2013-03-15 15:58                 ` Serge Hallyn
2013-03-15 21:02                 ` Eric W. Biederman
2013-03-15 14:42       ` Glauber Costa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.