All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 22:39 ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

There continues to be unexpected side-effects and security exposures
via CLONE_NEWUSER. For many end-users running distro kernels with
CONFIG_USER_NS enabled, there is no way to disable this feature when
desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
admins not running containers or Chrome can avoid the risks of this
feature.

-Kees

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 22:39 ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

There continues to be unexpected side-effects and security exposures
via CLONE_NEWUSER. For many end-users running distro kernels with
CONFIG_USER_NS enabled, there is no way to disable this feature when
desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
admins not running containers or Chrome can avoid the risks of this
feature.

-Kees

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-22 22:39 ` [kernel-hardening] " Kees Cook
@ 2016-01-22 22:39   ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

Several sysctls expect a state where the highest value (in extra2) is
locked once set for that boot. Yama does this, and kptr_restrict should
be doing it. This extracts Yama's logic and adds it to the existing
proc_dointvec_minmax_sysadmin, taking care to avoid the simple boolean
states (which do not get locked). Since Yama wants to be checking a
different capability, we build wrappers for both cases (CAP_SYS_ADMIN
and CAP_SYS_PTRACE).

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Documentation/sysctl/kernel.txt |  4 +++-
 include/linux/sysctl.h          | 18 ++++++++++++++++++
 kernel/sysctl.c                 | 34 +++++++++++++++++++++-------------
 security/yama/yama_lsm.c        | 18 +-----------------
 4 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 73c6b1ef0e84..bbfc5e339a3d 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -385,7 +385,9 @@ to protect against uses of %pK in dmesg(8) if leaking kernel pointer
 values to unprivileged users is a concern.
 
 When kptr_restrict is set to (2), kernel pointers printed using
-%pK will be replaced with 0's regardless of privileges.
+%pK will be replaced with 0's regardless of privileges, and the value
+will be locked at "2", so that the root user cannot remove this
+restriction.
 
 ==============================================================
 
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index fa7bc29925c9..f8f0b991fe3e 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -23,6 +23,7 @@
 
 #include <linux/list.h>
 #include <linux/rcupdate.h>
+#include <linux/capability.h>
 #include <linux/wait.h>
 #include <linux/rbtree.h>
 #include <uapi/linux/sysctl.h>
@@ -55,6 +56,23 @@ extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
 				      void __user *, size_t *, loff_t *);
 extern int proc_do_large_bitmap(struct ctl_table *, int,
 				void __user *, size_t *, loff_t *);
+extern int proc_dointvec_minmax_cap(int cap, struct ctl_table *table,
+				    int write, void __user *buffer,
+				    size_t *lenp, loff_t *ppos);
+static inline int proc_dointvec_minmax_cap_sysadmin(struct ctl_table *table,
+				    int write, void __user *buffer,
+				    size_t *lenp, loff_t *ppos)
+{
+	return proc_dointvec_minmax_cap(CAP_SYS_ADMIN, table, write, buffer,
+					lenp, ppos);
+}
+static inline int proc_dointvec_minmax_cap_ptrace(struct ctl_table *table,
+				    int write, void __user *buffer,
+				    size_t *lenp, loff_t *ppos)
+{
+	return proc_dointvec_minmax_cap(CAP_SYS_PTRACE, table, write, buffer,
+					lenp, ppos);
+}
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c810f8afdb7f..fc8899dd636d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -181,11 +181,6 @@ static int proc_taint(struct ctl_table *table, int write,
 			       void __user *buffer, size_t *lenp, loff_t *ppos);
 #endif
 
-#ifdef CONFIG_PRINTK
-static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
-				void __user *buffer, size_t *lenp, loff_t *ppos);
-#endif
-
 static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 #ifdef CONFIG_COREDUMP
@@ -803,7 +798,7 @@ static struct ctl_table kern_table[] = {
 		.data		= &dmesg_restrict,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax_sysadmin,
+		.proc_handler	= proc_dointvec_minmax_cap_sysadmin,
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
@@ -812,7 +807,7 @@ static struct ctl_table kern_table[] = {
 		.data		= &kptr_restrict,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax_sysadmin,
+		.proc_handler	= proc_dointvec_minmax_cap_sysadmin,
 		.extra1		= &zero,
 		.extra2		= &two,
 	},
@@ -2217,16 +2212,29 @@ static int proc_taint(struct ctl_table *table, int write,
 	return err;
 }
 
-#ifdef CONFIG_PRINTK
-static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
-				void __user *buffer, size_t *lenp, loff_t *ppos)
+int proc_dointvec_minmax_cap(int cap, struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	if (write && !capable(CAP_SYS_ADMIN))
+	struct ctl_table table_copy;
+	int value;
+
+	/* Require init capabilities to make changes. */
+	if (write && !capable(cap))
 		return -EPERM;
 
-	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	/*
+	 * To deal with const sysctl tables, we make a copy to perform
+	 * the locking. When data is >1 and ==extra2, lock extra1 to
+	 * extra2 to stop the value from being changed any further at
+	 * runtime.
+	 */
+	table_copy = *table;
+	value = *(int *)table_copy.data;
+	if (value > 1 && value == *(int *)table_copy.extra2)
+		table_copy.extra1 = table_copy.extra2;
+
+	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
 }
-#endif
 
 struct do_proc_dointvec_minmax_conv_param {
 	int *min;
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index d3c19c970a06..3215afd08fbd 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -354,22 +354,6 @@ static struct security_hook_list yama_hooks[] = {
 };
 
 #ifdef CONFIG_SYSCTL
-static int yama_dointvec_minmax(struct ctl_table *table, int write,
-				void __user *buffer, size_t *lenp, loff_t *ppos)
-{
-	struct ctl_table table_copy;
-
-	if (write && !capable(CAP_SYS_PTRACE))
-		return -EPERM;
-
-	/* Lock the max value if it ever gets set. */
-	table_copy = *table;
-	if (*(int *)table_copy.data == *(int *)table_copy.extra2)
-		table_copy.extra1 = table_copy.extra2;
-
-	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
-}
-
 static int zero;
 static int max_scope = YAMA_SCOPE_NO_ATTACH;
 
@@ -385,7 +369,7 @@ static struct ctl_table yama_sysctl_table[] = {
 		.data           = &ptrace_scope,
 		.maxlen         = sizeof(int),
 		.mode           = 0644,
-		.proc_handler   = yama_dointvec_minmax,
+		.proc_handler	= proc_dointvec_minmax_cap_ptrace,
 		.extra1         = &zero,
 		.extra2         = &max_scope,
 	},
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [kernel-hardening] [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
@ 2016-01-22 22:39   ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

Several sysctls expect a state where the highest value (in extra2) is
locked once set for that boot. Yama does this, and kptr_restrict should
be doing it. This extracts Yama's logic and adds it to the existing
proc_dointvec_minmax_sysadmin, taking care to avoid the simple boolean
states (which do not get locked). Since Yama wants to be checking a
different capability, we build wrappers for both cases (CAP_SYS_ADMIN
and CAP_SYS_PTRACE).

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Documentation/sysctl/kernel.txt |  4 +++-
 include/linux/sysctl.h          | 18 ++++++++++++++++++
 kernel/sysctl.c                 | 34 +++++++++++++++++++++-------------
 security/yama/yama_lsm.c        | 18 +-----------------
 4 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 73c6b1ef0e84..bbfc5e339a3d 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -385,7 +385,9 @@ to protect against uses of %pK in dmesg(8) if leaking kernel pointer
 values to unprivileged users is a concern.
 
 When kptr_restrict is set to (2), kernel pointers printed using
-%pK will be replaced with 0's regardless of privileges.
+%pK will be replaced with 0's regardless of privileges, and the value
+will be locked at "2", so that the root user cannot remove this
+restriction.
 
 ==============================================================
 
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index fa7bc29925c9..f8f0b991fe3e 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -23,6 +23,7 @@
 
 #include <linux/list.h>
 #include <linux/rcupdate.h>
+#include <linux/capability.h>
 #include <linux/wait.h>
 #include <linux/rbtree.h>
 #include <uapi/linux/sysctl.h>
@@ -55,6 +56,23 @@ extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
 				      void __user *, size_t *, loff_t *);
 extern int proc_do_large_bitmap(struct ctl_table *, int,
 				void __user *, size_t *, loff_t *);
+extern int proc_dointvec_minmax_cap(int cap, struct ctl_table *table,
+				    int write, void __user *buffer,
+				    size_t *lenp, loff_t *ppos);
+static inline int proc_dointvec_minmax_cap_sysadmin(struct ctl_table *table,
+				    int write, void __user *buffer,
+				    size_t *lenp, loff_t *ppos)
+{
+	return proc_dointvec_minmax_cap(CAP_SYS_ADMIN, table, write, buffer,
+					lenp, ppos);
+}
+static inline int proc_dointvec_minmax_cap_ptrace(struct ctl_table *table,
+				    int write, void __user *buffer,
+				    size_t *lenp, loff_t *ppos)
+{
+	return proc_dointvec_minmax_cap(CAP_SYS_PTRACE, table, write, buffer,
+					lenp, ppos);
+}
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c810f8afdb7f..fc8899dd636d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -181,11 +181,6 @@ static int proc_taint(struct ctl_table *table, int write,
 			       void __user *buffer, size_t *lenp, loff_t *ppos);
 #endif
 
-#ifdef CONFIG_PRINTK
-static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
-				void __user *buffer, size_t *lenp, loff_t *ppos);
-#endif
-
 static int proc_dointvec_minmax_coredump(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 #ifdef CONFIG_COREDUMP
@@ -803,7 +798,7 @@ static struct ctl_table kern_table[] = {
 		.data		= &dmesg_restrict,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax_sysadmin,
+		.proc_handler	= proc_dointvec_minmax_cap_sysadmin,
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
@@ -812,7 +807,7 @@ static struct ctl_table kern_table[] = {
 		.data		= &kptr_restrict,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax_sysadmin,
+		.proc_handler	= proc_dointvec_minmax_cap_sysadmin,
 		.extra1		= &zero,
 		.extra2		= &two,
 	},
@@ -2217,16 +2212,29 @@ static int proc_taint(struct ctl_table *table, int write,
 	return err;
 }
 
-#ifdef CONFIG_PRINTK
-static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
-				void __user *buffer, size_t *lenp, loff_t *ppos)
+int proc_dointvec_minmax_cap(int cap, struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	if (write && !capable(CAP_SYS_ADMIN))
+	struct ctl_table table_copy;
+	int value;
+
+	/* Require init capabilities to make changes. */
+	if (write && !capable(cap))
 		return -EPERM;
 
-	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	/*
+	 * To deal with const sysctl tables, we make a copy to perform
+	 * the locking. When data is >1 and ==extra2, lock extra1 to
+	 * extra2 to stop the value from being changed any further at
+	 * runtime.
+	 */
+	table_copy = *table;
+	value = *(int *)table_copy.data;
+	if (value > 1 && value == *(int *)table_copy.extra2)
+		table_copy.extra1 = table_copy.extra2;
+
+	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
 }
-#endif
 
 struct do_proc_dointvec_minmax_conv_param {
 	int *min;
diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index d3c19c970a06..3215afd08fbd 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -354,22 +354,6 @@ static struct security_hook_list yama_hooks[] = {
 };
 
 #ifdef CONFIG_SYSCTL
-static int yama_dointvec_minmax(struct ctl_table *table, int write,
-				void __user *buffer, size_t *lenp, loff_t *ppos)
-{
-	struct ctl_table table_copy;
-
-	if (write && !capable(CAP_SYS_PTRACE))
-		return -EPERM;
-
-	/* Lock the max value if it ever gets set. */
-	table_copy = *table;
-	if (*(int *)table_copy.data == *(int *)table_copy.extra2)
-		table_copy.extra1 = table_copy.extra2;
-
-	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
-}
-
 static int zero;
 static int max_scope = YAMA_SCOPE_NO_ATTACH;
 
@@ -385,7 +369,7 @@ static struct ctl_table yama_sysctl_table[] = {
 		.data           = &ptrace_scope,
 		.maxlen         = sizeof(int),
 		.mode           = 0644,
-		.proc_handler   = yama_dointvec_minmax,
+		.proc_handler	= proc_dointvec_minmax_cap_ptrace,
 		.extra1         = &zero,
 		.extra2         = &max_scope,
 	},
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 22:39 ` [kernel-hardening] " Kees Cook
@ 2016-01-22 22:39   ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

There continues to be many CONFIG_USER_NS related security exposures.
For admins running distro kernels with CONFIG_USER_NS, there is no way
to disable CLONE_NEWUSER. As many systems do not need CLONE_NEWUSER,
this provides a way for sysadmins to disable the feature.

This is inspired by a similar restriction in Grsecurity, but adds
a sysctl.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Documentation/sysctl/kernel.txt | 17 +++++++++++++++++
 kernel/sysctl.c                 | 14 ++++++++++++++
 kernel/user_namespace.c         |  7 +++++++
 3 files changed, 38 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bbfc5e339a3d..e9e8a4f949f5 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -85,6 +85,7 @@ show up in /proc/sys/kernel:
 - tainted
 - threads-max
 - unknown_nmi_panic
+- userns_restrict
 - watchdog
 - watchdog_thresh
 - version
@@ -933,6 +934,22 @@ example.  If a system hangs up, try pressing the NMI switch.
 
 ==============================================================
 
+userns_restrict:
+
+This toggle indicates whether CLONE_NEWUSER is available. As CLONE_NEWUSER
+has many unexpected side-effects and security exposures, this allows the
+sysadmin to disable the feature without needing to rebuild the kernel.
+
+When userns_restrict is set to (0), the default, there are no restrictions.
+
+When userns_restrict is set to (1), CLONE_NEWUSER is only available to
+processes that have CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.
+
+When userns_restrict is set to (2), CLONE_NEWUSER is not available at all,
+and the value is locked to "2" for the duration of the boot.
+
+==============================================================
+
 watchdog:
 
 This parameter can be used to disable or enable the soft lockup detector
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index fc8899dd636d..ceb8b107fe28 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -112,6 +112,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_USER_NS
+extern int sysctl_userns_restrict;
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -812,6 +815,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &two,
 	},
 #endif
+#ifdef CONFIG_USER_NS
+	{
+		.procname	= "userns_restrict",
+		.data		= &sysctl_userns_restrict,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax_cap_sysadmin,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
+#endif
 	{
 		.procname	= "ngroups_max",
 		.data		= &ngroups_max,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9bafc211930c..38395f9625ff 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -25,6 +25,7 @@
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
 static DEFINE_MUTEX(userns_state_mutex);
+int sysctl_userns_restrict __read_mostly;
 
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
@@ -84,6 +85,12 @@ int create_user_ns(struct cred *new)
 	    !kgid_has_mapping(parent_ns, group))
 		return -EPERM;
 
+	if (sysctl_userns_restrict == 2 ||
+	    (sysctl_userns_restrict == 1 && (!capable(CAP_SYS_ADMIN) ||
+					     !capable(CAP_SETUID) ||
+					     !capable(CAP_SETGID))))
+		return -EPERM;
+
 	ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
 	if (!ns)
 		return -ENOMEM;
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [kernel-hardening] [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 22:39   ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

There continues to be many CONFIG_USER_NS related security exposures.
For admins running distro kernels with CONFIG_USER_NS, there is no way
to disable CLONE_NEWUSER. As many systems do not need CLONE_NEWUSER,
this provides a way for sysadmins to disable the feature.

This is inspired by a similar restriction in Grsecurity, but adds
a sysctl.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 Documentation/sysctl/kernel.txt | 17 +++++++++++++++++
 kernel/sysctl.c                 | 14 ++++++++++++++
 kernel/user_namespace.c         |  7 +++++++
 3 files changed, 38 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bbfc5e339a3d..e9e8a4f949f5 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -85,6 +85,7 @@ show up in /proc/sys/kernel:
 - tainted
 - threads-max
 - unknown_nmi_panic
+- userns_restrict
 - watchdog
 - watchdog_thresh
 - version
@@ -933,6 +934,22 @@ example.  If a system hangs up, try pressing the NMI switch.
 
 ==============================================================
 
+userns_restrict:
+
+This toggle indicates whether CLONE_NEWUSER is available. As CLONE_NEWUSER
+has many unexpected side-effects and security exposures, this allows the
+sysadmin to disable the feature without needing to rebuild the kernel.
+
+When userns_restrict is set to (0), the default, there are no restrictions.
+
+When userns_restrict is set to (1), CLONE_NEWUSER is only available to
+processes that have CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.
+
+When userns_restrict is set to (2), CLONE_NEWUSER is not available at all,
+and the value is locked to "2" for the duration of the boot.
+
+==============================================================
+
 watchdog:
 
 This parameter can be used to disable or enable the soft lockup detector
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index fc8899dd636d..ceb8b107fe28 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -112,6 +112,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_USER_NS
+extern int sysctl_userns_restrict;
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -812,6 +815,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &two,
 	},
 #endif
+#ifdef CONFIG_USER_NS
+	{
+		.procname	= "userns_restrict",
+		.data		= &sysctl_userns_restrict,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax_cap_sysadmin,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
+#endif
 	{
 		.procname	= "ngroups_max",
 		.data		= &ngroups_max,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9bafc211930c..38395f9625ff 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -25,6 +25,7 @@
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
 static DEFINE_MUTEX(userns_state_mutex);
+int sysctl_userns_restrict __read_mostly;
 
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
@@ -84,6 +85,12 @@ int create_user_ns(struct cred *new)
 	    !kgid_has_mapping(parent_ns, group))
 		return -EPERM;
 
+	if (sysctl_userns_restrict == 2 ||
+	    (sysctl_userns_restrict == 1 && (!capable(CAP_SYS_ADMIN) ||
+					     !capable(CAP_SETUID) ||
+					     !capable(CAP_SETGID))))
+		return -EPERM;
+
 	ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
 	if (!ns)
 		return -ENOMEM;
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 22:39   ` [kernel-hardening] " Kees Cook
@ 2016-01-22 22:47     ` Robert Święcki
  -1 siblings, 0 replies; 80+ messages in thread
From: Robert Święcki @ 2016-01-22 22:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Miklos Szeredi,
	Kostya Serebryany, Alexander Potapenko, Eric Dumazet,
	Sasha Levin, linux-doc, LKML, kernel-hardening

Seems that Debian and some older Ubuntu versions are already using

$ sysctl -a | grep usern
kernel.unprivileged_userns_clone = 0

Shall we be consistent wit it?

2016-01-22 23:39 GMT+01:00 Kees Cook <keescook@chromium.org>:
> There continues to be many CONFIG_USER_NS related security exposures.
> For admins running distro kernels with CONFIG_USER_NS, there is no way
> to disable CLONE_NEWUSER. As many systems do not need CLONE_NEWUSER,
> this provides a way for sysadmins to disable the feature.
>
> This is inspired by a similar restriction in Grsecurity, but adds
> a sysctl.
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  Documentation/sysctl/kernel.txt | 17 +++++++++++++++++
>  kernel/sysctl.c                 | 14 ++++++++++++++
>  kernel/user_namespace.c         |  7 +++++++
>  3 files changed, 38 insertions(+)
>
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bbfc5e339a3d..e9e8a4f949f5 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -85,6 +85,7 @@ show up in /proc/sys/kernel:
>  - tainted
>  - threads-max
>  - unknown_nmi_panic
> +- userns_restrict
>  - watchdog
>  - watchdog_thresh
>  - version
> @@ -933,6 +934,22 @@ example.  If a system hangs up, try pressing the NMI switch.
>
>  ==============================================================
>
> +userns_restrict:
> +
> +This toggle indicates whether CLONE_NEWUSER is available. As CLONE_NEWUSER
> +has many unexpected side-effects and security exposures, this allows the
> +sysadmin to disable the feature without needing to rebuild the kernel.
> +
> +When userns_restrict is set to (0), the default, there are no restrictions.
> +
> +When userns_restrict is set to (1), CLONE_NEWUSER is only available to
> +processes that have CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.
> +
> +When userns_restrict is set to (2), CLONE_NEWUSER is not available at all,
> +and the value is locked to "2" for the duration of the boot.
> +
> +==============================================================
> +
>  watchdog:
>
>  This parameter can be used to disable or enable the soft lockup detector
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index fc8899dd636d..ceb8b107fe28 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -112,6 +112,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
>  #ifndef CONFIG_MMU
>  extern int sysctl_nr_trim_pages;
>  #endif
> +#ifdef CONFIG_USER_NS
> +extern int sysctl_userns_restrict;
> +#endif
>
>  /* Constants used for minimum and  maximum */
>  #ifdef CONFIG_LOCKUP_DETECTOR
> @@ -812,6 +815,17 @@ static struct ctl_table kern_table[] = {
>                 .extra2         = &two,
>         },
>  #endif
> +#ifdef CONFIG_USER_NS
> +       {
> +               .procname       = "userns_restrict",
> +               .data           = &sysctl_userns_restrict,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax_cap_sysadmin,
> +               .extra1         = &zero,
> +               .extra2         = &two,
> +       },
> +#endif
>         {
>                 .procname       = "ngroups_max",
>                 .data           = &ngroups_max,
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 9bafc211930c..38395f9625ff 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -25,6 +25,7 @@
>
>  static struct kmem_cache *user_ns_cachep __read_mostly;
>  static DEFINE_MUTEX(userns_state_mutex);
> +int sysctl_userns_restrict __read_mostly;
>
>  static bool new_idmap_permitted(const struct file *file,
>                                 struct user_namespace *ns, int cap_setid,
> @@ -84,6 +85,12 @@ int create_user_ns(struct cred *new)
>             !kgid_has_mapping(parent_ns, group))
>                 return -EPERM;
>
> +       if (sysctl_userns_restrict == 2 ||
> +           (sysctl_userns_restrict == 1 && (!capable(CAP_SYS_ADMIN) ||
> +                                            !capable(CAP_SETUID) ||
> +                                            !capable(CAP_SETGID))))
> +               return -EPERM;
> +
>         ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
>         if (!ns)
>                 return -ENOMEM;
> --
> 2.6.3
>



-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 22:47     ` Robert Święcki
  0 siblings, 0 replies; 80+ messages in thread
From: Robert Święcki @ 2016-01-22 22:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Miklos Szeredi,
	Kostya Serebryany, Alexander Potapenko, Eric Dumazet,
	Sasha Levin, linux-doc, LKML, kernel-hardening

Seems that Debian and some older Ubuntu versions are already using

$ sysctl -a | grep usern
kernel.unprivileged_userns_clone = 0

Shall we be consistent wit it?

2016-01-22 23:39 GMT+01:00 Kees Cook <keescook@chromium.org>:
> There continues to be many CONFIG_USER_NS related security exposures.
> For admins running distro kernels with CONFIG_USER_NS, there is no way
> to disable CLONE_NEWUSER. As many systems do not need CLONE_NEWUSER,
> this provides a way for sysadmins to disable the feature.
>
> This is inspired by a similar restriction in Grsecurity, but adds
> a sysctl.
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  Documentation/sysctl/kernel.txt | 17 +++++++++++++++++
>  kernel/sysctl.c                 | 14 ++++++++++++++
>  kernel/user_namespace.c         |  7 +++++++
>  3 files changed, 38 insertions(+)
>
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bbfc5e339a3d..e9e8a4f949f5 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -85,6 +85,7 @@ show up in /proc/sys/kernel:
>  - tainted
>  - threads-max
>  - unknown_nmi_panic
> +- userns_restrict
>  - watchdog
>  - watchdog_thresh
>  - version
> @@ -933,6 +934,22 @@ example.  If a system hangs up, try pressing the NMI switch.
>
>  ==============================================================
>
> +userns_restrict:
> +
> +This toggle indicates whether CLONE_NEWUSER is available. As CLONE_NEWUSER
> +has many unexpected side-effects and security exposures, this allows the
> +sysadmin to disable the feature without needing to rebuild the kernel.
> +
> +When userns_restrict is set to (0), the default, there are no restrictions.
> +
> +When userns_restrict is set to (1), CLONE_NEWUSER is only available to
> +processes that have CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.
> +
> +When userns_restrict is set to (2), CLONE_NEWUSER is not available at all,
> +and the value is locked to "2" for the duration of the boot.
> +
> +==============================================================
> +
>  watchdog:
>
>  This parameter can be used to disable or enable the soft lockup detector
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index fc8899dd636d..ceb8b107fe28 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -112,6 +112,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
>  #ifndef CONFIG_MMU
>  extern int sysctl_nr_trim_pages;
>  #endif
> +#ifdef CONFIG_USER_NS
> +extern int sysctl_userns_restrict;
> +#endif
>
>  /* Constants used for minimum and  maximum */
>  #ifdef CONFIG_LOCKUP_DETECTOR
> @@ -812,6 +815,17 @@ static struct ctl_table kern_table[] = {
>                 .extra2         = &two,
>         },
>  #endif
> +#ifdef CONFIG_USER_NS
> +       {
> +               .procname       = "userns_restrict",
> +               .data           = &sysctl_userns_restrict,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax_cap_sysadmin,
> +               .extra1         = &zero,
> +               .extra2         = &two,
> +       },
> +#endif
>         {
>                 .procname       = "ngroups_max",
>                 .data           = &ngroups_max,
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 9bafc211930c..38395f9625ff 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -25,6 +25,7 @@
>
>  static struct kmem_cache *user_ns_cachep __read_mostly;
>  static DEFINE_MUTEX(userns_state_mutex);
> +int sysctl_userns_restrict __read_mostly;
>
>  static bool new_idmap_permitted(const struct file *file,
>                                 struct user_namespace *ns, int cap_setid,
> @@ -84,6 +85,12 @@ int create_user_ns(struct cred *new)
>             !kgid_has_mapping(parent_ns, group))
>                 return -EPERM;
>
> +       if (sysctl_userns_restrict == 2 ||
> +           (sysctl_userns_restrict == 1 && (!capable(CAP_SYS_ADMIN) ||
> +                                            !capable(CAP_SETUID) ||
> +                                            !capable(CAP_SETGID))))
> +               return -EPERM;
> +
>         ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
>         if (!ns)
>                 return -ENOMEM;
> --
> 2.6.3
>



-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 22:39 ` [kernel-hardening] " Kees Cook
@ 2016-01-22 22:49   ` Richard Weinberger
  -1 siblings, 0 replies; 80+ messages in thread
From: Richard Weinberger @ 2016-01-22 22:49 UTC (permalink / raw)
  To: Kees Cook, Andrew Morton
  Cc: Al Viro, Eric W. Biederman, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Am 22.01.2016 um 23:39 schrieb Kees Cook:
> There continues to be unexpected side-effects and security exposures
> via CLONE_NEWUSER. For many end-users running distro kernels with
> CONFIG_USER_NS enabled, there is no way to disable this feature when
> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
> admins not running containers or Chrome can avoid the risks of this
> feature.

Last time such a patch came up I was not thrilled because hiding
a scary feature behind a knob IMHO doesn't make it any better nor helps
finding issues.
But as userns is still a source of a lot of issues and distros enable
it by default a knob for the admin seems to be a good idea by now. ;-\

Thanks,
//richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 22:49   ` Richard Weinberger
  0 siblings, 0 replies; 80+ messages in thread
From: Richard Weinberger @ 2016-01-22 22:49 UTC (permalink / raw)
  To: Kees Cook, Andrew Morton
  Cc: Al Viro, Eric W. Biederman, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Am 22.01.2016 um 23:39 schrieb Kees Cook:
> There continues to be unexpected side-effects and security exposures
> via CLONE_NEWUSER. For many end-users running distro kernels with
> CONFIG_USER_NS enabled, there is no way to disable this feature when
> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
> admins not running containers or Chrome can avoid the risks of this
> feature.

Last time such a patch came up I was not thrilled because hiding
a scary feature behind a knob IMHO doesn't make it any better nor helps
finding issues.
But as userns is still a source of a lot of issues and distros enable
it by default a knob for the admin seems to be a good idea by now. ;-\

Thanks,
//richard

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 22:47     ` [kernel-hardening] " Robert Święcki
@ 2016-01-22 22:50       ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:50 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

On Fri, Jan 22, 2016 at 2:47 PM, Robert Święcki <robert@swiecki.net> wrote:
> Seems that Debian and some older Ubuntu versions are already using
>
> $ sysctl -a | grep usern
> kernel.unprivileged_userns_clone = 0
>
> Shall we be consistent wit it?

Oh! I didn't see that on systems I checked. On which version did you find that?

I'd kind of like to keep the _restrict name, as that follows kptr_ and dmesg_...

-Kees

>
> 2016-01-22 23:39 GMT+01:00 Kees Cook <keescook@chromium.org>:
>> There continues to be many CONFIG_USER_NS related security exposures.
>> For admins running distro kernels with CONFIG_USER_NS, there is no way
>> to disable CLONE_NEWUSER. As many systems do not need CLONE_NEWUSER,
>> this provides a way for sysadmins to disable the feature.
>>
>> This is inspired by a similar restriction in Grsecurity, but adds
>> a sysctl.
>>
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> ---
>>  Documentation/sysctl/kernel.txt | 17 +++++++++++++++++
>>  kernel/sysctl.c                 | 14 ++++++++++++++
>>  kernel/user_namespace.c         |  7 +++++++
>>  3 files changed, 38 insertions(+)
>>
>> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
>> index bbfc5e339a3d..e9e8a4f949f5 100644
>> --- a/Documentation/sysctl/kernel.txt
>> +++ b/Documentation/sysctl/kernel.txt
>> @@ -85,6 +85,7 @@ show up in /proc/sys/kernel:
>>  - tainted
>>  - threads-max
>>  - unknown_nmi_panic
>> +- userns_restrict
>>  - watchdog
>>  - watchdog_thresh
>>  - version
>> @@ -933,6 +934,22 @@ example.  If a system hangs up, try pressing the NMI switch.
>>
>>  ==============================================================
>>
>> +userns_restrict:
>> +
>> +This toggle indicates whether CLONE_NEWUSER is available. As CLONE_NEWUSER
>> +has many unexpected side-effects and security exposures, this allows the
>> +sysadmin to disable the feature without needing to rebuild the kernel.
>> +
>> +When userns_restrict is set to (0), the default, there are no restrictions.
>> +
>> +When userns_restrict is set to (1), CLONE_NEWUSER is only available to
>> +processes that have CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.
>> +
>> +When userns_restrict is set to (2), CLONE_NEWUSER is not available at all,
>> +and the value is locked to "2" for the duration of the boot.
>> +
>> +==============================================================
>> +
>>  watchdog:
>>
>>  This parameter can be used to disable or enable the soft lockup detector
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index fc8899dd636d..ceb8b107fe28 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -112,6 +112,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
>>  #ifndef CONFIG_MMU
>>  extern int sysctl_nr_trim_pages;
>>  #endif
>> +#ifdef CONFIG_USER_NS
>> +extern int sysctl_userns_restrict;
>> +#endif
>>
>>  /* Constants used for minimum and  maximum */
>>  #ifdef CONFIG_LOCKUP_DETECTOR
>> @@ -812,6 +815,17 @@ static struct ctl_table kern_table[] = {
>>                 .extra2         = &two,
>>         },
>>  #endif
>> +#ifdef CONFIG_USER_NS
>> +       {
>> +               .procname       = "userns_restrict",
>> +               .data           = &sysctl_userns_restrict,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax_cap_sysadmin,
>> +               .extra1         = &zero,
>> +               .extra2         = &two,
>> +       },
>> +#endif
>>         {
>>                 .procname       = "ngroups_max",
>>                 .data           = &ngroups_max,
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 9bafc211930c..38395f9625ff 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -25,6 +25,7 @@
>>
>>  static struct kmem_cache *user_ns_cachep __read_mostly;
>>  static DEFINE_MUTEX(userns_state_mutex);
>> +int sysctl_userns_restrict __read_mostly;
>>
>>  static bool new_idmap_permitted(const struct file *file,
>>                                 struct user_namespace *ns, int cap_setid,
>> @@ -84,6 +85,12 @@ int create_user_ns(struct cred *new)
>>             !kgid_has_mapping(parent_ns, group))
>>                 return -EPERM;
>>
>> +       if (sysctl_userns_restrict == 2 ||
>> +           (sysctl_userns_restrict == 1 && (!capable(CAP_SYS_ADMIN) ||
>> +                                            !capable(CAP_SETUID) ||
>> +                                            !capable(CAP_SETGID))))
>> +               return -EPERM;
>> +
>>         ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
>>         if (!ns)
>>                 return -ENOMEM;
>> --
>> 2.6.3
>>
>
>
>
> --
> Robert Święcki



-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 22:50       ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 22:50 UTC (permalink / raw)
  To: Robert Święcki
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

On Fri, Jan 22, 2016 at 2:47 PM, Robert Święcki <robert@swiecki.net> wrote:
> Seems that Debian and some older Ubuntu versions are already using
>
> $ sysctl -a | grep usern
> kernel.unprivileged_userns_clone = 0
>
> Shall we be consistent wit it?

Oh! I didn't see that on systems I checked. On which version did you find that?

I'd kind of like to keep the _restrict name, as that follows kptr_ and dmesg_...

-Kees

>
> 2016-01-22 23:39 GMT+01:00 Kees Cook <keescook@chromium.org>:
>> There continues to be many CONFIG_USER_NS related security exposures.
>> For admins running distro kernels with CONFIG_USER_NS, there is no way
>> to disable CLONE_NEWUSER. As many systems do not need CLONE_NEWUSER,
>> this provides a way for sysadmins to disable the feature.
>>
>> This is inspired by a similar restriction in Grsecurity, but adds
>> a sysctl.
>>
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> ---
>>  Documentation/sysctl/kernel.txt | 17 +++++++++++++++++
>>  kernel/sysctl.c                 | 14 ++++++++++++++
>>  kernel/user_namespace.c         |  7 +++++++
>>  3 files changed, 38 insertions(+)
>>
>> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
>> index bbfc5e339a3d..e9e8a4f949f5 100644
>> --- a/Documentation/sysctl/kernel.txt
>> +++ b/Documentation/sysctl/kernel.txt
>> @@ -85,6 +85,7 @@ show up in /proc/sys/kernel:
>>  - tainted
>>  - threads-max
>>  - unknown_nmi_panic
>> +- userns_restrict
>>  - watchdog
>>  - watchdog_thresh
>>  - version
>> @@ -933,6 +934,22 @@ example.  If a system hangs up, try pressing the NMI switch.
>>
>>  ==============================================================
>>
>> +userns_restrict:
>> +
>> +This toggle indicates whether CLONE_NEWUSER is available. As CLONE_NEWUSER
>> +has many unexpected side-effects and security exposures, this allows the
>> +sysadmin to disable the feature without needing to rebuild the kernel.
>> +
>> +When userns_restrict is set to (0), the default, there are no restrictions.
>> +
>> +When userns_restrict is set to (1), CLONE_NEWUSER is only available to
>> +processes that have CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID.
>> +
>> +When userns_restrict is set to (2), CLONE_NEWUSER is not available at all,
>> +and the value is locked to "2" for the duration of the boot.
>> +
>> +==============================================================
>> +
>>  watchdog:
>>
>>  This parameter can be used to disable or enable the soft lockup detector
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index fc8899dd636d..ceb8b107fe28 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -112,6 +112,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
>>  #ifndef CONFIG_MMU
>>  extern int sysctl_nr_trim_pages;
>>  #endif
>> +#ifdef CONFIG_USER_NS
>> +extern int sysctl_userns_restrict;
>> +#endif
>>
>>  /* Constants used for minimum and  maximum */
>>  #ifdef CONFIG_LOCKUP_DETECTOR
>> @@ -812,6 +815,17 @@ static struct ctl_table kern_table[] = {
>>                 .extra2         = &two,
>>         },
>>  #endif
>> +#ifdef CONFIG_USER_NS
>> +       {
>> +               .procname       = "userns_restrict",
>> +               .data           = &sysctl_userns_restrict,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax_cap_sysadmin,
>> +               .extra1         = &zero,
>> +               .extra2         = &two,
>> +       },
>> +#endif
>>         {
>>                 .procname       = "ngroups_max",
>>                 .data           = &ngroups_max,
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 9bafc211930c..38395f9625ff 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -25,6 +25,7 @@
>>
>>  static struct kmem_cache *user_ns_cachep __read_mostly;
>>  static DEFINE_MUTEX(userns_state_mutex);
>> +int sysctl_userns_restrict __read_mostly;
>>
>>  static bool new_idmap_permitted(const struct file *file,
>>                                 struct user_namespace *ns, int cap_setid,
>> @@ -84,6 +85,12 @@ int create_user_ns(struct cred *new)
>>             !kgid_has_mapping(parent_ns, group))
>>                 return -EPERM;
>>
>> +       if (sysctl_userns_restrict == 2 ||
>> +           (sysctl_userns_restrict == 1 && (!capable(CAP_SYS_ADMIN) ||
>> +                                            !capable(CAP_SETUID) ||
>> +                                            !capable(CAP_SETGID))))
>> +               return -EPERM;
>> +
>>         ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
>>         if (!ns)
>>                 return -ENOMEM;
>> --
>> 2.6.3
>>
>
>
>
> --
> Robert Święcki



-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 22:50       ` [kernel-hardening] " Kees Cook
@ 2016-01-22 22:55         ` Robert Święcki
  -1 siblings, 0 replies; 80+ messages in thread
From: Robert Święcki @ 2016-01-22 22:55 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:

>> Seems that Debian and some older Ubuntu versions are already using
>>
>> $ sysctl -a | grep usern
>> kernel.unprivileged_userns_clone = 0
>>
>> Shall we be consistent wit it?
>
> Oh! I didn't see that on systems I checked. On which version did you find that?

$ uname -a
Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
(2016-01-07) x86_64 GNU/Linux
$ cat /etc/debian_version
8.2

IIRC some older kernels delivered with Ubuntu Precise were also using
it (but maybe I'm mistaken)

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 22:55         ` Robert Święcki
  0 siblings, 0 replies; 80+ messages in thread
From: Robert Święcki @ 2016-01-22 22:55 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:

>> Seems that Debian and some older Ubuntu versions are already using
>>
>> $ sysctl -a | grep usern
>> kernel.unprivileged_userns_clone = 0
>>
>> Shall we be consistent wit it?
>
> Oh! I didn't see that on systems I checked. On which version did you find that?

$ uname -a
Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
(2016-01-07) x86_64 GNU/Linux
$ cat /etc/debian_version
8.2

IIRC some older kernels delivered with Ubuntu Precise were also using
it (but maybe I'm mistaken)

-- 
Robert Święcki

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 22:55         ` [kernel-hardening] " Robert Święcki
@ 2016-01-22 23:00           ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 23:00 UTC (permalink / raw)
  To: Robert Święcki, Serge E. Hallyn, Ben Hutchings
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
> 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
>
>>> Seems that Debian and some older Ubuntu versions are already using
>>>
>>> $ sysctl -a | grep usern
>>> kernel.unprivileged_userns_clone = 0
>>>
>>> Shall we be consistent wit it?
>>
>> Oh! I didn't see that on systems I checked. On which version did you find that?
>
> $ uname -a
> Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> (2016-01-07) x86_64 GNU/Linux
> $ cat /etc/debian_version
> 8.2

Ah-ha, Debian only, though it looks like this was just committed to
the Ubuntu kernel tree too:


> IIRC some older kernels delivered with Ubuntu Precise were also using
> it (but maybe I'm mistaken)

I don't see it there.

I think my patch is more complete, but I'm happy to change the name if
this sysctl has already started to enter the global consciousness. ;)

Serge, Ben, what do you think?

-Kees


-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-22 23:00           ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-22 23:00 UTC (permalink / raw)
  To: Robert Święcki, Serge E. Hallyn, Ben Hutchings
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
> 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
>
>>> Seems that Debian and some older Ubuntu versions are already using
>>>
>>> $ sysctl -a | grep usern
>>> kernel.unprivileged_userns_clone = 0
>>>
>>> Shall we be consistent wit it?
>>
>> Oh! I didn't see that on systems I checked. On which version did you find that?
>
> $ uname -a
> Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> (2016-01-07) x86_64 GNU/Linux
> $ cat /etc/debian_version
> 8.2

Ah-ha, Debian only, though it looks like this was just committed to
the Ubuntu kernel tree too:


> IIRC some older kernels delivered with Ubuntu Precise were also using
> it (but maybe I'm mistaken)

I don't see it there.

I think my patch is more complete, but I'm happy to change the name if
this sysctl has already started to enter the global consciousness. ;)

Serge, Ben, what do you think?

-Kees


-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 23:00           ` [kernel-hardening] " Kees Cook
@ 2016-01-23  0:44             ` Serge Hallyn
  -1 siblings, 0 replies; 80+ messages in thread
From: Serge Hallyn @ 2016-01-23  0:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Robert Święcki, Ben Hutchings, Andrew Morton, Al Viro,
	Richard Weinberger, Eric W. Biederman, Andy Lutomirski,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

Quoting Kees Cook (keescook@chromium.org):
> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
> >
> >>> Seems that Debian and some older Ubuntu versions are already using
> >>>
> >>> $ sysctl -a | grep usern
> >>> kernel.unprivileged_userns_clone = 0
> >>>
> >>> Shall we be consistent wit it?
> >>
> >> Oh! I didn't see that on systems I checked. On which version did you find that?
> >
> > $ uname -a
> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> > (2016-01-07) x86_64 GNU/Linux
> > $ cat /etc/debian_version
> > 8.2
> 
> Ah-ha, Debian only, though it looks like this was just committed to
> the Ubuntu kernel tree too:
> 
> 
> > IIRC some older kernels delivered with Ubuntu Precise were also using
> > it (but maybe I'm mistaken)
> 
> I don't see it there.
> 
> I think my patch is more complete, but I'm happy to change the name if
> this sysctl has already started to enter the global consciousness. ;)
> 
> Serge, Ben, what do you think?
> 
> -Kees

Hey,

I had originally written this for Ubuntu when userns was still new
and not upstream.  Then we dropped it when it got upstream.

The reason we are re-adding it is because we're going to be pushing the
envelop again wrt unprivileged userns usage.  Seth has been working on
supporting mounts of fuse, for instance.  When everything is upstream,
(or we drop it :) we'll drop the patch again.

-serge

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-23  0:44             ` Serge Hallyn
  0 siblings, 0 replies; 80+ messages in thread
From: Serge Hallyn @ 2016-01-23  0:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Robert Święcki, Ben Hutchings, Andrew Morton, Al Viro,
	Richard Weinberger, Eric W. Biederman, Andy Lutomirski,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

Quoting Kees Cook (keescook@chromium.org):
> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
> >
> >>> Seems that Debian and some older Ubuntu versions are already using
> >>>
> >>> $ sysctl -a | grep usern
> >>> kernel.unprivileged_userns_clone = 0
> >>>
> >>> Shall we be consistent wit it?
> >>
> >> Oh! I didn't see that on systems I checked. On which version did you find that?
> >
> > $ uname -a
> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> > (2016-01-07) x86_64 GNU/Linux
> > $ cat /etc/debian_version
> > 8.2
> 
> Ah-ha, Debian only, though it looks like this was just committed to
> the Ubuntu kernel tree too:
> 
> 
> > IIRC some older kernels delivered with Ubuntu Precise were also using
> > it (but maybe I'm mistaken)
> 
> I don't see it there.
> 
> I think my patch is more complete, but I'm happy to change the name if
> this sysctl has already started to enter the global consciousness. ;)
> 
> Serge, Ben, what do you think?
> 
> -Kees

Hey,

I had originally written this for Ubuntu when userns was still new
and not upstream.  Then we dropped it when it got upstream.

The reason we are re-adding it is because we're going to be pushing the
envelop again wrt unprivileged userns usage.  Seth has been working on
supporting mounts of fuse, for instance.  When everything is upstream,
(or we drop it :) we'll drop the patch again.

-serge

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 23:00           ` [kernel-hardening] " Kees Cook
@ 2016-01-23  0:44             ` Serge Hallyn
  -1 siblings, 0 replies; 80+ messages in thread
From: Serge Hallyn @ 2016-01-23  0:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Robert Święcki, Ben Hutchings, Andrew Morton, Al Viro,
	Richard Weinberger, Eric W. Biederman, Andy Lutomirski,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

Quoting Kees Cook (keescook@chromium.org):
> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
> >
> >>> Seems that Debian and some older Ubuntu versions are already using
> >>>
> >>> $ sysctl -a | grep usern
> >>> kernel.unprivileged_userns_clone = 0
> >>>
> >>> Shall we be consistent wit it?
> >>
> >> Oh! I didn't see that on systems I checked. On which version did you find that?
> >
> > $ uname -a
> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> > (2016-01-07) x86_64 GNU/Linux
> > $ cat /etc/debian_version
> > 8.2
> 
> Ah-ha, Debian only, though it looks like this was just committed to
> the Ubuntu kernel tree too:
> 
> 
> > IIRC some older kernels delivered with Ubuntu Precise were also using
> > it (but maybe I'm mistaken)
> 
> I don't see it there.
> 
> I think my patch is more complete, but I'm happy to change the name if
> this sysctl has already started to enter the global consciousness. ;)
> 
> Serge, Ben, what do you think?

Oh, sorry - as for the name of it, what is the alternative you are proposing?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-23  0:44             ` Serge Hallyn
  0 siblings, 0 replies; 80+ messages in thread
From: Serge Hallyn @ 2016-01-23  0:44 UTC (permalink / raw)
  To: Kees Cook
  Cc: Robert Święcki, Ben Hutchings, Andrew Morton, Al Viro,
	Richard Weinberger, Eric W. Biederman, Andy Lutomirski,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML,
	kernel-hardening

Quoting Kees Cook (keescook@chromium.org):
> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
> >
> >>> Seems that Debian and some older Ubuntu versions are already using
> >>>
> >>> $ sysctl -a | grep usern
> >>> kernel.unprivileged_userns_clone = 0
> >>>
> >>> Shall we be consistent wit it?
> >>
> >> Oh! I didn't see that on systems I checked. On which version did you find that?
> >
> > $ uname -a
> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> > (2016-01-07) x86_64 GNU/Linux
> > $ cat /etc/debian_version
> > 8.2
> 
> Ah-ha, Debian only, though it looks like this was just committed to
> the Ubuntu kernel tree too:
> 
> 
> > IIRC some older kernels delivered with Ubuntu Precise were also using
> > it (but maybe I'm mistaken)
> 
> I don't see it there.
> 
> I think my patch is more complete, but I'm happy to change the name if
> this sysctl has already started to enter the global consciousness. ;)
> 
> Serge, Ben, what do you think?

Oh, sorry - as for the name of it, what is the alternative you are proposing?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 23:00           ` [kernel-hardening] " Kees Cook
                             ` (2 preceding siblings ...)
  (?)
@ 2016-01-23  0:59           ` Ben Hutchings
  2016-01-24 20:59               ` Kees Cook
  -1 siblings, 1 reply; 80+ messages in thread
From: Ben Hutchings @ 2016-01-23  0:59 UTC (permalink / raw)
  To: kernel-hardening, Robert Święcki, Serge E. Hallyn
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Andy Lutomirski, Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML

[-- Attachment #1: Type: text/plain, Size: 1639 bytes --]

On Fri, 2016-01-22 at 15:00 -0800, Kees Cook wrote:
> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
> > 
> > > > Seems that Debian and some older Ubuntu versions are already using
> > > > 
> > > > $ sysctl -a | grep usern
> > > > kernel.unprivileged_userns_clone = 0
> > > > 
> > > > Shall we be consistent wit it?
> > > 
> > > Oh! I didn't see that on systems I checked. On which version did you find that?
> > 
> > $ uname -a
> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
> > (2016-01-07) x86_64 GNU/Linux
> > $ cat /etc/debian_version
> > 8.2
> 
> Ah-ha, Debian only, though it looks like this was just committed to
> the Ubuntu kernel tree too:
> 
> 
> > IIRC some older kernels delivered with Ubuntu Precise were also using
> > it (but maybe I'm mistaken)
> 
> I don't see it there.
> 
> I think my patch is more complete, but I'm happy to change the name if
> this sysctl has already started to enter the global consciousness. ;)
> 
> Serge, Ben, what do you think?

I agree that using the '_restrict' suffix for new restrictions makes
sense.  I also don't think that a third possible value for
kernel.unprivileged_userns_clone would would be understandable.

I would probably make kernel.unprivileged_userns_clone a wrapper for
kernel.userns_restrict in Debian, then deprecate and eventually remove
it.

Ben.

-- 
Ben Hutchings
Life is what happens to you while you're busy making other plans.
                                                               - John Lennon

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-22 22:39 ` [kernel-hardening] " Kees Cook
@ 2016-01-23  3:02   ` Eric W. Biederman
  -1 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-23  3:02 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:

> There continues to be unexpected side-effects and security exposures
> via CLONE_NEWUSER. For many end-users running distro kernels with
> CONFIG_USER_NS enabled, there is no way to disable this feature when
> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
> admins not running containers or Chrome can avoid the risks of this
> feature.

I don't actually think there do continue to be unexpected side-effects
and security exposures with CLONE_NEWUSER.  It takes a while for all of
the fixes to trickle out to distros.  At most what I have seen recently
are problems with other kernel interfaces being amplified with user
namespaces.  AKA the current mess with devpts, and the unexpected
issues with bind mounts in mount namespaces.

I have a couple of concerns with a sysctl.

1) As user namespaces settle out this sysctl has the potential to
   decrease the security of the system overall as sandboxing
   features of the kernel will not be available to unprivileged
   applications.

   Web browsing with chrome will be less safe for example.

2) I strongly suspect the granularity of a sysctl is wrong for access to
   user namespaces on a production system.

   In general I suspect what we want is something like seccomp.  I
   believe all of the relevant bits are in registers.  I actually
   thought that was enough for seccomp.  Does seccomp not work for
   some reason?

3) A sysctl breeds a false sense of security in thinking that if a
   security issue is discovered you can just flip a switch, disable
   all new user namespaces and you won't be vulnerable.

   In fact most of the issues in the past have only required being in
   a user namespace to trigger.  Which means any containers or user
   namespaces that already exist could be used to exploit any new
   found issue.  Which means that a I don't think a sysctl will give
   the desired level of protection.

   In my analysis of the issues to date I don't know of anything
   short of a reboot that would meaninfully remove the threat.

4) With applications like docker coming on-line I don't think a
   restriction to processes with capabilities is actually meaninful
   for restricting access to user namespaces.

So I have concerns about both efficacy and usability with the proposed
sysctl.

So to keep this productive.  Please tell me about the threat model
you envision, and how you envision knobs in the kernel being used to
counter those threats.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-23  3:02   ` Eric W. Biederman
  0 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-23  3:02 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:

> There continues to be unexpected side-effects and security exposures
> via CLONE_NEWUSER. For many end-users running distro kernels with
> CONFIG_USER_NS enabled, there is no way to disable this feature when
> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
> admins not running containers or Chrome can avoid the risks of this
> feature.

I don't actually think there do continue to be unexpected side-effects
and security exposures with CLONE_NEWUSER.  It takes a while for all of
the fixes to trickle out to distros.  At most what I have seen recently
are problems with other kernel interfaces being amplified with user
namespaces.  AKA the current mess with devpts, and the unexpected
issues with bind mounts in mount namespaces.

I have a couple of concerns with a sysctl.

1) As user namespaces settle out this sysctl has the potential to
   decrease the security of the system overall as sandboxing
   features of the kernel will not be available to unprivileged
   applications.

   Web browsing with chrome will be less safe for example.

2) I strongly suspect the granularity of a sysctl is wrong for access to
   user namespaces on a production system.

   In general I suspect what we want is something like seccomp.  I
   believe all of the relevant bits are in registers.  I actually
   thought that was enough for seccomp.  Does seccomp not work for
   some reason?

3) A sysctl breeds a false sense of security in thinking that if a
   security issue is discovered you can just flip a switch, disable
   all new user namespaces and you won't be vulnerable.

   In fact most of the issues in the past have only required being in
   a user namespace to trigger.  Which means any containers or user
   namespaces that already exist could be used to exploit any new
   found issue.  Which means that a I don't think a sysctl will give
   the desired level of protection.

   In my analysis of the issues to date I don't know of anything
   short of a reboot that would meaninfully remove the threat.

4) With applications like docker coming on-line I don't think a
   restriction to processes with capabilities is actually meaninful
   for restricting access to user namespaces.

So I have concerns about both efficacy and usability with the proposed
sysctl.

So to keep this productive.  Please tell me about the threat model
you envision, and how you envision knobs in the kernel being used to
counter those threats.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-22 22:39   ` [kernel-hardening] " Kees Cook
@ 2016-01-23  3:10     ` Eric W. Biederman
  -1 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-23  3:10 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:

> Several sysctls expect a state where the highest value (in extra2) is
> locked once set for that boot. Yama does this, and kptr_restrict should
> be doing it. This extracts Yama's logic and adds it to the existing
> proc_dointvec_minmax_sysadmin, taking care to avoid the simple boolean
> states (which do not get locked). Since Yama wants to be checking a
> different capability, we build wrappers for both cases (CAP_SYS_ADMIN
> and CAP_SYS_PTRACE).

Sigh this sysctl appears susceptible to known attacks.

In my quick skim I believe this sysctl implementation that checks
capabilities is susceptible to attacks where the already open file
descriptor is set as stdout on a setuid root application.

Can we come up with an interface that isn't exploitable by an
application that will act as a setuid cat?

Eric

> -#ifdef CONFIG_PRINTK
> -static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
> -				void __user *buffer, size_t *lenp, loff_t *ppos)
> +int proc_dointvec_minmax_cap(int cap, struct ctl_table *table, int write,
> +			     void __user *buffer, size_t *lenp, loff_t *ppos)
>  {
> -	if (write && !capable(CAP_SYS_ADMIN))
> +	struct ctl_table table_copy;
> +	int value;
> +
> +	/* Require init capabilities to make changes. */
> +	if (write && !capable(cap))
>  		return -EPERM;
>  
> -	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	/*
> +	 * To deal with const sysctl tables, we make a copy to perform
> +	 * the locking. When data is >1 and ==extra2, lock extra1 to
> +	 * extra2 to stop the value from being changed any further at
> +	 * runtime.
> +	 */
> +	table_copy = *table;
> +	value = *(int *)table_copy.data;
> +	if (value > 1 && value == *(int *)table_copy.extra2)
> +		table_copy.extra1 = table_copy.extra2;
> +
> +	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
>  }
> -#endif

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
@ 2016-01-23  3:10     ` Eric W. Biederman
  0 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-23  3:10 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:

> Several sysctls expect a state where the highest value (in extra2) is
> locked once set for that boot. Yama does this, and kptr_restrict should
> be doing it. This extracts Yama's logic and adds it to the existing
> proc_dointvec_minmax_sysadmin, taking care to avoid the simple boolean
> states (which do not get locked). Since Yama wants to be checking a
> different capability, we build wrappers for both cases (CAP_SYS_ADMIN
> and CAP_SYS_PTRACE).

Sigh this sysctl appears susceptible to known attacks.

In my quick skim I believe this sysctl implementation that checks
capabilities is susceptible to attacks where the already open file
descriptor is set as stdout on a setuid root application.

Can we come up with an interface that isn't exploitable by an
application that will act as a setuid cat?

Eric

> -#ifdef CONFIG_PRINTK
> -static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
> -				void __user *buffer, size_t *lenp, loff_t *ppos)
> +int proc_dointvec_minmax_cap(int cap, struct ctl_table *table, int write,
> +			     void __user *buffer, size_t *lenp, loff_t *ppos)
>  {
> -	if (write && !capable(CAP_SYS_ADMIN))
> +	struct ctl_table table_copy;
> +	int value;
> +
> +	/* Require init capabilities to make changes. */
> +	if (write && !capable(cap))
>  		return -EPERM;
>  
> -	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	/*
> +	 * To deal with const sysctl tables, we make a copy to perform
> +	 * the locking. When data is >1 and ==extra2, lock extra1 to
> +	 * extra2 to stop the value from being changed any further at
> +	 * runtime.
> +	 */
> +	table_copy = *table;
> +	value = *(int *)table_copy.data;
> +	if (value > 1 && value == *(int *)table_copy.extra2)
> +		table_copy.extra1 = table_copy.extra2;
> +
> +	return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos);
>  }
> -#endif

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-23  3:10     ` [kernel-hardening] " Eric W. Biederman
  (?)
@ 2016-01-23 22:25     ` Jann Horn
  2016-01-24  1:20       ` Eric W. Biederman
  -1 siblings, 1 reply; 80+ messages in thread
From: Jann Horn @ 2016-01-23 22:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Andrew Morton, Al Viro, Richard Weinberger,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

[-- Attachment #1: Type: text/plain, Size: 1207 bytes --]

On Fri, Jan 22, 2016 at 09:10:07PM -0600, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
> 
> > Several sysctls expect a state where the highest value (in extra2) is
> > locked once set for that boot. Yama does this, and kptr_restrict should
> > be doing it. This extracts Yama's logic and adds it to the existing
> > proc_dointvec_minmax_sysadmin, taking care to avoid the simple boolean
> > states (which do not get locked). Since Yama wants to be checking a
> > different capability, we build wrappers for both cases (CAP_SYS_ADMIN
> > and CAP_SYS_PTRACE).
> 
> Sigh this sysctl appears susceptible to known attacks.
> 
> In my quick skim I believe this sysctl implementation that checks
> capabilities is susceptible to attacks where the already open file
> descriptor is set as stdout on a setuid root application.
> 
> Can we come up with an interface that isn't exploitable by an
> application that will act as a setuid cat?

Adding the struct file * to the parameters of all proc_handler
functions would work, right? (Or just filp->f_cred? That would be
less generic.)

A quick grep says that's just about 160 functions that'll need to
be changed. :/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-23 22:25     ` Jann Horn
@ 2016-01-24  1:20       ` Eric W. Biederman
  2016-01-24  1:43         ` Al Viro
  0 siblings, 1 reply; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-24  1:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel-hardening, Kees Cook, Andrew Morton, Al Viro,
	Richard Weinberger, Andy Lutomirski, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

Jann Horn <jann@thejh.net> writes:

> On Fri, Jan 22, 2016 at 09:10:07PM -0600, Eric W. Biederman wrote:
>> Kees Cook <keescook@chromium.org> writes:
>> 
>> > Several sysctls expect a state where the highest value (in extra2) is
>> > locked once set for that boot. Yama does this, and kptr_restrict should
>> > be doing it. This extracts Yama's logic and adds it to the existing
>> > proc_dointvec_minmax_sysadmin, taking care to avoid the simple boolean
>> > states (which do not get locked). Since Yama wants to be checking a
>> > different capability, we build wrappers for both cases (CAP_SYS_ADMIN
>> > and CAP_SYS_PTRACE).
>> 
>> Sigh this sysctl appears susceptible to known attacks.
>> 
>> In my quick skim I believe this sysctl implementation that checks
>> capabilities is susceptible to attacks where the already open file
>> descriptor is set as stdout on a setuid root application.
>> 
>> Can we come up with an interface that isn't exploitable by an
>> application that will act as a setuid cat?
>
> Adding the struct file * to the parameters of all proc_handler
> functions would work, right? (Or just filp->f_cred? That would be
> less generic.)
>
> A quick grep says that's just about 160 functions that'll need to
> be changed. :/

Yep.  That is about the size of it.  file * used to be passed to the
sysctl methods but it was removed several years ago because no one was
using it.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-24  1:20       ` Eric W. Biederman
@ 2016-01-24  1:43         ` Al Viro
  2016-01-24  1:56           ` Jann Horn
  0 siblings, 1 reply; 80+ messages in thread
From: Al Viro @ 2016-01-24  1:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jann Horn, kernel-hardening, Kees Cook, Andrew Morton,
	Richard Weinberger, Andy Lutomirski, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

On Sat, Jan 23, 2016 at 07:20:17PM -0600, Eric W. Biederman wrote:

> Yep.  That is about the size of it.  file * used to be passed to the
> sysctl methods but it was removed several years ago because no one was
> using it.

Generally cred would be better...  Alternatively we could eat one more
pointer in task_struct and stash a reference to that sucker there, rather
than adding an explicit argument (again, with cred instead of file).
Not sure...

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-24  1:43         ` Al Viro
@ 2016-01-24  1:56           ` Jann Horn
  2016-01-24  6:02             ` Eric W. Biederman
  0 siblings, 1 reply; 80+ messages in thread
From: Jann Horn @ 2016-01-24  1:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, kernel-hardening, Kees Cook, Andrew Morton,
	Richard Weinberger, Andy Lutomirski, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 866 bytes --]

On Sun, Jan 24, 2016 at 01:43:42AM +0000, Al Viro wrote:
> On Sat, Jan 23, 2016 at 07:20:17PM -0600, Eric W. Biederman wrote:
> 
> > Yep.  That is about the size of it.  file * used to be passed to the
> > sysctl methods but it was removed several years ago because no one was
> > using it.
> 
> Generally cred would be better...

> Alternatively we could eat one more
> pointer in task_struct and stash a reference to that sucker there, rather
> than adding an explicit argument (again, with cred instead of file).
> Not sure...

I think it makes sense to do this the same way as the rest of the VFS code
here (which passes the creds down through an argument).

And adding the arguments everywhere doesn't really mean more work - either
way, someone should probably go through all of those sysctl handlers and
fix them up to use the file creds.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-24  1:56           ` Jann Horn
@ 2016-01-24  6:02             ` Eric W. Biederman
  2016-01-24  6:32               ` Jann Horn
  0 siblings, 1 reply; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-24  6:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, kernel-hardening, Kees Cook, Andrew Morton,
	Richard Weinberger, Andy Lutomirski, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

Jann Horn <jann@thejh.net> writes:

> On Sun, Jan 24, 2016 at 01:43:42AM +0000, Al Viro wrote:
>> On Sat, Jan 23, 2016 at 07:20:17PM -0600, Eric W. Biederman wrote:
>> 
>> > Yep.  That is about the size of it.  file * used to be passed to the
>> > sysctl methods but it was removed several years ago because no one was
>> > using it.
>> 
>> Generally cred would be better...
>
>> Alternatively we could eat one more
>> pointer in task_struct and stash a reference to that sucker there, rather
>> than adding an explicit argument (again, with cred instead of file).
>> Not sure...
>
> I think it makes sense to do this the same way as the rest of the VFS code
> here (which passes the creds down through an argument).
>
> And adding the arguments everywhere doesn't really mean more work - either
> way, someone should probably go through all of those sysctl handlers and
> fix them up to use the file creds.

Not all of them need it.  It might be worth figuring out the necessary
rigamarole to hook into sysctl_perm the way the networking code does and
have that require the capability at open time.

The advantage is that open time is when it is actually appropraite to
check permissions.  I could be wrong but I doubt there is enough madness
with the handful of sysctl users that call capable to require the checks
to happen on write and not on open.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-24  6:02             ` Eric W. Biederman
@ 2016-01-24  6:32               ` Jann Horn
  2016-01-24  6:44                 ` Eric W. Biederman
  0 siblings, 1 reply; 80+ messages in thread
From: Jann Horn @ 2016-01-24  6:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, kernel-hardening, Kees Cook, Andrew Morton,
	Richard Weinberger, Andy Lutomirski, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1993 bytes --]

On Sun, Jan 24, 2016 at 12:02:41AM -0600, Eric W. Biederman wrote:
> Jann Horn <jann@thejh.net> writes:
> 
> > On Sun, Jan 24, 2016 at 01:43:42AM +0000, Al Viro wrote:
> >> On Sat, Jan 23, 2016 at 07:20:17PM -0600, Eric W. Biederman wrote:
> >> 
> >> > Yep.  That is about the size of it.  file * used to be passed to the
> >> > sysctl methods but it was removed several years ago because no one was
> >> > using it.
> >> 
> >> Generally cred would be better...
> >
> >> Alternatively we could eat one more
> >> pointer in task_struct and stash a reference to that sucker there, rather
> >> than adding an explicit argument (again, with cred instead of file).
> >> Not sure...
> >
> > I think it makes sense to do this the same way as the rest of the VFS code
> > here (which passes the creds down through an argument).
> >
> > And adding the arguments everywhere doesn't really mean more work - either
> > way, someone should probably go through all of those sysctl handlers and
> > fix them up to use the file creds.
> 
> Not all of them need it.  It might be worth figuring out the necessary
> rigamarole to hook into sysctl_perm the way the networking code does and
> have that require the capability at open time.
> 
> The advantage is that open time is when it is actually appropraite to
> check permissions.  I could be wrong but I doubt there is enough madness
> with the handful of sysctl users that call capable to require the checks
> to happen on write and not on open.

That would work - if all sysctls know whether a capability will be needed
for writing later on and don't decide it based on the written data. Is that
always true?

Looking through some of the sysctl handlers, I found proc_do_uts_string and
pid_ns_ctl_handler, which operate on a namespace looked up through current
at write time. I think that's buggy and ought to be done using the file
opener creds and on the file opener's namespaces, but where can those be
stored?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin
  2016-01-24  6:32               ` Jann Horn
@ 2016-01-24  6:44                 ` Eric W. Biederman
  0 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-24  6:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, kernel-hardening, Kees Cook, Andrew Morton,
	Richard Weinberger, Andy Lutomirski, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

Jann Horn <jann@thejh.net> writes:

> On Sun, Jan 24, 2016 at 12:02:41AM -0600, Eric W. Biederman wrote:
>> Jann Horn <jann@thejh.net> writes:
>> 
>> > On Sun, Jan 24, 2016 at 01:43:42AM +0000, Al Viro wrote:
>> >> On Sat, Jan 23, 2016 at 07:20:17PM -0600, Eric W. Biederman wrote:
>> >> 
>> >> > Yep.  That is about the size of it.  file * used to be passed to the
>> >> > sysctl methods but it was removed several years ago because no one was
>> >> > using it.
>> >> 
>> >> Generally cred would be better...
>> >
>> >> Alternatively we could eat one more
>> >> pointer in task_struct and stash a reference to that sucker there, rather
>> >> than adding an explicit argument (again, with cred instead of file).
>> >> Not sure...
>> >
>> > I think it makes sense to do this the same way as the rest of the VFS code
>> > here (which passes the creds down through an argument).
>> >
>> > And adding the arguments everywhere doesn't really mean more work - either
>> > way, someone should probably go through all of those sysctl handlers and
>> > fix them up to use the file creds.
>> 
>> Not all of them need it.  It might be worth figuring out the necessary
>> rigamarole to hook into sysctl_perm the way the networking code does and
>> have that require the capability at open time.
>> 
>> The advantage is that open time is when it is actually appropraite to
>> check permissions.  I could be wrong but I doubt there is enough madness
>> with the handful of sysctl users that call capable to require the checks
>> to happen on write and not on open.
>
> That would work - if all sysctls know whether a capability will be needed
> for writing later on and don't decide it based on the written data. Is that
> always true?

That is certainly the common case to  stick a pointer to static data in
struct sysctl_table.

> Looking through some of the sysctl handlers, I found proc_do_uts_string and
> pid_ns_ctl_handler, which operate on a namespace looked up through current
> at write time. I think that's buggy and ought to be done using the file
> opener creds and on the file opener's namespaces, but where can those be
> stored?

Interesting point.  I had not though about it from that angle.  From all
other angles it has just been something that would be nice to fix but
I hadn't seen the need.   We have all of the infrastructure needed to
register sysctls per namespace and the networking stack uses it.  That
would not be hard to use for uts, pid and ipc namespaces as well.  That
would remove any race.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-23  3:02   ` [kernel-hardening] " Eric W. Biederman
@ 2016-01-24 20:57     ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-24 20:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, LKML, kernel-hardening

On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> There continues to be unexpected side-effects and security exposures
>> via CLONE_NEWUSER. For many end-users running distro kernels with
>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>> admins not running containers or Chrome can avoid the risks of this
>> feature.
>
> I don't actually think there do continue to be unexpected side-effects
> and security exposures with CLONE_NEWUSER.  It takes a while for all of
> the fixes to trickle out to distros.  At most what I have seen recently
> are problems with other kernel interfaces being amplified with user
> namespaces.  AKA the current mess with devpts, and the unexpected
> issues with bind mounts in mount namespaces.

Access to CLONE_NEWUSER has lead to a lot of security issues over the
last 3 years. There has to be a way to avoid this for people that have
no interest in containers.

For admins running servers where there are no containers (which is
still a giant number of systems -- containers are popular but not
ubiquitous), the sysctl makes perfect sense.

> I have a couple of concerns with a sysctl.
>
> 1) As user namespaces settle out this sysctl has the potential to
>    decrease the security of the system overall as sandboxing
>    features of the kernel will not be available to unprivileged
>    applications.
>
>    Web browsing with chrome will be less safe for example.

I don't propose this for Desktops.

> 2) I strongly suspect the granularity of a sysctl is wrong for access to
>    user namespaces on a production system.
>
>    In general I suspect what we want is something like seccomp.  I
>    believe all of the relevant bits are in registers.  I actually
>    thought that was enough for seccomp.  Does seccomp not work for
>    some reason?

Setting a global seccomp filter on init is not possible with any inits
yet, and for some architectures it would push all processes onto the
slow path. It's an extraordinarily big hammer for wanting to turn off
a single area of the kernel with a long history of problems.

Also, seccomp is arguably a program author's policy tool, not a system
policy tool. We could offer this sysctl as an LSM too, but that's even
messier. This is a trivial change to user namespaces and provides a
large protection to people that aren't interested in the risks of
running containers.

> 3) A sysctl breeds a false sense of security in thinking that if a
>    security issue is discovered you can just flip a switch, disable
>    all new user namespaces and you won't be vulnerable.
>
>    In fact most of the issues in the past have only required being in
>    a user namespace to trigger.  Which means any containers or user
>    namespaces that already exist could be used to exploit any new
>    found issue.  Which means that a I don't think a sysctl will give
>    the desired level of protection.
>
>    In my analysis of the issues to date I don't know of anything
>    short of a reboot that would meaninfully remove the threat.

Any admin that decides to just turn off CLONE_NEWUSER in the middle of
still using it is insane. I don't think this breeds any false sense of
security as most sysctls are set at boot time.

> 4) With applications like docker coming on-line I don't think a
>    restriction to processes with capabilities is actually meaninful
>    for restricting access to user namespaces.

Admins who are currently using containers are already exposed to so
much attack surface. This is not for them, it's for people that don't
use containers.

> So I have concerns about both efficacy and usability with the proposed
> sysctl.

Two distros already have this sysctl because it was so strongly
requested by their users. This needs to be upstream so we can manage
the effects correctly.

> So to keep this productive.  Please tell me about the threat model
> you envision, and how you envision knobs in the kernel being used to
> counter those threats.

The threat model I envision is post-intrusion escalation of privileges
on systems that run distro kernels and do not use containers. I
envision the sysctl being used at boot time to kill the entire class
of current and future vulnerabilities exposed by CLONE_NEWUSER. Just
like the sysctls used to turn off modules at boot or turn off kexec at
boot.

As Linux developers I feel we have an obligation to provide our end
users with run-time choices (not just compile-time choices), since
most of our users are using kernels built by someone else. Given the
repeated problems with module auto-loading, we provided a way to
disable module loading. Given the physical-memory-rewriting exposure
of kexec, we provides a way to disable kexec. Given the conflict
between hibernation and kASLR, we provided a way to choose one at
runtime. Here, we're looking back on three years of vulnerabilities
around CLONE_NEWUSER with no end in sight, and we have an obligation
to help the end users that don't want to be exposed to this any more.
Note I'm not suggesting we stop trying to fix the problems we find
with user namespaces, but we need to provide a way to disable them.

Having this sysctl is vastly superior to telling people how to rewrite
their kernel memory at boot time to disable syscalls:

https://outflux.net/blog/archives/2013/12/10/live-patching-the-kernel/

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-24 20:57     ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-24 20:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Al Viro, Richard Weinberger, Andy Lutomirski,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, LKML, kernel-hardening

On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> There continues to be unexpected side-effects and security exposures
>> via CLONE_NEWUSER. For many end-users running distro kernels with
>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>> admins not running containers or Chrome can avoid the risks of this
>> feature.
>
> I don't actually think there do continue to be unexpected side-effects
> and security exposures with CLONE_NEWUSER.  It takes a while for all of
> the fixes to trickle out to distros.  At most what I have seen recently
> are problems with other kernel interfaces being amplified with user
> namespaces.  AKA the current mess with devpts, and the unexpected
> issues with bind mounts in mount namespaces.

Access to CLONE_NEWUSER has lead to a lot of security issues over the
last 3 years. There has to be a way to avoid this for people that have
no interest in containers.

For admins running servers where there are no containers (which is
still a giant number of systems -- containers are popular but not
ubiquitous), the sysctl makes perfect sense.

> I have a couple of concerns with a sysctl.
>
> 1) As user namespaces settle out this sysctl has the potential to
>    decrease the security of the system overall as sandboxing
>    features of the kernel will not be available to unprivileged
>    applications.
>
>    Web browsing with chrome will be less safe for example.

I don't propose this for Desktops.

> 2) I strongly suspect the granularity of a sysctl is wrong for access to
>    user namespaces on a production system.
>
>    In general I suspect what we want is something like seccomp.  I
>    believe all of the relevant bits are in registers.  I actually
>    thought that was enough for seccomp.  Does seccomp not work for
>    some reason?

Setting a global seccomp filter on init is not possible with any inits
yet, and for some architectures it would push all processes onto the
slow path. It's an extraordinarily big hammer for wanting to turn off
a single area of the kernel with a long history of problems.

Also, seccomp is arguably a program author's policy tool, not a system
policy tool. We could offer this sysctl as an LSM too, but that's even
messier. This is a trivial change to user namespaces and provides a
large protection to people that aren't interested in the risks of
running containers.

> 3) A sysctl breeds a false sense of security in thinking that if a
>    security issue is discovered you can just flip a switch, disable
>    all new user namespaces and you won't be vulnerable.
>
>    In fact most of the issues in the past have only required being in
>    a user namespace to trigger.  Which means any containers or user
>    namespaces that already exist could be used to exploit any new
>    found issue.  Which means that a I don't think a sysctl will give
>    the desired level of protection.
>
>    In my analysis of the issues to date I don't know of anything
>    short of a reboot that would meaninfully remove the threat.

Any admin that decides to just turn off CLONE_NEWUSER in the middle of
still using it is insane. I don't think this breeds any false sense of
security as most sysctls are set at boot time.

> 4) With applications like docker coming on-line I don't think a
>    restriction to processes with capabilities is actually meaninful
>    for restricting access to user namespaces.

Admins who are currently using containers are already exposed to so
much attack surface. This is not for them, it's for people that don't
use containers.

> So I have concerns about both efficacy and usability with the proposed
> sysctl.

Two distros already have this sysctl because it was so strongly
requested by their users. This needs to be upstream so we can manage
the effects correctly.

> So to keep this productive.  Please tell me about the threat model
> you envision, and how you envision knobs in the kernel being used to
> counter those threats.

The threat model I envision is post-intrusion escalation of privileges
on systems that run distro kernels and do not use containers. I
envision the sysctl being used at boot time to kill the entire class
of current and future vulnerabilities exposed by CLONE_NEWUSER. Just
like the sysctls used to turn off modules at boot or turn off kexec at
boot.

As Linux developers I feel we have an obligation to provide our end
users with run-time choices (not just compile-time choices), since
most of our users are using kernels built by someone else. Given the
repeated problems with module auto-loading, we provided a way to
disable module loading. Given the physical-memory-rewriting exposure
of kexec, we provides a way to disable kexec. Given the conflict
between hibernation and kASLR, we provided a way to choose one at
runtime. Here, we're looking back on three years of vulnerabilities
around CLONE_NEWUSER with no end in sight, and we have an obligation
to help the end users that don't want to be exposed to this any more.
Note I'm not suggesting we stop trying to fix the problems we find
with user namespaces, but we need to provide a way to disable them.

Having this sysctl is vastly superior to telling people how to rewrite
their kernel memory at boot time to disable syscalls:

https://outflux.net/blog/archives/2013/12/10/live-patching-the-kernel/

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-23  0:59           ` Ben Hutchings
@ 2016-01-24 20:59               ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-24 20:59 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Robert Święcki, Serge E. Hallyn, Andrew Morton,
	Al Viro, Richard Weinberger, Eric W. Biederman, Andy Lutomirski,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML

On Fri, Jan 22, 2016 at 4:59 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Fri, 2016-01-22 at 15:00 -0800, Kees Cook wrote:
>> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
>> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
>> >
>> > > > Seems that Debian and some older Ubuntu versions are already using
>> > > >
>> > > > $ sysctl -a | grep usern
>> > > > kernel.unprivileged_userns_clone = 0
>> > > >
>> > > > Shall we be consistent wit it?
>> > >
>> > > Oh! I didn't see that on systems I checked. On which version did you find that?
>> >
>> > $ uname -a
>> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
>> > (2016-01-07) x86_64 GNU/Linux
>> > $ cat /etc/debian_version
>> > 8.2
>>
>> Ah-ha, Debian only, though it looks like this was just committed to
>> the Ubuntu kernel tree too:
>>
>>
>> > IIRC some older kernels delivered with Ubuntu Precise were also using
>> > it (but maybe I'm mistaken)
>>
>> I don't see it there.
>>
>> I think my patch is more complete, but I'm happy to change the name if
>> this sysctl has already started to enter the global consciousness. ;)
>>
>> Serge, Ben, what do you think?
>
> I agree that using the '_restrict' suffix for new restrictions makes
> sense.  I also don't think that a third possible value for
> kernel.unprivileged_userns_clone would would be understandable.
>
> I would probably make kernel.unprivileged_userns_clone a wrapper for
> kernel.userns_restrict in Debian, then deprecate and eventually remove
> it.

Okay, cool. We'll keep my patch as-is then. Thanks!

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-24 20:59               ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-24 20:59 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Robert Święcki, Serge E. Hallyn, Andrew Morton,
	Al Viro, Richard Weinberger, Eric W. Biederman, Andy Lutomirski,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML

On Fri, Jan 22, 2016 at 4:59 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Fri, 2016-01-22 at 15:00 -0800, Kees Cook wrote:
>> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
>> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
>> >
>> > > > Seems that Debian and some older Ubuntu versions are already using
>> > > >
>> > > > $ sysctl -a | grep usern
>> > > > kernel.unprivileged_userns_clone = 0
>> > > >
>> > > > Shall we be consistent wit it?
>> > >
>> > > Oh! I didn't see that on systems I checked. On which version did you find that?
>> >
>> > $ uname -a
>> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
>> > (2016-01-07) x86_64 GNU/Linux
>> > $ cat /etc/debian_version
>> > 8.2
>>
>> Ah-ha, Debian only, though it looks like this was just committed to
>> the Ubuntu kernel tree too:
>>
>>
>> > IIRC some older kernels delivered with Ubuntu Precise were also using
>> > it (but maybe I'm mistaken)
>>
>> I don't see it there.
>>
>> I think my patch is more complete, but I'm happy to change the name if
>> this sysctl has already started to enter the global consciousness. ;)
>>
>> Serge, Ben, what do you think?
>
> I agree that using the '_restrict' suffix for new restrictions makes
> sense.  I also don't think that a third possible value for
> kernel.unprivileged_userns_clone would would be understandable.
>
> I would probably make kernel.unprivileged_userns_clone a wrapper for
> kernel.userns_restrict in Debian, then deprecate and eventually remove
> it.

Okay, cool. We'll keep my patch as-is then. Thanks!

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-24 20:59               ` Kees Cook
  (?)
@ 2016-01-24 22:20               ` Andy Lutomirski
  2016-01-25 18:51                 ` Kees Cook
  -1 siblings, 1 reply; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-24 22:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel-hardening, Robert Święcki, Serge E. Hallyn,
	Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML

On Sun, Jan 24, 2016 at 12:59 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Jan 22, 2016 at 4:59 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
>> On Fri, 2016-01-22 at 15:00 -0800, Kees Cook wrote:
>>> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
>>> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
>>> >
>>> > > > Seems that Debian and some older Ubuntu versions are already using
>>> > > >
>>> > > > $ sysctl -a | grep usern
>>> > > > kernel.unprivileged_userns_clone = 0
>>> > > >
>>> > > > Shall we be consistent wit it?
>>> > >
>>> > > Oh! I didn't see that on systems I checked. On which version did you find that?
>>> >
>>> > $ uname -a
>>> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
>>> > (2016-01-07) x86_64 GNU/Linux
>>> > $ cat /etc/debian_version
>>> > 8.2
>>>
>>> Ah-ha, Debian only, though it looks like this was just committed to
>>> the Ubuntu kernel tree too:
>>>
>>>
>>> > IIRC some older kernels delivered with Ubuntu Precise were also using
>>> > it (but maybe I'm mistaken)
>>>
>>> I don't see it there.
>>>
>>> I think my patch is more complete, but I'm happy to change the name if
>>> this sysctl has already started to enter the global consciousness. ;)
>>>
>>> Serge, Ben, what do you think?
>>
>> I agree that using the '_restrict' suffix for new restrictions makes
>> sense.  I also don't think that a third possible value for
>> kernel.unprivileged_userns_clone would would be understandable.
>>
>> I would probably make kernel.unprivileged_userns_clone a wrapper for
>> kernel.userns_restrict in Debian, then deprecate and eventually remove
>> it.
>
> Okay, cool. We'll keep my patch as-is then. Thanks!

We still need to deal with the capable check in the write handler though, right?

But I must be missing something: why is mode 0644 insufficient?

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-23  3:02   ` [kernel-hardening] " Eric W. Biederman
@ 2016-01-24 22:22     ` Andy Lutomirski
  -1 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-24 22:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> There continues to be unexpected side-effects and security exposures
>> via CLONE_NEWUSER. For many end-users running distro kernels with
>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>> admins not running containers or Chrome can avoid the risks of this
>> feature.
>
> I don't actually think there do continue to be unexpected side-effects
> and security exposures with CLONE_NEWUSER.  It takes a while for all of
> the fixes to trickle out to distros.  At most what I have seen recently
> are problems with other kernel interfaces being amplified with user
> namespaces.  AKA the current mess with devpts, and the unexpected
> issues with bind mounts in mount namespaces.
>

>
> So to keep this productive.  Please tell me about the threat model
> you envision, and how you envision knobs in the kernel being used to
> counter those threats.

I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
over /any/ network namespace and to thus access the network
configuration API to be a huge risk.  For example, unprivileged users
can program iptables.  I'll eat my hat if there are no privilege
escalations in there.  (They can't request module loading, but still.)

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-24 22:22     ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-24 22:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> There continues to be unexpected side-effects and security exposures
>> via CLONE_NEWUSER. For many end-users running distro kernels with
>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>> admins not running containers or Chrome can avoid the risks of this
>> feature.
>
> I don't actually think there do continue to be unexpected side-effects
> and security exposures with CLONE_NEWUSER.  It takes a while for all of
> the fixes to trickle out to distros.  At most what I have seen recently
> are problems with other kernel interfaces being amplified with user
> namespaces.  AKA the current mess with devpts, and the unexpected
> issues with bind mounts in mount namespaces.
>

>
> So to keep this productive.  Please tell me about the threat model
> you envision, and how you envision knobs in the kernel being used to
> counter those threats.

I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
over /any/ network namespace and to thus access the network
configuration API to be a huge risk.  For example, unprivileged users
can program iptables.  I'll eat my hat if there are no privilege
escalations in there.  (They can't request module loading, but still.)

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-24 22:20               ` Andy Lutomirski
@ 2016-01-25 18:51                 ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-25 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kernel-hardening, Robert Święcki, Serge E. Hallyn,
	Andrew Morton, Al Viro, Richard Weinberger, Eric W. Biederman,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML

On Sun, Jan 24, 2016 at 2:20 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Sun, Jan 24, 2016 at 12:59 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Fri, Jan 22, 2016 at 4:59 PM, Ben Hutchings <ben@decadent.org.uk> wrote:
>>> On Fri, 2016-01-22 at 15:00 -0800, Kees Cook wrote:
>>>> On Fri, Jan 22, 2016 at 2:55 PM, Robert Święcki <robert@swiecki.net> wrote:
>>>> > 2016-01-22 23:50 GMT+01:00 Kees Cook <keescook@chromium.org>:
>>>> >
>>>> > > > Seems that Debian and some older Ubuntu versions are already using
>>>> > > >
>>>> > > > $ sysctl -a | grep usern
>>>> > > > kernel.unprivileged_userns_clone = 0
>>>> > > >
>>>> > > > Shall we be consistent wit it?
>>>> > >
>>>> > > Oh! I didn't see that on systems I checked. On which version did you find that?
>>>> >
>>>> > $ uname -a
>>>> > Linux bc1 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-5~bpo8+1
>>>> > (2016-01-07) x86_64 GNU/Linux
>>>> > $ cat /etc/debian_version
>>>> > 8.2
>>>>
>>>> Ah-ha, Debian only, though it looks like this was just committed to
>>>> the Ubuntu kernel tree too:
>>>>
>>>>
>>>> > IIRC some older kernels delivered with Ubuntu Precise were also using
>>>> > it (but maybe I'm mistaken)
>>>>
>>>> I don't see it there.
>>>>
>>>> I think my patch is more complete, but I'm happy to change the name if
>>>> this sysctl has already started to enter the global consciousness. ;)
>>>>
>>>> Serge, Ben, what do you think?
>>>
>>> I agree that using the '_restrict' suffix for new restrictions makes
>>> sense.  I also don't think that a third possible value for
>>> kernel.unprivileged_userns_clone would would be understandable.
>>>
>>> I would probably make kernel.unprivileged_userns_clone a wrapper for
>>> kernel.userns_restrict in Debian, then deprecate and eventually remove
>>> it.
>>
>> Okay, cool. We'll keep my patch as-is then. Thanks!
>
> We still need to deal with the capable check in the write handler though, right?
>
> But I must be missing something: why is mode 0644 insufficient?

Yeah, separate issue. I think it's a corner case: a non-cap root user
using a setuid tool to write to sysctls. It's worth solving, but I'd
like to land the CLONE_NEWUSER sysctl first; it's much more urgent.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-24 22:22     ` [kernel-hardening] " Andy Lutomirski
@ 2016-01-25 18:51       ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-25 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Sun, Jan 24, 2016 at 2:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>
>>> There continues to be unexpected side-effects and security exposures
>>> via CLONE_NEWUSER. For many end-users running distro kernels with
>>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>>> admins not running containers or Chrome can avoid the risks of this
>>> feature.
>>
>> I don't actually think there do continue to be unexpected side-effects
>> and security exposures with CLONE_NEWUSER.  It takes a while for all of
>> the fixes to trickle out to distros.  At most what I have seen recently
>> are problems with other kernel interfaces being amplified with user
>> namespaces.  AKA the current mess with devpts, and the unexpected
>> issues with bind mounts in mount namespaces.
>>
>
>>
>> So to keep this productive.  Please tell me about the threat model
>> you envision, and how you envision knobs in the kernel being used to
>> counter those threats.
>
> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
> over /any/ network namespace and to thus access the network
> configuration API to be a huge risk.  For example, unprivileged users
> can program iptables.  I'll eat my hat if there are no privilege
> escalations in there.  (They can't request module loading, but still.)

Should I consider this an Ack for the patch? :)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-25 18:51       ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-25 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Sun, Jan 24, 2016 at 2:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>
>>> There continues to be unexpected side-effects and security exposures
>>> via CLONE_NEWUSER. For many end-users running distro kernels with
>>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>>> admins not running containers or Chrome can avoid the risks of this
>>> feature.
>>
>> I don't actually think there do continue to be unexpected side-effects
>> and security exposures with CLONE_NEWUSER.  It takes a while for all of
>> the fixes to trickle out to distros.  At most what I have seen recently
>> are problems with other kernel interfaces being amplified with user
>> namespaces.  AKA the current mess with devpts, and the unexpected
>> issues with bind mounts in mount namespaces.
>>
>
>>
>> So to keep this productive.  Please tell me about the threat model
>> you envision, and how you envision knobs in the kernel being used to
>> counter those threats.
>
> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
> over /any/ network namespace and to thus access the network
> configuration API to be a huge risk.  For example, unprivileged users
> can program iptables.  I'll eat my hat if there are no privilege
> escalations in there.  (They can't request module loading, but still.)

Should I consider this an Ack for the patch? :)

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-25 18:51       ` [kernel-hardening] " Kees Cook
@ 2016-01-25 18:53         ` Andy Lutomirski
  -1 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-25 18:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 10:51 AM, Kees Cook <keescook@chromium.org> wrote:
> On Sun, Jan 24, 2016 at 2:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Kees Cook <keescook@chromium.org> writes:
>>>
>>>> There continues to be unexpected side-effects and security exposures
>>>> via CLONE_NEWUSER. For many end-users running distro kernels with
>>>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>>>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>>>> admins not running containers or Chrome can avoid the risks of this
>>>> feature.
>>>
>>> I don't actually think there do continue to be unexpected side-effects
>>> and security exposures with CLONE_NEWUSER.  It takes a while for all of
>>> the fixes to trickle out to distros.  At most what I have seen recently
>>> are problems with other kernel interfaces being amplified with user
>>> namespaces.  AKA the current mess with devpts, and the unexpected
>>> issues with bind mounts in mount namespaces.
>>>
>>
>>>
>>> So to keep this productive.  Please tell me about the threat model
>>> you envision, and how you envision knobs in the kernel being used to
>>> counter those threats.
>>
>> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
>> over /any/ network namespace and to thus access the network
>> configuration API to be a huge risk.  For example, unprivileged users
>> can program iptables.  I'll eat my hat if there are no privilege
>> escalations in there.  (They can't request module loading, but still.)
>
> Should I consider this an Ack for the patch? :)

Only if you explain why you need the CAP_SYS_ADMIN check.  :)

IOW, I think you could change that one line of code and have a less
weird version of the patch that would work just fine.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-25 18:53         ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-25 18:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 10:51 AM, Kees Cook <keescook@chromium.org> wrote:
> On Sun, Jan 24, 2016 at 2:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Kees Cook <keescook@chromium.org> writes:
>>>
>>>> There continues to be unexpected side-effects and security exposures
>>>> via CLONE_NEWUSER. For many end-users running distro kernels with
>>>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>>>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>>>> admins not running containers or Chrome can avoid the risks of this
>>>> feature.
>>>
>>> I don't actually think there do continue to be unexpected side-effects
>>> and security exposures with CLONE_NEWUSER.  It takes a while for all of
>>> the fixes to trickle out to distros.  At most what I have seen recently
>>> are problems with other kernel interfaces being amplified with user
>>> namespaces.  AKA the current mess with devpts, and the unexpected
>>> issues with bind mounts in mount namespaces.
>>>
>>
>>>
>>> So to keep this productive.  Please tell me about the threat model
>>> you envision, and how you envision knobs in the kernel being used to
>>> counter those threats.
>>
>> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
>> over /any/ network namespace and to thus access the network
>> configuration API to be a huge risk.  For example, unprivileged users
>> can program iptables.  I'll eat my hat if there are no privilege
>> escalations in there.  (They can't request module loading, but still.)
>
> Should I consider this an Ack for the patch? :)

Only if you explain why you need the CAP_SYS_ADMIN check.  :)

IOW, I think you could change that one line of code and have a less
weird version of the patch that would work just fine.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-25 18:53         ` [kernel-hardening] " Andy Lutomirski
@ 2016-01-25 18:56           ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-25 18:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 10:53 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jan 25, 2016 at 10:51 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Sun, Jan 24, 2016 at 2:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Kees Cook <keescook@chromium.org> writes:
>>>>
>>>>> There continues to be unexpected side-effects and security exposures
>>>>> via CLONE_NEWUSER. For many end-users running distro kernels with
>>>>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>>>>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>>>>> admins not running containers or Chrome can avoid the risks of this
>>>>> feature.
>>>>
>>>> I don't actually think there do continue to be unexpected side-effects
>>>> and security exposures with CLONE_NEWUSER.  It takes a while for all of
>>>> the fixes to trickle out to distros.  At most what I have seen recently
>>>> are problems with other kernel interfaces being amplified with user
>>>> namespaces.  AKA the current mess with devpts, and the unexpected
>>>> issues with bind mounts in mount namespaces.
>>>>
>>>
>>>>
>>>> So to keep this productive.  Please tell me about the threat model
>>>> you envision, and how you envision knobs in the kernel being used to
>>>> counter those threats.
>>>
>>> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
>>> over /any/ network namespace and to thus access the network
>>> configuration API to be a huge risk.  For example, unprivileged users
>>> can program iptables.  I'll eat my hat if there are no privilege
>>> escalations in there.  (They can't request module loading, but still.)
>>
>> Should I consider this an Ack for the patch? :)
>
> Only if you explain why you need the CAP_SYS_ADMIN check.  :)

Hm? In the sysctl write? Because otherwise a non-cap root user could
turn "1" to "0". The restriction on CLONE_NEWUSER checks caps, not
uid, so the uid must be protected by cap checks. The DAC permissions
on sysctls for cap-based restrictions make no sense -- they need to be
doing cap checks not DAC checks. It's the same logic for why
dmesg_restrict and kptr_restrict use the same cap check.

> IOW, I think you could change that one line of code and have a less
> weird version of the patch that would work just fine.

Well, I don't know about less weird, but it would leave a unneeded
hole in the permission checks.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-25 18:56           ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-25 18:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 10:53 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Jan 25, 2016 at 10:51 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Sun, Jan 24, 2016 at 2:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Kees Cook <keescook@chromium.org> writes:
>>>>
>>>>> There continues to be unexpected side-effects and security exposures
>>>>> via CLONE_NEWUSER. For many end-users running distro kernels with
>>>>> CONFIG_USER_NS enabled, there is no way to disable this feature when
>>>>> desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so
>>>>> admins not running containers or Chrome can avoid the risks of this
>>>>> feature.
>>>>
>>>> I don't actually think there do continue to be unexpected side-effects
>>>> and security exposures with CLONE_NEWUSER.  It takes a while for all of
>>>> the fixes to trickle out to distros.  At most what I have seen recently
>>>> are problems with other kernel interfaces being amplified with user
>>>> namespaces.  AKA the current mess with devpts, and the unexpected
>>>> issues with bind mounts in mount namespaces.
>>>>
>>>
>>>>
>>>> So to keep this productive.  Please tell me about the threat model
>>>> you envision, and how you envision knobs in the kernel being used to
>>>> counter those threats.
>>>
>>> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN
>>> over /any/ network namespace and to thus access the network
>>> configuration API to be a huge risk.  For example, unprivileged users
>>> can program iptables.  I'll eat my hat if there are no privilege
>>> escalations in there.  (They can't request module loading, but still.)
>>
>> Should I consider this an Ack for the patch? :)
>
> Only if you explain why you need the CAP_SYS_ADMIN check.  :)

Hm? In the sysctl write? Because otherwise a non-cap root user could
turn "1" to "0". The restriction on CLONE_NEWUSER checks caps, not
uid, so the uid must be protected by cap checks. The DAC permissions
on sysctls for cap-based restrictions make no sense -- they need to be
doing cap checks not DAC checks. It's the same logic for why
dmesg_restrict and kptr_restrict use the same cap check.

> IOW, I think you could change that one line of code and have a less
> weird version of the patch that would work just fine.

Well, I don't know about less weird, but it would leave a unneeded
hole in the permission checks.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-25 18:56           ` [kernel-hardening] " Kees Cook
@ 2016-01-25 19:33             ` Eric W. Biederman
  -1 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-25 19:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:
>
> Well, I don't know about less weird, but it would leave a unneeded
> hole in the permission checks.

To be clear the current patch has my:

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

The code is buggy, and poorly thought through.  Your lack of interest in
fixing the bugs in your patch is distressing.

So broken code, not willing to fix.  No. We are not merging this sysctl.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-25 19:33             ` Eric W. Biederman
  0 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-25 19:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:
>
> Well, I don't know about less weird, but it would leave a unneeded
> hole in the permission checks.

To be clear the current patch has my:

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

The code is buggy, and poorly thought through.  Your lack of interest in
fixing the bugs in your patch is distressing.

So broken code, not willing to fix.  No. We are not merging this sysctl.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-25 19:33             ` [kernel-hardening] " Eric W. Biederman
@ 2016-01-25 22:34               ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-25 22:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>>
>> Well, I don't know about less weird, but it would leave a unneeded
>> hole in the permission checks.
>
> To be clear the current patch has my:
>
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> The code is buggy, and poorly thought through.  Your lack of interest in
> fixing the bugs in your patch is distressing.

I'm not sure where you see me having a "lack of interest". The
existing cap-checking sysctls have a corner-case bug, which is
orthogonal to this change.

> So broken code, not willing to fix.  No. We are not merging this sysctl.

I think you're jumping to conclusions. :)

This feature is already implemented by two distros, and likely wanted
by others. We cannot ignore that. The sysctl default doesn't change
the existing behavior, so this doesn't get in your way at all. Can you
please respond to my earlier email where I rebutted each of your
arguments against it? Just saying "no" and putting words in my mouth
isn't very productive.

Andy, given your interest in this feature, and my explanation of the
CAP_SYSADMIN check, what are your thoughts?

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-25 22:34               ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-25 22:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>>
>> Well, I don't know about less weird, but it would leave a unneeded
>> hole in the permission checks.
>
> To be clear the current patch has my:
>
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> The code is buggy, and poorly thought through.  Your lack of interest in
> fixing the bugs in your patch is distressing.

I'm not sure where you see me having a "lack of interest". The
existing cap-checking sysctls have a corner-case bug, which is
orthogonal to this change.

> So broken code, not willing to fix.  No. We are not merging this sysctl.

I think you're jumping to conclusions. :)

This feature is already implemented by two distros, and likely wanted
by others. We cannot ignore that. The sysctl default doesn't change
the existing behavior, so this doesn't get in your way at all. Can you
please respond to my earlier email where I rebutted each of your
arguments against it? Just saying "no" and putting words in my mouth
isn't very productive.

Andy, given your interest in this feature, and my explanation of the
CAP_SYSADMIN check, what are your thoughts?

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-25 22:34               ` [kernel-hardening] " Kees Cook
@ 2016-01-25 23:33                 ` Andy Lutomirski
  -1 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-25 23:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 2:34 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>>
>>> Well, I don't know about less weird, but it would leave a unneeded
>>> hole in the permission checks.
>>
>> To be clear the current patch has my:
>>
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> The code is buggy, and poorly thought through.  Your lack of interest in
>> fixing the bugs in your patch is distressing.
>
> I'm not sure where you see me having a "lack of interest". The
> existing cap-checking sysctls have a corner-case bug, which is
> orthogonal to this change.
>
>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>
> I think you're jumping to conclusions. :)
>
> This feature is already implemented by two distros, and likely wanted
> by others. We cannot ignore that. The sysctl default doesn't change
> the existing behavior, so this doesn't get in your way at all. Can you
> please respond to my earlier email where I rebutted each of your
> arguments against it? Just saying "no" and putting words in my mouth
> isn't very productive.
>
> Andy, given your interest in this feature, and my explanation of the
> CAP_SYSADMIN check, what are your thoughts?
>

I think the sysctl sucks, but I don't have a better idea, so I think
it should go in.  There's clearly demand.

A better change would be to have an option to tighten up what
namespaces can be manipulated in which manner.  If anyone wants to
step up and do that, it sounds useful.  Denying CAP_NET_ADMIN in an
unprivileged user ns would address one piece of attack surface.  There
are probably others.

*However*, I think that trying to protect against a hypothetical
attacker with uid == global root who has procfs access but doesn't
have CAP_SYS_ADMIN isn't important enough to be worth introducing yet
another bad capable() call.

Whoever wants to tilt at the windmill of fixng sysctl permissions can
address all of them, and then maybe this sysctl would be worth
tightening up.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-25 23:33                 ` Andy Lutomirski
  0 siblings, 0 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-25 23:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

On Mon, Jan 25, 2016 at 2:34 PM, Kees Cook <keescook@chromium.org> wrote:
> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>>
>>> Well, I don't know about less weird, but it would leave a unneeded
>>> hole in the permission checks.
>>
>> To be clear the current patch has my:
>>
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> The code is buggy, and poorly thought through.  Your lack of interest in
>> fixing the bugs in your patch is distressing.
>
> I'm not sure where you see me having a "lack of interest". The
> existing cap-checking sysctls have a corner-case bug, which is
> orthogonal to this change.
>
>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>
> I think you're jumping to conclusions. :)
>
> This feature is already implemented by two distros, and likely wanted
> by others. We cannot ignore that. The sysctl default doesn't change
> the existing behavior, so this doesn't get in your way at all. Can you
> please respond to my earlier email where I rebutted each of your
> arguments against it? Just saying "no" and putting words in my mouth
> isn't very productive.
>
> Andy, given your interest in this feature, and my explanation of the
> CAP_SYSADMIN check, what are your thoughts?
>

I think the sysctl sucks, but I don't have a better idea, so I think
it should go in.  There's clearly demand.

A better change would be to have an option to tighten up what
namespaces can be manipulated in which manner.  If anyone wants to
step up and do that, it sounds useful.  Denying CAP_NET_ADMIN in an
unprivileged user ns would address one piece of attack surface.  There
are probably others.

*However*, I think that trying to protect against a hypothetical
attacker with uid == global root who has procfs access but doesn't
have CAP_SYS_ADMIN isn't important enough to be worth introducing yet
another bad capable() call.

Whoever wants to tilt at the windmill of fixng sysctl permissions can
address all of them, and then maybe this sysctl would be worth
tightening up.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-25 22:34               ` [kernel-hardening] " Kees Cook
  (?)
  (?)
@ 2016-01-26  2:27               ` Daniel Micay
  -1 siblings, 0 replies; 80+ messages in thread
From: Daniel Micay @ 2016-01-26  2:27 UTC (permalink / raw)
  To: kernel-hardening, Eric W. Biederman
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 418 bytes --]

> This feature is already implemented by two distros, and likely wanted
> by others. We cannot ignore that.

Date point: Arch Linux won't be enabling CONFIG_USERNS until there's a
way to disable unprivileged user namespaces. The kernel maintainers are
unwilling to carry long-term out-of-tree patches.

https://github.com/sandstorm-io/sandstorm/blob/d270755b1b55e5be6c96df2cce7c914f35f0d2a2/install.sh#L464-L474

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-25 22:34               ` [kernel-hardening] " Kees Cook
@ 2016-01-26  4:57                 ` Eric W. Biederman
  -1 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-26  4:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:

> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>>
>>> Well, I don't know about less weird, but it would leave a unneeded
>>> hole in the permission checks.
>>
>> To be clear the current patch has my:
>>
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> The code is buggy, and poorly thought through.  Your lack of interest in
>> fixing the bugs in your patch is distressing.
>
> I'm not sure where you see me having a "lack of interest". The
> existing cap-checking sysctls have a corner-case bug, which is
> orthogonal to this change.

That certainly doesn't sound like you have any plans to change anything
there.

>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>
> I think you're jumping to conclusions. :)

I think I am the maintainer.

What you are proposing is very much something that is only of interst to
people who are not using user namespaces.  It is fatally flawed as
a way to avoid new attack surfaces for people who don't care as the
sysctl leaves user namespaces enabled by default.  It is fatally flawed
as remediation to recommend to people to change if a new user namespace
related but is discovered.  Any running process that happens to be
created while user namespace creation was enabled will continue to
exist.  Effectively a reboot will be required as part of a mitigation.
Many sysadmins will get that wrong.

I can't possibly see your sysctl as proposed achieving it's goals.  A
person has to be entirely too aware of subtlety and nuance to use it
effectively.

> This feature is already implemented by two distros, and likely wanted
> by others. We cannot ignore that. The sysctl default doesn't change
> the existing behavior, so this doesn't get in your way at all. Can you
> please respond to my earlier email where I rebutted each of your
> arguments against it? Just saying "no" and putting words in my mouth
> isn't very productive.

Calling people who make mistakes insane is not a rebuttal.  In security
usability matters, and your sysctl has low usability.

Further you seem to have missed something crucial in your understanding.
As was explained earlier the sysctl was added to ubuntu to allow early
adopters to experiment not as a long term way of managing user
namespaces.


What sounds like a generally useful feature that would cover your use
case and many others is a per user limit on the number of user
namespaces users may create.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-26  4:57                 ` Eric W. Biederman
  0 siblings, 0 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-26  4:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel,
	kernel-hardening

Kees Cook <keescook@chromium.org> writes:

> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>>
>>> Well, I don't know about less weird, but it would leave a unneeded
>>> hole in the permission checks.
>>
>> To be clear the current patch has my:
>>
>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>
>> The code is buggy, and poorly thought through.  Your lack of interest in
>> fixing the bugs in your patch is distressing.
>
> I'm not sure where you see me having a "lack of interest". The
> existing cap-checking sysctls have a corner-case bug, which is
> orthogonal to this change.

That certainly doesn't sound like you have any plans to change anything
there.

>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>
> I think you're jumping to conclusions. :)

I think I am the maintainer.

What you are proposing is very much something that is only of interst to
people who are not using user namespaces.  It is fatally flawed as
a way to avoid new attack surfaces for people who don't care as the
sysctl leaves user namespaces enabled by default.  It is fatally flawed
as remediation to recommend to people to change if a new user namespace
related but is discovered.  Any running process that happens to be
created while user namespace creation was enabled will continue to
exist.  Effectively a reboot will be required as part of a mitigation.
Many sysadmins will get that wrong.

I can't possibly see your sysctl as proposed achieving it's goals.  A
person has to be entirely too aware of subtlety and nuance to use it
effectively.

> This feature is already implemented by two distros, and likely wanted
> by others. We cannot ignore that. The sysctl default doesn't change
> the existing behavior, so this doesn't get in your way at all. Can you
> please respond to my earlier email where I rebutted each of your
> arguments against it? Just saying "no" and putting words in my mouth
> isn't very productive.

Calling people who make mistakes insane is not a rebuttal.  In security
usability matters, and your sysctl has low usability.

Further you seem to have missed something crucial in your understanding.
As was explained earlier the sysctl was added to ubuntu to allow early
adopters to experiment not as a long term way of managing user
namespaces.


What sounds like a generally useful feature that would cover your use
case and many others is a per user limit on the number of user
namespaces users may create.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-24 20:57     ` [kernel-hardening] " Kees Cook
  (?)
@ 2016-01-26  7:38     ` Serge Hallyn
  -1 siblings, 0 replies; 80+ messages in thread
From: Serge Hallyn @ 2016-01-26  7:38 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Andy Lutomirski, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc, LKML

Quoting Kees Cook (keescook@chromium.org):
> On Fri, Jan 22, 2016 at 7:02 PM, Eric W. Biederman
> > So I have concerns about both efficacy and usability with the proposed
> > sysctl.
> 
> Two distros already have this sysctl because it was so strongly
> requested by their users. This needs to be upstream so we can manage
> the effects correctly.

Which two distros?  Was it in fact requested by their users?

My opinion remains that long-term this is a bad thing.  If we're going to
have this upstream, it should be clearly marked so as to be easily
removable at some point down the road.  Userspace that cannot count on a
feature (in the best case) won't use it or (much worse) will fall back
to broken behavior in one case or the other.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26  4:57                 ` [kernel-hardening] " Eric W. Biederman
@ 2016-01-26 14:38                   ` Josh Boyer
  -1 siblings, 0 replies; 80+ messages in thread
From: Josh Boyer @ 2016-01-26 14:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Andy Lutomirski, Andrew Morton, Al Viro,
	Richard Weinberger, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Kees Cook <keescook@chromium.org> writes:
>>>>
>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>> hole in the permission checks.
>>>
>>> To be clear the current patch has my:
>>>
>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>> The code is buggy, and poorly thought through.  Your lack of interest in
>>> fixing the bugs in your patch is distressing.
>>
>> I'm not sure where you see me having a "lack of interest". The
>> existing cap-checking sysctls have a corner-case bug, which is
>> orthogonal to this change.
>
> That certainly doesn't sound like you have any plans to change anything
> there.
>
>>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>>
>> I think you're jumping to conclusions. :)
>
> I think I am the maintainer.
>
> What you are proposing is very much something that is only of interst to
> people who are not using user namespaces.  It is fatally flawed as
> a way to avoid new attack surfaces for people who don't care as the
> sysctl leaves user namespaces enabled by default.  It is fatally flawed
> as remediation to recommend to people to change if a new user namespace
> related but is discovered.  Any running process that happens to be
> created while user namespace creation was enabled will continue to
> exist.  Effectively a reboot will be required as part of a mitigation.
> Many sysadmins will get that wrong.
>
> I can't possibly see your sysctl as proposed achieving it's goals.  A
> person has to be entirely too aware of subtlety and nuance to use it
> effectively.

What you're saying is true for the "oh crap" case of a new userns
related CVE being found.  However, there is the case where sysadmins
know for a fact that a set of machines should not allow user
namespaces to be enabled.  Currently they have 2 choices, 1) use their
distro kernel as-is, which may not meet their goal of having userns
disabled, or 2) rebuild their kernel to disable it, which may
invalidate any support contracts they have.

I tend to agree with you on the lack of value around runtime
mitigation, but allowing an admin to toggle this as a blatant on/off
switch on reboot does have value.

>> This feature is already implemented by two distros, and likely wanted
>> by others. We cannot ignore that. The sysctl default doesn't change
>> the existing behavior, so this doesn't get in your way at all. Can you
>> please respond to my earlier email where I rebutted each of your
>> arguments against it? Just saying "no" and putting words in my mouth
>> isn't very productive.
>
> Calling people who make mistakes insane is not a rebuttal.  In security
> usability matters, and your sysctl has low usability.
>
> Further you seem to have missed something crucial in your understanding.
> As was explained earlier the sysctl was added to ubuntu to allow early
> adopters to experiment not as a long term way of managing user
> namespaces.
>
>
> What sounds like a generally useful feature that would cover your use
> case and many others is a per user limit on the number of user
> namespaces users may create.

Where that number may be zero?  I don't see how that is really any
better than a sysctl.  Could you elaborate?

josh

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-26 14:38                   ` Josh Boyer
  0 siblings, 0 replies; 80+ messages in thread
From: Josh Boyer @ 2016-01-26 14:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Andy Lutomirski, Andrew Morton, Al Viro,
	Richard Weinberger, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Kees Cook <keescook@chromium.org> writes:
>>>>
>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>> hole in the permission checks.
>>>
>>> To be clear the current patch has my:
>>>
>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>> The code is buggy, and poorly thought through.  Your lack of interest in
>>> fixing the bugs in your patch is distressing.
>>
>> I'm not sure where you see me having a "lack of interest". The
>> existing cap-checking sysctls have a corner-case bug, which is
>> orthogonal to this change.
>
> That certainly doesn't sound like you have any plans to change anything
> there.
>
>>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>>
>> I think you're jumping to conclusions. :)
>
> I think I am the maintainer.
>
> What you are proposing is very much something that is only of interst to
> people who are not using user namespaces.  It is fatally flawed as
> a way to avoid new attack surfaces for people who don't care as the
> sysctl leaves user namespaces enabled by default.  It is fatally flawed
> as remediation to recommend to people to change if a new user namespace
> related but is discovered.  Any running process that happens to be
> created while user namespace creation was enabled will continue to
> exist.  Effectively a reboot will be required as part of a mitigation.
> Many sysadmins will get that wrong.
>
> I can't possibly see your sysctl as proposed achieving it's goals.  A
> person has to be entirely too aware of subtlety and nuance to use it
> effectively.

What you're saying is true for the "oh crap" case of a new userns
related CVE being found.  However, there is the case where sysadmins
know for a fact that a set of machines should not allow user
namespaces to be enabled.  Currently they have 2 choices, 1) use their
distro kernel as-is, which may not meet their goal of having userns
disabled, or 2) rebuild their kernel to disable it, which may
invalidate any support contracts they have.

I tend to agree with you on the lack of value around runtime
mitigation, but allowing an admin to toggle this as a blatant on/off
switch on reboot does have value.

>> This feature is already implemented by two distros, and likely wanted
>> by others. We cannot ignore that. The sysctl default doesn't change
>> the existing behavior, so this doesn't get in your way at all. Can you
>> please respond to my earlier email where I rebutted each of your
>> arguments against it? Just saying "no" and putting words in my mouth
>> isn't very productive.
>
> Calling people who make mistakes insane is not a rebuttal.  In security
> usability matters, and your sysctl has low usability.
>
> Further you seem to have missed something crucial in your understanding.
> As was explained earlier the sysctl was added to ubuntu to allow early
> adopters to experiment not as a long term way of managing user
> namespaces.
>
>
> What sounds like a generally useful feature that would cover your use
> case and many others is a per user limit on the number of user
> namespaces users may create.

Where that number may be zero?  I don't see how that is really any
better than a sysctl.  Could you elaborate?

josh

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 14:38                   ` [kernel-hardening] " Josh Boyer
@ 2016-01-26 14:46                     ` Austin S. Hemmelgarn
  -1 siblings, 0 replies; 80+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-26 14:46 UTC (permalink / raw)
  To: Josh Boyer, Eric W. Biederman
  Cc: Kees Cook, Andy Lutomirski, Andrew Morton, Al Viro,
	Richard Weinberger, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

On 2016-01-26 09:38, Josh Boyer wrote:
> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>
>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>
>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>> hole in the permission checks.
>>>>
>>>> To be clear the current patch has my:
>>>>
>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>
>>>> The code is buggy, and poorly thought through.  Your lack of interest in
>>>> fixing the bugs in your patch is distressing.
>>>
>>> I'm not sure where you see me having a "lack of interest". The
>>> existing cap-checking sysctls have a corner-case bug, which is
>>> orthogonal to this change.
>>
>> That certainly doesn't sound like you have any plans to change anything
>> there.
>>
>>>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>>>
>>> I think you're jumping to conclusions. :)
>>
>> I think I am the maintainer.
>>
>> What you are proposing is very much something that is only of interst to
>> people who are not using user namespaces.  It is fatally flawed as
>> a way to avoid new attack surfaces for people who don't care as the
>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>> as remediation to recommend to people to change if a new user namespace
>> related but is discovered.  Any running process that happens to be
>> created while user namespace creation was enabled will continue to
>> exist.  Effectively a reboot will be required as part of a mitigation.
>> Many sysadmins will get that wrong.
>>
>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>> person has to be entirely too aware of subtlety and nuance to use it
>> effectively.
>
> What you're saying is true for the "oh crap" case of a new userns
> related CVE being found.  However, there is the case where sysadmins
> know for a fact that a set of machines should not allow user
> namespaces to be enabled.  Currently they have 2 choices, 1) use their
> distro kernel as-is, which may not meet their goal of having userns
> disabled, or 2) rebuild their kernel to disable it, which may
> invalidate any support contracts they have.
>
> I tend to agree with you on the lack of value around runtime
> mitigation, but allowing an admin to toggle this as a blatant on/off
> switch on reboot does have value.
>
>>> This feature is already implemented by two distros, and likely wanted
>>> by others. We cannot ignore that. The sysctl default doesn't change
>>> the existing behavior, so this doesn't get in your way at all. Can you
>>> please respond to my earlier email where I rebutted each of your
>>> arguments against it? Just saying "no" and putting words in my mouth
>>> isn't very productive.
>>
>> Calling people who make mistakes insane is not a rebuttal.  In security
>> usability matters, and your sysctl has low usability.
>>
>> Further you seem to have missed something crucial in your understanding.
>> As was explained earlier the sysctl was added to ubuntu to allow early
>> adopters to experiment not as a long term way of managing user
>> namespaces.
>>
>>
>> What sounds like a generally useful feature that would cover your use
>> case and many others is a per user limit on the number of user
>> namespaces users may create.
>
> Where that number may be zero?  I don't see how that is really any
> better than a sysctl.  Could you elaborate?
It's a better option because it would allow better configurability. 
Take for example a single user desktop system with some network daemons. 
  On such a system, the actual login used for the graphical environment 
by the user should be allowed at least a few user namespaces, because 
some software depends on them for security (Chrome for example, as well 
as some distro's build systems), but system users should be limited to 
at most one if they need it, and ideally zero, so that remote exploits 
couldn't give access to a user namespace.

Conversely, on a server system, it's not unreasonable to completely 
disable user namespaces for almost everything, except for giving one to 
services that use them properly for sand-boxing.

I will state though that I only feel this is a better solution given 
that two criteria are met:
1. You can set 0 as the limit.
2. You can configure this without needing some special software (this in 
particular means that seccomp is not an option).

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-26 14:46                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 80+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-26 14:46 UTC (permalink / raw)
  To: Josh Boyer, Eric W. Biederman
  Cc: Kees Cook, Andy Lutomirski, Andrew Morton, Al Viro,
	Richard Weinberger, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

On 2016-01-26 09:38, Josh Boyer wrote:
> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>>
>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>
>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>> hole in the permission checks.
>>>>
>>>> To be clear the current patch has my:
>>>>
>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>
>>>> The code is buggy, and poorly thought through.  Your lack of interest in
>>>> fixing the bugs in your patch is distressing.
>>>
>>> I'm not sure where you see me having a "lack of interest". The
>>> existing cap-checking sysctls have a corner-case bug, which is
>>> orthogonal to this change.
>>
>> That certainly doesn't sound like you have any plans to change anything
>> there.
>>
>>>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>>>
>>> I think you're jumping to conclusions. :)
>>
>> I think I am the maintainer.
>>
>> What you are proposing is very much something that is only of interst to
>> people who are not using user namespaces.  It is fatally flawed as
>> a way to avoid new attack surfaces for people who don't care as the
>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>> as remediation to recommend to people to change if a new user namespace
>> related but is discovered.  Any running process that happens to be
>> created while user namespace creation was enabled will continue to
>> exist.  Effectively a reboot will be required as part of a mitigation.
>> Many sysadmins will get that wrong.
>>
>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>> person has to be entirely too aware of subtlety and nuance to use it
>> effectively.
>
> What you're saying is true for the "oh crap" case of a new userns
> related CVE being found.  However, there is the case where sysadmins
> know for a fact that a set of machines should not allow user
> namespaces to be enabled.  Currently they have 2 choices, 1) use their
> distro kernel as-is, which may not meet their goal of having userns
> disabled, or 2) rebuild their kernel to disable it, which may
> invalidate any support contracts they have.
>
> I tend to agree with you on the lack of value around runtime
> mitigation, but allowing an admin to toggle this as a blatant on/off
> switch on reboot does have value.
>
>>> This feature is already implemented by two distros, and likely wanted
>>> by others. We cannot ignore that. The sysctl default doesn't change
>>> the existing behavior, so this doesn't get in your way at all. Can you
>>> please respond to my earlier email where I rebutted each of your
>>> arguments against it? Just saying "no" and putting words in my mouth
>>> isn't very productive.
>>
>> Calling people who make mistakes insane is not a rebuttal.  In security
>> usability matters, and your sysctl has low usability.
>>
>> Further you seem to have missed something crucial in your understanding.
>> As was explained earlier the sysctl was added to ubuntu to allow early
>> adopters to experiment not as a long term way of managing user
>> namespaces.
>>
>>
>> What sounds like a generally useful feature that would cover your use
>> case and many others is a per user limit on the number of user
>> namespaces users may create.
>
> Where that number may be zero?  I don't see how that is really any
> better than a sysctl.  Could you elaborate?
It's a better option because it would allow better configurability. 
Take for example a single user desktop system with some network daemons. 
  On such a system, the actual login used for the graphical environment 
by the user should be allowed at least a few user namespaces, because 
some software depends on them for security (Chrome for example, as well 
as some distro's build systems), but system users should be limited to 
at most one if they need it, and ideally zero, so that remote exploits 
couldn't give access to a user namespace.

Conversely, on a server system, it's not unreasonable to completely 
disable user namespaces for almost everything, except for giving one to 
services that use them properly for sand-boxing.

I will state though that I only feel this is a better solution given 
that two criteria are met:
1. You can set 0 as the limit.
2. You can configure this without needing some special software (this in 
particular means that seccomp is not an option).

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 14:46                     ` [kernel-hardening] " Austin S. Hemmelgarn
@ 2016-01-26 14:56                       ` Josh Boyer
  -1 siblings, 0 replies; 80+ messages in thread
From: Josh Boyer @ 2016-01-26 14:56 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Eric W. Biederman, Kees Cook, Andy Lutomirski, Andrew Morton,
	Al Viro, Richard Weinberger, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

On Tue, Jan 26, 2016 at 9:46 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-26 09:38, Josh Boyer wrote:
>>
>> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>>
>>> Kees Cook <keescook@chromium.org> writes:
>>>
>>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>>
>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>>
>>>>>>
>>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>>> hole in the permission checks.
>>>>>
>>>>>
>>>>> To be clear the current patch has my:
>>>>>
>>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>>
>>>>> The code is buggy, and poorly thought through.  Your lack of interest
>>>>> in
>>>>> fixing the bugs in your patch is distressing.
>>>>
>>>>
>>>> I'm not sure where you see me having a "lack of interest". The
>>>> existing cap-checking sysctls have a corner-case bug, which is
>>>> orthogonal to this change.
>>>
>>>
>>> That certainly doesn't sound like you have any plans to change anything
>>> there.
>>>
>>>>> So broken code, not willing to fix.  No. We are not merging this
>>>>> sysctl.
>>>>
>>>>
>>>> I think you're jumping to conclusions. :)
>>>
>>>
>>> I think I am the maintainer.
>>>
>>> What you are proposing is very much something that is only of interst to
>>> people who are not using user namespaces.  It is fatally flawed as
>>> a way to avoid new attack surfaces for people who don't care as the
>>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>>> as remediation to recommend to people to change if a new user namespace
>>> related but is discovered.  Any running process that happens to be
>>> created while user namespace creation was enabled will continue to
>>> exist.  Effectively a reboot will be required as part of a mitigation.
>>> Many sysadmins will get that wrong.
>>>
>>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>>> person has to be entirely too aware of subtlety and nuance to use it
>>> effectively.
>>
>>
>> What you're saying is true for the "oh crap" case of a new userns
>> related CVE being found.  However, there is the case where sysadmins
>> know for a fact that a set of machines should not allow user
>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>> distro kernel as-is, which may not meet their goal of having userns
>> disabled, or 2) rebuild their kernel to disable it, which may
>> invalidate any support contracts they have.
>>
>> I tend to agree with you on the lack of value around runtime
>> mitigation, but allowing an admin to toggle this as a blatant on/off
>> switch on reboot does have value.
>>
>>>> This feature is already implemented by two distros, and likely wanted
>>>> by others. We cannot ignore that. The sysctl default doesn't change
>>>> the existing behavior, so this doesn't get in your way at all. Can you
>>>> please respond to my earlier email where I rebutted each of your
>>>> arguments against it? Just saying "no" and putting words in my mouth
>>>> isn't very productive.
>>>
>>>
>>> Calling people who make mistakes insane is not a rebuttal.  In security
>>> usability matters, and your sysctl has low usability.
>>>
>>> Further you seem to have missed something crucial in your understanding.
>>> As was explained earlier the sysctl was added to ubuntu to allow early
>>> adopters to experiment not as a long term way of managing user
>>> namespaces.
>>>
>>>
>>> What sounds like a generally useful feature that would cover your use
>>> case and many others is a per user limit on the number of user
>>> namespaces users may create.
>>
>>
>> Where that number may be zero?  I don't see how that is really any
>> better than a sysctl.  Could you elaborate?
>
> It's a better option because it would allow better configurability. Take for
> example a single user desktop system with some network daemons.  On such a
> system, the actual login used for the graphical environment by the user
> should be allowed at least a few user namespaces, because some software
> depends on them for security (Chrome for example, as well as some distro's
> build systems), but system users should be limited to at most one if they
> need it, and ideally zero, so that remote exploits couldn't give access to a
> user namespace.
>
> Conversely, on a server system, it's not unreasonable to completely disable
> user namespaces for almost everything, except for giving one to services
> that use them properly for sand-boxing.

OK, so better granularity.  Fine.

> I will state though that I only feel this is a better solution given that
> two criteria are met:
> 1. You can set 0 as the limit.
> 2. You can configure this without needing some special software (this in
> particular means that seccomp is not an option).

I'd have to add 3. You can set a global default for all users that can
be overridden on a per user basis.

Otherwise you play whack-a-mole with every new user or daemon that
adds its own uid.

josh

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-26 14:56                       ` Josh Boyer
  0 siblings, 0 replies; 80+ messages in thread
From: Josh Boyer @ 2016-01-26 14:56 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Eric W. Biederman, Kees Cook, Andy Lutomirski, Andrew Morton,
	Al Viro, Richard Weinberger, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel, kernel-hardening

On Tue, Jan 26, 2016 at 9:46 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-26 09:38, Josh Boyer wrote:
>>
>> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>>
>>> Kees Cook <keescook@chromium.org> writes:
>>>
>>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>>
>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>>
>>>>>>
>>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>>> hole in the permission checks.
>>>>>
>>>>>
>>>>> To be clear the current patch has my:
>>>>>
>>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>>
>>>>> The code is buggy, and poorly thought through.  Your lack of interest
>>>>> in
>>>>> fixing the bugs in your patch is distressing.
>>>>
>>>>
>>>> I'm not sure where you see me having a "lack of interest". The
>>>> existing cap-checking sysctls have a corner-case bug, which is
>>>> orthogonal to this change.
>>>
>>>
>>> That certainly doesn't sound like you have any plans to change anything
>>> there.
>>>
>>>>> So broken code, not willing to fix.  No. We are not merging this
>>>>> sysctl.
>>>>
>>>>
>>>> I think you're jumping to conclusions. :)
>>>
>>>
>>> I think I am the maintainer.
>>>
>>> What you are proposing is very much something that is only of interst to
>>> people who are not using user namespaces.  It is fatally flawed as
>>> a way to avoid new attack surfaces for people who don't care as the
>>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>>> as remediation to recommend to people to change if a new user namespace
>>> related but is discovered.  Any running process that happens to be
>>> created while user namespace creation was enabled will continue to
>>> exist.  Effectively a reboot will be required as part of a mitigation.
>>> Many sysadmins will get that wrong.
>>>
>>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>>> person has to be entirely too aware of subtlety and nuance to use it
>>> effectively.
>>
>>
>> What you're saying is true for the "oh crap" case of a new userns
>> related CVE being found.  However, there is the case where sysadmins
>> know for a fact that a set of machines should not allow user
>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>> distro kernel as-is, which may not meet their goal of having userns
>> disabled, or 2) rebuild their kernel to disable it, which may
>> invalidate any support contracts they have.
>>
>> I tend to agree with you on the lack of value around runtime
>> mitigation, but allowing an admin to toggle this as a blatant on/off
>> switch on reboot does have value.
>>
>>>> This feature is already implemented by two distros, and likely wanted
>>>> by others. We cannot ignore that. The sysctl default doesn't change
>>>> the existing behavior, so this doesn't get in your way at all. Can you
>>>> please respond to my earlier email where I rebutted each of your
>>>> arguments against it? Just saying "no" and putting words in my mouth
>>>> isn't very productive.
>>>
>>>
>>> Calling people who make mistakes insane is not a rebuttal.  In security
>>> usability matters, and your sysctl has low usability.
>>>
>>> Further you seem to have missed something crucial in your understanding.
>>> As was explained earlier the sysctl was added to ubuntu to allow early
>>> adopters to experiment not as a long term way of managing user
>>> namespaces.
>>>
>>>
>>> What sounds like a generally useful feature that would cover your use
>>> case and many others is a per user limit on the number of user
>>> namespaces users may create.
>>
>>
>> Where that number may be zero?  I don't see how that is really any
>> better than a sysctl.  Could you elaborate?
>
> It's a better option because it would allow better configurability. Take for
> example a single user desktop system with some network daemons.  On such a
> system, the actual login used for the graphical environment by the user
> should be allowed at least a few user namespaces, because some software
> depends on them for security (Chrome for example, as well as some distro's
> build systems), but system users should be limited to at most one if they
> need it, and ideally zero, so that remote exploits couldn't give access to a
> user namespace.
>
> Conversely, on a server system, it's not unreasonable to completely disable
> user namespaces for almost everything, except for giving one to services
> that use them properly for sand-boxing.

OK, so better granularity.  Fine.

> I will state though that I only feel this is a better solution given that
> two criteria are met:
> 1. You can set 0 as the limit.
> 2. You can configure this without needing some special software (this in
> particular means that seccomp is not an option).

I'd have to add 3. You can set a global default for all users that can
be overridden on a per user basis.

Otherwise you play whack-a-mole with every new user or daemon that
adds its own uid.

josh

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26  4:57                 ` [kernel-hardening] " Eric W. Biederman
@ 2016-01-26 16:37                   ` Kees Cook
  -1 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-26 16:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Kostya Serebryany, Alexander Potapenko, Eric Dumazet,
	Sasha Levin, linux-doc, linux-kernel, kernel-hardening

On Mon, Jan 25, 2016 at 8:57 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Kees Cook <keescook@chromium.org> writes:
>>>>
>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>> hole in the permission checks.
>>>
>>> To be clear the current patch has my:
>>>
>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>> The code is buggy, and poorly thought through.  Your lack of interest in
>>> fixing the bugs in your patch is distressing.
>>
>> I'm not sure where you see me having a "lack of interest". The
>> existing cap-checking sysctls have a corner-case bug, which is
>> orthogonal to this change.
>
> That certainly doesn't sound like you have any plans to change anything
> there.

Again, not sure why you think that. My primary role in kernel
development is fixing or helping coordinate fixing of security issues
and features. I already acknowledged the issue (it is a corner case,
and no one seems to debate that). I'm working based on priorities; I
have a long list of things to do. :)

>>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>>
>> I think you're jumping to conclusions. :)
>
> I think I am the maintainer.

Sure, no debate there. In fact, I'm certain you're the maintainer. :)

> What you are proposing is very much something that is only of interst to
> people who are not using user namespaces.  It is fatally flawed as
> a way to avoid new attack surfaces for people who don't care as the
> sysctl leaves user namespaces enabled by default.  It is fatally flawed
> as remediation to recommend to people to change if a new user namespace
> related but is discovered.  Any running process that happens to be
> created while user namespace creation was enabled will continue to
> exist.  Effectively a reboot will be required as part of a mitigation.
> Many sysadmins will get that wrong.

I disagree. The same kinds of issues exist with any of the *_restrict
sysctls: if you turn them on later, things that happened before are
still going to be a problem. You'll have already leaked a kernel base
address, etc. This would be no different.

I'm open to having this sysctl kill all CLONE_NEWUSERed process trees,
if you think that'll be more useful?

> I can't possibly see your sysctl as proposed achieving it's goals.  A
> person has to be entirely too aware of subtlety and nuance to use it
> effectively.

Again, I disagree. There are plenty of people who want to have user ns
disabled. This gives them the knob to do so.

>> This feature is already implemented by two distros, and likely wanted
>> by others. We cannot ignore that. The sysctl default doesn't change
>> the existing behavior, so this doesn't get in your way at all. Can you
>> please respond to my earlier email where I rebutted each of your
>> arguments against it? Just saying "no" and putting words in my mouth
>> isn't very productive.
>
> Calling people who make mistakes insane is not a rebuttal.  In security

I said this:

>> Any admin that decides to just turn off CLONE_NEWUSER in the middle of
>> still using it is insane. I don't think this breeds any false sense of
>> security as most sysctls are set at boot time.

I was arguing that admins that use the sysctl are not going to be the
admins that are using containers already. I didn't mean it as "making
a mistake is insane" but rather "it would appear that a person using
both would be seeking opposing goals".

> usability matters, and your sysctl has low usability.

Unsurprisingly, we disagree here too. This sysctl serves as an attack
surface reduction tool. I never saw it as a way to evict existing
containers.

> Further you seem to have missed something crucial in your understanding.
> As was explained earlier the sysctl was added to ubuntu to allow early
> adopters to experiment not as a long term way of managing user
> namespaces.

It's not about management: the audience of the sysctl is only those
that are not using user namespaces. Providing attack surface reduction
tools to admins is a net win for Linux security as a whole. We both
want the same thing: a safer Linux environment. There's no debate that
having user ns exposes a larger attack surface than not having it.
Being able to disable it for people not interested in using user ns
means a reduction in their attack surface.

> What sounds like a generally useful feature that would cover your use
> case and many others is a per user limit on the number of user
> namespaces users may create.

That sounds fine to me. Are you thinking of a new RLIMIT, or something
else? I don't need a sysctl, I just want a way to effectively disable
user ns.

-Kees


-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-26 16:37                   ` Kees Cook
  0 siblings, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-26 16:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Kostya Serebryany, Alexander Potapenko, Eric Dumazet,
	Sasha Levin, linux-doc, linux-kernel, kernel-hardening

On Mon, Jan 25, 2016 at 8:57 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Kees Cook <keescook@chromium.org> writes:
>>>>
>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>> hole in the permission checks.
>>>
>>> To be clear the current patch has my:
>>>
>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>
>>> The code is buggy, and poorly thought through.  Your lack of interest in
>>> fixing the bugs in your patch is distressing.
>>
>> I'm not sure where you see me having a "lack of interest". The
>> existing cap-checking sysctls have a corner-case bug, which is
>> orthogonal to this change.
>
> That certainly doesn't sound like you have any plans to change anything
> there.

Again, not sure why you think that. My primary role in kernel
development is fixing or helping coordinate fixing of security issues
and features. I already acknowledged the issue (it is a corner case,
and no one seems to debate that). I'm working based on priorities; I
have a long list of things to do. :)

>>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>>
>> I think you're jumping to conclusions. :)
>
> I think I am the maintainer.

Sure, no debate there. In fact, I'm certain you're the maintainer. :)

> What you are proposing is very much something that is only of interst to
> people who are not using user namespaces.  It is fatally flawed as
> a way to avoid new attack surfaces for people who don't care as the
> sysctl leaves user namespaces enabled by default.  It is fatally flawed
> as remediation to recommend to people to change if a new user namespace
> related but is discovered.  Any running process that happens to be
> created while user namespace creation was enabled will continue to
> exist.  Effectively a reboot will be required as part of a mitigation.
> Many sysadmins will get that wrong.

I disagree. The same kinds of issues exist with any of the *_restrict
sysctls: if you turn them on later, things that happened before are
still going to be a problem. You'll have already leaked a kernel base
address, etc. This would be no different.

I'm open to having this sysctl kill all CLONE_NEWUSERed process trees,
if you think that'll be more useful?

> I can't possibly see your sysctl as proposed achieving it's goals.  A
> person has to be entirely too aware of subtlety and nuance to use it
> effectively.

Again, I disagree. There are plenty of people who want to have user ns
disabled. This gives them the knob to do so.

>> This feature is already implemented by two distros, and likely wanted
>> by others. We cannot ignore that. The sysctl default doesn't change
>> the existing behavior, so this doesn't get in your way at all. Can you
>> please respond to my earlier email where I rebutted each of your
>> arguments against it? Just saying "no" and putting words in my mouth
>> isn't very productive.
>
> Calling people who make mistakes insane is not a rebuttal.  In security

I said this:

>> Any admin that decides to just turn off CLONE_NEWUSER in the middle of
>> still using it is insane. I don't think this breeds any false sense of
>> security as most sysctls are set at boot time.

I was arguing that admins that use the sysctl are not going to be the
admins that are using containers already. I didn't mean it as "making
a mistake is insane" but rather "it would appear that a person using
both would be seeking opposing goals".

> usability matters, and your sysctl has low usability.

Unsurprisingly, we disagree here too. This sysctl serves as an attack
surface reduction tool. I never saw it as a way to evict existing
containers.

> Further you seem to have missed something crucial in your understanding.
> As was explained earlier the sysctl was added to ubuntu to allow early
> adopters to experiment not as a long term way of managing user
> namespaces.

It's not about management: the audience of the sysctl is only those
that are not using user namespaces. Providing attack surface reduction
tools to admins is a net win for Linux security as a whole. We both
want the same thing: a safer Linux environment. There's no debate that
having user ns exposes a larger attack surface than not having it.
Being able to disable it for people not interested in using user ns
means a reduction in their attack surface.

> What sounds like a generally useful feature that would cover your use
> case and many others is a per user limit on the number of user
> namespaces users may create.

That sounds fine to me. Are you thinking of a new RLIMIT, or something
else? I don't need a sysctl, I just want a way to effectively disable
user ns.

-Kees


-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 14:38                   ` [kernel-hardening] " Josh Boyer
  (?)
  (?)
@ 2016-01-26 17:15                   ` Serge Hallyn
  2016-01-26 18:09                     ` Austin S. Hemmelgarn
                                       ` (2 more replies)
  -1 siblings, 3 replies; 80+ messages in thread
From: Serge Hallyn @ 2016-01-26 17:15 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Eric W. Biederman, Kees Cook, Andy Lutomirski, Andrew Morton,
	Al Viro, Richard Weinberger, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

Quoting Josh Boyer (jwboyer@fedoraproject.org):
> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> > Kees Cook <keescook@chromium.org> writes:
> >
> >> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> >> <ebiederm@xmission.com> wrote:
> >>> Kees Cook <keescook@chromium.org> writes:
> >>>>
> >>>> Well, I don't know about less weird, but it would leave a unneeded
> >>>> hole in the permission checks.
> >>>
> >>> To be clear the current patch has my:
> >>>
> >>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >>>
> >>> The code is buggy, and poorly thought through.  Your lack of interest in
> >>> fixing the bugs in your patch is distressing.
> >>
> >> I'm not sure where you see me having a "lack of interest". The
> >> existing cap-checking sysctls have a corner-case bug, which is
> >> orthogonal to this change.
> >
> > That certainly doesn't sound like you have any plans to change anything
> > there.
> >
> >>> So broken code, not willing to fix.  No. We are not merging this sysctl.
> >>
> >> I think you're jumping to conclusions. :)
> >
> > I think I am the maintainer.
> >
> > What you are proposing is very much something that is only of interst to
> > people who are not using user namespaces.  It is fatally flawed as
> > a way to avoid new attack surfaces for people who don't care as the
> > sysctl leaves user namespaces enabled by default.  It is fatally flawed
> > as remediation to recommend to people to change if a new user namespace
> > related but is discovered.  Any running process that happens to be
> > created while user namespace creation was enabled will continue to
> > exist.  Effectively a reboot will be required as part of a mitigation.
> > Many sysadmins will get that wrong.
> >
> > I can't possibly see your sysctl as proposed achieving it's goals.  A
> > person has to be entirely too aware of subtlety and nuance to use it
> > effectively.
> 
> What you're saying is true for the "oh crap" case of a new userns
> related CVE being found.  However, there is the case where sysadmins
> know for a fact that a set of machines should not allow user
> namespaces to be enabled.  Currently they have 2 choices, 1) use their

Hi - can you give a specific example of this?  (Where users really should
not be able to use them - not where they might not need them)  I think
it'll help the discussion tremendously.  Because so far the only good
arguments I've seen have been about actual bugs in the user namespaces,
which would not warrant a designed-in permanent disable switch.  If
there are good use cases where such a disable switch will always be
needed (and compiling out can't satisfy) that'd be helpful.

thanks,
-serge

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 14:56                       ` [kernel-hardening] " Josh Boyer
  (?)
@ 2016-01-26 17:20                       ` Serge Hallyn
  2016-01-26 19:56                         ` Josh Boyer
  -1 siblings, 1 reply; 80+ messages in thread
From: Serge Hallyn @ 2016-01-26 17:20 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Austin S. Hemmelgarn, Eric W. Biederman, Kees Cook,
	Andy Lutomirski, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

Quoting Josh Boyer (jwboyer@fedoraproject.org):
> On Tue, Jan 26, 2016 at 9:46 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> > On 2016-01-26 09:38, Josh Boyer wrote:
> >>
> >> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
> >> <ebiederm@xmission.com> wrote:
> >>>
> >>> Kees Cook <keescook@chromium.org> writes:
> >>>
> >>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> >>>> <ebiederm@xmission.com> wrote:
> >>>>>
> >>>>> Kees Cook <keescook@chromium.org> writes:
> >>>>>>
> >>>>>>
> >>>>>> Well, I don't know about less weird, but it would leave a unneeded
> >>>>>> hole in the permission checks.
> >>>>>
> >>>>>
> >>>>> To be clear the current patch has my:
> >>>>>
> >>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >>>>>
> >>>>> The code is buggy, and poorly thought through.  Your lack of interest
> >>>>> in
> >>>>> fixing the bugs in your patch is distressing.
> >>>>
> >>>>
> >>>> I'm not sure where you see me having a "lack of interest". The
> >>>> existing cap-checking sysctls have a corner-case bug, which is
> >>>> orthogonal to this change.
> >>>
> >>>
> >>> That certainly doesn't sound like you have any plans to change anything
> >>> there.
> >>>
> >>>>> So broken code, not willing to fix.  No. We are not merging this
> >>>>> sysctl.
> >>>>
> >>>>
> >>>> I think you're jumping to conclusions. :)
> >>>
> >>>
> >>> I think I am the maintainer.
> >>>
> >>> What you are proposing is very much something that is only of interst to
> >>> people who are not using user namespaces.  It is fatally flawed as
> >>> a way to avoid new attack surfaces for people who don't care as the
> >>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
> >>> as remediation to recommend to people to change if a new user namespace
> >>> related but is discovered.  Any running process that happens to be
> >>> created while user namespace creation was enabled will continue to
> >>> exist.  Effectively a reboot will be required as part of a mitigation.
> >>> Many sysadmins will get that wrong.
> >>>
> >>> I can't possibly see your sysctl as proposed achieving it's goals.  A
> >>> person has to be entirely too aware of subtlety and nuance to use it
> >>> effectively.
> >>
> >>
> >> What you're saying is true for the "oh crap" case of a new userns
> >> related CVE being found.  However, there is the case where sysadmins
> >> know for a fact that a set of machines should not allow user
> >> namespaces to be enabled.  Currently they have 2 choices, 1) use their
> >> distro kernel as-is, which may not meet their goal of having userns
> >> disabled, or 2) rebuild their kernel to disable it, which may
> >> invalidate any support contracts they have.
> >>
> >> I tend to agree with you on the lack of value around runtime
> >> mitigation, but allowing an admin to toggle this as a blatant on/off
> >> switch on reboot does have value.
> >>
> >>>> This feature is already implemented by two distros, and likely wanted
> >>>> by others. We cannot ignore that. The sysctl default doesn't change
> >>>> the existing behavior, so this doesn't get in your way at all. Can you
> >>>> please respond to my earlier email where I rebutted each of your
> >>>> arguments against it? Just saying "no" and putting words in my mouth
> >>>> isn't very productive.
> >>>
> >>>
> >>> Calling people who make mistakes insane is not a rebuttal.  In security
> >>> usability matters, and your sysctl has low usability.
> >>>
> >>> Further you seem to have missed something crucial in your understanding.
> >>> As was explained earlier the sysctl was added to ubuntu to allow early
> >>> adopters to experiment not as a long term way of managing user
> >>> namespaces.
> >>>
> >>>
> >>> What sounds like a generally useful feature that would cover your use
> >>> case and many others is a per user limit on the number of user
> >>> namespaces users may create.
> >>
> >>
> >> Where that number may be zero?  I don't see how that is really any
> >> better than a sysctl.  Could you elaborate?
> >
> > It's a better option because it would allow better configurability. Take for
> > example a single user desktop system with some network daemons.  On such a
> > system, the actual login used for the graphical environment by the user
> > should be allowed at least a few user namespaces, because some software
> > depends on them for security (Chrome for example, as well as some distro's
> > build systems), but system users should be limited to at most one if they
> > need it, and ideally zero, so that remote exploits couldn't give access to a
> > user namespace.
> >
> > Conversely, on a server system, it's not unreasonable to completely disable
> > user namespaces for almost everything, except for giving one to services
> > that use them properly for sand-boxing.
> 
> OK, so better granularity.  Fine.
> 
> > I will state though that I only feel this is a better solution given that
> > two criteria are met:
> > 1. You can set 0 as the limit.
> > 2. You can configure this without needing some special software (this in
> > particular means that seccomp is not an option).
> 
> I'd have to add 3. You can set a global default for all users that can
> be overridden on a per user basis.
> 
> Otherwise you play whack-a-mole with every new user or daemon that
> adds its own uid.

Given that you want per-user, does a per-uid rlimit, which could be -1
(unlimited) by default, inherited for all uids mapped into a namespace
owned by the uid, and which can be set (only reduced) by pam on login,
make sense?

I'm still not actually seeing the value of this apart from another knob
to prevent kernel memory abuse.  But at least it does kill two birds
with one stone (also satisfying people who want it turned off altogether).
Is there a third use case for limiting number of user namespaces per uid?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 17:15                   ` Serge Hallyn
@ 2016-01-26 18:09                     ` Austin S. Hemmelgarn
  2016-01-26 18:27                       ` Andy Lutomirski
  2016-01-26 23:13                     ` Kees Cook
  2016-01-26 23:47                     ` Josh Boyer
  2 siblings, 1 reply; 80+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-26 18:09 UTC (permalink / raw)
  To: Serge Hallyn, kernel-hardening
  Cc: Eric W. Biederman, Kees Cook, Andy Lutomirski, Andrew Morton,
	Al Viro, Richard Weinberger, Robert Święcki,
	Dmitry Vyukov, David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

On 2016-01-26 12:15, Serge Hallyn wrote:
> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Kees Cook <keescook@chromium.org> writes:
>>>
>>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>>
>>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>>> hole in the permission checks.
>>>>>
>>>>> To be clear the current patch has my:
>>>>>
>>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>>
>>>>> The code is buggy, and poorly thought through.  Your lack of interest in
>>>>> fixing the bugs in your patch is distressing.
>>>>
>>>> I'm not sure where you see me having a "lack of interest". The
>>>> existing cap-checking sysctls have a corner-case bug, which is
>>>> orthogonal to this change.
>>>
>>> That certainly doesn't sound like you have any plans to change anything
>>> there.
>>>
>>>>> So broken code, not willing to fix.  No. We are not merging this sysctl.
>>>>
>>>> I think you're jumping to conclusions. :)
>>>
>>> I think I am the maintainer.
>>>
>>> What you are proposing is very much something that is only of interst to
>>> people who are not using user namespaces.  It is fatally flawed as
>>> a way to avoid new attack surfaces for people who don't care as the
>>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>>> as remediation to recommend to people to change if a new user namespace
>>> related but is discovered.  Any running process that happens to be
>>> created while user namespace creation was enabled will continue to
>>> exist.  Effectively a reboot will be required as part of a mitigation.
>>> Many sysadmins will get that wrong.
>>>
>>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>>> person has to be entirely too aware of subtlety and nuance to use it
>>> effectively.
>>
>> What you're saying is true for the "oh crap" case of a new userns
>> related CVE being found.  However, there is the case where sysadmins
>> know for a fact that a set of machines should not allow user
>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>
> Hi - can you give a specific example of this?  (Where users really should
> not be able to use them - not where they might not need them)  I think
> it'll help the discussion tremendously.  Because so far the only good
> arguments I've seen have been about actual bugs in the user namespaces,
> which would not warrant a designed-in permanent disable switch.  If
> there are good use cases where such a disable switch will always be
> needed (and compiling out can't satisfy) that'd be helpful.
In general, if a particular daemon provides a network service and does 
not use user namespaces for sand-boxing, it should not be allowed to use 
user namespaces, because those then become something else to potentially 
land an exploit through.  ntpd, postfix, and most other regularly used 
network servers fall into this category.

If you're hosting a shared system providing terminal server like usage 
where the users actually have shell access, then they probably should 
not be able to use user namespaces on the server.

In essence, if there are cases where you know for certain that users do 
not need user namespaces, they should not be allowed to use them.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 18:09                     ` Austin S. Hemmelgarn
@ 2016-01-26 18:27                       ` Andy Lutomirski
  2016-01-26 18:45                         ` Austin S. Hemmelgarn
  2016-01-26 23:15                         ` Kees Cook
  0 siblings, 2 replies; 80+ messages in thread
From: Andy Lutomirski @ 2016-01-26 18:27 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Serge Hallyn, kernel-hardening, Eric W. Biederman, Kees Cook,
	Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

On Tue, Jan 26, 2016 at 10:09 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-26 12:15, Serge Hallyn wrote:
>>
>> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>>>
>>> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>>
>>>> Kees Cook <keescook@chromium.org> writes:
>>>>
>>>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>>>> <ebiederm@xmission.com> wrote:
>>>>>>
>>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>>>
>>>>>>>
>>>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>>>> hole in the permission checks.
>>>>>>
>>>>>>
>>>>>> To be clear the current patch has my:
>>>>>>
>>>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>>>
>>>>>> The code is buggy, and poorly thought through.  Your lack of interest
>>>>>> in
>>>>>> fixing the bugs in your patch is distressing.
>>>>>
>>>>>
>>>>> I'm not sure where you see me having a "lack of interest". The
>>>>> existing cap-checking sysctls have a corner-case bug, which is
>>>>> orthogonal to this change.
>>>>
>>>>
>>>> That certainly doesn't sound like you have any plans to change anything
>>>> there.
>>>>
>>>>>> So broken code, not willing to fix.  No. We are not merging this
>>>>>> sysctl.
>>>>>
>>>>>
>>>>> I think you're jumping to conclusions. :)
>>>>
>>>>
>>>> I think I am the maintainer.
>>>>
>>>> What you are proposing is very much something that is only of interst to
>>>> people who are not using user namespaces.  It is fatally flawed as
>>>> a way to avoid new attack surfaces for people who don't care as the
>>>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>>>> as remediation to recommend to people to change if a new user namespace
>>>> related but is discovered.  Any running process that happens to be
>>>> created while user namespace creation was enabled will continue to
>>>> exist.  Effectively a reboot will be required as part of a mitigation.
>>>> Many sysadmins will get that wrong.
>>>>
>>>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>>>> person has to be entirely too aware of subtlety and nuance to use it
>>>> effectively.
>>>
>>>
>>> What you're saying is true for the "oh crap" case of a new userns
>>> related CVE being found.  However, there is the case where sysadmins
>>> know for a fact that a set of machines should not allow user
>>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>>
>>
>> Hi - can you give a specific example of this?  (Where users really should
>> not be able to use them - not where they might not need them)  I think
>> it'll help the discussion tremendously.  Because so far the only good
>> arguments I've seen have been about actual bugs in the user namespaces,
>> which would not warrant a designed-in permanent disable switch.  If
>> there are good use cases where such a disable switch will always be
>> needed (and compiling out can't satisfy) that'd be helpful.
>
> In general, if a particular daemon provides a network service and does not
> use user namespaces for sand-boxing, it should not be allowed to use user
> namespaces, because those then become something else to potentially land an
> exploit through.  ntpd, postfix, and most other regularly used network
> servers fall into this category.

seccomp handles this issue quite nicely.

>
> If you're hosting a shared system providing terminal server like usage where
> the users actually have shell access, then they probably should not be able
> to use user namespaces on the server.
>

Au contraire.  If they have user ns access, then can sandbox their own programs.

--Andy

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 18:27                       ` Andy Lutomirski
@ 2016-01-26 18:45                         ` Austin S. Hemmelgarn
  2016-01-26 23:15                         ` Kees Cook
  1 sibling, 0 replies; 80+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-26 18:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge Hallyn, kernel-hardening, Eric W. Biederman, Kees Cook,
	Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

On 2016-01-26 13:27, Andy Lutomirski wrote:
> On Tue, Jan 26, 2016 at 10:09 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-01-26 12:15, Serge Hallyn wrote:
>>>
>>> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>>>>
>>>> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>>>> <ebiederm@xmission.com> wrote:
>>>>>
>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>
>>>>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>>
>>>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>>>>
>>>>>>>>
>>>>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>>>>> hole in the permission checks.
>>>>>>>
>>>>>>>
>>>>>>> To be clear the current patch has my:
>>>>>>>
>>>>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>>>>
>>>>>>> The code is buggy, and poorly thought through.  Your lack of interest
>>>>>>> in
>>>>>>> fixing the bugs in your patch is distressing.
>>>>>>
>>>>>>
>>>>>> I'm not sure where you see me having a "lack of interest". The
>>>>>> existing cap-checking sysctls have a corner-case bug, which is
>>>>>> orthogonal to this change.
>>>>>
>>>>>
>>>>> That certainly doesn't sound like you have any plans to change anything
>>>>> there.
>>>>>
>>>>>>> So broken code, not willing to fix.  No. We are not merging this
>>>>>>> sysctl.
>>>>>>
>>>>>>
>>>>>> I think you're jumping to conclusions. :)
>>>>>
>>>>>
>>>>> I think I am the maintainer.
>>>>>
>>>>> What you are proposing is very much something that is only of interst to
>>>>> people who are not using user namespaces.  It is fatally flawed as
>>>>> a way to avoid new attack surfaces for people who don't care as the
>>>>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>>>>> as remediation to recommend to people to change if a new user namespace
>>>>> related but is discovered.  Any running process that happens to be
>>>>> created while user namespace creation was enabled will continue to
>>>>> exist.  Effectively a reboot will be required as part of a mitigation.
>>>>> Many sysadmins will get that wrong.
>>>>>
>>>>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>>>>> person has to be entirely too aware of subtlety and nuance to use it
>>>>> effectively.
>>>>
>>>>
>>>> What you're saying is true for the "oh crap" case of a new userns
>>>> related CVE being found.  However, there is the case where sysadmins
>>>> know for a fact that a set of machines should not allow user
>>>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>>>
>>>
>>> Hi - can you give a specific example of this?  (Where users really should
>>> not be able to use them - not where they might not need them)  I think
>>> it'll help the discussion tremendously.  Because so far the only good
>>> arguments I've seen have been about actual bugs in the user namespaces,
>>> which would not warrant a designed-in permanent disable switch.  If
>>> there are good use cases where such a disable switch will always be
>>> needed (and compiling out can't satisfy) that'd be helpful.
>>
>> In general, if a particular daemon provides a network service and does not
>> use user namespaces for sand-boxing, it should not be allowed to use user
>> namespaces, because those then become something else to potentially land an
>> exploit through.  ntpd, postfix, and most other regularly used network
>> servers fall into this category.
>
> seccomp handles this issue quite nicely.
>
seccomp is a pain to set up given current tooling, and isn't supported 
by most server software.  Unless there's some tool out there to hook 
arbitrary seccomp filters into an arbitrary program, then this isn't an 
option for most people.
>>
>> If you're hosting a shared system providing terminal server like usage where
>> the users actually have shell access, then they probably should not be able
>> to use user namespaces on the server.
>>
>
> Au contraire.  If they have user ns access, then can sandbox their own programs.
I should clarify, by 'terminal server like usage' I meant thin client 
setups, not Sun Ray or Windows style terminal servers.  IOW, a file 
server that provides a few extra services (DHCP, TFTP and similar) and 
only allows shell access so users can move around their own files 
directly on the server.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 17:20                       ` Serge Hallyn
@ 2016-01-26 19:56                         ` Josh Boyer
  2016-01-26 20:11                           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 80+ messages in thread
From: Josh Boyer @ 2016-01-26 19:56 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: kernel-hardening, Austin S. Hemmelgarn, Eric W. Biederman,
	Kees Cook, Andy Lutomirski, Andrew Morton, Al Viro,
	Richard Weinberger, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

On Tue, Jan 26, 2016 at 12:20 PM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>> On Tue, Jan 26, 2016 at 9:46 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>> > On 2016-01-26 09:38, Josh Boyer wrote:
>> >>
>> >> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>> >> <ebiederm@xmission.com> wrote:
>> >>>
>> >>> Kees Cook <keescook@chromium.org> writes:
>> >>>
>> >>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>> >>>> <ebiederm@xmission.com> wrote:
>> >>>>>
>> >>>>> Kees Cook <keescook@chromium.org> writes:
>> >>>>>>
>> >>>>>>
>> >>>>>> Well, I don't know about less weird, but it would leave a unneeded
>> >>>>>> hole in the permission checks.
>> >>>>>
>> >>>>>
>> >>>>> To be clear the current patch has my:
>> >>>>>
>> >>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> >>>>>
>> >>>>> The code is buggy, and poorly thought through.  Your lack of interest
>> >>>>> in
>> >>>>> fixing the bugs in your patch is distressing.
>> >>>>
>> >>>>
>> >>>> I'm not sure where you see me having a "lack of interest". The
>> >>>> existing cap-checking sysctls have a corner-case bug, which is
>> >>>> orthogonal to this change.
>> >>>
>> >>>
>> >>> That certainly doesn't sound like you have any plans to change anything
>> >>> there.
>> >>>
>> >>>>> So broken code, not willing to fix.  No. We are not merging this
>> >>>>> sysctl.
>> >>>>
>> >>>>
>> >>>> I think you're jumping to conclusions. :)
>> >>>
>> >>>
>> >>> I think I am the maintainer.
>> >>>
>> >>> What you are proposing is very much something that is only of interst to
>> >>> people who are not using user namespaces.  It is fatally flawed as
>> >>> a way to avoid new attack surfaces for people who don't care as the
>> >>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>> >>> as remediation to recommend to people to change if a new user namespace
>> >>> related but is discovered.  Any running process that happens to be
>> >>> created while user namespace creation was enabled will continue to
>> >>> exist.  Effectively a reboot will be required as part of a mitigation.
>> >>> Many sysadmins will get that wrong.
>> >>>
>> >>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>> >>> person has to be entirely too aware of subtlety and nuance to use it
>> >>> effectively.
>> >>
>> >>
>> >> What you're saying is true for the "oh crap" case of a new userns
>> >> related CVE being found.  However, there is the case where sysadmins
>> >> know for a fact that a set of machines should not allow user
>> >> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>> >> distro kernel as-is, which may not meet their goal of having userns
>> >> disabled, or 2) rebuild their kernel to disable it, which may
>> >> invalidate any support contracts they have.
>> >>
>> >> I tend to agree with you on the lack of value around runtime
>> >> mitigation, but allowing an admin to toggle this as a blatant on/off
>> >> switch on reboot does have value.
>> >>
>> >>>> This feature is already implemented by two distros, and likely wanted
>> >>>> by others. We cannot ignore that. The sysctl default doesn't change
>> >>>> the existing behavior, so this doesn't get in your way at all. Can you
>> >>>> please respond to my earlier email where I rebutted each of your
>> >>>> arguments against it? Just saying "no" and putting words in my mouth
>> >>>> isn't very productive.
>> >>>
>> >>>
>> >>> Calling people who make mistakes insane is not a rebuttal.  In security
>> >>> usability matters, and your sysctl has low usability.
>> >>>
>> >>> Further you seem to have missed something crucial in your understanding.
>> >>> As was explained earlier the sysctl was added to ubuntu to allow early
>> >>> adopters to experiment not as a long term way of managing user
>> >>> namespaces.
>> >>>
>> >>>
>> >>> What sounds like a generally useful feature that would cover your use
>> >>> case and many others is a per user limit on the number of user
>> >>> namespaces users may create.
>> >>
>> >>
>> >> Where that number may be zero?  I don't see how that is really any
>> >> better than a sysctl.  Could you elaborate?
>> >
>> > It's a better option because it would allow better configurability. Take for
>> > example a single user desktop system with some network daemons.  On such a
>> > system, the actual login used for the graphical environment by the user
>> > should be allowed at least a few user namespaces, because some software
>> > depends on them for security (Chrome for example, as well as some distro's
>> > build systems), but system users should be limited to at most one if they
>> > need it, and ideally zero, so that remote exploits couldn't give access to a
>> > user namespace.
>> >
>> > Conversely, on a server system, it's not unreasonable to completely disable
>> > user namespaces for almost everything, except for giving one to services
>> > that use them properly for sand-boxing.
>>
>> OK, so better granularity.  Fine.
>>
>> > I will state though that I only feel this is a better solution given that
>> > two criteria are met:
>> > 1. You can set 0 as the limit.
>> > 2. You can configure this without needing some special software (this in
>> > particular means that seccomp is not an option).
>>
>> I'd have to add 3. You can set a global default for all users that can
>> be overridden on a per user basis.
>>
>> Otherwise you play whack-a-mole with every new user or daemon that
>> adds its own uid.
>
> Given that you want per-user, does a per-uid rlimit, which could be -1
> (unlimited) by default, inherited for all uids mapped into a namespace
> owned by the uid, and which can be set (only reduced) by pam on login,
> make sense?

To clarify, I don't actively want per-user.  Eric suggested it and I'm
thinking through it from a theoretical perspective.  I'd likely be
fine with a big ban-hammer that sysadmins can set, but that doesn't
mean I'm opposed to something more flexible if it makes sense.

I'm not sure if rlimit makes sense in the way you describe it.  I
don't care about uids within an existing user namespace really.  That
is icing on the cake.  I was looking for something that would disallow
uids to create user namespaces to begin with (so inheritance wouldn't
matter).  If rlimit is that mechanism, then I guess.  Seems like an
odd fit though, particularly if you tie it to pam.

> I'm still not actually seeing the value of this apart from another knob
> to prevent kernel memory abuse.  But at least it does kill two birds
> with one stone (also satisfying people who want it turned off altogether).
> Is there a third use case for limiting number of user namespaces per uid?

Not that I can think of.

josh

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 19:56                         ` Josh Boyer
@ 2016-01-26 20:11                           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 80+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-26 20:11 UTC (permalink / raw)
  To: Josh Boyer, Serge Hallyn
  Cc: kernel-hardening, Eric W. Biederman, Kees Cook, Andy Lutomirski,
	Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

On 2016-01-26 14:56, Josh Boyer wrote:
> On Tue, Jan 26, 2016 at 12:20 PM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
>> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>>> On Tue, Jan 26, 2016 at 9:46 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>> On 2016-01-26 09:38, Josh Boyer wrote:
>>>>>
>>>>> On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
>>>>> <ebiederm@xmission.com> wrote:
>>>>>>
>>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>>
>>>>>>> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
>>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>>>
>>>>>>>> Kees Cook <keescook@chromium.org> writes:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well, I don't know about less weird, but it would leave a unneeded
>>>>>>>>> hole in the permission checks.
>>>>>>>>
>>>>>>>>
>>>>>>>> To be clear the current patch has my:
>>>>>>>>
>>>>>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>>>>>>
>>>>>>>> The code is buggy, and poorly thought through.  Your lack of interest
>>>>>>>> in
>>>>>>>> fixing the bugs in your patch is distressing.
>>>>>>>
>>>>>>>
>>>>>>> I'm not sure where you see me having a "lack of interest". The
>>>>>>> existing cap-checking sysctls have a corner-case bug, which is
>>>>>>> orthogonal to this change.
>>>>>>
>>>>>>
>>>>>> That certainly doesn't sound like you have any plans to change anything
>>>>>> there.
>>>>>>
>>>>>>>> So broken code, not willing to fix.  No. We are not merging this
>>>>>>>> sysctl.
>>>>>>>
>>>>>>>
>>>>>>> I think you're jumping to conclusions. :)
>>>>>>
>>>>>>
>>>>>> I think I am the maintainer.
>>>>>>
>>>>>> What you are proposing is very much something that is only of interst to
>>>>>> people who are not using user namespaces.  It is fatally flawed as
>>>>>> a way to avoid new attack surfaces for people who don't care as the
>>>>>> sysctl leaves user namespaces enabled by default.  It is fatally flawed
>>>>>> as remediation to recommend to people to change if a new user namespace
>>>>>> related but is discovered.  Any running process that happens to be
>>>>>> created while user namespace creation was enabled will continue to
>>>>>> exist.  Effectively a reboot will be required as part of a mitigation.
>>>>>> Many sysadmins will get that wrong.
>>>>>>
>>>>>> I can't possibly see your sysctl as proposed achieving it's goals.  A
>>>>>> person has to be entirely too aware of subtlety and nuance to use it
>>>>>> effectively.
>>>>>
>>>>>
>>>>> What you're saying is true for the "oh crap" case of a new userns
>>>>> related CVE being found.  However, there is the case where sysadmins
>>>>> know for a fact that a set of machines should not allow user
>>>>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>>>>> distro kernel as-is, which may not meet their goal of having userns
>>>>> disabled, or 2) rebuild their kernel to disable it, which may
>>>>> invalidate any support contracts they have.
>>>>>
>>>>> I tend to agree with you on the lack of value around runtime
>>>>> mitigation, but allowing an admin to toggle this as a blatant on/off
>>>>> switch on reboot does have value.
>>>>>
>>>>>>> This feature is already implemented by two distros, and likely wanted
>>>>>>> by others. We cannot ignore that. The sysctl default doesn't change
>>>>>>> the existing behavior, so this doesn't get in your way at all. Can you
>>>>>>> please respond to my earlier email where I rebutted each of your
>>>>>>> arguments against it? Just saying "no" and putting words in my mouth
>>>>>>> isn't very productive.
>>>>>>
>>>>>>
>>>>>> Calling people who make mistakes insane is not a rebuttal.  In security
>>>>>> usability matters, and your sysctl has low usability.
>>>>>>
>>>>>> Further you seem to have missed something crucial in your understanding.
>>>>>> As was explained earlier the sysctl was added to ubuntu to allow early
>>>>>> adopters to experiment not as a long term way of managing user
>>>>>> namespaces.
>>>>>>
>>>>>>
>>>>>> What sounds like a generally useful feature that would cover your use
>>>>>> case and many others is a per user limit on the number of user
>>>>>> namespaces users may create.
>>>>>
>>>>>
>>>>> Where that number may be zero?  I don't see how that is really any
>>>>> better than a sysctl.  Could you elaborate?
>>>>
>>>> It's a better option because it would allow better configurability. Take for
>>>> example a single user desktop system with some network daemons.  On such a
>>>> system, the actual login used for the graphical environment by the user
>>>> should be allowed at least a few user namespaces, because some software
>>>> depends on them for security (Chrome for example, as well as some distro's
>>>> build systems), but system users should be limited to at most one if they
>>>> need it, and ideally zero, so that remote exploits couldn't give access to a
>>>> user namespace.
>>>>
>>>> Conversely, on a server system, it's not unreasonable to completely disable
>>>> user namespaces for almost everything, except for giving one to services
>>>> that use them properly for sand-boxing.
>>>
>>> OK, so better granularity.  Fine.
>>>
>>>> I will state though that I only feel this is a better solution given that
>>>> two criteria are met:
>>>> 1. You can set 0 as the limit.
>>>> 2. You can configure this without needing some special software (this in
>>>> particular means that seccomp is not an option).
>>>
>>> I'd have to add 3. You can set a global default for all users that can
>>> be overridden on a per user basis.
>>>
>>> Otherwise you play whack-a-mole with every new user or daemon that
>>> adds its own uid.
>>
>> Given that you want per-user, does a per-uid rlimit, which could be -1
>> (unlimited) by default, inherited for all uids mapped into a namespace
>> owned by the uid, and which can be set (only reduced) by pam on login,
>> make sense?
>
> To clarify, I don't actively want per-user.  Eric suggested it and I'm
> thinking through it from a theoretical perspective.  I'd likely be
> fine with a big ban-hammer that sysadmins can set, but that doesn't
> mean I'm opposed to something more flexible if it makes sense.
>
> I'm not sure if rlimit makes sense in the way you describe it.  I
> don't care about uids within an existing user namespace really.  That
> is icing on the cake.  I was looking for something that would disallow
> uids to create user namespaces to begin with (so inheritance wouldn't
> matter).  If rlimit is that mechanism, then I guess.  Seems like an
> odd fit though, particularly if you tie it to pam.
The PAM connection would be a side effect of it being an rlimit, not 
something by itself.  It's not used on a lot of smaller systems because 
Linux is not used as much as a time sharing system, but part of the 
purpose of PAM was to be able to set rlimits and similar things on new 
sessions before the user could do anything.  This is still used today, 
and /etc/security/limits.conf can be found on any modern Linux system 
which uses PAM.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 17:15                   ` Serge Hallyn
  2016-01-26 18:09                     ` Austin S. Hemmelgarn
@ 2016-01-26 23:13                     ` Kees Cook
  2016-01-27 10:27                       ` Eric W. Biederman
  2016-01-26 23:47                     ` Josh Boyer
  2 siblings, 1 reply; 80+ messages in thread
From: Kees Cook @ 2016-01-26 23:13 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: kernel-hardening, Eric W. Biederman, Andy Lutomirski,
	Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Kostya Serebryany, Alexander Potapenko, Eric Dumazet,
	Sasha Levin, linux-doc, linux-kernel

On Tue, Jan 26, 2016 at 9:15 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>> What you're saying is true for the "oh crap" case of a new userns
>> related CVE being found.  However, there is the case where sysadmins
>> know for a fact that a set of machines should not allow user
>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>
> Hi - can you give a specific example of this?  (Where users really should
> not be able to use them - not where they might not need them)  I think
> it'll help the discussion tremendously.  Because so far the only good
> arguments I've seen have been about actual bugs in the user namespaces,
> which would not warrant a designed-in permanent disable switch.  If
> there are good use cases where such a disable switch will always be
> needed (and compiling out can't satisfy) that'd be helpful.

My example is a machine in a colo rack serving web pages. A site gets
attacked, and www-data uses user namespaces to continue their attack
to gain root privileges.

The admin of such a machine could have disabled userns months earlier
and limited the scope of the attack.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 18:27                       ` Andy Lutomirski
  2016-01-26 18:45                         ` Austin S. Hemmelgarn
@ 2016-01-26 23:15                         ` Kees Cook
  1 sibling, 0 replies; 80+ messages in thread
From: Kees Cook @ 2016-01-26 23:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Austin S. Hemmelgarn, Serge Hallyn, kernel-hardening,
	Eric W. Biederman, Andrew Morton, Al Viro, Richard Weinberger,
	Robert Święcki, Dmitry Vyukov, David Howells,
	Miklos Szeredi, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

On Tue, Jan 26, 2016 at 10:27 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jan 26, 2016 at 10:09 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-01-26 12:15, Serge Hallyn wrote:
>>>
>>> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>>>> What you're saying is true for the "oh crap" case of a new userns
>>>> related CVE being found.  However, there is the case where sysadmins
>>>> know for a fact that a set of machines should not allow user
>>>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>>>
>>>
>>> Hi - can you give a specific example of this?  (Where users really should
>>> not be able to use them - not where they might not need them)  I think
>>> it'll help the discussion tremendously.  Because so far the only good
>>> arguments I've seen have been about actual bugs in the user namespaces,
>>> which would not warrant a designed-in permanent disable switch.  If
>>> there are good use cases where such a disable switch will always be
>>> needed (and compiling out can't satisfy) that'd be helpful.
>>
>> In general, if a particular daemon provides a network service and does not
>> use user namespaces for sand-boxing, it should not be allowed to use user
>> namespaces, because those then become something else to potentially land an
>> exploit through.  ntpd, postfix, and most other regularly used network
>> servers fall into this category.
>
> seccomp handles this issue quite nicely.
>
>>
>> If you're hosting a shared system providing terminal server like usage where
>> the users actually have shell access, then they probably should not be able
>> to use user namespaces on the server.
>>
>
> Au contraire.  If they have user ns access, then can sandbox their own programs.

The open-ended cases of web servers and shell access aren't cleanly
handled by seccomp. And we're talking about protecting them as soon as
this knob exists, not after each program or service grows its own
sandboxing solution.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 17:15                   ` Serge Hallyn
  2016-01-26 18:09                     ` Austin S. Hemmelgarn
  2016-01-26 23:13                     ` Kees Cook
@ 2016-01-26 23:47                     ` Josh Boyer
  2 siblings, 0 replies; 80+ messages in thread
From: Josh Boyer @ 2016-01-26 23:47 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Kostya Serebryany, linux-doc, Al Viro, Andrew Morton,
	David Howells, kernel-hardening, Andy Lutomirski,
	Eric W. Biederman, Alexander Potapenko, Kees Cook, Eric Dumazet,
	Robert Święcki, Miklos Szeredi, Dmitry Vyukov,
	Richard Weinberger, linux-kernel, Sasha Levin

[-- Attachment #1: Type: text/plain, Size: 3131 bytes --]

On Jan 26, 2016 3:58 PM, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:
>
> Quoting Josh Boyer (jwboyer@fedoraproject.org):
> > On Mon, Jan 25, 2016 at 11:57 PM, Eric W. Biederman
> > <ebiederm@xmission.com> wrote:
> > > Kees Cook <keescook@chromium.org> writes:
> > >
> > >> On Mon, Jan 25, 2016 at 11:33 AM, Eric W. Biederman
> > >> <ebiederm@xmission.com> wrote:
> > >>> Kees Cook <keescook@chromium.org> writes:
> > >>>>
> > >>>> Well, I don't know about less weird, but it would leave a unneeded
> > >>>> hole in the permission checks.
> > >>>
> > >>> To be clear the current patch has my:
> > >>>
> > >>> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > >>>
> > >>> The code is buggy, and poorly thought through.  Your lack of
interest in
> > >>> fixing the bugs in your patch is distressing.
> > >>
> > >> I'm not sure where you see me having a "lack of interest". The
> > >> existing cap-checking sysctls have a corner-case bug, which is
> > >> orthogonal to this change.
> > >
> > > That certainly doesn't sound like you have any plans to change
anything
> > > there.
> > >
> > >>> So broken code, not willing to fix.  No. We are not merging this
sysctl.
> > >>
> > >> I think you're jumping to conclusions. :)
> > >
> > > I think I am the maintainer.
> > >
> > > What you are proposing is very much something that is only of interst
to
> > > people who are not using user namespaces.  It is fatally flawed as
> > > a way to avoid new attack surfaces for people who don't care as the
> > > sysctl leaves user namespaces enabled by default.  It is fatally
flawed
> > > as remediation to recommend to people to change if a new user
namespace
> > > related but is discovered.  Any running process that happens to be
> > > created while user namespace creation was enabled will continue to
> > > exist.  Effectively a reboot will be required as part of a mitigation.
> > > Many sysadmins will get that wrong.
> > >
> > > I can't possibly see your sysctl as proposed achieving it's goals.  A
> > > person has to be entirely too aware of subtlety and nuance to use it
> > > effectively.
> >
> > What you're saying is true for the "oh crap" case of a new userns
> > related CVE being found.  However, there is the case where sysadmins
> > know for a fact that a set of machines should not allow user
> > namespaces to be enabled.  Currently they have 2 choices, 1) use their
>
> Hi - can you give a specific example of this?  (Where users really should
> not be able to use them - not where they might not need them)  I think
> it'll help the discussion tremendously.  Because so far the only good
> arguments I've seen have been about actual bugs in the user namespaces,
> which would not warrant a designed-in permanent disable switch.  If
> there are good use cases where such a disable switch will always be
> needed (and compiling out can't satisfy) that'd be helpful.

I already did in the email you quoted.  Limiting the surface where the
admin knows that shouldn't be used without having to rebuild a distribution
kernel.  Think of it like blacklisting a module, except user namespaces
can't be modules.

josh

[-- Attachment #2: Type: text/html, Size: 4499 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26 23:13                     ` Kees Cook
@ 2016-01-27 10:27                       ` Eric W. Biederman
  2016-01-27 12:32                         ` Austin S. Hemmelgarn
  2016-01-28 14:41                           ` Robert Święcki
  0 siblings, 2 replies; 80+ messages in thread
From: Eric W. Biederman @ 2016-01-27 10:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: Serge Hallyn, kernel-hardening, Andy Lutomirski, Andrew Morton,
	Al Viro, Richard Weinberger, Robert Święcki,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

Kees Cook <keescook@chromium.org> writes:

> On Tue, Jan 26, 2016 at 9:15 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
>> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>>> What you're saying is true for the "oh crap" case of a new userns
>>> related CVE being found.  However, there is the case where sysadmins
>>> know for a fact that a set of machines should not allow user
>>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>>
>> Hi - can you give a specific example of this?  (Where users really should
>> not be able to use them - not where they might not need them)  I think
>> it'll help the discussion tremendously.  Because so far the only good
>> arguments I've seen have been about actual bugs in the user namespaces,
>> which would not warrant a designed-in permanent disable switch.  If
>> there are good use cases where such a disable switch will always be
>> needed (and compiling out can't satisfy) that'd be helpful.
>
> My example is a machine in a colo rack serving web pages. A site gets
> attacked, and www-data uses user namespaces to continue their attack
> to gain root privileges.
>
> The admin of such a machine could have disabled userns months earlier
> and limited the scope of the attack.

Of course for the paranoid there is already a mechanism to do this.
/sbin/chroot.

No new user namespaces are allowed to be created inside of a chroot.

Eric

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-27 10:27                       ` Eric W. Biederman
@ 2016-01-27 12:32                         ` Austin S. Hemmelgarn
  2016-01-28 14:41                           ` Robert Święcki
  1 sibling, 0 replies; 80+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-27 12:32 UTC (permalink / raw)
  To: Eric W. Biederman, Kees Cook
  Cc: Serge Hallyn, kernel-hardening, Andy Lutomirski, Andrew Morton,
	Al Viro, Richard Weinberger, Robert Święcki,
	Dmitry Vyukov, David Howells, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

On 2016-01-27 05:27, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Tue, Jan 26, 2016 at 9:15 AM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
>>> Quoting Josh Boyer (jwboyer@fedoraproject.org):
>>>> What you're saying is true for the "oh crap" case of a new userns
>>>> related CVE being found.  However, there is the case where sysadmins
>>>> know for a fact that a set of machines should not allow user
>>>> namespaces to be enabled.  Currently they have 2 choices, 1) use their
>>>
>>> Hi - can you give a specific example of this?  (Where users really should
>>> not be able to use them - not where they might not need them)  I think
>>> it'll help the discussion tremendously.  Because so far the only good
>>> arguments I've seen have been about actual bugs in the user namespaces,
>>> which would not warrant a designed-in permanent disable switch.  If
>>> there are good use cases where such a disable switch will always be
>>> needed (and compiling out can't satisfy) that'd be helpful.
>>
>> My example is a machine in a colo rack serving web pages. A site gets
>> attacked, and www-data uses user namespaces to continue their attack
>> to gain root privileges.
>>
>> The admin of such a machine could have disabled userns months earlier
>> and limited the scope of the attack.
>
> Of course for the paranoid there is already a mechanism to do this.
> /sbin/chroot.
>
> No new user namespaces are allowed to be created inside of a chroot.
I would like to point out that this is undocumented outside of the 
kernel source code on every Linux system I've seen.  And, it's not 
hugely obvious from looking at the source code unless you have some 
experience with kernel programming.

Also, being able to limit to exactly one (or possibly two depending on 
the application's usage of them) user namespace would be useful, as that 
would allow sand-boxing without the ability to create any more through 
some exploit.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-26  4:57                 ` [kernel-hardening] " Eric W. Biederman
                                   ` (2 preceding siblings ...)
  (?)
@ 2016-01-28  8:56                 ` Serge E. Hallyn
  2016-01-28 12:53                   ` Austin S. Hemmelgarn
  -1 siblings, 1 reply; 80+ messages in thread
From: Serge E. Hallyn @ 2016-01-28  8:56 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Kees Cook, Andy Lutomirski, Andrew Morton, Al Viro,
	Richard Weinberger, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

On Mon, Jan 25, 2016 at 10:57:32PM -0600, Eric W. Biederman wrote:
> What sounds like a generally useful feature that would cover your use
> case and many others is a per user limit on the number of user
> namespaces users may create.

Ok, I'm sorry, but after thinking about this quite awhile, I think this
is a bad idea.  If I'm allowed to create exactly one, then (a) I won't
be able to run two instances of chrome (does chrome use one userns per
tab or per application?), yet (b) i can easily just not use chrome and
use my allocation to run a vulnerability.

IMO, having a (hopefully temporary, so cleanly separated out) sysctl,
which perhaps goes so far as to kill all non-init user namespaces when
set to -1, makes the most sense.  I still think the harm due to having
userspace not being able to rely on user namespaces will, long term, be
worse than the security implications of having user namespaces always
enabled.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-28  8:56                 ` Serge E. Hallyn
@ 2016-01-28 12:53                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 80+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-28 12:53 UTC (permalink / raw)
  To: Serge E. Hallyn, kernel-hardening
  Cc: Kees Cook, Andy Lutomirski, Andrew Morton, Al Viro,
	Richard Weinberger, Robert Święcki, Dmitry Vyukov,
	David Howells, Miklos Szeredi, Kostya Serebryany,
	Alexander Potapenko, Eric Dumazet, Sasha Levin, linux-doc,
	linux-kernel

On 2016-01-28 03:56, Serge E. Hallyn wrote:
> On Mon, Jan 25, 2016 at 10:57:32PM -0600, Eric W. Biederman wrote:
>> What sounds like a generally useful feature that would cover your use
>> case and many others is a per user limit on the number of user
>> namespaces users may create.
>
> Ok, I'm sorry, but after thinking about this quite awhile, I think this
> is a bad idea.  If I'm allowed to create exactly one, then (a) I won't
> be able to run two instances of chrome (does chrome use one userns per
> tab or per application?), yet (b) i can easily just not use chrome and
> use my allocation to run a vulnerability.
Ignore regular users WRT the ability to limit to one or two namespace 
instances, the advantage of being able to limit to one or two user 
namespaces for a given user is that it prevents daemons that use it for 
sand-boxing from being exploited to create more.  If we go with Eric's 
suggestion, there is absolutely no reason on a regular desktop system 
that the users used for a graphical login would have to be limited.
>
> IMO, having a (hopefully temporary, so cleanly separated out) sysctl,
> which perhaps goes so far as to kill all non-init user namespaces when
> set to -1, makes the most sense.  I still think the harm due to having
> userspace not being able to rely on user namespaces will, long term, be
> worse than the security implications of having user namespaces always
> enabled.
Userspace can't rely on them as it is right now.  The default kernel 
config has them disabled, and a number of distros are refusing to ship 
kernels with them enabled at least in the short term, wheras a number 
are only shipping kernels with them enabled (I'd be willing to bet that 
part of what prompted this originally was Chrome).  Just by virtue of 
this, you can't rely on them being there and need to gracefully handle 
the situation if they aren't.  Adding a sysctl allowing them to be 
disabled completely on a kernel that has support built in would not make 
them all that much more unreliable than they already are.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
  2016-01-27 10:27                       ` Eric W. Biederman
@ 2016-01-28 14:41                           ` Robert Święcki
  2016-01-28 14:41                           ` Robert Święcki
  1 sibling, 0 replies; 80+ messages in thread
From: Robert Święcki @ 2016-01-28 14:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Serge Hallyn, kernel-hardening, Andy Lutomirski,
	Andrew Morton, Al Viro, Richard Weinberger, Dmitry Vyukov,
	David Howells, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

>> The admin of such a machine could have disabled userns months earlier
>> and limited the scope of the attack.
>
> Of course for the paranoid there is already a mechanism to do this.
> /sbin/chroot.
>
> No new user namespaces are allowed to be created inside of a chroot.

Another alternative is to create a custom kernel module which will
disable the user namespace (by limiting to privileged users only, or
disabling it altogether).

IMO people tend to use distro kernels for convenience, and a
suggestion of creating a chroot dir for every service exposed to
users, or building a custom kernel module is an advice that not many
sysadmins using distro kernels would take, even if they have concerns
about the the increased attack surface enabled by CLONE_NEWUSER.

Also, I don't think the willingness to disable the feature or limit it
to the already privileged users will be something that only truly
paranoid sysadmins/users would have. We've seen a fair amount of
privilege escalation / DoS bugs that this kernel feature enabled in
the recent 18 months or so, and they still seem to be found on a
somewhat regular basis. Therefore this discussion and its outcome
might be of interest to less paranoid folk as well.

I agree with Kees that the "knob" will be mostly used by admins of
web-servers (and similar services) as a protection against privilege
escalation after getting low-priv code execution on the system.
Therefore a sysctl enabled from /etc/sysctl.conf should work well
here, even if it's set to the permissive mode by default at boot.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [kernel-hardening] Re: [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled
@ 2016-01-28 14:41                           ` Robert Święcki
  0 siblings, 0 replies; 80+ messages in thread
From: Robert Święcki @ 2016-01-28 14:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, Serge Hallyn, kernel-hardening, Andy Lutomirski,
	Andrew Morton, Al Viro, Richard Weinberger, Dmitry Vyukov,
	David Howells, Kostya Serebryany, Alexander Potapenko,
	Eric Dumazet, Sasha Levin, linux-doc, linux-kernel

>> The admin of such a machine could have disabled userns months earlier
>> and limited the scope of the attack.
>
> Of course for the paranoid there is already a mechanism to do this.
> /sbin/chroot.
>
> No new user namespaces are allowed to be created inside of a chroot.

Another alternative is to create a custom kernel module which will
disable the user namespace (by limiting to privileged users only, or
disabling it altogether).

IMO people tend to use distro kernels for convenience, and a
suggestion of creating a chroot dir for every service exposed to
users, or building a custom kernel module is an advice that not many
sysadmins using distro kernels would take, even if they have concerns
about the the increased attack surface enabled by CLONE_NEWUSER.

Also, I don't think the willingness to disable the feature or limit it
to the already privileged users will be something that only truly
paranoid sysadmins/users would have. We've seen a fair amount of
privilege escalation / DoS bugs that this kernel feature enabled in
the recent 18 months or so, and they still seem to be found on a
somewhat regular basis. Therefore this discussion and its outcome
might be of interest to less paranoid folk as well.

I agree with Kees that the "knob" will be mostly used by admins of
web-servers (and similar services) as a protection against privilege
escalation after getting low-priv code execution on the system.
Therefore a sysctl enabled from /etc/sysctl.conf should work well
here, even if it's set to the permissive mode by default at boot.

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2016-01-28 14:41 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-22 22:39 [PATCH 0/2] sysctl: allow CLONE_NEWUSER to be disabled Kees Cook
2016-01-22 22:39 ` [kernel-hardening] " Kees Cook
2016-01-22 22:39 ` [PATCH 1/2] sysctl: expand use of proc_dointvec_minmax_sysadmin Kees Cook
2016-01-22 22:39   ` [kernel-hardening] " Kees Cook
2016-01-23  3:10   ` Eric W. Biederman
2016-01-23  3:10     ` [kernel-hardening] " Eric W. Biederman
2016-01-23 22:25     ` Jann Horn
2016-01-24  1:20       ` Eric W. Biederman
2016-01-24  1:43         ` Al Viro
2016-01-24  1:56           ` Jann Horn
2016-01-24  6:02             ` Eric W. Biederman
2016-01-24  6:32               ` Jann Horn
2016-01-24  6:44                 ` Eric W. Biederman
2016-01-22 22:39 ` [PATCH 2/2] sysctl: allow CLONE_NEWUSER to be disabled Kees Cook
2016-01-22 22:39   ` [kernel-hardening] " Kees Cook
2016-01-22 22:47   ` Robert Święcki
2016-01-22 22:47     ` [kernel-hardening] " Robert Święcki
2016-01-22 22:50     ` Kees Cook
2016-01-22 22:50       ` [kernel-hardening] " Kees Cook
2016-01-22 22:55       ` Robert Święcki
2016-01-22 22:55         ` [kernel-hardening] " Robert Święcki
2016-01-22 23:00         ` Kees Cook
2016-01-22 23:00           ` [kernel-hardening] " Kees Cook
2016-01-23  0:44           ` Serge Hallyn
2016-01-23  0:44             ` [kernel-hardening] " Serge Hallyn
2016-01-23  0:44           ` Serge Hallyn
2016-01-23  0:44             ` [kernel-hardening] " Serge Hallyn
2016-01-23  0:59           ` Ben Hutchings
2016-01-24 20:59             ` Kees Cook
2016-01-24 20:59               ` Kees Cook
2016-01-24 22:20               ` Andy Lutomirski
2016-01-25 18:51                 ` Kees Cook
2016-01-22 22:49 ` [PATCH 0/2] " Richard Weinberger
2016-01-22 22:49   ` [kernel-hardening] " Richard Weinberger
2016-01-23  3:02 ` Eric W. Biederman
2016-01-23  3:02   ` [kernel-hardening] " Eric W. Biederman
2016-01-24 20:57   ` Kees Cook
2016-01-24 20:57     ` [kernel-hardening] " Kees Cook
2016-01-26  7:38     ` Serge Hallyn
2016-01-24 22:22   ` Andy Lutomirski
2016-01-24 22:22     ` [kernel-hardening] " Andy Lutomirski
2016-01-25 18:51     ` Kees Cook
2016-01-25 18:51       ` [kernel-hardening] " Kees Cook
2016-01-25 18:53       ` Andy Lutomirski
2016-01-25 18:53         ` [kernel-hardening] " Andy Lutomirski
2016-01-25 18:56         ` Kees Cook
2016-01-25 18:56           ` [kernel-hardening] " Kees Cook
2016-01-25 19:33           ` Eric W. Biederman
2016-01-25 19:33             ` [kernel-hardening] " Eric W. Biederman
2016-01-25 22:34             ` Kees Cook
2016-01-25 22:34               ` [kernel-hardening] " Kees Cook
2016-01-25 23:33               ` Andy Lutomirski
2016-01-25 23:33                 ` [kernel-hardening] " Andy Lutomirski
2016-01-26  2:27               ` Daniel Micay
2016-01-26  4:57               ` Eric W. Biederman
2016-01-26  4:57                 ` [kernel-hardening] " Eric W. Biederman
2016-01-26 14:38                 ` Josh Boyer
2016-01-26 14:38                   ` [kernel-hardening] " Josh Boyer
2016-01-26 14:46                   ` Austin S. Hemmelgarn
2016-01-26 14:46                     ` [kernel-hardening] " Austin S. Hemmelgarn
2016-01-26 14:56                     ` Josh Boyer
2016-01-26 14:56                       ` [kernel-hardening] " Josh Boyer
2016-01-26 17:20                       ` Serge Hallyn
2016-01-26 19:56                         ` Josh Boyer
2016-01-26 20:11                           ` Austin S. Hemmelgarn
2016-01-26 17:15                   ` Serge Hallyn
2016-01-26 18:09                     ` Austin S. Hemmelgarn
2016-01-26 18:27                       ` Andy Lutomirski
2016-01-26 18:45                         ` Austin S. Hemmelgarn
2016-01-26 23:15                         ` Kees Cook
2016-01-26 23:13                     ` Kees Cook
2016-01-27 10:27                       ` Eric W. Biederman
2016-01-27 12:32                         ` Austin S. Hemmelgarn
2016-01-28 14:41                         ` Robert Święcki
2016-01-28 14:41                           ` Robert Święcki
2016-01-26 23:47                     ` Josh Boyer
2016-01-26 16:37                 ` Kees Cook
2016-01-26 16:37                   ` [kernel-hardening] " Kees Cook
2016-01-28  8:56                 ` Serge E. Hallyn
2016-01-28 12:53                   ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.