All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] capabilities: add capability cgroup controller
@ 2016-06-23 15:07 ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-06-23 15:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: luto, serge, keescook, Topi Miettinen, Jonathan Corbet,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find
out useful values for the limits, except blind trial and error.

Currently, there is no way to know which capabilities are actually used.
Even the source code is only implicit, in-depth knowledge of each
capability must be used when analyzing a program to judge which
capabilities the program will exercise.

Add a new cgroup controller for monitoring of capabilities
in the cgroup.

Test case demonstrating basic capability monitoring and how the
capabilities are combined at next level (boot to rdshell):

(initramfs) cd /sys/fs
(initramfs) mount -t cgroup2 cgroup cgroup
(initramfs) cd cgroup
(initramfs) echo +capability > cgroup.subtree_control
(initramfs) mkdir test; cd test
(initramfs) echo +capability > cgroup.subtree_control
(initramfs) ls
capability.used         cgroup.events           cgroup.subtree_control
cgroup.controllers      cgroup.procs
(initramfs) mkdir first second
(initramfs) sh

BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) cd first
(initramfs) echo $$ >cgroup.procs
(initramfs) cat capability.used
0000000000000000 # nothing so far
(initramfs) mknod /dev/z_$$ c 1 2
(initramfs) cat capability.used
0000000008000000 # CAP_MKNOD
(initramfs) cat ../capability.used
0000000008000000 # also seen at next higher level
(initramfs) exit
(initramfs) sh

BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) cd second
(initramfs) echo $$ >cgroup.procs
(initramfs) cat capability.used
0000000000000000 # nothing so far
(initramfs) chown 1234 /dev/z_*
(initramfs) cat capability.used
0000000000000001 # CAP_CHROOT
(initramfs) cat ../capability.used
0000000008000001 # combined at next higher level
(initramfs) exit

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 Documentation/cgroup-v2.txt       | 17 +++++++
 include/linux/capability_cgroup.h |  7 +++
 include/linux/cgroup_subsys.h     |  4 ++
 init/Kconfig                      |  6 +++
 kernel/capability.c               |  2 +
 security/Makefile                 |  1 +
 security/capability_cgroup.c      | 99 +++++++++++++++++++++++++++++++++++++++
 7 files changed, 136 insertions(+)
 create mode 100644 include/linux/capability_cgroup.h
 create mode 100644 security/capability_cgroup.c

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4cc07ce..2b3d277 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1118,6 +1118,23 @@ writeback as follows.
 	total available memory and applied the same way as
 	vm.dirty[_background]_ratio.
 
+5-4. Capabilities
+
+The "capability" controller is used to monitor capability use in the
+cgroup. This can be used to discover a starting point for capability
+bounding sets, even when running a shell script under ambient
+capabilities, with only short-lived helper processes exercising the
+capabilities.
+
+
+5-4-1. Capability Interface Files
+
+  capability.used
+
+	A read-only file which exists on all cgroups.
+
+	This reports the combined value of capability use in the
+	current cgroup and all its children.
 
 6. Namespace
 
diff --git a/include/linux/capability_cgroup.h b/include/linux/capability_cgroup.h
new file mode 100644
index 0000000..c03b58d
--- /dev/null
+++ b/include/linux/capability_cgroup.h
@@ -0,0 +1,7 @@
+#ifdef CONFIG_CGROUP_CAPABILITY
+void capability_cgroup_update_used(int cap);
+#else
+static inline void capability_cgroup_update_used(int cap)
+{
+}
+#endif
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..a5161d0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_CAPABILITY)
+SUBSYS(capability)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f755a60..25d17ef 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1141,6 +1141,12 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config CGROUP_CAPABILITY
+	bool "Capability controller"
+	help
+	  Provides a simple controller for monitoring of capabilities in the
+	  cgroup.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n
diff --git a/kernel/capability.c b/kernel/capability.c
index 45432b5..b57d7f9 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -17,6 +17,7 @@
 #include <linux/syscalls.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/capability_cgroup.h>
 #include <asm/uaccess.h>
 
 /*
@@ -380,6 +381,7 @@ bool ns_capable(struct user_namespace *ns, int cap)
 	}
 
 	if (security_capable(current_cred(), ns, cap) == 0) {
+		capability_cgroup_update_used(cap);
 		current->flags |= PF_SUPERPRIV;
 		return true;
 	}
diff --git a/security/Makefile b/security/Makefile
index f2d71cd..2bb04f1 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_SECURITY_APPARMOR)		+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)		+= yama/
 obj-$(CONFIG_SECURITY_LOADPIN)		+= loadpin/
 obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
+obj-$(CONFIG_CGROUP_CAPABILITY)		+= capability_cgroup.o
 
 # Object integrity file lists
 subdir-$(CONFIG_INTEGRITY)		+= integrity
diff --git a/security/capability_cgroup.c b/security/capability_cgroup.c
new file mode 100644
index 0000000..f002477
--- /dev/null
+++ b/security/capability_cgroup.c
@@ -0,0 +1,99 @@
+/*
+ * Capability cgroup
+ *
+ * Copyright 2016 Topi Miettinen
+ *
+ * This file is subject to the terms and conditions of the GNU General
+ * Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/capability.h>
+#include <linux/capability_cgroup.h>
+#include <linux/cgroup.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+
+static DEFINE_MUTEX(capcg_mutex);
+
+struct capcg_cgroup {
+	struct cgroup_subsys_state css;
+	kernel_cap_t cap_used; /* Capabilities actually used */
+};
+
+static inline struct capcg_cgroup *css_to_capcg(struct cgroup_subsys_state *s)
+{
+	return s ? container_of(s, struct capcg_cgroup, css) : NULL;
+}
+
+static inline struct capcg_cgroup *task_to_capcg(struct task_struct *task)
+{
+	return css_to_capcg(task_css(task, capability_cgrp_id));
+}
+
+static struct cgroup_subsys_state *capcg_css_alloc(struct cgroup_subsys_state
+						   *parent)
+{
+	struct capcg_cgroup *caps;
+
+	caps = kzalloc(sizeof(*caps), GFP_KERNEL);
+	if (!caps)
+		return ERR_PTR(-ENOMEM);
+
+	cap_clear(caps->cap_used);
+	return &caps->css;
+}
+
+static void capcg_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_to_capcg(css));
+}
+
+static int capcg_seq_show_used(struct seq_file *m, void *v)
+{
+	struct capcg_cgroup *capcg = css_to_capcg(seq_css(m));
+	struct cgroup_subsys_state *pos;
+	u32 capi;
+	kernel_cap_t subsys_caps = capcg->cap_used;
+
+	rcu_read_lock();
+
+	css_for_each_child(pos, &capcg->css) {
+		struct capcg_cgroup *pos_capcg = css_to_capcg(pos);
+
+		subsys_caps = cap_combine(subsys_caps, pos_capcg->cap_used);
+	}
+
+	rcu_read_unlock();
+
+	CAP_FOR_EACH_U32(capi) {
+		seq_printf(m, "%08x",
+			   subsys_caps.cap[CAP_LAST_U32 - capi]);
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static struct cftype capcg_files[] = {
+	{
+		.name = "used",
+		.seq_show = capcg_seq_show_used,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys capability_cgrp_subsys = {
+	.css_alloc = capcg_css_alloc,
+	.css_free = capcg_css_free,
+	.dfl_cftypes = capcg_files,
+};
+
+void capability_cgroup_update_used(int cap)
+{
+	struct capcg_cgroup *caps = task_to_capcg(current);
+
+	mutex_lock(&capcg_mutex);
+	cap_raise(caps->cap_used, cap);
+	mutex_unlock(&capcg_mutex);
+}
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH] capabilities: add capability cgroup controller
@ 2016-06-23 15:07 ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-06-23 15:07 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: luto-DgEjT+Ai2ygdnm+yROfE0A, serge-A9i7LUbDfNHQT0dZR+AlfA,
	keescook-F7+t8E8rja9g9hUCZPvPmw, Topi Miettinen, Jonathan Corbet,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP CGROUP, open list:CAPABILITIES

There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find
out useful values for the limits, except blind trial and error.

Currently, there is no way to know which capabilities are actually used.
Even the source code is only implicit, in-depth knowledge of each
capability must be used when analyzing a program to judge which
capabilities the program will exercise.

Add a new cgroup controller for monitoring of capabilities
in the cgroup.

Test case demonstrating basic capability monitoring and how the
capabilities are combined at next level (boot to rdshell):

(initramfs) cd /sys/fs
(initramfs) mount -t cgroup2 cgroup cgroup
(initramfs) cd cgroup
(initramfs) echo +capability > cgroup.subtree_control
(initramfs) mkdir test; cd test
(initramfs) echo +capability > cgroup.subtree_control
(initramfs) ls
capability.used         cgroup.events           cgroup.subtree_control
cgroup.controllers      cgroup.procs
(initramfs) mkdir first second
(initramfs) sh

BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) cd first
(initramfs) echo $$ >cgroup.procs
(initramfs) cat capability.used
0000000000000000 # nothing so far
(initramfs) mknod /dev/z_$$ c 1 2
(initramfs) cat capability.used
0000000008000000 # CAP_MKNOD
(initramfs) cat ../capability.used
0000000008000000 # also seen at next higher level
(initramfs) exit
(initramfs) sh

BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) cd second
(initramfs) echo $$ >cgroup.procs
(initramfs) cat capability.used
0000000000000000 # nothing so far
(initramfs) chown 1234 /dev/z_*
(initramfs) cat capability.used
0000000000000001 # CAP_CHROOT
(initramfs) cat ../capability.used
0000000008000001 # combined at next higher level
(initramfs) exit

Signed-off-by: Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 Documentation/cgroup-v2.txt       | 17 +++++++
 include/linux/capability_cgroup.h |  7 +++
 include/linux/cgroup_subsys.h     |  4 ++
 init/Kconfig                      |  6 +++
 kernel/capability.c               |  2 +
 security/Makefile                 |  1 +
 security/capability_cgroup.c      | 99 +++++++++++++++++++++++++++++++++++++++
 7 files changed, 136 insertions(+)
 create mode 100644 include/linux/capability_cgroup.h
 create mode 100644 security/capability_cgroup.c

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4cc07ce..2b3d277 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1118,6 +1118,23 @@ writeback as follows.
 	total available memory and applied the same way as
 	vm.dirty[_background]_ratio.
 
+5-4. Capabilities
+
+The "capability" controller is used to monitor capability use in the
+cgroup. This can be used to discover a starting point for capability
+bounding sets, even when running a shell script under ambient
+capabilities, with only short-lived helper processes exercising the
+capabilities.
+
+
+5-4-1. Capability Interface Files
+
+  capability.used
+
+	A read-only file which exists on all cgroups.
+
+	This reports the combined value of capability use in the
+	current cgroup and all its children.
 
 6. Namespace
 
diff --git a/include/linux/capability_cgroup.h b/include/linux/capability_cgroup.h
new file mode 100644
index 0000000..c03b58d
--- /dev/null
+++ b/include/linux/capability_cgroup.h
@@ -0,0 +1,7 @@
+#ifdef CONFIG_CGROUP_CAPABILITY
+void capability_cgroup_update_used(int cap);
+#else
+static inline void capability_cgroup_update_used(int cap)
+{
+}
+#endif
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..a5161d0 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_CAPABILITY)
+SUBSYS(capability)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f755a60..25d17ef 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1141,6 +1141,12 @@ config CGROUP_PERF
 
 	  Say N if unsure.
 
+config CGROUP_CAPABILITY
+	bool "Capability controller"
+	help
+	  Provides a simple controller for monitoring of capabilities in the
+	  cgroup.
+
 config CGROUP_DEBUG
 	bool "Example controller"
 	default n
diff --git a/kernel/capability.c b/kernel/capability.c
index 45432b5..b57d7f9 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -17,6 +17,7 @@
 #include <linux/syscalls.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/capability_cgroup.h>
 #include <asm/uaccess.h>
 
 /*
@@ -380,6 +381,7 @@ bool ns_capable(struct user_namespace *ns, int cap)
 	}
 
 	if (security_capable(current_cred(), ns, cap) == 0) {
+		capability_cgroup_update_used(cap);
 		current->flags |= PF_SUPERPRIV;
 		return true;
 	}
diff --git a/security/Makefile b/security/Makefile
index f2d71cd..2bb04f1 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_SECURITY_APPARMOR)		+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)		+= yama/
 obj-$(CONFIG_SECURITY_LOADPIN)		+= loadpin/
 obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
+obj-$(CONFIG_CGROUP_CAPABILITY)		+= capability_cgroup.o
 
 # Object integrity file lists
 subdir-$(CONFIG_INTEGRITY)		+= integrity
diff --git a/security/capability_cgroup.c b/security/capability_cgroup.c
new file mode 100644
index 0000000..f002477
--- /dev/null
+++ b/security/capability_cgroup.c
@@ -0,0 +1,99 @@
+/*
+ * Capability cgroup
+ *
+ * Copyright 2016 Topi Miettinen
+ *
+ * This file is subject to the terms and conditions of the GNU General
+ * Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/capability.h>
+#include <linux/capability_cgroup.h>
+#include <linux/cgroup.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+
+static DEFINE_MUTEX(capcg_mutex);
+
+struct capcg_cgroup {
+	struct cgroup_subsys_state css;
+	kernel_cap_t cap_used; /* Capabilities actually used */
+};
+
+static inline struct capcg_cgroup *css_to_capcg(struct cgroup_subsys_state *s)
+{
+	return s ? container_of(s, struct capcg_cgroup, css) : NULL;
+}
+
+static inline struct capcg_cgroup *task_to_capcg(struct task_struct *task)
+{
+	return css_to_capcg(task_css(task, capability_cgrp_id));
+}
+
+static struct cgroup_subsys_state *capcg_css_alloc(struct cgroup_subsys_state
+						   *parent)
+{
+	struct capcg_cgroup *caps;
+
+	caps = kzalloc(sizeof(*caps), GFP_KERNEL);
+	if (!caps)
+		return ERR_PTR(-ENOMEM);
+
+	cap_clear(caps->cap_used);
+	return &caps->css;
+}
+
+static void capcg_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_to_capcg(css));
+}
+
+static int capcg_seq_show_used(struct seq_file *m, void *v)
+{
+	struct capcg_cgroup *capcg = css_to_capcg(seq_css(m));
+	struct cgroup_subsys_state *pos;
+	u32 capi;
+	kernel_cap_t subsys_caps = capcg->cap_used;
+
+	rcu_read_lock();
+
+	css_for_each_child(pos, &capcg->css) {
+		struct capcg_cgroup *pos_capcg = css_to_capcg(pos);
+
+		subsys_caps = cap_combine(subsys_caps, pos_capcg->cap_used);
+	}
+
+	rcu_read_unlock();
+
+	CAP_FOR_EACH_U32(capi) {
+		seq_printf(m, "%08x",
+			   subsys_caps.cap[CAP_LAST_U32 - capi]);
+	}
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static struct cftype capcg_files[] = {
+	{
+		.name = "used",
+		.seq_show = capcg_seq_show_used,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys capability_cgrp_subsys = {
+	.css_alloc = capcg_css_alloc,
+	.css_free = capcg_css_free,
+	.dfl_cftypes = capcg_files,
+};
+
+void capability_cgroup_update_used(int cap)
+{
+	struct capcg_cgroup *caps = task_to_capcg(current);
+
+	mutex_lock(&capcg_mutex);
+	cap_raise(caps->cap_used, cap);
+	mutex_unlock(&capcg_mutex);
+}
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-23 21:03   ` Kees Cook
  0 siblings, 0 replies; 46+ messages in thread
From: Kees Cook @ 2016-06-23 21:03 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: LKML, Andy Lutomirski, Serge E. Hallyn, Jonathan Corbet,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Thu, Jun 23, 2016 at 8:07 AM, Topi Miettinen <toiwoton@gmail.com> wrote:
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find
> out useful values for the limits, except blind trial and error.
>
> Currently, there is no way to know which capabilities are actually used.
> Even the source code is only implicit, in-depth knowledge of each
> capability must be used when analyzing a program to judge which
> capabilities the program will exercise.
>
> Add a new cgroup controller for monitoring of capabilities
> in the cgroup.
>
> Test case demonstrating basic capability monitoring and how the
> capabilities are combined at next level (boot to rdshell):
>
> (initramfs) cd /sys/fs
> (initramfs) mount -t cgroup2 cgroup cgroup
> (initramfs) cd cgroup
> (initramfs) echo +capability > cgroup.subtree_control
> (initramfs) mkdir test; cd test
> (initramfs) echo +capability > cgroup.subtree_control
> (initramfs) ls
> capability.used         cgroup.events           cgroup.subtree_control
> cgroup.controllers      cgroup.procs
> (initramfs) mkdir first second
> (initramfs) sh
>
> BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
> Enter 'help' for a list of built-in commands.
>
> (initramfs) cd first
> (initramfs) echo $$ >cgroup.procs
> (initramfs) cat capability.used
> 0000000000000000 # nothing so far
> (initramfs) mknod /dev/z_$$ c 1 2
> (initramfs) cat capability.used
> 0000000008000000 # CAP_MKNOD
> (initramfs) cat ../capability.used
> 0000000008000000 # also seen at next higher level
> (initramfs) exit
> (initramfs) sh
>
> BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
> Enter 'help' for a list of built-in commands.
>
> (initramfs) cd second
> (initramfs) echo $$ >cgroup.procs
> (initramfs) cat capability.used
> 0000000000000000 # nothing so far
> (initramfs) chown 1234 /dev/z_*
> (initramfs) cat capability.used
> 0000000000000001 # CAP_CHROOT

nitpick: this is CAP_CHOWN, not CAP_CHROOT

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-23 21:03   ` Kees Cook
  0 siblings, 0 replies; 46+ messages in thread
From: Kees Cook @ 2016-06-23 21:03 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: LKML, Andy Lutomirski, Serge E. Hallyn, Jonathan Corbet,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Thu, Jun 23, 2016 at 8:07 AM, Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find
> out useful values for the limits, except blind trial and error.
>
> Currently, there is no way to know which capabilities are actually used.
> Even the source code is only implicit, in-depth knowledge of each
> capability must be used when analyzing a program to judge which
> capabilities the program will exercise.
>
> Add a new cgroup controller for monitoring of capabilities
> in the cgroup.
>
> Test case demonstrating basic capability monitoring and how the
> capabilities are combined at next level (boot to rdshell):
>
> (initramfs) cd /sys/fs
> (initramfs) mount -t cgroup2 cgroup cgroup
> (initramfs) cd cgroup
> (initramfs) echo +capability > cgroup.subtree_control
> (initramfs) mkdir test; cd test
> (initramfs) echo +capability > cgroup.subtree_control
> (initramfs) ls
> capability.used         cgroup.events           cgroup.subtree_control
> cgroup.controllers      cgroup.procs
> (initramfs) mkdir first second
> (initramfs) sh
>
> BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
> Enter 'help' for a list of built-in commands.
>
> (initramfs) cd first
> (initramfs) echo $$ >cgroup.procs
> (initramfs) cat capability.used
> 0000000000000000 # nothing so far
> (initramfs) mknod /dev/z_$$ c 1 2
> (initramfs) cat capability.used
> 0000000008000000 # CAP_MKNOD
> (initramfs) cat ../capability.used
> 0000000008000000 # also seen at next higher level
> (initramfs) exit
> (initramfs) sh
>
> BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
> Enter 'help' for a list of built-in commands.
>
> (initramfs) cd second
> (initramfs) echo $$ >cgroup.procs
> (initramfs) cat capability.used
> 0000000000000000 # nothing so far
> (initramfs) chown 1234 /dev/z_*
> (initramfs) cat capability.used
> 0000000000000001 # CAP_CHROOT

nitpick: this is CAP_CHOWN, not CAP_CHROOT

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-23 15:07 ` Topi Miettinen
  (?)
  (?)
@ 2016-06-23 21:38 ` Tejun Heo
  2016-06-24  0:22   ` Topi Miettinen
  -1 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2016-06-23 21:38 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, luto, serge, keescook, Jonathan Corbet, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, Andrew Morton,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Hello,

On Thu, Jun 23, 2016 at 06:07:10PM +0300, Topi Miettinen wrote:
> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find
> out useful values for the limits, except blind trial and error.
> 
> Currently, there is no way to know which capabilities are actually used.
> Even the source code is only implicit, in-depth knowledge of each
> capability must be used when analyzing a program to judge which
> capabilities the program will exercise.
> 
> Add a new cgroup controller for monitoring of capabilities
> in the cgroup.
> 
> Test case demonstrating basic capability monitoring and how the
> capabilities are combined at next level (boot to rdshell):

This doesn't have anything to do with resource control and I don't
think it's a good idea to add arbitrary monitoring mechanisms to
cgroup just because it's easy to add interface there.  Given that
capabilities are inherited and modified through the process hierarchy,
shouldn't this be part of that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-23 15:07 ` Topi Miettinen
@ 2016-06-23 23:46   ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2016-06-23 23:46 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, luto, serge, keescook, Jonathan Corbet, Tejun Heo,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton@gmail.com> wrote:

> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find
> out useful values for the limits, except blind trial and error.
> 
> Currently, there is no way to know which capabilities are actually used.
> Even the source code is only implicit, in-depth knowledge of each
> capability must be used when analyzing a program to judge which
> capabilities the program will exercise.
> 
> Add a new cgroup controller for monitoring of capabilities
> in the cgroup.

I'm having trouble understanding how valuable this feature is to our
users, and that's a rather important thing!

Perhaps it would help if you were to explain your motivation:
particular use cases which benefited from this, for example.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-23 23:46   ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2016-06-23 23:46 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, luto, serge, keescook, Jonathan Corbet, Tejun Heo,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP CGROUP, open list:CAPABILITIES

On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton@gmail.com> wrote:

> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find
> out useful values for the limits, except blind trial and error.
> 
> Currently, there is no way to know which capabilities are actually used.
> Even the source code is only implicit, in-depth knowledge of each
> capability must be used when analyzing a program to judge which
> capabilities the program will exercise.
> 
> Add a new cgroup controller for monitoring of capabilities
> in the cgroup.

I'm having trouble understanding how valuable this feature is to our
users, and that's a rather important thing!

Perhaps it would help if you were to explain your motivation:
particular use cases which benefited from this, for example.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-23 21:38 ` Tejun Heo
@ 2016-06-24  0:22   ` Topi Miettinen
  2016-06-24 15:48     ` Tejun Heo
  0 siblings, 1 reply; 46+ messages in thread
From: Topi Miettinen @ 2016-06-24  0:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, luto, serge, keescook, Jonathan Corbet, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, Andrew Morton,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/23/16 21:38, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jun 23, 2016 at 06:07:10PM +0300, Topi Miettinen wrote:
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find
>> out useful values for the limits, except blind trial and error.
>>
>> Currently, there is no way to know which capabilities are actually used.
>> Even the source code is only implicit, in-depth knowledge of each
>> capability must be used when analyzing a program to judge which
>> capabilities the program will exercise.
>>
>> Add a new cgroup controller for monitoring of capabilities
>> in the cgroup.
>>
>> Test case demonstrating basic capability monitoring and how the
>> capabilities are combined at next level (boot to rdshell):
> 
> This doesn't have anything to do with resource control and I don't
> think it's a good idea to add arbitrary monitoring mechanisms to
> cgroup just because it's easy to add interface there.  Given that
> capabilities are inherited and modified through the process hierarchy,
> shouldn't this be part of that?

With per process tracking, it's easy to miss if a short-lived process
exercised capabilities. Especially with ambient capabilities, the parent
process could be a shell script which might not use capabilities at all,
but its children do the heavy lifting.

Per process tracking (like in the version I sent earlier) could still be
added on top of this to complement cgroup level tracking, but I think
cgroup approach is more flexible as it can cover anything from a single
task to a collection of processes.

-Topi

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-24  1:14     ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-06-24  1:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, luto, serge, keescook, Jonathan Corbet, Tejun Heo,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/23/16 23:46, Andrew Morton wrote:
> On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton@gmail.com> wrote:
> 
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find
>> out useful values for the limits, except blind trial and error.
>>
>> Currently, there is no way to know which capabilities are actually used.
>> Even the source code is only implicit, in-depth knowledge of each
>> capability must be used when analyzing a program to judge which
>> capabilities the program will exercise.
>>
>> Add a new cgroup controller for monitoring of capabilities
>> in the cgroup.
> 
> I'm having trouble understanding how valuable this feature is to our
> users, and that's a rather important thing!
> 
> Perhaps it would help if you were to explain your motivation:
> particular use cases which benefited from this, for example.
> 

It's easy to control with for example systemd or many other tools, which
capabilities a service should have at the start. But how should a system
administrator, application developer or distro maintaner ever determine
a suitable value for this? Currently the only way seems to be to become
an expert on capabilities, make an educated guess how the set of
programs in question happen to work in this context and especially how
they could exercise the capabilites in all possible use cases. Even
then, the outcome is to just try something to see if that happens to
work. Reading the source code (if available) does not help very much,
because the use of capabilities is anything but explicit there.

This is way too difficult, there must be some easier way. The
information which capabilities actually were used in a trial run gives a
much better starting point. The users can just use the list of used
capabilities with configuring the service or when developing or
maintaining the application. Of course, even that could still fail
eventually, but then you simply copy the new value of used capabilities
to the configuration, whereas currently you have to reconsider your
understanding of the capabilities and the programs in light of the
failure, which by itself might give no new useful information.

One way to solve this for good would be to make the use of capabilities
explicit in the ABI. For example, there could be a system call
dac_override() which would be the only possible way ever to use the
capability CAP_DAC_OVERRIDE and so forth. Then reading source code,
tracing and many other approaches would be useful. But the OS with that
kind of ABI (not Linux) would not be Unix-like at all for any
(potentially) capability using programs, like find(1) or cat(1).

-Topi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-24  1:14     ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-06-24  1:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-DgEjT+Ai2ygdnm+yROfE0A,
	serge-A9i7LUbDfNHQT0dZR+AlfA, keescook-F7+t8E8rja9g9hUCZPvPmw,
	Jonathan Corbet, Tejun Heo, Li Zefan, Johannes Weiner,
	Serge Hallyn, James Morris, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/23/16 23:46, Andrew Morton wrote:
> On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
>> There are many basic ways to control processes, including capabilities,
>> cgroups and resource limits. However, there are far fewer ways to find
>> out useful values for the limits, except blind trial and error.
>>
>> Currently, there is no way to know which capabilities are actually used.
>> Even the source code is only implicit, in-depth knowledge of each
>> capability must be used when analyzing a program to judge which
>> capabilities the program will exercise.
>>
>> Add a new cgroup controller for monitoring of capabilities
>> in the cgroup.
> 
> I'm having trouble understanding how valuable this feature is to our
> users, and that's a rather important thing!
> 
> Perhaps it would help if you were to explain your motivation:
> particular use cases which benefited from this, for example.
> 

It's easy to control with for example systemd or many other tools, which
capabilities a service should have at the start. But how should a system
administrator, application developer or distro maintaner ever determine
a suitable value for this? Currently the only way seems to be to become
an expert on capabilities, make an educated guess how the set of
programs in question happen to work in this context and especially how
they could exercise the capabilites in all possible use cases. Even
then, the outcome is to just try something to see if that happens to
work. Reading the source code (if available) does not help very much,
because the use of capabilities is anything but explicit there.

This is way too difficult, there must be some easier way. The
information which capabilities actually were used in a trial run gives a
much better starting point. The users can just use the list of used
capabilities with configuring the service or when developing or
maintaining the application. Of course, even that could still fail
eventually, but then you simply copy the new value of used capabilities
to the configuration, whereas currently you have to reconsider your
understanding of the capabilities and the programs in light of the
failure, which by itself might give no new useful information.

One way to solve this for good would be to make the use of capabilities
explicit in the ABI. For example, there could be a system call
dac_override() which would be the only possible way ever to use the
capability CAP_DAC_OVERRIDE and so forth. Then reading source code,
tracing and many other approaches would be useful. But the OS with that
kind of ABI (not Linux) would not be Unix-like at all for any
(potentially) capability using programs, like find(1) or cat(1).

-Topi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-24  4:15       ` Andy Lutomirski
  0 siblings, 0 replies; 46+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:15 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Andrew Morton, linux-kernel, Andrew Lutomirski, Serge E. Hallyn,
	Kees Cook, Jonathan Corbet, Tejun Heo, Li Zefan, Johannes Weiner,
	Serge Hallyn, James Morris, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Thu, Jun 23, 2016 at 6:14 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> On 06/23/16 23:46, Andrew Morton wrote:
>> On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton@gmail.com> wrote:
>>
>>> There are many basic ways to control processes, including capabilities,
>>> cgroups and resource limits. However, there are far fewer ways to find
>>> out useful values for the limits, except blind trial and error.
>>>
>>> Currently, there is no way to know which capabilities are actually used.
>>> Even the source code is only implicit, in-depth knowledge of each
>>> capability must be used when analyzing a program to judge which
>>> capabilities the program will exercise.
>>>
>>> Add a new cgroup controller for monitoring of capabilities
>>> in the cgroup.
>>
>> I'm having trouble understanding how valuable this feature is to our
>> users, and that's a rather important thing!
>>
>> Perhaps it would help if you were to explain your motivation:
>> particular use cases which benefited from this, for example.
>>
>
> It's easy to control with for example systemd or many other tools, which
> capabilities a service should have at the start. But how should a system
> administrator, application developer or distro maintaner ever determine
> a suitable value for this? Currently the only way seems to be to become
> an expert on capabilities, make an educated guess how the set of
> programs in question happen to work in this context and especially how
> they could exercise the capabilites in all possible use cases. Even
> then, the outcome is to just try something to see if that happens to
> work. Reading the source code (if available) does not help very much,
> because the use of capabilities is anything but explicit there.
>
> This is way too difficult, there must be some easier way. The
> information which capabilities actually were used in a trial run gives a
> much better starting point. The users can just use the list of used
> capabilities with configuring the service or when developing or
> maintaining the application. Of course, even that could still fail
> eventually, but then you simply copy the new value of used capabilities
> to the configuration, whereas currently you have to reconsider your
> understanding of the capabilities and the programs in light of the
> failure, which by itself might give no new useful information.
>
> One way to solve this for good would be to make the use of capabilities
> explicit in the ABI. For example, there could be a system call
> dac_override() which would be the only possible way ever to use the
> capability CAP_DAC_OVERRIDE and so forth. Then reading source code,
> tracing and many other approaches would be useful. But the OS with that
> kind of ABI (not Linux) would not be Unix-like at all for any
> (potentially) capability using programs, like find(1) or cat(1).

The problem is that most of the capabilities are so powerful on their
own that limiting services to just a few may be all but useless.

--Andy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-24  4:15       ` Andy Lutomirski
  0 siblings, 0 replies; 46+ messages in thread
From: Andy Lutomirski @ 2016-06-24  4:15 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Lutomirski, Serge E. Hallyn, Kees Cook, Jonathan Corbet,
	Tejun Heo, Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Thu, Jun 23, 2016 at 6:14 PM, Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 06/23/16 23:46, Andrew Morton wrote:
>> On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>> There are many basic ways to control processes, including capabilities,
>>> cgroups and resource limits. However, there are far fewer ways to find
>>> out useful values for the limits, except blind trial and error.
>>>
>>> Currently, there is no way to know which capabilities are actually used.
>>> Even the source code is only implicit, in-depth knowledge of each
>>> capability must be used when analyzing a program to judge which
>>> capabilities the program will exercise.
>>>
>>> Add a new cgroup controller for monitoring of capabilities
>>> in the cgroup.
>>
>> I'm having trouble understanding how valuable this feature is to our
>> users, and that's a rather important thing!
>>
>> Perhaps it would help if you were to explain your motivation:
>> particular use cases which benefited from this, for example.
>>
>
> It's easy to control with for example systemd or many other tools, which
> capabilities a service should have at the start. But how should a system
> administrator, application developer or distro maintaner ever determine
> a suitable value for this? Currently the only way seems to be to become
> an expert on capabilities, make an educated guess how the set of
> programs in question happen to work in this context and especially how
> they could exercise the capabilites in all possible use cases. Even
> then, the outcome is to just try something to see if that happens to
> work. Reading the source code (if available) does not help very much,
> because the use of capabilities is anything but explicit there.
>
> This is way too difficult, there must be some easier way. The
> information which capabilities actually were used in a trial run gives a
> much better starting point. The users can just use the list of used
> capabilities with configuring the service or when developing or
> maintaining the application. Of course, even that could still fail
> eventually, but then you simply copy the new value of used capabilities
> to the configuration, whereas currently you have to reconsider your
> understanding of the capabilities and the programs in light of the
> failure, which by itself might give no new useful information.
>
> One way to solve this for good would be to make the use of capabilities
> explicit in the ABI. For example, there could be a system call
> dac_override() which would be the only possible way ever to use the
> capability CAP_DAC_OVERRIDE and so forth. Then reading source code,
> tracing and many other approaches would be useful. But the OS with that
> kind of ABI (not Linux) would not be Unix-like at all for any
> (potentially) capability using programs, like find(1) or cat(1).

The problem is that most of the capabilities are so powerful on their
own that limiting services to just a few may be all but useless.

--Andy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-24  0:22   ` Topi Miettinen
@ 2016-06-24 15:48     ` Tejun Heo
  2016-06-24 15:59       ` Serge E. Hallyn
  0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2016-06-24 15:48 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: linux-kernel, luto, serge, keescook, Jonathan Corbet, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, Andrew Morton,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Hello,

On Fri, Jun 24, 2016 at 12:22:54AM +0000, Topi Miettinen wrote:
> > This doesn't have anything to do with resource control and I don't
> > think it's a good idea to add arbitrary monitoring mechanisms to
> > cgroup just because it's easy to add interface there.  Given that
> > capabilities are inherited and modified through the process hierarchy,
> > shouldn't this be part of that?
> 
> With per process tracking, it's easy to miss if a short-lived process
> exercised capabilities. Especially with ambient capabilities, the parent
> process could be a shell script which might not use capabilities at all,
> but its children do the heavy lifting.

But isn't being recursive orthogonal to using cgroup?  Why not account
usages recursively along the process hierarchy?  Capabilities don't
have much to do with cgroup but everything with process hierarchy.
That's how they're distributed and modified.  If monitoring their
usages is necessary, it makes sense to do it in the same structure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-24 15:48     ` Tejun Heo
@ 2016-06-24 15:59       ` Serge E. Hallyn
  2016-06-24 16:35           ` Tejun Heo
  0 siblings, 1 reply; 46+ messages in thread
From: Serge E. Hallyn @ 2016-06-24 15:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Topi Miettinen, linux-kernel, luto, serge, keescook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Quoting Tejun Heo (tj@kernel.org):
> Hello,
> 
> On Fri, Jun 24, 2016 at 12:22:54AM +0000, Topi Miettinen wrote:
> > > This doesn't have anything to do with resource control and I don't
> > > think it's a good idea to add arbitrary monitoring mechanisms to
> > > cgroup just because it's easy to add interface there.  Given that
> > > capabilities are inherited and modified through the process hierarchy,
> > > shouldn't this be part of that?
> > 
> > With per process tracking, it's easy to miss if a short-lived process
> > exercised capabilities. Especially with ambient capabilities, the parent
> > process could be a shell script which might not use capabilities at all,
> > but its children do the heavy lifting.
> 
> But isn't being recursive orthogonal to using cgroup?  Why not account
> usages recursively along the process hierarchy?  Capabilities don't
> have much to do with cgroup but everything with process hierarchy.
> That's how they're distributed and modified.  If monitoring their
> usages is necessary, it makes sense to do it in the same structure.

That was my argument against using cgroups to enforce a new bounding
set.  For tracking though, the cgroup process tracking seems as applicable
to this as it does to systemd tracking of services.  It tracks a task and
the children it forks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-24 16:35           ` Tejun Heo
  0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2016-06-24 16:35 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Topi Miettinen, linux-kernel, luto, keescook, Jonathan Corbet,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Hello,

On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
> Quoting Tejun Heo (tj@kernel.org):
> > But isn't being recursive orthogonal to using cgroup?  Why not account
> > usages recursively along the process hierarchy?  Capabilities don't
> > have much to do with cgroup but everything with process hierarchy.
> > That's how they're distributed and modified.  If monitoring their
> > usages is necessary, it makes sense to do it in the same structure.
> 
> That was my argument against using cgroups to enforce a new bounding
> set.  For tracking though, the cgroup process tracking seems as applicable
> to this as it does to systemd tracking of services.  It tracks a task and
> the children it forks.

Just monitoring is less jarring than implementing security enforcement
via cgroup, but it is still jarring.  What's wrong with recursive
process hierarchy monitoring which is in line with the whole facility
is implemented anyway?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-24 16:35           ` Tejun Heo
  0 siblings, 0 replies; 46+ messages in thread
From: Tejun Heo @ 2016-06-24 16:35 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Topi Miettinen, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	luto-DgEjT+Ai2ygdnm+yROfE0A, keescook-F7+t8E8rja9g9hUCZPvPmw,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Hello,

On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> > But isn't being recursive orthogonal to using cgroup?  Why not account
> > usages recursively along the process hierarchy?  Capabilities don't
> > have much to do with cgroup but everything with process hierarchy.
> > That's how they're distributed and modified.  If monitoring their
> > usages is necessary, it makes sense to do it in the same structure.
> 
> That was my argument against using cgroups to enforce a new bounding
> set.  For tracking though, the cgroup process tracking seems as applicable
> to this as it does to systemd tracking of services.  It tracks a task and
> the children it forks.

Just monitoring is less jarring than implementing security enforcement
via cgroup, but it is still jarring.  What's wrong with recursive
process hierarchy monitoring which is in line with the whole facility
is implemented anyway?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-24 16:35           ` Tejun Heo
  (?)
@ 2016-06-24 16:59           ` Serge E. Hallyn
  2016-06-24 17:21               ` Eric W. Biederman
  2016-06-24 17:24             ` Tejun Heo
  -1 siblings, 2 replies; 46+ messages in thread
From: Serge E. Hallyn @ 2016-06-24 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Topi Miettinen, linux-kernel, luto, keescook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Quoting Tejun Heo (tj@kernel.org):
> Hello,
> 
> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
> > Quoting Tejun Heo (tj@kernel.org):
> > > But isn't being recursive orthogonal to using cgroup?  Why not account
> > > usages recursively along the process hierarchy?  Capabilities don't
> > > have much to do with cgroup but everything with process hierarchy.
> > > That's how they're distributed and modified.  If monitoring their
> > > usages is necessary, it makes sense to do it in the same structure.
> > 
> > That was my argument against using cgroups to enforce a new bounding
> > set.  For tracking though, the cgroup process tracking seems as applicable
> > to this as it does to systemd tracking of services.  It tracks a task and
> > the children it forks.
> 
> Just monitoring is less jarring than implementing security enforcement
> via cgroup, but it is still jarring.  What's wrong with recursive
> process hierarchy monitoring which is in line with the whole facility
> is implemented anyway?

As I think Topi pointed out, one shortcoming is that if there is a short-lived
child task, using its /proc/self/status is racy.  You might just miss that it
ever even existed, let alone that the "application" needed it.

Another alternative we've both mentioned is to use systemtap.  That's not
as nice a solution as a cgroup, but then again this isn't really a common
case, so maybe it is precisely what a tracing infrastructure is meant for.

-serge

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-24 16:59           ` Serge E. Hallyn
@ 2016-06-24 17:21               ` Eric W. Biederman
  2016-06-24 17:24             ` Tejun Heo
  1 sibling, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2016-06-24 17:21 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Topi Miettinen, linux-kernel, luto, keescook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Tejun Heo (tj@kernel.org):
>> Hello,
>> 
>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>> > Quoting Tejun Heo (tj@kernel.org):
>> > > But isn't being recursive orthogonal to using cgroup?  Why not account
>> > > usages recursively along the process hierarchy?  Capabilities don't
>> > > have much to do with cgroup but everything with process hierarchy.
>> > > That's how they're distributed and modified.  If monitoring their
>> > > usages is necessary, it makes sense to do it in the same structure.
>> > 
>> > That was my argument against using cgroups to enforce a new bounding
>> > set.  For tracking though, the cgroup process tracking seems as applicable
>> > to this as it does to systemd tracking of services.  It tracks a task and
>> > the children it forks.
>> 
>> Just monitoring is less jarring than implementing security enforcement
>> via cgroup, but it is still jarring.  What's wrong with recursive
>> process hierarchy monitoring which is in line with the whole facility
>> is implemented anyway?
>
> As I think Topi pointed out, one shortcoming is that if there is a short-lived
> child task, using its /proc/self/status is racy.  You might just miss that it
> ever even existed, let alone that the "application" needed it.
>
> Another alternative we've both mentioned is to use systemtap.  That's not
> as nice a solution as a cgroup, but then again this isn't really a common
> case, so maybe it is precisely what a tracing infrastructure is meant for.

Hmm.

We have capability use wired up into auditing.  So we might be able to
get away with just adding an appropriate audit message in
commoncap.c:cap_capable that honors the audit flag and logs an audit
message.  The hook in selinux already appears to do that.

Certainly audit sounds like the subsystem for this kind of work, as it's
whole point in life is logging things, then something in userspace can
just run over the audit longs and build a nice summary.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-24 17:21               ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2016-06-24 17:21 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, Topi Miettinen, linux-kernel, luto, keescook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Tejun Heo (tj@kernel.org):
>> Hello,
>> 
>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>> > Quoting Tejun Heo (tj@kernel.org):
>> > > But isn't being recursive orthogonal to using cgroup?  Why not account
>> > > usages recursively along the process hierarchy?  Capabilities don't
>> > > have much to do with cgroup but everything with process hierarchy.
>> > > That's how they're distributed and modified.  If monitoring their
>> > > usages is necessary, it makes sense to do it in the same structure.
>> > 
>> > That was my argument against using cgroups to enforce a new bounding
>> > set.  For tracking though, the cgroup process tracking seems as applicable
>> > to this as it does to systemd tracking of services.  It tracks a task and
>> > the children it forks.
>> 
>> Just monitoring is less jarring than implementing security enforcement
>> via cgroup, but it is still jarring.  What's wrong with recursive
>> process hierarchy monitoring which is in line with the whole facility
>> is implemented anyway?
>
> As I think Topi pointed out, one shortcoming is that if there is a short-lived
> child task, using its /proc/self/status is racy.  You might just miss that it
> ever even existed, let alone that the "application" needed it.
>
> Another alternative we've both mentioned is to use systemtap.  That's not
> as nice a solution as a cgroup, but then again this isn't really a common
> case, so maybe it is precisely what a tracing infrastructure is meant for.

Hmm.

We have capability use wired up into auditing.  So we might be able to
get away with just adding an appropriate audit message in
commoncap.c:cap_capable that honors the audit flag and logs an audit
message.  The hook in selinux already appears to do that.

Certainly audit sounds like the subsystem for this kind of work, as it's
whole point in life is logging things, then something in userspace can
just run over the audit longs and build a nice summary.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-24 16:59           ` Serge E. Hallyn
  2016-06-24 17:21               ` Eric W. Biederman
@ 2016-06-24 17:24             ` Tejun Heo
  2016-06-26 19:14               ` Topi Miettinen
  1 sibling, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2016-06-24 17:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Topi Miettinen, linux-kernel, luto, keescook, Jonathan Corbet,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Hello, Serge.

On Fri, Jun 24, 2016 at 11:59:10AM -0500, Serge E. Hallyn wrote:
> > Just monitoring is less jarring than implementing security enforcement
> > via cgroup, but it is still jarring.  What's wrong with recursive
> > process hierarchy monitoring which is in line with the whole facility
> > is implemented anyway?
> 
> As I think Topi pointed out, one shortcoming is that if there is a short-lived
> child task, using its /proc/self/status is racy.  You might just miss that it
> ever even existed, let alone that the "application" needed it.

But the parent can collect whatever its children used.  We already do
that with other stats.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-24 17:21               ` Eric W. Biederman
  (?)
@ 2016-06-24 17:39               ` Serge E. Hallyn
  -1 siblings, 0 replies; 46+ messages in thread
From: Serge E. Hallyn @ 2016-06-24 17:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Tejun Heo, Topi Miettinen, linux-kernel, luto,
	keescook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	Serge Hallyn, James Morris, Andrew Morton, David Howells,
	David Woodhouse, Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
> > Quoting Tejun Heo (tj@kernel.org):
> >> Hello,
> >> 
> >> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
> >> > Quoting Tejun Heo (tj@kernel.org):
> >> > > But isn't being recursive orthogonal to using cgroup?  Why not account
> >> > > usages recursively along the process hierarchy?  Capabilities don't
> >> > > have much to do with cgroup but everything with process hierarchy.
> >> > > That's how they're distributed and modified.  If monitoring their
> >> > > usages is necessary, it makes sense to do it in the same structure.
> >> > 
> >> > That was my argument against using cgroups to enforce a new bounding
> >> > set.  For tracking though, the cgroup process tracking seems as applicable
> >> > to this as it does to systemd tracking of services.  It tracks a task and
> >> > the children it forks.
> >> 
> >> Just monitoring is less jarring than implementing security enforcement
> >> via cgroup, but it is still jarring.  What's wrong with recursive
> >> process hierarchy monitoring which is in line with the whole facility
> >> is implemented anyway?
> >
> > As I think Topi pointed out, one shortcoming is that if there is a short-lived
> > child task, using its /proc/self/status is racy.  You might just miss that it
> > ever even existed, let alone that the "application" needed it.
> >
> > Another alternative we've both mentioned is to use systemtap.  That's not
> > as nice a solution as a cgroup, but then again this isn't really a common
> > case, so maybe it is precisely what a tracing infrastructure is meant for.
> 
> Hmm.
> 
> We have capability use wired up into auditing.  So we might be able to
> get away with just adding an appropriate audit message in
> commoncap.c:cap_capable that honors the audit flag and logs an audit
> message.  The hook in selinux already appears to do that.
> 
> Certainly audit sounds like the subsystem for this kind of work, as it's
> whole point in life is logging things, then something in userspace can
> just run over the audit longs and build a nice summary.

Good point, so long as we can also track ppid or fork info (using
taskstats?) that would seem the best way.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-25 18:00         ` Djalal Harouni
  0 siblings, 0 replies; 46+ messages in thread
From: Djalal Harouni @ 2016-06-25 18:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Topi Miettinen, Andrew Morton, linux-kernel, Andrew Lutomirski,
	Serge E. Hallyn, Kees Cook, Jonathan Corbet, Tejun Heo, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, David Howells,
	David Woodhouse, Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Fri, Jun 24, 2016 at 6:15 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jun 23, 2016 at 6:14 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>> On 06/23/16 23:46, Andrew Morton wrote:
>>> On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton@gmail.com> wrote:
>>>
>>>> There are many basic ways to control processes, including capabilities,
>>>> cgroups and resource limits. However, there are far fewer ways to find
>>>> out useful values for the limits, except blind trial and error.
>>>>
>>>> Currently, there is no way to know which capabilities are actually used.
>>>> Even the source code is only implicit, in-depth knowledge of each
>>>> capability must be used when analyzing a program to judge which
>>>> capabilities the program will exercise.
>>>>
>>>> Add a new cgroup controller for monitoring of capabilities
>>>> in the cgroup.
>>>
>>> I'm having trouble understanding how valuable this feature is to our
>>> users, and that's a rather important thing!
>>>
>>> Perhaps it would help if you were to explain your motivation:
>>> particular use cases which benefited from this, for example.
>>>
>>
>> It's easy to control with for example systemd or many other tools, which
>> capabilities a service should have at the start. But how should a system
>> administrator, application developer or distro maintaner ever determine
>> a suitable value for this? Currently the only way seems to be to become
>> an expert on capabilities, make an educated guess how the set of
>> programs in question happen to work in this context and especially how
>> they could exercise the capabilites in all possible use cases. Even
>> then, the outcome is to just try something to see if that happens to
>> work. Reading the source code (if available) does not help very much,
>> because the use of capabilities is anything but explicit there.
>>
>> This is way too difficult, there must be some easier way. The
>> information which capabilities actually were used in a trial run gives a
>> much better starting point. The users can just use the list of used
>> capabilities with configuring the service or when developing or
>> maintaining the application. Of course, even that could still fail
>> eventually, but then you simply copy the new value of used capabilities
>> to the configuration, whereas currently you have to reconsider your
>> understanding of the capabilities and the programs in light of the
>> failure, which by itself might give no new useful information.
>>
>> One way to solve this for good would be to make the use of capabilities
>> explicit in the ABI. For example, there could be a system call
>> dac_override() which would be the only possible way ever to use the
>> capability CAP_DAC_OVERRIDE and so forth. Then reading source code,
>> tracing and many other approaches would be useful. But the OS with that
>> kind of ABI (not Linux) would not be Unix-like at all for any
>> (potentially) capability using programs, like find(1) or cat(1).
>
> The problem is that most of the capabilities are so powerful on their
> own that limiting services to just a few may be all but useless.

May be there is some gain _if_ the resources that a process interact
with _can_ also be made invisible with namespaces.



> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
tixxdz
http://opendz.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-25 18:00         ` Djalal Harouni
  0 siblings, 0 replies; 46+ messages in thread
From: Djalal Harouni @ 2016-06-25 18:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Topi Miettinen, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Lutomirski,
	Serge E. Hallyn, Kees Cook, Jonathan Corbet, Tejun Heo, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, David Howells,
	David Woodhouse, Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Fri, Jun 24, 2016 at 6:15 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Thu, Jun 23, 2016 at 6:14 PM, Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On 06/23/16 23:46, Andrew Morton wrote:
>>> On Thu, 23 Jun 2016 18:07:10 +0300 Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>
>>>> There are many basic ways to control processes, including capabilities,
>>>> cgroups and resource limits. However, there are far fewer ways to find
>>>> out useful values for the limits, except blind trial and error.
>>>>
>>>> Currently, there is no way to know which capabilities are actually used.
>>>> Even the source code is only implicit, in-depth knowledge of each
>>>> capability must be used when analyzing a program to judge which
>>>> capabilities the program will exercise.
>>>>
>>>> Add a new cgroup controller for monitoring of capabilities
>>>> in the cgroup.
>>>
>>> I'm having trouble understanding how valuable this feature is to our
>>> users, and that's a rather important thing!
>>>
>>> Perhaps it would help if you were to explain your motivation:
>>> particular use cases which benefited from this, for example.
>>>
>>
>> It's easy to control with for example systemd or many other tools, which
>> capabilities a service should have at the start. But how should a system
>> administrator, application developer or distro maintaner ever determine
>> a suitable value for this? Currently the only way seems to be to become
>> an expert on capabilities, make an educated guess how the set of
>> programs in question happen to work in this context and especially how
>> they could exercise the capabilites in all possible use cases. Even
>> then, the outcome is to just try something to see if that happens to
>> work. Reading the source code (if available) does not help very much,
>> because the use of capabilities is anything but explicit there.
>>
>> This is way too difficult, there must be some easier way. The
>> information which capabilities actually were used in a trial run gives a
>> much better starting point. The users can just use the list of used
>> capabilities with configuring the service or when developing or
>> maintaining the application. Of course, even that could still fail
>> eventually, but then you simply copy the new value of used capabilities
>> to the configuration, whereas currently you have to reconsider your
>> understanding of the capabilities and the programs in light of the
>> failure, which by itself might give no new useful information.
>>
>> One way to solve this for good would be to make the use of capabilities
>> explicit in the ABI. For example, there could be a system call
>> dac_override() which would be the only possible way ever to use the
>> capability CAP_DAC_OVERRIDE and so forth. Then reading source code,
>> tracing and many other approaches would be useful. But the OS with that
>> kind of ABI (not Linux) would not be Unix-like at all for any
>> (potentially) capability using programs, like find(1) or cat(1).
>
> The problem is that most of the capabilities are so powerful on their
> own that limiting services to just a few may be all but useless.

May be there is some gain _if_ the resources that a process interact
with _can_ also be made invisible with namespaces.



> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
tixxdz
http://opendz.org

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-26 19:03                 ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-06-26 19:03 UTC (permalink / raw)
  To: Eric W. Biederman, Serge E. Hallyn
  Cc: Tejun Heo, linux-kernel, luto, keescook, Jonathan Corbet,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/24/16 17:21, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> 
>> Quoting Tejun Heo (tj@kernel.org):
>>> Hello,
>>>
>>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>>>> Quoting Tejun Heo (tj@kernel.org):
>>>>> But isn't being recursive orthogonal to using cgroup?  Why not account
>>>>> usages recursively along the process hierarchy?  Capabilities don't
>>>>> have much to do with cgroup but everything with process hierarchy.
>>>>> That's how they're distributed and modified.  If monitoring their
>>>>> usages is necessary, it makes sense to do it in the same structure.
>>>>
>>>> That was my argument against using cgroups to enforce a new bounding
>>>> set.  For tracking though, the cgroup process tracking seems as applicable
>>>> to this as it does to systemd tracking of services.  It tracks a task and
>>>> the children it forks.
>>>
>>> Just monitoring is less jarring than implementing security enforcement
>>> via cgroup, but it is still jarring.  What's wrong with recursive
>>> process hierarchy monitoring which is in line with the whole facility
>>> is implemented anyway?
>>
>> As I think Topi pointed out, one shortcoming is that if there is a short-lived
>> child task, using its /proc/self/status is racy.  You might just miss that it
>> ever even existed, let alone that the "application" needed it.
>>
>> Another alternative we've both mentioned is to use systemtap.  That's not
>> as nice a solution as a cgroup, but then again this isn't really a common
>> case, so maybe it is precisely what a tracing infrastructure is meant for.
> 
> Hmm.
> 
> We have capability use wired up into auditing.  So we might be able to
> get away with just adding an appropriate audit message in
> commoncap.c:cap_capable that honors the audit flag and logs an audit
> message.  The hook in selinux already appears to do that.
> 
> Certainly audit sounds like the subsystem for this kind of work, as it's
> whole point in life is logging things, then something in userspace can
> just run over the audit longs and build a nice summary.

Even simpler would be to avoid the complexity of audit subsystem and
just printk() when a task starts using a capability first time (not on
further uses by same task). There are not that many capability bits nor
privileged processes, meaning not too many log entries. I know as this
was actually my first approach. But it's also far less user friendly
than just reading a summarized value which could be directly fed back to
configuration.

Logging/auditing approach also doesn't work well for other things I'd
like to present meaningful values for the user. For example, consider
RLIMIT_AS, where my goal is also to enable the users to be able to
configure this limit for a service. Should there be an audit message
whenever the address space limit grows (i.e. each mmap())? What about
when it shrinks? For RLIMIT_NOFILE we'd have to report each
open()/close()/dup()/socket()/etc. and track how many are opened at the
same time. I think it's better to store the fully cooked (meaningful to
user) value in kernel and present it only when asked.

-Topi

> 
> Eric
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-26 19:03                 ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-06-26 19:03 UTC (permalink / raw)
  To: Eric W. Biederman, Serge E. Hallyn
  Cc: Tejun Heo, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	luto-DgEjT+Ai2ygdnm+yROfE0A, keescook-F7+t8E8rja9g9hUCZPvPmw,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/24/16 17:21, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
>> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>>> Hello,
>>>
>>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>>>> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>>>>> But isn't being recursive orthogonal to using cgroup?  Why not account
>>>>> usages recursively along the process hierarchy?  Capabilities don't
>>>>> have much to do with cgroup but everything with process hierarchy.
>>>>> That's how they're distributed and modified.  If monitoring their
>>>>> usages is necessary, it makes sense to do it in the same structure.
>>>>
>>>> That was my argument against using cgroups to enforce a new bounding
>>>> set.  For tracking though, the cgroup process tracking seems as applicable
>>>> to this as it does to systemd tracking of services.  It tracks a task and
>>>> the children it forks.
>>>
>>> Just monitoring is less jarring than implementing security enforcement
>>> via cgroup, but it is still jarring.  What's wrong with recursive
>>> process hierarchy monitoring which is in line with the whole facility
>>> is implemented anyway?
>>
>> As I think Topi pointed out, one shortcoming is that if there is a short-lived
>> child task, using its /proc/self/status is racy.  You might just miss that it
>> ever even existed, let alone that the "application" needed it.
>>
>> Another alternative we've both mentioned is to use systemtap.  That's not
>> as nice a solution as a cgroup, but then again this isn't really a common
>> case, so maybe it is precisely what a tracing infrastructure is meant for.
> 
> Hmm.
> 
> We have capability use wired up into auditing.  So we might be able to
> get away with just adding an appropriate audit message in
> commoncap.c:cap_capable that honors the audit flag and logs an audit
> message.  The hook in selinux already appears to do that.
> 
> Certainly audit sounds like the subsystem for this kind of work, as it's
> whole point in life is logging things, then something in userspace can
> just run over the audit longs and build a nice summary.

Even simpler would be to avoid the complexity of audit subsystem and
just printk() when a task starts using a capability first time (not on
further uses by same task). There are not that many capability bits nor
privileged processes, meaning not too many log entries. I know as this
was actually my first approach. But it's also far less user friendly
than just reading a summarized value which could be directly fed back to
configuration.

Logging/auditing approach also doesn't work well for other things I'd
like to present meaningful values for the user. For example, consider
RLIMIT_AS, where my goal is also to enable the users to be able to
configure this limit for a service. Should there be an audit message
whenever the address space limit grows (i.e. each mmap())? What about
when it shrinks? For RLIMIT_NOFILE we'd have to report each
open()/close()/dup()/socket()/etc. and track how many are opened at the
same time. I think it's better to store the fully cooked (meaningful to
user) value in kernel and present it only when asked.

-Topi

> 
> Eric
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-24 17:24             ` Tejun Heo
@ 2016-06-26 19:14               ` Topi Miettinen
  2016-06-26 22:26                 ` Tejun Heo
  0 siblings, 1 reply; 46+ messages in thread
From: Topi Miettinen @ 2016-06-26 19:14 UTC (permalink / raw)
  To: Tejun Heo, Serge E. Hallyn
  Cc: linux-kernel, luto, keescook, Jonathan Corbet, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, Andrew Morton,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/24/16 17:24, Tejun Heo wrote:
> Hello, Serge.
> 
> On Fri, Jun 24, 2016 at 11:59:10AM -0500, Serge E. Hallyn wrote:
>>> Just monitoring is less jarring than implementing security enforcement
>>> via cgroup, but it is still jarring.  What's wrong with recursive
>>> process hierarchy monitoring which is in line with the whole facility
>>> is implemented anyway?
>>
>> As I think Topi pointed out, one shortcoming is that if there is a short-lived
>> child task, using its /proc/self/status is racy.  You might just miss that it
>> ever even existed, let alone that the "application" needed it.
> 
> But the parent can collect whatever its children used.  We already do
> that with other stats.

The parent might be able do it if proc/pid/xyz files are still
accessible after child exit but before its exit status is collected. But
if the parent doesn't do it (and you are not able to change it to do it)
and it collects the exit status without collecting other info, can you
suggest a different way how another process could collect it 100% reliably?

-Topi

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-26 19:14               ` Topi Miettinen
@ 2016-06-26 22:26                 ` Tejun Heo
  2016-06-27 14:54                   ` Serge E. Hallyn
  0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2016-06-26 22:26 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Serge E. Hallyn, lkml, luto, Kees Cook, Jonathan Corbet,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Hello, Topi.

On Sun, Jun 26, 2016 at 3:14 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> The parent might be able do it if proc/pid/xyz files are still
> accessible after child exit but before its exit status is collected. But
> if the parent doesn't do it (and you are not able to change it to do it)
> and it collects the exit status without collecting other info, can you
> suggest a different way how another process could collect it 100% reliably?

I'm not saying that there's such mechanism now. I'm suggesting that
that'd be a more fitting way of implementing a new mechanism to track
capability usages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-26 22:26                 ` Tejun Heo
@ 2016-06-27 14:54                   ` Serge E. Hallyn
  2016-06-27 19:10                     ` Topi Miettinen
  0 siblings, 1 reply; 46+ messages in thread
From: Serge E. Hallyn @ 2016-06-27 14:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Topi Miettinen, Serge E. Hallyn, lkml, luto, Kees Cook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Quoting Tejun Heo (tj@kernel.org):
> Hello, Topi.
> 
> On Sun, Jun 26, 2016 at 3:14 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> > The parent might be able do it if proc/pid/xyz files are still
> > accessible after child exit but before its exit status is collected. But
> > if the parent doesn't do it (and you are not able to change it to do it)
> > and it collects the exit status without collecting other info, can you
> > suggest a different way how another process could collect it 100% reliably?
> 
> I'm not saying that there's such mechanism now. I'm suggesting that
> that'd be a more fitting way of implementing a new mechanism to track
> capability usages.

Hi Topi,

I think Eric was right a few emails earlier that the audit subsystem is
really the most appropriate answer to this.  (Perhaps sysctl-controllered?)
Combined with taskstats it would give you what you need.  Or you could even
use an empty new named cgroup controller, say 'none,name=caps', and then
look only at audit results for cgroup '/myapp' in the caps hierarchy.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-27 14:54                   ` Serge E. Hallyn
@ 2016-06-27 19:10                     ` Topi Miettinen
  2016-06-27 19:17                       ` Tejun Heo
  0 siblings, 1 reply; 46+ messages in thread
From: Topi Miettinen @ 2016-06-27 19:10 UTC (permalink / raw)
  To: Serge E. Hallyn, Tejun Heo
  Cc: lkml, luto, Kees Cook, Jonathan Corbet, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, Andrew Morton,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/27/16 14:54, Serge E. Hallyn wrote:
> Quoting Tejun Heo (tj@kernel.org):
>> Hello, Topi.
>>
>> On Sun, Jun 26, 2016 at 3:14 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>>> The parent might be able do it if proc/pid/xyz files are still
>>> accessible after child exit but before its exit status is collected. But
>>> if the parent doesn't do it (and you are not able to change it to do it)
>>> and it collects the exit status without collecting other info, can you
>>> suggest a different way how another process could collect it 100% reliably?
>>
>> I'm not saying that there's such mechanism now. I'm suggesting that
>> that'd be a more fitting way of implementing a new mechanism to track
>> capability usages.
> 
> Hi Topi,
> 
> I think Eric was right a few emails earlier that the audit subsystem is
> really the most appropriate answer to this.  (Perhaps sysctl-controllered?)
> Combined with taskstats it would give you what you need.  Or you could even
> use an empty new named cgroup controller, say 'none,name=caps', and then
> look only at audit results for cgroup '/myapp' in the caps hierarchy.
> 

I'll have to study these more. But from what I saw so far, it looks to
me that a separate tool would be needed to read taskstats and if that
tool is not taken by distros, the users would not be any wiser, right?
With cgroup (or /proc), no new tools would be needed.

-Topi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-27 19:10                     ` Topi Miettinen
@ 2016-06-27 19:17                       ` Tejun Heo
  2016-06-27 19:49                           ` Serge E. Hallyn
  0 siblings, 1 reply; 46+ messages in thread
From: Tejun Heo @ 2016-06-27 19:17 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Serge E. Hallyn, lkml, luto, Kees Cook, Jonathan Corbet,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Hello,

On Mon, Jun 27, 2016 at 3:10 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> I'll have to study these more. But from what I saw so far, it looks to
> me that a separate tool would be needed to read taskstats and if that
> tool is not taken by distros, the users would not be any wiser, right?
> With cgroup (or /proc), no new tools would be needed.

That is a factor but shouldn't be a deciding factor in designing our
user-facing interfaces. Please also note that kernel source tree
already has tools/ subdirectory which contains userland tools which
are distributed along with the kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-27 19:49                           ` Serge E. Hallyn
  0 siblings, 0 replies; 46+ messages in thread
From: Serge E. Hallyn @ 2016-06-27 19:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Topi Miettinen, Serge E. Hallyn, lkml, luto, Kees Cook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Quoting Tejun Heo (tj@kernel.org):
> Hello,
> 
> On Mon, Jun 27, 2016 at 3:10 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
> > I'll have to study these more. But from what I saw so far, it looks to
> > me that a separate tool would be needed to read taskstats and if that
> > tool is not taken by distros, the users would not be any wiser, right?
> > With cgroup (or /proc), no new tools would be needed.
> 
> That is a factor but shouldn't be a deciding factor in designing our
> user-facing interfaces. Please also note that kernel source tree
> already has tools/ subdirectory which contains userland tools which
> are distributed along with the kernel.

And, if you take audit+cgroup approach then no tools are needed.  So long
as you can have audit print out the cgroups for a task as part of the
capability audit record.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-27 19:49                           ` Serge E. Hallyn
  0 siblings, 0 replies; 46+ messages in thread
From: Serge E. Hallyn @ 2016-06-27 19:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Topi Miettinen, Serge E. Hallyn, lkml,
	luto-DgEjT+Ai2ygdnm+yROfE0A, Kees Cook, Jonathan Corbet,
	Li Zefan, Johannes Weiner, Serge Hallyn, James Morris,
	Andrew Morton, David Howells, David Woodhouse, Ard Biesheuvel,
	Paul E. McKenney, Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello,
> 
> On Mon, Jun 27, 2016 at 3:10 PM, Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > I'll have to study these more. But from what I saw so far, it looks to
> > me that a separate tool would be needed to read taskstats and if that
> > tool is not taken by distros, the users would not be any wiser, right?
> > With cgroup (or /proc), no new tools would be needed.
> 
> That is a factor but shouldn't be a deciding factor in designing our
> user-facing interfaces. Please also note that kernel source tree
> already has tools/ subdirectory which contains userland tools which
> are distributed along with the kernel.

And, if you take audit+cgroup approach then no tools are needed.  So long
as you can have audit print out the cgroups for a task as part of the
capability audit record.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-26 19:03                 ` Topi Miettinen
@ 2016-06-28  4:57                   ` Eric W. Biederman
  -1 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2016-06-28  4:57 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Serge E. Hallyn, Tejun Heo, linux-kernel, luto, keescook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Topi Miettinen <toiwoton@gmail.com> writes:

> On 06/24/16 17:21, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>>> Quoting Tejun Heo (tj@kernel.org):
>>>> Hello,
>>>>
>>>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>>>>> Quoting Tejun Heo (tj@kernel.org):
>>>>>> But isn't being recursive orthogonal to using cgroup?  Why not account
>>>>>> usages recursively along the process hierarchy?  Capabilities don't
>>>>>> have much to do with cgroup but everything with process hierarchy.
>>>>>> That's how they're distributed and modified.  If monitoring their
>>>>>> usages is necessary, it makes sense to do it in the same structure.
>>>>>
>>>>> That was my argument against using cgroups to enforce a new bounding
>>>>> set.  For tracking though, the cgroup process tracking seems as applicable
>>>>> to this as it does to systemd tracking of services.  It tracks a task and
>>>>> the children it forks.
>>>>
>>>> Just monitoring is less jarring than implementing security enforcement
>>>> via cgroup, but it is still jarring.  What's wrong with recursive
>>>> process hierarchy monitoring which is in line with the whole facility
>>>> is implemented anyway?
>>>
>>> As I think Topi pointed out, one shortcoming is that if there is a short-lived
>>> child task, using its /proc/self/status is racy.  You might just miss that it
>>> ever even existed, let alone that the "application" needed it.
>>>
>>> Another alternative we've both mentioned is to use systemtap.  That's not
>>> as nice a solution as a cgroup, but then again this isn't really a common
>>> case, so maybe it is precisely what a tracing infrastructure is meant for.
>> 
>> Hmm.
>> 
>> We have capability use wired up into auditing.  So we might be able to
>> get away with just adding an appropriate audit message in
>> commoncap.c:cap_capable that honors the audit flag and logs an audit
>> message.  The hook in selinux already appears to do that.
>> 
>> Certainly audit sounds like the subsystem for this kind of work, as it's
>> whole point in life is logging things, then something in userspace can
>> just run over the audit longs and build a nice summary.
>
> Even simpler would be to avoid the complexity of audit subsystem and
> just printk() when a task starts using a capability first time (not on
> further uses by same task). There are not that many capability bits nor
> privileged processes, meaning not too many log entries. I know as this
> was actually my first approach. But it's also far less user friendly
> than just reading a summarized value which could be directly fed back to
> configuration.

Your loss.

> Logging/auditing approach also doesn't work well for other things I'd
> like to present meaningful values for the user. For example, consider
> RLIMIT_AS, where my goal is also to enable the users to be able to
> configure this limit for a service. Should there be an audit message
> whenever the address space limit grows (i.e. each mmap())? What about
> when it shrinks? For RLIMIT_NOFILE we'd have to report each
> open()/close()/dup()/socket()/etc. and track how many are opened at the
> same time. I think it's better to store the fully cooked (meaningful to
> user) value in kernel and present it only when asked.

That doesn't have anything to do with anything.

My suggestion was very much to do with capabilities which are already
logged with the audit subsystem with selinux.  The idea was to move
those audit calls into commoncap where they arguably belong allow anyone
to use them for anything.

That is a non-controversial code cleanup that happens to cover your
special case.  That is enough to build a tool in userspace that will
tell you which capabilities you need without penalizing the kernel, or
the vast majority of everyone who does not use your feature.

>From what I have seen of this conversation there is not and will not be
one interface to rule them all.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-06-28  4:57                   ` Eric W. Biederman
  0 siblings, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2016-06-28  4:57 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Serge E. Hallyn, Tejun Heo, linux-kernel, luto, keescook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

Topi Miettinen <toiwoton@gmail.com> writes:

> On 06/24/16 17:21, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> 
>>> Quoting Tejun Heo (tj@kernel.org):
>>>> Hello,
>>>>
>>>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>>>>> Quoting Tejun Heo (tj@kernel.org):
>>>>>> But isn't being recursive orthogonal to using cgroup?  Why not account
>>>>>> usages recursively along the process hierarchy?  Capabilities don't
>>>>>> have much to do with cgroup but everything with process hierarchy.
>>>>>> That's how they're distributed and modified.  If monitoring their
>>>>>> usages is necessary, it makes sense to do it in the same structure.
>>>>>
>>>>> That was my argument against using cgroups to enforce a new bounding
>>>>> set.  For tracking though, the cgroup process tracking seems as applicable
>>>>> to this as it does to systemd tracking of services.  It tracks a task and
>>>>> the children it forks.
>>>>
>>>> Just monitoring is less jarring than implementing security enforcement
>>>> via cgroup, but it is still jarring.  What's wrong with recursive
>>>> process hierarchy monitoring which is in line with the whole facility
>>>> is implemented anyway?
>>>
>>> As I think Topi pointed out, one shortcoming is that if there is a short-lived
>>> child task, using its /proc/self/status is racy.  You might just miss that it
>>> ever even existed, let alone that the "application" needed it.
>>>
>>> Another alternative we've both mentioned is to use systemtap.  That's not
>>> as nice a solution as a cgroup, but then again this isn't really a common
>>> case, so maybe it is precisely what a tracing infrastructure is meant for.
>> 
>> Hmm.
>> 
>> We have capability use wired up into auditing.  So we might be able to
>> get away with just adding an appropriate audit message in
>> commoncap.c:cap_capable that honors the audit flag and logs an audit
>> message.  The hook in selinux already appears to do that.
>> 
>> Certainly audit sounds like the subsystem for this kind of work, as it's
>> whole point in life is logging things, then something in userspace can
>> just run over the audit longs and build a nice summary.
>
> Even simpler would be to avoid the complexity of audit subsystem and
> just printk() when a task starts using a capability first time (not on
> further uses by same task). There are not that many capability bits nor
> privileged processes, meaning not too many log entries. I know as this
> was actually my first approach. But it's also far less user friendly
> than just reading a summarized value which could be directly fed back to
> configuration.

Your loss.

> Logging/auditing approach also doesn't work well for other things I'd
> like to present meaningful values for the user. For example, consider
> RLIMIT_AS, where my goal is also to enable the users to be able to
> configure this limit for a service. Should there be an audit message
> whenever the address space limit grows (i.e. each mmap())? What about
> when it shrinks? For RLIMIT_NOFILE we'd have to report each
> open()/close()/dup()/socket()/etc. and track how many are opened at the
> same time. I think it's better to store the fully cooked (meaningful to
> user) value in kernel and present it only when asked.

That doesn't have anything to do with anything.

My suggestion was very much to do with capabilities which are already
logged with the audit subsystem with selinux.  The idea was to move
those audit calls into commoncap where they arguably belong allow anyone
to use them for anything.

That is a non-controversial code cleanup that happens to cover your
special case.  That is enough to build a tool in userspace that will
tell you which capabilities you need without penalizing the kernel, or
the vast majority of everyone who does not use your feature.

From what I have seen of this conversation there is not and will not be
one interface to rule them all.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-07-02 11:20                     ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-07-02 11:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Tejun Heo, linux-kernel, luto, keescook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/28/16 04:57, Eric W. Biederman wrote:
> Topi Miettinen <toiwoton@gmail.com> writes:
> 
>> On 06/24/16 17:21, Eric W. Biederman wrote:
>>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>>
>>>> Quoting Tejun Heo (tj@kernel.org):
>>>>> Hello,
>>>>>
>>>>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>>>>>> Quoting Tejun Heo (tj@kernel.org):
>>>>>>> But isn't being recursive orthogonal to using cgroup?  Why not account
>>>>>>> usages recursively along the process hierarchy?  Capabilities don't
>>>>>>> have much to do with cgroup but everything with process hierarchy.
>>>>>>> That's how they're distributed and modified.  If monitoring their
>>>>>>> usages is necessary, it makes sense to do it in the same structure.
>>>>>>
>>>>>> That was my argument against using cgroups to enforce a new bounding
>>>>>> set.  For tracking though, the cgroup process tracking seems as applicable
>>>>>> to this as it does to systemd tracking of services.  It tracks a task and
>>>>>> the children it forks.
>>>>>
>>>>> Just monitoring is less jarring than implementing security enforcement
>>>>> via cgroup, but it is still jarring.  What's wrong with recursive
>>>>> process hierarchy monitoring which is in line with the whole facility
>>>>> is implemented anyway?
>>>>
>>>> As I think Topi pointed out, one shortcoming is that if there is a short-lived
>>>> child task, using its /proc/self/status is racy.  You might just miss that it
>>>> ever even existed, let alone that the "application" needed it.
>>>>
>>>> Another alternative we've both mentioned is to use systemtap.  That's not
>>>> as nice a solution as a cgroup, but then again this isn't really a common
>>>> case, so maybe it is precisely what a tracing infrastructure is meant for.
>>>
>>> Hmm.
>>>
>>> We have capability use wired up into auditing.  So we might be able to
>>> get away with just adding an appropriate audit message in
>>> commoncap.c:cap_capable that honors the audit flag and logs an audit
>>> message.  The hook in selinux already appears to do that.
>>>
>>> Certainly audit sounds like the subsystem for this kind of work, as it's
>>> whole point in life is logging things, then something in userspace can
>>> just run over the audit longs and build a nice summary.
>>
>> Even simpler would be to avoid the complexity of audit subsystem and
>> just printk() when a task starts using a capability first time (not on
>> further uses by same task). There are not that many capability bits nor
>> privileged processes, meaning not too many log entries. I know as this
>> was actually my first approach. But it's also far less user friendly
>> than just reading a summarized value which could be directly fed back to
>> configuration.
> 
> Your loss.
> 
>> Logging/auditing approach also doesn't work well for other things I'd
>> like to present meaningful values for the user. For example, consider
>> RLIMIT_AS, where my goal is also to enable the users to be able to
>> configure this limit for a service. Should there be an audit message
>> whenever the address space limit grows (i.e. each mmap())? What about
>> when it shrinks? For RLIMIT_NOFILE we'd have to report each
>> open()/close()/dup()/socket()/etc. and track how many are opened at the
>> same time. I think it's better to store the fully cooked (meaningful to
>> user) value in kernel and present it only when asked.
> 
> That doesn't have anything to do with anything.
> 
> My suggestion was very much to do with capabilities which are already
> logged with the audit subsystem with selinux.  The idea was to move
> those audit calls into commoncap where they arguably belong allow anyone
> to use them for anything.
> 
> That is a non-controversial code cleanup that happens to cover your
> special case.  That is enough to build a tool in userspace that will
> tell you which capabilities you need without penalizing the kernel, or
> the vast majority of everyone who does not use your feature.
> 
> From what I have seen of this conversation there is not and will not be
> one interface to rule them all.

Now that I know taskstats better, it looks like a good choice for most
of the highwater marks, complemented with audit logging. The taskstats
interface is only available to privileged processes but that's OK. I'll
make new patches based on this approach.

-Topi


> 
> Eric
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-07-02 11:20                     ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-07-02 11:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Tejun Heo, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	luto-DgEjT+Ai2ygdnm+yROfE0A, keescook-F7+t8E8rja9g9hUCZPvPmw,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 06/28/16 04:57, Eric W. Biederman wrote:
> Topi Miettinen <toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> 
>> On 06/24/16 17:21, Eric W. Biederman wrote:
>>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>>>
>>>> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>>>>> Hello,
>>>>>
>>>>> On Fri, Jun 24, 2016 at 10:59:16AM -0500, Serge E. Hallyn wrote:
>>>>>> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>>>>>>> But isn't being recursive orthogonal to using cgroup?  Why not account
>>>>>>> usages recursively along the process hierarchy?  Capabilities don't
>>>>>>> have much to do with cgroup but everything with process hierarchy.
>>>>>>> That's how they're distributed and modified.  If monitoring their
>>>>>>> usages is necessary, it makes sense to do it in the same structure.
>>>>>>
>>>>>> That was my argument against using cgroups to enforce a new bounding
>>>>>> set.  For tracking though, the cgroup process tracking seems as applicable
>>>>>> to this as it does to systemd tracking of services.  It tracks a task and
>>>>>> the children it forks.
>>>>>
>>>>> Just monitoring is less jarring than implementing security enforcement
>>>>> via cgroup, but it is still jarring.  What's wrong with recursive
>>>>> process hierarchy monitoring which is in line with the whole facility
>>>>> is implemented anyway?
>>>>
>>>> As I think Topi pointed out, one shortcoming is that if there is a short-lived
>>>> child task, using its /proc/self/status is racy.  You might just miss that it
>>>> ever even existed, let alone that the "application" needed it.
>>>>
>>>> Another alternative we've both mentioned is to use systemtap.  That's not
>>>> as nice a solution as a cgroup, but then again this isn't really a common
>>>> case, so maybe it is precisely what a tracing infrastructure is meant for.
>>>
>>> Hmm.
>>>
>>> We have capability use wired up into auditing.  So we might be able to
>>> get away with just adding an appropriate audit message in
>>> commoncap.c:cap_capable that honors the audit flag and logs an audit
>>> message.  The hook in selinux already appears to do that.
>>>
>>> Certainly audit sounds like the subsystem for this kind of work, as it's
>>> whole point in life is logging things, then something in userspace can
>>> just run over the audit longs and build a nice summary.
>>
>> Even simpler would be to avoid the complexity of audit subsystem and
>> just printk() when a task starts using a capability first time (not on
>> further uses by same task). There are not that many capability bits nor
>> privileged processes, meaning not too many log entries. I know as this
>> was actually my first approach. But it's also far less user friendly
>> than just reading a summarized value which could be directly fed back to
>> configuration.
> 
> Your loss.
> 
>> Logging/auditing approach also doesn't work well for other things I'd
>> like to present meaningful values for the user. For example, consider
>> RLIMIT_AS, where my goal is also to enable the users to be able to
>> configure this limit for a service. Should there be an audit message
>> whenever the address space limit grows (i.e. each mmap())? What about
>> when it shrinks? For RLIMIT_NOFILE we'd have to report each
>> open()/close()/dup()/socket()/etc. and track how many are opened at the
>> same time. I think it's better to store the fully cooked (meaningful to
>> user) value in kernel and present it only when asked.
> 
> That doesn't have anything to do with anything.
> 
> My suggestion was very much to do with capabilities which are already
> logged with the audit subsystem with selinux.  The idea was to move
> those audit calls into commoncap where they arguably belong allow anyone
> to use them for anything.
> 
> That is a non-controversial code cleanup that happens to cover your
> special case.  That is enough to build a tool in userspace that will
> tell you which capabilities you need without penalizing the kernel, or
> the vast majority of everyone who does not use your feature.
> 
> From what I have seen of this conversation there is not and will not be
> one interface to rule them all.

Now that I know taskstats better, it looks like a good choice for most
of the highwater marks, complemented with audit logging. The taskstats
interface is only available to privileged processes but that's OK. I'll
make new patches based on this approach.

-Topi


> 
> Eric
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-06-27 19:49                           ` Serge E. Hallyn
@ 2016-07-03 15:08                             ` Topi Miettinen
  -1 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-07-03 15:08 UTC (permalink / raw)
  To: Serge E. Hallyn, Eric W. Biederman, Tejun Heo
  Cc: lkml, luto, Kees Cook, Jonathan Corbet, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, Andrew Morton,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

[-- Attachment #1: Type: text/plain, Size: 2300 bytes --]

On 06/27/16 19:49, Serge E. Hallyn wrote:
> Quoting Tejun Heo (tj@kernel.org):
>> Hello,
>>
>> On Mon, Jun 27, 2016 at 3:10 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>>> I'll have to study these more. But from what I saw so far, it looks to
>>> me that a separate tool would be needed to read taskstats and if that
>>> tool is not taken by distros, the users would not be any wiser, right?
>>> With cgroup (or /proc), no new tools would be needed.
>>
>> That is a factor but shouldn't be a deciding factor in designing our
>> user-facing interfaces. Please also note that kernel source tree
>> already has tools/ subdirectory which contains userland tools which
>> are distributed along with the kernel.
> 
> And, if you take audit+cgroup approach then no tools are needed.  So long
> as you can have audit print out the cgroups for a task as part of the
> capability audit record.
> 

The attached patch would make any uses of capabilities generate audit
messages. It works for simple tests as you can see from the commit
message, but unfortunately the call to audit_cgroup_list() deadlocks the
system when booting a full blown OS. There's no deadlock when the call
is removed.

I guess that in some cases, cgroup_mutex and/or css_set_lock could be
already held earlier before entering audit_cgroup_list(). Holding the
locks is however required by task_cgroup_from_root(). Is there any way
to avoid this? For example, only print some kind of cgroup ID numbers
(are there unique and stable IDs, available without locks?) for those
cgroups where the task is registered in the audit message?

I could remove the cgroup part from the audit message entirely, but then
knowing which capabilities were used in what cgroup gets much more
difficult. The rest of the patch would be useful without it and of
course simpler.

In my earlier versions a per-task cap_used variable summarized all uses
of capabilities, but it was not clear when to reset the variable (fork?
exec? capset?), so it's gone for now. This was also used to rate limit
printing audit messages by only acting when each capability was first
used by the task, but now all uses of capabilities trigger audit
logging. Could that become a problem? I think it only makes sense to
summarize capability use per cgroup (via taskstats).

-Topi


[-- Attachment #2: 0001-capabilities-audit-capability-use.patch --]
[-- Type: text/x-patch, Size: 9402 bytes --]

>From 2d5248f91998873174dbcbcafe87e5b30c3858aa Mon Sep 17 00:00:00 2001
From: Topi Miettinen <toiwoton@gmail.com>
Date: Sat, 2 Jul 2016 16:25:20 +0300
Subject: [PATCH] capabilities: audit capability use

There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find
out useful values for the limits, except blind trial and error.

Currently, there is no way to know which capabilities are actually used.
Even the source code is only implicit, in-depth knowledge of each
capability must be used when analyzing a program to judge which
capabilities the program will exercise.

Generate an audit message when capabilities are used. This can then be
used to configure capability sets for services by a software developer,
maintainer or system administrator.

Test case demonstrating basic capability monitoring with the new
message type 1330 and how the cgroups are displayed (boot to rdshell):

BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) cd /sys/fs
(initramfs) mount -t cgroup2 cgroup cgroup
[   16.503902] audit_printk_skb: 4026 callbacks suppressed
[   16.505059] audit: type=1330 audit(1467543885.733:469): cap_used=21 pid=214 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=
[   16.506845] audit: type=1330 audit(1467543885.733:469): cap_used=21 pid=214 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=
[   16.509234] audit: type=1300 audit(1467543885.733:469): arch=c000003e syscall=165 success=yes exit=0 a0=7ffc2f394e2d a1=7ffc2f394e34 a2=7ffc2f394e25 a3=8000 items=0 ppid=213 pid=214 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=4294967295 comm="mount" exe="/bin/mount" key=(null)
[   16.510134] audit: type=1327 audit(1467543885.733:469): proctitle=6D6F756E74002D74006367726F757032006367726F7570006367726F7570
(initramfs) cd cgroup
(initramfs) mkdir test; cd test
[   16.533829] audit: type=1330 audit(1467543885.765:470): cap_used=1 pid=215 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=:/;
[   16.536587] audit: type=1300 audit(1467543885.765:470): arch=c000003e syscall=83 success=yes exit=0 a0=7ffe4f0bfe29 a1=1ff a2=0 a3=1e2 items=0 ppid=213 pid=215 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=4294967295 comm="mkdir" exe="/bin/mkdir" key=(null)
[   16.537263] audit: type=1327 audit(1467543885.765:470): proctitle=6D6B6469720074657374
(initramfs) echo $$ >cgroup.procs
(initramfs) mknod /dev/z_$$ c 1 2
[   16.571516] audit: type=1330 audit(1467543885.801:471): cap_used=27 pid=216 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=:/test;
[   16.572812] audit: type=1300 audit(1467543885.801:471): arch=c000003e syscall=133 success=yes exit=0 a0=7ffe04fe3e11 a1=21b6 a2=102 a3=5c9 items=0 ppid=213 pid=216 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=4294967295 comm="mknod" exe="/bin/mknod" key=(null)
[   16.573571] audit: type=1327 audit(1467543885.801:471): proctitle=6D6B6E6F64002F6465762F7A5F323133006300310032

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 include/linux/audit.h      |  4 +++
 include/linux/cgroup.h     |  2 ++
 include/uapi/linux/audit.h |  1 +
 kernel/audit.c             | 22 ++++++++++++++++
 kernel/capability.c        |  5 ++--
 kernel/cgroup.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index e38e3fc..971cb2e 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -438,6 +438,8 @@ static inline void audit_mmap_fd(int fd, int flags)
 		__audit_mmap_fd(fd, flags);
 }
 
+extern void audit_log_cap_use(int cap);
+
 extern int audit_n_rules;
 extern int audit_signals;
 #else /* CONFIG_AUDITSYSCALL */
@@ -545,6 +547,8 @@ static inline void audit_mmap_fd(int fd, int flags)
 { }
 static inline void audit_ptrace(struct task_struct *t)
 { }
+static inline void audit_log_cap_use(int cap)
+{ }
 #define audit_n_rules 0
 #define audit_signals 0
 #endif /* CONFIG_AUDITSYSCALL */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a20320c..b5dc8aa 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -100,6 +100,8 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
 int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
+struct audit_buffer;
+void audit_cgroup_list(struct audit_buffer *ab);
 
 void cgroup_fork(struct task_struct *p);
 extern int cgroup_can_fork(struct task_struct *p);
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index d820aa9..a5c9a73 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -111,6 +111,7 @@
 #define AUDIT_PROCTITLE		1327	/* Proctitle emit event */
 #define AUDIT_FEATURE_CHANGE	1328	/* audit log listing feature changes */
 #define AUDIT_REPLACE		1329	/* Replace auditd if this packet unanswerd */
+#define AUDIT_CAPABILITY	1330	/* Record showing capability use */
 
 #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
 #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
diff --git a/kernel/audit.c b/kernel/audit.c
index 8d528f9..370beb7 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -54,6 +54,7 @@
 #include <linux/kthread.h>
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup.h>
 
 #include <linux/audit.h>
 
@@ -1709,6 +1710,27 @@ static void audit_log_fcaps(struct audit_buffer *ab, struct audit_names *name)
 				 name->fcap.fE, name->fcap_ver);
 }
 
+void audit_log_cap_use(int cap)
+{
+	struct audit_context *context = current->audit_context;
+	struct audit_buffer *ab;
+	kuid_t uid;
+	kgid_t gid;
+
+	ab = audit_log_start(context, GFP_KERNEL, AUDIT_CAPABILITY);
+	audit_log_format(ab, "cap_used=%d", cap);
+	current_uid_gid(&uid, &gid);
+	audit_log_format(ab, " pid=%d auid=%u uid=%u gid=%u ses=%u",
+			 task_pid_nr(current),
+			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
+			 from_kuid(&init_user_ns, uid),
+			 from_kgid(&init_user_ns, gid),
+			 audit_get_sessionid(current));
+	audit_log_format(ab, " cgroups=");
+	audit_cgroup_list(ab);
+	audit_log_end(ab);
+}
+
 static inline int audit_copy_fcaps(struct audit_names *name,
 				   const struct dentry *dentry)
 {
diff --git a/kernel/capability.c b/kernel/capability.c
index 45432b5..d45d5b1 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -366,8 +366,8 @@ bool has_capability_noaudit(struct task_struct *t, int cap)
  * @ns:  The usernamespace we want the capability in
  * @cap: The capability to be tested for
  *
- * Return true if the current task has the given superior capability currently
- * available for use, false if not.
+ * Return true if the current task has the given superior capability
+ * currently available for use, false if not. Write an audit message.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
@@ -380,6 +380,7 @@ bool ns_capable(struct user_namespace *ns, int cap)
 	}
 
 	if (security_capable(current_cred(), ns, cap) == 0) {
+		audit_log_cap_use(cap);
 		current->flags |= PF_SUPERPRIV;
 		return true;
 	}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 75c0ff0..3b92e85 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -63,6 +63,7 @@
 #include <linux/nsproxy.h>
 #include <linux/proc_ns.h>
 #include <net/sock.h>
+#include <linux/audit.h>
 
 /*
  * pidlists linger the following amount before being destroyed.  The goal
@@ -5789,6 +5790,67 @@ out:
 	return retval;
 }
 
+/*
+ * audit_cgroup_list()
+ *  - Print task's cgroup paths with audit_log_format()
+ *  - Used for capability audit logging
+ *  - Otherwise very similar to proc_cgroup_show().
+ */
+void audit_cgroup_list(struct audit_buffer *ab)
+{
+	char *buf, *path;
+	struct cgroup_root *root;
+
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf)
+		return;
+
+	mutex_lock(&cgroup_mutex);
+	spin_lock_irq(&css_set_lock);
+
+	for_each_root(root) {
+		struct cgroup_subsys *ss;
+		struct cgroup *cgrp;
+		int ssid, count = 0;
+
+		if (root == &cgrp_dfl_root && !cgrp_dfl_visible)
+			continue;
+
+		if (root != &cgrp_dfl_root)
+			for_each_subsys(ss, ssid)
+				if (root->subsys_mask & (1 << ssid))
+					audit_log_format(ab, "%s%s",
+							 count++ ? "," : "",
+							 ss->legacy_name);
+		if (strlen(root->name))
+			audit_log_format(ab, "%sname=%s", count ? "," : "",
+					 root->name);
+		audit_log_format(ab, ":");
+
+		cgrp = task_cgroup_from_root(current, root);
+
+		if (cgroup_on_dfl(cgrp) || !(current->flags & PF_EXITING)) {
+			path = cgroup_path_ns_locked(cgrp, buf, PATH_MAX,
+						current->nsproxy->cgroup_ns);
+			if (!path)
+				goto out_unlock;
+		} else
+			path = "/";
+
+		audit_log_format(ab, "%s", path);
+
+		if (cgroup_on_dfl(cgrp) && cgroup_is_dead(cgrp))
+			audit_log_format(ab, " (deleted);");
+		else
+			audit_log_format(ab, ";");
+	}
+
+out_unlock:
+	spin_unlock_irq(&css_set_lock);
+	mutex_unlock(&cgroup_mutex);
+	kfree(buf);
+}
+
 /* Display information about each subsystem and each hierarchy */
 static int proc_cgroupstats_show(struct seq_file *m, void *v)
 {
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-07-03 15:08                             ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-07-03 15:08 UTC (permalink / raw)
  To: Serge E. Hallyn, Eric W. Biederman, Tejun Heo
  Cc: lkml, luto, Kees Cook, Jonathan Corbet, Li Zefan,
	Johannes Weiner, Serge Hallyn, James Morris, Andrew Morton,
	David Howells, David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	Petr Mladek, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

[-- Attachment #1: Type: text/plain, Size: 2300 bytes --]

On 06/27/16 19:49, Serge E. Hallyn wrote:
> Quoting Tejun Heo (tj@kernel.org):
>> Hello,
>>
>> On Mon, Jun 27, 2016 at 3:10 PM, Topi Miettinen <toiwoton@gmail.com> wrote:
>>> I'll have to study these more. But from what I saw so far, it looks to
>>> me that a separate tool would be needed to read taskstats and if that
>>> tool is not taken by distros, the users would not be any wiser, right?
>>> With cgroup (or /proc), no new tools would be needed.
>>
>> That is a factor but shouldn't be a deciding factor in designing our
>> user-facing interfaces. Please also note that kernel source tree
>> already has tools/ subdirectory which contains userland tools which
>> are distributed along with the kernel.
> 
> And, if you take audit+cgroup approach then no tools are needed.  So long
> as you can have audit print out the cgroups for a task as part of the
> capability audit record.
> 

The attached patch would make any uses of capabilities generate audit
messages. It works for simple tests as you can see from the commit
message, but unfortunately the call to audit_cgroup_list() deadlocks the
system when booting a full blown OS. There's no deadlock when the call
is removed.

I guess that in some cases, cgroup_mutex and/or css_set_lock could be
already held earlier before entering audit_cgroup_list(). Holding the
locks is however required by task_cgroup_from_root(). Is there any way
to avoid this? For example, only print some kind of cgroup ID numbers
(are there unique and stable IDs, available without locks?) for those
cgroups where the task is registered in the audit message?

I could remove the cgroup part from the audit message entirely, but then
knowing which capabilities were used in what cgroup gets much more
difficult. The rest of the patch would be useful without it and of
course simpler.

In my earlier versions a per-task cap_used variable summarized all uses
of capabilities, but it was not clear when to reset the variable (fork?
exec? capset?), so it's gone for now. This was also used to rate limit
printing audit messages by only acting when each capability was first
used by the task, but now all uses of capabilities trigger audit
logging. Could that become a problem? I think it only makes sense to
summarize capability use per cgroup (via taskstats).

-Topi


[-- Attachment #2: 0001-capabilities-audit-capability-use.patch --]
[-- Type: text/x-patch, Size: 9401 bytes --]

From 2d5248f91998873174dbcbcafe87e5b30c3858aa Mon Sep 17 00:00:00 2001
From: Topi Miettinen <toiwoton@gmail.com>
Date: Sat, 2 Jul 2016 16:25:20 +0300
Subject: [PATCH] capabilities: audit capability use

There are many basic ways to control processes, including capabilities,
cgroups and resource limits. However, there are far fewer ways to find
out useful values for the limits, except blind trial and error.

Currently, there is no way to know which capabilities are actually used.
Even the source code is only implicit, in-depth knowledge of each
capability must be used when analyzing a program to judge which
capabilities the program will exercise.

Generate an audit message when capabilities are used. This can then be
used to configure capability sets for services by a software developer,
maintainer or system administrator.

Test case demonstrating basic capability monitoring with the new
message type 1330 and how the cgroups are displayed (boot to rdshell):

BusyBox v1.22.1 (Debian 1:1.22.0-19) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs) cd /sys/fs
(initramfs) mount -t cgroup2 cgroup cgroup
[   16.503902] audit_printk_skb: 4026 callbacks suppressed
[   16.505059] audit: type=1330 audit(1467543885.733:469): cap_used=21 pid=214 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=
[   16.506845] audit: type=1330 audit(1467543885.733:469): cap_used=21 pid=214 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=
[   16.509234] audit: type=1300 audit(1467543885.733:469): arch=c000003e syscall=165 success=yes exit=0 a0=7ffc2f394e2d a1=7ffc2f394e34 a2=7ffc2f394e25 a3=8000 items=0 ppid=213 pid=214 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=4294967295 comm="mount" exe="/bin/mount" key=(null)
[   16.510134] audit: type=1327 audit(1467543885.733:469): proctitle=6D6F756E74002D74006367726F757032006367726F7570006367726F7570
(initramfs) cd cgroup
(initramfs) mkdir test; cd test
[   16.533829] audit: type=1330 audit(1467543885.765:470): cap_used=1 pid=215 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=:/;
[   16.536587] audit: type=1300 audit(1467543885.765:470): arch=c000003e syscall=83 success=yes exit=0 a0=7ffe4f0bfe29 a1=1ff a2=0 a3=1e2 items=0 ppid=213 pid=215 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=4294967295 comm="mkdir" exe="/bin/mkdir" key=(null)
[   16.537263] audit: type=1327 audit(1467543885.765:470): proctitle=6D6B6469720074657374
(initramfs) echo $$ >cgroup.procs
(initramfs) mknod /dev/z_$$ c 1 2
[   16.571516] audit: type=1330 audit(1467543885.801:471): cap_used=27 pid=216 auid=4294967295 uid=0 gid=0 ses=4294967295 cgroups=:/test;
[   16.572812] audit: type=1300 audit(1467543885.801:471): arch=c000003e syscall=133 success=yes exit=0 a0=7ffe04fe3e11 a1=21b6 a2=102 a3=5c9 items=0 ppid=213 pid=216 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=4294967295 comm="mknod" exe="/bin/mknod" key=(null)
[   16.573571] audit: type=1327 audit(1467543885.801:471): proctitle=6D6B6E6F64002F6465762F7A5F323133006300310032

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 include/linux/audit.h      |  4 +++
 include/linux/cgroup.h     |  2 ++
 include/uapi/linux/audit.h |  1 +
 kernel/audit.c             | 22 ++++++++++++++++
 kernel/capability.c        |  5 ++--
 kernel/cgroup.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index e38e3fc..971cb2e 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -438,6 +438,8 @@ static inline void audit_mmap_fd(int fd, int flags)
 		__audit_mmap_fd(fd, flags);
 }
 
+extern void audit_log_cap_use(int cap);
+
 extern int audit_n_rules;
 extern int audit_signals;
 #else /* CONFIG_AUDITSYSCALL */
@@ -545,6 +547,8 @@ static inline void audit_mmap_fd(int fd, int flags)
 { }
 static inline void audit_ptrace(struct task_struct *t)
 { }
+static inline void audit_log_cap_use(int cap)
+{ }
 #define audit_n_rules 0
 #define audit_signals 0
 #endif /* CONFIG_AUDITSYSCALL */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a20320c..b5dc8aa 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -100,6 +100,8 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
 int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
 int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
 		     struct pid *pid, struct task_struct *tsk);
+struct audit_buffer;
+void audit_cgroup_list(struct audit_buffer *ab);
 
 void cgroup_fork(struct task_struct *p);
 extern int cgroup_can_fork(struct task_struct *p);
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index d820aa9..a5c9a73 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -111,6 +111,7 @@
 #define AUDIT_PROCTITLE		1327	/* Proctitle emit event */
 #define AUDIT_FEATURE_CHANGE	1328	/* audit log listing feature changes */
 #define AUDIT_REPLACE		1329	/* Replace auditd if this packet unanswerd */
+#define AUDIT_CAPABILITY	1330	/* Record showing capability use */
 
 #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
 #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
diff --git a/kernel/audit.c b/kernel/audit.c
index 8d528f9..370beb7 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -54,6 +54,7 @@
 #include <linux/kthread.h>
 #include <linux/kernel.h>
 #include <linux/syscalls.h>
+#include <linux/cgroup.h>
 
 #include <linux/audit.h>
 
@@ -1709,6 +1710,27 @@ static void audit_log_fcaps(struct audit_buffer *ab, struct audit_names *name)
 				 name->fcap.fE, name->fcap_ver);
 }
 
+void audit_log_cap_use(int cap)
+{
+	struct audit_context *context = current->audit_context;
+	struct audit_buffer *ab;
+	kuid_t uid;
+	kgid_t gid;
+
+	ab = audit_log_start(context, GFP_KERNEL, AUDIT_CAPABILITY);
+	audit_log_format(ab, "cap_used=%d", cap);
+	current_uid_gid(&uid, &gid);
+	audit_log_format(ab, " pid=%d auid=%u uid=%u gid=%u ses=%u",
+			 task_pid_nr(current),
+			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
+			 from_kuid(&init_user_ns, uid),
+			 from_kgid(&init_user_ns, gid),
+			 audit_get_sessionid(current));
+	audit_log_format(ab, " cgroups=");
+	audit_cgroup_list(ab);
+	audit_log_end(ab);
+}
+
 static inline int audit_copy_fcaps(struct audit_names *name,
 				   const struct dentry *dentry)
 {
diff --git a/kernel/capability.c b/kernel/capability.c
index 45432b5..d45d5b1 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -366,8 +366,8 @@ bool has_capability_noaudit(struct task_struct *t, int cap)
  * @ns:  The usernamespace we want the capability in
  * @cap: The capability to be tested for
  *
- * Return true if the current task has the given superior capability currently
- * available for use, false if not.
+ * Return true if the current task has the given superior capability
+ * currently available for use, false if not. Write an audit message.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
@@ -380,6 +380,7 @@ bool ns_capable(struct user_namespace *ns, int cap)
 	}
 
 	if (security_capable(current_cred(), ns, cap) == 0) {
+		audit_log_cap_use(cap);
 		current->flags |= PF_SUPERPRIV;
 		return true;
 	}
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 75c0ff0..3b92e85 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -63,6 +63,7 @@
 #include <linux/nsproxy.h>
 #include <linux/proc_ns.h>
 #include <net/sock.h>
+#include <linux/audit.h>
 
 /*
  * pidlists linger the following amount before being destroyed.  The goal
@@ -5789,6 +5790,67 @@ out:
 	return retval;
 }
 
+/*
+ * audit_cgroup_list()
+ *  - Print task's cgroup paths with audit_log_format()
+ *  - Used for capability audit logging
+ *  - Otherwise very similar to proc_cgroup_show().
+ */
+void audit_cgroup_list(struct audit_buffer *ab)
+{
+	char *buf, *path;
+	struct cgroup_root *root;
+
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf)
+		return;
+
+	mutex_lock(&cgroup_mutex);
+	spin_lock_irq(&css_set_lock);
+
+	for_each_root(root) {
+		struct cgroup_subsys *ss;
+		struct cgroup *cgrp;
+		int ssid, count = 0;
+
+		if (root == &cgrp_dfl_root && !cgrp_dfl_visible)
+			continue;
+
+		if (root != &cgrp_dfl_root)
+			for_each_subsys(ss, ssid)
+				if (root->subsys_mask & (1 << ssid))
+					audit_log_format(ab, "%s%s",
+							 count++ ? "," : "",
+							 ss->legacy_name);
+		if (strlen(root->name))
+			audit_log_format(ab, "%sname=%s", count ? "," : "",
+					 root->name);
+		audit_log_format(ab, ":");
+
+		cgrp = task_cgroup_from_root(current, root);
+
+		if (cgroup_on_dfl(cgrp) || !(current->flags & PF_EXITING)) {
+			path = cgroup_path_ns_locked(cgrp, buf, PATH_MAX,
+						current->nsproxy->cgroup_ns);
+			if (!path)
+				goto out_unlock;
+		} else
+			path = "/";
+
+		audit_log_format(ab, "%s", path);
+
+		if (cgroup_on_dfl(cgrp) && cgroup_is_dead(cgrp))
+			audit_log_format(ab, " (deleted);");
+		else
+			audit_log_format(ab, ";");
+	}
+
+out_unlock:
+	spin_unlock_irq(&css_set_lock);
+	mutex_unlock(&cgroup_mutex);
+	kfree(buf);
+}
+
 /* Display information about each subsystem and each hierarchy */
 static int proc_cgroupstats_show(struct seq_file *m, void *v)
 {
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: audit capability use
@ 2016-07-03 16:13                               ` kbuild test robot
  0 siblings, 0 replies; 46+ messages in thread
From: kbuild test robot @ 2016-07-03 16:13 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: kbuild-all, Serge E. Hallyn, Eric W. Biederman, Tejun Heo, lkml,
	luto, Kees Cook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	Serge Hallyn, James Morris, Andrew Morton, David Howells,
	David Woodhouse, Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

[-- Attachment #1: Type: text/plain, Size: 2500 bytes --]

Hi,

[auto build test ERROR on cgroup/for-next]
[also build test ERROR on v4.7-rc5]
[cannot apply to next-20160701]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Topi-Miettinen/capabilities-audit-capability-use/20160703-231120
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
config: microblaze-mmu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=microblaze 

All errors (new ones prefixed by >>):

>> kernel/audit.c:1713:6: error: redefinition of 'audit_log_cap_use'
    void audit_log_cap_use(int cap)
         ^
   In file included from kernel/audit.c:59:0:
   include/linux/audit.h:574:20: note: previous definition of 'audit_log_cap_use' was here
    static inline void audit_log_cap_use(int cap)
                       ^
   kernel/audit.c: In function 'audit_log_cap_use':
>> kernel/audit.c:1730:2: error: implicit declaration of function 'audit_cgroup_list' [-Werror=implicit-function-declaration]
     audit_cgroup_list(ab);
     ^
   cc1: some warnings being treated as errors

vim +/audit_log_cap_use +1713 kernel/audit.c

  1707	
  1708		if (log)
  1709			audit_log_format(ab, " cap_fe=%d cap_fver=%x",
  1710					 name->fcap.fE, name->fcap_ver);
  1711	}
  1712	
> 1713	void audit_log_cap_use(int cap)
  1714	{
  1715		struct audit_context *context = current->audit_context;
  1716		struct audit_buffer *ab;
  1717		kuid_t uid;
  1718		kgid_t gid;
  1719	
  1720		ab = audit_log_start(context, GFP_KERNEL, AUDIT_CAPABILITY);
  1721		audit_log_format(ab, "cap_used=%d", cap);
  1722		current_uid_gid(&uid, &gid);
  1723		audit_log_format(ab, " pid=%d auid=%u uid=%u gid=%u ses=%u",
  1724				 task_pid_nr(current),
  1725				 from_kuid(&init_user_ns, audit_get_loginuid(current)),
  1726				 from_kuid(&init_user_ns, uid),
  1727				 from_kgid(&init_user_ns, gid),
  1728				 audit_get_sessionid(current));
  1729		audit_log_format(ab, " cgroups=");
> 1730		audit_cgroup_list(ab);
  1731		audit_log_end(ab);
  1732	}
  1733	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 12626 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: audit capability use
@ 2016-07-03 16:13                               ` kbuild test robot
  0 siblings, 0 replies; 46+ messages in thread
From: kbuild test robot @ 2016-07-03 16:13 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: kbuild-all-JC7UmRfGjtg, Serge E. Hallyn, Eric W. Biederman,
	Tejun Heo, lkml, luto-DgEjT+Ai2ygdnm+yROfE0A, Kees Cook,
	Jonathan Corbet, Li Zefan, Johannes Weiner, Serge Hallyn,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, Petr Mladek,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

[-- Attachment #1: Type: text/plain, Size: 2500 bytes --]

Hi,

[auto build test ERROR on cgroup/for-next]
[also build test ERROR on v4.7-rc5]
[cannot apply to next-20160701]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Topi-Miettinen/capabilities-audit-capability-use/20160703-231120
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
config: microblaze-mmu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=microblaze 

All errors (new ones prefixed by >>):

>> kernel/audit.c:1713:6: error: redefinition of 'audit_log_cap_use'
    void audit_log_cap_use(int cap)
         ^
   In file included from kernel/audit.c:59:0:
   include/linux/audit.h:574:20: note: previous definition of 'audit_log_cap_use' was here
    static inline void audit_log_cap_use(int cap)
                       ^
   kernel/audit.c: In function 'audit_log_cap_use':
>> kernel/audit.c:1730:2: error: implicit declaration of function 'audit_cgroup_list' [-Werror=implicit-function-declaration]
     audit_cgroup_list(ab);
     ^
   cc1: some warnings being treated as errors

vim +/audit_log_cap_use +1713 kernel/audit.c

  1707	
  1708		if (log)
  1709			audit_log_format(ab, " cap_fe=%d cap_fver=%x",
  1710					 name->fcap.fE, name->fcap_ver);
  1711	}
  1712	
> 1713	void audit_log_cap_use(int cap)
  1714	{
  1715		struct audit_context *context = current->audit_context;
  1716		struct audit_buffer *ab;
  1717		kuid_t uid;
  1718		kgid_t gid;
  1719	
  1720		ab = audit_log_start(context, GFP_KERNEL, AUDIT_CAPABILITY);
  1721		audit_log_format(ab, "cap_used=%d", cap);
  1722		current_uid_gid(&uid, &gid);
  1723		audit_log_format(ab, " pid=%d auid=%u uid=%u gid=%u ses=%u",
  1724				 task_pid_nr(current),
  1725				 from_kuid(&init_user_ns, audit_get_loginuid(current)),
  1726				 from_kuid(&init_user_ns, uid),
  1727				 from_kgid(&init_user_ns, gid),
  1728				 audit_get_sessionid(current));
  1729		audit_log_format(ab, " cgroups=");
> 1730		audit_cgroup_list(ab);
  1731		audit_log_end(ab);
  1732	}
  1733	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 12626 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-07-03 15:08                             ` Topi Miettinen
  (?)
  (?)
@ 2016-07-07  9:16                             ` Petr Mladek
  2016-07-07 20:27                               ` Topi Miettinen
  -1 siblings, 1 reply; 46+ messages in thread
From: Petr Mladek @ 2016-07-07  9:16 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Serge E. Hallyn, Eric W. Biederman, Tejun Heo, lkml, luto,
	Kees Cook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	Serge Hallyn, James Morris, Andrew Morton, David Howells,
	David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
> The attached patch would make any uses of capabilities generate audit
> messages. It works for simple tests as you can see from the commit
> message, but unfortunately the call to audit_cgroup_list() deadlocks the
> system when booting a full blown OS. There's no deadlock when the call
> is removed.
> 
> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
> already held earlier before entering audit_cgroup_list(). Holding the
> locks is however required by task_cgroup_from_root(). Is there any way
> to avoid this? For example, only print some kind of cgroup ID numbers
> (are there unique and stable IDs, available without locks?) for those
> cgroups where the task is registered in the audit message?

I am not sure if anyone know what really happens here. I suggest to
enable lockdep. It might detect possible deadlock even before it
really happens, see Documentation/locking/lockdep-design.txt

It can be enabled by

   CONFIG_PROVE_LOCKING=y

It depends on

    CONFIG_DEBUG_KERNEL=y

and maybe some more options, see lib/Kconfig.debug


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-07-07  9:16                             ` [PATCH] capabilities: add capability cgroup controller Petr Mladek
@ 2016-07-07 20:27                               ` Topi Miettinen
  2016-07-08  9:13                                 ` Petr Mladek
  0 siblings, 1 reply; 46+ messages in thread
From: Topi Miettinen @ 2016-07-07 20:27 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Serge E. Hallyn, Eric W. Biederman, Tejun Heo, lkml, luto,
	Kees Cook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	Serge Hallyn, James Morris, Andrew Morton, David Howells,
	David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 07/07/16 09:16, Petr Mladek wrote:
> On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
>> The attached patch would make any uses of capabilities generate audit
>> messages. It works for simple tests as you can see from the commit
>> message, but unfortunately the call to audit_cgroup_list() deadlocks the
>> system when booting a full blown OS. There's no deadlock when the call
>> is removed.
>>
>> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
>> already held earlier before entering audit_cgroup_list(). Holding the
>> locks is however required by task_cgroup_from_root(). Is there any way
>> to avoid this? For example, only print some kind of cgroup ID numbers
>> (are there unique and stable IDs, available without locks?) for those
>> cgroups where the task is registered in the audit message?
> 
> I am not sure if anyone know what really happens here. I suggest to
> enable lockdep. It might detect possible deadlock even before it
> really happens, see Documentation/locking/lockdep-design.txt
> 
> It can be enabled by
> 
>    CONFIG_PROVE_LOCKING=y
> 
> It depends on
> 
>     CONFIG_DEBUG_KERNEL=y
> 
> and maybe some more options, see lib/Kconfig.debug

Thanks a lot! I caught this stack dump:

starting version 230
[    3.416647] ------------[ cut here ]------------
[    3.417310] WARNING: CPU: 0 PID: 95 at
/home/topi/d/linux.git/kernel/locking/lockdep.c:2871
lockdep_trace_alloc+0xb4/0xc0
[    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[    3.417923] Modules linked in:
[    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
[    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Debian-1.8.2-1 04/01/2014
[    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
ffffffff813c9c45
[    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
ffffffff81091e9b
[    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
00000000ffffffff
[    3.419374] Call Trace:
[    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
[    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
[    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
[    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
[    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
[    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
[    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
[    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
[    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
[    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
[    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
[    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
[    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
[    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
[    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
[    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
[    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
[    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
[    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
[    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
[    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
[    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
[    3.420170] ---[ end trace fb586899fb556a5e ]---
[    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
available
[    4.014078] clocksource: Switched to clocksource tsc
Begin: Loading essential drivers ... done.

This is with qemu and the boot continues normally. With real computer,
there's no such output and system just seems to freeze.

Could it be possible that the deadlock happens because there's some IO
towards /sys/fs/cgroup, which causes a capability check and that in turn
causes locking problems when we try to print cgroup list?

-Topi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-07-07 20:27                               ` Topi Miettinen
@ 2016-07-08  9:13                                 ` Petr Mladek
  2016-07-09 16:38                                   ` Topi Miettinen
  2016-07-10  9:04                                     ` Topi Miettinen
  0 siblings, 2 replies; 46+ messages in thread
From: Petr Mladek @ 2016-07-08  9:13 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Serge E. Hallyn, Eric W. Biederman, Tejun Heo, lkml, luto,
	Kees Cook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	Serge Hallyn, James Morris, Andrew Morton, David Howells,
	David Woodhouse, Ard Biesheuvel, Paul E. McKenney,
	open list:DOCUMENTATION, open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
> On 07/07/16 09:16, Petr Mladek wrote:
> > On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
> >> The attached patch would make any uses of capabilities generate audit
> >> messages. It works for simple tests as you can see from the commit
> >> message, but unfortunately the call to audit_cgroup_list() deadlocks the
> >> system when booting a full blown OS. There's no deadlock when the call
> >> is removed.
> >>
> >> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
> >> already held earlier before entering audit_cgroup_list(). Holding the
> >> locks is however required by task_cgroup_from_root(). Is there any way
> >> to avoid this? For example, only print some kind of cgroup ID numbers
> >> (are there unique and stable IDs, available without locks?) for those
> >> cgroups where the task is registered in the audit message?
> > 
> > I am not sure if anyone know what really happens here. I suggest to
> > enable lockdep. It might detect possible deadlock even before it
> > really happens, see Documentation/locking/lockdep-design.txt
> > 
> > It can be enabled by
> > 
> >    CONFIG_PROVE_LOCKING=y
> > 
> > It depends on
> > 
> >     CONFIG_DEBUG_KERNEL=y
> > 
> > and maybe some more options, see lib/Kconfig.debug
> 
> Thanks a lot! I caught this stack dump:
> 
> starting version 230
> [    3.416647] ------------[ cut here ]------------
> [    3.417310] WARNING: CPU: 0 PID: 95 at
> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
> lockdep_trace_alloc+0xb4/0xc0
> [    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [    3.417923] Modules linked in:
> [    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
> [    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Debian-1.8.2-1 04/01/2014
> [    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
> ffffffff813c9c45
> [    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
> ffffffff81091e9b
> [    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
> 00000000ffffffff
> [    3.419374] Call Trace:
> [    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
> [    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
> [    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
> [    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
> [    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
> [    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
> [    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
> [    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
> [    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
> [    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
> [    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
> [    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
> [    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
> [    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
> [    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
> [    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
> [    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
> [    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
> [    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
> [    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
> [    3.420170] ---[ end trace fb586899fb556a5e ]---
> [    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
> available
> [    4.014078] clocksource: Switched to clocksource tsc
> Begin: Loading essential drivers ... done.
> 
> This is with qemu and the boot continues normally. With real computer,
> there's no such output and system just seems to freeze.
> 
> Could it be possible that the deadlock happens because there's some IO
> towards /sys/fs/cgroup, which causes a capability check and that in turn
> causes locking problems when we try to print cgroup list?

The above warning is printed by the code from
kernel/locking/lockdep.c:2871

static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
{
[...]
	/* We're only interested __GFP_FS allocations for now */
	if (!(gfp_mask & __GFP_FS))
		return;

	/*
	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
	 */
	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
		return;


The backtrace shows that your new audit_log_cap_use() is called
from vfs_write(). You might try to use audit_log_start() with
GFP_NOFS instead of GFP_KERNEL.

Note that this is rather intuitive advice. I still need to learn a lot
about memory management and kernel in general to be more sure about
a correct solution.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-07-08  9:13                                 ` Petr Mladek
@ 2016-07-09 16:38                                   ` Topi Miettinen
  2016-07-10  9:04                                     ` Topi Miettinen
  1 sibling, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-07-09 16:38 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Serge E. Hallyn, Eric W. Biederman, Tejun Heo, lkml, luto,
	Kees Cook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

On 07/08/16 09:13, Petr Mladek wrote:
> On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
>> On 07/07/16 09:16, Petr Mladek wrote:
>>> On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
>>>> The attached patch would make any uses of capabilities generate audit
>>>> messages. It works for simple tests as you can see from the commit
>>>> message, but unfortunately the call to audit_cgroup_list() deadlocks the
>>>> system when booting a full blown OS. There's no deadlock when the call
>>>> is removed.
>>>>
>>>> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
>>>> already held earlier before entering audit_cgroup_list(). Holding the
>>>> locks is however required by task_cgroup_from_root(). Is there any way
>>>> to avoid this? For example, only print some kind of cgroup ID numbers
>>>> (are there unique and stable IDs, available without locks?) for those
>>>> cgroups where the task is registered in the audit message?
>>>
>>> I am not sure if anyone know what really happens here. I suggest to
>>> enable lockdep. It might detect possible deadlock even before it
>>> really happens, see Documentation/locking/lockdep-design.txt
>>>
>>> It can be enabled by
>>>
>>>    CONFIG_PROVE_LOCKING=y
>>>
>>> It depends on
>>>
>>>     CONFIG_DEBUG_KERNEL=y
>>>
>>> and maybe some more options, see lib/Kconfig.debug
>>
>> Thanks a lot! I caught this stack dump:
>>
>> starting version 230
>> [    3.416647] ------------[ cut here ]------------
>> [    3.417310] WARNING: CPU: 0 PID: 95 at
>> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
>> lockdep_trace_alloc+0xb4/0xc0
>> [    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>> [    3.417923] Modules linked in:
>> [    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
>> [    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Debian-1.8.2-1 04/01/2014
>> [    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
>> ffffffff813c9c45
>> [    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
>> ffffffff81091e9b
>> [    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
>> 00000000ffffffff
>> [    3.419374] Call Trace:
>> [    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
>> [    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
>> [    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
>> [    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
>> [    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
>> [    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
>> [    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
>> [    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
>> [    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
>> [    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
>> [    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
>> [    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
>> [    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
>> [    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
>> [    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
>> [    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
>> [    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
>> [    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
>> [    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
>> [    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
>> [    3.420170] ---[ end trace fb586899fb556a5e ]---
>> [    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
>> available
>> [    4.014078] clocksource: Switched to clocksource tsc
>> Begin: Loading essential drivers ... done.
>>
>> This is with qemu and the boot continues normally. With real computer,
>> there's no such output and system just seems to freeze.
>>
>> Could it be possible that the deadlock happens because there's some IO
>> towards /sys/fs/cgroup, which causes a capability check and that in turn
>> causes locking problems when we try to print cgroup list?
> 
> The above warning is printed by the code from
> kernel/locking/lockdep.c:2871
> 
> static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> {
> [...]
> 	/* We're only interested __GFP_FS allocations for now */
> 	if (!(gfp_mask & __GFP_FS))
> 		return;
> 
> 	/*
> 	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
> 	 */
> 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
> 		return;
> 
> 
> The backtrace shows that your new audit_log_cap_use() is called
> from vfs_write(). You might try to use audit_log_start() with
> GFP_NOFS instead of GFP_KERNEL.
> 
> Note that this is rather intuitive advice. I still need to learn a lot
> about memory management and kernel in general to be more sure about
> a correct solution.

Here's what I got now:

[   18.043181]
[   18.044123] ======================================================
[   18.044123] [ INFO: possible circular locking dependency detected ]
[   18.044123] 4.7.0-rc5+ #99 Not tainted
[   18.044123] -------------------------------------------------------
[   18.044123] systemd/1 is trying to acquire lock:
[   18.044123]  (tasklist_lock){.+.+..}, at: [<ffffffff81137ae1>]
cgroup_mount+0x4f1/0xc10
[   18.044123]
[   18.044123] but task is already holding lock:
[   18.044123]  (css_set_lock){......}, at: [<ffffffff81137a9d>]
cgroup_mount+0x4ad/0xc10
[   18.044123]
[   18.044123] which lock already depends on the new lock.
[   18.044123]
[   18.044123]
[   18.044123] the existing dependency chain (in reverse order) is:
[   18.044123]
-> #3 (css_set_lock){......}:
[   18.044123]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[   18.044123]        [<ffffffff8184e187>] _raw_spin_lock_irq+0x37/0x50
[   18.044123]        [<ffffffff811374be>] cgroup_setup_root+0x19e/0x2d0
[   18.044123]        [<ffffffff821911fc>] cgroup_init+0xec/0x41d
[   18.044123]        [<ffffffff82171f68>] start_kernel+0x40c/0x465
[   18.044123]        [<ffffffff82171294>]
x86_64_start_reservations+0x2f/0x31
[   18.044123]        [<ffffffff8217140e>] x86_64_start_kernel+0x178/0x18b
[   18.044123]
-> #2 (cgroup_mutex){+.+...}:
[   18.044123]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[   18.044123]        [<ffffffff8184afaf>] mutex_lock_nested+0x5f/0x350
[   18.044123]        [<ffffffff8113967a>] audit_cgroup_list+0x4a/0x2f0
[   18.044123]        [<ffffffff81145d69>] audit_log_cap_use+0xd9/0xf0
[   18.044123]        [<ffffffff8109cd75>] ns_capable+0x45/0x70
[   18.044123]        [<ffffffff8109cdb7>] capable+0x17/0x20
[   18.044123]        [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
[   18.044123]        [<ffffffff81230997>] __vfs_write+0x37/0x160
[   18.044123]        [<ffffffff81231048>] vfs_write+0xb8/0x1b0
[   18.044123]        [<ffffffff81232078>] SyS_write+0x58/0xc0
[   18.044123]        [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
[   18.044123]        [<ffffffff8184ea5a>] return_from_SYSCALL_64+0x0/0x7a
[   18.044123]
-> #1 (&(&sighand->siglock)->rlock){+.+...}:
[   18.044123]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[   18.044123]        [<ffffffff8184e011>] _raw_spin_lock+0x31/0x40
[   18.044123]        [<ffffffff810901d9>]
copy_process.part.34+0x10f9/0x1b40
[   18.044123]        [<ffffffff81090e23>] _do_fork+0xf3/0x6b0
[   18.044123]        [<ffffffff81091409>] kernel_thread+0x29/0x30
[   18.044123]        [<ffffffff810b71d7>] kthreadd+0x187/0x1e0
[   18.044123]        [<ffffffff8184ebbf>] ret_from_fork+0x1f/0x40
[   18.044123]
-> #0 (tasklist_lock){.+.+..}:
[   18.044123]        [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[   18.044123]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[   18.044123]        [<ffffffff8184e444>] _raw_read_lock+0x34/0x50
[   18.044123]        [<ffffffff81137ae1>] cgroup_mount+0x4f1/0xc10
[   18.044123]        [<ffffffff81234de8>] mount_fs+0x38/0x170
[   18.044123]        [<ffffffff812562bb>] vfs_kern_mount+0x6b/0x150
[   18.044123]        [<ffffffff81258fdc>] do_mount+0x24c/0xe30
[   18.044123]        [<ffffffff81259ef5>] SyS_mount+0x95/0xe0
[   18.044123]        [<ffffffff8184e9a5>]
entry_SYSCALL_64_fastpath+0x18/0xa8
[   18.044123]
[   18.044123] other info that might help us debug this:
[   18.044123]
[   18.044123] Chain exists of:
  tasklist_lock --> cgroup_mutex --> css_set_lock

[   18.044123]  Possible unsafe locking scenario:
[   18.044123]
[   18.044123]        CPU0                    CPU1
[   18.044123]        ----                    ----
[   18.044123]   lock(css_set_lock);
[   18.044123]                                lock(cgroup_mutex);
[   18.044123]                                lock(css_set_lock);
[   18.044123]   lock(tasklist_lock);
[   18.044123]
[   18.044123]  *** DEADLOCK ***
[   18.044123]
[   18.044123] 1 lock held by systemd/1:
[   18.044123]  #0:  (css_set_lock){......}, at: [<ffffffff81137a9d>]
cgroup_mount+0x4ad/0xc10
[   18.044123]
[   18.044123] stack backtrace:
[   18.044123] CPU: 0 PID: 1 Comm: systemd Not tainted 4.7.0-rc5+ #99
[   18.044123] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Debian-1.8.2-1 04/01/2014
[   18.044123]  0000000000000086 0000000008966b11 ffff880006d13bb0
ffffffff813c9c45
[   18.044123]  ffffffff829dbed0 ffffffff829cf2a0 ffff880006d13bf0
ffffffff810e60a3
[   18.044123]  ffff880006d13c30 ffff880006d067b0 ffff880006d06040
0000000000000001
[   18.044123] Call Trace:
[   18.044123]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
[   18.044123]  [<ffffffff810e60a3>] print_circular_bug+0x1e3/0x250
[   18.044123]  [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[   18.044123]  [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[   18.044123]  [<ffffffff81137ae1>] ? cgroup_mount+0x4f1/0xc10
[   18.044123]  [<ffffffff8184e444>] _raw_read_lock+0x34/0x50
[   18.044123]  [<ffffffff81137ae1>] ? cgroup_mount+0x4f1/0xc10
[   18.044123]  [<ffffffff81137ae1>] cgroup_mount+0x4f1/0xc10
[   18.044123]  [<ffffffff810e5637>] ? lockdep_init_map+0x57/0x1f0
[   18.044123]  [<ffffffff81234de8>] mount_fs+0x38/0x170
[   18.044123]  [<ffffffff812562bb>] vfs_kern_mount+0x6b/0x150
[   18.044123]  [<ffffffff81258fdc>] do_mount+0x24c/0xe30
[   18.044123]  [<ffffffff8121060b>] ? kmem_cache_alloc_trace+0x28b/0x5e0
[   18.044123]  [<ffffffff811cc1c6>] ? strndup_user+0x46/0x80
[   18.044123]  [<ffffffff81259ef5>] SyS_mount+0x95/0xe0
[   18.044123]  [<ffffffff8184e9a5>] entry_SYSCALL_64_fastpath+0x18/0xa8

This is with GFP_KERNEL changed to GFP_NOFS for both allocations.

-Topi

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
  2016-07-08  9:13                                 ` Petr Mladek
@ 2016-07-10  9:04                                     ` Topi Miettinen
  2016-07-10  9:04                                     ` Topi Miettinen
  1 sibling, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-07-10  9:04 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Serge E. Hallyn, Eric W. Biederman, Tejun Heo, lkml, luto,
	Kees Cook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

[-- Attachment #1: Type: text/plain, Size: 10985 bytes --]

On 07/08/16 09:13, Petr Mladek wrote:
> On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
>> On 07/07/16 09:16, Petr Mladek wrote:
>>> On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
>>>> The attached patch would make any uses of capabilities generate audit
>>>> messages. It works for simple tests as you can see from the commit
>>>> message, but unfortunately the call to audit_cgroup_list() deadlocks the
>>>> system when booting a full blown OS. There's no deadlock when the call
>>>> is removed.
>>>>
>>>> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
>>>> already held earlier before entering audit_cgroup_list(). Holding the
>>>> locks is however required by task_cgroup_from_root(). Is there any way
>>>> to avoid this? For example, only print some kind of cgroup ID numbers
>>>> (are there unique and stable IDs, available without locks?) for those
>>>> cgroups where the task is registered in the audit message?
>>>
>>> I am not sure if anyone know what really happens here. I suggest to
>>> enable lockdep. It might detect possible deadlock even before it
>>> really happens, see Documentation/locking/lockdep-design.txt
>>>
>>> It can be enabled by
>>>
>>>    CONFIG_PROVE_LOCKING=y
>>>
>>> It depends on
>>>
>>>     CONFIG_DEBUG_KERNEL=y
>>>
>>> and maybe some more options, see lib/Kconfig.debug
>>
>> Thanks a lot! I caught this stack dump:
>>
>> starting version 230
>> [    3.416647] ------------[ cut here ]------------
>> [    3.417310] WARNING: CPU: 0 PID: 95 at
>> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
>> lockdep_trace_alloc+0xb4/0xc0
>> [    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>> [    3.417923] Modules linked in:
>> [    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
>> [    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Debian-1.8.2-1 04/01/2014
>> [    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
>> ffffffff813c9c45
>> [    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
>> ffffffff81091e9b
>> [    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
>> 00000000ffffffff
>> [    3.419374] Call Trace:
>> [    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
>> [    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
>> [    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
>> [    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
>> [    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
>> [    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
>> [    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
>> [    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
>> [    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
>> [    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
>> [    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
>> [    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
>> [    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
>> [    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
>> [    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
>> [    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
>> [    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
>> [    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
>> [    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
>> [    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
>> [    3.420170] ---[ end trace fb586899fb556a5e ]---
>> [    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
>> available
>> [    4.014078] clocksource: Switched to clocksource tsc
>> Begin: Loading essential drivers ... done.
>>
>> This is with qemu and the boot continues normally. With real computer,
>> there's no such output and system just seems to freeze.
>>
>> Could it be possible that the deadlock happens because there's some IO
>> towards /sys/fs/cgroup, which causes a capability check and that in turn
>> causes locking problems when we try to print cgroup list?
> 
> The above warning is printed by the code from
> kernel/locking/lockdep.c:2871
> 
> static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> {
> [...]
> 	/* We're only interested __GFP_FS allocations for now */
> 	if (!(gfp_mask & __GFP_FS))
> 		return;
> 
> 	/*
> 	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
> 	 */
> 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
> 		return;
> 
> 
> The backtrace shows that your new audit_log_cap_use() is called
> from vfs_write(). You might try to use audit_log_start() with
> GFP_NOFS instead of GFP_KERNEL.
> 
> Note that this is rather intuitive advice. I still need to learn a lot
> about memory management and kernel in general to be more sure about
> a correct solution.
> 
> Best Regards,
> Petr
> 

With the attached patch, the system boots without deadlock. Locking
problems still remain:

[    3.652221] ======================================================
[    3.652221] [ INFO: possible circular locking dependency detected ]
[    3.652221] 4.7.0-rc5+ #101 Not tainted
[    3.652221] -------------------------------------------------------
[    3.652221] systemd/1 is trying to acquire lock:
[    3.652221]  (tasklist_lock){.+.+..}, at: [<ffffffff81137ddd>]
cgroup_mount+0x7ed/0xbc0
[    3.652221]
               but task is already holding lock:
[    3.652221]  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               which lock already depends on the new lock.

[    3.652221]
               the existing dependency chain (in reverse order) is:
[    3.652221]
               -> #3 (css_set_lock){......}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e137>] _raw_spin_lock_irq+0x37/0x50
[    3.652221]        [<ffffffff811374be>] cgroup_setup_root+0x19e/0x2d0
[    3.652221]        [<ffffffff821911fc>] cgroup_init+0xec/0x41d
[    3.652221]        [<ffffffff82171f68>] start_kernel+0x40c/0x465
[    3.652221]        [<ffffffff82171294>]
x86_64_start_reservations+0x2f/0x31
[    3.652221]        [<ffffffff8217140e>] x86_64_start_kernel+0x178/0x18b
[    3.652221]
               -> #2 (cgroup_mutex){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184af5f>] mutex_lock_nested+0x5f/0x350
[    3.652221]        [<ffffffff8113962a>] audit_cgroup_list+0x4a/0x2f0
[    3.652221]        [<ffffffff81145d19>] audit_log_cap_use+0xd9/0xf0
[    3.652221]        [<ffffffff8109cd75>] ns_capable+0x45/0x70
[    3.652221]        [<ffffffff8109cdb7>] capable+0x17/0x20
[    3.652221]        [<ffffffff812a2f00>] oom_score_adj_write+0x150/0x2f0
[    3.652221]        [<ffffffff81230947>] __vfs_write+0x37/0x160
[    3.652221]        [<ffffffff81230ff8>] vfs_write+0xb8/0x1b0
[    3.652221]        [<ffffffff81232028>] SyS_write+0x58/0xc0
[    3.652221]        [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
[    3.652221]        [<ffffffff8184ea1a>] return_from_SYSCALL_64+0x0/0x7a
[    3.652221]
               -> #1 (&(&sighand->siglock)->rlock){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184dfc1>] _raw_spin_lock+0x31/0x40
[    3.652221]        [<ffffffff810901d9>]
copy_process.part.34+0x10f9/0x1b40
[    3.652221]        [<ffffffff81090e23>] _do_fork+0xf3/0x6b0
[    3.652221]        [<ffffffff81091409>] kernel_thread+0x29/0x30
[    3.652221]        [<ffffffff810b71d7>] kthreadd+0x187/0x1e0
[    3.652221]        [<ffffffff8184eb7f>] ret_from_fork+0x1f/0x40
[    3.652221]
               -> #0 (tasklist_lock){.+.+..}:
[    3.652221]        [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]        [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]        [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]        [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]        [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]        [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]        [<ffffffff8184e965>]
entry_SYSCALL_64_fastpath+0x18/0xa8
[    3.652221]
               other info that might help us debug this:

[    3.652221] Chain exists of:
                 tasklist_lock --> cgroup_mutex --> css_set_lock

[    3.652221]  Possible unsafe locking scenario:

[    3.652221]        CPU0                    CPU1
[    3.652221]        ----                    ----
[    3.652221]   lock(css_set_lock);
[    3.652221]                                lock(cgroup_mutex);
[    3.652221]                                lock(css_set_lock);
[    3.652221]   lock(tasklist_lock);
[    3.652221]
                *** DEADLOCK ***

[    3.652221] 1 lock held by systemd/1:
[    3.652221]  #0:  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               stack backtrace:
[    3.652221] CPU: 0 PID: 1 Comm: systemd Not tainted 4.7.0-rc5+ #101
[    3.652221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Debian-1.8.2-1 04/01/2014
[    3.652221]  0000000000000086 0000000024b7c1ed ffff880006d13bb0
ffffffff813c9bf5
[    3.652221]  ffffffff829dbd20 ffffffff829cf2a0 ffff880006d13bf0
ffffffff810e60a3
[    3.652221]  ffff880006d13c30 ffff880006d067b0 ffff880006d06040
0000000000000001
[    3.652221] Call Trace:
[    3.652221]  [<ffffffff813c9bf5>] dump_stack+0x67/0x92
[    3.652221]  [<ffffffff810e60a3>] print_circular_bug+0x1e3/0x250
[    3.652221]  [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]  [<ffffffff81210bcd>] ? __kmalloc_track_caller+0x2bd/0x630
[    3.652221]  [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff810e5637>] ? lockdep_init_map+0x57/0x1f0
[    3.652221]  [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]  [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]  [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]  [<ffffffff812105bb>] ? kmem_cache_alloc_trace+0x28b/0x5e0
[    3.652221]  [<ffffffff811cc176>] ? strndup_user+0x46/0x80
[    3.652221]  [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]  [<ffffffff8184e965>] entry_SYSCALL_64_fastpath+0x18/0xa8

Rate limiting would not be a bad idea, there were 329 audit log entries
about capability use.

-Topi


[-- Attachment #2: 0001-cgroup-check-options-and-capabilities-before-locking.patch --]
[-- Type: text/x-patch, Size: 2088 bytes --]

>From b857ec7d436f6fbb8c33e8d6a16d2f16175517d8 Mon Sep 17 00:00:00 2001
From: Topi Miettinen <toiwoton@gmail.com>
Date: Sun, 10 Jul 2016 11:45:30 +0300
Subject: [PATCH] cgroup: check options and capabilities before locking to
 avoid a deadlock

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/cgroup.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1931679..1190559 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2100,6 +2100,21 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		return ERR_PTR(-EPERM);
 	}
 
+	/* First find the desired set of subsystems */
+	ret = parse_cgroupfs_options(data, &opts);
+	if (ret)
+		goto out_free;
+
+	/*
+	 * We know this subsystem has not yet been bound.  Users in a non-init
+	 * user namespace may only mount hierarchies with no bound subsystems,
+	 * i.e. 'none,name=user1'
+	 */
+	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_free;
+	}
+
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
 	 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2121,11 +2136,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 
 	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
 
-	/* First find the desired set of subsystems */
-	ret = parse_cgroupfs_options(data, &opts);
-	if (ret)
-		goto out_unlock;
-
 	/*
 	 * Destruction of cgroup root is asynchronous, so subsystems may
 	 * still be dying after the previous unmount.  Let's drain the
@@ -2216,16 +2226,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		goto out_unlock;
 	}
 
-	/*
-	 * We know this subsystem has not yet been bound.  Users in a non-init
-	 * user namespace may only mount hierarchies with no bound subsystems,
-	 * i.e. 'none,name=user1'
-	 */
-	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
-		ret = -EPERM;
-		goto out_unlock;
-	}
-
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root) {
 		ret = -ENOMEM;
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] capabilities: add capability cgroup controller
@ 2016-07-10  9:04                                     ` Topi Miettinen
  0 siblings, 0 replies; 46+ messages in thread
From: Topi Miettinen @ 2016-07-10  9:04 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Serge E. Hallyn, Eric W. Biederman, Tejun Heo, lkml, luto,
	Kees Cook, Jonathan Corbet, Li Zefan, Johannes Weiner,
	James Morris, Andrew Morton, David Howells, David Woodhouse,
	Ard Biesheuvel, Paul E. McKenney, open list:DOCUMENTATION,
	open list:CONTROL GROUP (CGROUP),
	open list:CAPABILITIES

[-- Attachment #1: Type: text/plain, Size: 10985 bytes --]

On 07/08/16 09:13, Petr Mladek wrote:
> On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
>> On 07/07/16 09:16, Petr Mladek wrote:
>>> On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
>>>> The attached patch would make any uses of capabilities generate audit
>>>> messages. It works for simple tests as you can see from the commit
>>>> message, but unfortunately the call to audit_cgroup_list() deadlocks the
>>>> system when booting a full blown OS. There's no deadlock when the call
>>>> is removed.
>>>>
>>>> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
>>>> already held earlier before entering audit_cgroup_list(). Holding the
>>>> locks is however required by task_cgroup_from_root(). Is there any way
>>>> to avoid this? For example, only print some kind of cgroup ID numbers
>>>> (are there unique and stable IDs, available without locks?) for those
>>>> cgroups where the task is registered in the audit message?
>>>
>>> I am not sure if anyone know what really happens here. I suggest to
>>> enable lockdep. It might detect possible deadlock even before it
>>> really happens, see Documentation/locking/lockdep-design.txt
>>>
>>> It can be enabled by
>>>
>>>    CONFIG_PROVE_LOCKING=y
>>>
>>> It depends on
>>>
>>>     CONFIG_DEBUG_KERNEL=y
>>>
>>> and maybe some more options, see lib/Kconfig.debug
>>
>> Thanks a lot! I caught this stack dump:
>>
>> starting version 230
>> [    3.416647] ------------[ cut here ]------------
>> [    3.417310] WARNING: CPU: 0 PID: 95 at
>> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
>> lockdep_trace_alloc+0xb4/0xc0
>> [    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>> [    3.417923] Modules linked in:
>> [    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
>> [    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Debian-1.8.2-1 04/01/2014
>> [    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
>> ffffffff813c9c45
>> [    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
>> ffffffff81091e9b
>> [    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
>> 00000000ffffffff
>> [    3.419374] Call Trace:
>> [    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
>> [    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
>> [    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
>> [    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
>> [    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
>> [    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
>> [    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
>> [    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
>> [    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
>> [    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
>> [    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
>> [    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
>> [    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
>> [    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
>> [    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
>> [    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
>> [    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
>> [    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
>> [    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
>> [    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
>> [    3.420170] ---[ end trace fb586899fb556a5e ]---
>> [    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
>> available
>> [    4.014078] clocksource: Switched to clocksource tsc
>> Begin: Loading essential drivers ... done.
>>
>> This is with qemu and the boot continues normally. With real computer,
>> there's no such output and system just seems to freeze.
>>
>> Could it be possible that the deadlock happens because there's some IO
>> towards /sys/fs/cgroup, which causes a capability check and that in turn
>> causes locking problems when we try to print cgroup list?
> 
> The above warning is printed by the code from
> kernel/locking/lockdep.c:2871
> 
> static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> {
> [...]
> 	/* We're only interested __GFP_FS allocations for now */
> 	if (!(gfp_mask & __GFP_FS))
> 		return;
> 
> 	/*
> 	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
> 	 */
> 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
> 		return;
> 
> 
> The backtrace shows that your new audit_log_cap_use() is called
> from vfs_write(). You might try to use audit_log_start() with
> GFP_NOFS instead of GFP_KERNEL.
> 
> Note that this is rather intuitive advice. I still need to learn a lot
> about memory management and kernel in general to be more sure about
> a correct solution.
> 
> Best Regards,
> Petr
> 

With the attached patch, the system boots without deadlock. Locking
problems still remain:

[    3.652221] ======================================================
[    3.652221] [ INFO: possible circular locking dependency detected ]
[    3.652221] 4.7.0-rc5+ #101 Not tainted
[    3.652221] -------------------------------------------------------
[    3.652221] systemd/1 is trying to acquire lock:
[    3.652221]  (tasklist_lock){.+.+..}, at: [<ffffffff81137ddd>]
cgroup_mount+0x7ed/0xbc0
[    3.652221]
               but task is already holding lock:
[    3.652221]  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               which lock already depends on the new lock.

[    3.652221]
               the existing dependency chain (in reverse order) is:
[    3.652221]
               -> #3 (css_set_lock){......}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e137>] _raw_spin_lock_irq+0x37/0x50
[    3.652221]        [<ffffffff811374be>] cgroup_setup_root+0x19e/0x2d0
[    3.652221]        [<ffffffff821911fc>] cgroup_init+0xec/0x41d
[    3.652221]        [<ffffffff82171f68>] start_kernel+0x40c/0x465
[    3.652221]        [<ffffffff82171294>]
x86_64_start_reservations+0x2f/0x31
[    3.652221]        [<ffffffff8217140e>] x86_64_start_kernel+0x178/0x18b
[    3.652221]
               -> #2 (cgroup_mutex){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184af5f>] mutex_lock_nested+0x5f/0x350
[    3.652221]        [<ffffffff8113962a>] audit_cgroup_list+0x4a/0x2f0
[    3.652221]        [<ffffffff81145d19>] audit_log_cap_use+0xd9/0xf0
[    3.652221]        [<ffffffff8109cd75>] ns_capable+0x45/0x70
[    3.652221]        [<ffffffff8109cdb7>] capable+0x17/0x20
[    3.652221]        [<ffffffff812a2f00>] oom_score_adj_write+0x150/0x2f0
[    3.652221]        [<ffffffff81230947>] __vfs_write+0x37/0x160
[    3.652221]        [<ffffffff81230ff8>] vfs_write+0xb8/0x1b0
[    3.652221]        [<ffffffff81232028>] SyS_write+0x58/0xc0
[    3.652221]        [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
[    3.652221]        [<ffffffff8184ea1a>] return_from_SYSCALL_64+0x0/0x7a
[    3.652221]
               -> #1 (&(&sighand->siglock)->rlock){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184dfc1>] _raw_spin_lock+0x31/0x40
[    3.652221]        [<ffffffff810901d9>]
copy_process.part.34+0x10f9/0x1b40
[    3.652221]        [<ffffffff81090e23>] _do_fork+0xf3/0x6b0
[    3.652221]        [<ffffffff81091409>] kernel_thread+0x29/0x30
[    3.652221]        [<ffffffff810b71d7>] kthreadd+0x187/0x1e0
[    3.652221]        [<ffffffff8184eb7f>] ret_from_fork+0x1f/0x40
[    3.652221]
               -> #0 (tasklist_lock){.+.+..}:
[    3.652221]        [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]        [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]        [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]        [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]        [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]        [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]        [<ffffffff8184e965>]
entry_SYSCALL_64_fastpath+0x18/0xa8
[    3.652221]
               other info that might help us debug this:

[    3.652221] Chain exists of:
                 tasklist_lock --> cgroup_mutex --> css_set_lock

[    3.652221]  Possible unsafe locking scenario:

[    3.652221]        CPU0                    CPU1
[    3.652221]        ----                    ----
[    3.652221]   lock(css_set_lock);
[    3.652221]                                lock(cgroup_mutex);
[    3.652221]                                lock(css_set_lock);
[    3.652221]   lock(tasklist_lock);
[    3.652221]
                *** DEADLOCK ***

[    3.652221] 1 lock held by systemd/1:
[    3.652221]  #0:  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               stack backtrace:
[    3.652221] CPU: 0 PID: 1 Comm: systemd Not tainted 4.7.0-rc5+ #101
[    3.652221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Debian-1.8.2-1 04/01/2014
[    3.652221]  0000000000000086 0000000024b7c1ed ffff880006d13bb0
ffffffff813c9bf5
[    3.652221]  ffffffff829dbd20 ffffffff829cf2a0 ffff880006d13bf0
ffffffff810e60a3
[    3.652221]  ffff880006d13c30 ffff880006d067b0 ffff880006d06040
0000000000000001
[    3.652221] Call Trace:
[    3.652221]  [<ffffffff813c9bf5>] dump_stack+0x67/0x92
[    3.652221]  [<ffffffff810e60a3>] print_circular_bug+0x1e3/0x250
[    3.652221]  [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]  [<ffffffff81210bcd>] ? __kmalloc_track_caller+0x2bd/0x630
[    3.652221]  [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff810e5637>] ? lockdep_init_map+0x57/0x1f0
[    3.652221]  [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]  [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]  [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]  [<ffffffff812105bb>] ? kmem_cache_alloc_trace+0x28b/0x5e0
[    3.652221]  [<ffffffff811cc176>] ? strndup_user+0x46/0x80
[    3.652221]  [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]  [<ffffffff8184e965>] entry_SYSCALL_64_fastpath+0x18/0xa8

Rate limiting would not be a bad idea, there were 329 audit log entries
about capability use.

-Topi


[-- Attachment #2: 0001-cgroup-check-options-and-capabilities-before-locking.patch --]
[-- Type: text/x-patch, Size: 2087 bytes --]

From b857ec7d436f6fbb8c33e8d6a16d2f16175517d8 Mon Sep 17 00:00:00 2001
From: Topi Miettinen <toiwoton@gmail.com>
Date: Sun, 10 Jul 2016 11:45:30 +0300
Subject: [PATCH] cgroup: check options and capabilities before locking to
 avoid a deadlock

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/cgroup.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1931679..1190559 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2100,6 +2100,21 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		return ERR_PTR(-EPERM);
 	}
 
+	/* First find the desired set of subsystems */
+	ret = parse_cgroupfs_options(data, &opts);
+	if (ret)
+		goto out_free;
+
+	/*
+	 * We know this subsystem has not yet been bound.  Users in a non-init
+	 * user namespace may only mount hierarchies with no bound subsystems,
+	 * i.e. 'none,name=user1'
+	 */
+	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_free;
+	}
+
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
 	 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2121,11 +2136,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 
 	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
 
-	/* First find the desired set of subsystems */
-	ret = parse_cgroupfs_options(data, &opts);
-	if (ret)
-		goto out_unlock;
-
 	/*
 	 * Destruction of cgroup root is asynchronous, so subsystems may
 	 * still be dying after the previous unmount.  Let's drain the
@@ -2216,16 +2226,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		goto out_unlock;
 	}
 
-	/*
-	 * We know this subsystem has not yet been bound.  Users in a non-init
-	 * user namespace may only mount hierarchies with no bound subsystems,
-	 * i.e. 'none,name=user1'
-	 */
-	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
-		ret = -EPERM;
-		goto out_unlock;
-	}
-
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root) {
 		ret = -ENOMEM;
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-07-10  9:04 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-23 15:07 [PATCH] capabilities: add capability cgroup controller Topi Miettinen
2016-06-23 15:07 ` Topi Miettinen
2016-06-23 21:03 ` Kees Cook
2016-06-23 21:03   ` Kees Cook
2016-06-23 21:38 ` Tejun Heo
2016-06-24  0:22   ` Topi Miettinen
2016-06-24 15:48     ` Tejun Heo
2016-06-24 15:59       ` Serge E. Hallyn
2016-06-24 16:35         ` Tejun Heo
2016-06-24 16:35           ` Tejun Heo
2016-06-24 16:59           ` Serge E. Hallyn
2016-06-24 17:21             ` Eric W. Biederman
2016-06-24 17:21               ` Eric W. Biederman
2016-06-24 17:39               ` Serge E. Hallyn
2016-06-26 19:03               ` Topi Miettinen
2016-06-26 19:03                 ` Topi Miettinen
2016-06-28  4:57                 ` Eric W. Biederman
2016-06-28  4:57                   ` Eric W. Biederman
2016-07-02 11:20                   ` Topi Miettinen
2016-07-02 11:20                     ` Topi Miettinen
2016-06-24 17:24             ` Tejun Heo
2016-06-26 19:14               ` Topi Miettinen
2016-06-26 22:26                 ` Tejun Heo
2016-06-27 14:54                   ` Serge E. Hallyn
2016-06-27 19:10                     ` Topi Miettinen
2016-06-27 19:17                       ` Tejun Heo
2016-06-27 19:49                         ` Serge E. Hallyn
2016-06-27 19:49                           ` Serge E. Hallyn
2016-07-03 15:08                           ` Topi Miettinen
2016-07-03 15:08                             ` Topi Miettinen
2016-07-03 16:13                             ` [PATCH] capabilities: audit capability use kbuild test robot
2016-07-03 16:13                               ` kbuild test robot
2016-07-07  9:16                             ` [PATCH] capabilities: add capability cgroup controller Petr Mladek
2016-07-07 20:27                               ` Topi Miettinen
2016-07-08  9:13                                 ` Petr Mladek
2016-07-09 16:38                                   ` Topi Miettinen
2016-07-10  9:04                                   ` Topi Miettinen
2016-07-10  9:04                                     ` Topi Miettinen
2016-06-23 23:46 ` Andrew Morton
2016-06-23 23:46   ` Andrew Morton
2016-06-24  1:14   ` Topi Miettinen
2016-06-24  1:14     ` Topi Miettinen
2016-06-24  4:15     ` Andy Lutomirski
2016-06-24  4:15       ` Andy Lutomirski
2016-06-25 18:00       ` Djalal Harouni
2016-06-25 18:00         ` Djalal Harouni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.