[RFC PATCH] perf: Container-aware tracing support

* [RFC PATCH] perf: Container-aware tracing support
@ 2015-07-15  9:08 Aravinda Prasad
  2015-07-15 12:47 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Aravinda Prasad @ 2015-07-15  9:08 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel, rostedt, mingo, paulus, acme; +Cc: hbathini, ananth

Current tracing infrastructure such as perf and ftrace reports system
wide data when invoked inside a container. It is required to restrict
events specific to a container context when such tools are invoked
inside a container.

This RFC patch supports filtering container specific events, without
any change in the user interface, when invoked within a container for
the perf utility; such support needs to be extended to ftrace. This
patch assumes that the debugfs is available within the container and
all the processes running inside a container are grouped into a single
perf_event subsystem of cgroups. This patch piggybacks on the existing
support available for tracing with cgroups [1] by setting the cgrp
member of the event structure to the cgroup of the context perf tool
is invoked from.

However, this patch is not complete and requires more work to fully
support tracing inside a container. This patch is intended to initiate
the discussion on having container-aware tracing support. A detailed
explanation on what is supported and pending issues are mentioned
below.

Suggestions, feedback, flames are welcome.

[1] https://lkml.org/lkml/2011/2/14/40

--------------------------------------------------------------------
Details:

With this patch, perf-stat, perf-record (tracepoints, [ku]rpobes) and
perf-top when executed within a container reports events that are
triggered only in that container context. However, there are couple
of limitations on how this works for kprobes/uprobes and in general
ftrace infrastructure.

The problem arises due to the use of files /sys/kernel/debug/
tracing/[uk]probe_events. Perf utility inserts a probe by writing into
the [uk]probe_events file, which is parsed by the kernel to register
an event. When debugfs is mounted inside containers, the contents of
these files are visible to all containers. This implies that a user
within a container can list/delete probes registered by other
containers, leading to security issues and/or denial of service (Eg:
by deleting a probe from another container every time it is
registered). This could be undesirable depending on the way containers
are used (Eg: if used in multi-tenancy with each users assigned a
container).

The issues mentioned above exist for tracing infrastructures which use
ftrace interface. One approach is to have a container specific view of
these files under /sys/kernel/debug/tracing. At this moment, this seems
to require a significant rework of ftrace.

We are looking for feedback on the assumptions we have made about the
processes running inside a container grouped into a single perf_event
subsystem and also any thoughts on extending such support to ftrace.

Regards,
Aravinda

Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Aravinda Prasad <aravinda@linux.vnet.ibm.com>
---
 kernel/events/core.c |   49 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 81aa3a4..f6a1f89 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -589,17 +589,38 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 {
 	struct perf_cgroup *cgrp;
 	struct cgroup_subsys_state *css;
-	struct fd f = fdget(fd);
+	struct fd f;
 	int ret = 0;
 
-	if (!f.file)
-		return -EBADF;
+	if (fd != -1) {
+		f = fdget(fd);
+		if (!f.file)
+			return -EBADF;
 
-	css = css_tryget_online_from_dir(f.file->f_path.dentry,
+		css = css_tryget_online_from_dir(f.file->f_path.dentry,
 					 &perf_event_cgrp_subsys);
-	if (IS_ERR(css)) {
-		ret = PTR_ERR(css);
-		goto out;
+		if (IS_ERR(css)) {
+			ret = PTR_ERR(css);
+			fdput(f);
+			return ret;
+		}
+	} else if (event->attach_state == PERF_ATTACH_TASK) {
+		/* Tracing on a PID. No need to set event->cgrp */
+		return ret;
+	} else if (task_active_pid_ns(current) != &init_pid_ns) {
+		/* Don't set event->cgrp if task belongs to root cgroup */
+		if (task_css_is_root(current, perf_event_cgrp_id))
+			return ret;
+
+		css = task_css(current, perf_event_cgrp_id);
+		if (!css || !css_tryget_online(css))
+			return -ENOENT;
+	} else {
+		/*
+		 * perf invoked from global context and hence don't set
+		 * event->cgrp as all the events should be included
+		 */
+		return ret;
 	}
 
 	cgrp = container_of(css, struct perf_cgroup, css);
@@ -614,8 +635,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 		perf_detach_cgroup(event);
 		ret = -EINVAL;
 	}
-out:
-	fdput(f);
+
+	if (fd != -1)
+		fdput(f);
+
 	return ret;
 }
 
@@ -7554,11 +7577,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
-	if (cgroup_fd != -1) {
-		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
-		if (err)
-			goto err_ns;
-	}
+	err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+	if (err)
+		goto err_ns;
 
 	pmu = perf_init_event(event);
 	if (!pmu)


^ permalink raw reply related	[flat|nested] 7+ messages in thread