From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758579AbcG0V1v (ORCPT ); Wed, 27 Jul 2016 17:27:51 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:33522 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1758254AbcG0V1t (ORCPT ); Wed, 27 Jul 2016 17:27:49 -0400 X-IBM-Helo: d23dlp01.au.ibm.com X-IBM-MailFrom: hbathini@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 1/3] perf: filter container events based on cgroup namespace From: Hari Bathini To: daniel@iogearbox.net, peterz@infradead.org, linux-kernel@vger.kernel.org, acme@kernel.org, alexander.shishkin@linux.intel.com, mingo@redhat.com, paulus@samba.org, ebiederm@xmission.com, kernel@kyup.com, rostedt@goodmis.org, viro@zeniv.linux.org.uk Cc: aravinda@linux.vnet.ibm.com, ananth@in.ibm.com Date: Thu, 28 Jul 2016 02:57:27 +0530 In-Reply-To: <146965470618.23765.7329786743211962695.stgit@hbathini.in.ibm.com> References: <146965470618.23765.7329786743211962695.stgit@hbathini.in.ibm.com> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16072721-0040-0000-0000-000001C5E38E X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16072721-0041-0000-0000-00000A2730D7 Message-Id: <146965484711.23765.5878825588596955069.stgit@hbathini.in.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-07-27_12:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1607270211 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Aravinda Prasad This patch adds support to filter container specific events, without any change in the user interface, when invoked within a container for the perf utility. Our earlier patch [1] required the container to be created with PID namespace. However, during the discussion in Plumbers it was mentioned that the requirement of PID namespace is insufficient for containers that need access to the host PID namespace [3]. Now that the kernel supports cgroup namespace, we modified the patch to look for cgroup namespace instead of pid namespace to filter events. Thus keeping the basic idea of approach [1] same while addressing [3]. The patch assumes that tracefs is available within the container and all the processes running inside the container are grouped into a single perf_event subsystem of cgroups. Running the below command inside a container with global cgroup namespace $ perf record -e kmem:kmalloc -aR perf report looks like below (with lot of noise): $ perf report --sort pid,symbol -n # # # Total Lost Samples: 0 # # Samples: 8K of event 'kmem:kmalloc' # Event count (approx.): 8487 # # Overhead Samples Pid:Command Symbol # ........ ............ ................... .......................... # 71.56% 6073 0:kworker/dying [k] __kmalloc 26.82% 2276 0:kworker/dying [k] kmem_cache_alloc_trace 1.48% 126 0:kworker/dying [k] __kmalloc_track_caller 0.07% 6 0:curl [k] kmalloc_order_trace 0.05% 4 186:perf [k] __kmalloc 0.02% 2 61:java [k] __kmalloc $ while running the above perf record command inside a container with new cgroup namespace, only samples that belong to this container are listed: $ perf report --sort pid,dso,symbol -n # # # Total Lost Samples: 0 # # Samples: 3 of event 'kmem:kmalloc' # Event count (approx.): 3 # # Overhead Samples Pid:Command Symbol # ........ ............ ............. ............. # 100.00% 3 61:java [k] __kmalloc $ In order to filter events specific to a container, this patch assumes the container is created with a new cgroup namespace. [1] https://lkml.org/lkml/2015/7/15/192 [2] http://linuxplumbersconf.org/2015/ocw/sessions/2667.html [3] Notes for container-aware tracing: https://etherpad.openstack.org/p/LPC2015_Containers Signed-off-by: Aravinda Prasad Signed-off-by: Hari Bathini --- kernel/events/core.c | 51 +++++++++++++++++++++++++++++++++++--------------- 1 file changed, 36 insertions(+), 15 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 43d43a2d..d7ef1e1 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -764,17 +764,38 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event, { struct perf_cgroup *cgrp; struct cgroup_subsys_state *css; - struct fd f = fdget(fd); + struct fd f; int ret = 0; - if (!f.file) - return -EBADF; + if (fd != -1) { + f = fdget(fd); + if (!f.file) + return -EBADF; - css = css_tryget_online_from_dir(f.file->f_path.dentry, - &perf_event_cgrp_subsys); - if (IS_ERR(css)) { - ret = PTR_ERR(css); - goto out; + css = css_tryget_online_from_dir(f.file->f_path.dentry, + &perf_event_cgrp_subsys); + if (IS_ERR(css)) { + ret = PTR_ERR(css); + fdput(f); + return ret; + } + } else if (event->attach_state == PERF_ATTACH_TASK) { + /* Tracing on a PID. No need to set event->cgrp */ + return ret; + } else if (current->nsproxy->cgroup_ns != &init_cgroup_ns) { + /* Don't set event->cgrp if task belongs to root cgroup */ + if (task_css_is_root(current, perf_event_cgrp_id)) + return ret; + + css = task_css(current, perf_event_cgrp_id); + if (!css || !css_tryget_online(css)) + return -ENOENT; + } else { + /* + * perf invoked from global context and hence don't set + * event->cgrp as all the events should be included + */ + return ret; } cgrp = container_of(css, struct perf_cgroup, css); @@ -789,8 +810,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event, perf_detach_cgroup(event); ret = -EINVAL; } -out: - fdput(f); + + if (fd != -1) + fdput(f); + return ret; } @@ -8864,11 +8887,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, if (!has_branch_stack(event)) event->attr.branch_sample_type = 0; - if (cgroup_fd != -1) { - err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); - if (err) - goto err_ns; - } + err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); + if (err) + goto err_ns; pmu = perf_init_event(event); if (!pmu)