From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753333AbbGWOBp (ORCPT <rfc822;w@1wt.eu>);
	Thu, 23 Jul 2015 10:01:45 -0400
Received: from mail.kernel.org ([198.145.29.136]:40420 "EHLO mail.kernel.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753262AbbGWOBc (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 23 Jul 2015 10:01:32 -0400
Date: Thu, 23 Jul 2015 11:01:27 -0300
From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Hemant Kumar <hemant@linux.vnet.ibm.com>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>, linux-kernel@vger.kernel.org,
        Adrian Hunter <adrian.hunter@intel.com>,
        Ingo Molnar <mingo@redhat.com>, Jiri Olsa <jolsa@kernel.org>,
        Namhyung Kim <namhyung@kernel.org>, Borislav Petkov <bp@suse.de>
Subject: Re: Re: [RFC PATCH perf/core v2 00/16] perf-probe --cache and SDT
 support
Message-ID: <20150723140127.GD3152@kernel.org>
References: <20150715091352.8915.87480.stgit@localhost.localdomain>
 <55A7215F.40803@linux.vnet.ibm.com>
 <55A874C6.5030202@hitachi.com>
 <55AFA4E2.4040801@linux.vnet.ibm.com>
 <55B0E872.8030206@hitachi.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <55B0E872.8030206@hitachi.com>
X-Url: http://acmel.wordpress.com
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Em Thu, Jul 23, 2015 at 10:13:22PM +0900, Masami Hiramatsu escreveu:
> On 2015/07/22 23:12, Hemant Kumar wrote:
> > On 07/17/2015 08:51 AM, Masami Hiramatsu wrote:
> >> On 2015/07/16 12:13, Hemant Kumar wrote:
> > The idea behind '%' was to identify the SDT events and take a different path
> > to lookup through the cache, put a probe, record and then delete the probe.
> > Or, do you want "perf record" to record any event this way (not just an sdt
> > event).
 
> I see, but I think that is not good by following reasons,
 
> - when we record event with "-e %provider:event", it will be shown as
>   "provider:event"
> - if perf-list shows the SDT(cached) events as "%provider:event", that
>   will not match the recorded result.
> - it is somewhat fragile that we temporary add the SDT event and remove it
>   after record, because the event will not hide from ftrace users (this
>   means that we'll fail removing the event by -EBUSY if someone use it
>   via ftrace)

We should avoid that, if we say record event "foo:bar", then we should
consistently show "foo:bar" everywhere this is referenced.

> - if we set SDT events perf-probe, it will be shown as "provider:event" name
>   because "%" will be rejected by ftrace. In that case, what the perf-list show
>   those events, both of %provider:event and provider:event ?
 
> thus I pushed the "%" as a "special remembering mark" only for looking
> up the event from cache by perf-probe.
 
> So I'd like to suggest that the following behavior
 
> 1) perf-list shows the cached-with-name and SDT events as Tracepoint events
>   even if it is not yet probed.

I can agree with 'perf list' showing what can be used as events, and
SDTs, AFAIK, match that definition, i.e. we have somewhere (in the DSOs,
right?) information about where to ask for an event to be enabled.

If there are details on how that needs to be obtained, then passed to
the kernel somehow to then become really, really accessible, then these
details are completely internal.

> # perf list
> 
> List of pre-defined events (to be used in -e):
> ...
>   libc:memory_heap_new                             [Tracepoint event]

Is it like this or is it like [ku]probes where we already have a
namespace qualifier, i.e.:

[root@zoo ~]# perf probe icmp_rcv
Added new event:
  probe:icmp_rcv       (on icmp_rcv)

You can now use it in all perf tools, such as:

	perf record -e probe:icmp_rcv -aR sleep 1

[root@zoo ~]#

[root@zoo ~]# perf probe /lib64/libc-2.20.so malloc
Added new events:
  probe_libc:malloc    (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_1  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_2  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_3  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_4  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_5  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_6  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_7  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_8  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_9  (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_10 (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_11 (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_12 (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_13 (on malloc in /lib64/libc-2.20.so)
  probe_libc:malloc_14 (on malloc in /lib64/libc-2.20.so)

You can now use it in all perf tools, such as:

	perf record -e probe_libc:malloc_14 -aR sleep 1

[root@zoo ~]#

"probe" for kernel events, "probe_%s" % DSO basename for userspace
events.

Why not continue with that and have SDTs use the probe_%s: namespace?
Sorry if this was already discussed here...

If there is some ambiguity, that can be resolved by explicitely setting
a new name, 'perf probe' has provision for that, right? I.e.:

[root@zoo ~]# perf probe /lib64/libc-2.20.so another_name=malloc
Added new events:
<SNIP>
  probe_libc:another_name_14 (on malloc in /lib64/libc-2.20.so)

You can now use it in all perf tools, such as:

	perf record -e probe_libc:another_name_14 -aR sleep 1

[root@zoo ~]#

> ...
>   probes:myevent                                   [Tracepoint event]
> ...
 
> 2) perf-record -e with no-probed event should try to set up the given probe
>  by using perf-probe. It is possible to remove that the probe after recording,
>  but also ignore if it fails by -EBUSY. (anyway, there is no difference for
>  users)

Right, being able to add new probes _without_ calling 'perf probe', but
instead functions used by 'perf probe' is something the eBPF does (I'm
almost getting there... :) ) and that I want to do in other tools, like
'trace' as well...

There are permission problems about how to add new probes and how to
_enable_ them, i.e. I think it is ok to allow users to ask for
probe:foo, if they are monitoring just their workloads...

[root@zoo ~]# perf probe 'vfs_getname=getname_flags:72 pathname=filename:string'
Added new event:
  probe:vfs_getname    (on getname_flags:72 with pathname=filename:string)

You can now use it in all perf tools, such as:

	perf record -e probe:vfs_getname -aR sleep 1

[root@zoo ~]#

Then, as !root:

  [acme@zoo linux]$ trace cat /etc/passwd
  Error:	No permissions to read /sys/kernel/debug/tracing/events/raw_syscalls/sys_(enter|exit)
  Hint:	Try 'sudo mount -o remount,mode=755 /sys/kernel/debug'

By default, !root users can't see debugfs, so, no dice in trying to use
the raw_syscalls tracepoints, bummer, then:

  [acme@zoo linux]$ sudo mount -o remount,mode=755 /sys/kernel/debug
  [sudo] password for acme: 
  [acme@zoo linux]$ trace cat /etc/passwd
  Error:	No permissions to read /sys/kernel/debug/tracing/events/raw_syscalls/sys_(enter|exit)
  Hint:	Try 'sudo mount -o remount,mode=755 /sys/kernel/debug/tracing'

Ouch, now this tracefs thing inside debugfs, at least sudo doesn't asks me for the
password this time!

But then, since 'perf trace' knows there is a probe:vfs_getname syscall in place, it
will try to use it, to monitor a workload it is about to start, but:

  [acme@zoo linux]$ sudo mount -o remount,mode=755 /sys/kernel/debug/tracing
  [acme@zoo linux]$ trace cat /etc/passwd
  Error:	Operation not permitted.
  Hint:	Check /proc/sys/kernel/perf_event_paranoid setting.
  Hint:	For system wide tracing it needs to be set to -1.
  Hint:	Try: 'sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"'
  Hint:	The current value is 1.
  [acme@zoo linux]$

This needs refining, i.e. I should just warn that albeit "probe:vfs_getname" is available,
it can't be used, unless we open the doors wide.

Sorry for the digression, but these permission issues will hit us with SDT as well, no?
 
> This rule will solve the contradiction between the event name on recorded
> data and listed events. However, as we discussed there are other clashes.

> A) clash among binaries: Since the binary builders can freely use the
> provider name, it is possible to clash to other binaries' SDTs.

Yes, one way to disambiguate is to use the buildid, i.e. content based, when/if
the need arises. We should _always_ store the build-id, together with any probe
used, so that we can bail out when trying to run those with a non matching DSO
(kernel, module, library, whatever).

When specifying some SDT that is ambiguous, we should bail out and ask for further
namespacing, like:

[acme@zoo linux]$ git show --oneline 25b6
error: short SHA1 25b6 is ambiguous.
error: short SHA1 25b6 is ambiguous.
fatal: ambiguous argument '25b6': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
[acme@zoo linux]$ git show --oneline 25b67
error: short SHA1 25b67 is ambiguous.
error: short SHA1 25b67 is ambiguous.
fatal: ambiguous argument '25b67': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

It can be via build-id or by asking for the full pathname to the DSO, etc.
 
> B) clash among different versions: Of course the different versions of binaries
> can be co-exist on the system. Those usually have the same SDTs and same
> basename, just different build-ids.

Right, in that case the build-id or the full pathname needs to be specified somehow

> These issues are not solved by using "%" because it happens among SDTs.
> So we need to find another way to distinguish the SDTs.

>>From what I've read so far (not much): agreed.

- Arnaldo