From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1757609AbcHCCvJ (ORCPT <rfc822;w@1wt.eu>);
	Tue, 2 Aug 2016 22:51:09 -0400
Received: from mail-yw0-f177.google.com ([209.85.161.177]:34207 "EHLO
	mail-yw0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1757559AbcHCCua (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 2 Aug 2016 22:50:30 -0400
MIME-Version: 1.0
In-Reply-To: <579C2059.9@huawei.com>
References: <1468970448-3309-1-git-send-email-bgregg@netflix.com>
 <CAJN39oibpBU9rZpAX2KJp2bNDj-gdrOMU+4xG14iuab5aPG-xg@mail.gmail.com> <579C2059.9@huawei.com>
From: Brendan Gregg <bgregg@netflix.com>
Date: Tue, 2 Aug 2016 19:44:17 -0700
Message-ID: <CAJN39ojbnFOv4SD-2o+zWpukfNfwrKv_SbM8WHQppj1rqeGDuQ@mail.gmail.com>
Subject: Re: [PATCH] perf/core: Add a tracepoint for perf sampling
To: "Wangnan (F)" <wangnan0@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        linux-kernel@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Jul 29, 2016 at 8:34 PM, Wangnan (F) <wangnan0@huawei.com> wrote:
>
>
> On 2016/7/30 2:05, Brendan Gregg wrote:
>>
>> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg <bgregg@netflix.com> wrote:
>>>
>>> When perf is performing hrtimer-based sampling, this tracepoint can be
>>> used
>>> by BPF to run additional logic on each sample. For example, BPF can fetch
>>> stack traces and frequency count them in kernel context, for an efficient
>>> profiler.
>>
>> Any comments on this patch? Thanks,
>>
>> Brendan
>
>
> Sorry for the late.
>
> I think it is a useful feature. Could you please provide an example
> to show how to use it in perf?

Yes, the following example samples at 999 Hertz, and emits the
instruction pointer only when it is within a custom address range, as
checked by BPF. Eg:

# ./perf record -e bpf-output/no-inherit,name=evt/ \
    -e ./sampleip_range.c/map:channel.event=evt/ \
    -a ./perf record -F 999 -e cpu-clock -N -a -o /dev/null sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB /dev/null ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.134 MB perf.data (222 samples) ]

# ./perf script -F comm,pid,time,bpf-output
'bpf-output' not valid for hardware events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
              dd  6501  3058.117379:
      BPF output: 0000: 3c 4c 21 81 ff ff ff ff  <L!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.130392:
      BPF output: 0000: 55 4c 21 81 ff ff ff ff  UL!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.131393:
      BPF output: 0000: 55 4c 21 81 ff ff ff ff  UL!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.149411:
      BPF output: 0000: e1 4b 21 81 ff ff ff ff  .K!.....
                  0008: 00 00 00 00              ....

              dd  6501  3058.155417:
      BPF output: 0000: 76 4c 21 81 ff ff ff ff  vL!.....
                  0008: 00 00 00 00              ....

For that example, perf is running a BPF program to emit filtered
details, and running a second perf to configure sampling. We can
certainly improve how this works. And this will be much more
interesting once perf can emit maps, and a perf BPF program can
populate a map.

Here's sampleip_range.c:

/************************ BEGIN **************************/
#include <uapi/linux/bpf.h>
#include <uapi/linux/ptrace.h>

#define SEC(NAME) __attribute__((section(NAME), used))

/*
 * Edit the following to match the instruction address range you want to
 * sample. Eg, look in /proc/kallsyms. The addresses will change for each
 * kernel version and build.
 */
#define RANGE_START  0xffffffff81214b90
#define RANGE_END    0xffffffff81214cd0

struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};

static int (*probe_read)(void *dst, int size, void *src) =
    (void *)BPF_FUNC_probe_read;
static int (*get_smp_processor_id)(void) =
    (void *)BPF_FUNC_get_smp_processor_id;
static int (*perf_event_output)(void *, struct bpf_map_def *, int, void *,
    unsigned long) = (void *)BPF_FUNC_perf_event_output;

struct bpf_map_def SEC("maps") channel = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = __NR_CPUS__,
};

/* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */
struct perf_hrtimer_args {
unsigned long long pad;
struct pt_regs *regs;
struct perf_event *event;
};
SEC("perf:perf_hrtimer")
int func(struct perf_hrtimer_args *ctx)
{
struct pt_regs regs = {};
probe_read(&regs, sizeof(regs), ctx->regs);
if (regs.ip >= RANGE_START && regs.ip < RANGE_END) {
perf_event_output(ctx, &channel, get_smp_processor_id(),
   &regs.ip, sizeof(regs.ip));
}
return 0;
}

char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
/************************* END ***************************/

>
> If I understand correctly, I can have a BPF script run 99 times per
> second using
>
>   # perf -e cpu-clock/freq=99/ -e mybpf.c ...
>
> And in mybpf.c, attach a BPF script on the new tracepoint. Right?
>
> Also, since we already have timer:hrtimer_expire_entry, please provide
> some further information about why we need a new tracepoint.

timer:hrtimer_expire_entry fires for much more than just the perf
timer. The perf:perf_hrtimer tracepoint also has registers and perf
context as arguments, which can be used for profiling programs.

Thanks for the comments,

Brendan