From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757609AbcHCCvJ (ORCPT ); Tue, 2 Aug 2016 22:51:09 -0400 Received: from mail-yw0-f177.google.com ([209.85.161.177]:34207 "EHLO mail-yw0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757559AbcHCCua (ORCPT ); Tue, 2 Aug 2016 22:50:30 -0400 MIME-Version: 1.0 In-Reply-To: <579C2059.9@huawei.com> References: <1468970448-3309-1-git-send-email-bgregg@netflix.com> <579C2059.9@huawei.com> From: Brendan Gregg Date: Tue, 2 Aug 2016 19:44:17 -0700 Message-ID: Subject: Re: [PATCH] perf/core: Add a tracepoint for perf sampling To: "Wangnan (F)" Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , linux-kernel@vger.kernel.org, Alexei Starovoitov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 29, 2016 at 8:34 PM, Wangnan (F) wrote: > > > On 2016/7/30 2:05, Brendan Gregg wrote: >> >> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg wrote: >>> >>> When perf is performing hrtimer-based sampling, this tracepoint can be >>> used >>> by BPF to run additional logic on each sample. For example, BPF can fetch >>> stack traces and frequency count them in kernel context, for an efficient >>> profiler. >> >> Any comments on this patch? Thanks, >> >> Brendan > > > Sorry for the late. > > I think it is a useful feature. Could you please provide an example > to show how to use it in perf? Yes, the following example samples at 999 Hertz, and emits the instruction pointer only when it is within a custom address range, as checked by BPF. Eg: # ./perf record -e bpf-output/no-inherit,name=evt/ \ -e ./sampleip_range.c/map:channel.event=evt/ \ -a ./perf record -F 999 -e cpu-clock -N -a -o /dev/null sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.000 MB /dev/null ] [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.134 MB perf.data (222 samples) ] # ./perf script -F comm,pid,time,bpf-output 'bpf-output' not valid for hardware events. Ignoring. 'bpf-output' not valid for unknown events. Ignoring. 'bpf-output' not valid for unknown events. Ignoring. dd 6501 3058.117379: BPF output: 0000: 3c 4c 21 81 ff ff ff ff #include #define SEC(NAME) __attribute__((section(NAME), used)) /* * Edit the following to match the instruction address range you want to * sample. Eg, look in /proc/kallsyms. The addresses will change for each * kernel version and build. */ #define RANGE_START 0xffffffff81214b90 #define RANGE_END 0xffffffff81214cd0 struct bpf_map_def { unsigned int type; unsigned int key_size; unsigned int value_size; unsigned int max_entries; }; static int (*probe_read)(void *dst, int size, void *src) = (void *)BPF_FUNC_probe_read; static int (*get_smp_processor_id)(void) = (void *)BPF_FUNC_get_smp_processor_id; static int (*perf_event_output)(void *, struct bpf_map_def *, int, void *, unsigned long) = (void *)BPF_FUNC_perf_event_output; struct bpf_map_def SEC("maps") channel = { .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, .key_size = sizeof(int), .value_size = sizeof(u32), .max_entries = __NR_CPUS__, }; /* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */ struct perf_hrtimer_args { unsigned long long pad; struct pt_regs *regs; struct perf_event *event; }; SEC("perf:perf_hrtimer") int func(struct perf_hrtimer_args *ctx) { struct pt_regs regs = {}; probe_read(®s, sizeof(regs), ctx->regs); if (regs.ip >= RANGE_START && regs.ip < RANGE_END) { perf_event_output(ctx, &channel, get_smp_processor_id(), ®s.ip, sizeof(regs.ip)); } return 0; } char _license[] SEC("license") = "GPL"; int _version SEC("version") = LINUX_VERSION_CODE; /************************* END ***************************/ > > If I understand correctly, I can have a BPF script run 99 times per > second using > > # perf -e cpu-clock/freq=99/ -e mybpf.c ... > > And in mybpf.c, attach a BPF script on the new tracepoint. Right? > > Also, since we already have timer:hrtimer_expire_entry, please provide > some further information about why we need a new tracepoint. timer:hrtimer_expire_entry fires for much more than just the perf timer. The perf:perf_hrtimer tracepoint also has registers and perf context as arguments, which can be used for profiling programs. Thanks for the comments, Brendan