All of lore.kernel.org
 help / color / mirror / Atom feed
* linux 4.4, perf & BPF, and bpf_perf_event_output
@ 2016-01-12  0:07 Brendan Gregg
  2016-01-12  2:36 ` Wangnan (F)
  0 siblings, 1 reply; 5+ messages in thread
From: Brendan Gregg @ 2016-01-12  0:07 UTC (permalink / raw)
  To: linux-perf-use., Alexei Starovoitov, Wang Nan

G'Day,

Congrats Wang Nan on the new perf bpf features in 4.4! I just
revisited where I was at with them now that it's official, and have a
couple of questions.

So I can run something like this:

---syncsnoop.c---
/* perf.data: default event record only; trace_pipe:
bpf_trace_printk() output */
#include <uapi/linux/bpf.h>
#include "bpf_helpers.h"

SEC("func=sys_sync")
int bpf_func__sys_sync(void *ctx)
{
    char fmt[] = "sync()\n";
    bpf_trace_printk(fmt, sizeof (fmt));
    return 1;
};

char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
------

With: perf record -a --clang-opt "-DLINUX_VERSION_CODE=0x40400"
--event syncsnoop.c sleep 5

And it works. I get PERF_RECORD_SAMPLE entries in perf.data, and my
custom strings in /sys/kernel/debug/tracing/trace_pipe.

Q1. Given bpf_trace_printk() & trace_pipe is a hack, is there a way
yet to print my custom strings to perf.data? Eg, something like
bpf_perf_event_output() to change the format string of the event.
Currently I just get:

# perf script
            sync 30207 [002]  2509.576984: perf_bpf_probe:func:
(ffffffff81221b20)
            sync 30209 [003]  2509.766632: perf_bpf_probe:func:
(ffffffff81221b20)
            sync 30229 [003]  2509.936299: perf_bpf_probe:func:
(ffffffff81221b20)
            sync 30230 [004]  2510.099059: perf_bpf_probe:func:
(ffffffff81221b20)
            sync 30231 [004]  2510.289351: perf_bpf_probe:func:
(ffffffff81221b20)

Q2. Apart from per-event strings, is there a way to emit the maps?
bpf-script-example.c populates a map, but I don't think it ends up in
perf.data. I think a common case would be to emit it once for the run.
Another common case would be to emit it once per second. Any way we
can do this? I did try bpf_perf_event_output(), but I don't think it
ended up in perf.data (at least, I don't see it using "perf script
-D").

thanks,

Brendan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: linux 4.4, perf & BPF, and bpf_perf_event_output
  2016-01-12  0:07 linux 4.4, perf & BPF, and bpf_perf_event_output Brendan Gregg
@ 2016-01-12  2:36 ` Wangnan (F)
  2016-01-12 15:27   ` Arnaldo Carvalho de Melo
  2016-01-12 20:56   ` Brendan Gregg
  0 siblings, 2 replies; 5+ messages in thread
From: Wangnan (F) @ 2016-01-12  2:36 UTC (permalink / raw)
  To: Brendan Gregg, linux-perf-use., Alexei Starovoitov



On 2016/1/12 8:07, Brendan Gregg wrote:
> G'Day,
>
> Congrats Wang Nan on the new perf bpf features in 4.4! I just
> revisited where I was at with them now that it's official, and have a
> couple of questions.
>
> So I can run something like this:
>
> ---syncsnoop.c---
> /* perf.data: default event record only; trace_pipe:
> bpf_trace_printk() output */
> #include <uapi/linux/bpf.h>
> #include "bpf_helpers.h"
>
> SEC("func=sys_sync")
> int bpf_func__sys_sync(void *ctx)
> {
>      char fmt[] = "sync()\n";
>      bpf_trace_printk(fmt, sizeof (fmt));
>      return 1;
> };
>
> char _license[] SEC("license") = "GPL";
> int _version SEC("version") = LINUX_VERSION_CODE;
> ------
>
> With: perf record -a --clang-opt "-DLINUX_VERSION_CODE=0x40400"
> --event syncsnoop.c sleep 5
>
> And it works. I get PERF_RECORD_SAMPLE entries in perf.data, and my
> custom strings in /sys/kernel/debug/tracing/trace_pipe.
>
> Q1. Given bpf_trace_printk() & trace_pipe is a hack, is there a way
> yet to print my custom strings to perf.data? Eg, something like
> bpf_perf_event_output() to change the format string of the event.
> Currently I just get:
>
> # perf script
>              sync 30207 [002]  2509.576984: perf_bpf_probe:func:
> (ffffffff81221b20)
>              sync 30209 [003]  2509.766632: perf_bpf_probe:func:
> (ffffffff81221b20)
>              sync 30229 [003]  2509.936299: perf_bpf_probe:func:
> (ffffffff81221b20)
>              sync 30230 [004]  2510.099059: perf_bpf_probe:func:
> (ffffffff81221b20)
>              sync 30231 [004]  2510.289351: perf_bpf_probe:func:
> (ffffffff81221b20)
Yes. I have implemented this feature. Patch has posted, but not
in 4.4. I hope you will be able to use this feature in v4.5.
It depends on Arnaldo.

There is a small example at commit message of [1]. The basic workflow is:

  1. Create a bpf-output map in your BPF file
  2. Output data to it by bpf_perf_event_output in BPF source
  3. Create bpf-output event in perf cmdline
  4. Use perf-CTF conversion
  5. User babeltrace to see your output in raw format
  6. Use python script for decoding

Currently you are unable to see string in normal 'perf script' output.
I sent a mail for suggestion about perf script output format in [2]
at Oct. 28. Please check your mailbox.

At the end of this mail I provide some detail description.

[1] 
http://lkml.kernel.org/g/1452520124-2073-26-git-send-email-wangnan0@huawei.com
[2] http://lkml.kernel.org/g/5630AC2A.7030308@huawei.com

> Q2. Apart from per-event strings, is there a way to emit the maps?
> bpf-script-example.c populates a map, but I don't think it ends up in
> perf.data. I think a common case would be to emit it once for the run.
> Another common case would be to emit it once per second. Any way we
> can do this? I did try bpf_perf_event_output(), but I don't think it
> ended up in perf.data (at least, I don't see it using "perf script
> -D").

Not support yet. Map data don't go to 'perf.data' at all. But I have
another idea.

I'm thinking about linking user-mode BPF into perf, then we can put all
things together into one BPF source, including:

data sampling, filtering, accumulation, pretty-printing

makes it similar to dtrace.

Here's an example. In this example we dump a perf.data output when
we found a write system call takes longer than expected (in my recent
patch series perf can act as a flight recorder), then print the
map data we collected during recording.

/* Define message passed between BPF script and perf */
enum perf_cmd {
   DUMP,
}
enum perf_cmd cmd;

/* define histogram map */
SEC("map")
struct bpf_map_def hist_map = { ... };

SEC("func=sys_write")
int sys_write_enter(...) {  }

SEC("func=sys_write%return")
int sys_write_exit(...) {
   enum perf_cmd cmd;

   /* check how long this syscall takes */
   /* maintain histogram */
   if (syscall_too_long) {
     cmd = DUMP;
     bpf_perf_event_output(..., &cmd, ...);   <-- trigger a dump
   }
}

/* Following are user mode BPF scripts which are run in perf context */

SEC("mode=perf;action=recv_bpf_msg")
int perf_recv_bpf_msg(void *pcmd)
{
    ...
    switch (*pcmd) {
      ...
      case DUMP:
        /* make perf dump a perf.data. */
        perf_hist_output(&hist_map);  /* let perf output histogram */
        ...
    }
}

SEC("mode=perf;action=perf_end")
int perf_end(...)
{
   print_the_final_result(&hist_map);
   ...
}

Thank you.

> thanks,
>
> Brendan

Ddetail description about perf event output:

First, you need create a bpf-output map in your BPF file:

  struct bpf_map_def SEC("maps") channel = {
      .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
      .key_size = sizeof(int),
      .value_size = sizeof(u32),
      .max_entries = __NR_CPUS__,
  };

Then you output data to it by bpf_perf_event_output (this is its name in 
samples/bpf/bpf_helper.h. In my
commit message it is perf_event_output):

  SEC("func=sys_write")
  int bpf_func__sys_write(void *ctx)
  {
      char fmt[] = "write()\n";
      bpf_perf_event_output(ctx, &channel,
                            bpf_get_smp_processor_id(),
                            fmt, sizeof (fmt));
      return 1;
  };

When running perf you need create the channel for it by opening a 
bpf-output event and connect it with
'channel' map you defined in your BPF source file:

# perf record -e evt=bpf-output/no-inherit/ -e 
./test.c/maps:channel.event=evt/ ls

Then in perf.data you will see two samples, one is the function you 
probe in (sys_write), another is the
BPF output.

# perf script
               ls 32115 1009664.006228: evt=bpf-output/no-inherit/: 
ffffffff811fcac1 sys_write ([kernel.kallsyms])
               ls 32115 [002] 1009664.006229: perf_bpf_probe:func: 
(ffffffff811fcac0)

You can't see the string you output now, because perf script doesn't support
decoding data you passed through bpf-output event. Now you can see your
string by perf script by 'perf script -D' like this:

  . ... raw event: size 56 bytes
  .  0000:  09 00 00 00 01 00 38 00 01 16 00 00 00 00 00 00 ......8.........
  .  0010:  c1 ca 1f 81 ff ff ff ff 73 7d 00 00 73 7d 00 00 ........s}..s}..
  .  0020:  bb 09 6f b8 48 96 03 00 0c 00 00 00 77 72 69 74 
..o.H.......writ      <--- *HERE*
  .  0030:  65 28 29 0a 00 00 00 00                          e().....
  .
  4294967295 1009664006228411 0x2ed0 [0x38]: PERF_RECORD_SAMPLE(IP, 
0x1): 32115/32115: 0xffffffff811fcac1 period: 1 addr: 0
   ... thread: ls:32115
   ...... dso: 
/home/wangnan/.debug/.build-id/fb/0f83d011364583af77e563a390dfc1b78cce8d
                ls 32115 1009664.006228: evt=bpf-output/no-inherit/: 
ffffffff811fcac1 sys_write ([kernel.kallsyms])

For me, I prefer CTF conversion:

  # ~/perf data convert --to-ctf ./out.ctf
  [ perf data convert: Converted 'perf.data' into CTF data './out.ctf' ]
  [ perf data convert: Converted and wrote 0.000 MB (2 samples) ]
  # babeltrace ./out.ctf/
  [16:27:44.006228411] (+?.?????????) evt=bpf-output/no-inherit/: { 
cpu_id = 0 }, { perf_ip = 0xFFFFFFFF811FCAC1, perf_tid = 32115, perf_pid 
= 32115, perf_id = 5633,  raw_len = 3, raw_data = [ [0] = 0x74697277, 
[1] = 0xA292865, [2] = 0x0 ] }
  [16:27:44.006229678] (+0.000001267) perf_bpf_probe:func: { cpu_id = 2 
}, { perf_ip = 0xFFFFFFFF811FCAC1, perf_tid = 32115, perf_pid = 32115, 
perf_id = 5641, perf_period = 1, common_type = 1177, common_flags = 1, 
common_preempt_count = 0, common_pid = 32115, _probe_ip = 
0xFFFFFFFF811FCAC0 }

Your string is there:

  raw_data = [ [0] = 0x74697277, [1] = 0xA292865, [2] = 0x0 ] }

Still not good, but don't forget we have babeltrace python binding. A
simple python script can decode it for you.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: linux 4.4, perf & BPF, and bpf_perf_event_output
  2016-01-12  2:36 ` Wangnan (F)
@ 2016-01-12 15:27   ` Arnaldo Carvalho de Melo
  2016-01-12 20:56   ` Brendan Gregg
  1 sibling, 0 replies; 5+ messages in thread
From: Arnaldo Carvalho de Melo @ 2016-01-12 15:27 UTC (permalink / raw)
  To: Wangnan (F); +Cc: Brendan Gregg, linux-perf-use., Alexei Starovoitov

Em Tue, Jan 12, 2016 at 10:36:41AM +0800, Wangnan (F) escreveu:
> 
> 
> On 2016/1/12 8:07, Brendan Gregg wrote:
> >G'Day,
> >
> >Congrats Wang Nan on the new perf bpf features in 4.4! I just
> >revisited where I was at with them now that it's official, and have a
> >couple of questions.
> >
> >So I can run something like this:
> >
> >---syncsnoop.c---
> >/* perf.data: default event record only; trace_pipe:
> >bpf_trace_printk() output */
> >#include <uapi/linux/bpf.h>
> >#include "bpf_helpers.h"
> >
> >SEC("func=sys_sync")
> >int bpf_func__sys_sync(void *ctx)
> >{
> >     char fmt[] = "sync()\n";
> >     bpf_trace_printk(fmt, sizeof (fmt));
> >     return 1;
> >};
> >
> >char _license[] SEC("license") = "GPL";
> >int _version SEC("version") = LINUX_VERSION_CODE;
> >------
> >
> >With: perf record -a --clang-opt "-DLINUX_VERSION_CODE=0x40400"
> >--event syncsnoop.c sleep 5
> >
> >And it works. I get PERF_RECORD_SAMPLE entries in perf.data, and my
> >custom strings in /sys/kernel/debug/tracing/trace_pipe.
> >
> >Q1. Given bpf_trace_printk() & trace_pipe is a hack, is there a way
> >yet to print my custom strings to perf.data? Eg, something like
> >bpf_perf_event_output() to change the format string of the event.
> >Currently I just get:
> >
> ># perf script
> >             sync 30207 [002]  2509.576984: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30209 [003]  2509.766632: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30229 [003]  2509.936299: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30230 [004]  2510.099059: perf_bpf_probe:func:
> >(ffffffff81221b20)
> >             sync 30231 [004]  2510.289351: perf_bpf_probe:func:
> >(ffffffff81221b20)
> Yes. I have implemented this feature. Patch has posted, but not
> in 4.4. I hope you will be able to use this feature in v4.5.
> It depends on Arnaldo.

Right, I have to go testing this stuff patch by patch, its in my table
on more time this week.
 
> There is a small example at commit message of [1]. The basic workflow is:
> 
>  1. Create a bpf-output map in your BPF file
>  2. Output data to it by bpf_perf_event_output in BPF source
>  3. Create bpf-output event in perf cmdline
>  4. Use perf-CTF conversion
>  5. User babeltrace to see your output in raw format
>  6. Use python script for decoding
> 
> Currently you are unable to see string in normal 'perf script' output.
> I sent a mail for suggestion about perf script output format in [2]
> at Oct. 28. Please check your mailbox.
> 
> At the end of this mail I provide some detail description.
> 
> [1] http://lkml.kernel.org/g/1452520124-2073-26-git-send-email-wangnan0@huawei.com
> [2] http://lkml.kernel.org/g/5630AC2A.7030308@huawei.com
> 
> >Q2. Apart from per-event strings, is there a way to emit the maps?
> >bpf-script-example.c populates a map, but I don't think it ends up in
> >perf.data. I think a common case would be to emit it once for the run.
> >Another common case would be to emit it once per second. Any way we
> >can do this? I did try bpf_perf_event_output(), but I don't think it
> >ended up in perf.data (at least, I don't see it using "perf script
> >-D").
> 
> Not support yet. Map data don't go to 'perf.data' at all. But I have
> another idea.
> 
> I'm thinking about linking user-mode BPF into perf, then we can put all
> things together into one BPF source, including:
> 
> data sampling, filtering, accumulation, pretty-printing
> 
> makes it similar to dtrace.
> 
> Here's an example. In this example we dump a perf.data output when
> we found a write system call takes longer than expected (in my recent
> patch series perf can act as a flight recorder), then print the
> map data we collected during recording.
> 
> /* Define message passed between BPF script and perf */
> enum perf_cmd {
>   DUMP,
> }
> enum perf_cmd cmd;
> 
> /* define histogram map */
> SEC("map")
> struct bpf_map_def hist_map = { ... };
> 
> SEC("func=sys_write")
> int sys_write_enter(...) {  }
> 
> SEC("func=sys_write%return")
> int sys_write_exit(...) {
>   enum perf_cmd cmd;
> 
>   /* check how long this syscall takes */
>   /* maintain histogram */
>   if (syscall_too_long) {
>     cmd = DUMP;
>     bpf_perf_event_output(..., &cmd, ...);   <-- trigger a dump
>   }
> }
> 
> /* Following are user mode BPF scripts which are run in perf context */
> 
> SEC("mode=perf;action=recv_bpf_msg")
> int perf_recv_bpf_msg(void *pcmd)
> {
>    ...
>    switch (*pcmd) {
>      ...
>      case DUMP:
>        /* make perf dump a perf.data. */
>        perf_hist_output(&hist_map);  /* let perf output histogram */
>        ...
>    }
> }
> 
> SEC("mode=perf;action=perf_end")
> int perf_end(...)
> {
>   print_the_final_result(&hist_map);
>   ...
> }
> 
> Thank you.
> 
> >thanks,
> >
> >Brendan
> 
> Ddetail description about perf event output:
> 
> First, you need create a bpf-output map in your BPF file:
> 
>  struct bpf_map_def SEC("maps") channel = {
>      .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
>      .key_size = sizeof(int),
>      .value_size = sizeof(u32),
>      .max_entries = __NR_CPUS__,
>  };
> 
> Then you output data to it by bpf_perf_event_output (this is its name in
> samples/bpf/bpf_helper.h. In my
> commit message it is perf_event_output):
> 
>  SEC("func=sys_write")
>  int bpf_func__sys_write(void *ctx)
>  {
>      char fmt[] = "write()\n";
>      bpf_perf_event_output(ctx, &channel,
>                            bpf_get_smp_processor_id(),
>                            fmt, sizeof (fmt));
>      return 1;
>  };
> 
> When running perf you need create the channel for it by opening a bpf-output
> event and connect it with
> 'channel' map you defined in your BPF source file:
> 
> # perf record -e evt=bpf-output/no-inherit/ -e
> ./test.c/maps:channel.event=evt/ ls
> 
> Then in perf.data you will see two samples, one is the function you probe in
> (sys_write), another is the
> BPF output.
> 
> # perf script
>               ls 32115 1009664.006228: evt=bpf-output/no-inherit/:
> ffffffff811fcac1 sys_write ([kernel.kallsyms])
>               ls 32115 [002] 1009664.006229: perf_bpf_probe:func:
> (ffffffff811fcac0)
> 
> You can't see the string you output now, because perf script doesn't support
> decoding data you passed through bpf-output event. Now you can see your
> string by perf script by 'perf script -D' like this:
> 
>  . ... raw event: size 56 bytes
>  .  0000:  09 00 00 00 01 00 38 00 01 16 00 00 00 00 00 00 ......8.........
>  .  0010:  c1 ca 1f 81 ff ff ff ff 73 7d 00 00 73 7d 00 00 ........s}..s}..
>  .  0020:  bb 09 6f b8 48 96 03 00 0c 00 00 00 77 72 69 74 ..o.H.......writ
> <--- *HERE*
>  .  0030:  65 28 29 0a 00 00 00 00                          e().....
>  .
>  4294967295 1009664006228411 0x2ed0 [0x38]: PERF_RECORD_SAMPLE(IP, 0x1):
> 32115/32115: 0xffffffff811fcac1 period: 1 addr: 0
>   ... thread: ls:32115
>   ...... dso:
> /home/wangnan/.debug/.build-id/fb/0f83d011364583af77e563a390dfc1b78cce8d
>                ls 32115 1009664.006228: evt=bpf-output/no-inherit/:
> ffffffff811fcac1 sys_write ([kernel.kallsyms])
> 
> For me, I prefer CTF conversion:
> 
>  # ~/perf data convert --to-ctf ./out.ctf
>  [ perf data convert: Converted 'perf.data' into CTF data './out.ctf' ]
>  [ perf data convert: Converted and wrote 0.000 MB (2 samples) ]
>  # babeltrace ./out.ctf/
>  [16:27:44.006228411] (+?.?????????) evt=bpf-output/no-inherit/: { cpu_id =
> 0 }, { perf_ip = 0xFFFFFFFF811FCAC1, perf_tid = 32115, perf_pid = 32115,
> perf_id = 5633,  raw_len = 3, raw_data = [ [0] = 0x74697277, [1] =
> 0xA292865, [2] = 0x0 ] }
>  [16:27:44.006229678] (+0.000001267) perf_bpf_probe:func: { cpu_id = 2 }, {
> perf_ip = 0xFFFFFFFF811FCAC1, perf_tid = 32115, perf_pid = 32115, perf_id =
> 5641, perf_period = 1, common_type = 1177, common_flags = 1,
> common_preempt_count = 0, common_pid = 32115, _probe_ip = 0xFFFFFFFF811FCAC0
> }
> 
> Your string is there:
> 
>  raw_data = [ [0] = 0x74697277, [1] = 0xA292865, [2] = 0x0 ] }
> 
> Still not good, but don't forget we have babeltrace python binding. A
> simple python script can decode it for you.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: linux 4.4, perf & BPF, and bpf_perf_event_output
  2016-01-12  2:36 ` Wangnan (F)
  2016-01-12 15:27   ` Arnaldo Carvalho de Melo
@ 2016-01-12 20:56   ` Brendan Gregg
  2016-01-13  2:54     ` Wangnan (F)
  1 sibling, 1 reply; 5+ messages in thread
From: Brendan Gregg @ 2016-01-12 20:56 UTC (permalink / raw)
  To: Wangnan (F); +Cc: linux-perf-use., Alexei Starovoitov

On Mon, Jan 11, 2016 at 6:36 PM, Wangnan (F) <wangnan0@huawei.com> wrote:
>
>
> On 2016/1/12 8:07, Brendan Gregg wrote:
>>
>> G'Day,
>>
>> Congrats Wang Nan on the new perf bpf features in 4.4! I just
>> revisited where I was at with them now that it's official, and have a
>> couple of questions.
>>
>> So I can run something like this:
>>
>> ---syncsnoop.c---
>> /* perf.data: default event record only; trace_pipe:
>> bpf_trace_printk() output */
>> #include <uapi/linux/bpf.h>
>> #include "bpf_helpers.h"
>>
>> SEC("func=sys_sync")
>> int bpf_func__sys_sync(void *ctx)
>> {
>>      char fmt[] = "sync()\n";
>>      bpf_trace_printk(fmt, sizeof (fmt));
>>      return 1;
>> };
>>
>> char _license[] SEC("license") = "GPL";
>> int _version SEC("version") = LINUX_VERSION_CODE;
>> ------
>>
>> With: perf record -a --clang-opt "-DLINUX_VERSION_CODE=0x40400"
>> --event syncsnoop.c sleep 5
>>
>> And it works. I get PERF_RECORD_SAMPLE entries in perf.data, and my
>> custom strings in /sys/kernel/debug/tracing/trace_pipe.
>>
>> Q1. Given bpf_trace_printk() & trace_pipe is a hack, is there a way
>> yet to print my custom strings to perf.data? Eg, something like
>> bpf_perf_event_output() to change the format string of the event.
>> Currently I just get:
>>
>> # perf script
>>              sync 30207 [002]  2509.576984: perf_bpf_probe:func:
>> (ffffffff81221b20)
>>              sync 30209 [003]  2509.766632: perf_bpf_probe:func:
>> (ffffffff81221b20)
>>              sync 30229 [003]  2509.936299: perf_bpf_probe:func:
>> (ffffffff81221b20)
>>              sync 30230 [004]  2510.099059: perf_bpf_probe:func:
>> (ffffffff81221b20)
>>              sync 30231 [004]  2510.289351: perf_bpf_probe:func:
>> (ffffffff81221b20)
>
> Yes. I have implemented this feature. Patch has posted, but not
> in 4.4. I hope you will be able to use this feature in v4.5.
> It depends on Arnaldo.
>
> There is a small example at commit message of [1]. The basic workflow is:
>
>  1. Create a bpf-output map in your BPF file
>  2. Output data to it by bpf_perf_event_output in BPF source
>  3. Create bpf-output event in perf cmdline

Ok, I've browsed the examples, so considering this:

 # perf record -g -e evt=bpf-output/no-inherit/ \
                  -e ./test_bpf_output.c/maps.map_channel.event=evt/ -a ls

Please tell me if I'm understanding these correctly:

A. bpf-output is a dummy event used to pass data from kernel to user.
ie, I'll see them as PERF_RECORD_SAMPLE in "perf script -D".
B. bpf-output is triggered by bpf_perf_event_output().
C. The "evt=" is giving it an alias for later reference.
D. The "/no-inherit/" is to stop the dummy event from being used more
than once, by child tasks.
E. The "maps.map_channel.event=evt" ... I'm not sure what "event"
means here: is it associated with bpf_perf_event_output() being
called? ie, bpf_perf_event_output() -> bpf-output -> .event ?. ... So
I think this is saying that the map_channel map's
bpf_perf_event_output() calls should be emitted via the "evt" alias,
which we earlier defined as bpf-output.

Seems like "-e evt=bpf-output/no-inherit/" is redundant (or at least
could be an option, like "-x", but we seem to be running out of
letters!). If the user specifies a C program, then uses
bpf_perf_event_output(), then maybe perf should automatically begin
recording bpf-output without the user needing to specify it. After
all, lots of other stuff already goes into perf.data that I didn't
explicitly ask for (like PERF_RECORD_MMAP). :)

Also, "/maps.map_channel.event=evt/" seems redundant too, and could be
the default behavior. ie, I'd like to just run:

# perf record -g -e test_bpf_output.c -a ls

And then get dummy PERF_RECORD_SAMPLE events in my perf.data that has
the bpf_perf_event_output() details in. If I want to customize them,
using the above -e syntax, then fine, but that would be optional.

While this mechanism looks like it can pass bpf_perf_event_output(), I
guess a separate question is how we can dump map data at the end of
runs. Eg, imagine I'm using a map to store a histogram, which I want
dumped once at the end of the run. I don't have a specific place to
put a bpf_perf_event_output().

PS. regarding SEC("func=sys_sync") -- anyway to trace a kretprobe? :)

thanks,

Brendan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: linux 4.4, perf & BPF, and bpf_perf_event_output
  2016-01-12 20:56   ` Brendan Gregg
@ 2016-01-13  2:54     ` Wangnan (F)
  0 siblings, 0 replies; 5+ messages in thread
From: Wangnan (F) @ 2016-01-13  2:54 UTC (permalink / raw)
  To: Brendan Gregg; +Cc: linux-perf-use., Alexei Starovoitov



On 2016/1/13 4:56, Brendan Gregg wrote:
> On Mon, Jan 11, 2016 at 6:36 PM, Wangnan (F) <wangnan0@huawei.com> wrote:
>>
>> On 2016/1/12 8:07, Brendan Gregg wrote:
>>> G'Day,
[SNIP]

>> Yes. I have implemented this feature. Patch has posted, but not
>> in 4.4. I hope you will be able to use this feature in v4.5.
>> It depends on Arnaldo.
>>
>> There is a small example at commit message of [1]. The basic workflow is:
>>
>>   1. Create a bpf-output map in your BPF file
>>   2. Output data to it by bpf_perf_event_output in BPF source
>>   3. Create bpf-output event in perf cmdline
> Ok, I've browsed the examples, so considering this:
>
>   # perf record -g -e evt=bpf-output/no-inherit/ \
>                    -e ./test_bpf_output.c/maps.map_channel.event=evt/ -a ls
>
> Please tell me if I'm understanding these correctly:
>
> A. bpf-output is a dummy event used to pass data from kernel to user.
> ie, I'll see them as PERF_RECORD_SAMPLE in "perf script -D".

Right.

> B. bpf-output is triggered by bpf_perf_event_output().
Right.
> C. The "evt=" is giving it an alias for later reference.

Right.
> D. The "/no-inherit/" is to stop the dummy event from being used more
> than once, by child tasks.

Yes, but need more works to explain. See below.

> E. The "maps.map_channel.event=evt" ...

maps:map_channel.event=evt

"." --> ":"


See below for explaination.

> I'm not sure what "event"
> means here: is it associated with bpf_perf_event_output() being
> called? ie, bpf_perf_event_output() -> bpf-output -> .event ?. ... So
> I think this is saying that the map_channel map's
> bpf_perf_event_output() calls should be emitted via the "evt" alias,
> which we earlier defined as bpf-output.
>
> Seems like "-e evt=bpf-output/no-inherit/" is redundant (or at least
> could be an option, like "-x", but we seem to be running out of
> letters!). If the user specifies a C program, then uses
> bpf_perf_event_output(), then maybe perf should automatically begin
> recording bpf-output without the user needing to specify it. After
> all, lots of other stuff already goes into perf.data that I didn't
> explicitly ask for (like PERF_RECORD_MMAP). :)

It can be discussed. We can create a syntax sugar. Could
you please give some detail suggestions?

Without using sugar we can do other interesting things.
For example:

  # perf record -e sync_trace=bpf-output/no-inherit/ \
                -e display_trace=bpf-output/no-inherit/ \
                ...

Here we create two bpf-output events for different propose. In
BPF file let's simply output a zero size data to different events
to indicate what happen. Then 'perf script' output is enough for me,
don't need CTF conversion.

Also, in the above example we can further adding /call-graph=no/
to bpf-output, because we only need to know 'something is
happening', don't need the full call graph where we find the unusual.

>
> Also, "/maps.map_channel.event=evt/" seems redundant too, and could be
> the default behavior. ie, I'd like to just run:
>
> # perf record -g -e test_bpf_output.c -a ls
>
> And then get dummy PERF_RECORD_SAMPLE events in my perf.data that has
> the bpf_perf_event_output() details in. If I want to customize them,
> using the above -e syntax, then fine, but that would be optional.

See above. We can make a sugar on it. Could you please give
a detail suggestion?

> While this mechanism looks like it can pass bpf_perf_event_output(), I
> guess a separate question is how we can dump map data at the end of
> runs. Eg, imagine I'm using a map to store a histogram, which I want
> dumped once at the end of the run. I don't have a specific place to
> put a bpf_perf_event_output().
>
> PS. regarding SEC("func=sys_sync") -- anyway to trace a kretprobe? :)

You can use:

SEC("func=sys_sync%return")

Now let's discuss the detail of this part.

1. perf creates multiple perf event instances for an event.
    Each event is bound to a processor. For example, with a 8 core
    machine, a '-e cycles' causes 8 perf event instances.

2. Because of 1, a BPF program needs to operates multiple perf
    events.

3. Because of 2, BPF program operate perf events through a map
    with type BPF_MAP_TYPE_PERF_EVENT_ARRAY. This is why the
    interface you see is not as strightforward as you may expect.
    Also, this is the reason why perf event array needs at least
    __NR_CPUS__ slots.

4. Operating inherit perf event in BPF program is dangerous, so
    kernel doesn't allow inserting inherit event into the map in 3.
    This is the reason why we need /no-inherit/. (However, we can
    provide a sugar to autimatically turn off inherit setting
    if the event is system-wide).

5. So the working flow should be:
    1) Create perf events and give them names:
       using '-e evt=<events>'

    2) Full them into the map, using:
       /maps:map_channel.event=evt/

6. Why we need such a long string '/maps:map_channel.event=evt/' ?

    The full maps configuration syntax is:

    maps:[<arraymap>].value<indices>=[value]
    maps:[<eventmap>].event<indices>=[event]

    With this configuration we are not only allowed to fill perf event
    into map, but can also fill different initial value to normal array map.
    For example, we can put a pid of a program into an array map and use
    that pid in BPF program, without having to recompile the BPF program.
    this map is very similar to global variables.

    maps:global_vars.value[0]=`ps -e | grep X | awk '{print $1}'`
     ^           ^      ^  ^
     |           |      |  |
   prefix        |      | only set the first element
              map name  |
                        |
                we are inserting value


    maps:map_channel.event=evt
     ^           ^      ^   ^
     |           |      |   |
   prefix        |      | event alias
              map name  |
                        |
                we are filling perf event

Thank you.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-01-13  2:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-12  0:07 linux 4.4, perf & BPF, and bpf_perf_event_output Brendan Gregg
2016-01-12  2:36 ` Wangnan (F)
2016-01-12 15:27   ` Arnaldo Carvalho de Melo
2016-01-12 20:56   ` Brendan Gregg
2016-01-13  2:54     ` Wangnan (F)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.