[RFC PATCH 0/2] bpf: enable/disable events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] bpf: enable/disable events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling
@ 2015-10-12  9:02 Kaixu Xia
  2015-10-12  9:02 ` [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples Kaixu Xia
  2015-10-12  9:02 ` [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers Kaixu Xia
  0 siblings, 2 replies; 22+ messages in thread
From: Kaixu Xia @ 2015-10-12  9:02 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

In some scenarios we don't want to output trace data when perf sampling
in order to reduce overhead. For example, perf can be run as daemon to
dump trace data when necessary, such as the system performance goes down.

This patchset adds the helpers bpf_perf_event_sample_enable/disable() to
implement this function. By applying these helpers, we can enable/disable
events stored in PERF_EVENT_ARRAY maps trace data output and get the
samples we are most interested in.

We also need to make the perf user side can adds the normal PMU events
from perf cmdline to PERF_EVENT_ARRAY maps. My colleague He Kuang is doing
this work. In the following example, the cycles will be stored in the
PERF_EVENT_ARRAY maps.

Before this patch,
   $ ./perf record -e cycles -a sleep 1
   $ ./perf report --stdio
	# To display the perf.data header info, please use --header/--header-only option
	#
	#
	# Total Lost Samples: 0
	#
	# Samples: 655  of event 'cycles'
	# Event count (approx.): 129323548
	...

After this patch,
   $ ./perf record -e pmux=cycles --event perf-bpf.o/my_cycles_map=pmux/ -a sleep 1
   $ ./perf report --stdio
	# To display the perf.data header info, please use --header/--header-only option
	#
	#
	# Total Lost Samples: 0
	#
	# Samples: 23  of event 'cycles'
	# Event count (approx.): 2064170
	...

The bpf program example:

  struct bpf_map_def SEC("maps") my_cycles_map = {
          .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
          .key_size = sizeof(int),
          .value_size = sizeof(u32),
          .max_entries = 32, 
  };

  SEC("enter=sys_write")
  int bpf_prog_1(struct pt_regs *ctx)
  {
          bpf_perf_event_sample_enable(&my_cycles_map);
          return 0;
  }

  SEC("exit=sys_write%return")
  int bpf_prog_2(struct pt_regs *ctx)
  {
          bpf_perf_event_sample_disable(&my_cycles_map);
          return 0;
  }


Kaixu Xia (2):
  perf: Add the flag sample_disable not to output data on samples
  bpf: Implement bpf_perf_event_sample_enable/disable() helpers

 include/linux/bpf.h        |  3 +++
 include/linux/perf_event.h |  2 ++
 include/uapi/linux/bpf.h   |  2 ++
 kernel/bpf/arraymap.c      |  5 +++++
 kernel/bpf/verifier.c      |  4 +++-
 kernel/events/core.c       |  3 +++
 kernel/trace/bpf_trace.c   | 34 ++++++++++++++++++++++++++++++++++
 7 files changed, 52 insertions(+), 1 deletion(-)

-- 
1.8.3.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12  9:02 [RFC PATCH 0/2] bpf: enable/disable events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling Kaixu Xia
@ 2015-10-12  9:02 ` Kaixu Xia
  2015-10-12 12:02   ` Peter Zijlstra
                     ` (3 more replies)
  2015-10-12  9:02 ` [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers Kaixu Xia
  1 sibling, 4 replies; 22+ messages in thread
From: Kaixu Xia @ 2015-10-12  9:02 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

In some scenarios we don't want to output trace data when sampling
to reduce overhead. This patch adds the flag sample_disable to
implement this function. By setting this flag and integrating with
ebpf, we can control the data output process and get the samples we
are most interested in.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
---
 include/linux/bpf.h        | 1 +
 include/linux/perf_event.h | 2 ++
 kernel/bpf/arraymap.c      | 5 +++++
 kernel/events/core.c       | 3 +++
 4 files changed, 11 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f57d7fe..25e073d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -39,6 +39,7 @@ struct bpf_map {
 	u32 max_entries;
 	const struct bpf_map_ops *ops;
 	struct work_struct work;
+	atomic_t perf_sample_disable;
 };
 
 struct bpf_map_type_list {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 092a0e8..0606d1d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -483,6 +483,8 @@ struct perf_event {
 	perf_overflow_handler_t		overflow_handler;
 	void				*overflow_handler_context;
 
+	atomic_t			*sample_disable;
+
 #ifdef CONFIG_EVENT_TRACING
 	struct trace_event_call		*tp_event;
 	struct event_filter		*filter;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 29ace10..4ae82c9 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -51,6 +51,9 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 
 	array->elem_size = elem_size;
 
+	if (attr->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY)
+		atomic_set(&array->map.perf_sample_disable, 1);
+
 	return &array->map;
 }
 
@@ -298,6 +301,8 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
 		perf_event_release_kernel(event);
 		return ERR_PTR(-EINVAL);
 	}
+
+	event->sample_disable = &map->perf_sample_disable;
 	return event;
 }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..f6ef45c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
 		irq_work_queue(&event->pending);
 	}
 
+	if ((event->sample_disable) && atomic_read(event->sample_disable))
+		return ret;
+
 	if (event->overflow_handler)
 		event->overflow_handler(event, data, regs);
 	else
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12  9:02 ` [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples Kaixu Xia
@ 2015-10-12 12:02   ` Peter Zijlstra
  2015-10-12 12:05     ` Wangnan (F)
  2015-10-12 14:14   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-12 12:02 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Mon, Oct 12, 2015 at 09:02:42AM +0000, Kaixu Xia wrote:
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -483,6 +483,8 @@ struct perf_event {
>  	perf_overflow_handler_t		overflow_handler;
>  	void				*overflow_handler_context;
>  
> +	atomic_t			*sample_disable;
> +
>  #ifdef CONFIG_EVENT_TRACING
>  	struct trace_event_call		*tp_event;
>  	struct event_filter		*filter;

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b11756f..f6ef45c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
>  		irq_work_queue(&event->pending);
>  	}
>  
> +	if ((event->sample_disable) && atomic_read(event->sample_disable))
> +		return ret;
> +
>  	if (event->overflow_handler)
>  		event->overflow_handler(event, data, regs);
>  	else

Try and guarantee sample_disable lives in the same cacheline as
overflow_handler.

I think we should at the very least replace the kzalloc() currently used
with a cacheline aligned alloc, and check the structure layout to verify
these two do in fact share a cacheline.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12 12:02   ` Peter Zijlstra
@ 2015-10-12 12:05     ` Wangnan (F)
  2015-10-12 12:12       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Wangnan (F) @ 2015-10-12 12:05 UTC (permalink / raw)
  To: Peter Zijlstra, Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	linux-kernel, pi3orama, hekuang, netdev



On 2015/10/12 20:02, Peter Zijlstra wrote:
> On Mon, Oct 12, 2015 at 09:02:42AM +0000, Kaixu Xia wrote:
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -483,6 +483,8 @@ struct perf_event {
>>   	perf_overflow_handler_t		overflow_handler;
>>   	void				*overflow_handler_context;
>>   
>> +	atomic_t			*sample_disable;
>> +
>>   #ifdef CONFIG_EVENT_TRACING
>>   	struct trace_event_call		*tp_event;
>>   	struct event_filter		*filter;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b11756f..f6ef45c 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
>>   		irq_work_queue(&event->pending);
>>   	}
>>   
>> +	if ((event->sample_disable) && atomic_read(event->sample_disable))
>> +		return ret;
>> +
>>   	if (event->overflow_handler)
>>   		event->overflow_handler(event, data, regs);
>>   	else
> Try and guarantee sample_disable lives in the same cacheline as
> overflow_handler.

Could you please explain why we need them to be in a same cacheline?

Thank you.

> I think we should at the very least replace the kzalloc() currently used
> with a cacheline aligned alloc, and check the structure layout to verify
> these two do in fact share a cacheline.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12 12:05     ` Wangnan (F)
@ 2015-10-12 12:12       ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-12 12:12 UTC (permalink / raw)
  To: Wangnan (F)
  Cc: Kaixu Xia, ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa,
	daniel, linux-kernel, pi3orama, hekuang, netdev

On Mon, Oct 12, 2015 at 08:05:20PM +0800, Wangnan (F) wrote:
> 
> 
> On 2015/10/12 20:02, Peter Zijlstra wrote:
> >On Mon, Oct 12, 2015 at 09:02:42AM +0000, Kaixu Xia wrote:
> >>--- a/include/linux/perf_event.h
> >>+++ b/include/linux/perf_event.h
> >>@@ -483,6 +483,8 @@ struct perf_event {
> >>  	perf_overflow_handler_t		overflow_handler;
> >>  	void				*overflow_handler_context;
> >>+	atomic_t			*sample_disable;
> >>+
> >>  #ifdef CONFIG_EVENT_TRACING
> >>  	struct trace_event_call		*tp_event;
> >>  	struct event_filter		*filter;
> >>diff --git a/kernel/events/core.c b/kernel/events/core.c
> >>index b11756f..f6ef45c 100644
> >>--- a/kernel/events/core.c
> >>+++ b/kernel/events/core.c
> >>@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
> >>  		irq_work_queue(&event->pending);
> >>  	}
> >>+	if ((event->sample_disable) && atomic_read(event->sample_disable))
> >>+		return ret;
> >>+
> >>  	if (event->overflow_handler)
> >>  		event->overflow_handler(event, data, regs);
> >>  	else
> >Try and guarantee sample_disable lives in the same cacheline as
> >overflow_handler.
> 
> Could you please explain why we need them to be in a same cacheline?

Because otherwise you've just added a cacheline miss to this relatively
hot path.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12  9:02 ` [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples Kaixu Xia
  2015-10-12 12:02   ` Peter Zijlstra
@ 2015-10-12 14:14   ` kbuild test robot
  2015-10-12 19:20   ` Alexei Starovoitov
  2015-10-13 12:00   ` Sergei Shtylyov
  3 siblings, 0 replies; 22+ messages in thread
From: kbuild test robot @ 2015-10-12 14:14 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: kbuild-all, ast, davem, acme, mingo, a.p.zijlstra,
	masami.hiramatsu.pt, jolsa, daniel, xiakaixu, wangnan0,
	linux-kernel, pi3orama, hekuang, netdev

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

Hi Kaixu,

[auto build test ERROR on tip/perf/core -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/Kaixu-Xia/bpf-enable-disable-events-stored-in-PERF_EVENT_ARRAY-maps-trace-data-output-when-perf-sampling/20151012-170616
config: m68k-allyesconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   kernel/bpf/arraymap.c: In function 'perf_event_fd_array_get_ptr':
>> kernel/bpf/arraymap.c:305:7: error: 'struct perf_event' has no member named 'sample_disable'
     event->sample_disable = &map->perf_sample_disable;
          ^

vim +305 kernel/bpf/arraymap.c

   299		if (attr->type != PERF_TYPE_RAW &&
   300		    attr->type != PERF_TYPE_HARDWARE) {
   301			perf_event_release_kernel(event);
   302			return ERR_PTR(-EINVAL);
   303		}
   304	
 > 305		event->sample_disable = &map->perf_sample_disable;
   306		return event;
   307	}
   308	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 34573 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12  9:02 ` [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples Kaixu Xia
  2015-10-12 12:02   ` Peter Zijlstra
  2015-10-12 14:14   ` kbuild test robot
@ 2015-10-12 19:20   ` Alexei Starovoitov
  2015-10-13  2:30     ` xiakaixu
  2015-10-13 12:00   ` Sergei Shtylyov
  3 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2015-10-12 19:20 UTC (permalink / raw)
  To: Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: wangnan0, linux-kernel, pi3orama, hekuang, netdev

On 10/12/15 2:02 AM, Kaixu Xia wrote:
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f57d7fe..25e073d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -39,6 +39,7 @@ struct bpf_map {
>   	u32 max_entries;
>   	const struct bpf_map_ops *ops;
>   	struct work_struct work;
> +	atomic_t perf_sample_disable;
>   };
>
>   struct bpf_map_type_list {
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 092a0e8..0606d1d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -483,6 +483,8 @@ struct perf_event {
>   	perf_overflow_handler_t		overflow_handler;
>   	void				*overflow_handler_context;
>
> +	atomic_t			*sample_disable;

this looks fragile and unnecessary.
Why add such field to generic bpf_map and carry its pointer into perf_event?
Single extra field in perf_event would have been enough.
Even better is to avoid adding any fields.
There is already event->state why not to use that?
The proper perf_event_enable/disable are so heavy that another
mechanism needed? cpu_function_call is probably too much to do
from bpf program, but that can be simplified?
Based on the use case from cover letter, sounds like you want
something like soft_disable?
Then extending event->state would make the most sense.
Also consider the case of re-entrant event enable/disable.
So inc/dec of a flag may be needed?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12 19:20   ` Alexei Starovoitov
@ 2015-10-13  2:30     ` xiakaixu
  2015-10-13  3:10       ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: xiakaixu @ 2015-10-13  2:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel, wangnan0, linux-kernel, pi3orama, hekuang, netdev

于 2015/10/13 3:20, Alexei Starovoitov 写道:
> On 10/12/15 2:02 AM, Kaixu Xia wrote:
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index f57d7fe..25e073d 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -39,6 +39,7 @@ struct bpf_map {
>>       u32 max_entries;
>>       const struct bpf_map_ops *ops;
>>       struct work_struct work;
>> +    atomic_t perf_sample_disable;
>>   };
>>
>>   struct bpf_map_type_list {
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 092a0e8..0606d1d 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -483,6 +483,8 @@ struct perf_event {
>>       perf_overflow_handler_t        overflow_handler;
>>       void                *overflow_handler_context;
>>
>> +    atomic_t            *sample_disable;
> 
> this looks fragile and unnecessary.
> Why add such field to generic bpf_map and carry its pointer into perf_event?
> Single extra field in perf_event would have been enough.
> Even better is to avoid adding any fields.
> There is already event->state why not to use that?
> The proper perf_event_enable/disable are so heavy that another
> mechanism needed? cpu_function_call is probably too much to do
> from bpf program, but that can be simplified?
> Based on the use case from cover letter, sounds like you want
> something like soft_disable?
> Then extending event->state would make the most sense.
> Also consider the case of re-entrant event enable/disable.
> So inc/dec of a flag may be needed?

Thanks for your comments!
I've tried perf_event_enable/disable, but there is a warning caused
by cpu_function_call. The main reason as follows,
 int smp_call_function_single(...)
 {
	...
	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
		     && !oops_in_progress);
	...
}
So I added the extra atomic flag filed in order to avoid this problem.
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-13  2:30     ` xiakaixu
@ 2015-10-13  3:10       ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2015-10-13  3:10 UTC (permalink / raw)
  To: xiakaixu
  Cc: davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel, wangnan0, linux-kernel, pi3orama, hekuang, netdev

On 10/12/15 7:30 PM, xiakaixu wrote:
>> The proper perf_event_enable/disable are so heavy that another
>> >mechanism needed? cpu_function_call is probably too much to do
>> >from bpf program, but that can be simplified?
>> >Based on the use case from cover letter, sounds like you want
>> >something like soft_disable?
>> >Then extending event->state would make the most sense.
>> >Also consider the case of re-entrant event enable/disable.
>> >So inc/dec of a flag may be needed?
> Thanks for your comments!
> I've tried perf_event_enable/disable, but there is a warning caused
> by cpu_function_call. The main reason as follows,
>   int smp_call_function_single(...)
>   {
> 	...
> 	WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
> 		     && !oops_in_progress);

of course, that's what I meant by 'cpu_function_call is too much
to do from bpf program'. In this case it's running out of kprobe
with disabled irq, so you hit the warning, but even if it was regular
tracepoint, doing ipi from bpf is too much. All bpf helpers must be
deterministic without such side effects.

> So I added the extra atomic flag filed in order to avoid this problem.

that's a hammer approach. There are other ways to do it, like:
- extend event->state with this soft_disable-like functionality
  (Also consider the case of re-entrant event enable/disable.
   inc/dec may be needed)
- or tap into event->attr.sample_period
   may be it can be temporarily set to zero to indicate soft_disabled.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples
  2015-10-12  9:02 ` [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples Kaixu Xia
                     ` (2 preceding siblings ...)
  2015-10-12 19:20   ` Alexei Starovoitov
@ 2015-10-13 12:00   ` Sergei Shtylyov
  3 siblings, 0 replies; 22+ messages in thread
From: Sergei Shtylyov @ 2015-10-13 12:00 UTC (permalink / raw)
  To: Kaixu Xia, ast, davem, acme, mingo, a.p.zijlstra,
	masami.hiramatsu.pt, jolsa, daniel
  Cc: wangnan0, linux-kernel, pi3orama, hekuang, netdev

Hello.

On 10/12/2015 12:02 PM, Kaixu Xia wrote:

> In some scenarios we don't want to output trace data when sampling
> to reduce overhead. This patch adds the flag sample_disable to
> implement this function. By setting this flag and integrating with
> ebpf, we can control the data output process and get the samples we
> are most interested in.
>
> Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>

[...]
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 29ace10..4ae82c9 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
[...]
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b11756f..f6ef45c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
>   		irq_work_queue(&event->pending);
>   	}
>
> +	if ((event->sample_disable) && atomic_read(event->sample_disable))

    Inner parens not needed at all.

[...]

MBR, Sergei


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-12  9:02 [RFC PATCH 0/2] bpf: enable/disable events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling Kaixu Xia
  2015-10-12  9:02 ` [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples Kaixu Xia
@ 2015-10-12  9:02 ` Kaixu Xia
  2015-10-12 19:29   ` Alexei Starovoitov
  1 sibling, 1 reply; 22+ messages in thread
From: Kaixu Xia @ 2015-10-12  9:02 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

The functions bpf_perf_event_sample_enable/disable() can set the
flag sample_disable to enable/disable output trace data on samples.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
---
 include/linux/bpf.h      |  2 ++
 include/uapi/linux/bpf.h |  2 ++
 kernel/bpf/verifier.c    |  4 +++-
 kernel/trace/bpf_trace.c | 34 ++++++++++++++++++++++++++++++++++
 4 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 25e073d..09148ff 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -192,6 +192,8 @@ extern const struct bpf_func_proto bpf_map_update_elem_proto;
 extern const struct bpf_func_proto bpf_map_delete_elem_proto;
 
 extern const struct bpf_func_proto bpf_perf_event_read_proto;
+extern const struct bpf_func_proto bpf_perf_event_sample_enable_proto;
+extern const struct bpf_func_proto bpf_perf_event_sample_disable_proto;
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
 extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
 extern const struct bpf_func_proto bpf_tail_call_proto;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 92a48e2..5229c550 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -272,6 +272,8 @@ enum bpf_func_id {
 	BPF_FUNC_skb_get_tunnel_key,
 	BPF_FUNC_skb_set_tunnel_key,
 	BPF_FUNC_perf_event_read,	/* u64 bpf_perf_event_read(&map, index) */
+	BPF_FUNC_perf_event_sample_enable,	/* u64 bpf_perf_event_enable(&map) */
+	BPF_FUNC_perf_event_sample_disable,	/* u64 bpf_perf_event_disable(&map) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b074b23..6428daf 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -244,6 +244,8 @@ static const struct {
 } func_limit[] = {
 	{BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
 	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
+	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_sample_enable},
+	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_sample_disable},
 };
 
 static void print_verifier_state(struct verifier_env *env)
@@ -860,7 +862,7 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		 * don't allow any other map type to be passed into
 		 * the special func;
 		 */
-		if (bool_map != bool_func)
+		if (bool_func && bool_map != bool_func)
 			return -EINVAL;
 	}
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0fe96c7..abe943a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -215,6 +215,36 @@ const struct bpf_func_proto bpf_perf_event_read_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+static u64 bpf_perf_event_sample_enable(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+
+	atomic_set(&map->perf_sample_disable, 0);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_perf_event_sample_enable_proto = {
+       .func           = bpf_perf_event_sample_enable,
+       .gpl_only       = false,
+       .ret_type       = RET_INTEGER,
+       .arg1_type      = ARG_CONST_MAP_PTR,
+};
+
+static u64 bpf_perf_event_sample_disable(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+       struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+
+       atomic_set(&map->perf_sample_disable, 1);
+       return 0;
+}
+
+const struct bpf_func_proto bpf_perf_event_sample_disable_proto = {
+       .func           = bpf_perf_event_sample_disable,
+       .gpl_only       = false,
+       .ret_type       = RET_INTEGER,
+       .arg1_type      = ARG_CONST_MAP_PTR,
+};
+
 static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -242,6 +272,10 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_perf_event_read:
 		return &bpf_perf_event_read_proto;
+	case BPF_FUNC_perf_event_sample_enable:
+		return &bpf_perf_event_sample_enable_proto;
+	case BPF_FUNC_perf_event_sample_disable:
+		return &bpf_perf_event_sample_disable_proto;
 	default:
 		return NULL;
 	}
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-12  9:02 ` [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers Kaixu Xia
@ 2015-10-12 19:29   ` Alexei Starovoitov
  2015-10-13  3:27     ` Wangnan (F)
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2015-10-12 19:29 UTC (permalink / raw)
  To: Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: wangnan0, linux-kernel, pi3orama, hekuang, netdev

On 10/12/15 2:02 AM, Kaixu Xia wrote:
> +extern const struct bpf_func_proto bpf_perf_event_sample_enable_proto;
> +extern const struct bpf_func_proto bpf_perf_event_sample_disable_proto;

externs are unnecessary. Just make them static.
Also I prefer single helper that takes a flag, so we can extend it
instead of adding func_id for every little operation.

To avoid conflicts if you touch kernel/bpf/* or bpf.h please always
base your patches of net-next.

 > +	atomic_set(&map->perf_sample_disable, 0);

global flag per map is no go.
events are independent and should be treated as such.

Please squash these two patches, since they're part of one logical
feature. Splitting them like this only makes review harder.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-12 19:29   ` Alexei Starovoitov
@ 2015-10-13  3:27     ` Wangnan (F)
  2015-10-13  3:39       ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Wangnan (F) @ 2015-10-13  3:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Kaixu Xia, davem, acme, mingo, a.p.zijlstra,
	masami.hiramatsu.pt, jolsa, daniel
  Cc: linux-kernel, pi3orama, hekuang, netdev



On 2015/10/13 3:29, Alexei Starovoitov wrote:
> On 10/12/15 2:02 AM, Kaixu Xia wrote:
>> +extern const struct bpf_func_proto bpf_perf_event_sample_enable_proto;
>> +extern const struct bpf_func_proto bpf_perf_event_sample_disable_proto;
>
> externs are unnecessary. Just make them static.
> Also I prefer single helper that takes a flag, so we can extend it
> instead of adding func_id for every little operation.
>
> To avoid conflicts if you touch kernel/bpf/* or bpf.h please always
> base your patches of net-next.
>
> > +    atomic_set(&map->perf_sample_disable, 0);
>
> global flag per map is no go.
> events are independent and should be treated as such.
>

Then how to avoid racing? For example, when one core disabling all events
in a map, another core is enabling all of them. This racing may causes 
sereval
perf events in a map dump samples while other events not. To avoid such 
racing
I think some locking must be introduced, then cost is even higher.

The reason why we introduce an atomic pointer is because each operation 
should
controls a set of events, not one event, due to the per-cpu manner of 
perf events.

Thank you.

> Please squash these two patches, since they're part of one logical
> feature. Splitting them like this only makes review harder.
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13  3:27     ` Wangnan (F)
@ 2015-10-13  3:39       ` Alexei Starovoitov
  2015-10-13  3:51         ` Wangnan (F)
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2015-10-13  3:39 UTC (permalink / raw)
  To: Wangnan (F),
	Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: linux-kernel, pi3orama, hekuang, netdev

On 10/12/15 8:27 PM, Wangnan (F) wrote:
> Then how to avoid racing? For example, when one core disabling all events
> in a map, another core is enabling all of them. This racing may causes
> sereval
> perf events in a map dump samples while other events not. To avoid such
> racing
> I think some locking must be introduced, then cost is even higher.
>
> The reason why we introduce an atomic pointer is because each operation
> should
> controls a set of events, not one event, due to the per-cpu manner of
> perf events.

why 'set disable' is needed ?
the example given in cover letter shows the use case where you want
to receive samples only within sys_write() syscall.
The example makes sense, but sys_write() is running on this cpu, so just
disabling it on the current one is enough.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13  3:39       ` Alexei Starovoitov
@ 2015-10-13  3:51         ` Wangnan (F)
  2015-10-13  4:16           ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Wangnan (F) @ 2015-10-13  3:51 UTC (permalink / raw)
  To: Alexei Starovoitov, Kaixu Xia, davem, acme, mingo, a.p.zijlstra,
	masami.hiramatsu.pt, jolsa, daniel
  Cc: linux-kernel, pi3orama, hekuang, netdev



On 2015/10/13 11:39, Alexei Starovoitov wrote:
> On 10/12/15 8:27 PM, Wangnan (F) wrote:
>> Then how to avoid racing? For example, when one core disabling all 
>> events
>> in a map, another core is enabling all of them. This racing may causes
>> sereval
>> perf events in a map dump samples while other events not. To avoid such
>> racing
>> I think some locking must be introduced, then cost is even higher.
>>
>> The reason why we introduce an atomic pointer is because each operation
>> should
>> controls a set of events, not one event, due to the per-cpu manner of
>> perf events.
>
> why 'set disable' is needed ?
> the example given in cover letter shows the use case where you want
> to receive samples only within sys_write() syscall.
> The example makes sense, but sys_write() is running on this cpu, so just
> disabling it on the current one is enough.
>

Our real use case is control of the system-wide sampling. For example,
we need sampling all CPUs when smartphone start refershing its display.
We need all CPUs because in Android system there are plenty of threads
get involed into this behavior. We can't achieve this by controling
sampling on only one CPU. This is the reason we need 'set enable'
and 'set disable'.

Thank you.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13  3:51         ` Wangnan (F)
@ 2015-10-13  4:16           ` Alexei Starovoitov
  2015-10-13  4:34             ` Wangnan (F)
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2015-10-13  4:16 UTC (permalink / raw)
  To: Wangnan (F),
	Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: linux-kernel, pi3orama, hekuang, netdev

On 10/12/15 8:51 PM, Wangnan (F) wrote:
>> why 'set disable' is needed ?
>> the example given in cover letter shows the use case where you want
>> to receive samples only within sys_write() syscall.
>> The example makes sense, but sys_write() is running on this cpu, so just
>> disabling it on the current one is enough.
>>
>
> Our real use case is control of the system-wide sampling. For example,
> we need sampling all CPUs when smartphone start refershing its display.
> We need all CPUs because in Android system there are plenty of threads
> get involed into this behavior. We can't achieve this by controling
> sampling on only one CPU. This is the reason we need 'set enable'
> and 'set disable'.

ok, but that use case may have different enable/disable pattern.
In sys_write example ultra-fast enable/disable is must have, since
the whole syscall is fast and overhead should be minimal.
but for display refresh? we're talking milliseconds, no?
Can you just ioctl() it from user space?
If cost of enable/disable is high or the time range between toggling is
long, then doing it from the bpf program doesn't make sense. Instead
the program can do bpf_perf_event_output() to send a notification to
user space that condition is met and the user space can ioctl() events.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13  4:16           ` Alexei Starovoitov
@ 2015-10-13  4:34             ` Wangnan (F)
  2015-10-13  5:15               ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Wangnan (F) @ 2015-10-13  4:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Kaixu Xia, davem, acme, mingo, a.p.zijlstra,
	masami.hiramatsu.pt, jolsa, daniel
  Cc: linux-kernel, pi3orama, hekuang, netdev

On 2015/10/13 12:16, Alexei Starovoitov wrote:
> On 10/12/15 8:51 PM, Wangnan (F) wrote:
>>> why 'set disable' is needed ?
>>> the example given in cover letter shows the use case where you want
>>> to receive samples only within sys_write() syscall.
>>> The example makes sense, but sys_write() is running on this cpu, so 
>>> just
>>> disabling it on the current one is enough.
>>>
>>
>> Our real use case is control of the system-wide sampling. For example,
>> we need sampling all CPUs when smartphone start refershing its display.
>> We need all CPUs because in Android system there are plenty of threads
>> get involed into this behavior. We can't achieve this by controling
>> sampling on only one CPU. This is the reason we need 'set enable'
>> and 'set disable'.
>
> ok, but that use case may have different enable/disable pattern.
> In sys_write example ultra-fast enable/disable is must have, since
> the whole syscall is fast and overhead should be minimal.
> but for display refresh? we're talking milliseconds, no?
> Can you just ioctl() it from user space?
> If cost of enable/disable is high or the time range between toggling is
> long, then doing it from the bpf program doesn't make sense. Instead
> the program can do bpf_perf_event_output() to send a notification to
> user space that condition is met and the user space can ioctl() events.
>

OK. I think I understand your design principle that, everything inside BPF
should be as fast as possible.

Make userspace control events using ioctl make things harder. You know that
'perf record' itself doesn't care too much about events it reveived. It only
copies data to perf.data, but what we want is to use perf record simply like
this:

  # perf record -e evt=cycles -e control.o/pmu=evt/ -a sleep 100

And in control.o we create uprobe point to mark the start and finish of 
a frame:

  SEC("target=/a/b/c.o\nstartFrame=0x123456")
  int startFrame(void *) {
    bpf_pmu_enable(pmu);
    return 1;
  }

  SEC("target=/a/b/c.o\nfinishFrame=0x234568")
  int finishFrame(void *) {
    bpf_pmu_disable(pmu);
    return 1;
  }

I think it is make sence also.

I still think perf is not necessary be independent each other. You know 
we have
PERF_EVENT_IOC_SET_OUTPUT which can set multiple events output through one
ringbuffer. This way perf events are connected.

I think the 'set disable/enable' design in this patchset satisify the 
design goal
that in BPF program we only do simple and fast things. The only 
inconvience is
we add something into map, which is ugly. What about using similar 
implementation
like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like 
PERF_EVENT_IOC_SET_ENABLER,
then let perf to select an event as 'enabler', then BPF can still 
control one atomic
variable to enable/disable a set of events.

Thank you.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13  4:34             ` Wangnan (F)
@ 2015-10-13  5:15               ` Alexei Starovoitov
  2015-10-13  6:57                 ` Wangnan (F)
  2015-10-13 10:54                 ` He Kuang
  0 siblings, 2 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2015-10-13  5:15 UTC (permalink / raw)
  To: Wangnan (F),
	Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: linux-kernel, pi3orama, hekuang, netdev

On 10/12/15 9:34 PM, Wangnan (F) wrote:
>
>
> On 2015/10/13 12:16, Alexei Starovoitov wrote:
>> On 10/12/15 8:51 PM, Wangnan (F) wrote:
>>>> why 'set disable' is needed ?
>>>> the example given in cover letter shows the use case where you want
>>>> to receive samples only within sys_write() syscall.
>>>> The example makes sense, but sys_write() is running on this cpu, so
>>>> just
>>>> disabling it on the current one is enough.
>>>>
>>>
>>> Our real use case is control of the system-wide sampling. For example,
>>> we need sampling all CPUs when smartphone start refershing its display.
>>> We need all CPUs because in Android system there are plenty of threads
>>> get involed into this behavior. We can't achieve this by controling
>>> sampling on only one CPU. This is the reason we need 'set enable'
>>> and 'set disable'.
>>
>> ok, but that use case may have different enable/disable pattern.
>> In sys_write example ultra-fast enable/disable is must have, since
>> the whole syscall is fast and overhead should be minimal.
>> but for display refresh? we're talking milliseconds, no?
>> Can you just ioctl() it from user space?
>> If cost of enable/disable is high or the time range between toggling is
>> long, then doing it from the bpf program doesn't make sense. Instead
>> the program can do bpf_perf_event_output() to send a notification to
>> user space that condition is met and the user space can ioctl() events.
>>
>
> OK. I think I understand your design principle that, everything inside BPF
> should be as fast as possible.
>
> Make userspace control events using ioctl make things harder. You know that
> 'perf record' itself doesn't care too much about events it reveived. It
> only
> copies data to perf.data, but what we want is to use perf record simply
> like
> this:
>
>   # perf record -e evt=cycles -e control.o/pmu=evt/ -a sleep 100
>
> And in control.o we create uprobe point to mark the start and finish of
> a frame:
>
>   SEC("target=/a/b/c.o\nstartFrame=0x123456")
>   int startFrame(void *) {
>     bpf_pmu_enable(pmu);
>     return 1;
>   }
>
>   SEC("target=/a/b/c.o\nfinishFrame=0x234568")
>   int finishFrame(void *) {
>     bpf_pmu_disable(pmu);
>     return 1;
>   }
>
> I think it is make sence also.

yes. that looks quite useful,
but did you consider re-entrant startFrame() ?
start << here sampling starts
   start
   finish << here all samples disabled?!
finish
and startFrame()/finishFrame() running on all cpus of that user app ?
One cpu entering into startFrame() while another cpu doing finishFrame
what behavior should be? sampling is still enabled on all cpus? or off?
Either case doesn't seem to work with simple enable/disable.
Few emails in this thread back, I mentioned inc/dec of a flag
to solve that.

> What about using similar
> implementation
> like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like
> PERF_EVENT_IOC_SET_ENABLER,
> then let perf to select an event as 'enabler', then BPF can still
> control one atomic
> variable to enable/disable a set of events.

you lost me on that last sentence. How this 'enabler' will work?
Also I'm still missing what's wrong with perf doing ioctl() on
events on all cpus manually when bpf program tells it to do so.
Is it speed you concerned about or extra work in perf ?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13  5:15               ` Alexei Starovoitov
@ 2015-10-13  6:57                 ` Wangnan (F)
  2015-10-13 10:54                 ` He Kuang
  1 sibling, 0 replies; 22+ messages in thread
From: Wangnan (F) @ 2015-10-13  6:57 UTC (permalink / raw)
  To: Alexei Starovoitov, Kaixu Xia, davem, acme, mingo, a.p.zijlstra,
	masami.hiramatsu.pt, jolsa, daniel
  Cc: linux-kernel, pi3orama, hekuang, netdev



On 2015/10/13 13:15, Alexei Starovoitov wrote:
> On 10/12/15 9:34 PM, Wangnan (F) wrote:
>>
>>
>> On 2015/10/13 12:16, Alexei Starovoitov wrote:
>>> On 10/12/15 8:51 PM, Wangnan (F) wrote:
>>>>> why 'set disable' is needed ?
>>>>> the example given in cover letter shows the use case where you want
>>>>> to receive samples only within sys_write() syscall.
>>>>> The example makes sense, but sys_write() is running on this cpu, so
>>>>> just
>>>>> disabling it on the current one is enough.
>>>>>
>>>>
>>>> Our real use case is control of the system-wide sampling. For example,
>>>> we need sampling all CPUs when smartphone start refershing its 
>>>> display.
>>>> We need all CPUs because in Android system there are plenty of threads
>>>> get involed into this behavior. We can't achieve this by controling
>>>> sampling on only one CPU. This is the reason we need 'set enable'
>>>> and 'set disable'.
>>>
>>> ok, but that use case may have different enable/disable pattern.
>>> In sys_write example ultra-fast enable/disable is must have, since
>>> the whole syscall is fast and overhead should be minimal.
>>> but for display refresh? we're talking milliseconds, no?
>>> Can you just ioctl() it from user space?
>>> If cost of enable/disable is high or the time range between toggling is
>>> long, then doing it from the bpf program doesn't make sense. Instead
>>> the program can do bpf_perf_event_output() to send a notification to
>>> user space that condition is met and the user space can ioctl() events.
>>>
>>
>> OK. I think I understand your design principle that, everything 
>> inside BPF
>> should be as fast as possible.
>>
>> Make userspace control events using ioctl make things harder. You 
>> know that
>> 'perf record' itself doesn't care too much about events it reveived. It
>> only
>> copies data to perf.data, but what we want is to use perf record simply
>> like
>> this:
>>
>>   # perf record -e evt=cycles -e control.o/pmu=evt/ -a sleep 100
>>
>> And in control.o we create uprobe point to mark the start and finish of
>> a frame:
>>
>>   SEC("target=/a/b/c.o\nstartFrame=0x123456")
>>   int startFrame(void *) {
>>     bpf_pmu_enable(pmu);
>>     return 1;
>>   }
>>
>>   SEC("target=/a/b/c.o\nfinishFrame=0x234568")
>>   int finishFrame(void *) {
>>     bpf_pmu_disable(pmu);
>>     return 1;
>>   }
>>
>> I think it is make sence also.
>
> yes. that looks quite useful,
> but did you consider re-entrant startFrame() ?
> start << here sampling starts
>   start
>   finish << here all samples disabled?!
> finish
> and startFrame()/finishFrame() running on all cpus of that user app ?
> One cpu entering into startFrame() while another cpu doing finishFrame
> what behavior should be? sampling is still enabled on all cpus? or off?
> Either case doesn't seem to work with simple enable/disable.
> Few emails in this thread back, I mentioned inc/dec of a flag
> to solve that.

Correct.

>
>> What about using similar
>> implementation
>> like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like
>> PERF_EVENT_IOC_SET_ENABLER,
>> then let perf to select an event as 'enabler', then BPF can still
>> control one atomic
>> variable to enable/disable a set of events.
>
> you lost me on that last sentence. How this 'enabler' will work?

Like what we did in this patchset: add an atomic flag to perf_event,
make all perf_event connected to this enabler by PERF_EVENT_IOC_SET_ENABLER.
During running, check the enabler's atomic flag. So we use one atomic
variable to control a set of perf_event. Finally create a BPF helper
function to control that atomic variable.

> Also I'm still missing what's wrong with perf doing ioctl() on
> events on all cpus manually when bpf program tells it to do so.
> Is it speed you concerned about or extra work in perf ?
>

I think both speed and extra work need be concerned.

Say we use perf to enable/disable sampling. Use the above example to
describe, when smartphone starting refresing display, we write something
into ringbuffer, then display refreshing start. We have to wait for
perf be scheduled in, parse event it get (perf record doesn't do this
currently), discover the trigger event then enable sampling perf events
on all cpus. We make trigger and action asynchronous. I'm not sure how
many ns or ms it need, and I believe asynchronization itself introduces
complexity, which I think need to be avoided except we can explain the
advantages asynchronization can bring.

But yes, perf based implementation can shut down the PMU completely, which
is better than current light-weight implementation.

In summary:

  - In next version we will use counter based flag instead of current
    0/1 switcher in considering of reentering problem.

  - I think we both agree we need a light weight solution in which we can
    enable/disable sampling in function level. This light-weight solution
    can be applied to only one perf-event.

  - Our disagreement is whether to introduce a heavy-weight solution based
    on perf to enable/disable a group of perf event. For me, perf-based
    solution can shut down PMU completly, which is good. However, it
    introduces asynchronization and extra work on perf. I think we can
    do it in a much simpler, fully BPF way. Enabler solution I mentioned
    above is a candidate.

Thank you.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13  5:15               ` Alexei Starovoitov
  2015-10-13  6:57                 ` Wangnan (F)
@ 2015-10-13 10:54                 ` He Kuang
  2015-10-13 11:07                   ` Wangnan (F)
  2015-10-14  5:14                   ` Alexei Starovoitov
  1 sibling, 2 replies; 22+ messages in thread
From: He Kuang @ 2015-10-13 10:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Wangnan (F),
	Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: linux-kernel, pi3orama, netdev

hi, Alexei

>> What about using similar
>> implementation
>> like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like
>> PERF_EVENT_IOC_SET_ENABLER,
>> then let perf to select an event as 'enabler', then BPF can still
>> control one atomic
>> variable to enable/disable a set of events.
>
> you lost me on that last sentence. How this 'enabler' will work?
> Also I'm still missing what's wrong with perf doing ioctl() on
> events on all cpus manually when bpf program tells it to do so.
> Is it speed you concerned about or extra work in perf ?
>
>

For not having too much wakeups, perf ringbuffer has a watermark
limit to cache events and reduce the wakeups, which causes perf
userspace tool can not receive perf events immediately.

Here's a simple demo expamle to prove it, 'sleep_exec' does some
writes and prints a timestamp every second, and an lable is
printed when perf poll gets events.

   $ perf record -m 2 -e syscalls:sys_enter_write sleep_exec 1000
   userspace sleep time: 0 seconds
   userspace sleep time: 1 seconds
   userspace sleep time: 2 seconds
   userspace sleep time: 3 seconds
   perf record wakeup onetime 0
   userspace sleep time: 4 seconds
   userspace sleep time: 5 seconds
   userspace sleep time: 6 seconds
   userspace sleep time: 7 seconds
   perf record wakeup onetime 1
   userspace sleep time: 8 seconds
   perf record wakeup onetime 2
   ..

   $ perf record -m 1 -e syscalls:sys_enter_write sleep_exec 1000
   userspace sleep time: 0 seconds
   userspace sleep time: 1 seconds
   perf record wakeup onetime 0
   userspace sleep time: 2 seconds
   userspace sleep time: 3 seconds
   perf record wakeup onetime 1
   userspace sleep time: 4 seconds
   userspace sleep time: 5 seconds
   ..

By default, if no mmap_pages is specified, perf tools wakeup only
when the target executalbe finished:

   $ perf record -e syscalls:sys_enter_write sleep_exec 5
   userspace sleep time: 0 seconds
   userspace sleep time: 1 seconds
   userspace sleep time: 2 seconds
   userspace sleep time: 3 seconds
   userspace sleep time: 4 seconds
   perf record wakeup onetime 0
   [ perf record: Woken up 1 times to write data ]
   [ perf record: Captured and wrote 0.006 MB perf.data (54 samples) ]

If we want perf to reflect as soon as our sample event be generated,
--no-buffering should be used, but this option has a greater
impact on performance.

   $ perf record --no-buffering -e syscalls:sys_enter_write sleep_exec 1000
   userspace sleep time: 0 seconds
   perf record wakeup onetime 0
   perf record wakeup onetime 1
   perf record wakeup onetime 2
   perf record wakeup onetime 3
   perf record wakeup onetime 4
   perf record wakeup onetime 5
   perf record wakeup onetime 6
   ..

Thank you

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13 10:54                 ` He Kuang
@ 2015-10-13 11:07                   ` Wangnan (F)
  2015-10-14  5:14                   ` Alexei Starovoitov
  1 sibling, 0 replies; 22+ messages in thread
From: Wangnan (F) @ 2015-10-13 11:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: He Kuang, Kaixu Xia, davem, acme, mingo, a.p.zijlstra,
	masami.hiramatsu.pt, jolsa, daniel, linux-kernel, pi3orama,
	netdev



On 2015/10/13 18:54, He Kuang wrote:
> hi, Alexei
>
>>> What about using similar
>>> implementation
>>> like PERF_EVENT_IOC_SET_OUTPUT, creating a new ioctl like
>>> PERF_EVENT_IOC_SET_ENABLER,
>>> then let perf to select an event as 'enabler', then BPF can still
>>> control one atomic
>>> variable to enable/disable a set of events.
>>
>> you lost me on that last sentence. How this 'enabler' will work?
>> Also I'm still missing what's wrong with perf doing ioctl() on
>> events on all cpus manually when bpf program tells it to do so.
>> Is it speed you concerned about or extra work in perf ?
>>
>>
>
> For not having too much wakeups, perf ringbuffer has a watermark
> limit to cache events and reduce the wakeups, which causes perf
> userspace tool can not receive perf events immediately.
>
> Here's a simple demo expamle to prove it, 'sleep_exec' does some
> writes and prints a timestamp every second, and an lable is
> printed when perf poll gets events.
>
>   $ perf record -m 2 -e syscalls:sys_enter_write sleep_exec 1000
>   userspace sleep time: 0 seconds
>   userspace sleep time: 1 seconds
>   userspace sleep time: 2 seconds
>   userspace sleep time: 3 seconds
>   perf record wakeup onetime 0
>   userspace sleep time: 4 seconds
>   userspace sleep time: 5 seconds
>   userspace sleep time: 6 seconds
>   userspace sleep time: 7 seconds
>   perf record wakeup onetime 1
>   userspace sleep time: 8 seconds
>   perf record wakeup onetime 2
>   ..
>
>   $ perf record -m 1 -e syscalls:sys_enter_write sleep_exec 1000
>   userspace sleep time: 0 seconds
>   userspace sleep time: 1 seconds
>   perf record wakeup onetime 0
>   userspace sleep time: 2 seconds
>   userspace sleep time: 3 seconds
>   perf record wakeup onetime 1
>   userspace sleep time: 4 seconds
>   userspace sleep time: 5 seconds
>   ..
>
> By default, if no mmap_pages is specified, perf tools wakeup only
> when the target executalbe finished:
>
>   $ perf record -e syscalls:sys_enter_write sleep_exec 5
>   userspace sleep time: 0 seconds
>   userspace sleep time: 1 seconds
>   userspace sleep time: 2 seconds
>   userspace sleep time: 3 seconds
>   userspace sleep time: 4 seconds
>   perf record wakeup onetime 0
>   [ perf record: Woken up 1 times to write data ]
>   [ perf record: Captured and wrote 0.006 MB perf.data (54 samples) ]
>
> If we want perf to reflect as soon as our sample event be generated,
> --no-buffering should be used, but this option has a greater
> impact on performance.
>
>   $ perf record --no-buffering -e syscalls:sys_enter_write sleep_exec 
> 1000
>   userspace sleep time: 0 seconds
>   perf record wakeup onetime 0
>   perf record wakeup onetime 1
>   perf record wakeup onetime 2
>   perf record wakeup onetime 3
>   perf record wakeup onetime 4
>   perf record wakeup onetime 5
>   perf record wakeup onetime 6
>   ..
>

Hi Alexei,

Based on He Kuang's test result, if we choose to use perf to control 
perf event
and output trigger event through bof_output_data, with default setting we
have to wait for sereval seconds until perf can get first trigger event 
if the
trigger event's frequency is low. In my display refreshing example, it 
causes
losting of event triggering. From user's view, random frames would miss.

With --no-buffering, things can become faster, but --no-buffering causes 
perf
to be scheduled in faster than normal, which is conflict to the goal of 
event
disabling that we want to reduce recording overhead as much as possible.

Thank you.

> Thank you
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers
  2015-10-13 10:54                 ` He Kuang
  2015-10-13 11:07                   ` Wangnan (F)
@ 2015-10-14  5:14                   ` Alexei Starovoitov
  1 sibling, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2015-10-14  5:14 UTC (permalink / raw)
  To: He Kuang, Wangnan (F),
	Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: linux-kernel, pi3orama, netdev

On 10/13/15 3:54 AM, He Kuang wrote:
> If we want perf to reflect as soon as our sample event be generated,
> --no-buffering should be used, but this option has a greater
> impact on performance.

no_buffering doesn't have to be applied to all events obviously.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-10-14  5:14 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-12  9:02 [RFC PATCH 0/2] bpf: enable/disable events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling Kaixu Xia
2015-10-12  9:02 ` [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples Kaixu Xia
2015-10-12 12:02   ` Peter Zijlstra
2015-10-12 12:05     ` Wangnan (F)
2015-10-12 12:12       ` Peter Zijlstra
2015-10-12 14:14   ` kbuild test robot
2015-10-12 19:20   ` Alexei Starovoitov
2015-10-13  2:30     ` xiakaixu
2015-10-13  3:10       ` Alexei Starovoitov
2015-10-13 12:00   ` Sergei Shtylyov
2015-10-12  9:02 ` [RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers Kaixu Xia
2015-10-12 19:29   ` Alexei Starovoitov
2015-10-13  3:27     ` Wangnan (F)
2015-10-13  3:39       ` Alexei Starovoitov
2015-10-13  3:51         ` Wangnan (F)
2015-10-13  4:16           ` Alexei Starovoitov
2015-10-13  4:34             ` Wangnan (F)
2015-10-13  5:15               ` Alexei Starovoitov
2015-10-13  6:57                 ` Wangnan (F)
2015-10-13 10:54                 ` He Kuang
2015-10-13 11:07                   ` Wangnan (F)
2015-10-14  5:14                   ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.