Re: [PATCH bpf-next 1/4] bpf: enable program stats

From: Daniel Borkmann <daniel@iogearbox.net>
To: Roman Gushchin <guro@fb.com>, Alexei Starovoitov <ast@kernel.org>
Cc: "davem@davemloft.net" <davem@davemloft.net>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 1/4] bpf: enable program stats
Date: Sat, 23 Feb 2019 02:06:56 +0100	[thread overview]
Message-ID: <8de20612-c32a-9370-254d-9ebf1df78efa@iogearbox.net> (raw)
In-Reply-To: <20190223003434.GA17559@tower.DHCP.thefacebook.com>

On 02/23/2019 01:34 AM, Roman Gushchin wrote:
> On Fri, Feb 22, 2019 at 03:36:41PM -0800, Alexei Starovoitov wrote:
>> JITed BPF programs are indistinguishable from kernel functions, but unlike
>> kernel code BPF code can be changed often.
>> Typical approach of "perf record" + "perf report" profiling and tunning of
>> kernel code works just as well for BPF programs, but kernel code doesn't
>> need to be monitored whereas BPF programs do.
>> Users load and run large amount of BPF programs.
>> These BPF stats allow tools monitor the usage of BPF on the server.
>> The monitoring tools will turn sysctl kernel.bpf_stats_enabled
>> on and off for few seconds to sample average cost of the programs.
>> Aggregated data over hours and days will provide an insight into cost of BPF
>> and alarms can trigger in case given program suddenly gets more expensive.
>>
>> The cost of two sched_clock() per program invocation adds ~20 nsec.
>> Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
>> from ~40 nsec to ~60 nsec.
>> static_key minimizes the cost of the stats collection.
>> There is no measurable difference before/after this patch
>> with kernel.bpf_stats_enabled=0
>>
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
>>  include/linux/bpf.h    |  7 +++++++
>>  include/linux/filter.h | 16 +++++++++++++++-
>>  kernel/bpf/core.c      | 13 +++++++++++++
>>  kernel/bpf/syscall.c   | 24 ++++++++++++++++++++++--
>>  kernel/bpf/verifier.c  |  5 +++++
>>  kernel/sysctl.c        | 34 ++++++++++++++++++++++++++++++++++
>>  6 files changed, 96 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index de18227b3d95..14260674bc57 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -340,6 +340,11 @@ enum bpf_cgroup_storage_type {
>>  
>>  #define MAX_BPF_CGROUP_STORAGE_TYPE __BPF_CGROUP_STORAGE_MAX
>>  
>> +struct bpf_prog_stats {
>> +	u64 cnt;
>> +	u64 nsecs;
>> +};
>> +
>>  struct bpf_prog_aux {
>>  	atomic_t refcnt;
>>  	u32 used_map_cnt;
>> @@ -389,6 +394,7 @@ struct bpf_prog_aux {
>>  	 * main prog always has linfo_idx == 0
>>  	 */
>>  	u32 linfo_idx;
>> +	struct bpf_prog_stats __percpu *stats;
>>  	union {
>>  		struct work_struct work;
>>  		struct rcu_head	rcu;
>> @@ -559,6 +565,7 @@ void bpf_map_area_free(void *base);
>>  void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
>>  
>>  extern int sysctl_unprivileged_bpf_disabled;
>> +extern int sysctl_bpf_stats_enabled;
>>  
>>  int bpf_map_new_fd(struct bpf_map *map, int flags);
>>  int bpf_prog_new_fd(struct bpf_prog *prog);
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index f32b3eca5a04..7b14235b6f7d 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -533,7 +533,21 @@ struct sk_filter {
>>  	struct bpf_prog	*prog;
>>  };
>>  
>> -#define BPF_PROG_RUN(filter, ctx)  ({ cant_sleep(); (*(filter)->bpf_func)(ctx, (filter)->insnsi); })
>> +DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
>> +
>> +#define BPF_PROG_RUN(prog, ctx)	({				\
>> +	u32 ret;						\
>> +	cant_sleep();						\
>> +	if (static_branch_unlikely(&bpf_stats_enabled_key)) {	\
>> +		u64 start = sched_clock();			\
>> +		ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi);	\
>> +		this_cpu_inc(prog->aux->stats->cnt);		\
>> +		this_cpu_add(prog->aux->stats->nsecs,		\
>> +			     sched_clock() - start);		\
>> +	} else {						\
>> +		ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi);	\
>> +	}							\
>> +	ret; })
> 
> Can we bump "cnt" unconditionally and gate only the "nsecs" calculation?
> 
> Why it might be useful?
> If the performance cost is too high to keep the statistics gathering
> always enabled, it will be possible to measure the average cost
> periodically and extrapolate.

In general, having some stats and timing info would be useful, but I
guess people might want to customize it in future even more specifically
beyond number of runs + time it takes. One thing that would be super
useful is to have some notion of __attribute__((constructor)) and
__attribute__((destructor)) support in BPF which gets then inlined into
prologue / epilogue of program. E.g. such infrastructure would allow to
mark an skb and measure time it takes through the BPF prog till it hits
an exit point somewhere (without having to explicitly code this into the
program everywhere). Other examples may be histograms or p99 latencies
that might probably be useful. Feels like for monitoring more ways to
program would be nice and to move it into the BPF insns sequence (e.g.
enforced globally or by choice of prog as another option)? Thoughts?