Re: [PATCH v2 bpf-next] bpf: sharing bpf runtime stats with /dev/bpf_stats

From: Song Liu <songliubraving@fb.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>,
	"ast@kernel.org" <ast@kernel.org>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"yzaikin@google.com" <yzaikin@google.com>
Subject: Re: [PATCH v2 bpf-next] bpf: sharing bpf runtime stats with /dev/bpf_stats
Date: Wed, 18 Mar 2020 21:20:11 +0000	[thread overview]
Message-ID: <3E03D914-36FA-4956-AF14-CAFD784D013A@fb.com> (raw)
In-Reply-To: <ba62e0be-6de6-036c-a836-178c1a9c079a@iogearbox.net>



> On Mar 18, 2020, at 1:58 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
> On 3/18/20 7:33 AM, Song Liu wrote:
>>> On Mar 17, 2020, at 4:08 PM, Song Liu <songliubraving@fb.com> wrote:
>>>> On Mar 17, 2020, at 2:47 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>> 
>>>>>> Hm, true as well. Wouldn't long-term extending "bpftool prog profile" fentry/fexit
>>>>>> programs supersede this old bpf_stats infrastructure? Iow, can't we implement the
>>>>>> same (or even more elaborate stats aggregation) in BPF via fentry/fexit and then
>>>>>> potentially deprecate bpf_stats counters?
>>>>> I think run_time_ns has its own value as a simple monitoring framework. We can
>>>>> use it in tools like top (and variations). It will be easier for these tools to
>>>>> adopt run_time_ns than using fentry/fexit.
>>>> 
>>>> Agree that this is easier; I presume there is no such official integration today
>>>> in tools like top, right, or is there anything planned?
>>> 
>>> Yes, we do want more supports in different tools to increase the visibility.
>>> Here is the effort for atop: https://github.com/Atoptool/atop/pull/88 .
>>> 
>>> I wasn't pushing push hard on this one mostly because the sysctl interface requires
>>> a user space "owner".
>>> 
>>>>> On the other hand, in long term, we may include a few fentry/fexit based programs
>>>>> in the kernel binary (or the rpm), so that these tools can use them easily. At
>>>>> that time, we can fully deprecate run_time_ns. Maybe this is not too far away?
>>>> 
>>>> Did you check how feasible it is to have something like `bpftool prog profile top`
>>>> which then enables fentry/fexit for /all/ existing BPF programs in the system? It
>>>> could then sort the sample interval by run_cnt, cycles, cache misses, aggregated
>>>> runtime, etc in a top-like output. Wdyt?
>>> 
>>> I wonder whether we can achieve this with one bpf prog (or a trampoline) that covers
>>> all BPF programs, like a trampoline inside __BPF_PROG_RUN()?
>>> 
>>> For long term direction, I think we could compare two different approaches: add new
>>> tools (like bpftool prog profile top) vs. add BPF support to existing tools. The
>>> first approach is easier. The latter approach would show BPF information to users
>>> who are not expecting BPF programs in the systems. For many sysadmins, seeing BPF
>>> programs in top/ps, and controlling them via kill is more natural than learning
>>> bpftool. What's your thought on this?
>> More thoughts on this.
>> If we have a special trampoline that attach to all BPF programs at once, we really
>> don't need the run_time_ns stats anymore. Eventually, tools that monitor BPF
>> programs will depend on libbpf, so using fentry/fexit to monitor BPF programs doesn't
>> introduce extra dependency. I guess we also need a way to include BPF program in
>> libbpf.
>> To summarize this plan, we need:
>> 1) A global trampoline that attaches to all BPF programs at once;
> 
> Overall sounds good, I think the `at once` part might be tricky, at least it would
> need to patch one prog after another, each prog also needs to store its own metrics
> somewhere for later collection. The start-to-sample could be a shared global var (aka
> shared map between all the programs) which would flip the switch though.

I was thinking about adding bpf_global_trampoline and use it in __BPF_PROG_RUN. 
Something like:

diff --git i/include/linux/filter.h w/include/linux/filter.h
index 9b5aa5c483cc..ac9497d1fa7b 100644
--- i/include/linux/filter.h
+++ w/include/linux/filter.h
@@ -559,9 +559,14 @@ struct sk_filter {

 DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key);

+extern struct bpf_trampoline *bpf_global_trampoline;
+DECLARE_STATIC_KEY_FALSE(bpf_global_tr_active);
+
 #define __BPF_PROG_RUN(prog, ctx, dfunc)       ({                      \
        u32 ret;                                                        \
        cant_migrate();                                                 \
+       if (static_branch_unlikely(&bpf_global_tr_active))              \
+               run_the_trampoline();                                   \
        if (static_branch_unlikely(&bpf_stats_enabled_key)) {           \
                struct bpf_prog_stats *stats;                           \
                u64 start = sched_clock();                              \


I am not 100% sure this is OK. 

I am also not sure whether this is an overkill. Do we really want more complex
metric for all BPF programs? Or run_time_ns is enough? 

Thanks,
Song