From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: [PATCH v2 net-next 1/6] bpf: introduce BPF_PROG_TEST_RUN command Date: Sat, 1 Apr 2017 22:42:55 +0200 Message-ID: <20170401224255.4f8780f1@redhat.com> References: <20170331044543.4075183-1-ast@fb.com> <20170331044543.4075183-2-ast@fb.com> <20170401091423.4ce1ef3b@redhat.com> <4085f538-2a92-0373-d81c-5f9396ba0d84@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: "David S . Miller" , Daniel Borkmann , Wang Nan , Martin KaFai Lau , , , brouer@redhat.com To: Alexei Starovoitov Return-path: Received: from mx1.redhat.com ([209.132.183.28]:49978 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751526AbdDAUnC (ORCPT ); Sat, 1 Apr 2017 16:43:02 -0400 In-Reply-To: <4085f538-2a92-0373-d81c-5f9396ba0d84@fb.com> Sender: netdev-owner@vger.kernel.org List-ID: On Sat, 1 Apr 2017 08:45:01 -0700 Alexei Starovoitov wrote: > On 4/1/17 12:14 AM, Jesper Dangaard Brouer wrote: > > On Thu, 30 Mar 2017 21:45:38 -0700 > > Alexei Starovoitov wrote: > > > >> static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *time) > >> +{ > >> + u64 time_start, time_spent = 0; > >> + u32 ret = 0, i; > >> + > >> + if (!repeat) > >> + repeat = 1; > >> + time_start = ktime_get_ns(); > > > > I've found that is useful to record the CPU cycles, as it is more > > useful for comparing between CPUs. The nanosec time measurement varies > > too much between CPUs and GHz. I do use nanosec measurements myself a > > lot, but that is mostly because it is easier to relate to pps rates. > > For eBPF code execution I think it is more useful to get a cycles cost > > count? > > for micro-benchmarking of an instruction or small primitives > like spin_lock and irq_save/restore, yes. Cycles are more interesting > to look at. Here it's the whole program which in case of networking > likely does at least a few map lookups. > Also this duration field is more of sanity test then actual metric. Okay, if it was only a sanity metric. > > I've been using tsc[1] (rdtsc) to get the CPU cycles, I believe > > get_cycles() the more generic call, which have arch specific impl. (but > > can return 0 if no arch support). > > > > The best solution would be to use the perf infrastructure and PMU > > counter to get both PMU cycles and instructions, as that also tell you > > about the pipeline efficiency like instructions per cycles. I only got > > this partly working in [1][2]. > > to use get_cycles() or perf_event_create_kernel_counter() the current > simple loop would become kthread pinned to cpu and so on. > imo it's an overkill. > The only reason 'duration' being reported is a sanity test with user > space measurements. > What this command allows to do is: > $ time ./my_bpf_benchmark > The reported time should match the kernel reported 'duration'. > The tiny difference will come from resched. That's sanity part. > Now we can also do > $ perf record ./my_bpf_benchmark Make perfect sense, to handle it this way. > and get all perf goodness for free without adding any kernel code. > I want this test_run command to stay execution only. All pmu and > performance metrics should stay on perf side. > In case of performance optimization of bpf programs we're trying > to improve perf by changing the way program is written, hence > we need perf to point out which line of C code is costly. > Second is improving performance by changing JIT, map implementations > and so on. Here we also want full perf tool power. > > Unfortunately there is an issue with perf today, since as soon as > my_bpf_benchmark exits, bpf prog is unloaded and ksym is gone, so > 'perf report' cannot associate addresses back to source code. > We discussed a solution with Arnaldo. So that's orthogonal work in > progress which is needed regardless of this test_run command. Yes, that is rather unfortunate. Good to hear there is work in this area. I've started using: sysctl net/core/bpf_jit_kallsyms=1 and adding --kallsyms=/proc/kallsyms to perf report, which is helpful. > User space can also pin itself to cpu instead of asking kernel to > do it and run the same program on multiple cpus in parallel testing > interaction between concurrent map accesses and so on. > So by keeping test_run command as execution only primitive we allow > user space to do all the fancy tricks and measurements. Sound good to me! :-) Acked-by: Jesper Dangaard Brouer -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer