Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs

From: Peter Zijlstra <peterz@infradead.org>
To: Song Liu <songliubraving@fb.com>
Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
	acme@kernel.org, mingo@redhat.com, kernel-team@fb.com,
	Kan Liang <kan.liang@linux.intel.com>,
	Like Xu <like.xu@linux.intel.com>,
	Alexey Budankov <alexey.budankov@linux.intel.com>
Subject: Re: [RFC] bpf: lbr: enable reading LBR from tracing bpf programs
Date: Wed, 18 Aug 2021 11:15:44 +0200	[thread overview]
Message-ID: <YRzPwClswwxHXVHe@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20210818012937.2522409-1-songliubraving@fb.com>

On Tue, Aug 17, 2021 at 06:29:37PM -0700, Song Liu wrote:
> The typical way to access LBR is via hardware perf_event. For CPUs with
> FREEZE_LBRS_ON_PMI support, PMI could capture reliable LBR. On the other
> hand, LBR could also be useful in non-PMI scenario. For example, in
> kretprobe or bpf fexit program, LBR could provide a lot of information
> on what happened with the function.
> 
> In this RFC, we try to enable LBR for BPF program. This works like:
>   1. Create a hardware perf_event with PERF_SAMPLE_BRANCH_* on each CPU;
>   2. Call a new bpf helper (bpf_get_branch_trace) from the BPF program;
>   3. Before calling this bpf program, the kernel stops LBR on local CPU,
>      make a copy of LBR, and resumes LBR;
>   4. In the bpf program, the helper access the copy from #3.
> 
> Please see tools/testing/selftests/bpf/[progs|prog_tests]/get_call_trace.c
> for a detailed example. Not that, this process is far from ideal, but it
> allows quick prototype of this feature.
> 
> AFAICT, the biggest challenge here is that we are now sharing LBR in PMI
> and out of PMI, which could trigger some interesting race conditions.
> However, if we allow some level of missed/corrupted samples, this should
> still be very useful.
> 
> Please share your thoughts and comments on this. Thanks in advance!

> +int bpf_branch_record_read(void)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> +	intel_pmu_lbr_disable_all();
> +	intel_pmu_lbr_read();
> +	memcpy(this_cpu_ptr(&bpf_lbr_entries), cpuc->lbr_entries,
> +	       sizeof(struct perf_branch_entry) * x86_pmu.lbr_nr);
> +	*this_cpu_ptr(&bpf_lbr_cnt) = x86_pmu.lbr_nr;
> +	intel_pmu_lbr_enable_all(false);
> +	return 0;
> +}

Urgghhh.. I so really hate BPF specials like this. Also, the PMI race
you describe is because you're doing abysmal layer violations. If you'd
have used perf_pmu_disable() that wouldn't have been a problem.

I'd much rather see a generic 'fake/inject' PMI facility, something that
works across the board and isn't tied to x86/intel.