Re: [PATCH v3 bpf-next 2/3] bpf: introduce helper bpf_get_branch_snapshot

From: Song Liu <songliubraving@fb.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "open list:BPF (Safe dynamic programs and tools)" 
	<bpf@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Ingo Molnar <mingo@redhat.com>, kajoljain <kjain@linux.ibm.com>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH v3 bpf-next 2/3] bpf: introduce helper bpf_get_branch_snapshot
Date: Tue, 31 Aug 2021 21:24:46 +0000	[thread overview]
Message-ID: <F7738B87-1CD3-40F8-9278-DC69E9AB0395@fb.com> (raw)
In-Reply-To: <26B5B18A-EC0B-4661-BE6F-E5D96DCDE0B9@fb.com>

> On Aug 31, 2021, at 9:41 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Aug 31, 2021, at 8:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Mon, Aug 30, 2021 at 02:41:05PM -0700, Song Liu wrote:
>> 
>>> @@ -564,6 +565,18 @@ static void notrace inc_misses_counter(struct bpf_prog *prog)
>>> u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
>>> 	__acquires(RCU)
>>> {
>> 	preempt_disable_notrace();
>> 
>>> +#ifdef CONFIG_PERF_EVENTS
>>> +	/* Calling migrate_disable costs two entries in the LBR. To save
>>> +	 * some entries, we call perf_snapshot_branch_stack before
>>> +	 * migrate_disable to save some entries. This is OK because we
>>> +	 * care about the branch trace before entering the BPF program.
>>> +	 * If migrate happens exactly here, there isn't much we can do to
>>> +	 * preserve the data.
>>> +	 */
>>> +	if (prog->call_get_branch)
>>> +		static_call(perf_snapshot_branch_stack)(
>>> +			this_cpu_ptr(&bpf_perf_branch_snapshot));
>> 
>> Here the comment is accurate, but if you recall the calling context
>> requirements of perf_snapshot_branch_stack from the last patch, you'll
>> see it requires you have at the very least preemption disabled, which
>> you just violated.
> 
>> 
>> I think you'll find that (on x86 at least) the suggested
>> preempt_disable_notrace() incurs no additional branches.
>> 
>> Still, there is the next point to consider...
>> 
>>> +#endif
>>> 	rcu_read_lock();
>>> 	migrate_disable();
>> 
>> 	preempt_enable_notrace();
> 
> Do we want preempt_enable_notrace() after migrate_disable()? It feels a 
> little weird to me.
> 
>> 
>>> 	if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
>> 
>>> @@ -1863,9 +1892,23 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
>>> 	preempt_enable();
>>> }
>>> 
>>> +DEFINE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
>>> +
>>> static __always_inline
>>> void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
>>> {
>>> +#ifdef CONFIG_PERF_EVENTS
>>> +	/* Calling migrate_disable costs two entries in the LBR. To save
>>> +	 * some entries, we call perf_snapshot_branch_stack before
>>> +	 * migrate_disable to save some entries. This is OK because we
>>> +	 * care about the branch trace before entering the BPF program.
>>> +	 * If migrate happens exactly here, there isn't much we can do to
>>> +	 * preserve the data.
>>> +	 */
>>> +	if (prog->call_get_branch)
>>> +		static_call(perf_snapshot_branch_stack)(
>>> +			this_cpu_ptr(&bpf_perf_branch_snapshot));
>>> +#endif
>>> 	cant_sleep();
>> 
>> In the face of ^^^^^^ the comment makes no sense. Still, what are the
>> nesting rules for __bpf_trace_run() and __bpf_prog_enter() ? I'm
>> thinking the trace one can nest inside an occurence of prog, at which
>> point you have pieces.
> 
> I think broken LBR records is something we cannot really avoid in case 
> of nesting. OTOH, these should be rare cases and will not hurt the results
> in most the use cases. 
> 
> I should probably tighten the rules in verifier to only apply it for 
> __bpf_prog_enter (where we have the primary use case). We can enable it
> for other program types when there are other use cases. 

Update about some offline discussion with Alexei and Andrii. We are planning
to move static_call(perf_snapshot_branch_stack) to inside the helper 
bpf_get_branch_snapshot. This change has a few benefit:

1. No need for extra check (prog->call_get_branch) before every program (even
   when the program doesn't use the helper).

2. No need to duplicate code of different BPF program hook. 
3. BPF program always run with migrate_disable(), so it is not necessary to 
   run add extra preempt_disable_notrace. 

It does flushes a few more LBR entries. But the result seems ok:

ID: 0 from intel_pmu_lbr_disable_all+58 to intel_pmu_lbr_disable_all+93
ID: 1 from intel_pmu_lbr_disable_all+54 to intel_pmu_lbr_disable_all+58
ID: 2 from intel_pmu_snapshot_branch_stack+88 to intel_pmu_lbr_disable_all+0
ID: 3 from bpf_get_branch_snapshot+28 to intel_pmu_snapshot_branch_stack+0
ID: 4 from <bpf_tramepoline> to bpf_get_branch_snapshot+0
ID: 5 from <bpf_tramepoline> to <bpf_tramepoline>
ID: 6 from __bpf_prog_enter+34 to <bpf_tramepoline>
ID: 7 from migrate_disable+60 to __bpf_prog_enter+9
ID: 8 from __bpf_prog_enter+4 to migrate_disable+0
ID: 9 from __bpf_prog_enter+4 to __bpf_prog_enter+0
ID: 10 from bpf_fexit_loop_test1+22 to __bpf_prog_enter+0
ID: 11 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 12 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 13 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 14 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 15 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13

We can save more by inlining intel_pmu_lbr_disable_all(). But it is probably
not necessary at the moment. 

Thanks,
Song