Re: [PATCH bpf] riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework

From: "Björn Töpel" <bjorn@kernel.org>
To: Pu Lehui <pulehui@huawei.com>, Pu Lehui <pulehui@huaweicloud.com>,
	bpf@vger.kernel.org, linux-riscv@lists.infradead.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Palmer Dabbelt <palmer@dabbelt.com>,
	Guo Ren <guoren@kernel.org>, Song Shuai <suagrfillet@gmail.com>
Subject: Re: [PATCH bpf] riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
Date: Fri, 21 Jul 2023 10:53:34 +0200	[thread overview]
Message-ID: <87o7k5fxwx.fsf@all.your.base.are.belong.to.us> (raw)
In-Reply-To: <b5977c5d-c434-7b4c-89f3-d575ee5d04e8@huawei.com>

Pu Lehui <pulehui@huawei.com> writes:

> On 2023/7/19 23:18, Björn Töpel wrote:
>> Pu Lehui <pulehui@huawei.com> writes:
>> 
>>> On 2023/7/19 4:06, Björn Töpel wrote:
>>>> Pu Lehui <pulehui@huaweicloud.com> writes:
>>>>
>>>>> From: Pu Lehui <pulehui@huawei.com>
>>>>>
>>>>> Commit 6724a76cff85 ("riscv: ftrace: Reduce the detour code size to
>>>>> half") optimizes the detour code size of kernel functions to half with
>>>>> T0 register and the upcoming DYNAMIC_FTRACE_WITH_DIRECT_CALLS of riscv
>>>>> is based on this optimization, we need to adapt riscv bpf trampoline
>>>>> based on this. One thing to do is to reduce detour code size of bpf
>>>>> programs, and the second is to deal with the return address after the
>>>>> execution of bpf trampoline. Meanwhile, add more comments and rename
>>>>> some variables to make more sense. The related tests have passed.
>>>>>
>>>>> This adaptation needs to be merged before the upcoming
>>>>> DYNAMIC_FTRACE_WITH_DIRECT_CALLS of riscv, otherwise it will crash due
>>>>> to a mismatch in the return address. So we target this modification to
>>>>> bpf tree and add fixes tag for locating.
>>>>
>>>> Thank you for working on this!
>>>>
>>>>> Fixes: 6724a76cff85 ("riscv: ftrace: Reduce the detour code size to half")
>>>>
>>>> This is not a fix. Nothing is broken. Only that this patch much come
>>>> before or as part of the ftrace series.
>>>
>>> Yep, it's really not a fix. I have no idea whether this patch target to
>>> bpf-next tree can be ahead of the ftrace series of riscv tree?
>> 
>> For this patch, I'd say it's easier to take it via the RISC-V tree, IFF
>> the ftrace series is in for-next.
>> 
>
> alright, so let's make it target to riscv-tree to avoid that cracsh.
>
>> [...]
>> 
>>>>> +#define DETOUR_NINSNS	2
>>>>
>>>> Better name? Maybe call this patchable function entry something? Also,
>>>
>>> How about RV_FENTRY_NINSNS?
>> 
>> Sure. And more importantly that it's actually used in the places where
>> nops/skips are done.
>
> the new one is suited up.
>
>> 
>>>> to catch future breaks like this -- would it make sense to have a
>>>> static_assert() combined with something tied to
>>>> -fpatchable-function-entry= from arch/riscv/Makefile?
>>>
>>> It is very necessary, but it doesn't seem to be easy. I try to find GCC
>>> related functions, something like __builtin_xxx, but I can't find it so
>>> far. Also try to make it as a CONFIG_PATCHABLE_FUNCTION_ENTRY=4 in
>>> Makefile and then static_assert, but obviously it shouldn't be done.
>>> Maybe we can deal with this later when we have a solution?
>> 
>> Ok!
>> 
>> [...]
>> 
>>>>> @@ -787,20 +762,19 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im,
>>>>>    	int i, ret, offset;
>>>>>    	int *branches_off = NULL;
>>>>>    	int stack_size = 0, nregs = m->nr_args;
>>>>> -	int retaddr_off, fp_off, retval_off, args_off;
>>>>> -	int nregs_off, ip_off, run_ctx_off, sreg_off;
>>>>> +	int fp_off, retval_off, args_off, nregs_off, ip_off, run_ctx_off, sreg_off;
>>>>>    	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
>>>>>    	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
>>>>>    	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>>>>>    	void *orig_call = func_addr;
>>>>> -	bool save_ret;
>>>>> +	bool save_retval, traced_ret;
>>>>>    	u32 insn;
>>>>>    
>>>>>    	/* Generated trampoline stack layout:
>>>>>    	 *
>>>>>    	 * FP - 8	    [ RA of parent func	] return address of parent
>>>>>    	 *					  function
>>>>> -	 * FP - retaddr_off [ RA of traced func	] return address of traced
>>>>> +	 * FP - 16	    [ RA of traced func	] return address of
>>>>>    	traced
>>>>
>>>> BPF code uses frame pointers. Shouldn't the trampoline frame look like a
>>>> regular frame [1], i.e. start with return address followed by previous
>>>> frame pointer?
>>>>
>>>
>>> oops, will fix it. Also we need to consider two types of trampoline
>>> stack layout, that is:
>>>
>>> * 1. trampoline called from function entry
>>> * --------------------------------------
>>> * FP + 8           [ RA of parent func ] return address of parent
>>> *                                        function
>>> * FP + 0           [ FP                ]
>>> *
>>> * FP - 8           [ RA of traced func ] return address of traced
>>> *                                        function
>>> * FP - 16          [ FP                ]
>>> * --------------------------------------
>>> *
>>> * 2. trampoline called directly
>>> * --------------------------------------
>>> * FP - 8           [ RA of caller func ] return address of caller
>>> *                                        function
>>> * FP - 16          [ FP                ]
>>> * --------------------------------------
>> 
>> Hmm, could you expand a bit on this? The stack frame top 16B (8+8)
>> should follow what the psabi suggests, regardless of the call site?
>> 
>
> Maybe I've missed something important! Or maybe I'm misunderstanding 
> what you mean. But anyway there is something to show. In my perspective, 
> we should construct a complete stack frame, otherwise one layer of stack 
> will be lost in calltrace when enable CONFIG_FRAME_POINTER.
>
> We can verify it by `echo 1 > 
> /sys/kernel/debug/tracing/options/stacktrace`, and the results as show 
> below:
>
> 1. complete stack frame
> * --------------------------------------
> * FP + 8           [ RA of parent func ] return address of parent
> *                                        function
> * FP + 0           [ FP                ]
> *
> * FP - 8           [ RA of traced func ] return address of traced
> *                                        function
> * FP - 16          [ FP                ]
> * --------------------------------------
> the stacktrace is:
>
>   => trace_event_raw_event_bpf_trace_printk
>   => bpf_trace_printk
>   => bpf_prog_ad7f62a5e7675635_bpf_prog
>   => bpf_trampoline_6442536643
>   => do_empty
>   => meminfo_proc_show
>   => seq_read_iter
>   => proc_reg_read_iter
>   => copy_splice_read
>   => vfs_splice_read
>   => splice_direct_to_actor
>   => do_splice_direct
>   => do_sendfile
>   => sys_sendfile64
>   => do_trap_ecall_u
>   => ret_from_exception
>
> 2. omit one FP
> * --------------------------------------
> * FP + 0           [ RA of parent func ] return address of parent
> *                                        function
> * FP - 8           [ RA of traced func ] return address of traced
> *                                        function
> * FP - 16          [ FP                ]
> * --------------------------------------
> the stacktrace is:
>
>   => trace_event_raw_event_bpf_trace_printk
>   => bpf_trace_printk
>   => bpf_prog_ad7f62a5e7675635_bpf_prog
>   => bpf_trampoline_6442491529
>   => do_empty
>   => seq_read_iter
>   => proc_reg_read_iter
>   => copy_splice_read
>   => vfs_splice_read
>   => splice_direct_to_actor
>   => do_splice_direct
>   => do_sendfile
>   => sys_sendfile64
>   => do_trap_ecall_u
>   => ret_from_exception
>
> it lost the layer of 'meminfo_proc_show'.

(Lehui was friendly enough to explain the details for me offlist.)

Aha, now I get what you mean! When we're getting into the trampoline
from the fentry-side, an additional stack frame needs to be
created. Otherwise, the unwinding will be incorrect.

So (for the rest of the readers ;-)), the BPF trampoline can be called
from:

A. A tracing point of view; Here, we're calling into the trampoline via
   the fentry/patchable entry. In this scenario, an additional stack
   frame needs to be constructed for proper unwinding.

B. For kfuncs. Here, the call into the trampoline is just a "regular
   call", and no additional stack frame is needed.

@Guo @Song Is the RISC-V ftrace code creating an additional stack frame,
or is the stack unwinding incorrect when the fentry is patched?

Thanks for clearing it up for me, Lehui!

Björn

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv