Re: [PATCH bpf-next v2 3/6] bpf: x86: Support in-register struct arguments

From: Yonghong Song <yhs@fb.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next v2 3/6] bpf: x86: Support in-register struct arguments
Date: Wed, 24 Aug 2022 21:04:28 -0700	[thread overview]
Message-ID: <7de2c92f-844c-a12e-3dc4-c92821c18e3a@fb.com> (raw)
In-Reply-To: <CAEf4BzZOGWFxGOD8hMH9v4gJPGv0tf5464Aa0DivDFrRhenx0Q@mail.gmail.com>

On 8/24/22 12:05 PM, Andrii Nakryiko wrote:
> On Thu, Aug 18, 2022 at 1:44 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Wed, Aug 17, 2022 at 09:56:23PM -0700, Yonghong Song wrote:
>>>
>>>
>>> On 8/15/22 3:44 PM, Alexei Starovoitov wrote:
>>>> On Thu, Aug 11, 2022 at 10:24 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>
>>>>> In C, struct value can be passed as a function argument.
>>>>> For small structs, struct value may be passed in
>>>>> one or more registers. For trampoline based bpf programs,
>>>>> This would cause complication since one-to-one mapping between
>>>>> function argument and arch argument register is not valid
>>>>> any more.
>>>>>
>>>>> To support struct value argument and make bpf programs
>>>>> easy to write, the bpf program function parameter is
>>>>> changed from struct type to a pointer to struct type.
>>>>> The following is a simplified example.
>>>>>
>>>>> In one of later selftests, we have a bpf_testmod function:
>>>>>       struct bpf_testmod_struct_arg_2 {
>>>>>           long a;
>>>>>           long b;
>>>>>       };
>>>>>       noinline int
>>>>>       bpf_testmod_test_struct_arg_2(int a, struct bpf_testmod_struct_arg_2 b, int c) {
>>>>>           bpf_testmod_test_struct_arg_result = a + b.a + b.b + c;
>>>>>           return bpf_testmod_test_struct_arg_result;
>>>>>       }
>>>>>
>>>>> When a bpf program is attached to the bpf_testmod function
>>>>> bpf_testmod_test_struct_arg_2(), the bpf program may look like
>>>>>       SEC("fentry/bpf_testmod_test_struct_arg_2")
>>>>>       int BPF_PROG(test_struct_arg_3, int a, struct bpf_testmod_struct_arg_2 *b, int c)
>>>>>       {
>>>>>           t2_a = a;
>>>>>           t2_b_a = b->a;
>>>>>           t2_b_b = b->b;
>>>>>           t2_c = c;
>>>>>           return 0;
>>>>>       }
>>>>>
>>>>> Basically struct value becomes a pointer to the struct.
>>>>> The trampoline stack will be increased to store the stack values and
>>>>> the pointer to these values will be saved in the stack slot corresponding
>>>>> to that argument. For x86_64, the struct size is limited up to 16 bytes
>>>>> so the struct can fit in one or two registers. The struct size of more
>>>>> than 16 bytes is not supported now as our current use case is
>>>>> for sockptr_t in the argument. We could handle such large struct's
>>>>> in the future if we have concrete use cases.
>>>>>
>>>>> The main changes are in save_regs() and restore_regs(). The following
>>>>> illustrated the trampoline asm codes for save_regs() and restore_regs().
>>>>> save_regs():
>>>>>       /* first argument */
>>>>>       mov    DWORD PTR [rbp-0x18],edi
>>>>>       /* second argument: struct, save actual values and put the pointer to the slot */
>>>>>       lea    rax,[rbp-0x40]
>>>>>       mov    QWORD PTR [rbp-0x10],rax
>>>>>       mov    QWORD PTR [rbp-0x40],rsi
>>>>>       mov    QWORD PTR [rbp-0x38],rdx
>>>>>       /* third argument */
>>>>>       mov    DWORD PTR [rbp-0x8],esi
>>>>> restore_regs():
>>>>>       mov    edi,DWORD PTR [rbp-0x18]
>>>>>       mov    rsi,QWORD PTR [rbp-0x40]
>>>>>       mov    rdx,QWORD PTR [rbp-0x38]
>>>>>       mov    esi,DWORD PTR [rbp-0x8]
>>>>
>>>> Not sure whether it was discussed before, but
>>>> why cannot we adjust the bpf side instead?
>>>> Technically struct passing between bpf progs was never
>>>> officially supported. llvm generates something.
>>>> Probably always passes by reference, but we can adjust
>>>> that behavior without breaking any programs because
>>>> we don't have bpf libraries. Programs are fully contained
>>>> in one or few files. libbpf can do static linking, but
>>>> without any actual libraries the chance of breaking
>>>> backward compat is close to zero.
>>>
>>> Agree. At this point, we don't need to worry about
>>> compatibility between bpf program and bpf program libraries.
>>>
>>>> Can we teach llvm to pass sizeof(struct) <= 16 in
>>>> two bpf registers?
>>>
>>> Yes, we can. I just hacked llvm and was able to
>>> do that.
>>>
>>>> Then we wouldn't need to have a discrepancy between
>>>> kernel function prototype and bpf fentry prog proto.
>>>> Both will have struct by value in the same spot.
>>>> The trampoline generation will be simpler for x86 and
>>>> its runtime faster too.
>>>
>>> I tested x86 and arm64 both supports two registers
>>> for a 16 byte struct.
>>>
>>>> The other architectures that pass small structs by reference
>>>> can do a bit more work in the trampoline: copy up to 16 byte
>>>> and bpf prog side will see it as they were passed in 'registers'.
>>>> wdyt?
>>>
>>> I know systemz and Hexagon will pass by reference for any
>>> struct size >= 8 bytes. Didn't complete check others.
>>>
>>> But since x86 and arm64 supports direct value passing
>>> with two registers, we should be okay. As you mentioned,
>>> we could support systemz/hexagon style of struct passing
>>> by copying the values to the stack.
>>>
>>>
>>> But I have a problem how to define a user friendly
>>> macro like BPF_PROG for user to use.
>>>
>>> Let us say, we have a program like below:
>>> SEC("fentry/bpf_testmod_test_struct_arg_1")
>>> int BPF_PROG(test_struct_arg_1, struct bpf_testmod_struct_arg_2 *a, int b,
>>> int c) {
>>> ...
>>> }
>>>
>>> We have BPF_PROG macro definition here:
>>>
>>> #define BPF_PROG(name, args...)     \
>>> name(unsigned long long *ctx);     \
>>> static __always_inline typeof(name(0))     \
>>> ____##name(unsigned long long *ctx, ##args);     \
>>> typeof(name(0)) name(unsigned long long *ctx)     \
>>> {     \
>>>          _Pragma("GCC diagnostic push")      \
>>>          _Pragma("GCC diagnostic ignored \"-Wint-conversion\"")      \
>>>          return ____##name(___bpf_ctx_cast(args));      \
>>>          _Pragma("GCC diagnostic pop")      \
>>> }     \
>>> static __always_inline typeof(name(0))     \
>>> ____##name(unsigned long long *ctx, ##args)
>>>
>>> Some we have static function definition
>>>
>>> int ____test_struct_arg_1(unsigned long long *ctx, struct
>>> bpf_testmod_struct_arg_2 *a, int b, int c);
>>>
>>> But the function call inside the function test_struct_arg_1()
>>> is
>>>    ____test_struct_arg_1(ctx, ctx[0], ctx[1], ctx[2]);
>>>
>>> We have two problems here:
>>>    ____test_struct_arg_1(ctx, ctx[0], ctx[1], ctx[2])
>>> does not match the static function declaration.
>>> This is not problem if everything is int/ptr type.
>>> If one of argument is structure type, we will have
>>> type conversion problem. Let us this can be resolved
>>> somehow through some hack.
>>>
>>> More importantly, because some structure may take two
>>> registers,
>>>     ____test_struct_arg_1(ctx, ctx[0], ctx[1], ctx[2])
>>> may not be correct. In my above example, if the
>>> structure size is 16 bytes,
>>> then the actual call should be
>>>     ____test_struct_arg_1(ctx, ctx[0], ctx[1], ctx[2], ctx[3])
>>> So we need to provide how many extra registers are needed
>>> beyond ##args in the macro. I have not tried how to
>>> resolve this but this extra information in the macro
>>> definite is not user friendly.
>>>
>>> Not sure what is the best way to handle this issue (##args is not precise
>>> and needs addition registers for >8 struct arguments).
>>
>> The kernel is using this trick to cast 8 byte structs to u64:
>> /* cast any integer, pointer, or small struct to u64 */
>> #define UINTTYPE(size) \
>>          __typeof__(__builtin_choose_expr(size == 1,  (u8)1, \
>>                     __builtin_choose_expr(size == 2, (u16)2, \
>>                     __builtin_choose_expr(size == 4, (u32)3, \
>>                     __builtin_choose_expr(size == 8, (u64)4, \
>>                                           (void)5)))))
>> #define __CAST_TO_U64(x) ({ \
>>          typeof(x) __src = (x); \
>>          UINTTYPE(sizeof(x)) __dst; \
>>          memcpy(&__dst, &__src, sizeof(__dst)); \
>>          (u64)__dst; })
>>
>> casting 16 byte struct to two u64 can be similar.
>> Ideally we would declare bpf prog as:
>> SEC("fentry/bpf_testmod_test_struct_arg_1")
>> int BPF_PROG(test_struct_arg_1, struct bpf_testmod_struct_arg_2 a, int b, int c) {
>> note there is no '*'. It's struct by value.
> 
> Agree. So I tried to compile this:
> 
> $ git diff
> diff --git a/tools/testing/selftests/bpf/progs/test_vmlinux.c
> b/tools/testing/selftests/bpf/progs/test_vmlinux.c
> index e9dfa0313d1b..dccb9ae9801f 100644
> --- a/tools/testing/selftests/bpf/progs/test_vmlinux.c
> +++ b/tools/testing/selftests/bpf/progs/test_vmlinux.c
> @@ -15,6 +15,16 @@ bool tp_btf_called = false;
>   bool kprobe_called = false;
>   bool fentry_called = false;
> 
> +typedef struct {
> +       void *x;
> +       int t;
> +} sockptr;
> +
> +static int blah(sockptr x)
> +{
> +       return x.t;
> +}
> +
>   SEC("tp/syscalls/sys_enter_nanosleep")
>   int handle__tp(struct trace_event_raw_sys_enter *args)
>   {
> @@ -30,7 +40,14 @@ int handle__tp(struct trace_event_raw_sys_enter *args)
>                  return 0;
> 
>          tp_called = true;
> -       return 0;
> +
> +       return blah(({ union {
> +               struct { u64 x, y; } z;
> +               sockptr s;
> +               } tmp = {.z = {0, 1}};
> +
> +               tmp.s;
> +       }));
>   }
> 
>   SEC("raw_tp/sys_enter")
> 
> 
> And it compiled. So I think it's possible to do u64 to struct
> conversion using this approach.
> We'd have to define two variations of macro -- one for structs <= 8
> bytes, another for structs > 8 and <= 16 bytes. One will "consume"
> single ctx[] slot, another -- will consume both. And then each variant
> knows which other macro to refer to after itself.
> 
> A bit of macro hackery, but it should work.

Thanks for suggestion. This approach might work. Let me
give a try.

> 
> 
>> The main challenge is how to do the math in the BPF_PROG macros.
>> Currently it's doing:
>> #define ___bpf_ctx_cast1(x)           ___bpf_ctx_cast0(), (void *)ctx[0]
>> #define ___bpf_ctx_cast2(x, args...)  ___bpf_ctx_cast1(args), (void *)ctx[1]
>> #define ___bpf_ctx_cast3(x, args...)  ___bpf_ctx_cast2(args), (void *)ctx[2]
>> #define ___bpf_ctx_cast4(x, args...)  ___bpf_ctx_cast3(args), (void *)ctx[3]
>>
>> The ctx[index] is one-to-one with argument position.
>> That 'index' needs to be calculated.
>> Maybe something like UINTTYPE() that applies to previous arguments?
>> #define REG_CNT(arg) \
>>          __builtin_choose_expr(sizeof(arg) == 1,  1, \
>>          __builtin_choose_expr(sizeof(arg) == 2,  1, \
>>          __builtin_choose_expr(sizeof(arg) == 4,  1, \
>>          __builtin_choose_expr(sizeof(arg) == 8,  1, \
>>          __builtin_choose_expr(sizeof(arg) == 16,  2, \
>>                                           (void)0)))))
>>
>> #define ___bpf_reg_cnt0()            0
>> #define ___bpf_reg_cnt1(x)          ___bpf_reg_cnt0() + REG_CNT(x)
>> #define ___bpf_reg_cnt2(x, args...) ___bpf_reg_cnt1(args) + REG_CNT(x)
>> #define ___bpf_reg_cnt(args...)    ___bpf_apply(___bpf_reg_cnt, ___bpf_narg(args))(args)
>>
>> This way the macro will calculate the index inside ctx[] array.
>>
>> and then inside ___bpf_ctx_castN macro use ___bpf_reg_cnt.
>> Instead of:
>> ___bpf_ctx_cast3(x, args...)  ___bpf_ctx_cast2(args), (void *)ctx[2]
>> it will be
>> ___bpf_ctx_cast3(x, args...)  ___bpf_ctx_cast2(args), \
>>    __builtin_choose_expr(sizeof(x) <= 8, (void *)ctx[___bpf_reg_cnt(args)],
>>                          *(typeof(x) *) &ctx[___bpf_reg_cnt(args)])
>>
>> x - is one of the arguments.
>> args - all args before 'x'. Doing __bpf_reg_cnt on them should calculate the index.
>> *(typeof(x) *)& should type cast to struct of 16 bytes.
>>
>> Rough idea, of course.
>>
>> Another alternative is instead of:
>> #define BPF_PROG(name, args...)
>> name(unsigned long long *ctx);
>> do:
>> #define BPF_PROG(name, args...)
>> struct XX {
>>    macro inserts all 'args' here separated by ; so it becomes a proper struct
>> };
>> name(struct XX *ctx);
>>
>> and then instead of doing ___bpf_ctx_castN for each argument
>> do single cast of all of 'u64 ctx[N]' passed from fentry into 'struct XX *'.
>> The problem with this approach that small args like char, short, int needs to
>> be declared in struct XX with __align__(8).
>>
>> Both approaches may be workable?