* [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms @ 2020-07-24 17:35 Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski ` (5 more replies) 0 siblings, 6 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-24 17:35 UTC (permalink / raw) To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski v4->v5: simplify the fix from v3/v4 (Daniel) v3->v4: - be more careful around the fix from v3 v2->v3: - call map_poke_untrack() on each previously registered subprog's aux struct to prog array if adding poke descriptor or tracking the aux struct failed (Daniel) v1->v2: - include the rax->rcx conversion in first patch, target prog needs to be placed in rcx in the tailcall indirect routine (Daniel) - add error checks to routines that add poke descriptors to subprograms (Daniel) - don't allow this optimization when arch is different than x64 and when JIT is disabled (Daniel) - pull out the rename of poke desc members onto a separate patch (Daniel) - add a new poke member to store the bypass address so that calculation of it won't be necessary - avoid the special casing when old and new is null in map_poke_run (Daniel) - do not sync RCU when bypass target was not patched (Daniel) - do not introduce nop2 instruction to prologue for cBPF programs (Daniel) RFC->v1: - rename poke->ip/poke->ip_aux pair to poke->tailcall_target/poke->tailcall_bypass (Alexei) - get rid of x86-specific code in prog_array_map_poke_run (Alexei) - use synchronize_rcu in prog_array_map_poke_run so that other CPUs in the middle of execution will finish running the program and will not stumble upon the incorrect state (Alexei) - update performance reports - rebase Hello, today bpf2bpf calls and tailcalls exclude each other. This set makes them work together. To give you an overview how this work started, previously I posted RFC that was targetted at getting rid of push/pop instructions for callee saved registers in x86-64 JIT that are not used by the BPF program. Alexei saw a potential that that work could be lifted a bit and tailcalls could work with BPF subprograms. More on that in [1], [2]. For previous discussions on RFC version, see [3]. For v1, see [4]. v2 is in [5], v3 can be ignored. In [1], Alexei says: "The prologue will look like: nop5 xor eax,eax // two new bytes if bpf_tail_call() is used in this function push rbp mov rbp, rsp sub rsp, rounded_stack_depth push rax // zero init tail_call counter variable number of push rbx,r13,r14,r15 Then bpf_tail_call will pop variable number rbx,.. and final 'pop rax' Then 'add rsp, size_of_current_stack_frame' jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp' This way new function will set its own stack size and will init tail call counter with whatever value the parent had. If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. Instead it would need to have 'nop2' in there." So basically I gave a shot at that suggestion. Patch 4 has a description of implementation. Quick overview of patches: Patch 1 changes BPF retpoline to use %rcx instead of %rax to store address of BPF tailcall target program Patch 2 propagates poke descriptors from main program to each subprogram Patch 3 renames poke->ip to poke->tailcall_target Patch 4 is the main dish in this set. It implements new prologue layout that was suggested by Alexei and reworks tailcall handling. Patch 5 relaxes verifier's restrictions about tailcalls being used with BPF subprograms for x64 JIT Patch 6 is the new selftest that proves tailcalls can be used from within BPF subprogram. ------------------------------------------------------------------- prog_array_map_poke_run changes: Before the tailcall and with the new prologue layout, stack need to be unwinded and callee saved registers need to be popped. Instructions responsible for that are generated, but they should not be executed if target program is not present. To address that, new poke target 'tailcall_bypass' is introduced to poke descriptor that will be used for skipping these instructions. This means there are two poke targets for handling direct tailcalls. Simplified flow can be presented as three sections: 1. skip call or nop (poke->tailcall_bypass) 2. stack unwind 3. call tail or nop (poke->tailcall_target) It would be possible under specific circumstances that one of CPU might be in point 2 and point 3 is not yet updated (nop), which would lead to problems mentioned in patch 4 commit message, IOW unwind section should not be executed if there is no target program. We can define the following state matrix for that (courtesy of Bjorn): A nop, unwind, nop B nop, unwind, tail C skip, unwind, nop D skip, unwind, tail A is forbidden (lead to incorrectness). The state transitions between tailcall install/update/remove will work as follows: First install tail call f: C->D->B(f) * poke the tailcall, after that get rid of the skip Update tail call f to f': B(f)->B(f') * poke the tailcall (poke->tailcall_target) and do NOT touch the poke->tailcall_bypass Remove tail call: B(f')->C(f') * poke->tailcall_bypass is poked back to jump, then we wait the RCU grace period so that other programs will finish its execution and after that we are safe to remove the poke->tailcall_target Install new tail call (f''): C(f')->D(f'')->B(f''). * same as first step This way CPU can never be exposed to "unwind, tail" state. ------------------------------------------------------------------- Performance impact: All of this work, as stated in [2], started as a way to speed up AF-XDP by dropping the push/pop of unused callee saved registers in prologue and epilogue. Impact is positive, 15% of performance gain. However, it is obvious that it will have a negative impact on BPF programs that utilize tailcalls, but we think its volume is acceptable for the feature that this set contains. Below are te numbers from 'perf stat' for two scenarios. First scenario is the output of command: $ sudo perf stat -ddd -r 1024 ./test_progs -t tailcalls tailcalls kselftest was modified in a following way: - only tailcall1 subtest is enabled - each of the bpf_prog_test_run() calls got set 'repeat' argument to 1000000 Numbers without this set: Performance counter stats for './test_progs -t tailcalls' (1024 runs): 261.68 msec task-clock # 0.998 CPUs utilized ( +- 0.12% ) 5 context-switches # 0.017 K/sec ( +- 0.54% ) 0 cpu-migrations # 0.000 K/sec ( +- 23.37% ) 113 page-faults # 0.433 K/sec ( +- 0.03% ) 877,156,850 cycles # 3.352 GHz ( +- 0.11% ) (30.31%) 1,379,322,515 instructions # 1.57 insn per cycle ( +- 0.02% ) (38.17%) 218,869,567 branches # 836.395 M/sec ( +- 0.01% ) (38.46%) 11,954,183 branch-misses # 5.46% of all branches ( +- 0.01% ) (38.74%) 283,350,418 L1-dcache-loads # 1082.805 M/sec ( +- 0.01% ) (39.00%) 156,323 L1-dcache-load-misses # 0.06% of all L1-dcache hits ( +- 0.74% ) (39.05%) 37,309 LLC-loads # 0.143 M/sec ( +- 1.02% ) (31.08%) 15,263 LLC-load-misses # 40.91% of all LL-cache hits ( +- 0.90% ) (30.95%) <not supported> L1-icache-loads 130,427 L1-icache-load-misses ( +- 0.45% ) (30.80%) 285,369,370 dTLB-loads # 1090.520 M/sec ( +- 0.01% ) (30.64%) 1,154 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 1.26% ) (30.46%) 2,015 iTLB-loads # 0.008 M/sec ( +- 1.12% ) (30.31%) 551 iTLB-load-misses # 27.34% of all iTLB cache hits ( +- 1.29% ) (30.20%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.262276 +- 0.000316 seconds time elapsed ( +- 0.12% ) With: Performance counter stats for './test_progs -t tailcalls' (1024 runs): 362.37 msec task-clock # 0.671 CPUs utilized ( +- 0.11% ) 28 context-switches # 0.077 K/sec ( +- 0.15% ) 0 cpu-migrations # 0.001 K/sec ( +- 4.46% ) 113 page-faults # 0.313 K/sec ( +- 0.03% ) 895,804,416 cycles # 2.472 GHz ( +- 0.08% ) (30.50%) 1,339,401,398 instructions # 1.50 insn per cycle ( +- 0.04% ) (38.29%) 302,718,849 branches # 835.385 M/sec ( +- 0.04% ) (38.39%) 11,962,089 branch-misses # 3.95% of all branches ( +- 0.05% ) (38.56%) 248,044,443 L1-dcache-loads # 684.505 M/sec ( +- 0.03% ) (38.70%) 239,882 L1-dcache-load-misses # 0.10% of all L1-dcache hits ( +- 0.49% ) (38.69%) 76,904 LLC-loads # 0.212 M/sec ( +- 0.96% ) (30.88%) 23,472 LLC-load-misses # 30.52% of all LL-cache hits ( +- 0.98% ) (30.85%) <not supported> L1-icache-loads 193,803 L1-icache-load-misses ( +- 0.53% ) (30.81%) 249,775,412 dTLB-loads # 689.282 M/sec ( +- 0.04% ) (30.81%) 2,176 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 1.53% ) (30.73%) 2,914 iTLB-loads # 0.008 M/sec ( +- 1.23% ) (30.59%) 978 iTLB-load-misses # 33.57% of all iTLB cache hits ( +- 1.29% ) (30.48%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.540236 +- 0.000454 seconds time elapsed ( +- 0.08% ) Second conducted measurement was on BPF kselftest flow_dissector that is using the progs/bpf_flow.c with 'repeat' argument on bpf_prog_test_run_xattr set also to 1000000. Without: Performance counter stats for './test_progs -t flow_dissector' (1024 runs): 1,355.18 msec task-clock # 0.989 CPUs utilized ( +- 0.11% ) 28 context-switches # 0.021 K/sec ( +- 0.49% ) 0 cpu-migrations # 0.000 K/sec ( +- 7.86% ) 125 page-faults # 0.093 K/sec ( +- 0.03% ) 4,609,228,676 cycles # 3.401 GHz ( +- 0.03% ) (30.70%) 6,735,946,489 instructions # 1.46 insn per cycle ( +- 0.01% ) (38.42%) 1,130,187,926 branches # 833.979 M/sec ( +- 0.01% ) (38.47%) 29,150,986 branch-misses # 2.58% of all branches ( +- 0.01% ) (38.51%) 1,737,548,851 L1-dcache-loads # 1282.158 M/sec ( +- 0.01% ) (38.56%) 659,851 L1-dcache-load-misses # 0.04% of all L1-dcache hits ( +- 0.78% ) (38.56%) 71,196 LLC-loads # 0.053 M/sec ( +- 0.97% ) (30.81%) 22,218 LLC-load-misses # 31.21% of all LL-cache hits ( +- 0.83% ) (30.79%) <not supported> L1-icache-loads 770,586 L1-icache-load-misses ( +- 0.67% ) (30.77%) 1,742,104,224 dTLB-loads # 1285.520 M/sec ( +- 0.01% ) (30.74%) 7,060 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 2.08% ) (30.72%) 4,282 iTLB-loads # 0.003 M/sec ( +- 16.98% ) (30.70%) 1,261 iTLB-load-misses # 29.46% of all iTLB cache hits ( +- 7.14% ) (30.68%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 1.37087 +- 0.00145 seconds time elapsed ( +- 0.11% ) With: Performance counter stats for './test_progs -t flow_dissector' (1024 runs): 1,385.56 msec task-clock # 0.989 CPUs utilized ( +- 0.06% ) 28 context-switches # 0.020 K/sec ( +- 0.48% ) 0 cpu-migrations # 0.000 K/sec ( +- 7.20% ) 125 page-faults # 0.091 K/sec ( +- 0.03% ) 4,642,599,630 cycles # 3.351 GHz ( +- 0.03% ) (30.69%) 6,901,261,616 instructions # 1.49 insn per cycle ( +- 0.01% ) (38.41%) 1,130,623,950 branches # 816.006 M/sec ( +- 0.01% ) (38.45%) 29,161,215 branch-misses # 2.58% of all branches ( +- 0.01% ) (38.50%) 1,796,850,740 L1-dcache-loads # 1296.842 M/sec ( +- 0.01% ) (38.55%) 673,908 L1-dcache-load-misses # 0.04% of all L1-dcache hits ( +- 0.89% ) (38.56%) 70,394 LLC-loads # 0.051 M/sec ( +- 1.08% ) (30.82%) 24,575 LLC-load-misses # 34.91% of all LL-cache hits ( +- 0.66% ) (30.80%) <not supported> L1-icache-loads 729,421 L1-icache-load-misses ( +- 0.85% ) (30.77%) 1,800,871,042 dTLB-loads # 1299.743 M/sec ( +- 0.01% ) (30.75%) 6,133 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 2.55% ) (30.73%) 1,998 iTLB-loads # 0.001 M/sec ( +- 9.36% ) (30.70%) 1,152 iTLB-load-misses # 57.66% of all iTLB cache hits ( +- 3.02% ) (30.68%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 1.400577 +- 0.000780 seconds time elapsed ( +- 0.06% ) Interesting fact is that I observed the huge iTLB-load-misses counts on clean kernel as well: flow_dissector test: 2,613 iTLB-loads # 0.002 M/sec ( +- 21.90% ) (30.70%) 16,483 iTLB-load-misses # 630.78% of all iTLB cache hits ( +- 79.63% ) (30.68%) tailcalls test: 1,996 iTLB-loads # 0.008 M/sec ( +- 1.08% ) (30.33%) 7,272 iTLB-load-misses # 364.24% of all iTLB cache hits ( +- 92.01% ) (30.21%) So probably Alexei's suspicion about get_random_int() doing strange things was right. ------------------------------------------------------------------- Thank you, Maciej [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/ [2]: https://lore.kernel.org/bpf/20200511143912.34086-1-maciej.fijalkowski@intel.com/ [3]: https://lore.kernel.org/bpf/20200702134930.4717-1-maciej.fijalkowski@intel.com/ [4]: https://lore.kernel.org/bpf/20200715233634.3868-1-maciej.fijalkowski@intel.com/ [5]: https://lore.kernel.org/netdev/20200721115321.3099-1-maciej.fijalkowski@intel.com/ Maciej Fijalkowski (6): bpf, x64: use %rcx instead of %rax for tail call retpolines bpf: propagate poke descriptors to subprograms bpf: rename poke descriptor's 'ip' member to 'tailcall_target' bpf, x64: rework pro/epilogue and tailcall handling in JIT bpf: allow for tailcalls in BPF subprograms for x64 JIT selftests: bpf: add dummy prog for bpf2bpf with tailcall arch/x86/include/asm/nospec-branch.h | 16 +- arch/x86/net/bpf_jit_comp.c | 249 +++++++++++++----- include/linux/bpf.h | 6 +- kernel/bpf/arraymap.c | 62 ++++- kernel/bpf/core.c | 3 +- kernel/bpf/verifier.c | 35 ++- .../selftests/bpf/prog_tests/tailcalls.c | 85 ++++++ tools/testing/selftests/bpf/progs/tailcall6.c | 38 +++ 8 files changed, 408 insertions(+), 86 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c -- 2.20.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v5 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines 2020-07-24 17:35 [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski @ 2020-07-24 17:35 ` Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 2/6] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski ` (4 subsequent siblings) 5 siblings, 0 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-24 17:35 UTC (permalink / raw) To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski Currently, %rax is used to store the jump target when BPF program is emitting the retpoline instructions that are handling the indirect tailcall. There is a plan to use %rax for different purpose, which is storing the tail call counter. In order to preserve this value across the tailcalls, adjust the BPF indirect tailcalls so that the target program will reside in %rcx and teach the retpoline instructions about new location of jump target. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> --- arch/x86/include/asm/nospec-branch.h | 16 ++++++++-------- arch/x86/net/bpf_jit_comp.c | 20 ++++++++++---------- 2 files changed, 18 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index e7752b4038ff..e491c3d9f227 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -314,19 +314,19 @@ static inline void mds_idle_clear_cpu_buffers(void) * lfence * jmp spec_trap * do_rop: - * mov %rax,(%rsp) for x86_64 + * mov %rcx,(%rsp) for x86_64 * mov %edx,(%esp) for x86_32 * retq * * Without retpolines configured: * - * jmp *%rax for x86_64 + * jmp *%rcx for x86_64 * jmp *%edx for x86_32 */ #ifdef CONFIG_RETPOLINE # ifdef CONFIG_X86_64 -# define RETPOLINE_RAX_BPF_JIT_SIZE 17 -# define RETPOLINE_RAX_BPF_JIT() \ +# define RETPOLINE_RCX_BPF_JIT_SIZE 17 +# define RETPOLINE_RCX_BPF_JIT() \ do { \ EMIT1_off32(0xE8, 7); /* callq do_rop */ \ /* spec_trap: */ \ @@ -334,7 +334,7 @@ do { \ EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \ EMIT2(0xEB, 0xF9); /* jmp spec_trap */ \ /* do_rop: */ \ - EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \ + EMIT4(0x48, 0x89, 0x0C, 0x24); /* mov %rcx,(%rsp) */ \ EMIT1(0xC3); /* retq */ \ } while (0) # else /* !CONFIG_X86_64 */ @@ -352,9 +352,9 @@ do { \ # endif #else /* !CONFIG_RETPOLINE */ # ifdef CONFIG_X86_64 -# define RETPOLINE_RAX_BPF_JIT_SIZE 2 -# define RETPOLINE_RAX_BPF_JIT() \ - EMIT2(0xFF, 0xE0); /* jmp *%rax */ +# define RETPOLINE_RCX_BPF_JIT_SIZE 2 +# define RETPOLINE_RCX_BPF_JIT() \ + EMIT2(0xFF, 0xE1); /* jmp *%rcx */ # else /* !CONFIG_X86_64 */ # define RETPOLINE_EDX_BPF_JIT() \ EMIT2(0xFF, 0xE2) /* jmp *%edx */ diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 42b6709e6dc7..5b3f19799efb 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -370,7 +370,7 @@ static void emit_bpf_tail_call_indirect(u8 **pprog) EMIT2(0x89, 0xD2); /* mov edx, edx */ EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], edx */ offsetof(struct bpf_array, map.max_entries)); -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */ +#define OFFSET1 (41 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */ EMIT2(X86_JBE, OFFSET1); /* jbe out */ label1 = cnt; @@ -380,36 +380,36 @@ static void emit_bpf_tail_call_indirect(u8 **pprog) */ EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */ EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE) +#define OFFSET2 (30 + RETPOLINE_RCX_BPF_JIT_SIZE) EMIT2(X86_JA, OFFSET2); /* ja out */ label2 = cnt; EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */ /* prog = array->ptrs[index]; */ - EMIT4_off32(0x48, 0x8B, 0x84, 0xD6, /* mov rax, [rsi + rdx * 8 + offsetof(...)] */ + EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */ offsetof(struct bpf_array, ptrs)); /* * if (prog == NULL) * goto out; */ - EMIT3(0x48, 0x85, 0xC0); /* test rax,rax */ -#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE) + EMIT3(0x48, 0x85, 0xC9); /* test rcx,rcx */ +#define OFFSET3 (8 + RETPOLINE_RCX_BPF_JIT_SIZE) EMIT2(X86_JE, OFFSET3); /* je out */ label3 = cnt; /* goto *(prog->bpf_func + prologue_size); */ - EMIT4(0x48, 0x8B, 0x40, /* mov rax, qword ptr [rax + 32] */ + EMIT4(0x48, 0x8B, 0x49, /* mov rcx, qword ptr [rcx + 32] */ offsetof(struct bpf_prog, bpf_func)); - EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE); /* add rax, prologue_size */ + EMIT4(0x48, 0x83, 0xC1, PROLOGUE_SIZE); /* add rcx, prologue_size */ /* - * Wow we're ready to jump into next BPF program + * Now we're ready to jump into next BPF program * rdi == ctx (1st arg) - * rax == prog->bpf_func + prologue_size + * rcx == prog->bpf_func + prologue_size */ - RETPOLINE_RAX_BPF_JIT(); + RETPOLINE_RCX_BPF_JIT(); /* out: */ BUILD_BUG_ON(cnt - label1 != OFFSET1); -- 2.20.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 bpf-next 2/6] bpf: propagate poke descriptors to subprograms 2020-07-24 17:35 [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski @ 2020-07-24 17:35 ` Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target' Maciej Fijalkowski ` (3 subsequent siblings) 5 siblings, 0 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-24 17:35 UTC (permalink / raw) To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski Previously, there was no need for poke descriptors being present in subprogram's bpf_prog_aux struct since tailcalls were simply not allowed in them. Each subprog is JITed independently so in order to enable JITing such subprograms, simply copy poke descriptors from main program to subprogram's poke tab. Add also subprog's aux struct to the BPF map poke_progs list by calling on it map_poke_track(). In case of any error, call the map_poke_untrack() on subprog's aux structs that have already been registered to prog array map. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> --- kernel/bpf/verifier.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9a6703bc3f36..f4955b4bf8a6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -9900,6 +9900,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) { struct bpf_prog *prog = env->prog, **func, *tmp; int i, j, subprog_start, subprog_end = 0, len, subprog; + struct bpf_map *map_ptr; struct bpf_insn *insn; void *old_bpf_func; int err, num_exentries; @@ -9967,6 +9968,23 @@ static int jit_subprogs(struct bpf_verifier_env *env) func[i]->aux->btf = prog->aux->btf; func[i]->aux->func_info = prog->aux->func_info; + for (j = 0; j < prog->aux->size_poke_tab; j++) { + int ret; + + ret = bpf_jit_add_poke_descriptor(func[i], + &prog->aux->poke_tab[j]); + if (ret < 0) { + verbose(env, "adding tail call poke descriptor failed\n"); + goto out_free; + } + map_ptr = func[i]->aux->poke_tab[j].tail_call.map; + ret = map_ptr->ops->map_poke_track(map_ptr, func[i]->aux); + if (ret < 0) { + verbose(env, "tracking tail call prog failed\n"); + goto out_free; + } + } + /* Use bpf_prog_F_tag to indicate functions in stack traces. * Long term would need debug info to populate names */ @@ -10060,9 +10078,16 @@ static int jit_subprogs(struct bpf_verifier_env *env) bpf_prog_free_unused_jited_linfo(prog); return 0; out_free: - for (i = 0; i < env->subprog_cnt; i++) - if (func[i]) - bpf_jit_free(func[i]); + for (i = 0; i < env->subprog_cnt; i++) { + if (!func[i]) + continue; + + for (j = 0; j < func[i]->aux->size_poke_tab; j++) { + map_ptr = func[i]->aux->poke_tab[j].tail_call.map; + map_ptr->ops->map_poke_untrack(map_ptr, func[i]->aux); + } + bpf_jit_free(func[i]); + } kfree(func); out_undo_insn: /* cleanup main prog to be interpreted */ -- 2.20.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target' 2020-07-24 17:35 [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 2/6] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski @ 2020-07-24 17:35 ` Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski ` (2 subsequent siblings) 5 siblings, 0 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-24 17:35 UTC (permalink / raw) To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski Reflect the actual purpose of poke->ip and rename it to poke->tailcall_target so that it will not the be confused with another poke target that will be introduced in next commit. While at it, do the same thing with poke->ip_stable - rename it to poke->tailcall_target_stable. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> --- arch/x86/net/bpf_jit_comp.c | 20 +++++++++++--------- include/linux/bpf.h | 4 ++-- kernel/bpf/arraymap.c | 17 +++++++++-------- 3 files changed, 22 insertions(+), 19 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 5b3f19799efb..44e64d406055 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -434,7 +434,7 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke, EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */ - poke->ip = image + (addr - X86_PATCH_SIZE); + poke->tailcall_target = image + (addr - X86_PATCH_SIZE); poke->adj_off = PROLOGUE_SIZE; memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE); @@ -453,7 +453,7 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog) for (i = 0; i < prog->aux->size_poke_tab; i++) { poke = &prog->aux->poke_tab[i]; - WARN_ON_ONCE(READ_ONCE(poke->ip_stable)); + WARN_ON_ONCE(READ_ONCE(poke->tailcall_target_stable)); if (poke->reason != BPF_POKE_REASON_TAIL_CALL) continue; @@ -464,18 +464,20 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog) if (target) { /* Plain memcpy is used when image is not live yet * and still not locked as read-only. Once poke - * location is active (poke->ip_stable), any parallel - * bpf_arch_text_poke() might occur still on the - * read-write image until we finally locked it as - * read-only. Both modifications on the given image - * are under text_mutex to avoid interference. + * location is active (poke->tailcall_target_stable), + * any parallel bpf_arch_text_poke() might occur + * still on the read-write image until we finally + * locked it as read-only. Both modifications on + * the given image are under text_mutex to avoid + * interference. */ - ret = __bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP, NULL, + ret = __bpf_arch_text_poke(poke->tailcall_target, + BPF_MOD_JUMP, NULL, (u8 *)target->bpf_func + poke->adj_off, false); BUG_ON(ret < 0); } - WRITE_ONCE(poke->ip_stable, true); + WRITE_ONCE(poke->tailcall_target_stable, true); mutex_unlock(&array->aux->poke_mutex); } } diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 72221aea1c60..aaa035519360 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -652,14 +652,14 @@ enum bpf_jit_poke_reason { /* Descriptor of pokes pointing /into/ the JITed image. */ struct bpf_jit_poke_descriptor { - void *ip; + void *tailcall_target; union { struct { struct bpf_map *map; u32 key; } tail_call; }; - bool ip_stable; + bool tailcall_target_stable; u8 adj_off; u16 reason; }; diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index c66e8273fccd..6fe6491fa17a 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -770,12 +770,13 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, * there could be danger of use after free otherwise. * 2) Initially when we start tracking aux, the program * is not JITed yet and also does not have a kallsyms - * entry. We skip these as poke->ip_stable is not - * active yet. The JIT will do the final fixup before - * setting it stable. The various poke->ip_stable are - * successively activated, so tail call updates can - * arrive from here while JIT is still finishing its - * final fixup for non-activated poke entries. + * entry. We skip these as poke->tailcall_target_stable + * is not active yet. The JIT will do the final fixup + * before setting it stable. The various + * poke->tailcall_target_stable are successively + * activated, so tail call updates can arrive from here + * while JIT is still finishing its final fixup for + * non-activated poke entries. * 3) On program teardown, the program's kallsym entry gets * removed out of RCU callback, but we can only untrack * from sleepable context, therefore bpf_arch_text_poke() @@ -792,7 +793,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, * 5) Any other error happening below from bpf_arch_text_poke() * is a unexpected bug. */ - if (!READ_ONCE(poke->ip_stable)) + if (!READ_ONCE(poke->tailcall_target_stable)) continue; if (poke->reason != BPF_POKE_REASON_TAIL_CALL) continue; @@ -800,7 +801,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, poke->tail_call.key != key) continue; - ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP, + ret = bpf_arch_text_poke(poke->tailcall_target, BPF_MOD_JUMP, old ? (u8 *)old->bpf_func + poke->adj_off : NULL, new ? (u8 *)new->bpf_func + -- 2.20.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT 2020-07-24 17:35 [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski ` (2 preceding siblings ...) 2020-07-24 17:35 ` [PATCH v5 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target' Maciej Fijalkowski @ 2020-07-24 17:35 ` Maciej Fijalkowski 2020-07-28 21:33 ` Daniel Borkmann 2020-07-28 22:07 ` Daniel Borkmann 2020-07-24 17:35 ` [PATCH v5 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski 5 siblings, 2 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-24 17:35 UTC (permalink / raw) To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski This commit serves two things: 1) it optimizes BPF prologue/epilogue generation 2) it makes possible to have tailcalls within BPF subprogram Both points are related to each other since without 1), 2) could not be achieved. In [1], Alexei says: "The prologue will look like: nop5 xor eax,eax // two new bytes if bpf_tail_call() is used in this // function push rbp mov rbp, rsp sub rsp, rounded_stack_depth push rax // zero init tail_call counter variable number of push rbx,r13,r14,r15 Then bpf_tail_call will pop variable number rbx,.. and final 'pop rax' Then 'add rsp, size_of_current_stack_frame' jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp' This way new function will set its own stack size and will init tail call counter with whatever value the parent had. If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. Instead it would need to have 'nop2' in there." Implement that suggestion. Since the layout of stack is changed, tail call counter handling can not rely anymore on popping it to rbx just like it have been handled for constant prologue case and later overwrite of rbx with actual value of rbx pushed to stack. Therefore, let's use one of the register (%rcx) that is considered to be volatile/caller-saved and pop the value of tail call counter in there in the epilogue. Drop the BUILD_BUG_ON in emit_prologue and in emit_bpf_tail_call_indirect where instruction layout is not constant anymore. Introduce new poke target, 'tailcall_bypass' to poke descriptor that is dedicated for skipping the register pops and stack unwind that are generated right before the actual jump to target program. For case when the target program is not present, BPF program will skip the pop instructions and nop5 dedicated for jmpq $target. An example of such state when only R6 of callee saved registers is used by program: ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4 ffffffffc0513aa6: 5b pop %rbx ffffffffc0513aa7: 58 pop %rax ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0513ab4: 48 89 df mov %rbx,%rdi When target program is inserted, the jump that was there to skip pops/nop5 will become the nop5, so CPU will go over pops and do the actual tailcall. One might ask why there simply can not be pushes after the nop5? In the following example snippet: ffffffffc037030c: 48 89 fb mov %rdi,%rbx (...) ffffffffc0370332: 5b pop %rbx ffffffffc0370333: 58 pop %rax ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp ffffffffc0370347: 50 push %rax ffffffffc0370348: 53 push %rbx ffffffffc0370349: 48 89 df mov %rbx,%rdi ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548 There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall and jump target is not present. ctx is in %rbx register and BPF subprogram that we will call into on ffffffffc037034c is relying on it, e.g. it will pick ctx from there. Such code layout is therefore broken as we would overwrite the content of %rbx with the value that was pushed on the prologue. That is the reason for the 'bypass' approach. Special care needs to be taken during the install/update/remove of tailcall target. In case when target program is not present, the CPU must not execute the pop instructions that precede the tailcall. To address that, the following states can be defined: A nop, unwind, nop B nop, unwind, tail C skip, unwind, nop D skip, unwind, tail A is forbidden (lead to incorrectness). The state transitions between tailcall install/update/remove will work as follows: First install tail call f: C->D->B(f) * poke the tailcall, after that get rid of the skip Update tail call f to f': B(f)->B(f') * poke the tailcall (poke->tailcall_target) and do NOT touch the poke->tailcall_bypass Remove tail call: B(f')->C(f') * poke->tailcall_bypass is poked back to jump, then we wait the RCU grace period so that other programs will finish its execution and after that we are safe to remove the poke->tailcall_target Install new tail call (f''): C(f')->D(f'')->B(f''). * same as first step This way CPU can never be exposed to "unwind, tail" state. For regression checks, 'tailcalls' kselftest was executed: $ sudo ./test_progs -t tailcalls #64/1 tailcall_1:OK #64/2 tailcall_2:OK #64/3 tailcall_3:OK #64/4 tailcall_4:OK #64/5 tailcall_5:OK #64 tailcalls:OK Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED Tail call related cases from test_verifier kselftest are also working fine. Sample BPF programs that utilize tail calls (sockex3, tracex5) work properly as well. [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/ Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> --- arch/x86/net/bpf_jit_comp.c | 221 ++++++++++++++++++++++++++++-------- include/linux/bpf.h | 2 + kernel/bpf/arraymap.c | 47 +++++++- kernel/bpf/core.c | 3 +- 4 files changed, 220 insertions(+), 53 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 44e64d406055..880f283adb66 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -221,14 +221,48 @@ struct jit_context { /* Number of bytes emit_patch() needs to generate instructions */ #define X86_PATCH_SIZE 5 +/* Number of bytes that will be skipped on tailcall */ +#define X86_TAIL_CALL_OFFSET 11 -#define PROLOGUE_SIZE 25 +static void push_callee_regs(u8 **pprog, bool *callee_regs_used) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (callee_regs_used[0]) + EMIT1(0x53); /* push rbx */ + if (callee_regs_used[1]) + EMIT2(0x41, 0x55); /* push r13 */ + if (callee_regs_used[2]) + EMIT2(0x41, 0x56); /* push r14 */ + if (callee_regs_used[3]) + EMIT2(0x41, 0x57); /* push r15 */ + *pprog = prog; +} + +static void pop_callee_regs(u8 **pprog, bool *callee_regs_used) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (callee_regs_used[3]) + EMIT2(0x41, 0x5F); /* pop r15 */ + if (callee_regs_used[2]) + EMIT2(0x41, 0x5E); /* pop r14 */ + if (callee_regs_used[1]) + EMIT2(0x41, 0x5D); /* pop r13 */ + if (callee_regs_used[0]) + EMIT1(0x5B); /* pop rbx */ + *pprog = prog; +} /* - * Emit x86-64 prologue code for BPF program and check its size. - * bpf_tail_call helper will skip it while jumping into another program + * Emit x86-64 prologue code for BPF program. + * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes + * while jumping to another program */ -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf) +static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, + bool tail_call) { u8 *prog = *pprog; int cnt = X86_PATCH_SIZE; @@ -238,19 +272,18 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf) */ memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt); prog += cnt; + if (!ebpf_from_cbpf) { + if (tail_call) + EMIT2(0x31, 0xC0); /* xor eax, eax */ + else + EMIT2(0x66, 0x90); /* nop2 */ + } EMIT1(0x55); /* push rbp */ EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */ /* sub rsp, rounded_stack_depth */ EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); - EMIT1(0x53); /* push rbx */ - EMIT2(0x41, 0x55); /* push r13 */ - EMIT2(0x41, 0x56); /* push r14 */ - EMIT2(0x41, 0x57); /* push r15 */ - if (!ebpf_from_cbpf) { - /* zero init tail_call_cnt */ - EMIT2(0x6a, 0x00); - BUILD_BUG_ON(cnt != PROLOGUE_SIZE); - } + if (!ebpf_from_cbpf && tail_call) + EMIT1(0x50); /* push rax */ *pprog = prog; } @@ -314,13 +347,14 @@ static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, mutex_lock(&text_mutex); if (memcmp(ip, old_insn, X86_PATCH_SIZE)) goto out; + ret = 1; if (memcmp(ip, new_insn, X86_PATCH_SIZE)) { if (text_live) text_poke_bp(ip, new_insn, X86_PATCH_SIZE, NULL); else memcpy(ip, new_insn, X86_PATCH_SIZE); + ret = 0; } - ret = 0; out: mutex_unlock(&text_mutex); return ret; @@ -337,6 +371,22 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, return __bpf_arch_text_poke(ip, t, old_addr, new_addr, true); } +static int get_pop_bytes(bool *callee_regs_used) +{ + int bytes = 0; + + if (callee_regs_used[3]) + bytes += 2; + if (callee_regs_used[2]) + bytes += 2; + if (callee_regs_used[1]) + bytes += 2; + if (callee_regs_used[0]) + bytes += 1; + + return bytes; +} + /* * Generate the following code: * @@ -351,12 +401,26 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, * goto *(prog->bpf_func + prologue_size); * out: */ -static void emit_bpf_tail_call_indirect(u8 **pprog) +static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used, + u32 stack_depth) { + int tcc_off = -4 - round_up(stack_depth, 8); u8 *prog = *pprog; - int label1, label2, label3; + int pop_bytes = 0; + int off1 = 49; + int off2 = 38; + int off3 = 16; int cnt = 0; + /* count the additional bytes used for popping callee regs from stack + * that need to be taken into account for each of the offsets that + * are used for bailing out of the tail call + */ + pop_bytes = get_pop_bytes(callee_regs_used); + off1 += pop_bytes; + off2 += pop_bytes; + off3 += pop_bytes; + /* * rdi - pointer to ctx * rsi - pointer to bpf_array @@ -370,21 +434,19 @@ static void emit_bpf_tail_call_indirect(u8 **pprog) EMIT2(0x89, 0xD2); /* mov edx, edx */ EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], edx */ offsetof(struct bpf_array, map.max_entries)); -#define OFFSET1 (41 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */ +#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */ EMIT2(X86_JBE, OFFSET1); /* jbe out */ - label1 = cnt; /* * if (tail_call_cnt > MAX_TAIL_CALL_CNT) * goto out; */ - EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */ + EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ -#define OFFSET2 (30 + RETPOLINE_RCX_BPF_JIT_SIZE) +#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE) EMIT2(X86_JA, OFFSET2); /* ja out */ - label2 = cnt; EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ - EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */ + EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ /* prog = array->ptrs[index]; */ EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */ @@ -394,48 +456,84 @@ static void emit_bpf_tail_call_indirect(u8 **pprog) * if (prog == NULL) * goto out; */ - EMIT3(0x48, 0x85, 0xC9); /* test rcx,rcx */ -#define OFFSET3 (8 + RETPOLINE_RCX_BPF_JIT_SIZE) + EMIT3(0x48, 0x85, 0xC9); /* test rcx,rcx */ +#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE) EMIT2(X86_JE, OFFSET3); /* je out */ - label3 = cnt; - /* goto *(prog->bpf_func + prologue_size); */ + *pprog = prog; + pop_callee_regs(pprog, callee_regs_used); + prog = *pprog; + + EMIT1(0x58); /* pop rax */ + EMIT3_off32(0x48, 0x81, 0xC4, /* add rsp, sd */ + round_up(stack_depth, 8)); + + /* goto *(prog->bpf_func + X86_TAIL_CALL_OFFSET); */ EMIT4(0x48, 0x8B, 0x49, /* mov rcx, qword ptr [rcx + 32] */ offsetof(struct bpf_prog, bpf_func)); - EMIT4(0x48, 0x83, 0xC1, PROLOGUE_SIZE); /* add rcx, prologue_size */ - + EMIT4(0x48, 0x83, 0xC1, /* add rcx, X86_TAIL_CALL_OFFSET */ + X86_TAIL_CALL_OFFSET); /* * Now we're ready to jump into next BPF program * rdi == ctx (1st arg) - * rcx == prog->bpf_func + prologue_size + * rcx == prog->bpf_func + X86_TAIL_CALL_OFFSET */ RETPOLINE_RCX_BPF_JIT(); /* out: */ - BUILD_BUG_ON(cnt - label1 != OFFSET1); - BUILD_BUG_ON(cnt - label2 != OFFSET2); - BUILD_BUG_ON(cnt - label3 != OFFSET3); *pprog = prog; } static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke, - u8 **pprog, int addr, u8 *image) + u8 **pprog, int addr, u8 *image, + bool *callee_regs_used, u32 stack_depth) { + int tcc_off = -4 - round_up(stack_depth, 8); u8 *prog = *pprog; + int pop_bytes = 0; + int off1 = 27; + int poke_off; int cnt = 0; + /* count the additional bytes used for popping callee regs to stack + * that need to be taken into account for jump offset that is used for + * bailing out from of the tail call when limit is reached + */ + pop_bytes = get_pop_bytes(callee_regs_used); + off1 += pop_bytes; + + /* + * total bytes for: + * - nop5/ jmpq $off + * - pop callee regs + * - sub rsp, $val + * - pop rax + */ + poke_off = X86_PATCH_SIZE + pop_bytes + 7 + 1; + /* * if (tail_call_cnt > MAX_TAIL_CALL_CNT) * goto out; */ - EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */ + EMIT2_off32(0x8B, 0x85, tcc_off); /* mov eax, dword ptr [rbp - tcc_off] */ EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ - EMIT2(X86_JA, 14); /* ja out */ + EMIT2(X86_JA, off1); /* ja out */ EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ - EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */ + EMIT2_off32(0x89, 0x85, tcc_off); /* mov dword ptr [rbp - tcc_off], eax */ + poke->tailcall_bypass = image + (addr - poke_off - X86_PATCH_SIZE); + poke->adj_off = X86_TAIL_CALL_OFFSET; poke->tailcall_target = image + (addr - X86_PATCH_SIZE); - poke->adj_off = PROLOGUE_SIZE; + poke->bypass_addr = (u8 *)poke->tailcall_target + X86_PATCH_SIZE; + + emit_jump(&prog, (u8 *)poke->tailcall_target + X86_PATCH_SIZE, + poke->tailcall_bypass); + + *pprog = prog; + pop_callee_regs(pprog, callee_regs_used); + prog = *pprog; + EMIT1(0x58); /* pop rax */ + EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE); prog += X86_PATCH_SIZE; @@ -476,6 +574,11 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog) (u8 *)target->bpf_func + poke->adj_off, false); BUG_ON(ret < 0); + ret = __bpf_arch_text_poke(poke->tailcall_bypass, + BPF_MOD_JUMP, + (u8 *)poke->tailcall_target + + X86_PATCH_SIZE, NULL, false); + BUG_ON(ret < 0); } WRITE_ONCE(poke->tailcall_target_stable, true); mutex_unlock(&array->aux->poke_mutex); @@ -654,19 +757,44 @@ static bool ex_handler_bpf(const struct exception_table_entry *x, return true; } +static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt, + bool *regs_used, bool *tail_call_seen) +{ + int i; + + for (i = 1; i <= insn_cnt; i++, insn++) { + if (insn->code == (BPF_JMP | BPF_TAIL_CALL)) + *tail_call_seen = true; + if (insn->dst_reg == BPF_REG_6 || insn->src_reg == BPF_REG_6) + regs_used[0] = true; + if (insn->dst_reg == BPF_REG_7 || insn->src_reg == BPF_REG_7) + regs_used[1] = true; + if (insn->dst_reg == BPF_REG_8 || insn->src_reg == BPF_REG_8) + regs_used[2] = true; + if (insn->dst_reg == BPF_REG_9 || insn->src_reg == BPF_REG_9) + regs_used[3] = true; + } +} + static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int oldproglen, struct jit_context *ctx) { struct bpf_insn *insn = bpf_prog->insnsi; + bool callee_regs_used[4] = {}; int insn_cnt = bpf_prog->len; + bool tail_call_seen = false; bool seen_exit = false; u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY]; int i, cnt = 0, excnt = 0; int proglen = 0; u8 *prog = temp; + detect_reg_usage(insn, insn_cnt, callee_regs_used, + &tail_call_seen); + emit_prologue(&prog, bpf_prog->aux->stack_depth, - bpf_prog_was_classic(bpf_prog)); + bpf_prog_was_classic(bpf_prog), tail_call_seen); + push_callee_regs(&prog, callee_regs_used); addrs[0] = prog - temp; for (i = 1; i <= insn_cnt; i++, insn++) { @@ -1111,9 +1239,13 @@ xadd: if (is_imm8(insn->off)) case BPF_JMP | BPF_TAIL_CALL: if (imm32) emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1], - &prog, addrs[i], image); + &prog, addrs[i], image, + callee_regs_used, + bpf_prog->aux->stack_depth); else - emit_bpf_tail_call_indirect(&prog); + emit_bpf_tail_call_indirect(&prog, + callee_regs_used, + bpf_prog->aux->stack_depth); break; /* cond jump */ @@ -1296,12 +1428,9 @@ xadd: if (is_imm8(insn->off)) seen_exit = true; /* Update cleanup_addr */ ctx->cleanup_addr = proglen; - if (!bpf_prog_was_classic(bpf_prog)) - EMIT1(0x5B); /* get rid of tail_call_cnt */ - EMIT2(0x41, 0x5F); /* pop r15 */ - EMIT2(0x41, 0x5E); /* pop r14 */ - EMIT2(0x41, 0x5D); /* pop r13 */ - EMIT1(0x5B); /* pop rbx */ + pop_callee_regs(&prog, callee_regs_used); + if (!bpf_prog_was_classic(bpf_prog) && tail_call_seen) + EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */ EMIT1(0xC9); /* leave */ EMIT1(0xC3); /* ret */ break; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index aaa035519360..14b796bf35de 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -653,6 +653,8 @@ enum bpf_jit_poke_reason { /* Descriptor of pokes pointing /into/ the JITed image. */ struct bpf_jit_poke_descriptor { void *tailcall_target; + void *tailcall_bypass; + void *bypass_addr; union { struct { struct bpf_map *map; diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 6fe6491fa17a..e9d62a60134b 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, struct bpf_prog *old, struct bpf_prog *new) { + u8 *old_addr, *new_addr, *old_bypass_addr; struct prog_poke_elem *elem; struct bpf_array_aux *aux; @@ -800,13 +801,47 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, if (poke->tail_call.map != map || poke->tail_call.key != key) continue; + /* protect against un-updated poke descriptors since + * we could fill them from subprog and the same desc + * is present on main's program poke tab + */ + if (!poke->tailcall_bypass || !poke->tailcall_target || + !poke->bypass_addr) + continue; - ret = bpf_arch_text_poke(poke->tailcall_target, BPF_MOD_JUMP, - old ? (u8 *)old->bpf_func + - poke->adj_off : NULL, - new ? (u8 *)new->bpf_func + - poke->adj_off : NULL); - BUG_ON(ret < 0 && ret != -EINVAL); + old_bypass_addr = old ? NULL : poke->bypass_addr; + old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL; + new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL; + + if (new) { + ret = bpf_arch_text_poke(poke->tailcall_target, + BPF_MOD_JUMP, + old_addr, new_addr); + BUG_ON(ret < 0 && ret != -EINVAL); + if (!old) { + ret = bpf_arch_text_poke(poke->tailcall_bypass, + BPF_MOD_JUMP, + poke->bypass_addr, + NULL); + BUG_ON(ret < 0 && ret != -EINVAL); + } + } else { + ret = bpf_arch_text_poke(poke->tailcall_bypass, + BPF_MOD_JUMP, + old_bypass_addr, + poke->bypass_addr); + BUG_ON(ret < 0 && ret != -EINVAL); + /* let other CPUs finish the execution of program + * so that it will not possible to expose them + * to invalid nop, stack unwind, nop state + */ + if (!ret) + synchronize_rcu(); + ret = bpf_arch_text_poke(poke->tailcall_target, + BPF_MOD_JUMP, + old_addr, NULL); + BUG_ON(ret < 0 && ret != -EINVAL); + } } } } diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 7be02e555ab9..d86a35474d7b 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -773,7 +773,8 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog, if (size > poke_tab_max) return -ENOSPC; - if (poke->ip || poke->ip_stable || poke->adj_off) + if (poke->tailcall_target || poke->tailcall_target_stable || + poke->tailcall_bypass || poke->adj_off || poke->bypass_addr) return -EINVAL; switch (poke->reason) { -- 2.20.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT 2020-07-24 17:35 ` [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski @ 2020-07-28 21:33 ` Daniel Borkmann 2020-07-28 22:07 ` Daniel Borkmann 1 sibling, 0 replies; 13+ messages in thread From: Daniel Borkmann @ 2020-07-28 21:33 UTC (permalink / raw) To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson On 7/24/20 7:35 PM, Maciej Fijalkowski wrote: > This commit serves two things: > 1) it optimizes BPF prologue/epilogue generation > 2) it makes possible to have tailcalls within BPF subprogram [...] [...] > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 7be02e555ab9..d86a35474d7b 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -773,7 +773,8 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog, > > if (size > poke_tab_max) > return -ENOSPC; > - if (poke->ip || poke->ip_stable || poke->adj_off) > + if (poke->tailcall_target || poke->tailcall_target_stable || > + poke->tailcall_bypass || poke->adj_off || poke->bypass_addr) > return -EINVAL; Hmm, I thought we've been through this from prior review rounds, but these sort of changes break bisectability. You've already renamed the whole thing in patch 3/6 (poke->ip and the poke->ip_stable). So if you've applied up to patch 3, then build breaks right here. > switch (poke->reason) { > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT 2020-07-24 17:35 ` [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski 2020-07-28 21:33 ` Daniel Borkmann @ 2020-07-28 22:07 ` Daniel Borkmann 2020-07-29 16:10 ` Maciej Fijalkowski 1 sibling, 1 reply; 13+ messages in thread From: Daniel Borkmann @ 2020-07-28 22:07 UTC (permalink / raw) To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson On 7/24/20 7:35 PM, Maciej Fijalkowski wrote: > This commit serves two things: > 1) it optimizes BPF prologue/epilogue generation > 2) it makes possible to have tailcalls within BPF subprogram > > Both points are related to each other since without 1), 2) could not be > achieved. > [...] > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > index 6fe6491fa17a..e9d62a60134b 100644 > --- a/kernel/bpf/arraymap.c > +++ b/kernel/bpf/arraymap.c > @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > struct bpf_prog *old, > struct bpf_prog *new) > { > + u8 *old_addr, *new_addr, *old_bypass_addr; > struct prog_poke_elem *elem; > struct bpf_array_aux *aux; > > @@ -800,13 +801,47 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > if (poke->tail_call.map != map || > poke->tail_call.key != key) > continue; > + /* protect against un-updated poke descriptors since > + * we could fill them from subprog and the same desc > + * is present on main's program poke tab > + */ > + if (!poke->tailcall_bypass || !poke->tailcall_target || > + !poke->bypass_addr) > + continue; Thinking more about this, this check here is not sufficient. You basically need this here given you copy all poke descs over to each of the subprogs in jit_subprogs(). So for those that weren't handled by the subprog have the above addresses as NULL. But in jit_subprogs() once we filled out the target addresses for the bpf-in-bpf calls we loop over each subprog and do the extra/final pass in the JIT to complete the images. However, nothing protects bpf_tail_call_direct_fixup() as far as I can see from patching at the NULL addr if there is a target program loaded in the map at the given key. That will most likely blow up and hit the BUG_ON(). Instead of these above workarounds, did you try to go the path to only copy over the poke descs that are relevant for the individual subprog (but not all the others)? Thanks, Daniel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT 2020-07-28 22:07 ` Daniel Borkmann @ 2020-07-29 16:10 ` Maciej Fijalkowski 2020-07-29 21:10 ` Maciej Fijalkowski 0 siblings, 1 reply; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-29 16:10 UTC (permalink / raw) To: Daniel Borkmann; +Cc: ast, bpf, netdev, bjorn.topel, magnus.karlsson On Wed, Jul 29, 2020 at 12:07:52AM +0200, Daniel Borkmann wrote: > On 7/24/20 7:35 PM, Maciej Fijalkowski wrote: > > This commit serves two things: > > 1) it optimizes BPF prologue/epilogue generation > > 2) it makes possible to have tailcalls within BPF subprogram > > > > Both points are related to each other since without 1), 2) could not be > > achieved. > > > [...] > > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > > index 6fe6491fa17a..e9d62a60134b 100644 > > --- a/kernel/bpf/arraymap.c > > +++ b/kernel/bpf/arraymap.c > > @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > struct bpf_prog *old, > > struct bpf_prog *new) > > { > > + u8 *old_addr, *new_addr, *old_bypass_addr; > > struct prog_poke_elem *elem; > > struct bpf_array_aux *aux; > > @@ -800,13 +801,47 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > if (poke->tail_call.map != map || > > poke->tail_call.key != key) > > continue; > > + /* protect against un-updated poke descriptors since > > + * we could fill them from subprog and the same desc > > + * is present on main's program poke tab > > + */ > > + if (!poke->tailcall_bypass || !poke->tailcall_target || > > + !poke->bypass_addr) > > + continue; > > Thinking more about this, this check here is not sufficient. You basically need this here > given you copy all poke descs over to each of the subprogs in jit_subprogs(). So for those > that weren't handled by the subprog have the above addresses as NULL. But in jit_subprogs() > once we filled out the target addresses for the bpf-in-bpf calls we loop over each subprog > and do the extra/final pass in the JIT to complete the images. However, nothing protects > bpf_tail_call_direct_fixup() as far as I can see from patching at the NULL addr if there is > a target program loaded in the map at the given key. That will most likely blow up and hit > the BUG_ON(). Okay, I agree with this reasoning but must admit that I don't understand when exactly during fixup the target prog for a given key might be already present? Could you shed some light on it? I recall that I was hitting this case in test_verifier kselftest, so maybe I'll dig onto that, but otherwise I didn't stumble upon this. > > Instead of these above workarounds, did you try to go the path to only copy over the poke > descs that are relevant for the individual subprog (but not all the others)? I was able to come up with something today, but I'd like to share it here and discuss whether you think it's correct approach before rushing with another revision. Generally in fixup_bpf_calls I store the index of tail call insn onto the generated poke descriptor, then in jit_subprogs() I check whether the given poke descriptor belongs to the current subprog by checking if that previously stored absolute index of tail call insn is in the scope of the insns of given subprog. Then the insn->imm needs to be updated with new poke descriptor slot so that while JITing we will be able to grab the proper poke desc - previously it worked because we emulated the main prog's poke tab state onto each subprog. This way the subprogs actually get only relevant poke descs, but I have a concern about the main prog's poke tab. Shouldn't we pull out the descs that have been copied to the subprog out of the main poke tab? If yes, then shouldn't the poke tab be converted to a linked list? The patch that I will merge onto the 2/6 if you would say that we can live with this approach, it's on top of this series: From 57baac74647a4627fe85bb3393365de906070eb1 Mon Sep 17 00:00:00 2001 From: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Date: Wed, 29 Jul 2020 17:51:59 +0200 Subject: [PATCH] bpf: propagate only those poke descs that are used in subprog Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> --- include/linux/bpf.h | 1 + kernel/bpf/verifier.c | 11 ++++++++++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 14b796bf35de..74ab8ec2f2d3 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -664,6 +664,7 @@ struct bpf_jit_poke_descriptor { bool tailcall_target_stable; u8 adj_off; u16 reason; + u32 abs_insn_idx; }; /* reg_type info for ctx arguments */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 3ea769555246..d6402dc05087 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -9971,15 +9971,23 @@ static int jit_subprogs(struct bpf_verifier_env *env) func[i]->aux->func_info = prog->aux->func_info; for (j = 0; j < prog->aux->size_poke_tab; j++) { + u32 abs_insn_idx = prog->aux->poke_tab[j].abs_insn_idx; int ret; + if (!(abs_insn_idx >= subprog_start && + abs_insn_idx <= subprog_end)) + continue; + ret = bpf_jit_add_poke_descriptor(func[i], &prog->aux->poke_tab[j]); if (ret < 0) { verbose(env, "adding tail call poke descriptor failed\n"); goto out_free; } - map_ptr = func[i]->aux->poke_tab[j].tail_call.map; + + func[i]->insnsi[abs_insn_idx - subprog_start].imm = ret + 1; + + map_ptr = func[i]->aux->poke_tab[ret].tail_call.map; ret = map_ptr->ops->map_poke_track(map_ptr, func[i]->aux); if (ret < 0) { verbose(env, "tracking tail call prog failed\n"); @@ -10309,6 +10317,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) .reason = BPF_POKE_REASON_TAIL_CALL, .tail_call.map = BPF_MAP_PTR(aux->map_ptr_state), .tail_call.key = bpf_map_key_immediate(aux), + .abs_insn_idx = i, }; ret = bpf_jit_add_poke_descriptor(prog, &desc); -- 2.20.1 > > Thanks, > Daniel ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT 2020-07-29 16:10 ` Maciej Fijalkowski @ 2020-07-29 21:10 ` Maciej Fijalkowski 2020-07-30 20:16 ` Daniel Borkmann 0 siblings, 1 reply; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-29 21:10 UTC (permalink / raw) To: Daniel Borkmann; +Cc: ast, bpf, netdev, bjorn.topel, magnus.karlsson On Wed, Jul 29, 2020 at 06:10:44PM +0200, Maciej Fijalkowski wrote: > On Wed, Jul 29, 2020 at 12:07:52AM +0200, Daniel Borkmann wrote: > > On 7/24/20 7:35 PM, Maciej Fijalkowski wrote: > > > This commit serves two things: > > > 1) it optimizes BPF prologue/epilogue generation > > > 2) it makes possible to have tailcalls within BPF subprogram > > > > > > Both points are related to each other since without 1), 2) could not be > > > achieved. > > > > > [...] > > > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > > > index 6fe6491fa17a..e9d62a60134b 100644 > > > --- a/kernel/bpf/arraymap.c > > > +++ b/kernel/bpf/arraymap.c > > > @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > > struct bpf_prog *old, > > > struct bpf_prog *new) > > > { > > > + u8 *old_addr, *new_addr, *old_bypass_addr; > > > struct prog_poke_elem *elem; > > > struct bpf_array_aux *aux; > > > @@ -800,13 +801,47 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > > if (poke->tail_call.map != map || > > > poke->tail_call.key != key) > > > continue; > > > + /* protect against un-updated poke descriptors since > > > + * we could fill them from subprog and the same desc > > > + * is present on main's program poke tab > > > + */ > > > + if (!poke->tailcall_bypass || !poke->tailcall_target || > > > + !poke->bypass_addr) > > > + continue; > > > > Thinking more about this, this check here is not sufficient. You basically need this here > > given you copy all poke descs over to each of the subprogs in jit_subprogs(). So for those > > that weren't handled by the subprog have the above addresses as NULL. But in jit_subprogs() > > once we filled out the target addresses for the bpf-in-bpf calls we loop over each subprog > > and do the extra/final pass in the JIT to complete the images. However, nothing protects > > bpf_tail_call_direct_fixup() as far as I can see from patching at the NULL addr if there is > > a target program loaded in the map at the given key. That will most likely blow up and hit > > the BUG_ON(). > > Okay, I agree with this reasoning but must admit that I don't understand > when exactly during fixup the target prog for a given key might be already > present? Could you shed some light on it? I recall that I was hitting > this case in test_verifier kselftest, so maybe I'll dig onto that, but > otherwise I didn't stumble upon this. > > > > > Instead of these above workarounds, did you try to go the path to only copy over the poke > > descs that are relevant for the individual subprog (but not all the others)? > > I was able to come up with something today, but I'd like to share it here > and discuss whether you think it's correct approach before rushing with > another revision. > > Generally in fixup_bpf_calls I store the index of tail call insn onto the > generated poke descriptor, then in jit_subprogs() I check whether the > given poke descriptor belongs to the current subprog by checking if that > previously stored absolute index of tail call insn is in the scope of the > insns of given subprog. Then the insn->imm needs to be updated with new > poke descriptor slot so that while JITing we will be able to grab the > proper poke desc - previously it worked because we emulated the main > prog's poke tab state onto each subprog. > > This way the subprogs actually get only relevant poke descs, but I have a > concern about the main prog's poke tab. Shouldn't we pull out the descs > that have been copied to the subprog out of the main poke tab? > > If yes, then shouldn't the poke tab be converted to a linked list? Thinking a bit more about this, I think we can just untrack the main prog's aux struct from prog array map. If there are subprograms then the main prog is treated as subprog 0 and with the logic below every poke desc will be propagated properly. I checked that doing: for (i = 0; i < prog->aux->size_poke_tab; i++) { map_ptr = prog->aux->poke_tab[i].tail_call.map; map_ptr->ops->map_poke_untrack(map_ptr, prog->aux); } after the initial JIT subprogs loop works just fine and we can drop the cumbersome check from map_poke_run(). wdyt? > > The patch that I will merge onto the 2/6 if you would say that we can live > with this approach, it's on top of this series: > > From 57baac74647a4627fe85bb3393365de906070eb1 Mon Sep 17 00:00:00 2001 > From: Maciej Fijalkowski <maciej.fijalkowski@intel.com> > Date: Wed, 29 Jul 2020 17:51:59 +0200 > Subject: [PATCH] bpf: propagate only those poke descs that are used in subprog > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> > --- > include/linux/bpf.h | 1 + > kernel/bpf/verifier.c | 11 ++++++++++- > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 14b796bf35de..74ab8ec2f2d3 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -664,6 +664,7 @@ struct bpf_jit_poke_descriptor { > bool tailcall_target_stable; > u8 adj_off; > u16 reason; > + u32 abs_insn_idx; > }; > > /* reg_type info for ctx arguments */ > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 3ea769555246..d6402dc05087 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -9971,15 +9971,23 @@ static int jit_subprogs(struct bpf_verifier_env *env) > func[i]->aux->func_info = prog->aux->func_info; > > for (j = 0; j < prog->aux->size_poke_tab; j++) { > + u32 abs_insn_idx = prog->aux->poke_tab[j].abs_insn_idx; > int ret; > > + if (!(abs_insn_idx >= subprog_start && > + abs_insn_idx <= subprog_end)) > + continue; > + > ret = bpf_jit_add_poke_descriptor(func[i], > &prog->aux->poke_tab[j]); > if (ret < 0) { > verbose(env, "adding tail call poke descriptor failed\n"); > goto out_free; > } > - map_ptr = func[i]->aux->poke_tab[j].tail_call.map; > + > + func[i]->insnsi[abs_insn_idx - subprog_start].imm = ret + 1; > + > + map_ptr = func[i]->aux->poke_tab[ret].tail_call.map; > ret = map_ptr->ops->map_poke_track(map_ptr, func[i]->aux); > if (ret < 0) { > verbose(env, "tracking tail call prog failed\n"); > @@ -10309,6 +10317,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) > .reason = BPF_POKE_REASON_TAIL_CALL, > .tail_call.map = BPF_MAP_PTR(aux->map_ptr_state), > .tail_call.key = bpf_map_key_immediate(aux), > + .abs_insn_idx = i, > }; > > ret = bpf_jit_add_poke_descriptor(prog, &desc); > -- > 2.20.1 > > > > > Thanks, > > Daniel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT 2020-07-29 21:10 ` Maciej Fijalkowski @ 2020-07-30 20:16 ` Daniel Borkmann 2020-07-30 22:58 ` Maciej Fijalkowski 0 siblings, 1 reply; 13+ messages in thread From: Daniel Borkmann @ 2020-07-30 20:16 UTC (permalink / raw) To: Maciej Fijalkowski; +Cc: ast, bpf, netdev, bjorn.topel, magnus.karlsson On 7/29/20 11:10 PM, Maciej Fijalkowski wrote: > On Wed, Jul 29, 2020 at 06:10:44PM +0200, Maciej Fijalkowski wrote: >> On Wed, Jul 29, 2020 at 12:07:52AM +0200, Daniel Borkmann wrote: >>> On 7/24/20 7:35 PM, Maciej Fijalkowski wrote: >>>> This commit serves two things: >>>> 1) it optimizes BPF prologue/epilogue generation >>>> 2) it makes possible to have tailcalls within BPF subprogram >>>> >>>> Both points are related to each other since without 1), 2) could not be >>>> achieved. >>>> >>> [...] >>>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c >>>> index 6fe6491fa17a..e9d62a60134b 100644 >>>> --- a/kernel/bpf/arraymap.c >>>> +++ b/kernel/bpf/arraymap.c >>>> @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, >>>> struct bpf_prog *old, >>>> struct bpf_prog *new) >>>> { >>>> + u8 *old_addr, *new_addr, *old_bypass_addr; >>>> struct prog_poke_elem *elem; >>>> struct bpf_array_aux *aux; >>>> @@ -800,13 +801,47 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, >>>> if (poke->tail_call.map != map || >>>> poke->tail_call.key != key) >>>> continue; >>>> + /* protect against un-updated poke descriptors since >>>> + * we could fill them from subprog and the same desc >>>> + * is present on main's program poke tab >>>> + */ >>>> + if (!poke->tailcall_bypass || !poke->tailcall_target || >>>> + !poke->bypass_addr) >>>> + continue; >>> >>> Thinking more about this, this check here is not sufficient. You basically need this here >>> given you copy all poke descs over to each of the subprogs in jit_subprogs(). So for those >>> that weren't handled by the subprog have the above addresses as NULL. But in jit_subprogs() >>> once we filled out the target addresses for the bpf-in-bpf calls we loop over each subprog >>> and do the extra/final pass in the JIT to complete the images. However, nothing protects >>> bpf_tail_call_direct_fixup() as far as I can see from patching at the NULL addr if there is >>> a target program loaded in the map at the given key. That will most likely blow up and hit >>> the BUG_ON(). >> >> Okay, I agree with this reasoning but must admit that I don't understand >> when exactly during fixup the target prog for a given key might be already >> present? Could you shed some light on it? I recall that I was hitting >> this case in test_verifier kselftest, so maybe I'll dig onto that, but >> otherwise I didn't stumble upon this. If the tail call map as first created and some programs attached to it, then you would hit this in bpf_tail_call_direct_fixup() for the subprogs where not all poke descs in the subprog's table belong to the actual prog. >>> Instead of these above workarounds, did you try to go the path to only copy over the poke >>> descs that are relevant for the individual subprog (but not all the others)? >> >> I was able to come up with something today, but I'd like to share it here >> and discuss whether you think it's correct approach before rushing with >> another revision. >> >> Generally in fixup_bpf_calls I store the index of tail call insn onto the >> generated poke descriptor, then in jit_subprogs() I check whether the >> given poke descriptor belongs to the current subprog by checking if that >> previously stored absolute index of tail call insn is in the scope of the >> insns of given subprog. Then the insn->imm needs to be updated with new >> poke descriptor slot so that while JITing we will be able to grab the >> proper poke desc - previously it worked because we emulated the main >> prog's poke tab state onto each subprog. >> >> This way the subprogs actually get only relevant poke descs, but I have a That sounds reasonable to me, yes, and the below code also looks good. >> concern about the main prog's poke tab. Shouldn't we pull out the descs >> that have been copied to the subprog out of the main poke tab? >> >> If yes, then shouldn't the poke tab be converted to a linked list? > > Thinking a bit more about this, I think we can just untrack the main > prog's aux struct from prog array map. If there are subprograms then the > main prog is treated as subprog 0 and with the logic below every poke desc > will be propagated properly. > > I checked that doing: > > for (i = 0; i < prog->aux->size_poke_tab; i++) { > map_ptr = prog->aux->poke_tab[i].tail_call.map; > > map_ptr->ops->map_poke_untrack(map_ptr, prog->aux); > } > > after the initial JIT subprogs loop works just fine and we can drop the > cumbersome check from map_poke_run(). > > wdyt? Yes, that is needed as well. Given we test on prog->aux for tracking, the subprogs enries will get added in prog_array_map_poke_track() individually given their aux pointer is different and untracking main progs aux then also works since it has no effect on subprogs. >> The patch that I will merge onto the 2/6 if you would say that we can live >> with this approach, it's on top of this series: >> >> From 57baac74647a4627fe85bb3393365de906070eb1 Mon Sep 17 00:00:00 2001 >> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com> >> Date: Wed, 29 Jul 2020 17:51:59 +0200 >> Subject: [PATCH] bpf: propagate only those poke descs that are used in subprog >> >> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> >> --- >> include/linux/bpf.h | 1 + >> kernel/bpf/verifier.c | 11 ++++++++++- >> 2 files changed, 11 insertions(+), 1 deletion(-) >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index 14b796bf35de..74ab8ec2f2d3 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -664,6 +664,7 @@ struct bpf_jit_poke_descriptor { >> bool tailcall_target_stable; >> u8 adj_off; >> u16 reason; >> + u32 abs_insn_idx; tiny nit: I think just calling insn_idx is sufficient. >> }; >> >> /* reg_type info for ctx arguments */ >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 3ea769555246..d6402dc05087 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -9971,15 +9971,23 @@ static int jit_subprogs(struct bpf_verifier_env *env) >> func[i]->aux->func_info = prog->aux->func_info; >> >> for (j = 0; j < prog->aux->size_poke_tab; j++) { >> + u32 abs_insn_idx = prog->aux->poke_tab[j].abs_insn_idx; >> int ret; >> >> + if (!(abs_insn_idx >= subprog_start && >> + abs_insn_idx <= subprog_end)) >> + continue; >> + >> ret = bpf_jit_add_poke_descriptor(func[i], >> &prog->aux->poke_tab[j]); >> if (ret < 0) { >> verbose(env, "adding tail call poke descriptor failed\n"); >> goto out_free; >> } >> - map_ptr = func[i]->aux->poke_tab[j].tail_call.map; >> + >> + func[i]->insnsi[abs_insn_idx - subprog_start].imm = ret + 1; >> + >> + map_ptr = func[i]->aux->poke_tab[ret].tail_call.map; >> ret = map_ptr->ops->map_poke_track(map_ptr, func[i]->aux); >> if (ret < 0) { >> verbose(env, "tracking tail call prog failed\n"); >> @@ -10309,6 +10317,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) >> .reason = BPF_POKE_REASON_TAIL_CALL, >> .tail_call.map = BPF_MAP_PTR(aux->map_ptr_state), >> .tail_call.key = bpf_map_key_immediate(aux), >> + .abs_insn_idx = i, >> }; >> >> ret = bpf_jit_add_poke_descriptor(prog, &desc); >> -- >> 2.20.1 Lets ship it, thanks! Daniel ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT 2020-07-30 20:16 ` Daniel Borkmann @ 2020-07-30 22:58 ` Maciej Fijalkowski 0 siblings, 0 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-30 22:58 UTC (permalink / raw) To: Daniel Borkmann; +Cc: ast, bpf, netdev, bjorn.topel, magnus.karlsson On Thu, Jul 30, 2020 at 10:16:00PM +0200, Daniel Borkmann wrote: > On 7/29/20 11:10 PM, Maciej Fijalkowski wrote: > > On Wed, Jul 29, 2020 at 06:10:44PM +0200, Maciej Fijalkowski wrote: > > > On Wed, Jul 29, 2020 at 12:07:52AM +0200, Daniel Borkmann wrote: > > > > On 7/24/20 7:35 PM, Maciej Fijalkowski wrote: > > > > > This commit serves two things: > > > > > 1) it optimizes BPF prologue/epilogue generation > > > > > 2) it makes possible to have tailcalls within BPF subprogram > > > > > > > > > > Both points are related to each other since without 1), 2) could not be > > > > > achieved. > > > > > > > > > [...] > > > > > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > > > > > index 6fe6491fa17a..e9d62a60134b 100644 > > > > > --- a/kernel/bpf/arraymap.c > > > > > +++ b/kernel/bpf/arraymap.c > > > > > @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > > > > struct bpf_prog *old, > > > > > struct bpf_prog *new) > > > > > { > > > > > + u8 *old_addr, *new_addr, *old_bypass_addr; > > > > > struct prog_poke_elem *elem; > > > > > struct bpf_array_aux *aux; > > > > > @@ -800,13 +801,47 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > > > > if (poke->tail_call.map != map || > > > > > poke->tail_call.key != key) > > > > > continue; > > > > > + /* protect against un-updated poke descriptors since > > > > > + * we could fill them from subprog and the same desc > > > > > + * is present on main's program poke tab > > > > > + */ > > > > > + if (!poke->tailcall_bypass || !poke->tailcall_target || > > > > > + !poke->bypass_addr) > > > > > + continue; > > > > > > > > Thinking more about this, this check here is not sufficient. You basically need this here > > > > given you copy all poke descs over to each of the subprogs in jit_subprogs(). So for those > > > > that weren't handled by the subprog have the above addresses as NULL. But in jit_subprogs() > > > > once we filled out the target addresses for the bpf-in-bpf calls we loop over each subprog > > > > and do the extra/final pass in the JIT to complete the images. However, nothing protects > > > > bpf_tail_call_direct_fixup() as far as I can see from patching at the NULL addr if there is > > > > a target program loaded in the map at the given key. That will most likely blow up and hit > > > > the BUG_ON(). > > > > > > Okay, I agree with this reasoning but must admit that I don't understand > > > when exactly during fixup the target prog for a given key might be already > > > present? Could you shed some light on it? I recall that I was hitting > > > this case in test_verifier kselftest, so maybe I'll dig onto that, but > > > otherwise I didn't stumble upon this. > > If the tail call map as first created and some programs attached to it, then you > would hit this in bpf_tail_call_direct_fixup() for the subprogs where not all poke > descs in the subprog's table belong to the actual prog. Ah, got it. Thanks! I've cooked the v6 that includes what we agreed on here together with embarassing bisectability fix you spotted. Just waiting for build to be finished so that I'll be sure that there's no surprises after rebase. > > > > > Instead of these above workarounds, did you try to go the path to only copy over the poke > > > > descs that are relevant for the individual subprog (but not all the others)? > > > > > > I was able to come up with something today, but I'd like to share it here > > > and discuss whether you think it's correct approach before rushing with > > > another revision. > > > > > > Generally in fixup_bpf_calls I store the index of tail call insn onto the > > > generated poke descriptor, then in jit_subprogs() I check whether the > > > given poke descriptor belongs to the current subprog by checking if that > > > previously stored absolute index of tail call insn is in the scope of the > > > insns of given subprog. Then the insn->imm needs to be updated with new > > > poke descriptor slot so that while JITing we will be able to grab the > > > proper poke desc - previously it worked because we emulated the main > > > prog's poke tab state onto each subprog. > > > > > > This way the subprogs actually get only relevant poke descs, but I have a > > That sounds reasonable to me, yes, and the below code also looks good. > > > > concern about the main prog's poke tab. Shouldn't we pull out the descs > > > that have been copied to the subprog out of the main poke tab? > > > > > > If yes, then shouldn't the poke tab be converted to a linked list? > > > > Thinking a bit more about this, I think we can just untrack the main > > prog's aux struct from prog array map. If there are subprograms then the > > main prog is treated as subprog 0 and with the logic below every poke desc > > will be propagated properly. > > > > I checked that doing: > > > > for (i = 0; i < prog->aux->size_poke_tab; i++) { > > map_ptr = prog->aux->poke_tab[i].tail_call.map; > > > > map_ptr->ops->map_poke_untrack(map_ptr, prog->aux); > > } > > > > after the initial JIT subprogs loop works just fine and we can drop the > > cumbersome check from map_poke_run(). > > > > wdyt? > > Yes, that is needed as well. Given we test on prog->aux for tracking, the subprogs > enries will get added in prog_array_map_poke_track() individually given their aux > pointer is different and untracking main progs aux then also works since it has no > effect on subprogs. > > > > The patch that I will merge onto the 2/6 if you would say that we can live > > > with this approach, it's on top of this series: > > > > > > From 57baac74647a4627fe85bb3393365de906070eb1 Mon Sep 17 00:00:00 2001 > > > From: Maciej Fijalkowski <maciej.fijalkowski@intel.com> > > > Date: Wed, 29 Jul 2020 17:51:59 +0200 > > > Subject: [PATCH] bpf: propagate only those poke descs that are used in subprog > > > > > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> > > > --- > > > include/linux/bpf.h | 1 + > > > kernel/bpf/verifier.c | 11 ++++++++++- > > > 2 files changed, 11 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > index 14b796bf35de..74ab8ec2f2d3 100644 > > > --- a/include/linux/bpf.h > > > +++ b/include/linux/bpf.h > > > @@ -664,6 +664,7 @@ struct bpf_jit_poke_descriptor { > > > bool tailcall_target_stable; > > > u8 adj_off; > > > u16 reason; > > > + u32 abs_insn_idx; > > tiny nit: I think just calling insn_idx is sufficient. > > > > }; > > > /* reg_type info for ctx arguments */ > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > > index 3ea769555246..d6402dc05087 100644 > > > --- a/kernel/bpf/verifier.c > > > +++ b/kernel/bpf/verifier.c > > > @@ -9971,15 +9971,23 @@ static int jit_subprogs(struct bpf_verifier_env *env) > > > func[i]->aux->func_info = prog->aux->func_info; > > > for (j = 0; j < prog->aux->size_poke_tab; j++) { > > > + u32 abs_insn_idx = prog->aux->poke_tab[j].abs_insn_idx; > > > int ret; > > > + if (!(abs_insn_idx >= subprog_start && > > > + abs_insn_idx <= subprog_end)) > > > + continue; > > > + > > > ret = bpf_jit_add_poke_descriptor(func[i], > > > &prog->aux->poke_tab[j]); > > > if (ret < 0) { > > > verbose(env, "adding tail call poke descriptor failed\n"); > > > goto out_free; > > > } > > > - map_ptr = func[i]->aux->poke_tab[j].tail_call.map; > > > + > > > + func[i]->insnsi[abs_insn_idx - subprog_start].imm = ret + 1; > > > + > > > + map_ptr = func[i]->aux->poke_tab[ret].tail_call.map; > > > ret = map_ptr->ops->map_poke_track(map_ptr, func[i]->aux); > > > if (ret < 0) { > > > verbose(env, "tracking tail call prog failed\n"); > > > @@ -10309,6 +10317,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) > > > .reason = BPF_POKE_REASON_TAIL_CALL, > > > .tail_call.map = BPF_MAP_PTR(aux->map_ptr_state), > > > .tail_call.key = bpf_map_key_immediate(aux), > > > + .abs_insn_idx = i, > > > }; > > > ret = bpf_jit_add_poke_descriptor(prog, &desc); > > > -- > > > 2.20.1 > > Lets ship it, thanks! Super cool! :) > Daniel ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v5 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT 2020-07-24 17:35 [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski ` (3 preceding siblings ...) 2020-07-24 17:35 ` [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski @ 2020-07-24 17:35 ` Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski 5 siblings, 0 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-24 17:35 UTC (permalink / raw) To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski Relax verifier's restriction that was meant to forbid tailcall usage when subprog count was higher than 1. Also, do not max out the stack depth of program that utilizes tailcalls. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> --- kernel/bpf/verifier.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f4955b4bf8a6..3ea769555246 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4176,10 +4176,12 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, case BPF_FUNC_tail_call: if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY) goto error; +#if !defined(CONFIG_X86_64) || !defined(CONFIG_BPF_JIT_ALWAYS_ON) if (env->subprog_cnt > 1) { verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n"); return -EINVAL; } +#endif break; case BPF_FUNC_perf_event_read: case BPF_FUNC_perf_event_output: @@ -10284,7 +10286,9 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) * the program array. */ prog->cb_access = 1; +#if !defined(CONFIG_X86_64) || !defined(CONFIG_BPF_JIT_ALWAYS_ON) env->prog->aux->stack_depth = MAX_BPF_STACK; +#endif env->prog->aux->max_pkt_offset = MAX_PACKET_OFF; /* mark bpf_tail_call as different opcode to avoid -- 2.20.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v5 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall 2020-07-24 17:35 [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski ` (4 preceding siblings ...) 2020-07-24 17:35 ` [PATCH v5 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT Maciej Fijalkowski @ 2020-07-24 17:35 ` Maciej Fijalkowski 5 siblings, 0 replies; 13+ messages in thread From: Maciej Fijalkowski @ 2020-07-24 17:35 UTC (permalink / raw) To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski Introduce 6th test to taicalls kselftest that checks if tailcall can be correctly executed from the BPF subprogram. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> --- .../selftests/bpf/prog_tests/tailcalls.c | 85 +++++++++++++++++++ tools/testing/selftests/bpf/progs/tailcall6.c | 38 +++++++++ 2 files changed, 123 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c index bb8fe646dd9f..192c94896809 100644 --- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c +++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include <test_progs.h> +#include <network_helpers.h> /* test_tailcall_1 checks basic functionality by patching multiple locations * in a single program for a single tail call slot with nop->jmp, jmp->nop @@ -472,6 +473,88 @@ static void test_tailcall_5(void) bpf_object__close(obj); } +/* test_tailcall_6 purpose is to make sure that tailcalls are working + * correctly in correlation with BPF subprograms + */ +static void test_tailcall_6(void) +{ + int err, map_fd, prog_fd, main_fd, i; + struct bpf_map *prog_array; + struct bpf_program *prog; + struct bpf_object *obj; + __u32 retval, duration; + char prog_name[32]; + + err = bpf_prog_load("tailcall6.o", BPF_PROG_TYPE_SCHED_CLS, &obj, + &prog_fd); + if (CHECK_FAIL(err)) + return; + + prog = bpf_object__find_program_by_title(obj, "classifier"); + if (CHECK_FAIL(!prog)) + goto out; + + main_fd = bpf_program__fd(prog); + if (CHECK_FAIL(main_fd < 0)) + goto out; + + prog_array = bpf_object__find_map_by_name(obj, "jmp_table"); + if (CHECK_FAIL(!prog_array)) + goto out; + + map_fd = bpf_map__fd(prog_array); + if (CHECK_FAIL(map_fd < 0)) + goto out; + + /* nop -> jmp */ + for (i = 0; i < bpf_map__def(prog_array)->max_entries; i++) { + snprintf(prog_name, sizeof(prog_name), "classifier/%i", i); + + prog = bpf_object__find_program_by_title(obj, prog_name); + if (CHECK_FAIL(!prog)) + goto out; + + prog_fd = bpf_program__fd(prog); + if (CHECK_FAIL(prog_fd < 0)) + goto out; + + err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY); + if (CHECK_FAIL(err)) + goto out; + } + + err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0, + 0, &retval, &duration); + CHECK(err || retval != 1, "tailcall", + "err %d errno %d retval %d\n", err, errno, retval); + + /* jmp -> nop, call subprog that will do tailcall */ + i = 1; + err = bpf_map_delete_elem(map_fd, &i); + if (CHECK_FAIL(err)) + goto out; + + err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0, + 0, &retval, &duration); + CHECK(err || retval != 0, "tailcall", "err %d errno %d retval %d\n", + err, errno, retval); + + /* make sure that subprog can access ctx and entry prog that + * called this subprog can properly return + */ + i = 0; + err = bpf_map_delete_elem(map_fd, &i); + if (CHECK_FAIL(err)) + goto out; + + err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0, + 0, &retval, &duration); + CHECK(err || retval != 108, "tailcall", "err %d errno %d retval %d\n", + err, errno, retval); +out: + bpf_object__close(obj); +} + void test_tailcalls(void) { if (test__start_subtest("tailcall_1")) @@ -484,4 +567,6 @@ void test_tailcalls(void) test_tailcall_4(); if (test__start_subtest("tailcall_5")) test_tailcall_5(); + if (test__start_subtest("tailcall_6")) + test_tailcall_6(); } diff --git a/tools/testing/selftests/bpf/progs/tailcall6.c b/tools/testing/selftests/bpf/progs/tailcall6.c new file mode 100644 index 000000000000..e72ca5869b58 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/tailcall6.c @@ -0,0 +1,38 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> + +struct { + __uint(type, BPF_MAP_TYPE_PROG_ARRAY); + __uint(max_entries, 2); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(__u32)); +} jmp_table SEC(".maps"); + +#define TAIL_FUNC(x) \ + SEC("classifier/" #x) \ + int bpf_func_##x(struct __sk_buff *skb) \ + { \ + return x; \ + } +TAIL_FUNC(0) +TAIL_FUNC(1) + +static __attribute__ ((noinline)) +int subprog_tail(struct __sk_buff *skb) +{ + bpf_tail_call(skb, &jmp_table, 0); + + return skb->len * 2; +} + +SEC("classifier") +int entry(struct __sk_buff *skb) +{ + bpf_tail_call(skb, &jmp_table, 1); + + return subprog_tail(skb); +} + +char __license[] SEC("license") = "GPL"; +int _version SEC("version") = 1; -- 2.20.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-07-30 23:03 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-24 17:35 [PATCH v5 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 2/6] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target' Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski 2020-07-28 21:33 ` Daniel Borkmann 2020-07-28 22:07 ` Daniel Borkmann 2020-07-29 16:10 ` Maciej Fijalkowski 2020-07-29 21:10 ` Maciej Fijalkowski 2020-07-30 20:16 ` Daniel Borkmann 2020-07-30 22:58 ` Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT Maciej Fijalkowski 2020-07-24 17:35 ` [PATCH v5 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).