netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
@ 2020-07-31  0:03 Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-07-31  0:03 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

v5->v6:
- propagate only those poke descriptors that individual subprogram is
  actually using (Daniel)
- drop the cumbersome check if poke desc got filled in map_poke_run()
- move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
  to patch 3 to provide bisectability (Daniel)

v4->v5:
- simplify the fix from v3/v4 (Daniel)

v3->v4:
- be more careful around the fix from v3

v2->v3:
- call map_poke_untrack() on each previously registered subprog's aux
  struct to prog array if adding poke descriptor or tracking the aux
  struct failed (Daniel)

v1->v2:
- include the rax->rcx conversion in first patch, target prog needs to be
  placed in rcx in the tailcall indirect routine (Daniel)
- add error checks to routines that add poke descriptors to subprograms
  (Daniel)
- don't allow this optimization when arch is different than x64 and when JIT is
  disabled (Daniel)
- pull out the rename of poke desc members onto a separate patch (Daniel)
- add a new poke member to store the bypass address so that calculation of it
  won't be necessary
- avoid the special casing when old and new is null in map_poke_run (Daniel)
- do not sync RCU when bypass target was not patched (Daniel)
- do not introduce nop2 instruction to prologue for cBPF programs (Daniel)

RFC->v1:
- rename poke->ip/poke->ip_aux pair to
  poke->tailcall_target/poke->tailcall_bypass (Alexei)
- get rid of x86-specific code in prog_array_map_poke_run (Alexei)
- use synchronize_rcu in prog_array_map_poke_run so that other CPUs in
  the middle of execution will finish running the program and will not
  stumble upon the incorrect state (Alexei)
- update performance reports
- rebase


Hello,

today bpf2bpf calls and tailcalls exclude each other. This set makes them
work together.

To give you an overview how this work started, previously I posted RFC
that was targetted at getting rid of push/pop instructions for callee
saved registers in x86-64 JIT that are not used by the BPF program.
Alexei saw a potential that that work could be lifted a bit and
tailcalls could work with BPF subprograms. More on that in [1], [2].

For previous discussions on RFC version, see [3].
For v1, see [4]. v2 is in [5], v3 can be ignored.
v5 - [6].

In [1], Alexei says:

"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

So basically I gave a shot at that suggestion. Patch 4 has a description
of implementation.

Quick overview of patches:
Patch 1 changes BPF retpoline to use %rcx instead of %rax to store
address of BPF tailcall target program
Patch 2 propagates poke descriptors from main program to each subprogram
Patch 3 renames poke->ip to poke->tailcall_target
Patch 4 is the main dish in this set. It implements new prologue layout
that was suggested by Alexei and reworks tailcall handling.
Patch 5 relaxes verifier's restrictions about tailcalls being used with
BPF subprograms for x64 JIT
Patch 6 is the new selftest that proves tailcalls can be used from
within BPF subprogram.


-------------------------------------------------------------------
prog_array_map_poke_run changes:

Before the tailcall and with the new prologue layout, stack need to be
unwinded and callee saved registers need to be popped. Instructions
responsible for that are generated, but they should not be executed if
target program is not present. To address that, new poke target
'tailcall_bypass' is introduced to poke descriptor that will be used for
skipping these instructions. This means there are two poke targets for
handling direct tailcalls. Simplified flow can be presented as three
sections:

1. skip call or nop (poke->tailcall_bypass)
2. stack unwind
3. call tail or nop (poke->tailcall_target)

It would be possible under specific circumstances that one of CPU might
be in point 2 and point 3 is not yet updated (nop), which would lead to
problems mentioned in patch 4 commit message, IOW unwind section should
not be executed if there is no target program.

We can define the following state matrix for that (courtesy of Bjorn):
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->tailcall_target) and do NOT touch the
   poke->tailcall_bypass
Remove tail call: B(f')->C(f')
 * poke->tailcall_bypass is poked back to jump, then we wait the RCU
   grace period so that other programs will finish its execution and
   after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
 * same as first step

This way CPU can never be exposed to "unwind, tail" state.

-------------------------------------------------------------------
Performance impact:

All of this work, as stated in [2], started as a way to speed up AF-XDP
by dropping the push/pop of unused callee saved registers in prologue
and epilogue. Impact is positive, 15% of performance gain.

However, it is obvious that it will have a negative impact on BPF
programs that utilize tailcalls, but we think its volume is acceptable
for the feature that this set contains.

Below are te numbers from 'perf stat' for two scenarios.
First scenario is the output of command:

$ sudo perf stat -ddd -r 1024 ./test_progs -t tailcalls

tailcalls kselftest was modified in a following way:
- only tailcall1 subtest is enabled
- each of the bpf_prog_test_run() calls got set 'repeat' argument to
  1000000

Numbers without this set:

 Performance counter stats for './test_progs -t tailcalls' (1024 runs):

            261.68 msec task-clock                #    0.998 CPUs utilized            ( +-  0.12% )
                 5      context-switches          #    0.017 K/sec                    ( +-  0.54% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +- 23.37% )
               113      page-faults               #    0.433 K/sec                    ( +-  0.03% )
       877,156,850      cycles                    #    3.352 GHz                      ( +-  0.11% )  (30.31%)
     1,379,322,515      instructions              #    1.57  insn per cycle           ( +-  0.02% )  (38.17%)
       218,869,567      branches                  #  836.395 M/sec                    ( +-  0.01% )  (38.46%)
        11,954,183      branch-misses             #    5.46% of all branches          ( +-  0.01% )  (38.74%)
       283,350,418      L1-dcache-loads           # 1082.805 M/sec                    ( +-  0.01% )  (39.00%)
           156,323      L1-dcache-load-misses     #    0.06% of all L1-dcache hits    ( +-  0.74% )  (39.05%)
            37,309      LLC-loads                 #    0.143 M/sec                    ( +-  1.02% )  (31.08%)
            15,263      LLC-load-misses           #   40.91% of all LL-cache hits     ( +-  0.90% )  (30.95%)
   <not supported>      L1-icache-loads
           130,427      L1-icache-load-misses                                         ( +-  0.45% )  (30.80%)
       285,369,370      dTLB-loads                # 1090.520 M/sec                    ( +-  0.01% )  (30.64%)
             1,154      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.26% )  (30.46%)
             2,015      iTLB-loads                #    0.008 M/sec                    ( +-  1.12% )  (30.31%)
               551      iTLB-load-misses          #   27.34% of all iTLB cache hits   ( +-  1.29% )  (30.20%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          0.262276 +- 0.000316 seconds time elapsed  ( +-  0.12% )

With:

 Performance counter stats for './test_progs -t tailcalls' (1024 runs):

            362.37 msec task-clock                #    0.671 CPUs utilized            ( +-  0.11% )
                28      context-switches          #    0.077 K/sec                    ( +-  0.15% )
                 0      cpu-migrations            #    0.001 K/sec                    ( +-  4.46% )
               113      page-faults               #    0.313 K/sec                    ( +-  0.03% )
       895,804,416      cycles                    #    2.472 GHz                      ( +-  0.08% )  (30.50%)
     1,339,401,398      instructions              #    1.50  insn per cycle           ( +-  0.04% )  (38.29%)
       302,718,849      branches                  #  835.385 M/sec                    ( +-  0.04% )  (38.39%)
        11,962,089      branch-misses             #    3.95% of all branches          ( +-  0.05% )  (38.56%)
       248,044,443      L1-dcache-loads           #  684.505 M/sec                    ( +-  0.03% )  (38.70%)
           239,882      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    ( +-  0.49% )  (38.69%)
            76,904      LLC-loads                 #    0.212 M/sec                    ( +-  0.96% )  (30.88%)
            23,472      LLC-load-misses           #   30.52% of all LL-cache hits     ( +-  0.98% )  (30.85%)
   <not supported>      L1-icache-loads
           193,803      L1-icache-load-misses                                         ( +-  0.53% )  (30.81%)
       249,775,412      dTLB-loads                #  689.282 M/sec                    ( +-  0.04% )  (30.81%)
             2,176      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.53% )  (30.73%)
             2,914      iTLB-loads                #    0.008 M/sec                    ( +-  1.23% )  (30.59%)
               978      iTLB-load-misses          #   33.57% of all iTLB cache hits   ( +-  1.29% )  (30.48%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          0.540236 +- 0.000454 seconds time elapsed  ( +-  0.08% )

Second conducted measurement was on BPF kselftest flow_dissector that is
using the progs/bpf_flow.c with 'repeat' argument on
bpf_prog_test_run_xattr set also to 1000000.

Without:

Performance counter stats for './test_progs -t flow_dissector' (1024 runs):

          1,355.18 msec task-clock                #    0.989 CPUs utilized            ( +-  0.11% )
                28      context-switches          #    0.021 K/sec                    ( +-  0.49% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-  7.86% )
               125      page-faults               #    0.093 K/sec                    ( +-  0.03% )
     4,609,228,676      cycles                    #    3.401 GHz                      ( +-  0.03% )  (30.70%)
     6,735,946,489      instructions              #    1.46  insn per cycle           ( +-  0.01% )  (38.42%)
     1,130,187,926      branches                  #  833.979 M/sec                    ( +-  0.01% )  (38.47%)
        29,150,986      branch-misses             #    2.58% of all branches          ( +-  0.01% )  (38.51%)
     1,737,548,851      L1-dcache-loads           # 1282.158 M/sec                    ( +-  0.01% )  (38.56%)
           659,851      L1-dcache-load-misses     #    0.04% of all L1-dcache hits    ( +-  0.78% )  (38.56%)
            71,196      LLC-loads                 #    0.053 M/sec                    ( +-  0.97% )  (30.81%)
            22,218      LLC-load-misses           #   31.21% of all LL-cache hits     ( +-  0.83% )  (30.79%)
   <not supported>      L1-icache-loads
           770,586      L1-icache-load-misses                                         ( +-  0.67% )  (30.77%)
     1,742,104,224      dTLB-loads                # 1285.520 M/sec                    ( +-  0.01% )  (30.74%)
             7,060      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.08% )  (30.72%)
             4,282      iTLB-loads                #    0.003 M/sec                    ( +- 16.98% )  (30.70%)
             1,261      iTLB-load-misses          #   29.46% of all iTLB cache hits   ( +-  7.14% )  (30.68%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

           1.37087 +- 0.00145 seconds time elapsed  ( +-  0.11% )

With:

 Performance counter stats for './test_progs -t flow_dissector' (1024 runs):

          1,385.56 msec task-clock                #    0.989 CPUs utilized            ( +-  0.06% )
                28      context-switches          #    0.020 K/sec                    ( +-  0.48% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-  7.20% )
               125      page-faults               #    0.091 K/sec                    ( +-  0.03% )
     4,642,599,630      cycles                    #    3.351 GHz                      ( +-  0.03% )  (30.69%)
     6,901,261,616      instructions              #    1.49  insn per cycle           ( +-  0.01% )  (38.41%)
     1,130,623,950      branches                  #  816.006 M/sec                    ( +-  0.01% )  (38.45%)
        29,161,215      branch-misses             #    2.58% of all branches          ( +-  0.01% )  (38.50%)
     1,796,850,740      L1-dcache-loads           # 1296.842 M/sec                    ( +-  0.01% )  (38.55%)
           673,908      L1-dcache-load-misses     #    0.04% of all L1-dcache hits    ( +-  0.89% )  (38.56%)
            70,394      LLC-loads                 #    0.051 M/sec                    ( +-  1.08% )  (30.82%)
            24,575      LLC-load-misses           #   34.91% of all LL-cache hits     ( +-  0.66% )  (30.80%)
   <not supported>      L1-icache-loads
           729,421      L1-icache-load-misses                                         ( +-  0.85% )  (30.77%)
     1,800,871,042      dTLB-loads                # 1299.743 M/sec                    ( +-  0.01% )  (30.75%)
             6,133      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.55% )  (30.73%)
             1,998      iTLB-loads                #    0.001 M/sec                    ( +-  9.36% )  (30.70%)
             1,152      iTLB-load-misses          #   57.66% of all iTLB cache hits   ( +-  3.02% )  (30.68%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          1.400577 +- 0.000780 seconds time elapsed  ( +-  0.06% )


Interesting fact is that I observed the huge iTLB-load-misses counts on
clean kernel as well:

flow_dissector test:
             2,613      iTLB-loads                #    0.002 M/sec ( +- 21.90% )  (30.70%)
            16,483      iTLB-load-misses          #  630.78% of all iTLB cache hits   ( +- 79.63% )  (30.68%)
tailcalls test:
             1,996      iTLB-loads                #    0.008 M/sec ( +-  1.08% )  (30.33%)
             7,272      iTLB-load-misses          #  364.24% of all iTLB cache hits   ( +- 92.01% )  (30.21%)

So probably Alexei's suspicion about get_random_int() doing strange things
was right.

-------------------------------------------------------------------

Thank you,
Maciej

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
[2]: https://lore.kernel.org/bpf/20200511143912.34086-1-maciej.fijalkowski@intel.com/
[3]: https://lore.kernel.org/bpf/20200702134930.4717-1-maciej.fijalkowski@intel.com/
[4]: https://lore.kernel.org/bpf/20200715233634.3868-1-maciej.fijalkowski@intel.com/
[5]: https://lore.kernel.org/netdev/20200721115321.3099-1-maciej.fijalkowski@intel.com/
[6]: https://lore.kernel.org/netdev/20200724173557.5764-1-maciej.fijalkowski@intel.com/

Maciej Fijalkowski (6):
  bpf, x64: use %rcx instead of %rax for tail call retpolines
  bpf: propagate poke descriptors to subprograms
  bpf: rename poke descriptor's 'ip' member to 'tailcall_target'
  bpf, x64: rework pro/epilogue and tailcall handling in JIT
  bpf: allow for tailcalls in BPF subprograms for x64 JIT
  selftests: bpf: add dummy prog for bpf2bpf with tailcall

 arch/x86/include/asm/nospec-branch.h          |  16 +-
 arch/x86/net/bpf_jit_comp.c                   | 249 +++++++++++++-----
 include/linux/bpf.h                           |   7 +-
 kernel/bpf/arraymap.c                         |  55 +++-
 kernel/bpf/core.c                             |   3 +-
 kernel/bpf/verifier.c                         |  57 +++-
 .../selftests/bpf/prog_tests/tailcalls.c      |  85 ++++++
 tools/testing/selftests/bpf/progs/tailcall6.c |  38 +++
 8 files changed, 424 insertions(+), 86 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v6 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines
  2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
@ 2020-07-31  0:03 ` Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 2/6] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-07-31  0:03 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Currently, %rax is used to store the jump target when BPF program is
emitting the retpoline instructions that are handling the indirect
tailcall.

There is a plan to use %rax for different purpose, which is storing the
tail call counter. In order to preserve this value across the tailcalls,
adjust the BPF indirect tailcalls so that the target program will reside
in %rcx and teach the retpoline instructions about new location of jump
target.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/include/asm/nospec-branch.h | 16 ++++++++--------
 arch/x86/net/bpf_jit_comp.c          | 20 ++++++++++----------
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index e7752b4038ff..e491c3d9f227 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -314,19 +314,19 @@ static inline void mds_idle_clear_cpu_buffers(void)
  *    lfence
  *    jmp spec_trap
  *  do_rop:
- *    mov %rax,(%rsp) for x86_64
+ *    mov %rcx,(%rsp) for x86_64
  *    mov %edx,(%esp) for x86_32
  *    retq
  *
  * Without retpolines configured:
  *
- *    jmp *%rax for x86_64
+ *    jmp *%rcx for x86_64
  *    jmp *%edx for x86_32
  */
 #ifdef CONFIG_RETPOLINE
 # ifdef CONFIG_X86_64
-#  define RETPOLINE_RAX_BPF_JIT_SIZE	17
-#  define RETPOLINE_RAX_BPF_JIT()				\
+#  define RETPOLINE_RCX_BPF_JIT_SIZE	17
+#  define RETPOLINE_RCX_BPF_JIT()				\
 do {								\
 	EMIT1_off32(0xE8, 7);	 /* callq do_rop */		\
 	/* spec_trap: */					\
@@ -334,7 +334,7 @@ do {								\
 	EMIT3(0x0F, 0xAE, 0xE8); /* lfence */			\
 	EMIT2(0xEB, 0xF9);       /* jmp spec_trap */		\
 	/* do_rop: */						\
-	EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */	\
+	EMIT4(0x48, 0x89, 0x0C, 0x24); /* mov %rcx,(%rsp) */	\
 	EMIT1(0xC3);             /* retq */			\
 } while (0)
 # else /* !CONFIG_X86_64 */
@@ -352,9 +352,9 @@ do {								\
 # endif
 #else /* !CONFIG_RETPOLINE */
 # ifdef CONFIG_X86_64
-#  define RETPOLINE_RAX_BPF_JIT_SIZE	2
-#  define RETPOLINE_RAX_BPF_JIT()				\
-	EMIT2(0xFF, 0xE0);       /* jmp *%rax */
+#  define RETPOLINE_RCX_BPF_JIT_SIZE	2
+#  define RETPOLINE_RCX_BPF_JIT()				\
+	EMIT2(0xFF, 0xE1);       /* jmp *%rcx */
 # else /* !CONFIG_X86_64 */
 #  define RETPOLINE_EDX_BPF_JIT()				\
 	EMIT2(0xFF, 0xE2)        /* jmp *%edx */
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 42b6709e6dc7..5b3f19799efb 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -370,7 +370,7 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
 	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
 	      offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
+#define OFFSET1 (41 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
 	label1 = cnt;
 
@@ -380,36 +380,36 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	 */
 	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
-#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
+#define OFFSET2 (30 + RETPOLINE_RCX_BPF_JIT_SIZE)
 	EMIT2(X86_JA, OFFSET2);                   /* ja out */
 	label2 = cnt;
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
 	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
 
 	/* prog = array->ptrs[index]; */
-	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
+	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
 		    offsetof(struct bpf_array, ptrs));
 
 	/*
 	 * if (prog == NULL)
 	 *	goto out;
 	 */
-	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
-#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
+	EMIT3(0x48, 0x85, 0xC9);		  /* test rcx,rcx */
+#define OFFSET3 (8 + RETPOLINE_RCX_BPF_JIT_SIZE)
 	EMIT2(X86_JE, OFFSET3);                   /* je out */
 	label3 = cnt;
 
 	/* goto *(prog->bpf_func + prologue_size); */
-	EMIT4(0x48, 0x8B, 0x40,                   /* mov rax, qword ptr [rax + 32] */
+	EMIT4(0x48, 0x8B, 0x49,                   /* mov rcx, qword ptr [rcx + 32] */
 	      offsetof(struct bpf_prog, bpf_func));
-	EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
+	EMIT4(0x48, 0x83, 0xC1, PROLOGUE_SIZE);   /* add rcx, prologue_size */
 
 	/*
-	 * Wow we're ready to jump into next BPF program
+	 * Now we're ready to jump into next BPF program
 	 * rdi == ctx (1st arg)
-	 * rax == prog->bpf_func + prologue_size
+	 * rcx == prog->bpf_func + prologue_size
 	 */
-	RETPOLINE_RAX_BPF_JIT();
+	RETPOLINE_RCX_BPF_JIT();
 
 	/* out: */
 	BUILD_BUG_ON(cnt - label1 != OFFSET1);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 bpf-next 2/6] bpf: propagate poke descriptors to subprograms
  2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
@ 2020-07-31  0:03 ` Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target' Maciej Fijalkowski
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-07-31  0:03 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Previously, there was no need for poke descriptors being present in
subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
in them. Each subprog is JITed independently so in order to enable
JITing subprograms that use tailcalls, do the following:

- in fixup_bpf_calls() store the index of tailcall insn onto the generated
  poke descriptor,
- then in jit_subprogs() check whether the given poke descriptor belongs
  to the current subprog by checking if that previously stored absolute
  index of tail call insn is in the scope of the insns of given subprog,
- update the insn->imm with new poke descriptor slot so that while JITing
  the proper poke descriptor will be grabbed

This way each of the main program's poke descriptors are distributed
across the subprograms poke descriptor array, so main program's
descriptors can be untracked out of the prog array map.

Add also subprog's aux struct to the BPF map poke_progs list by calling
on it map_poke_track().

In case of any error, call the map_poke_untrack() on subprog's aux
structs that have already been registered to prog array map.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/verifier.c | 53 ++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 40c5e206ecf2..8d56b4fba2a6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -680,6 +680,7 @@ struct bpf_jit_poke_descriptor {
 	bool ip_stable;
 	u8 adj_off;
 	u16 reason;
+	u32 insn_idx;
 };
 
 /* reg_type info for ctx arguments */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b6ccfce3bf4c..96a339e24e93 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9982,6 +9982,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog, **func, *tmp;
 	int i, j, subprog_start, subprog_end = 0, len, subprog;
+	struct bpf_map *map_ptr;
 	struct bpf_insn *insn;
 	void *old_bpf_func;
 	int err, num_exentries;
@@ -10049,6 +10050,31 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		func[i]->aux->btf = prog->aux->btf;
 		func[i]->aux->func_info = prog->aux->func_info;
 
+		for (j = 0; j < prog->aux->size_poke_tab; j++) {
+			u32 insn_idx = prog->aux->poke_tab[j].insn_idx;
+			int ret;
+
+			if (!(insn_idx >= subprog_start &&
+			      insn_idx <= subprog_end))
+				continue;
+
+			ret = bpf_jit_add_poke_descriptor(func[i],
+							  &prog->aux->poke_tab[j]);
+			if (ret < 0) {
+				verbose(env, "adding tail call poke descriptor failed\n");
+				goto out_free;
+			}
+
+			func[i]->insnsi[insn_idx - subprog_start].imm = ret + 1;
+
+			map_ptr = func[i]->aux->poke_tab[ret].tail_call.map;
+			ret = map_ptr->ops->map_poke_track(map_ptr, func[i]->aux);
+			if (ret < 0) {
+				verbose(env, "tracking tail call prog failed\n");
+				goto out_free;
+			}
+		}
+
 		/* Use bpf_prog_F_tag to indicate functions in stack traces.
 		 * Long term would need debug info to populate names
 		 */
@@ -10074,6 +10100,19 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		}
 		cond_resched();
 	}
+
+	/* Untrack main program's aux structs so that during map_poke_run()
+	 * we will not stumble upon the unfilled poke descriptors; each
+	 * of the main program's poke descs got distributed across subprogs
+	 * and got tracked onto map, so we are sure that none of them will
+	 * be missed after the operation below
+	 */
+	for (i = 0; i < prog->aux->size_poke_tab; i++) {
+		map_ptr = prog->aux->poke_tab[i].tail_call.map;
+
+		map_ptr->ops->map_poke_untrack(map_ptr, prog->aux);
+	}
+
 	/* at this point all bpf functions were successfully JITed
 	 * now populate all bpf_calls with correct addresses and
 	 * run last pass of JIT
@@ -10142,9 +10181,16 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	bpf_prog_free_unused_jited_linfo(prog);
 	return 0;
 out_free:
-	for (i = 0; i < env->subprog_cnt; i++)
-		if (func[i])
-			bpf_jit_free(func[i]);
+	for (i = 0; i < env->subprog_cnt; i++) {
+		if (!func[i])
+			continue;
+
+		for (j = 0; j < func[i]->aux->size_poke_tab; j++) {
+			map_ptr = func[i]->aux->poke_tab[j].tail_call.map;
+			map_ptr->ops->map_poke_untrack(map_ptr, func[i]->aux);
+		}
+		bpf_jit_free(func[i]);
+	}
 	kfree(func);
 out_undo_insn:
 	/* cleanup main prog to be interpreted */
@@ -10362,6 +10408,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 					.reason = BPF_POKE_REASON_TAIL_CALL,
 					.tail_call.map = BPF_MAP_PTR(aux->map_ptr_state),
 					.tail_call.key = bpf_map_key_immediate(aux),
+					.insn_idx = i,
 				};
 
 				ret = bpf_jit_add_poke_descriptor(prog, &desc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target'
  2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 2/6] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
@ 2020-07-31  0:03 ` Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-07-31  0:03 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Reflect the actual purpose of poke->ip and rename it to
poke->tailcall_target so that it will not the be confused with another
poke target that will be introduced in next commit.

While at it, do the same thing with poke->ip_stable - rename it to
poke->tailcall_target_stable.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 20 +++++++++++---------
 include/linux/bpf.h         |  4 ++--
 kernel/bpf/arraymap.c       | 17 +++++++++--------
 kernel/bpf/core.c           |  3 ++-
 4 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 5b3f19799efb..44e64d406055 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -434,7 +434,7 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
 	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
 
-	poke->ip = image + (addr - X86_PATCH_SIZE);
+	poke->tailcall_target = image + (addr - X86_PATCH_SIZE);
 	poke->adj_off = PROLOGUE_SIZE;
 
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
@@ -453,7 +453,7 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog)
 
 	for (i = 0; i < prog->aux->size_poke_tab; i++) {
 		poke = &prog->aux->poke_tab[i];
-		WARN_ON_ONCE(READ_ONCE(poke->ip_stable));
+		WARN_ON_ONCE(READ_ONCE(poke->tailcall_target_stable));
 
 		if (poke->reason != BPF_POKE_REASON_TAIL_CALL)
 			continue;
@@ -464,18 +464,20 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog)
 		if (target) {
 			/* Plain memcpy is used when image is not live yet
 			 * and still not locked as read-only. Once poke
-			 * location is active (poke->ip_stable), any parallel
-			 * bpf_arch_text_poke() might occur still on the
-			 * read-write image until we finally locked it as
-			 * read-only. Both modifications on the given image
-			 * are under text_mutex to avoid interference.
+			 * location is active (poke->tailcall_target_stable),
+			 * any parallel bpf_arch_text_poke() might occur
+			 * still on the read-write image until we finally
+			 * locked it as read-only. Both modifications on
+			 * the given image are under text_mutex to avoid
+			 * interference.
 			 */
-			ret = __bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP, NULL,
+			ret = __bpf_arch_text_poke(poke->tailcall_target,
+						   BPF_MOD_JUMP, NULL,
 						   (u8 *)target->bpf_func +
 						   poke->adj_off, false);
 			BUG_ON(ret < 0);
 		}
-		WRITE_ONCE(poke->ip_stable, true);
+		WRITE_ONCE(poke->tailcall_target_stable, true);
 		mutex_unlock(&array->aux->poke_mutex);
 	}
 }
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8d56b4fba2a6..37a855d54162 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -670,14 +670,14 @@ enum bpf_jit_poke_reason {
 
 /* Descriptor of pokes pointing /into/ the JITed image. */
 struct bpf_jit_poke_descriptor {
-	void *ip;
+	void *tailcall_target;
 	union {
 		struct {
 			struct bpf_map *map;
 			u32 key;
 		} tail_call;
 	};
-	bool ip_stable;
+	bool tailcall_target_stable;
 	u8 adj_off;
 	u16 reason;
 	u32 insn_idx;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 8ff419b632a6..7e36b9a1827d 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -908,12 +908,13 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 			 *    there could be danger of use after free otherwise.
 			 * 2) Initially when we start tracking aux, the program
 			 *    is not JITed yet and also does not have a kallsyms
-			 *    entry. We skip these as poke->ip_stable is not
-			 *    active yet. The JIT will do the final fixup before
-			 *    setting it stable. The various poke->ip_stable are
-			 *    successively activated, so tail call updates can
-			 *    arrive from here while JIT is still finishing its
-			 *    final fixup for non-activated poke entries.
+			 *    entry. We skip these as poke->tailcall_target_stable
+			 *    is not active yet. The JIT will do the final fixup
+			 *    before setting it stable. The various
+			 *    poke->tailcall_target_stable are successively
+			 *    activated, so tail call updates can arrive from here
+			 *    while JIT is still finishing its final fixup for
+			 *    non-activated poke entries.
 			 * 3) On program teardown, the program's kallsym entry gets
 			 *    removed out of RCU callback, but we can only untrack
 			 *    from sleepable context, therefore bpf_arch_text_poke()
@@ -930,7 +931,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 			 * 5) Any other error happening below from bpf_arch_text_poke()
 			 *    is a unexpected bug.
 			 */
-			if (!READ_ONCE(poke->ip_stable))
+			if (!READ_ONCE(poke->tailcall_target_stable))
 				continue;
 			if (poke->reason != BPF_POKE_REASON_TAIL_CALL)
 				continue;
@@ -938,7 +939,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 			    poke->tail_call.key != key)
 				continue;
 
-			ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP,
+			ret = bpf_arch_text_poke(poke->tailcall_target, BPF_MOD_JUMP,
 						 old ? (u8 *)old->bpf_func +
 						 poke->adj_off : NULL,
 						 new ? (u8 *)new->bpf_func +
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index bde93344164d..586f3a7330c3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -773,7 +773,8 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 
 	if (size > poke_tab_max)
 		return -ENOSPC;
-	if (poke->ip || poke->ip_stable || poke->adj_off)
+	if (poke->tailcall_target || poke->tailcall_target_stable ||
+	    poke->adj_off)
 		return -EINVAL;
 
 	switch (poke->reason) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2020-07-31  0:03 ` [PATCH v6 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target' Maciej Fijalkowski
@ 2020-07-31  0:03 ` Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT Maciej Fijalkowski
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-07-31  0:03 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

This commit serves two things:
1) it optimizes BPF prologue/epilogue generation
2) it makes possible to have tailcalls within BPF subprogram

Both points are related to each other since without 1), 2) could not be
achieved.

In [1], Alexei says:
"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
             // function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call
counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

Implement that suggestion.

Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it to rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.

Drop the BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.

Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
dedicated for skipping the register pops and stack unwind that are
generated right before the actual jump to target program.
For case when the target program is not present, BPF program will skip
the pop instructions and nop5 dedicated for jmpq $target. An example of
such state when only R6 of callee saved registers is used by program:

ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
ffffffffc0513aa6:       5b                      pop    %rbx
ffffffffc0513aa7:       58                      pop    %rax
ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi

When target program is inserted, the jump that was there to skip
pops/nop5 will become the nop5, so CPU will go over pops and do the
actual tailcall.

One might ask why there simply can not be pushes after the nop5?
In the following example snippet:

ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
(...)
ffffffffc0370332:       5b                      pop    %rbx
ffffffffc0370333:       58                      pop    %rax
ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
ffffffffc0370347:       50                      push   %rax
ffffffffc0370348:       53                      push   %rbx
ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548

There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
and jump target is not present. ctx is in %rbx register and BPF
subprogram that we will call into on ffffffffc037034c is relying on it,
e.g. it will pick ctx from there. Such code layout is therefore broken
as we would overwrite the content of %rbx with the value that was pushed
on the prologue. That is the reason for the 'bypass' approach.

Special care needs to be taken during the install/update/remove of
tailcall target. In case when target program is not present, the CPU
must not execute the pop instructions that precede the tailcall.

To address that, the following states can be defined:
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->tailcall_target) and do NOT touch the
   poke->tailcall_bypass
Remove tail call: B(f')->C(f')
 * poke->tailcall_bypass is poked back to jump, then we wait the RCU
   grace period so that other programs will finish its execution and
   after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
 * same as first step

This way CPU can never be exposed to "unwind, tail" state.

For regression checks, 'tailcalls' kselftest was executed:
$ sudo ./test_progs -t tailcalls
 #64/1 tailcall_1:OK
 #64/2 tailcall_2:OK
 #64/3 tailcall_3:OK
 #64/4 tailcall_4:OK
 #64/5 tailcall_5:OK
 #64 tailcalls:OK
Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED

Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 221 ++++++++++++++++++++++++++++--------
 include/linux/bpf.h         |   2 +
 kernel/bpf/arraymap.c       |  40 ++++++-
 kernel/bpf/core.c           |   2 +-
 4 files changed, 212 insertions(+), 53 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 44e64d406055..880f283adb66 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -221,14 +221,48 @@ struct jit_context {
 
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
+/* Number of bytes that will be skipped on tailcall */
+#define X86_TAIL_CALL_OFFSET	11
 
-#define PROLOGUE_SIZE		25
+static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[0])
+		EMIT1(0x53);         /* push rbx */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x55);   /* push r13 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x56);   /* push r14 */
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x57);   /* push r15 */
+	*pprog = prog;
+}
+
+static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x5F);   /* pop r15 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x5E);   /* pop r14 */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x5D);   /* pop r13 */
+	if (callee_regs_used[0])
+		EMIT1(0x5B);         /* pop rbx */
+	*pprog = prog;
+}
 
 /*
- * Emit x86-64 prologue code for BPF program and check its size.
- * bpf_tail_call helper will skip it while jumping into another program
+ * Emit x86-64 prologue code for BPF program.
+ * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
+ * while jumping to another program
  */
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
+static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
+			  bool tail_call)
 {
 	u8 *prog = *pprog;
 	int cnt = X86_PATCH_SIZE;
@@ -238,19 +272,18 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
 	 */
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt);
 	prog += cnt;
+	if (!ebpf_from_cbpf) {
+		if (tail_call)
+			EMIT2(0x31, 0xC0); /* xor eax, eax */
+		else
+			EMIT2(0x66, 0x90); /* nop2 */
+	}
 	EMIT1(0x55);             /* push rbp */
 	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
 	/* sub rsp, rounded_stack_depth */
 	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	EMIT1(0x53);             /* push rbx */
-	EMIT2(0x41, 0x55);       /* push r13 */
-	EMIT2(0x41, 0x56);       /* push r14 */
-	EMIT2(0x41, 0x57);       /* push r15 */
-	if (!ebpf_from_cbpf) {
-		/* zero init tail_call_cnt */
-		EMIT2(0x6a, 0x00);
-		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
-	}
+	if (!ebpf_from_cbpf && tail_call)
+		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }
 
@@ -314,13 +347,14 @@ static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 	mutex_lock(&text_mutex);
 	if (memcmp(ip, old_insn, X86_PATCH_SIZE))
 		goto out;
+	ret = 1;
 	if (memcmp(ip, new_insn, X86_PATCH_SIZE)) {
 		if (text_live)
 			text_poke_bp(ip, new_insn, X86_PATCH_SIZE, NULL);
 		else
 			memcpy(ip, new_insn, X86_PATCH_SIZE);
+		ret = 0;
 	}
-	ret = 0;
 out:
 	mutex_unlock(&text_mutex);
 	return ret;
@@ -337,6 +371,22 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 	return __bpf_arch_text_poke(ip, t, old_addr, new_addr, true);
 }
 
+static int get_pop_bytes(bool *callee_regs_used)
+{
+	int bytes = 0;
+
+	if (callee_regs_used[3])
+		bytes += 2;
+	if (callee_regs_used[2])
+		bytes += 2;
+	if (callee_regs_used[1])
+		bytes += 2;
+	if (callee_regs_used[0])
+		bytes += 1;
+
+	return bytes;
+}
+
 /*
  * Generate the following code:
  *
@@ -351,12 +401,26 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
  *   goto *(prog->bpf_func + prologue_size);
  * out:
  */
-static void emit_bpf_tail_call_indirect(u8 **pprog)
+static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
+					u32 stack_depth)
 {
+	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog;
-	int label1, label2, label3;
+	int pop_bytes = 0;
+	int off1 = 49;
+	int off2 = 38;
+	int off3 = 16;
 	int cnt = 0;
 
+	/* count the additional bytes used for popping callee regs from stack
+	 * that need to be taken into account for each of the offsets that
+	 * are used for bailing out of the tail call
+	 */
+	pop_bytes = get_pop_bytes(callee_regs_used);
+	off1 += pop_bytes;
+	off2 += pop_bytes;
+	off3 += pop_bytes;
+
 	/*
 	 * rdi - pointer to ctx
 	 * rsi - pointer to bpf_array
@@ -370,21 +434,19 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
 	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
 	      offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 (41 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
+#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
-	label1 = cnt;
 
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
-#define OFFSET2 (30 + RETPOLINE_RCX_BPF_JIT_SIZE)
+#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
 	EMIT2(X86_JA, OFFSET2);                   /* ja out */
-	label2 = cnt;
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
 
 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
@@ -394,48 +456,84 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	 * if (prog == NULL)
 	 *	goto out;
 	 */
-	EMIT3(0x48, 0x85, 0xC9);		  /* test rcx,rcx */
-#define OFFSET3 (8 + RETPOLINE_RCX_BPF_JIT_SIZE)
+	EMIT3(0x48, 0x85, 0xC9);                  /* test rcx,rcx */
+#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE)
 	EMIT2(X86_JE, OFFSET3);                   /* je out */
-	label3 = cnt;
 
-	/* goto *(prog->bpf_func + prologue_size); */
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
+
+	EMIT1(0x58);                              /* pop rax */
+	EMIT3_off32(0x48, 0x81, 0xC4,             /* add rsp, sd */
+		    round_up(stack_depth, 8));
+
+	/* goto *(prog->bpf_func + X86_TAIL_CALL_OFFSET); */
 	EMIT4(0x48, 0x8B, 0x49,                   /* mov rcx, qword ptr [rcx + 32] */
 	      offsetof(struct bpf_prog, bpf_func));
-	EMIT4(0x48, 0x83, 0xC1, PROLOGUE_SIZE);   /* add rcx, prologue_size */
-
+	EMIT4(0x48, 0x83, 0xC1,                   /* add rcx, X86_TAIL_CALL_OFFSET */
+	      X86_TAIL_CALL_OFFSET);
 	/*
 	 * Now we're ready to jump into next BPF program
 	 * rdi == ctx (1st arg)
-	 * rcx == prog->bpf_func + prologue_size
+	 * rcx == prog->bpf_func + X86_TAIL_CALL_OFFSET
 	 */
 	RETPOLINE_RCX_BPF_JIT();
 
 	/* out: */
-	BUILD_BUG_ON(cnt - label1 != OFFSET1);
-	BUILD_BUG_ON(cnt - label2 != OFFSET2);
-	BUILD_BUG_ON(cnt - label3 != OFFSET3);
 	*pprog = prog;
 }
 
 static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
-				      u8 **pprog, int addr, u8 *image)
+				      u8 **pprog, int addr, u8 *image,
+				      bool *callee_regs_used, u32 stack_depth)
 {
+	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog;
+	int pop_bytes = 0;
+	int off1 = 27;
+	int poke_off;
 	int cnt = 0;
 
+	/* count the additional bytes used for popping callee regs to stack
+	 * that need to be taken into account for jump offset that is used for
+	 * bailing out from of the tail call when limit is reached
+	 */
+	pop_bytes = get_pop_bytes(callee_regs_used);
+	off1 += pop_bytes;
+
+	/*
+	 * total bytes for:
+	 * - nop5/ jmpq $off
+	 * - pop callee regs
+	 * - sub rsp, $val
+	 * - pop rax
+	 */
+	poke_off = X86_PATCH_SIZE + pop_bytes + 7 + 1;
+
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
-	EMIT2(X86_JA, 14);                            /* ja out */
+	EMIT2(X86_JA, off1);                          /* ja out */
 	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
 
+	poke->tailcall_bypass = image + (addr - poke_off - X86_PATCH_SIZE);
+	poke->adj_off = X86_TAIL_CALL_OFFSET;
 	poke->tailcall_target = image + (addr - X86_PATCH_SIZE);
-	poke->adj_off = PROLOGUE_SIZE;
+	poke->bypass_addr = (u8 *)poke->tailcall_target + X86_PATCH_SIZE;
+
+	emit_jump(&prog, (u8 *)poke->tailcall_target + X86_PATCH_SIZE,
+		  poke->tailcall_bypass);
+
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
+	EMIT1(0x58);                                  /* pop rax */
+	EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
 
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
 	prog += X86_PATCH_SIZE;
@@ -476,6 +574,11 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog)
 						   (u8 *)target->bpf_func +
 						   poke->adj_off, false);
 			BUG_ON(ret < 0);
+			ret = __bpf_arch_text_poke(poke->tailcall_bypass,
+						   BPF_MOD_JUMP,
+						   (u8 *)poke->tailcall_target +
+						   X86_PATCH_SIZE, NULL, false);
+			BUG_ON(ret < 0);
 		}
 		WRITE_ONCE(poke->tailcall_target_stable, true);
 		mutex_unlock(&array->aux->poke_mutex);
@@ -654,19 +757,44 @@ static bool ex_handler_bpf(const struct exception_table_entry *x,
 	return true;
 }
 
+static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
+			     bool *regs_used, bool *tail_call_seen)
+{
+	int i;
+
+	for (i = 1; i <= insn_cnt; i++, insn++) {
+		if (insn->code == (BPF_JMP | BPF_TAIL_CALL))
+			*tail_call_seen = true;
+		if (insn->dst_reg == BPF_REG_6 || insn->src_reg == BPF_REG_6)
+			regs_used[0] = true;
+		if (insn->dst_reg == BPF_REG_7 || insn->src_reg == BPF_REG_7)
+			regs_used[1] = true;
+		if (insn->dst_reg == BPF_REG_8 || insn->src_reg == BPF_REG_8)
+			regs_used[2] = true;
+		if (insn->dst_reg == BPF_REG_9 || insn->src_reg == BPF_REG_9)
+			regs_used[3] = true;
+	}
+}
+
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 		  int oldproglen, struct jit_context *ctx)
 {
 	struct bpf_insn *insn = bpf_prog->insnsi;
+	bool callee_regs_used[4] = {};
 	int insn_cnt = bpf_prog->len;
+	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
 	int i, cnt = 0, excnt = 0;
 	int proglen = 0;
 	u8 *prog = temp;
 
+	detect_reg_usage(insn, insn_cnt, callee_regs_used,
+			 &tail_call_seen);
+
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
-		      bpf_prog_was_classic(bpf_prog));
+		      bpf_prog_was_classic(bpf_prog), tail_call_seen);
+	push_callee_regs(&prog, callee_regs_used);
 	addrs[0] = prog - temp;
 
 	for (i = 1; i <= insn_cnt; i++, insn++) {
@@ -1111,9 +1239,13 @@ xadd:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_TAIL_CALL:
 			if (imm32)
 				emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
-							  &prog, addrs[i], image);
+							  &prog, addrs[i], image,
+							  callee_regs_used,
+							  bpf_prog->aux->stack_depth);
 			else
-				emit_bpf_tail_call_indirect(&prog);
+				emit_bpf_tail_call_indirect(&prog,
+							    callee_regs_used,
+							    bpf_prog->aux->stack_depth);
 			break;
 
 			/* cond jump */
@@ -1296,12 +1428,9 @@ xadd:			if (is_imm8(insn->off))
 			seen_exit = true;
 			/* Update cleanup_addr */
 			ctx->cleanup_addr = proglen;
-			if (!bpf_prog_was_classic(bpf_prog))
-				EMIT1(0x5B); /* get rid of tail_call_cnt */
-			EMIT2(0x41, 0x5F);   /* pop r15 */
-			EMIT2(0x41, 0x5E);   /* pop r14 */
-			EMIT2(0x41, 0x5D);   /* pop r13 */
-			EMIT1(0x5B);         /* pop rbx */
+			pop_callee_regs(&prog, callee_regs_used);
+			if (!bpf_prog_was_classic(bpf_prog) && tail_call_seen)
+				EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */
 			EMIT1(0xC9);         /* leave */
 			EMIT1(0xC3);         /* ret */
 			break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 37a855d54162..b9d8937254d3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -671,6 +671,8 @@ enum bpf_jit_poke_reason {
 /* Descriptor of pokes pointing /into/ the JITed image. */
 struct bpf_jit_poke_descriptor {
 	void *tailcall_target;
+	void *tailcall_bypass;
+	void *bypass_addr;
 	union {
 		struct {
 			struct bpf_map *map;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7e36b9a1827d..85f49d8206de 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -888,6 +888,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 				    struct bpf_prog *old,
 				    struct bpf_prog *new)
 {
+	u8 *old_addr, *new_addr, *old_bypass_addr;
 	struct prog_poke_elem *elem;
 	struct bpf_array_aux *aux;
 
@@ -939,12 +940,39 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 			    poke->tail_call.key != key)
 				continue;
 
-			ret = bpf_arch_text_poke(poke->tailcall_target, BPF_MOD_JUMP,
-						 old ? (u8 *)old->bpf_func +
-						 poke->adj_off : NULL,
-						 new ? (u8 *)new->bpf_func +
-						 poke->adj_off : NULL);
-			BUG_ON(ret < 0 && ret != -EINVAL);
+			old_bypass_addr = old ? NULL : poke->bypass_addr;
+			old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL;
+			new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL;
+
+			if (new) {
+				ret = bpf_arch_text_poke(poke->tailcall_target,
+							 BPF_MOD_JUMP,
+							 old_addr, new_addr);
+				BUG_ON(ret < 0 && ret != -EINVAL);
+				if (!old) {
+					ret = bpf_arch_text_poke(poke->tailcall_bypass,
+								 BPF_MOD_JUMP,
+								 poke->bypass_addr,
+								 NULL);
+					BUG_ON(ret < 0 && ret != -EINVAL);
+				}
+			} else {
+				ret = bpf_arch_text_poke(poke->tailcall_bypass,
+							 BPF_MOD_JUMP,
+							 old_bypass_addr,
+							 poke->bypass_addr);
+				BUG_ON(ret < 0 && ret != -EINVAL);
+				/* let other CPUs finish the execution of program
+				 * so that it will not possible to expose them
+				 * to invalid nop, stack unwind, nop state
+				 */
+				if (!ret)
+					synchronize_rcu();
+				ret = bpf_arch_text_poke(poke->tailcall_target,
+							 BPF_MOD_JUMP,
+							 old_addr, NULL);
+				BUG_ON(ret < 0 && ret != -EINVAL);
+			}
 		}
 	}
 }
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 586f3a7330c3..48df54593444 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -774,7 +774,7 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 	if (size > poke_tab_max)
 		return -ENOSPC;
 	if (poke->tailcall_target || poke->tailcall_target_stable ||
-	    poke->adj_off)
+	    poke->tailcall_bypass || poke->adj_off || poke->bypass_addr)
 		return -EINVAL;
 
 	switch (poke->reason) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT
  2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2020-07-31  0:03 ` [PATCH v6 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
@ 2020-07-31  0:03 ` Maciej Fijalkowski
  2020-07-31  0:03 ` [PATCH v6 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
  2020-08-01  1:03 ` [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Daniel Borkmann
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-07-31  0:03 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Relax verifier's restriction that was meant to forbid tailcall usage
when subprog count was higher than 1.

Also, do not max out the stack depth of program that utilizes tailcalls.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 kernel/bpf/verifier.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 96a339e24e93..fefc91f350d8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4251,10 +4251,12 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_FUNC_tail_call:
 		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
 			goto error;
+#if !defined(CONFIG_X86_64) || !defined(CONFIG_BPF_JIT_ALWAYS_ON)
 		if (env->subprog_cnt > 1) {
 			verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
 			return -EINVAL;
 		}
+#endif
 		break;
 	case BPF_FUNC_perf_event_read:
 	case BPF_FUNC_perf_event_output:
@@ -10387,7 +10389,9 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			 * the program array.
 			 */
 			prog->cb_access = 1;
+#if !defined(CONFIG_X86_64) || !defined(CONFIG_BPF_JIT_ALWAYS_ON)
 			env->prog->aux->stack_depth = MAX_BPF_STACK;
+#endif
 			env->prog->aux->max_pkt_offset = MAX_PACKET_OFF;
 
 			/* mark bpf_tail_call as different opcode to avoid
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall
  2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (4 preceding siblings ...)
  2020-07-31  0:03 ` [PATCH v6 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT Maciej Fijalkowski
@ 2020-07-31  0:03 ` Maciej Fijalkowski
  2020-08-01  1:03 ` [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Daniel Borkmann
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-07-31  0:03 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Introduce 6th test to taicalls kselftest that checks if tailcall can be
correctly executed from the BPF subprogram.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/tailcalls.c      | 85 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/tailcall6.c | 38 +++++++++
 2 files changed, 123 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
index bb8fe646dd9f..192c94896809 100644
--- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c
+++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
+#include <network_helpers.h>
 
 /* test_tailcall_1 checks basic functionality by patching multiple locations
  * in a single program for a single tail call slot with nop->jmp, jmp->nop
@@ -472,6 +473,88 @@ static void test_tailcall_5(void)
 	bpf_object__close(obj);
 }
 
+/* test_tailcall_6 purpose is to make sure that tailcalls are working
+ * correctly in correlation with BPF subprograms
+ */
+static void test_tailcall_6(void)
+{
+	int err, map_fd, prog_fd, main_fd, i;
+	struct bpf_map *prog_array;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	__u32 retval, duration;
+	char prog_name[32];
+
+	err = bpf_prog_load("tailcall6.o", BPF_PROG_TYPE_SCHED_CLS, &obj,
+			    &prog_fd);
+	if (CHECK_FAIL(err))
+		return;
+
+	prog = bpf_object__find_program_by_title(obj, "classifier");
+	if (CHECK_FAIL(!prog))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(main_fd < 0))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (CHECK_FAIL(!prog_array))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (CHECK_FAIL(map_fd < 0))
+		goto out;
+
+	/* nop -> jmp */
+	for (i = 0; i < bpf_map__def(prog_array)->max_entries; i++) {
+		snprintf(prog_name, sizeof(prog_name), "classifier/%i", i);
+
+		prog = bpf_object__find_program_by_title(obj, prog_name);
+		if (CHECK_FAIL(!prog))
+			goto out;
+
+		prog_fd = bpf_program__fd(prog);
+		if (CHECK_FAIL(prog_fd < 0))
+			goto out;
+
+		err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+		if (CHECK_FAIL(err))
+			goto out;
+	}
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 1, "tailcall",
+	      "err %d errno %d retval %d\n", err, errno, retval);
+
+	/* jmp -> nop, call subprog that will do tailcall */
+	i = 1;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 0, "tailcall", "err %d errno %d retval %d\n",
+	      err, errno, retval);
+
+	/* make sure that subprog can access ctx and entry prog that
+	 * called this subprog can properly return
+	 */
+	i = 0;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 108, "tailcall", "err %d errno %d retval %d\n",
+	      err, errno, retval);
+out:
+	bpf_object__close(obj);
+}
+
 void test_tailcalls(void)
 {
 	if (test__start_subtest("tailcall_1"))
@@ -484,4 +567,6 @@ void test_tailcalls(void)
 		test_tailcall_4();
 	if (test__start_subtest("tailcall_5"))
 		test_tailcall_5();
+	if (test__start_subtest("tailcall_6"))
+		test_tailcall_6();
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall6.c b/tools/testing/selftests/bpf/progs/tailcall6.c
new file mode 100644
index 000000000000..e72ca5869b58
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall6.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+#define TAIL_FUNC(x) 				\
+	SEC("classifier/" #x)			\
+	int bpf_func_##x(struct __sk_buff *skb)	\
+	{					\
+		return x;			\
+	}
+TAIL_FUNC(0)
+TAIL_FUNC(1)
+
+static __attribute__ ((noinline))
+int subprog_tail(struct __sk_buff *skb)
+{
+	bpf_tail_call(skb, &jmp_table, 0);
+
+	return skb->len * 2;
+}
+
+SEC("classifier")
+int entry(struct __sk_buff *skb)
+{
+	bpf_tail_call(skb, &jmp_table, 1);
+
+	return subprog_tail(skb);
+}
+
+char __license[] SEC("license") = "GPL";
+int _version SEC("version") = 1;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (5 preceding siblings ...)
  2020-07-31  0:03 ` [PATCH v6 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
@ 2020-08-01  1:03 ` Daniel Borkmann
  2020-08-01  7:13   ` Maciej Fijalkowski
  6 siblings, 1 reply; 16+ messages in thread
From: Daniel Borkmann @ 2020-08-01  1:03 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson

On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
> v5->v6:
> - propagate only those poke descriptors that individual subprogram is
>    actually using (Daniel)
> - drop the cumbersome check if poke desc got filled in map_poke_run()
> - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
>    to patch 3 to provide bisectability (Daniel)

I did a basic test with Cilium on K8s with this set, spawning a few Pods
and checking connectivity & whether we're not crashing since it has bit more
elaborate tail call use. So far so good. I was inclined to push the series
out, but there is one more issue I noticed and didn't notice earlier when
reviewing, and that is overall stack size:

What happens when you create a single program that has nested BPF to BPF
calls e.g. either up to the maximum nesting or one call that is using up
the max stack size which is then doing another BPF to BPF call that contains
the tail call. In the tail call map, you have the same program in there.
This means we create a worst case stack from BPF size of max_stack_size *
max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
we have a stack of arch/x86/include/asm/page_64_types.h:

   #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

So we end up with 16k in a typical case. And this will cause kernel stack
overflow; I'm at least not seeing where we handle this situation in the
set. Hm, need to think more, but maybe this needs tracking of max stack
across tail calls to force an upper limit..

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-08-01  1:03 ` [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Daniel Borkmann
@ 2020-08-01  7:13   ` Maciej Fijalkowski
  2020-08-02  3:07     ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-08-01  7:13 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Sat, Aug 01, 2020 at 03:03:19AM +0200, Daniel Borkmann wrote:
> On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
> > v5->v6:
> > - propagate only those poke descriptors that individual subprogram is
> >    actually using (Daniel)
> > - drop the cumbersome check if poke desc got filled in map_poke_run()
> > - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
> >    to patch 3 to provide bisectability (Daniel)
> 
> I did a basic test with Cilium on K8s with this set, spawning a few Pods
> and checking connectivity & whether we're not crashing since it has bit more
> elaborate tail call use. So far so good. I was inclined to push the series
> out, but there is one more issue I noticed and didn't notice earlier when
> reviewing, and that is overall stack size:
> 
> What happens when you create a single program that has nested BPF to BPF
> calls e.g. either up to the maximum nesting or one call that is using up
> the max stack size which is then doing another BPF to BPF call that contains
> the tail call. In the tail call map, you have the same program in there.
> This means we create a worst case stack from BPF size of max_stack_size *
> max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
> we have a stack of arch/x86/include/asm/page_64_types.h:
> 
>   #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> 
> So we end up with 16k in a typical case. And this will cause kernel stack
> overflow; I'm at least not seeing where we handle this situation in the
> set. Hm, need to think more, but maybe this needs tracking of max stack
> across tail calls to force an upper limit..

My knee jerk reaction would be to decrement the allowed max tail calls,
but not sure if it's an option and if it would help.

Otherwise I'm not sure how to pass around the stack size just like we're
doing it in this set with tail call counter that sits in %rax.

> 
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-08-01  7:13   ` Maciej Fijalkowski
@ 2020-08-02  3:07     ` Alexei Starovoitov
  2020-08-03 14:00       ` Daniel Borkmann
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-08-02  3:07 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Sat, Aug 01, 2020 at 09:13:57AM +0200, Maciej Fijalkowski wrote:
> On Sat, Aug 01, 2020 at 03:03:19AM +0200, Daniel Borkmann wrote:
> > On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
> > > v5->v6:
> > > - propagate only those poke descriptors that individual subprogram is
> > >    actually using (Daniel)
> > > - drop the cumbersome check if poke desc got filled in map_poke_run()
> > > - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
> > >    to patch 3 to provide bisectability (Daniel)
> > 
> > I did a basic test with Cilium on K8s with this set, spawning a few Pods
> > and checking connectivity & whether we're not crashing since it has bit more
> > elaborate tail call use. So far so good. I was inclined to push the series
> > out, but there is one more issue I noticed and didn't notice earlier when
> > reviewing, and that is overall stack size:
> > 
> > What happens when you create a single program that has nested BPF to BPF
> > calls e.g. either up to the maximum nesting or one call that is using up
> > the max stack size which is then doing another BPF to BPF call that contains
> > the tail call. In the tail call map, you have the same program in there.
> > This means we create a worst case stack from BPF size of max_stack_size *
> > max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
> > we have a stack of arch/x86/include/asm/page_64_types.h:
> > 
> >   #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
> >  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> > 
> > So we end up with 16k in a typical case. And this will cause kernel stack
> > overflow; I'm at least not seeing where we handle this situation in the

Not quite. The subprog is always 32 byte stack (from safety pov).
The real stack (when JITed) can be lower or zero.
So the max stack is (512 - 32) * 32 = 15360.
So there is no overflow, but may be a bit too close to comfort.
Imo the room is ok to land the set and the better enforcement can
be done as a follow up later, like below idea...

> > set. Hm, need to think more, but maybe this needs tracking of max stack
> > across tail calls to force an upper limit..
> 
> My knee jerk reaction would be to decrement the allowed max tail calls,
> but not sure if it's an option and if it would help.

How about make the verifier use a lower bound for a function with a tail call ?
Something like 64 would work.
subprog_info[idx].stack_depth with tail_call will be >= 64.
Then the main function will be automatically limited to 512-64 and the worst
case stack = 14kbyte.
When the sub prog with tail call is not an empty body (malicious stack
abuser) then the lower bound won't affect anything.
A bit annoying that stack_depth will be used by JIT to actually allocate
that much. Some of it will not be used potentially, but I think it's fine.
It's much simpler solution than to keep two variables to track stack size.
Or may be check_max_stack_depth() can be a bit smarter and it can detect
that subprog is using tail_call without actually hacking stack_depth variable.
Essentially I'm proposing to tweak this formula:
depth += round_up(max_t(u32, subprog[idx].stack_depth, 1), 32);
and replace 1 with 64 for subprogs with tail_call.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-08-02  3:07     ` Alexei Starovoitov
@ 2020-08-03 14:00       ` Daniel Borkmann
  2020-08-21 17:38         ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Borkmann @ 2020-08-03 14:00 UTC (permalink / raw)
  To: Alexei Starovoitov, Maciej Fijalkowski
  Cc: ast, bpf, netdev, bjorn.topel, magnus.karlsson

On 8/2/20 5:07 AM, Alexei Starovoitov wrote:
> On Sat, Aug 01, 2020 at 09:13:57AM +0200, Maciej Fijalkowski wrote:
>> On Sat, Aug 01, 2020 at 03:03:19AM +0200, Daniel Borkmann wrote:
>>> On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
>>>> v5->v6:
>>>> - propagate only those poke descriptors that individual subprogram is
>>>>     actually using (Daniel)
>>>> - drop the cumbersome check if poke desc got filled in map_poke_run()
>>>> - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
>>>>     to patch 3 to provide bisectability (Daniel)
>>>
>>> I did a basic test with Cilium on K8s with this set, spawning a few Pods
>>> and checking connectivity & whether we're not crashing since it has bit more
>>> elaborate tail call use. So far so good. I was inclined to push the series
>>> out, but there is one more issue I noticed and didn't notice earlier when
>>> reviewing, and that is overall stack size:
>>>
>>> What happens when you create a single program that has nested BPF to BPF
>>> calls e.g. either up to the maximum nesting or one call that is using up
>>> the max stack size which is then doing another BPF to BPF call that contains
>>> the tail call. In the tail call map, you have the same program in there.
>>> This means we create a worst case stack from BPF size of max_stack_size *
>>> max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
>>> we have a stack of arch/x86/include/asm/page_64_types.h:
>>>
>>>    #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
>>>   #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>>>
>>> So we end up with 16k in a typical case. And this will cause kernel stack
>>> overflow; I'm at least not seeing where we handle this situation in the
> 
> Not quite. The subprog is always 32 byte stack (from safety pov).
> The real stack (when JITed) can be lower or zero.
> So the max stack is (512 - 32) * 32 = 15360.
> So there is no overflow, but may be a bit too close to comfort.

I did a check with adding `stack_not_used(current)` to various points which
provides some useful data under CONFIG_DEBUG_STACK_USAGE. From tc ingress side
I'm getting roughly 13k free stack space which is definitely less than 15k even
at tc layer. I also checked on sk_filter_trim_cap() on ingress and worst case I
saw is very close to 12k, so a malicious or by accident a buggy program would be
able to cause a stack overflow as-is.

> Imo the room is ok to land the set and the better enforcement can
> be done as a follow up later, like below idea...
> 
>>> set. Hm, need to think more, but maybe this needs tracking of max stack
>>> across tail calls to force an upper limit..
>>
>> My knee jerk reaction would be to decrement the allowed max tail calls,
>> but not sure if it's an option and if it would help.
> 
> How about make the verifier use a lower bound for a function with a tail call ?
> Something like 64 would work.
> subprog_info[idx].stack_depth with tail_call will be >= 64.
> Then the main function will be automatically limited to 512-64 and the worst
> case stack = 14kbyte.

Even 14k is way too close, see above. Some archs that are supported by the kernel
run under 8k total stack size. In the long run if more archs would support tail
calls with bpf-to-bpf calls, we might need a per-arch upper cap, but I think in
this context here an upper total cap on x86 that is 4k should be reasonable, it
sounds broken to me if more is indeed needed for the vast majority of use cases.

> When the sub prog with tail call is not an empty body (malicious stack
> abuser) then the lower bound won't affect anything.
> A bit annoying that stack_depth will be used by JIT to actually allocate
> that much. Some of it will not be used potentially, but I think it's fine.
> It's much simpler solution than to keep two variables to track stack size.
> Or may be check_max_stack_depth() can be a bit smarter and it can detect
> that subprog is using tail_call without actually hacking stack_depth variable.

+1, I think that would be better, maybe we could have a different cost function
for the tail call counter itself depending in which call-depth we are, but that
also requires two vars for tracking (tail call counter, call depth counter), so
more JIT changes & emitted insns required. :/ Otoh, what if tail call counter
is limited to 4k and we subtract stack usage instead with a min cost (e.g. 128)
if progs use less than that? Though the user experience will be really bad in
this case given these semantics feel less deterministic / hard to debug from
user PoV.

> Essentially I'm proposing to tweak this formula:
> depth += round_up(max_t(u32, subprog[idx].stack_depth, 1), 32);
> and replace 1 with 64 for subprogs with tail_call.
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-08-03 14:00       ` Daniel Borkmann
@ 2020-08-21 17:38         ` Maciej Fijalkowski
  2020-08-26 21:35           ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-08-21 17:38 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Mon, Aug 03, 2020 at 04:00:10PM +0200, Daniel Borkmann wrote:
> On 8/2/20 5:07 AM, Alexei Starovoitov wrote:
> > On Sat, Aug 01, 2020 at 09:13:57AM +0200, Maciej Fijalkowski wrote:
> > > On Sat, Aug 01, 2020 at 03:03:19AM +0200, Daniel Borkmann wrote:
> > > > On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
> > > > > v5->v6:
> > > > > - propagate only those poke descriptors that individual subprogram is
> > > > >     actually using (Daniel)
> > > > > - drop the cumbersome check if poke desc got filled in map_poke_run()
> > > > > - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
> > > > >     to patch 3 to provide bisectability (Daniel)
> > > > 
> > > > I did a basic test with Cilium on K8s with this set, spawning a few Pods
> > > > and checking connectivity & whether we're not crashing since it has bit more
> > > > elaborate tail call use. So far so good. I was inclined to push the series
> > > > out, but there is one more issue I noticed and didn't notice earlier when
> > > > reviewing, and that is overall stack size:
> > > > 
> > > > What happens when you create a single program that has nested BPF to BPF
> > > > calls e.g. either up to the maximum nesting or one call that is using up
> > > > the max stack size which is then doing another BPF to BPF call that contains
> > > > the tail call. In the tail call map, you have the same program in there.
> > > > This means we create a worst case stack from BPF size of max_stack_size *
> > > > max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
> > > > we have a stack of arch/x86/include/asm/page_64_types.h:
> > > > 
> > > >    #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
> > > >   #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> > > > 
> > > > So we end up with 16k in a typical case. And this will cause kernel stack
> > > > overflow; I'm at least not seeing where we handle this situation in the
> > 
> > Not quite. The subprog is always 32 byte stack (from safety pov).
> > The real stack (when JITed) can be lower or zero.
> > So the max stack is (512 - 32) * 32 = 15360.
> > So there is no overflow, but may be a bit too close to comfort.
> 
> I did a check with adding `stack_not_used(current)` to various points which
> provides some useful data under CONFIG_DEBUG_STACK_USAGE. From tc ingress side
> I'm getting roughly 13k free stack space which is definitely less than 15k even
> at tc layer. I also checked on sk_filter_trim_cap() on ingress and worst case I
> saw is very close to 12k, so a malicious or by accident a buggy program would be
> able to cause a stack overflow as-is.
> 
> > Imo the room is ok to land the set and the better enforcement can
> > be done as a follow up later, like below idea...
> > 
> > > > set. Hm, need to think more, but maybe this needs tracking of max stack
> > > > across tail calls to force an upper limit..
> > > 
> > > My knee jerk reaction would be to decrement the allowed max tail calls,
> > > but not sure if it's an option and if it would help.
> > 
> > How about make the verifier use a lower bound for a function with a tail call ?
> > Something like 64 would work.
> > subprog_info[idx].stack_depth with tail_call will be >= 64.
> > Then the main function will be automatically limited to 512-64 and the worst
> > case stack = 14kbyte.
> 
> Even 14k is way too close, see above. Some archs that are supported by the kernel
> run under 8k total stack size. In the long run if more archs would support tail
> calls with bpf-to-bpf calls, we might need a per-arch upper cap, but I think in
> this context here an upper total cap on x86 that is 4k should be reasonable, it
> sounds broken to me if more is indeed needed for the vast majority of use cases.
> 
> > When the sub prog with tail call is not an empty body (malicious stack
> > abuser) then the lower bound won't affect anything.
> > A bit annoying that stack_depth will be used by JIT to actually allocate
> > that much. Some of it will not be used potentially, but I think it's fine.
> > It's much simpler solution than to keep two variables to track stack size.
> > Or may be check_max_stack_depth() can be a bit smarter and it can detect
> > that subprog is using tail_call without actually hacking stack_depth variable.
> 
> +1, I think that would be better, maybe we could have a different cost function
> for the tail call counter itself depending in which call-depth we are, but that
> also requires two vars for tracking (tail call counter, call depth counter), so
> more JIT changes & emitted insns required. :/ Otoh, what if tail call counter
> is limited to 4k and we subtract stack usage instead with a min cost (e.g. 128)
> if progs use less than that? Though the user experience will be really bad in
> this case given these semantics feel less deterministic / hard to debug from
> user PoV.

Let's get this rolling again.
I like this approach, but from the opposite way - instead of decrementing
from 4k, let's start with 0 like we did before and add up the
max(stack_size, 128) on each tailcall as you suggested.

Reason for that is no need for changes in prologue, we can keep the xor
eax,eax insn which occupies 2 bytes whereas mov eax, 4096 needs 5 bytes
from what I see.

cmp eax, 4096 also needs more bytes than what cmp eax, MAX_TAIL_CALL_CNT
needed, but that's something we need as well as change mentioned below.

One last change is add eax, 1 becomes the add eax, max(stack_size, 128)
and it is also encoded differently.

Let me know if you're fine with that and if i can post v7.
Dirty patch below that I will squash onto patch 5 if it's fine.

From 01d2494eed07284ea56134f40c6a304b109090ab Mon Sep 17 00:00:00 2001
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 21 Aug 2020 14:04:27 +0200
Subject: [PATCH] bpf: track stack size in tailcall

WIP

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 37 ++++++++++++++++++++-----------------
 include/linux/bpf.h         |  1 +
 2 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 880f283adb66..56b38536b1dd 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -393,7 +393,7 @@ static int get_pop_bytes(bool *callee_regs_used)
  * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
  *   if (index >= array->map.max_entries)
  *     goto out;
- *   if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
+ *   if (tail_call_stack_depth + stack_depth > MAX_TAIL_CALL_STACK_DEPTH)
  *     goto out;
  *   prog = array->ptrs[index];
  *   if (prog == NULL)
@@ -404,11 +404,12 @@ static int get_pop_bytes(bool *callee_regs_used)
 static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
 					u32 stack_depth)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	u16 sd = max_t(u16, round_up(stack_depth, 8), 128);
+	int tcsd_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog;
 	int pop_bytes = 0;
-	int off1 = 49;
-	int off2 = 38;
+	int off1 = 49 + 4;
+	int off2 = 38 + 2;
 	int off3 = 16;
 	int cnt = 0;
 
@@ -438,15 +439,16 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
 
 	/*
-	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 * if (tail_call_stack_depth > MAX_TAIL_CALL_STACK_DEPTH)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);         /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT2_off32(0x8B, 0x85, tcsd_off);        /* mov eax, dword ptr [rbp - tcsd_off] */
+	EMIT1_off32(0x3D,                         /* cmp eax, MAX_TAIL_CALL_STACK_DEPTH */
+		    MAX_TAIL_CALL_STACK_DEPTH);
 #define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
 	EMIT2(X86_JA, OFFSET2);                   /* ja out */
-	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);         /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT1_off32(0x05, sd);                    /* add eax, stack_depth */
+	EMIT2_off32(0x89, 0x85, tcsd_off);        /* mov dword ptr [rbp - tcsd_off], eax */
 
 	/* prog = array->ptrs[index]; */
 	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
@@ -488,10 +490,11 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 				      u8 **pprog, int addr, u8 *image,
 				      bool *callee_regs_used, u32 stack_depth)
 {
-	int tcc_off = -4 - round_up(stack_depth, 8);
+	u16 sd = max_t(u16, round_up(stack_depth, 8), 128);
+	int tcsd_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog;
 	int pop_bytes = 0;
-	int off1 = 27;
+	int off1 = 27 + 2;
 	int poke_off;
 	int cnt = 0;
 
@@ -512,14 +515,14 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 	poke_off = X86_PATCH_SIZE + pop_bytes + 7 + 1;
 
 	/*
-	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 * if (tail_call_stack_depth > MAX_TAIL_CALL_STACK_DEPTH)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, tcc_off);             /* mov eax, dword ptr [rbp - tcc_off] */
-	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
+	EMIT2_off32(0x8B, 0x85, tcsd_off);            /* mov eax, dword ptr [rbp - tcsd_off] */
+	EMIT1_off32(0x3D, MAX_TAIL_CALL_STACK_DEPTH); /* cmp eax, MAX_TAIL_CALL_STACK_DEPTH */
 	EMIT2(X86_JA, off1);                          /* ja out */
-	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, tcc_off);             /* mov dword ptr [rbp - tcc_off], eax */
+	EMIT1_off32(0x05, sd);                        /* add eax, stack_size */
+	EMIT2_off32(0x89, 0x85, tcsd_off);            /* mov dword ptr [rbp - tcsd_off], eax */
 
 	poke->tailcall_bypass = image + (addr - poke_off - X86_PATCH_SIZE);
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
@@ -1430,7 +1433,7 @@ xadd:			if (is_imm8(insn->off))
 			ctx->cleanup_addr = proglen;
 			pop_callee_regs(&prog, callee_regs_used);
 			if (!bpf_prog_was_classic(bpf_prog) && tail_call_seen)
-				EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */
+				EMIT1(0x59); /* pop rcx, get rid of tail_call_stack_depth */
 			EMIT1(0xC9);         /* leave */
 			EMIT1(0xC3);         /* ret */
 			break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c9c460a437ed..5600dfd2217a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -895,6 +895,7 @@ struct bpf_array {
 
 #define BPF_COMPLEXITY_LIMIT_INSNS      1000000 /* yes. 1M insns */
 #define MAX_TAIL_CALL_CNT 32
+#define MAX_TAIL_CALL_STACK_DEPTH 4096
 
 #define BPF_F_ACCESS_MASK	(BPF_F_RDONLY |		\
 				 BPF_F_RDONLY_PROG |	\
-- 
2.20.1

> 
> > Essentially I'm proposing to tweak this formula:
> > depth += round_up(max_t(u32, subprog[idx].stack_depth, 1), 32);
> > and replace 1 with 64 for subprogs with tail_call.
> > 
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-08-21 17:38         ` Maciej Fijalkowski
@ 2020-08-26 21:35           ` Alexei Starovoitov
  2020-08-29 23:19             ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-08-26 21:35 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Fri, Aug 21, 2020 at 07:38:15PM +0200, Maciej Fijalkowski wrote:
> On Mon, Aug 03, 2020 at 04:00:10PM +0200, Daniel Borkmann wrote:
> > On 8/2/20 5:07 AM, Alexei Starovoitov wrote:
> > > On Sat, Aug 01, 2020 at 09:13:57AM +0200, Maciej Fijalkowski wrote:
> > > > On Sat, Aug 01, 2020 at 03:03:19AM +0200, Daniel Borkmann wrote:
> > > > > On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
> > > > > > v5->v6:
> > > > > > - propagate only those poke descriptors that individual subprogram is
> > > > > >     actually using (Daniel)
> > > > > > - drop the cumbersome check if poke desc got filled in map_poke_run()
> > > > > > - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
> > > > > >     to patch 3 to provide bisectability (Daniel)
> > > > > 
> > > > > I did a basic test with Cilium on K8s with this set, spawning a few Pods
> > > > > and checking connectivity & whether we're not crashing since it has bit more
> > > > > elaborate tail call use. So far so good. I was inclined to push the series
> > > > > out, but there is one more issue I noticed and didn't notice earlier when
> > > > > reviewing, and that is overall stack size:
> > > > > 
> > > > > What happens when you create a single program that has nested BPF to BPF
> > > > > calls e.g. either up to the maximum nesting or one call that is using up
> > > > > the max stack size which is then doing another BPF to BPF call that contains
> > > > > the tail call. In the tail call map, you have the same program in there.
> > > > > This means we create a worst case stack from BPF size of max_stack_size *
> > > > > max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
> > > > > we have a stack of arch/x86/include/asm/page_64_types.h:
> > > > > 
> > > > >    #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
> > > > >   #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> > > > > 
> > > > > So we end up with 16k in a typical case. And this will cause kernel stack
> > > > > overflow; I'm at least not seeing where we handle this situation in the
> > > 
> > > Not quite. The subprog is always 32 byte stack (from safety pov).
> > > The real stack (when JITed) can be lower or zero.
> > > So the max stack is (512 - 32) * 32 = 15360.
> > > So there is no overflow, but may be a bit too close to comfort.
> > 
> > I did a check with adding `stack_not_used(current)` to various points which
> > provides some useful data under CONFIG_DEBUG_STACK_USAGE. From tc ingress side
> > I'm getting roughly 13k free stack space which is definitely less than 15k even
> > at tc layer. I also checked on sk_filter_trim_cap() on ingress and worst case I
> > saw is very close to 12k, so a malicious or by accident a buggy program would be
> > able to cause a stack overflow as-is.
> > 
> > > Imo the room is ok to land the set and the better enforcement can
> > > be done as a follow up later, like below idea...
> > > 
> > > > > set. Hm, need to think more, but maybe this needs tracking of max stack
> > > > > across tail calls to force an upper limit..
> > > > 
> > > > My knee jerk reaction would be to decrement the allowed max tail calls,
> > > > but not sure if it's an option and if it would help.
> > > 
> > > How about make the verifier use a lower bound for a function with a tail call ?
> > > Something like 64 would work.
> > > subprog_info[idx].stack_depth with tail_call will be >= 64.
> > > Then the main function will be automatically limited to 512-64 and the worst
> > > case stack = 14kbyte.
> > 
> > Even 14k is way too close, see above. Some archs that are supported by the kernel
> > run under 8k total stack size. In the long run if more archs would support tail
> > calls with bpf-to-bpf calls, we might need a per-arch upper cap, but I think in
> > this context here an upper total cap on x86 that is 4k should be reasonable, it
> > sounds broken to me if more is indeed needed for the vast majority of use cases.
> > 
> > > When the sub prog with tail call is not an empty body (malicious stack
> > > abuser) then the lower bound won't affect anything.
> > > A bit annoying that stack_depth will be used by JIT to actually allocate
> > > that much. Some of it will not be used potentially, but I think it's fine.
> > > It's much simpler solution than to keep two variables to track stack size.
> > > Or may be check_max_stack_depth() can be a bit smarter and it can detect
> > > that subprog is using tail_call without actually hacking stack_depth variable.
> > 
> > +1, I think that would be better, maybe we could have a different cost function
> > for the tail call counter itself depending in which call-depth we are, but that
> > also requires two vars for tracking (tail call counter, call depth counter), so
> > more JIT changes & emitted insns required. :/ Otoh, what if tail call counter
> > is limited to 4k and we subtract stack usage instead with a min cost (e.g. 128)
> > if progs use less than that? Though the user experience will be really bad in
> > this case given these semantics feel less deterministic / hard to debug from
> > user PoV.
> 
> Let's get this rolling again.
> I like this approach, but from the opposite way - instead of decrementing
> from 4k, let's start with 0 like we did before and add up the
> max(stack_size, 128) on each tailcall as you suggested.
> 
> Reason for that is no need for changes in prologue, we can keep the xor
> eax,eax insn which occupies 2 bytes whereas mov eax, 4096 needs 5 bytes
> from what I see.
> 
> cmp eax, 4096 also needs more bytes than what cmp eax, MAX_TAIL_CALL_CNT
> needed, but that's something we need as well as change mentioned below.
> 
> One last change is add eax, 1 becomes the add eax, max(stack_size, 128)
> and it is also encoded differently.
> 
> Let me know if you're fine with that and if i can post v7.
> Dirty patch below that I will squash onto patch 5 if it's fine.
> 
> From 01d2494eed07284ea56134f40c6a304b109090ab Mon Sep 17 00:00:00 2001
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Fri, 21 Aug 2020 14:04:27 +0200
> Subject: [PATCH] bpf: track stack size in tailcall
> 
> WIP
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 37 ++++++++++++++++++++-----------------
>  include/linux/bpf.h         |  1 +
>  2 files changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 880f283adb66..56b38536b1dd 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -393,7 +393,7 @@ static int get_pop_bytes(bool *callee_regs_used)
>   * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
>   *   if (index >= array->map.max_entries)
>   *     goto out;
> - *   if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
> + *   if (tail_call_stack_depth + stack_depth > MAX_TAIL_CALL_STACK_DEPTH)
>   *     goto out;

I don't think we cannot use this approach because it's not correct. Adding the
stack_depth of the current function doesn't count stack space accurately.
The bpf_tail_call will unwind the current stack. It's the caller's stack
(in case of bpf2bpf) that matters from stack overflow pov.
But this callee (that does tail_call eventually) can be called from multiple
callsites in the caller and potentially from different callers, so
the callee cannot know the stack value to subtract without additional verifier help.
We can try to keep the maximum depth of stack (including all call frames) in
the verfier that leads to that callee with bpf_tail_call() and then pass it
into JITs to do this stack accounting. It's reasonable additional complexity in
the verifier, but it's painful to add the interpreter support.
We would need to hack BPF_TAIL_CALL insn. Like we can store
max_stack_of_all_callsites into insn->off when we do fixup_bpf_calls().
Then interpreter will do:
iff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ed0b3578867c..9a8b54c1adb6 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1532,10 +1532,10 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6

                if (unlikely(index >= array->map.max_entries))
                        goto out;
-               if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
+               if (unlikely(tail_call_stack > MAX_TAIL_CALL_STACK /* 4096 */))
                        goto out;

-               tail_call_cnt++;
+               tail_call_stack += insn->off;

and similar thing JITs would have to do. That includes modifying all existing JITs.

When bpf_tail_call() is called from top frame (instead of bpf-to-bpf subprog)
we can init 'off' with 128, so the old 32 call limit will be preserved.
But if we go with such massive user visible change I'd rather init 'off' with 32.
Then the tail call cnt limit will be 4096/32 = 128 invocations.
At least it will address a complain from folks that were hitting 32 limit.

Another approach is to use what I've suggested earlier.
Adjust the math in check_max_stack_depth():
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8a097a85d01b..9c6c909a1ab9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2982,6 +2982,11 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
        int ret_prog[MAX_CALL_FRAMES];

 process_func:
+       if (idx && subprog[idx].has_tail_call && depth >= 256) {
+               verbose(env, "Cannot do bpf_tail_call when call stack of previous frames is %d bytes. Too large\n",
+                       depth);
+               return -EACCES;
+       }
Then the worst case stack will be 256 * 32 = 8k while tail_call_cnt of 32 will stay as-is.
And no need to change interpreter or JITs.

I'm still thinking which way is better long term.
Thoughts?

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-08-26 21:35           ` Alexei Starovoitov
@ 2020-08-29 23:19             ` Maciej Fijalkowski
  2020-09-01 16:24               ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-08-29 23:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Wed, Aug 26, 2020 at 02:35:25PM -0700, Alexei Starovoitov wrote:
> On Fri, Aug 21, 2020 at 07:38:15PM +0200, Maciej Fijalkowski wrote:
> > On Mon, Aug 03, 2020 at 04:00:10PM +0200, Daniel Borkmann wrote:
> > > On 8/2/20 5:07 AM, Alexei Starovoitov wrote:
> > > > On Sat, Aug 01, 2020 at 09:13:57AM +0200, Maciej Fijalkowski wrote:
> > > > > On Sat, Aug 01, 2020 at 03:03:19AM +0200, Daniel Borkmann wrote:
> > > > > > On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
> > > > > > > v5->v6:
> > > > > > > - propagate only those poke descriptors that individual subprogram is
> > > > > > >     actually using (Daniel)
> > > > > > > - drop the cumbersome check if poke desc got filled in map_poke_run()
> > > > > > > - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
> > > > > > >     to patch 3 to provide bisectability (Daniel)
> > > > > > 
> > > > > > I did a basic test with Cilium on K8s with this set, spawning a few Pods
> > > > > > and checking connectivity & whether we're not crashing since it has bit more
> > > > > > elaborate tail call use. So far so good. I was inclined to push the series
> > > > > > out, but there is one more issue I noticed and didn't notice earlier when
> > > > > > reviewing, and that is overall stack size:
> > > > > > 
> > > > > > What happens when you create a single program that has nested BPF to BPF
> > > > > > calls e.g. either up to the maximum nesting or one call that is using up
> > > > > > the max stack size which is then doing another BPF to BPF call that contains
> > > > > > the tail call. In the tail call map, you have the same program in there.
> > > > > > This means we create a worst case stack from BPF size of max_stack_size *
> > > > > > max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
> > > > > > we have a stack of arch/x86/include/asm/page_64_types.h:
> > > > > > 
> > > > > >    #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
> > > > > >   #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> > > > > > 
> > > > > > So we end up with 16k in a typical case. And this will cause kernel stack
> > > > > > overflow; I'm at least not seeing where we handle this situation in the
> > > > 
> > > > Not quite. The subprog is always 32 byte stack (from safety pov).
> > > > The real stack (when JITed) can be lower or zero.
> > > > So the max stack is (512 - 32) * 32 = 15360.
> > > > So there is no overflow, but may be a bit too close to comfort.
> > > 
> > > I did a check with adding `stack_not_used(current)` to various points which
> > > provides some useful data under CONFIG_DEBUG_STACK_USAGE. From tc ingress side
> > > I'm getting roughly 13k free stack space which is definitely less than 15k even
> > > at tc layer. I also checked on sk_filter_trim_cap() on ingress and worst case I
> > > saw is very close to 12k, so a malicious or by accident a buggy program would be
> > > able to cause a stack overflow as-is.
> > > 
> > > > Imo the room is ok to land the set and the better enforcement can
> > > > be done as a follow up later, like below idea...
> > > > 
> > > > > > set. Hm, need to think more, but maybe this needs tracking of max stack
> > > > > > across tail calls to force an upper limit..
> > > > > 
> > > > > My knee jerk reaction would be to decrement the allowed max tail calls,
> > > > > but not sure if it's an option and if it would help.
> > > > 
> > > > How about make the verifier use a lower bound for a function with a tail call ?
> > > > Something like 64 would work.
> > > > subprog_info[idx].stack_depth with tail_call will be >= 64.
> > > > Then the main function will be automatically limited to 512-64 and the worst
> > > > case stack = 14kbyte.
> > > 
> > > Even 14k is way too close, see above. Some archs that are supported by the kernel
> > > run under 8k total stack size. In the long run if more archs would support tail
> > > calls with bpf-to-bpf calls, we might need a per-arch upper cap, but I think in
> > > this context here an upper total cap on x86 that is 4k should be reasonable, it
> > > sounds broken to me if more is indeed needed for the vast majority of use cases.
> > > 
> > > > When the sub prog with tail call is not an empty body (malicious stack
> > > > abuser) then the lower bound won't affect anything.
> > > > A bit annoying that stack_depth will be used by JIT to actually allocate
> > > > that much. Some of it will not be used potentially, but I think it's fine.
> > > > It's much simpler solution than to keep two variables to track stack size.
> > > > Or may be check_max_stack_depth() can be a bit smarter and it can detect
> > > > that subprog is using tail_call without actually hacking stack_depth variable.
> > > 
> > > +1, I think that would be better, maybe we could have a different cost function
> > > for the tail call counter itself depending in which call-depth we are, but that
> > > also requires two vars for tracking (tail call counter, call depth counter), so
> > > more JIT changes & emitted insns required. :/ Otoh, what if tail call counter
> > > is limited to 4k and we subtract stack usage instead with a min cost (e.g. 128)
> > > if progs use less than that? Though the user experience will be really bad in
> > > this case given these semantics feel less deterministic / hard to debug from
> > > user PoV.
> > 
> > Let's get this rolling again.
> > I like this approach, but from the opposite way - instead of decrementing
> > from 4k, let's start with 0 like we did before and add up the
> > max(stack_size, 128) on each tailcall as you suggested.
> > 
> > Reason for that is no need for changes in prologue, we can keep the xor
> > eax,eax insn which occupies 2 bytes whereas mov eax, 4096 needs 5 bytes
> > from what I see.
> > 
> > cmp eax, 4096 also needs more bytes than what cmp eax, MAX_TAIL_CALL_CNT
> > needed, but that's something we need as well as change mentioned below.
> > 
> > One last change is add eax, 1 becomes the add eax, max(stack_size, 128)
> > and it is also encoded differently.
> > 
> > Let me know if you're fine with that and if i can post v7.
> > Dirty patch below that I will squash onto patch 5 if it's fine.
> > 
> > From 01d2494eed07284ea56134f40c6a304b109090ab Mon Sep 17 00:00:00 2001
> > From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > Date: Fri, 21 Aug 2020 14:04:27 +0200
> > Subject: [PATCH] bpf: track stack size in tailcall
> > 
> > WIP
> > 
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 37 ++++++++++++++++++++-----------------
> >  include/linux/bpf.h         |  1 +
> >  2 files changed, 21 insertions(+), 17 deletions(-)
> > 
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 880f283adb66..56b38536b1dd 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -393,7 +393,7 @@ static int get_pop_bytes(bool *callee_regs_used)
> >   * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
> >   *   if (index >= array->map.max_entries)
> >   *     goto out;
> > - *   if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
> > + *   if (tail_call_stack_depth + stack_depth > MAX_TAIL_CALL_STACK_DEPTH)
> >   *     goto out;
> 
> I don't think we cannot use this approach because it's not correct. Adding the
> stack_depth of the current function doesn't count stack space accurately.
> The bpf_tail_call will unwind the current stack. It's the caller's stack
> (in case of bpf2bpf) that matters from stack overflow pov.

I must admit I was puzzled when I came back to this stuff after a break,
because as you're saying before the actual tailcall we will unwind the
stack frame of tailcall's caller (or the current stack frame in simpler
terms).

So, to visualize a bit and so that I'm sure I follow:

func1 -> sub rsp, 128
  subfunc1 -> sub rsp, 256
  tailcall1 -> add rsp, 256
    func2 -> sub rsp, 256 (total stack size = 128 + 256 = 384)
    subfunc2 -> sub rsp, 64
    subfunc22 -> sub rsp, 128
    tailcall2 -> add rsp, 128
      func3 -> sub rsp, 256 (total stack size 128 + 256 + 64 + 256 = 704)

and so on. And this is what we have to address. If that's it, then thanks
for making it explicit that it's about the subprog caller's stack.

> But this callee (that does tail_call eventually) can be called from multiple
> callsites in the caller and potentially from different callers, so
> the callee cannot know the stack value to subtract without additional verifier help.
> We can try to keep the maximum depth of stack (including all call frames) in
> the verfier that leads to that callee with bpf_tail_call() and then pass it
> into JITs to do this stack accounting. It's reasonable additional complexity in
> the verifier, but it's painful to add the interpreter support.

Not sure if we're on the same page - we allow this set only for x64 arch.
Why do you mention the interpreter and other JITs?

We could introduce one of your suggestions to verifier and surround it with
proper ifdefs like patch 5/6 is doing it.

> We would need to hack BPF_TAIL_CALL insn. Like we can store
> max_stack_of_all_callsites into insn->off when we do fixup_bpf_calls().
> Then interpreter will do:
> iff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index ed0b3578867c..9a8b54c1adb6 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1532,10 +1532,10 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6
> 
>                 if (unlikely(index >= array->map.max_entries))
>                         goto out;
> -               if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
> +               if (unlikely(tail_call_stack > MAX_TAIL_CALL_STACK /* 4096 */))
>                         goto out;
> 
> -               tail_call_cnt++;
> +               tail_call_stack += insn->off;
> 
> and similar thing JITs would have to do. That includes modifying all existing JITs.

Again, I don't get why we would have to address everything else besides
x64 JIT.

> 
> When bpf_tail_call() is called from top frame (instead of bpf-to-bpf subprog)
> we can init 'off' with 128, so the old 32 call limit will be preserved.
> But if we go with such massive user visible change I'd rather init 'off' with 32.
> Then the tail call cnt limit will be 4096/32 = 128 invocations.
> At least it will address a complain from folks that were hitting 32 limit.
> 
> Another approach is to use what I've suggested earlier.
> Adjust the math in check_max_stack_depth():
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 8a097a85d01b..9c6c909a1ab9 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2982,6 +2982,11 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
>         int ret_prog[MAX_CALL_FRAMES];
> 
>  process_func:
> +       if (idx && subprog[idx].has_tail_call && depth >= 256) {
> +               verbose(env, "Cannot do bpf_tail_call when call stack of previous frames is %d bytes. Too large\n",
> +                       depth);
> +               return -EACCES;
> +       }
> Then the worst case stack will be 256 * 32 = 8k while tail_call_cnt of 32 will stay as-is.
> And no need to change interpreter or JITs.

I tend to lean towards simpler solutions as this work is already complex.
Let's hear Daniel's opinion though.

> 
> I'm still thinking which way is better long term.
> Thoughts?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-08-29 23:19             ` Maciej Fijalkowski
@ 2020-09-01 16:24               ` Alexei Starovoitov
  2020-09-02 20:01                 ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-09-01 16:24 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Sun, Aug 30, 2020 at 01:19:25AM +0200, Maciej Fijalkowski wrote:
> On Wed, Aug 26, 2020 at 02:35:25PM -0700, Alexei Starovoitov wrote:
> > On Fri, Aug 21, 2020 at 07:38:15PM +0200, Maciej Fijalkowski wrote:
> > > On Mon, Aug 03, 2020 at 04:00:10PM +0200, Daniel Borkmann wrote:
> > > > On 8/2/20 5:07 AM, Alexei Starovoitov wrote:
> > > > > On Sat, Aug 01, 2020 at 09:13:57AM +0200, Maciej Fijalkowski wrote:
> > > > > > On Sat, Aug 01, 2020 at 03:03:19AM +0200, Daniel Borkmann wrote:
> > > > > > > On 7/31/20 2:03 AM, Maciej Fijalkowski wrote:
> > > > > > > > v5->v6:
> > > > > > > > - propagate only those poke descriptors that individual subprogram is
> > > > > > > >     actually using (Daniel)
> > > > > > > > - drop the cumbersome check if poke desc got filled in map_poke_run()
> > > > > > > > - move poke->ip renaming in bpf_jit_add_poke_descriptor() from patch 4
> > > > > > > >     to patch 3 to provide bisectability (Daniel)
> > > > > > > 
> > > > > > > I did a basic test with Cilium on K8s with this set, spawning a few Pods
> > > > > > > and checking connectivity & whether we're not crashing since it has bit more
> > > > > > > elaborate tail call use. So far so good. I was inclined to push the series
> > > > > > > out, but there is one more issue I noticed and didn't notice earlier when
> > > > > > > reviewing, and that is overall stack size:
> > > > > > > 
> > > > > > > What happens when you create a single program that has nested BPF to BPF
> > > > > > > calls e.g. either up to the maximum nesting or one call that is using up
> > > > > > > the max stack size which is then doing another BPF to BPF call that contains
> > > > > > > the tail call. In the tail call map, you have the same program in there.
> > > > > > > This means we create a worst case stack from BPF size of max_stack_size *
> > > > > > > max_tail_call_size, that is, 512*32. So that adds 16k worst case. For x86
> > > > > > > we have a stack of arch/x86/include/asm/page_64_types.h:
> > > > > > > 
> > > > > > >    #define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
> > > > > > >   #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> > > > > > > 
> > > > > > > So we end up with 16k in a typical case. And this will cause kernel stack
> > > > > > > overflow; I'm at least not seeing where we handle this situation in the
> > > > > 
> > > > > Not quite. The subprog is always 32 byte stack (from safety pov).
> > > > > The real stack (when JITed) can be lower or zero.
> > > > > So the max stack is (512 - 32) * 32 = 15360.
> > > > > So there is no overflow, but may be a bit too close to comfort.
> > > > 
> > > > I did a check with adding `stack_not_used(current)` to various points which
> > > > provides some useful data under CONFIG_DEBUG_STACK_USAGE. From tc ingress side
> > > > I'm getting roughly 13k free stack space which is definitely less than 15k even
> > > > at tc layer. I also checked on sk_filter_trim_cap() on ingress and worst case I
> > > > saw is very close to 12k, so a malicious or by accident a buggy program would be
> > > > able to cause a stack overflow as-is.
> > > > 
> > > > > Imo the room is ok to land the set and the better enforcement can
> > > > > be done as a follow up later, like below idea...
> > > > > 
> > > > > > > set. Hm, need to think more, but maybe this needs tracking of max stack
> > > > > > > across tail calls to force an upper limit..
> > > > > > 
> > > > > > My knee jerk reaction would be to decrement the allowed max tail calls,
> > > > > > but not sure if it's an option and if it would help.
> > > > > 
> > > > > How about make the verifier use a lower bound for a function with a tail call ?
> > > > > Something like 64 would work.
> > > > > subprog_info[idx].stack_depth with tail_call will be >= 64.
> > > > > Then the main function will be automatically limited to 512-64 and the worst
> > > > > case stack = 14kbyte.
> > > > 
> > > > Even 14k is way too close, see above. Some archs that are supported by the kernel
> > > > run under 8k total stack size. In the long run if more archs would support tail
> > > > calls with bpf-to-bpf calls, we might need a per-arch upper cap, but I think in
> > > > this context here an upper total cap on x86 that is 4k should be reasonable, it
> > > > sounds broken to me if more is indeed needed for the vast majority of use cases.
> > > > 
> > > > > When the sub prog with tail call is not an empty body (malicious stack
> > > > > abuser) then the lower bound won't affect anything.
> > > > > A bit annoying that stack_depth will be used by JIT to actually allocate
> > > > > that much. Some of it will not be used potentially, but I think it's fine.
> > > > > It's much simpler solution than to keep two variables to track stack size.
> > > > > Or may be check_max_stack_depth() can be a bit smarter and it can detect
> > > > > that subprog is using tail_call without actually hacking stack_depth variable.
> > > > 
> > > > +1, I think that would be better, maybe we could have a different cost function
> > > > for the tail call counter itself depending in which call-depth we are, but that
> > > > also requires two vars for tracking (tail call counter, call depth counter), so
> > > > more JIT changes & emitted insns required. :/ Otoh, what if tail call counter
> > > > is limited to 4k and we subtract stack usage instead with a min cost (e.g. 128)
> > > > if progs use less than that? Though the user experience will be really bad in
> > > > this case given these semantics feel less deterministic / hard to debug from
> > > > user PoV.
> > > 
> > > Let's get this rolling again.
> > > I like this approach, but from the opposite way - instead of decrementing
> > > from 4k, let's start with 0 like we did before and add up the
> > > max(stack_size, 128) on each tailcall as you suggested.
> > > 
> > > Reason for that is no need for changes in prologue, we can keep the xor
> > > eax,eax insn which occupies 2 bytes whereas mov eax, 4096 needs 5 bytes
> > > from what I see.
> > > 
> > > cmp eax, 4096 also needs more bytes than what cmp eax, MAX_TAIL_CALL_CNT
> > > needed, but that's something we need as well as change mentioned below.
> > > 
> > > One last change is add eax, 1 becomes the add eax, max(stack_size, 128)
> > > and it is also encoded differently.
> > > 
> > > Let me know if you're fine with that and if i can post v7.
> > > Dirty patch below that I will squash onto patch 5 if it's fine.
> > > 
> > > From 01d2494eed07284ea56134f40c6a304b109090ab Mon Sep 17 00:00:00 2001
> > > From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > Date: Fri, 21 Aug 2020 14:04:27 +0200
> > > Subject: [PATCH] bpf: track stack size in tailcall
> > > 
> > > WIP
> > > 
> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 37 ++++++++++++++++++++-----------------
> > >  include/linux/bpf.h         |  1 +
> > >  2 files changed, 21 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index 880f283adb66..56b38536b1dd 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -393,7 +393,7 @@ static int get_pop_bytes(bool *callee_regs_used)
> > >   * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
> > >   *   if (index >= array->map.max_entries)
> > >   *     goto out;
> > > - *   if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
> > > + *   if (tail_call_stack_depth + stack_depth > MAX_TAIL_CALL_STACK_DEPTH)
> > >   *     goto out;
> > 
> > I don't think we cannot use this approach because it's not correct. Adding the
> > stack_depth of the current function doesn't count stack space accurately.
> > The bpf_tail_call will unwind the current stack. It's the caller's stack
> > (in case of bpf2bpf) that matters from stack overflow pov.
> 
> I must admit I was puzzled when I came back to this stuff after a break,
> because as you're saying before the actual tailcall we will unwind the
> stack frame of tailcall's caller (or the current stack frame in simpler
> terms).
> 
> So, to visualize a bit and so that I'm sure I follow:
> 
> func1 -> sub rsp, 128
>   subfunc1 -> sub rsp, 256
>   tailcall1 -> add rsp, 256
>     func2 -> sub rsp, 256 (total stack size = 128 + 256 = 384)
>     subfunc2 -> sub rsp, 64
>     subfunc22 -> sub rsp, 128
>     tailcall2 -> add rsp, 128
>       func3 -> sub rsp, 256 (total stack size 128 + 256 + 64 + 256 = 704)
> 
> and so on. And this is what we have to address. If that's it, then thanks
> for making it explicit that it's about the subprog caller's stack.

Right. The above is correct. Could you add it to the code as a comment?
Please replace second and third use of 256 with a different constant to
make it easier to see that 'sub rsp, X' in func1, func2, and func3 can
be different.

> > But this callee (that does tail_call eventually) can be called from multiple
> > callsites in the caller and potentially from different callers, so
> > the callee cannot know the stack value to subtract without additional verifier help.
> > We can try to keep the maximum depth of stack (including all call frames) in
> > the verfier that leads to that callee with bpf_tail_call() and then pass it
> > into JITs to do this stack accounting. It's reasonable additional complexity in
> > the verifier, but it's painful to add the interpreter support.
> 
> Not sure if we're on the same page - we allow this set only for x64 arch.
> Why do you mention the interpreter and other JITs?

It's not 100% mandatory to make the interpreter compatible with JIT,
but we should always try to keep the parity when possible.
Like when I was working on BPF trampoline I've considered to support JITed code
only, since generation of trampoline itself requires Just-In-Time code
generation. But I took the extra effort to make sure invoke_bpf_prog() in
arch/x86/net/bpf_jit_comp.c supports interpreter as well. There could be bugs
in the interpreter or JIT. Having two ways to execute the program is useful for
many reasons.

In this case the new tail_call handling in x86 JIT will unwind the current stack,
so existing interpreter handling of tail_call won't quite work.
Take a look at bpf_patch_call_args(), JMP_CALL_ARGS:, and JMP_TAIL_CALL:.
The JMP_TAIL_CALL will sort-of unwind the current stack, but the size of the stack
will be reused for tail_call target function.
Illustrating on your example:
 func1 -> sub rsp, 128
   subfunc1 -> sub rsp, 256
   tailcall1 -> add rsp, 256
     func2 -> sub rsp, 192 (total stack size = 128 + 192 = 320)
     subfunc2 -> sub rsp, 64
     subfunc22 -> sub rsp, 128
     tailcall2 -> add rsp, 128
       func3 -> sub rsp, 224 (total stack size 128 + 192 + 64 + 224 = 608)

The interpreter will call into subfunc1 with 256 bytes of the interpreter stack
and will reuse it for tail_call into func2.
If func2 needs 192, it's going to work fine, but if it needs more than 256
there will be stack overflow.

We can disable mixing bpf2bpf calls and tail_calls when interpreter is used
for now, but it would be good to support it somehow.

> We could introduce one of your suggestions to verifier and surround it with
> proper ifdefs like patch 5/6 is doing it.

If we use the approach of changing JMP_TAIL_CALL pseudo insn to do:
-               tail_call_cnt++;
+               tail_call_stack += insn->off;
then we have to update other JITs to do the same.
Other JITs do NOT need to support bpf2bpf calls with tail_calls.
they do NOT need to do current stack unwinding, but they have to match
the new behavior of JMP_TAIL_CALL otherwise such interpreter vs JIT
discrepancy will create plenty of unhappy users.

> > We would need to hack BPF_TAIL_CALL insn. Like we can store
> > max_stack_of_all_callsites into insn->off when we do fixup_bpf_calls().
> > Then interpreter will do:
> > iff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index ed0b3578867c..9a8b54c1adb6 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -1532,10 +1532,10 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6
> > 
> >                 if (unlikely(index >= array->map.max_entries))
> >                         goto out;
> > -               if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
> > +               if (unlikely(tail_call_stack > MAX_TAIL_CALL_STACK /* 4096 */))
> >                         goto out;
> > 
> > -               tail_call_cnt++;
> > +               tail_call_stack += insn->off;
> > 
> > and similar thing JITs would have to do. That includes modifying all existing JITs.
> 
> Again, I don't get why we would have to address everything else besides
> x64 JIT.
> 
> > 
> > When bpf_tail_call() is called from top frame (instead of bpf-to-bpf subprog)
> > we can init 'off' with 128, so the old 32 call limit will be preserved.
> > But if we go with such massive user visible change I'd rather init 'off' with 32.
> > Then the tail call cnt limit will be 4096/32 = 128 invocations.
> > At least it will address a complain from folks that were hitting 32 limit.
> > 
> > Another approach is to use what I've suggested earlier.
> > Adjust the math in check_max_stack_depth():
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 8a097a85d01b..9c6c909a1ab9 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -2982,6 +2982,11 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
> >         int ret_prog[MAX_CALL_FRAMES];
> > 
> >  process_func:
> > +       if (idx && subprog[idx].has_tail_call && depth >= 256) {
> > +               verbose(env, "Cannot do bpf_tail_call when call stack of previous frames is %d bytes. Too large\n",
> > +                       depth);
> > +               return -EACCES;
> > +       }
> > Then the worst case stack will be 256 * 32 = 8k while tail_call_cnt of 32 will stay as-is.
> > And no need to change interpreter or JITs.
> 
> I tend to lean towards simpler solutions as this work is already complex.
> Let's hear Daniel's opinion though.

Let's go with this simpler solution. We can add fancier stack
accounting later.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms
  2020-09-01 16:24               ` Alexei Starovoitov
@ 2020-09-02 20:01                 ` Maciej Fijalkowski
  0 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2020-09-02 20:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Tue, Sep 01, 2020 at 09:24:12AM -0700, Alexei Starovoitov wrote:

[...]

> > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > > index 880f283adb66..56b38536b1dd 100644
> > > > --- a/arch/x86/net/bpf_jit_comp.c
> > > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > > @@ -393,7 +393,7 @@ static int get_pop_bytes(bool *callee_regs_used)
> > > >   * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
> > > >   *   if (index >= array->map.max_entries)
> > > >   *     goto out;
> > > > - *   if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
> > > > + *   if (tail_call_stack_depth + stack_depth > MAX_TAIL_CALL_STACK_DEPTH)
> > > >   *     goto out;
> > > 
> > > I don't think we cannot use this approach because it's not correct. Adding the
> > > stack_depth of the current function doesn't count stack space accurately.
> > > The bpf_tail_call will unwind the current stack. It's the caller's stack
> > > (in case of bpf2bpf) that matters from stack overflow pov.
> > 
> > I must admit I was puzzled when I came back to this stuff after a break,
> > because as you're saying before the actual tailcall we will unwind the
> > stack frame of tailcall's caller (or the current stack frame in simpler
> > terms).
> > 
> > So, to visualize a bit and so that I'm sure I follow:
> > 
> > func1 -> sub rsp, 128
> >   subfunc1 -> sub rsp, 256
> >   tailcall1 -> add rsp, 256
> >     func2 -> sub rsp, 256 (total stack size = 128 + 256 = 384)
> >     subfunc2 -> sub rsp, 64
> >     subfunc22 -> sub rsp, 128
> >     tailcall2 -> add rsp, 128
> >       func3 -> sub rsp, 256 (total stack size 128 + 256 + 64 + 256 = 704)
> > 
> > and so on. And this is what we have to address. If that's it, then thanks
> > for making it explicit that it's about the subprog caller's stack.
> 
> Right. The above is correct. Could you add it to the code as a comment?
> Please replace second and third use of 256 with a different constant to
> make it easier to see that 'sub rsp, X' in func1, func2, and func3 can
> be different.

Good stuff. I played with your suggestion to limit caller's stack depth
down to 256 when there is tailcall in subprogram by creating buffers on
stack among subprogs and use it in some stupid way so that verifier
wouldn't prune it and it seems that it is doing its job. For a moment I
was a bit confused if everything is all right since I saw that I hit the
stack depth of 480, but it was due to the fact that tailcall was in last
subprog AND previous stacks summarized were not above 256. IOW that last
stack will get unwinded before the tailcall as mentioned already multiple
times.

Thanks for all of the explanations below, that is pretty educational, but
I'm glad that I don't have to get my hands dirty with interpreter...yet :)

I wrapped the code you suggested with ifdefs same as the patch that allows
for having tailcalls in BPF subprogs.

Sending v7.

> 
> > > But this callee (that does tail_call eventually) can be called from multiple
> > > callsites in the caller and potentially from different callers, so
> > > the callee cannot know the stack value to subtract without additional verifier help.
> > > We can try to keep the maximum depth of stack (including all call frames) in
> > > the verfier that leads to that callee with bpf_tail_call() and then pass it
> > > into JITs to do this stack accounting. It's reasonable additional complexity in
> > > the verifier, but it's painful to add the interpreter support.
> > 
> > Not sure if we're on the same page - we allow this set only for x64 arch.
> > Why do you mention the interpreter and other JITs?
> 
> It's not 100% mandatory to make the interpreter compatible with JIT,
> but we should always try to keep the parity when possible.
> Like when I was working on BPF trampoline I've considered to support JITed code
> only, since generation of trampoline itself requires Just-In-Time code
> generation. But I took the extra effort to make sure invoke_bpf_prog() in
> arch/x86/net/bpf_jit_comp.c supports interpreter as well. There could be bugs
> in the interpreter or JIT. Having two ways to execute the program is useful for
> many reasons.
> 
> In this case the new tail_call handling in x86 JIT will unwind the current stack,
> so existing interpreter handling of tail_call won't quite work.
> Take a look at bpf_patch_call_args(), JMP_CALL_ARGS:, and JMP_TAIL_CALL:.
> The JMP_TAIL_CALL will sort-of unwind the current stack, but the size of the stack
> will be reused for tail_call target function.
> Illustrating on your example:
>  func1 -> sub rsp, 128
>    subfunc1 -> sub rsp, 256
>    tailcall1 -> add rsp, 256
>      func2 -> sub rsp, 192 (total stack size = 128 + 192 = 320)
>      subfunc2 -> sub rsp, 64
>      subfunc22 -> sub rsp, 128
>      tailcall2 -> add rsp, 128
>        func3 -> sub rsp, 224 (total stack size 128 + 192 + 64 + 224 = 608)
> 
> The interpreter will call into subfunc1 with 256 bytes of the interpreter stack
> and will reuse it for tail_call into func2.
> If func2 needs 192, it's going to work fine, but if it needs more than 256
> there will be stack overflow.
> 
> We can disable mixing bpf2bpf calls and tail_calls when interpreter is used
> for now, but it would be good to support it somehow.
> 
> > We could introduce one of your suggestions to verifier and surround it with
> > proper ifdefs like patch 5/6 is doing it.
> 
> If we use the approach of changing JMP_TAIL_CALL pseudo insn to do:
> -               tail_call_cnt++;
> +               tail_call_stack += insn->off;
> then we have to update other JITs to do the same.
> Other JITs do NOT need to support bpf2bpf calls with tail_calls.
> they do NOT need to do current stack unwinding, but they have to match
> the new behavior of JMP_TAIL_CALL otherwise such interpreter vs JIT
> discrepancy will create plenty of unhappy users.
> 
> > > We would need to hack BPF_TAIL_CALL insn. Like we can store
> > > max_stack_of_all_callsites into insn->off when we do fixup_bpf_calls().
> > > Then interpreter will do:
> > > iff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index ed0b3578867c..9a8b54c1adb6 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -1532,10 +1532,10 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6
> > > 
> > >                 if (unlikely(index >= array->map.max_entries))
> > >                         goto out;
> > > -               if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
> > > +               if (unlikely(tail_call_stack > MAX_TAIL_CALL_STACK /* 4096 */))
> > >                         goto out;
> > > 
> > > -               tail_call_cnt++;
> > > +               tail_call_stack += insn->off;
> > > 
> > > and similar thing JITs would have to do. That includes modifying all existing JITs.
> > 
> > Again, I don't get why we would have to address everything else besides
> > x64 JIT.
> > 
> > > 
> > > When bpf_tail_call() is called from top frame (instead of bpf-to-bpf subprog)
> > > we can init 'off' with 128, so the old 32 call limit will be preserved.
> > > But if we go with such massive user visible change I'd rather init 'off' with 32.
> > > Then the tail call cnt limit will be 4096/32 = 128 invocations.
> > > At least it will address a complain from folks that were hitting 32 limit.
> > > 
> > > Another approach is to use what I've suggested earlier.
> > > Adjust the math in check_max_stack_depth():
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 8a097a85d01b..9c6c909a1ab9 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -2982,6 +2982,11 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
> > >         int ret_prog[MAX_CALL_FRAMES];
> > > 
> > >  process_func:
> > > +       if (idx && subprog[idx].has_tail_call && depth >= 256) {
> > > +               verbose(env, "Cannot do bpf_tail_call when call stack of previous frames is %d bytes. Too large\n",
> > > +                       depth);
> > > +               return -EACCES;
> > > +       }
> > > Then the worst case stack will be 256 * 32 = 8k while tail_call_cnt of 32 will stay as-is.
> > > And no need to change interpreter or JITs.
> > 
> > I tend to lean towards simpler solutions as this work is already complex.
> > Let's hear Daniel's opinion though.
> 
> Let's go with this simpler solution. We can add fancier stack
> accounting later.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-02 20:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31  0:03 [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
2020-07-31  0:03 ` [PATCH v6 bpf-next 1/6] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
2020-07-31  0:03 ` [PATCH v6 bpf-next 2/6] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
2020-07-31  0:03 ` [PATCH v6 bpf-next 3/6] bpf: rename poke descriptor's 'ip' member to 'tailcall_target' Maciej Fijalkowski
2020-07-31  0:03 ` [PATCH v6 bpf-next 4/6] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
2020-07-31  0:03 ` [PATCH v6 bpf-next 5/6] bpf: allow for tailcalls in BPF subprograms for x64 JIT Maciej Fijalkowski
2020-07-31  0:03 ` [PATCH v6 bpf-next 6/6] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
2020-08-01  1:03 ` [PATCH v6 bpf-next 0/6] bpf: tailcalls in BPF subprograms Daniel Borkmann
2020-08-01  7:13   ` Maciej Fijalkowski
2020-08-02  3:07     ` Alexei Starovoitov
2020-08-03 14:00       ` Daniel Borkmann
2020-08-21 17:38         ` Maciej Fijalkowski
2020-08-26 21:35           ` Alexei Starovoitov
2020-08-29 23:19             ` Maciej Fijalkowski
2020-09-01 16:24               ` Alexei Starovoitov
2020-09-02 20:01                 ` Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).