bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms
@ 2020-07-15 23:36 Maciej Fijalkowski
  2020-07-15 23:36 ` [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-15 23:36 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

RFC->v1:
- rename poke->ip/poke->ip_aux pair to
  poke->tailcall_target/poke->tailcall_bypass (Alexei)
- get rid of x86-specific code in prog_array_map_poke_run (Alexei)
- use synchronize_rcu in prog_array_map_poke_run so that other CPUs in
  the middle of execution will finish running the program and will not
  stumble upon the incorrect state (Alexei)
- update performance reports
- drop the RFC-ish narration from descriptions
- rebase


Hello,

today bpf2bpf calls and tailcalls exclude each other. This set makes them
to work together.

To give you an overview how this work started, previously I posted RFC
that was targetted at getting rid of push/pop instructions for callee
saved registers in x86-64 JIT that are not used by the BPF program.
Alexei saw a potential that that work could be lifted a bit and
tailcalls could work with BPF subprograms. More on that in [1], [2].

For previous discussions on RFC version, see [3].

In [1], Alexei says:

"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

So basically I gave a shot at that suggestion. Patch 4 has a description
of implementation.

Quick overview of patches:
Patch 1 changes BPF retpoline to use %rcx instead of %rax to store
address of BPF tailcall target program
Patch 2 relaxes verifier's restrictions about tailcalls being used with
BPF subprograms
Patch 3 propagates poke descriptors from main program to each subprogram
Patch 4 is the main dish in this set. It implements new prologue layout
that was suggested by Alexei and reworks tailcall handling.
Patch 5 is the new selftest that proves tailcalls can be used from
within BPF subprogram.

-------------------------------------------------------------------
prog_array_map_poke_run changes:

Before the tailcall and with the new prologue layout, stack need to be
unwinded and callee saved registers need to be popped. Instructions
responsible for that are generated, but they should not be executed if
target program is not present. To address that, new poke target
'tailcall_bypass' is introduced to poke descriptor that will be used for
skipping these instructions. This means there are two poke targets for
handling direct tailcalls. Simplified flow can be presented as three
sections:

1. skip call or nop (poke->tailcall_bypass)
2. stack unwind
3. call tail or nop (poke->tailcall_target)

It would be possible under specific circumstances that one of CPU might
be in point 2 and point 3 is not yet updated (nop), which would lead to
problems mentioned in patch 4 commit message, IOW unwind section should
not be executed if there is no target program.

We can define the following state matrix for that (courtesy of Bjorn):
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->tailcall_target) and do NOT touch the
   poke->tailcall_bypass
Remove tail call: B(f')->C(f')
 * poke->tailcall_bypass is poked back to jump, then we wait the RCU
   grace period so that other programs will finish its execution and
   after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
 * same as first step

This way CPU can never be exposed to "unwind, tail" state.

-------------------------------------------------------------------
Performance impact:

All of this work, as stated in [2], started as a way to speed up AF-XDP
by dropping the push/pop of unused callee saved registers in prologue
and epilogue. Impact is positive, 15% of performance gain.

However, it is obvious that it will have a negative impact on BPF
programs that utilize tailcalls, but we think its volume is acceptable
for the feature that this set contains.

Below are te numbers from 'perf stat' for two scenarios.
First scenario is the output of command:

$ sudo perf stat -ddd -r 1024 ./test_progs -t tailcalls

tailcalls kselftest was modified in a following way:
- only taicall1 subtest is enabled
- each of the bpf_prog_test_run() calls got set 'repeat' argument to
  1000000

Numbers without this set:

 Performance counter stats for './test_progs -t tailcalls' (1024 runs):

            261.68 msec task-clock                #    0.998 CPUs utilized            ( +-  0.12% )
                 5      context-switches          #    0.017 K/sec                    ( +-  0.54% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +- 23.37% )
               113      page-faults               #    0.433 K/sec                    ( +-  0.03% )
       877,156,850      cycles                    #    3.352 GHz                      ( +-  0.11% )  (30.31%)
     1,379,322,515      instructions              #    1.57  insn per cycle           ( +-  0.02% )  (38.17%)
       218,869,567      branches                  #  836.395 M/sec                    ( +-  0.01% )  (38.46%)
        11,954,183      branch-misses             #    5.46% of all branches          ( +-  0.01% )  (38.74%)
       283,350,418      L1-dcache-loads           # 1082.805 M/sec                    ( +-  0.01% )  (39.00%)
           156,323      L1-dcache-load-misses     #    0.06% of all L1-dcache hits    ( +-  0.74% )  (39.05%)
            37,309      LLC-loads                 #    0.143 M/sec                    ( +-  1.02% )  (31.08%)
            15,263      LLC-load-misses           #   40.91% of all LL-cache hits     ( +-  0.90% )  (30.95%)
   <not supported>      L1-icache-loads
           130,427      L1-icache-load-misses                                         ( +-  0.45% )  (30.80%)
       285,369,370      dTLB-loads                # 1090.520 M/sec                    ( +-  0.01% )  (30.64%)
             1,154      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.26% )  (30.46%)
             2,015      iTLB-loads                #    0.008 M/sec                    ( +-  1.12% )  (30.31%)
               551      iTLB-load-misses          #   27.34% of all iTLB cache hits   ( +-  1.29% )  (30.20%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          0.262276 +- 0.000316 seconds time elapsed  ( +-  0.12% )

With:

 Performance counter stats for './test_progs -t tailcalls' (1024 runs):

            362.37 msec task-clock                #    0.671 CPUs utilized            ( +-  0.11% )
                28      context-switches          #    0.077 K/sec                    ( +-  0.15% )
                 0      cpu-migrations            #    0.001 K/sec                    ( +-  4.46% )
               113      page-faults               #    0.313 K/sec                    ( +-  0.03% )
       895,804,416      cycles                    #    2.472 GHz                      ( +-  0.08% )  (30.50%)
     1,339,401,398      instructions              #    1.50  insn per cycle           ( +-  0.04% )  (38.29%)
       302,718,849      branches                  #  835.385 M/sec                    ( +-  0.04% )  (38.39%)
        11,962,089      branch-misses             #    3.95% of all branches          ( +-  0.05% )  (38.56%)
       248,044,443      L1-dcache-loads           #  684.505 M/sec                    ( +-  0.03% )  (38.70%)
           239,882      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    ( +-  0.49% )  (38.69%)
            76,904      LLC-loads                 #    0.212 M/sec                    ( +-  0.96% )  (30.88%)
            23,472      LLC-load-misses           #   30.52% of all LL-cache hits     ( +-  0.98% )  (30.85%)
   <not supported>      L1-icache-loads
           193,803      L1-icache-load-misses                                         ( +-  0.53% )  (30.81%)
       249,775,412      dTLB-loads                #  689.282 M/sec                    ( +-  0.04% )  (30.81%)
             2,176      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.53% )  (30.73%)
             2,914      iTLB-loads                #    0.008 M/sec                    ( +-  1.23% )  (30.59%)
               978      iTLB-load-misses          #   33.57% of all iTLB cache hits   ( +-  1.29% )  (30.48%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          0.540236 +- 0.000454 seconds time elapsed  ( +-  0.08% )

Second conducted measurement was on BPF kselftest flow_dissector that is
using the progs/bpf_flow.c with 'repeat' argument on
bpf_prog_test_run_xattr set also to 1000000.

Without:

Performance counter stats for './test_progs -t flow_dissector' (1024 runs):

          1,355.18 msec task-clock                #    0.989 CPUs utilized            ( +-  0.11% )
                28      context-switches          #    0.021 K/sec                    ( +-  0.49% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-  7.86% )
               125      page-faults               #    0.093 K/sec                    ( +-  0.03% )
     4,609,228,676      cycles                    #    3.401 GHz                      ( +-  0.03% )  (30.70%)
     6,735,946,489      instructions              #    1.46  insn per cycle           ( +-  0.01% )  (38.42%)
     1,130,187,926      branches                  #  833.979 M/sec                    ( +-  0.01% )  (38.47%)
        29,150,986      branch-misses             #    2.58% of all branches          ( +-  0.01% )  (38.51%)
     1,737,548,851      L1-dcache-loads           # 1282.158 M/sec                    ( +-  0.01% )  (38.56%)
           659,851      L1-dcache-load-misses     #    0.04% of all L1-dcache hits    ( +-  0.78% )  (38.56%)
            71,196      LLC-loads                 #    0.053 M/sec                    ( +-  0.97% )  (30.81%)
            22,218      LLC-load-misses           #   31.21% of all LL-cache hits     ( +-  0.83% )  (30.79%)
   <not supported>      L1-icache-loads
           770,586      L1-icache-load-misses                                         ( +-  0.67% )  (30.77%)
     1,742,104,224      dTLB-loads                # 1285.520 M/sec                    ( +-  0.01% )  (30.74%)
             7,060      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.08% )  (30.72%)
             4,282      iTLB-loads                #    0.003 M/sec                    ( +- 16.98% )  (30.70%)
             1,261      iTLB-load-misses          #   29.46% of all iTLB cache hits   ( +-  7.14% )  (30.68%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

           1.37087 +- 0.00145 seconds time elapsed  ( +-  0.11% )

With:

 Performance counter stats for './test_progs -t flow_dissector' (1024 runs):

          1,385.56 msec task-clock                #    0.989 CPUs utilized            ( +-  0.06% )
                28      context-switches          #    0.020 K/sec                    ( +-  0.48% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-  7.20% )
               125      page-faults               #    0.091 K/sec                    ( +-  0.03% )
     4,642,599,630      cycles                    #    3.351 GHz                      ( +-  0.03% )  (30.69%)
     6,901,261,616      instructions              #    1.49  insn per cycle           ( +-  0.01% )  (38.41%)
     1,130,623,950      branches                  #  816.006 M/sec                    ( +-  0.01% )  (38.45%)
        29,161,215      branch-misses             #    2.58% of all branches          ( +-  0.01% )  (38.50%)
     1,796,850,740      L1-dcache-loads           # 1296.842 M/sec                    ( +-  0.01% )  (38.55%)
           673,908      L1-dcache-load-misses     #    0.04% of all L1-dcache hits    ( +-  0.89% )  (38.56%)
            70,394      LLC-loads                 #    0.051 M/sec                    ( +-  1.08% )  (30.82%)
            24,575      LLC-load-misses           #   34.91% of all LL-cache hits     ( +-  0.66% )  (30.80%)
   <not supported>      L1-icache-loads
           729,421      L1-icache-load-misses                                         ( +-  0.85% )  (30.77%)
     1,800,871,042      dTLB-loads                # 1299.743 M/sec                    ( +-  0.01% )  (30.75%)
             6,133      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.55% )  (30.73%)
             1,998      iTLB-loads                #    0.001 M/sec                    ( +-  9.36% )  (30.70%)
             1,152      iTLB-load-misses          #   57.66% of all iTLB cache hits   ( +-  3.02% )  (30.68%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          1.400577 +- 0.000780 seconds time elapsed  ( +-  0.06% )


Interesting fact is that I observed the huge iTLB-load-misses counts on
clean kernel as well:

flow_dissector test:
             2,613      iTLB-loads                #    0.002 M/sec ( +- 21.90% )  (30.70%)
            16,483      iTLB-load-misses          #  630.78% of all iTLB cache hits   ( +- 79.63% )  (30.68%)
tailcalls test:
             1,996      iTLB-loads                #    0.008 M/sec ( +-  1.08% )  (30.33%)
             7,272      iTLB-load-misses          #  364.24% of all iTLB cache hits   ( +- 92.01% )  (30.21%)

So probably Alexei's suspicion about get_random_int() doing strange was
right.

-------------------------------------------------------------------

Thank you,
Maciej

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
[2]: https://lore.kernel.org/bpf/20200511143912.34086-1-maciej.fijalkowski@intel.com/
[3]: https://lore.kernel.org/bpf/20200702134930.4717-1-maciej.fijalkowski@intel.com/


Maciej Fijalkowski (5):
  bpf, x64: use %rcx instead of %rax for tail call retpolines
  bpf: allow for tailcalls in BPF subprograms
  bpf: propagate poke descriptors to subprograms
  bpf, x64: rework pro/epilogue and tailcall handling in JIT
  selftests: bpf: add dummy prog for bpf2bpf with tailcall

 arch/x86/include/asm/nospec-branch.h          |  16 +-
 arch/x86/net/bpf_jit_comp.c                   | 241 +++++++++++++-----
 include/linux/bpf.h                           |   8 +-
 kernel/bpf/arraymap.c                         |  61 ++++-
 kernel/bpf/core.c                             |   3 +-
 kernel/bpf/verifier.c                         |  14 +-
 .../selftests/bpf/prog_tests/tailcalls.c      |  85 ++++++
 tools/testing/selftests/bpf/progs/tailcall6.c |  38 +++
 8 files changed, 379 insertions(+), 87 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines
  2020-07-15 23:36 [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
@ 2020-07-15 23:36 ` Maciej Fijalkowski
  2020-07-16 20:36   ` Daniel Borkmann
  2020-07-15 23:36 ` [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-15 23:36 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Currently, %rax is used to store the jump target when BPF program is
emitting the retpoline instructions that are handling the indirect
tailcall.

There is a plan to use %rax for different purpose, which is storing the
tail call counter. In order to preserve this value across the tailcalls,
use %rcx instead for jump target storage in retpoline instructions.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/include/asm/nospec-branch.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index e7752b4038ff..e491c3d9f227 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -314,19 +314,19 @@ static inline void mds_idle_clear_cpu_buffers(void)
  *    lfence
  *    jmp spec_trap
  *  do_rop:
- *    mov %rax,(%rsp) for x86_64
+ *    mov %rcx,(%rsp) for x86_64
  *    mov %edx,(%esp) for x86_32
  *    retq
  *
  * Without retpolines configured:
  *
- *    jmp *%rax for x86_64
+ *    jmp *%rcx for x86_64
  *    jmp *%edx for x86_32
  */
 #ifdef CONFIG_RETPOLINE
 # ifdef CONFIG_X86_64
-#  define RETPOLINE_RAX_BPF_JIT_SIZE	17
-#  define RETPOLINE_RAX_BPF_JIT()				\
+#  define RETPOLINE_RCX_BPF_JIT_SIZE	17
+#  define RETPOLINE_RCX_BPF_JIT()				\
 do {								\
 	EMIT1_off32(0xE8, 7);	 /* callq do_rop */		\
 	/* spec_trap: */					\
@@ -334,7 +334,7 @@ do {								\
 	EMIT3(0x0F, 0xAE, 0xE8); /* lfence */			\
 	EMIT2(0xEB, 0xF9);       /* jmp spec_trap */		\
 	/* do_rop: */						\
-	EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */	\
+	EMIT4(0x48, 0x89, 0x0C, 0x24); /* mov %rcx,(%rsp) */	\
 	EMIT1(0xC3);             /* retq */			\
 } while (0)
 # else /* !CONFIG_X86_64 */
@@ -352,9 +352,9 @@ do {								\
 # endif
 #else /* !CONFIG_RETPOLINE */
 # ifdef CONFIG_X86_64
-#  define RETPOLINE_RAX_BPF_JIT_SIZE	2
-#  define RETPOLINE_RAX_BPF_JIT()				\
-	EMIT2(0xFF, 0xE0);       /* jmp *%rax */
+#  define RETPOLINE_RCX_BPF_JIT_SIZE	2
+#  define RETPOLINE_RCX_BPF_JIT()				\
+	EMIT2(0xFF, 0xE1);       /* jmp *%rcx */
 # else /* !CONFIG_X86_64 */
 #  define RETPOLINE_EDX_BPF_JIT()				\
 	EMIT2(0xFF, 0xE2)        /* jmp *%edx */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms
  2020-07-15 23:36 [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
  2020-07-15 23:36 ` [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
@ 2020-07-15 23:36 ` Maciej Fijalkowski
  2020-07-16 21:10   ` Daniel Borkmann
  2020-07-16 21:29   ` Daniel Borkmann
  2020-07-15 23:36 ` [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-15 23:36 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Relax verifier's restriction that was meant to forbid tailcall usage
when subprog count was higher than 1.

Also, do not max out the stack depth of program that utilizes tailcalls.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 kernel/bpf/verifier.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3c1efc9d08fd..6481342b31ba 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4172,10 +4172,6 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_FUNC_tail_call:
 		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
 			goto error;
-		if (env->subprog_cnt > 1) {
-			verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
-			return -EINVAL;
-		}
 		break;
 	case BPF_FUNC_perf_event_read:
 	case BPF_FUNC_perf_event_output:
@@ -10252,7 +10248,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			 * the program array.
 			 */
 			prog->cb_access = 1;
-			env->prog->aux->stack_depth = MAX_BPF_STACK;
 			env->prog->aux->max_pkt_offset = MAX_PACKET_OFF;
 
 			/* mark bpf_tail_call as different opcode to avoid
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms
  2020-07-15 23:36 [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
  2020-07-15 23:36 ` [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
  2020-07-15 23:36 ` [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
@ 2020-07-15 23:36 ` Maciej Fijalkowski
  2020-07-16 21:16   ` Daniel Borkmann
  2020-07-15 23:36 ` [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
  2020-07-15 23:36 ` [PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
  4 siblings, 1 reply; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-15 23:36 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Previously, there was no need for poke descriptors being present in
subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
in them. Each subprog is JITed independently so in order to enable
JITing such subprograms, simply copy poke descriptors from main program
to subprogram's poke tab.

Add also subprog's aux struct to the BPF map poke_progs list by calling
on it map_poke_track().

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 kernel/bpf/verifier.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6481342b31ba..3b406b2860ef 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9932,6 +9932,9 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		goto out_undo_insn;
 
 	for (i = 0; i < env->subprog_cnt; i++) {
+		struct bpf_map *map_ptr;
+		int j;
+
 		subprog_start = subprog_end;
 		subprog_end = env->subprog_info[i + 1].start;
 
@@ -9956,6 +9959,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		func[i]->aux->btf = prog->aux->btf;
 		func[i]->aux->func_info = prog->aux->func_info;
 
+		for (j = 0; j < prog->aux->size_poke_tab; j++) {
+			bpf_jit_add_poke_descriptor(func[i], &prog->aux->poke_tab[j]);
+			map_ptr = func[i]->aux->poke_tab[j].tail_call.map;
+			map_ptr->ops->map_poke_track(map_ptr, func[i]->aux);
+		}
+
 		/* Use bpf_prog_F_tag to indicate functions in stack traces.
 		 * Long term would need debug info to populate names
 		 */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-15 23:36 [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2020-07-15 23:36 ` [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
@ 2020-07-15 23:36 ` Maciej Fijalkowski
  2020-07-16 23:06   ` Daniel Borkmann
  2020-07-15 23:36 ` [PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
  4 siblings, 1 reply; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-15 23:36 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

This commit serves two things:
1) it optimizes BPF prologue/epilogue generation
2) it makes possible to have tailcalls within BPF subprogram

Both points are related to each other since without 1), 2) could not be
achieved.

In [1], Alexei says:
"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
             // function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call
counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

Implement that suggestion.

Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it to rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.

Drop the BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.

Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
dedicated for skipping the register pops and stack unwind that are
generated right before the actual jump to target program. Reflect also
the actual purpose of poke->ip and rename it to poke->tailcall_target so
that it will not the be confused with the poke target that is being
introduced here.
For case when the target program is not present, BPF program will skip
the pop instructions and nop5 dedicated for jmpq $target. An example of
such state when only R6 of callee saved registers is used by program:

ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
ffffffffc0513aa6:       5b                      pop    %rbx
ffffffffc0513aa7:       58                      pop    %rax
ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi

When target program is inserted, the jump that was there to skip
pops/nop5 will become the nop5, so CPU will go over pops and do the
actual tailcall.

One might ask why there simply can not be pushes after the nop5?
In the following example snippet:

ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
(...)
ffffffffc0370332:       5b                      pop    %rbx
ffffffffc0370333:       58                      pop    %rax
ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
ffffffffc0370347:       50                      push   %rax
ffffffffc0370348:       53                      push   %rbx
ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548

There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
and jump target is not present. ctx is in %rbx register and BPF
subprogram that we will call into on ffffffffc037034c is relying on it,
e.g. it will pick ctx from there. Such code layout is therefore broken
as we would overwrite the content of %rbx with the value that was pushed
on the prologue. That is the reason for the 'bypass' approach.

Special care needs to be taken during the install/update/remove of
tailcall target. In case when target program is not present, the CPU
must not execute the pop instructions that precede the tailcall.

To address that, the following states can be defined:
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->tailcall_target) and do NOT touch the
   poke->tailcall_bypass
Remove tail call: B(f')->C(f')
 * poke->tailcall_bypass is poked back to jump, then we wait the RCU
   grace period so that other programs will finish its execution and
   after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
 * same as first step

This way CPU can never be exposed to "unwind, tail" state.

For regression checks, 'tailcalls' kselftest was executed:
$ sudo ./test_progs -t tailcalls
 #64/1 tailcall_1:OK
 #64/2 tailcall_2:OK
 #64/3 tailcall_3:OK
 #64/4 tailcall_4:OK
 #64/5 tailcall_5:OK
 #64 tailcalls:OK
Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED

Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 241 +++++++++++++++++++++++++++---------
 include/linux/bpf.h         |   8 +-
 kernel/bpf/arraymap.c       |  61 +++++++--
 kernel/bpf/core.c           |   3 +-
 4 files changed, 239 insertions(+), 74 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 42b6709e6dc7..4e83b13fb19f 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -219,16 +219,47 @@ struct jit_context {
 #define BPF_MAX_INSN_SIZE	128
 #define BPF_INSN_SAFETY		64
 
-/* Number of bytes emit_patch() needs to generate instructions */
-#define X86_PATCH_SIZE		5
+/* Number of bytes that will be skipped on tailcall */
+#define X86_TAIL_CALL_OFFSET	11
 
-#define PROLOGUE_SIZE		25
+static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[0])
+		EMIT1(0x53);         /* push rbx */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x55);   /* push r13 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x56);   /* push r14 */
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x57);   /* push r15 */
+	*pprog = prog;
+}
+
+static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x5F);   /* pop r15 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x5E);   /* pop r14 */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x5D);   /* pop r13 */
+	if (callee_regs_used[0])
+		EMIT1(0x5B);         /* pop rbx */
+	*pprog = prog;
+}
 
 /*
- * Emit x86-64 prologue code for BPF program and check its size.
+ * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip it while jumping into another program
  */
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
+static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
+			  bool tail_call)
 {
 	u8 *prog = *pprog;
 	int cnt = X86_PATCH_SIZE;
@@ -238,19 +269,16 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
 	 */
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt);
 	prog += cnt;
+	if (!ebpf_from_cbpf && tail_call)
+		EMIT2(0x31, 0xC0);       /* xor eax, eax */
+	else
+		EMIT2(0x66, 0x90);       /* nop2 */
 	EMIT1(0x55);             /* push rbp */
 	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
 	/* sub rsp, rounded_stack_depth */
 	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	EMIT1(0x53);             /* push rbx */
-	EMIT2(0x41, 0x55);       /* push r13 */
-	EMIT2(0x41, 0x56);       /* push r14 */
-	EMIT2(0x41, 0x57);       /* push r15 */
-	if (!ebpf_from_cbpf) {
-		/* zero init tail_call_cnt */
-		EMIT2(0x6a, 0x00);
-		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
-	}
+	if (!ebpf_from_cbpf && tail_call)
+		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }
 
@@ -337,6 +365,22 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 	return __bpf_arch_text_poke(ip, t, old_addr, new_addr, true);
 }
 
+static int get_pop_bytes(bool *callee_regs_used)
+{
+	int bytes = 0;
+
+	if (callee_regs_used[3])
+		bytes += 2;
+	if (callee_regs_used[2])
+		bytes += 2;
+	if (callee_regs_used[1])
+		bytes += 2;
+	if (callee_regs_used[0])
+		bytes += 1;
+
+	return bytes;
+}
+
 /*
  * Generate the following code:
  *
@@ -351,12 +395,25 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
  *   goto *(prog->bpf_func + prologue_size);
  * out:
  */
-static void emit_bpf_tail_call_indirect(u8 **pprog)
+static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
+					u32 stack_depth)
 {
 	u8 *prog = *pprog;
-	int label1, label2, label3;
+	int pop_bytes = 0;
+	int off1 = 49;
+	int off2 = 38;
+	int off3 = 16;
 	int cnt = 0;
 
+	/* count the additional bytes used for popping callee regs from stack
+	 * that need to be taken into account for each of the offsets that
+	 * are used for bailing out of the tail call
+	 */
+	pop_bytes = get_pop_bytes(callee_regs_used);
+	off1 += pop_bytes;
+	off2 += pop_bytes;
+	off3 += pop_bytes;
+
 	/*
 	 * rdi - pointer to ctx
 	 * rsi - pointer to bpf_array
@@ -370,72 +427,108 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
 	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
 	      offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
+#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
-	label1 = cnt;
 
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85                    /* mov eax, dword ptr [rbp - (4 + sd)] */,
+		    -4 - round_up(stack_depth, 8));
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
-#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
+#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
 	EMIT2(X86_JA, OFFSET2);                   /* ja out */
-	label2 = cnt;
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85,                   /* mov dword ptr [rbp - (4 + sd)], eax */
+		    -4 - round_up(stack_depth, 8));
 
 	/* prog = array->ptrs[index]; */
-	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
+	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,        /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
 		    offsetof(struct bpf_array, ptrs));
 
 	/*
 	 * if (prog == NULL)
 	 *	goto out;
 	 */
-	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
-#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
-	EMIT2(X86_JE, OFFSET3);                   /* je out */
-	label3 = cnt;
+	EMIT3(0x48, 0x85, 0xC9);                   /* test rcx,rcx */
+#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE)
+	EMIT2(X86_JE, OFFSET3);                    /* je out */
 
-	/* goto *(prog->bpf_func + prologue_size); */
-	EMIT4(0x48, 0x8B, 0x40,                   /* mov rax, qword ptr [rax + 32] */
-	      offsetof(struct bpf_prog, bpf_func));
-	EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
+
+	EMIT1(0x58);                               /* pop rax */
+	EMIT3_off32(0x48, 0x81, 0xC4,              /* add rsp, sd */
+		    round_up(stack_depth, 8));
 
+	/* goto *(prog->bpf_func + X86_TAIL_CALL_OFFSET); */
+	EMIT4(0x48, 0x8B, 0x49,                   /* mov rcx, qword ptr [rcx + 32] */
+	      offsetof(struct bpf_prog, bpf_func));
+	EMIT4(0x48, 0x83, 0xC1,                   /* add rcx, X86_TAIL_CALL_OFFSET */
+	      X86_TAIL_CALL_OFFSET);
 	/*
-	 * Wow we're ready to jump into next BPF program
+	 * Now we're ready to jump into next BPF program
 	 * rdi == ctx (1st arg)
-	 * rax == prog->bpf_func + prologue_size
+	 * rcx == prog->bpf_func + X86_TAIL_CALL_OFFSET
 	 */
-	RETPOLINE_RAX_BPF_JIT();
+	RETPOLINE_RCX_BPF_JIT();
 
 	/* out: */
-	BUILD_BUG_ON(cnt - label1 != OFFSET1);
-	BUILD_BUG_ON(cnt - label2 != OFFSET2);
-	BUILD_BUG_ON(cnt - label3 != OFFSET3);
 	*pprog = prog;
 }
 
 static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
-				      u8 **pprog, int addr, u8 *image)
+				      u8 **pprog, int addr, u8 *image,
+				      bool *callee_regs_used, u32 stack_depth)
 {
 	u8 *prog = *pprog;
+	int pop_bytes = 0;
+	int off1 = 27;
+	int poke_off;
 	int cnt = 0;
 
+	/* count the additional bytes used for popping callee regs to stack
+	 * that need to be taken into account for jump offset that is used for
+	 * bailing out from the tail call when limit is reached
+	 */
+	pop_bytes = get_pop_bytes(callee_regs_used);
+	off1 += pop_bytes;
+
+	/*
+	 * total bytes for:
+	 * - nop5/ jmpq $off
+	 * - pop callee regs
+	 * - sub rsp, $val
+	 * - pop rax
+	 */
+	poke_off = X86_PATCH_SIZE + pop_bytes + 7 + 1;
+
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85,
+		    -4 - round_up(stack_depth, 8));   /* mov eax, dword ptr [rbp - (4 + sd)] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
-	EMIT2(X86_JA, 14);                            /* ja out */
+	EMIT2(X86_JA, off1);                          /* ja out */
 	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85,
+		    -4 - round_up(stack_depth, 8));   /* mov dword ptr [rbp - (4 + sd)], eax */
+
+	poke->tailcall_bypass = image + (addr - poke_off - X86_PATCH_SIZE);
+	poke->adj_off = X86_TAIL_CALL_OFFSET;
+	poke->tailcall_target = image + (addr - X86_PATCH_SIZE);
+
+	emit_jump(&prog, (u8 *)poke->tailcall_target + X86_PATCH_SIZE,
+		  poke->tailcall_bypass);
 
-	poke->ip = image + (addr - X86_PATCH_SIZE);
-	poke->adj_off = PROLOGUE_SIZE;
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
+	EMIT1(0x58);                                  /* pop rax */
+	EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
 
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
 	prog += X86_PATCH_SIZE;
@@ -453,7 +546,7 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog)
 
 	for (i = 0; i < prog->aux->size_poke_tab; i++) {
 		poke = &prog->aux->poke_tab[i];
-		WARN_ON_ONCE(READ_ONCE(poke->ip_stable));
+		WARN_ON_ONCE(READ_ONCE(poke->tailcall_target_stable));
 
 		if (poke->reason != BPF_POKE_REASON_TAIL_CALL)
 			continue;
@@ -462,20 +555,26 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog)
 		mutex_lock(&array->aux->poke_mutex);
 		target = array->ptrs[poke->tail_call.key];
 		if (target) {
-			/* Plain memcpy is used when image is not live yet
-			 * and still not locked as read-only. Once poke
-			 * location is active (poke->ip_stable), any parallel
+			/* Plain memcpy is used when image is not live yet and
+			 * still not locked as read-only. Once poke location is
+			 * active (poke->tailcall_target_stable), any parallel
 			 * bpf_arch_text_poke() might occur still on the
 			 * read-write image until we finally locked it as
 			 * read-only. Both modifications on the given image
 			 * are under text_mutex to avoid interference.
 			 */
-			ret = __bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP, NULL,
+			ret = __bpf_arch_text_poke(poke->tailcall_target,
+						   BPF_MOD_JUMP, NULL,
 						   (u8 *)target->bpf_func +
 						   poke->adj_off, false);
 			BUG_ON(ret < 0);
+			ret = __bpf_arch_text_poke(poke->tailcall_bypass,
+						   BPF_MOD_JUMP,
+						   (u8 *)poke->tailcall_target +
+						   X86_PATCH_SIZE, NULL, false);
+			BUG_ON(ret < 0);
 		}
-		WRITE_ONCE(poke->ip_stable, true);
+		WRITE_ONCE(poke->tailcall_target_stable, true);
 		mutex_unlock(&array->aux->poke_mutex);
 	}
 }
@@ -652,19 +751,44 @@ static bool ex_handler_bpf(const struct exception_table_entry *x,
 	return true;
 }
 
+static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
+			     bool *regs_used, bool *tail_call_seen)
+{
+	int i;
+
+	for (i = 1; i <= insn_cnt; i++, insn++) {
+		if (insn->code == (BPF_JMP | BPF_TAIL_CALL))
+			*tail_call_seen = true;
+		if (insn->dst_reg == BPF_REG_6 || insn->src_reg == BPF_REG_6)
+			regs_used[0] = true;
+		if (insn->dst_reg == BPF_REG_7 || insn->src_reg == BPF_REG_7)
+			regs_used[1] = true;
+		if (insn->dst_reg == BPF_REG_8 || insn->src_reg == BPF_REG_8)
+			regs_used[2] = true;
+		if (insn->dst_reg == BPF_REG_9 || insn->src_reg == BPF_REG_9)
+			regs_used[3] = true;
+	}
+}
+
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 		  int oldproglen, struct jit_context *ctx)
 {
 	struct bpf_insn *insn = bpf_prog->insnsi;
+	bool callee_regs_used[4] = {};
 	int insn_cnt = bpf_prog->len;
+	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
 	int i, cnt = 0, excnt = 0;
 	int proglen = 0;
 	u8 *prog = temp;
 
+	detect_reg_usage(insn, insn_cnt, callee_regs_used,
+			 &tail_call_seen);
+
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
-		      bpf_prog_was_classic(bpf_prog));
+		      bpf_prog_was_classic(bpf_prog), tail_call_seen);
+	push_callee_regs(&prog, callee_regs_used);
 	addrs[0] = prog - temp;
 
 	for (i = 1; i <= insn_cnt; i++, insn++) {
@@ -1109,9 +1233,13 @@ xadd:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_TAIL_CALL:
 			if (imm32)
 				emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
-							  &prog, addrs[i], image);
+							  &prog, addrs[i], image,
+							  callee_regs_used,
+							  bpf_prog->aux->stack_depth);
 			else
-				emit_bpf_tail_call_indirect(&prog);
+				emit_bpf_tail_call_indirect(&prog,
+							    callee_regs_used,
+							    bpf_prog->aux->stack_depth);
 			break;
 
 			/* cond jump */
@@ -1294,12 +1422,9 @@ xadd:			if (is_imm8(insn->off))
 			seen_exit = true;
 			/* Update cleanup_addr */
 			ctx->cleanup_addr = proglen;
-			if (!bpf_prog_was_classic(bpf_prog))
-				EMIT1(0x5B); /* get rid of tail_call_cnt */
-			EMIT2(0x41, 0x5F);   /* pop r15 */
-			EMIT2(0x41, 0x5E);   /* pop r14 */
-			EMIT2(0x41, 0x5D);   /* pop r13 */
-			EMIT1(0x5B);         /* pop rbx */
+			pop_callee_regs(&prog, callee_regs_used);
+			if (!bpf_prog_was_classic(bpf_prog) && tail_call_seen)
+				EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */
 			EMIT1(0xC9);         /* leave */
 			EMIT1(0xC3);         /* ret */
 			break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c67c88ad35f8..38897b9c7d61 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -651,14 +651,15 @@ enum bpf_jit_poke_reason {
 
 /* Descriptor of pokes pointing /into/ the JITed image. */
 struct bpf_jit_poke_descriptor {
-	void *ip;
+	void *tailcall_target;
+	void *tailcall_bypass;
 	union {
 		struct {
 			struct bpf_map *map;
 			u32 key;
 		} tail_call;
 	};
-	bool ip_stable;
+	bool tailcall_target_stable;
 	u8 adj_off;
 	u16 reason;
 };
@@ -1775,6 +1776,9 @@ enum bpf_text_poke_type {
 	BPF_MOD_JUMP,
 };
 
+/* Number of bytes emit_patch() needs to generate instructions */
+#define X86_PATCH_SIZE		5
+
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index c66e8273fccd..d15729a3f46c 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 				    struct bpf_prog *old,
 				    struct bpf_prog *new)
 {
+	u8 *bypass_addr, *old_addr, *new_addr;
 	struct prog_poke_elem *elem;
 	struct bpf_array_aux *aux;
 
@@ -770,12 +771,13 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 			 *    there could be danger of use after free otherwise.
 			 * 2) Initially when we start tracking aux, the program
 			 *    is not JITed yet and also does not have a kallsyms
-			 *    entry. We skip these as poke->ip_stable is not
-			 *    active yet. The JIT will do the final fixup before
-			 *    setting it stable. The various poke->ip_stable are
-			 *    successively activated, so tail call updates can
-			 *    arrive from here while JIT is still finishing its
-			 *    final fixup for non-activated poke entries.
+			 *    entry. We skip these as poke->tailcall_target_stable
+			 *    is not active yet. The JIT will do the final fixup
+			 *    before setting it stable. The various
+			 *    poke->tailcall_target_stable are successively activated,
+			 *    so tail call updates can arrive from here while JIT
+			 *    is still finishing its final fixup for non-activated
+			 *    poke entries.
 			 * 3) On program teardown, the program's kallsym entry gets
 			 *    removed out of RCU callback, but we can only untrack
 			 *    from sleepable context, therefore bpf_arch_text_poke()
@@ -792,20 +794,53 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 			 * 5) Any other error happening below from bpf_arch_text_poke()
 			 *    is a unexpected bug.
 			 */
-			if (!READ_ONCE(poke->ip_stable))
+			if (!READ_ONCE(poke->tailcall_target_stable))
 				continue;
 			if (poke->reason != BPF_POKE_REASON_TAIL_CALL)
 				continue;
 			if (poke->tail_call.map != map ||
 			    poke->tail_call.key != key)
 				continue;
+			/* protect against un-updated poke descriptors since
+			 * we could fill them from subprog and the same desc
+			 * is present on main's program poke tab
+			 */
+			if (!poke->tailcall_bypass || !poke->tailcall_target)
+				continue;
+
+			if (!old && !new)
+				continue;
 
-			ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP,
-						 old ? (u8 *)old->bpf_func +
-						 poke->adj_off : NULL,
-						 new ? (u8 *)new->bpf_func +
-						 poke->adj_off : NULL);
-			BUG_ON(ret < 0 && ret != -EINVAL);
+			bypass_addr = (u8 *)poke->tailcall_target + X86_PATCH_SIZE;
+			old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL;
+			new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL;
+
+			if (new) {
+				ret = bpf_arch_text_poke(poke->tailcall_target,
+							 BPF_MOD_JUMP,
+							 old_addr, new_addr);
+				BUG_ON(ret < 0 && ret != -EINVAL);
+				if (!old) {
+					ret = bpf_arch_text_poke(poke->tailcall_bypass,
+								 BPF_MOD_JUMP,
+								 bypass_addr, NULL);
+					BUG_ON(ret < 0 && ret != -EINVAL);
+				}
+			} else {
+				ret = bpf_arch_text_poke(poke->tailcall_bypass,
+							 BPF_MOD_JUMP,
+							 NULL, bypass_addr);
+				BUG_ON(ret < 0 && ret != -EINVAL);
+				/* let other CPUs finish the execution of program
+				 * so that it will not possible to expose them
+				 * to invalid nop, stack unwind, nop state
+				 */
+				synchronize_rcu();
+				ret = bpf_arch_text_poke(poke->tailcall_target,
+							 BPF_MOD_JUMP,
+							 old_addr, NULL);
+				BUG_ON(ret < 0 && ret != -EINVAL);
+			}
 		}
 	}
 }
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9df4cc9a2907..cebf12a2cbbf 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -773,7 +773,8 @@ int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 
 	if (size > poke_tab_max)
 		return -ENOSPC;
-	if (poke->ip || poke->ip_stable || poke->adj_off)
+	if (poke->tailcall_target || poke->tailcall_target_stable ||
+	    poke->tailcall_bypass || poke->adj_off)
 		return -EINVAL;
 
 	switch (poke->reason) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall
  2020-07-15 23:36 [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2020-07-15 23:36 ` [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
@ 2020-07-15 23:36 ` Maciej Fijalkowski
  4 siblings, 0 replies; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-15 23:36 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Introduce 6th test to taicalls kselftest that checks if tailcall can be
correctly executed from the BPF subprogram.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/tailcalls.c      | 85 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/tailcall6.c | 38 +++++++++
 2 files changed, 123 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
index bb8fe646dd9f..192c94896809 100644
--- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c
+++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
+#include <network_helpers.h>
 
 /* test_tailcall_1 checks basic functionality by patching multiple locations
  * in a single program for a single tail call slot with nop->jmp, jmp->nop
@@ -472,6 +473,88 @@ static void test_tailcall_5(void)
 	bpf_object__close(obj);
 }
 
+/* test_tailcall_6 purpose is to make sure that tailcalls are working
+ * correctly in correlation with BPF subprograms
+ */
+static void test_tailcall_6(void)
+{
+	int err, map_fd, prog_fd, main_fd, i;
+	struct bpf_map *prog_array;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	__u32 retval, duration;
+	char prog_name[32];
+
+	err = bpf_prog_load("tailcall6.o", BPF_PROG_TYPE_SCHED_CLS, &obj,
+			    &prog_fd);
+	if (CHECK_FAIL(err))
+		return;
+
+	prog = bpf_object__find_program_by_title(obj, "classifier");
+	if (CHECK_FAIL(!prog))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(main_fd < 0))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (CHECK_FAIL(!prog_array))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (CHECK_FAIL(map_fd < 0))
+		goto out;
+
+	/* nop -> jmp */
+	for (i = 0; i < bpf_map__def(prog_array)->max_entries; i++) {
+		snprintf(prog_name, sizeof(prog_name), "classifier/%i", i);
+
+		prog = bpf_object__find_program_by_title(obj, prog_name);
+		if (CHECK_FAIL(!prog))
+			goto out;
+
+		prog_fd = bpf_program__fd(prog);
+		if (CHECK_FAIL(prog_fd < 0))
+			goto out;
+
+		err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+		if (CHECK_FAIL(err))
+			goto out;
+	}
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 1, "tailcall",
+	      "err %d errno %d retval %d\n", err, errno, retval);
+
+	/* jmp -> nop, call subprog that will do tailcall */
+	i = 1;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 0, "tailcall", "err %d errno %d retval %d\n",
+	      err, errno, retval);
+
+	/* make sure that subprog can access ctx and entry prog that
+	 * called this subprog can properly return
+	 */
+	i = 0;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 108, "tailcall", "err %d errno %d retval %d\n",
+	      err, errno, retval);
+out:
+	bpf_object__close(obj);
+}
+
 void test_tailcalls(void)
 {
 	if (test__start_subtest("tailcall_1"))
@@ -484,4 +567,6 @@ void test_tailcalls(void)
 		test_tailcall_4();
 	if (test__start_subtest("tailcall_5"))
 		test_tailcall_5();
+	if (test__start_subtest("tailcall_6"))
+		test_tailcall_6();
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall6.c b/tools/testing/selftests/bpf/progs/tailcall6.c
new file mode 100644
index 000000000000..e72ca5869b58
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall6.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+#define TAIL_FUNC(x) 				\
+	SEC("classifier/" #x)			\
+	int bpf_func_##x(struct __sk_buff *skb)	\
+	{					\
+		return x;			\
+	}
+TAIL_FUNC(0)
+TAIL_FUNC(1)
+
+static __attribute__ ((noinline))
+int subprog_tail(struct __sk_buff *skb)
+{
+	bpf_tail_call(skb, &jmp_table, 0);
+
+	return skb->len * 2;
+}
+
+SEC("classifier")
+int entry(struct __sk_buff *skb)
+{
+	bpf_tail_call(skb, &jmp_table, 1);
+
+	return subprog_tail(skb);
+}
+
+char __license[] SEC("license") = "GPL";
+int _version SEC("version") = 1;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines
  2020-07-15 23:36 ` [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
@ 2020-07-16 20:36   ` Daniel Borkmann
  2020-07-17  9:29     ` Maciej Fijalkowski
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2020-07-16 20:36 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson

On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> Currently, %rax is used to store the jump target when BPF program is
> emitting the retpoline instructions that are handling the indirect
> tailcall.
> 
> There is a plan to use %rax for different purpose, which is storing the
> tail call counter. In order to preserve this value across the tailcalls,
> use %rcx instead for jump target storage in retpoline instructions.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>   arch/x86/include/asm/nospec-branch.h | 16 ++++++++--------
>   1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index e7752b4038ff..e491c3d9f227 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -314,19 +314,19 @@ static inline void mds_idle_clear_cpu_buffers(void)
>    *    lfence
>    *    jmp spec_trap
>    *  do_rop:
> - *    mov %rax,(%rsp) for x86_64
> + *    mov %rcx,(%rsp) for x86_64
>    *    mov %edx,(%esp) for x86_32
>    *    retq
>    *
>    * Without retpolines configured:
>    *
> - *    jmp *%rax for x86_64
> + *    jmp *%rcx for x86_64
>    *    jmp *%edx for x86_32
>    */
>   #ifdef CONFIG_RETPOLINE
>   # ifdef CONFIG_X86_64
> -#  define RETPOLINE_RAX_BPF_JIT_SIZE	17
> -#  define RETPOLINE_RAX_BPF_JIT()				\
> +#  define RETPOLINE_RCX_BPF_JIT_SIZE	17
> +#  define RETPOLINE_RCX_BPF_JIT()				\
>   do {								\
>   	EMIT1_off32(0xE8, 7);	 /* callq do_rop */		\
>   	/* spec_trap: */					\
> @@ -334,7 +334,7 @@ do {								\
>   	EMIT3(0x0F, 0xAE, 0xE8); /* lfence */			\
>   	EMIT2(0xEB, 0xF9);       /* jmp spec_trap */		\
>   	/* do_rop: */						\
> -	EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */	\
> +	EMIT4(0x48, 0x89, 0x0C, 0x24); /* mov %rcx,(%rsp) */	\
>   	EMIT1(0xC3);             /* retq */			\
>   } while (0)
>   # else /* !CONFIG_X86_64 */
> @@ -352,9 +352,9 @@ do {								\
>   # endif
>   #else /* !CONFIG_RETPOLINE */
>   # ifdef CONFIG_X86_64
> -#  define RETPOLINE_RAX_BPF_JIT_SIZE	2
> -#  define RETPOLINE_RAX_BPF_JIT()				\
> -	EMIT2(0xFF, 0xE0);       /* jmp *%rax */
> +#  define RETPOLINE_RCX_BPF_JIT_SIZE	2
> +#  define RETPOLINE_RCX_BPF_JIT()				\
> +	EMIT2(0xFF, 0xE1);       /* jmp *%rcx */

Hmm, so the target prog gets loaded into rax in emit_bpf_tail_call_indirect()
but then you jump into rcx? What am I missing? This still needs to be bisectable.

>   # else /* !CONFIG_X86_64 */
>   #  define RETPOLINE_EDX_BPF_JIT()				\
>   	EMIT2(0xFF, 0xE2)        /* jmp *%edx */
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms
  2020-07-15 23:36 ` [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
@ 2020-07-16 21:10   ` Daniel Borkmann
  2020-07-16 21:29   ` Daniel Borkmann
  1 sibling, 0 replies; 19+ messages in thread
From: Daniel Borkmann @ 2020-07-16 21:10 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson

On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> Relax verifier's restriction that was meant to forbid tailcall usage
> when subprog count was higher than 1.
> 
> Also, do not max out the stack depth of program that utilizes tailcalls.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

[nit: this patch also needs reordering]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms
  2020-07-15 23:36 ` [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
@ 2020-07-16 21:16   ` Daniel Borkmann
  2020-07-17  9:36     ` Maciej Fijalkowski
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2020-07-16 21:16 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson

On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> Previously, there was no need for poke descriptors being present in
> subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
> in them. Each subprog is JITed independently so in order to enable
> JITing such subprograms, simply copy poke descriptors from main program
> to subprogram's poke tab.
> 
> Add also subprog's aux struct to the BPF map poke_progs list by calling
> on it map_poke_track().
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>   kernel/bpf/verifier.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6481342b31ba..3b406b2860ef 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -9932,6 +9932,9 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>   		goto out_undo_insn;
>   
>   	for (i = 0; i < env->subprog_cnt; i++) {
> +		struct bpf_map *map_ptr;
> +		int j;
> +
>   		subprog_start = subprog_end;
>   		subprog_end = env->subprog_info[i + 1].start;
>   
> @@ -9956,6 +9959,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>   		func[i]->aux->btf = prog->aux->btf;
>   		func[i]->aux->func_info = prog->aux->func_info;
>   
> +		for (j = 0; j < prog->aux->size_poke_tab; j++) {
> +			bpf_jit_add_poke_descriptor(func[i], &prog->aux->poke_tab[j]);
> +			map_ptr = func[i]->aux->poke_tab[j].tail_call.map;
> +			map_ptr->ops->map_poke_track(map_ptr, func[i]->aux);

Error checking missing for bpf_jit_add_poke_descriptor() and map_poke_track() ..? It
must be guaranteed that adding this to the tracker must not fail, otherwise this will
be a real pain to debug given the prog will never be patched.

> +		}
> +
>   		/* Use bpf_prog_F_tag to indicate functions in stack traces.
>   		 * Long term would need debug info to populate names
>   		 */
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms
  2020-07-15 23:36 ` [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
  2020-07-16 21:10   ` Daniel Borkmann
@ 2020-07-16 21:29   ` Daniel Borkmann
  2020-07-16 22:46     ` Daniel Borkmann
  1 sibling, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2020-07-16 21:29 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson

On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> Relax verifier's restriction that was meant to forbid tailcall usage
> when subprog count was higher than 1.
> 
> Also, do not max out the stack depth of program that utilizes tailcalls.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>   kernel/bpf/verifier.c | 5 -----
>   1 file changed, 5 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3c1efc9d08fd..6481342b31ba 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4172,10 +4172,6 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>   	case BPF_FUNC_tail_call:
>   		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
>   			goto error;
> -		if (env->subprog_cnt > 1) {
> -			verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
> -			return -EINVAL;
> -		}
>   		break;
>   	case BPF_FUNC_perf_event_read:
>   	case BPF_FUNC_perf_event_output:
> @@ -10252,7 +10248,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>   			 * the program array.
>   			 */
>   			prog->cb_access = 1;
> -			env->prog->aux->stack_depth = MAX_BPF_STACK;
>   			env->prog->aux->max_pkt_offset = MAX_PACKET_OFF;
>   
>   			/* mark bpf_tail_call as different opcode to avoid

Also, isn't this broken when JIT is not used (as in stack oob access)?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms
  2020-07-16 21:29   ` Daniel Borkmann
@ 2020-07-16 22:46     ` Daniel Borkmann
  2020-07-17 11:39       ` Maciej Fijalkowski
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Borkmann @ 2020-07-16 22:46 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson

On 7/16/20 11:29 PM, Daniel Borkmann wrote:
> On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
>> Relax verifier's restriction that was meant to forbid tailcall usage
>> when subprog count was higher than 1.
>>
>> Also, do not max out the stack depth of program that utilizes tailcalls.
>>
>> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> ---
>>   kernel/bpf/verifier.c | 5 -----
>>   1 file changed, 5 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 3c1efc9d08fd..6481342b31ba 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -4172,10 +4172,6 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
>>       case BPF_FUNC_tail_call:
>>           if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
>>               goto error;
>> -        if (env->subprog_cnt > 1) {
>> -            verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
>> -            return -EINVAL;
>> -        }
>>           break;
>>       case BPF_FUNC_perf_event_read:
>>       case BPF_FUNC_perf_event_output:
>> @@ -10252,7 +10248,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>>                * the program array.
>>                */
>>               prog->cb_access = 1;
>> -            env->prog->aux->stack_depth = MAX_BPF_STACK;
>>               env->prog->aux->max_pkt_offset = MAX_PACKET_OFF;
>>               /* mark bpf_tail_call as different opcode to avoid
> 
> Also, isn't this broken when JIT is not used (as in stack oob access)?

(Similarly for non-x86 archs after this set.)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-15 23:36 ` [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
@ 2020-07-16 23:06   ` Daniel Borkmann
  2020-07-17  2:16     ` Alexei Starovoitov
  2020-07-17 10:52     ` Maciej Fijalkowski
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Borkmann @ 2020-07-16 23:06 UTC (permalink / raw)
  To: Maciej Fijalkowski, ast; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson

On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> This commit serves two things:
> 1) it optimizes BPF prologue/epilogue generation
> 2) it makes possible to have tailcalls within BPF subprogram
> 
> Both points are related to each other since without 1), 2) could not be
> achieved.
> 
> In [1], Alexei says:
> "The prologue will look like:
> nop5
> xor eax,eax  // two new bytes if bpf_tail_call() is used in this
>               // function
> push rbp
> mov rbp, rsp
> sub rsp, rounded_stack_depth
> push rax // zero init tail_call counter
> variable number of push rbx,r13,r14,r15
> 
> Then bpf_tail_call will pop variable number rbx,..
> and final 'pop rax'
> Then 'add rsp, size_of_current_stack_frame'
> jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> rbp, rsp'
> 
> This way new function will set its own stack size and will init tail
> call
> counter with whatever value the parent had.
> 
> If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> Instead it would need to have 'nop2' in there."
> 
> Implement that suggestion.
> 
> Since the layout of stack is changed, tail call counter handling can not
> rely anymore on popping it to rbx just like it have been handled for
> constant prologue case and later overwrite of rbx with actual value of
> rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
> is considered to be volatile/caller-saved and pop the value of tail call
> counter in there in the epilogue.
> 
> Drop the BUILD_BUG_ON in emit_prologue and in
> emit_bpf_tail_call_indirect where instruction layout is not constant
> anymore.
> 
> Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
> dedicated for skipping the register pops and stack unwind that are
> generated right before the actual jump to target program. Reflect also
> the actual purpose of poke->ip and rename it to poke->tailcall_target so
> that it will not the be confused with the poke target that is being
> introduced here.
> For case when the target program is not present, BPF program will skip
> the pop instructions and nop5 dedicated for jmpq $target. An example of
> such state when only R6 of callee saved registers is used by program:
> 
> ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
> ffffffffc0513aa6:       5b                      pop    %rbx
> ffffffffc0513aa7:       58                      pop    %rax
> ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
> 
> When target program is inserted, the jump that was there to skip
> pops/nop5 will become the nop5, so CPU will go over pops and do the
> actual tailcall.
> 
> One might ask why there simply can not be pushes after the nop5?
> In the following example snippet:
> 
> ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
> (...)
> ffffffffc0370332:       5b                      pop    %rbx
> ffffffffc0370333:       58                      pop    %rax
> ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
> ffffffffc0370347:       50                      push   %rax
> ffffffffc0370348:       53                      push   %rbx
> ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
> ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
> 
> There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
> and jump target is not present. ctx is in %rbx register and BPF
> subprogram that we will call into on ffffffffc037034c is relying on it,
> e.g. it will pick ctx from there. Such code layout is therefore broken
> as we would overwrite the content of %rbx with the value that was pushed
> on the prologue. That is the reason for the 'bypass' approach.
> 
> Special care needs to be taken during the install/update/remove of
> tailcall target. In case when target program is not present, the CPU
> must not execute the pop instructions that precede the tailcall.
> 
> To address that, the following states can be defined:
> A nop, unwind, nop
> B nop, unwind, tail
> C skip, unwind, nop
> D skip, unwind, tail
> 
> A is forbidden (lead to incorrectness). The state transitions between
> tailcall install/update/remove will work as follows:
> 
> First install tail call f: C->D->B(f)
>   * poke the tailcall, after that get rid of the skip
> Update tail call f to f': B(f)->B(f')
>   * poke the tailcall (poke->tailcall_target) and do NOT touch the
>     poke->tailcall_bypass
> Remove tail call: B(f')->C(f')
>   * poke->tailcall_bypass is poked back to jump, then we wait the RCU
>     grace period so that other programs will finish its execution and
>     after that we are safe to remove the poke->tailcall_target
> Install new tail call (f''): C(f')->D(f'')->B(f'').
>   * same as first step
> 
> This way CPU can never be exposed to "unwind, tail" state.
> 
> For regression checks, 'tailcalls' kselftest was executed:
> $ sudo ./test_progs -t tailcalls
>   #64/1 tailcall_1:OK
>   #64/2 tailcall_2:OK
>   #64/3 tailcall_3:OK
>   #64/4 tailcall_4:OK
>   #64/5 tailcall_5:OK
>   #64 tailcalls:OK
> Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED
> 
> Tail call related cases from test_verifier kselftest are also working
> fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
> work properly as well.
> 
> [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
> 
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Overall approach looks reasonable to me. The patch here could still be cleaned up a
bit further, still very rough. Just minor comments below:

> ---
>   arch/x86/net/bpf_jit_comp.c | 241 +++++++++++++++++++++++++++---------
>   include/linux/bpf.h         |   8 +-
>   kernel/bpf/arraymap.c       |  61 +++++++--
>   kernel/bpf/core.c           |   3 +-
>   4 files changed, 239 insertions(+), 74 deletions(-)
> 
[...]
>   /*
> - * Emit x86-64 prologue code for BPF program and check its size.
> + * Emit x86-64 prologue code for BPF program.
>    * bpf_tail_call helper will skip it while jumping into another program
>    */
> -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> +static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> +			  bool tail_call)
>   {
>   	u8 *prog = *pprog;
>   	int cnt = X86_PATCH_SIZE;
> @@ -238,19 +269,16 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
>   	 */
>   	memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt);
>   	prog += cnt;
> +	if (!ebpf_from_cbpf && tail_call)
> +		EMIT2(0x31, 0xC0);       /* xor eax, eax */
> +	else
> +		EMIT2(0x66, 0x90);       /* nop2 */

nit: Why does the ebpf_from_cbpf need the extra nop?

>   	EMIT1(0x55);             /* push rbp */
>   	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
>   	/* sub rsp, rounded_stack_depth */
>   	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
> -	EMIT1(0x53);             /* push rbx */
> -	EMIT2(0x41, 0x55);       /* push r13 */
> -	EMIT2(0x41, 0x56);       /* push r14 */
> -	EMIT2(0x41, 0x57);       /* push r15 */
> -	if (!ebpf_from_cbpf) {
> -		/* zero init tail_call_cnt */
> -		EMIT2(0x6a, 0x00);
> -		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
> -	}
> +	if (!ebpf_from_cbpf && tail_call)
> +		EMIT1(0x50);         /* push rax */
>   	*pprog = prog;
>   }
>   
[...]
> -static void emit_bpf_tail_call_indirect(u8 **pprog)
> +static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> +					u32 stack_depth)
>   {
>   	u8 *prog = *pprog;
> -	int label1, label2, label3;
> +	int pop_bytes = 0;
> +	int off1 = 49;
> +	int off2 = 38;
> +	int off3 = 16;
>   	int cnt = 0;
>   
> +	/* count the additional bytes used for popping callee regs from stack
> +	 * that need to be taken into account for each of the offsets that
> +	 * are used for bailing out of the tail call
> +	 */
> +	pop_bytes = get_pop_bytes(callee_regs_used);
> +	off1 += pop_bytes;
> +	off2 += pop_bytes;
> +	off3 += pop_bytes;
> +
>   	/*
>   	 * rdi - pointer to ctx
>   	 * rsi - pointer to bpf_array
> @@ -370,72 +427,108 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
>   	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
>   	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
>   	      offsetof(struct bpf_array, map.max_entries));
> -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
> +#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */

The whole rename belongs into the first patch to avoid breaking bisectability
as mentioned.

>   	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
> -	label1 = cnt;
>   
>   	/*
>   	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
>   	 *	goto out;
>   	 */
> -	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
> +	EMIT2_off32(0x8B, 0x85                    /* mov eax, dword ptr [rbp - (4 + sd)] */,
> +		    -4 - round_up(stack_depth, 8));
>   	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
> +#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
>   	EMIT2(X86_JA, OFFSET2);                   /* ja out */
> -	label2 = cnt;
>   	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
> +	EMIT2_off32(0x89, 0x85,                   /* mov dword ptr [rbp - (4 + sd)], eax */
> +		    -4 - round_up(stack_depth, 8));

nit: should probably sit in a var

>   
>   	/* prog = array->ptrs[index]; */
> -	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
> +	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,        /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
>   		    offsetof(struct bpf_array, ptrs));
>   
>   	/*
>   	 * if (prog == NULL)
>   	 *	goto out;
>   	 */
> -	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
> -#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
> -	EMIT2(X86_JE, OFFSET3);                   /* je out */
> -	label3 = cnt;
> +	EMIT3(0x48, 0x85, 0xC9);                   /* test rcx,rcx */
> +#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE)
> +	EMIT2(X86_JE, OFFSET3);                    /* je out */
>   
[...]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c67c88ad35f8..38897b9c7d61 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -651,14 +651,15 @@ enum bpf_jit_poke_reason {
>   
>   /* Descriptor of pokes pointing /into/ the JITed image. */
>   struct bpf_jit_poke_descriptor {
> -	void *ip;
> +	void *tailcall_target;
> +	void *tailcall_bypass;
>   	union {
>   		struct {
>   			struct bpf_map *map;
>   			u32 key;
>   		} tail_call;
>   	};
> -	bool ip_stable;
> +	bool tailcall_target_stable;

Probably makes sense to split off the pure rename into a separate patch to
reduce this one slightly.

>   	u8 adj_off;
>   	u16 reason;
>   };
> @@ -1775,6 +1776,9 @@ enum bpf_text_poke_type {
>   	BPF_MOD_JUMP,
>   };
>   
> +/* Number of bytes emit_patch() needs to generate instructions */
> +#define X86_PATCH_SIZE		5

nit: this is arch specific, so should not be exposed in here, neither in
arraymap.c below

>   int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
>   		       void *addr1, void *addr2);
>   
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index c66e8273fccd..d15729a3f46c 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
>   				    struct bpf_prog *old,
>   				    struct bpf_prog *new)
>   {
> +	u8 *bypass_addr, *old_addr, *new_addr;
>   	struct prog_poke_elem *elem;
>   	struct bpf_array_aux *aux;
>   
> @@ -770,12 +771,13 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
>   			 *    there could be danger of use after free otherwise.
>   			 * 2) Initially when we start tracking aux, the program
>   			 *    is not JITed yet and also does not have a kallsyms
> -			 *    entry. We skip these as poke->ip_stable is not
> -			 *    active yet. The JIT will do the final fixup before
> -			 *    setting it stable. The various poke->ip_stable are
> -			 *    successively activated, so tail call updates can
> -			 *    arrive from here while JIT is still finishing its
> -			 *    final fixup for non-activated poke entries.
> +			 *    entry. We skip these as poke->tailcall_target_stable
> +			 *    is not active yet. The JIT will do the final fixup
> +			 *    before setting it stable. The various
> +			 *    poke->tailcall_target_stable are successively activated,
> +			 *    so tail call updates can arrive from here while JIT
> +			 *    is still finishing its final fixup for non-activated
> +			 *    poke entries.
>   			 * 3) On program teardown, the program's kallsym entry gets
>   			 *    removed out of RCU callback, but we can only untrack
>   			 *    from sleepable context, therefore bpf_arch_text_poke()
> @@ -792,20 +794,53 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
>   			 * 5) Any other error happening below from bpf_arch_text_poke()
>   			 *    is a unexpected bug.
>   			 */
> -			if (!READ_ONCE(poke->ip_stable))
> +			if (!READ_ONCE(poke->tailcall_target_stable))
>   				continue;
>   			if (poke->reason != BPF_POKE_REASON_TAIL_CALL)
>   				continue;
>   			if (poke->tail_call.map != map ||
>   			    poke->tail_call.key != key)
>   				continue;
> +			/* protect against un-updated poke descriptors since
> +			 * we could fill them from subprog and the same desc
> +			 * is present on main's program poke tab
> +			 */
> +			if (!poke->tailcall_bypass || !poke->tailcall_target)
> +				continue;

Can't we avoid copying these descriptors over to the subprog in the first place?

> +			if (!old && !new)
> +				continue;

Could we avoid this above but instead signal via bpf_arch_text_poke() that nothing
had to be patched? Reason is that bpf_arch_text_poke() will still do the sanity
check to make sure reality meets expectation wrt current insns (which is also
why I didn't add this skip). In that case we could then just avoid the expensive
synchronize_rcu().

> -			ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP,
> -						 old ? (u8 *)old->bpf_func +
> -						 poke->adj_off : NULL,
> -						 new ? (u8 *)new->bpf_func +
> -						 poke->adj_off : NULL);
> -			BUG_ON(ret < 0 && ret != -EINVAL);
> +			bypass_addr = (u8 *)poke->tailcall_target + X86_PATCH_SIZE;
> +			old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL;
> +			new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL;
> +
> +			if (new) {
> +				ret = bpf_arch_text_poke(poke->tailcall_target,
> +							 BPF_MOD_JUMP,
> +							 old_addr, new_addr);
> +				BUG_ON(ret < 0 && ret != -EINVAL);
> +				if (!old) {
> +					ret = bpf_arch_text_poke(poke->tailcall_bypass,
> +								 BPF_MOD_JUMP,
> +								 bypass_addr, NULL);
> +					BUG_ON(ret < 0 && ret != -EINVAL);
> +				}
> +			} else {
> +				ret = bpf_arch_text_poke(poke->tailcall_bypass,
> +							 BPF_MOD_JUMP,
> +							 NULL, bypass_addr);
> +				BUG_ON(ret < 0 && ret != -EINVAL);
> +				/* let other CPUs finish the execution of program
> +				 * so that it will not possible to expose them
> +				 * to invalid nop, stack unwind, nop state
> +				 */
> +				synchronize_rcu();

Very heavyweight that we need to potentially call this /multiple/ times for just a
/single/ map update under poke mutex even ... but agree it's needed here to avoid
racing. :(

> +				ret = bpf_arch_text_poke(poke->tailcall_target,
> +							 BPF_MOD_JUMP,
> +							 old_addr, NULL);
> +				BUG_ON(ret < 0 && ret != -EINVAL);
> +			}
>   		}
>   	}
>   }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-16 23:06   ` Daniel Borkmann
@ 2020-07-17  2:16     ` Alexei Starovoitov
  2020-07-17 10:57       ` Maciej Fijalkowski
  2020-07-17 10:52     ` Maciej Fijalkowski
  1 sibling, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2020-07-17  2:16 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Maciej Fijalkowski, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Fri, Jul 17, 2020 at 01:06:07AM +0200, Daniel Borkmann wrote:
> > +				ret = bpf_arch_text_poke(poke->tailcall_bypass,
> > +							 BPF_MOD_JUMP,
> > +							 NULL, bypass_addr);
> > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > +				/* let other CPUs finish the execution of program
> > +				 * so that it will not possible to expose them
> > +				 * to invalid nop, stack unwind, nop state
> > +				 */
> > +				synchronize_rcu();
> 
> Very heavyweight that we need to potentially call this /multiple/ times for just a
> /single/ map update under poke mutex even ... but agree it's needed here to avoid
> racing. :(

Yeah. I wasn't clear with my suggestion earlier.
I meant to say that synchronize_rcu() can be done between two loops.
list_for_each_entry(elem, &aux->poke_progs, list)
   for (i = 0; i < elem->aux->size_poke_tab; i++)
        bpf_arch_text_poke(poke->tailcall_bypass, ...
synchronize_rcu();
list_for_each_entry(elem, &aux->poke_progs, list)
   for (i = 0; i < elem->aux->size_poke_tab; i++)
        bpf_arch_text_poke(poke->poke->tailcall_target, ...

Not sure how much better it will be though.
text_poke is heavy.
I think it's heavier than synchronize_rcu().
Long term we can do batch of text_poke-s.

I'm actually fine with above approach of synchronize_rcu() without splitting the loop.
This kind of optimizations can be done later as a follow up.
I'd really like to land this stuff in this bpf-next cycle.
It's a big improvement to tail_calls and bpf2bpf calls.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines
  2020-07-16 20:36   ` Daniel Borkmann
@ 2020-07-17  9:29     ` Maciej Fijalkowski
  0 siblings, 0 replies; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-17  9:29 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Maciej Fijalkowski, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Thu, Jul 16, 2020 at 10:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> > Currently, %rax is used to store the jump target when BPF program is
> > emitting the retpoline instructions that are handling the indirect
> > tailcall.
> >
> > There is a plan to use %rax for different purpose, which is storing the
> > tail call counter. In order to preserve this value across the tailcalls,
> > use %rcx instead for jump target storage in retpoline instructions.
> >
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >   arch/x86/include/asm/nospec-branch.h | 16 ++++++++--------
> >   1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> > index e7752b4038ff..e491c3d9f227 100644
> > --- a/arch/x86/include/asm/nospec-branch.h
> > +++ b/arch/x86/include/asm/nospec-branch.h
> > @@ -314,19 +314,19 @@ static inline void mds_idle_clear_cpu_buffers(void)
> >    *    lfence
> >    *    jmp spec_trap
> >    *  do_rop:
> > - *    mov %rax,(%rsp) for x86_64
> > + *    mov %rcx,(%rsp) for x86_64
> >    *    mov %edx,(%esp) for x86_32
> >    *    retq
> >    *
> >    * Without retpolines configured:
> >    *
> > - *    jmp *%rax for x86_64
> > + *    jmp *%rcx for x86_64
> >    *    jmp *%edx for x86_32
> >    */
> >   #ifdef CONFIG_RETPOLINE
> >   # ifdef CONFIG_X86_64
> > -#  define RETPOLINE_RAX_BPF_JIT_SIZE 17
> > -#  define RETPOLINE_RAX_BPF_JIT()                            \
> > +#  define RETPOLINE_RCX_BPF_JIT_SIZE 17
> > +#  define RETPOLINE_RCX_BPF_JIT()                            \
> >   do {                                                                \
> >       EMIT1_off32(0xE8, 7);    /* callq do_rop */             \
> >       /* spec_trap: */                                        \
> > @@ -334,7 +334,7 @@ do {                                                              \
> >       EMIT3(0x0F, 0xAE, 0xE8); /* lfence */                   \
> >       EMIT2(0xEB, 0xF9);       /* jmp spec_trap */            \
> >       /* do_rop: */                                           \
> > -     EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */    \
> > +     EMIT4(0x48, 0x89, 0x0C, 0x24); /* mov %rcx,(%rsp) */    \
> >       EMIT1(0xC3);             /* retq */                     \
> >   } while (0)
> >   # else /* !CONFIG_X86_64 */
> > @@ -352,9 +352,9 @@ do {                                                              \
> >   # endif
> >   #else /* !CONFIG_RETPOLINE */
> >   # ifdef CONFIG_X86_64
> > -#  define RETPOLINE_RAX_BPF_JIT_SIZE 2
> > -#  define RETPOLINE_RAX_BPF_JIT()                            \
> > -     EMIT2(0xFF, 0xE0);       /* jmp *%rax */
> > +#  define RETPOLINE_RCX_BPF_JIT_SIZE 2
> > +#  define RETPOLINE_RCX_BPF_JIT()                            \
> > +     EMIT2(0xFF, 0xE1);       /* jmp *%rcx */
>
> Hmm, so the target prog gets loaded into rax in emit_bpf_tail_call_indirect()
> but then you jump into rcx? What am I missing? This still needs to be bisectable.

Somehow your comments on patches 1, 2 and 3 didn't arrive to my work mail.
I'm responding from web-gmail as my client seems to be broken and I am
in a bit of hurry, so apologize for any inconveniences...

You are right of course, I will include the JIT change in this patch on v2.

>
> >   # else /* !CONFIG_X86_64 */
> >   #  define RETPOLINE_EDX_BPF_JIT()                           \
> >       EMIT2(0xFF, 0xE2)        /* jmp *%edx */
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms
  2020-07-16 21:16   ` Daniel Borkmann
@ 2020-07-17  9:36     ` Maciej Fijalkowski
  0 siblings, 0 replies; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-17  9:36 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Maciej Fijalkowski, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Thu, Jul 16, 2020 at 11:18 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> > Previously, there was no need for poke descriptors being present in
> > subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
> > in them. Each subprog is JITed independently so in order to enable
> > JITing such subprograms, simply copy poke descriptors from main program
> > to subprogram's poke tab.
> >
> > Add also subprog's aux struct to the BPF map poke_progs list by calling
> > on it map_poke_track().
> >
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >   kernel/bpf/verifier.c | 9 +++++++++
> >   1 file changed, 9 insertions(+)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 6481342b31ba..3b406b2860ef 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -9932,6 +9932,9 @@ static int jit_subprogs(struct bpf_verifier_env *env)
> >               goto out_undo_insn;
> >
> >       for (i = 0; i < env->subprog_cnt; i++) {
> > +             struct bpf_map *map_ptr;
> > +             int j;
> > +
> >               subprog_start = subprog_end;
> >               subprog_end = env->subprog_info[i + 1].start;
> >
> > @@ -9956,6 +9959,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
> >               func[i]->aux->btf = prog->aux->btf;
> >               func[i]->aux->func_info = prog->aux->func_info;
> >
> > +             for (j = 0; j < prog->aux->size_poke_tab; j++) {
> > +                     bpf_jit_add_poke_descriptor(func[i], &prog->aux->poke_tab[j]);
> > +                     map_ptr = func[i]->aux->poke_tab[j].tail_call.map;
> > +                     map_ptr->ops->map_poke_track(map_ptr, func[i]->aux);
>
> Error checking missing for bpf_jit_add_poke_descriptor() and map_poke_track() ..? It
> must be guaranteed that adding this to the tracker must not fail, otherwise this will
> be a real pain to debug given the prog will never be patched.

My bad, will fix it in v2.

>
> > +             }
> > +
> >               /* Use bpf_prog_F_tag to indicate functions in stack traces.
> >                * Long term would need debug info to populate names
> >                */
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-16 23:06   ` Daniel Borkmann
  2020-07-17  2:16     ` Alexei Starovoitov
@ 2020-07-17 10:52     ` Maciej Fijalkowski
  1 sibling, 0 replies; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-17 10:52 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Fri, Jul 17, 2020 at 01:06:07AM +0200, Daniel Borkmann wrote:
> On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> > This commit serves two things:
> > 1) it optimizes BPF prologue/epilogue generation
> > 2) it makes possible to have tailcalls within BPF subprogram
> > 
> > Both points are related to each other since without 1), 2) could not be
> > achieved.
> > 
> > In [1], Alexei says:
> > "The prologue will look like:
> > nop5
> > xor eax,eax  // two new bytes if bpf_tail_call() is used in this
> >               // function
> > push rbp
> > mov rbp, rsp
> > sub rsp, rounded_stack_depth
> > push rax // zero init tail_call counter
> > variable number of push rbx,r13,r14,r15
> > 
> > Then bpf_tail_call will pop variable number rbx,..
> > and final 'pop rax'
> > Then 'add rsp, size_of_current_stack_frame'
> > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> > rbp, rsp'
> > 
> > This way new function will set its own stack size and will init tail
> > call
> > counter with whatever value the parent had.
> > 
> > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> > Instead it would need to have 'nop2' in there."
> > 
> > Implement that suggestion.
> > 
> > Since the layout of stack is changed, tail call counter handling can not
> > rely anymore on popping it to rbx just like it have been handled for
> > constant prologue case and later overwrite of rbx with actual value of
> > rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
> > is considered to be volatile/caller-saved and pop the value of tail call
> > counter in there in the epilogue.
> > 
> > Drop the BUILD_BUG_ON in emit_prologue and in
> > emit_bpf_tail_call_indirect where instruction layout is not constant
> > anymore.
> > 
> > Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
> > dedicated for skipping the register pops and stack unwind that are
> > generated right before the actual jump to target program. Reflect also
> > the actual purpose of poke->ip and rename it to poke->tailcall_target so
> > that it will not the be confused with the poke target that is being
> > introduced here.
> > For case when the target program is not present, BPF program will skip
> > the pop instructions and nop5 dedicated for jmpq $target. An example of
> > such state when only R6 of callee saved registers is used by program:
> > 
> > ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
> > ffffffffc0513aa6:       5b                      pop    %rbx
> > ffffffffc0513aa7:       58                      pop    %rax
> > ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
> > 
> > When target program is inserted, the jump that was there to skip
> > pops/nop5 will become the nop5, so CPU will go over pops and do the
> > actual tailcall.
> > 
> > One might ask why there simply can not be pushes after the nop5?
> > In the following example snippet:
> > 
> > ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
> > (...)
> > ffffffffc0370332:       5b                      pop    %rbx
> > ffffffffc0370333:       58                      pop    %rax
> > ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
> > ffffffffc0370347:       50                      push   %rax
> > ffffffffc0370348:       53                      push   %rbx
> > ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
> > ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
> > 
> > There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
> > and jump target is not present. ctx is in %rbx register and BPF
> > subprogram that we will call into on ffffffffc037034c is relying on it,
> > e.g. it will pick ctx from there. Such code layout is therefore broken
> > as we would overwrite the content of %rbx with the value that was pushed
> > on the prologue. That is the reason for the 'bypass' approach.
> > 
> > Special care needs to be taken during the install/update/remove of
> > tailcall target. In case when target program is not present, the CPU
> > must not execute the pop instructions that precede the tailcall.
> > 
> > To address that, the following states can be defined:
> > A nop, unwind, nop
> > B nop, unwind, tail
> > C skip, unwind, nop
> > D skip, unwind, tail
> > 
> > A is forbidden (lead to incorrectness). The state transitions between
> > tailcall install/update/remove will work as follows:
> > 
> > First install tail call f: C->D->B(f)
> >   * poke the tailcall, after that get rid of the skip
> > Update tail call f to f': B(f)->B(f')
> >   * poke the tailcall (poke->tailcall_target) and do NOT touch the
> >     poke->tailcall_bypass
> > Remove tail call: B(f')->C(f')
> >   * poke->tailcall_bypass is poked back to jump, then we wait the RCU
> >     grace period so that other programs will finish its execution and
> >     after that we are safe to remove the poke->tailcall_target
> > Install new tail call (f''): C(f')->D(f'')->B(f'').
> >   * same as first step
> > 
> > This way CPU can never be exposed to "unwind, tail" state.
> > 
> > For regression checks, 'tailcalls' kselftest was executed:
> > $ sudo ./test_progs -t tailcalls
> >   #64/1 tailcall_1:OK
> >   #64/2 tailcall_2:OK
> >   #64/3 tailcall_3:OK
> >   #64/4 tailcall_4:OK
> >   #64/5 tailcall_5:OK
> >   #64 tailcalls:OK
> > Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > Tail call related cases from test_verifier kselftest are also working
> > fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
> > work properly as well.
> > 
> > [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
> > 
> > Suggested-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> Overall approach looks reasonable to me. The patch here could still be cleaned up a
> bit further, still very rough. Just minor comments below:

Thank you for spotting all of the issues. I will provide a v2 on monday as
from today to sunday I will be out of reach.

> 
> > ---
> >   arch/x86/net/bpf_jit_comp.c | 241 +++++++++++++++++++++++++++---------
> >   include/linux/bpf.h         |   8 +-
> >   kernel/bpf/arraymap.c       |  61 +++++++--
> >   kernel/bpf/core.c           |   3 +-
> >   4 files changed, 239 insertions(+), 74 deletions(-)
> > 
> [...]
> >   /*
> > - * Emit x86-64 prologue code for BPF program and check its size.
> > + * Emit x86-64 prologue code for BPF program.
> >    * bpf_tail_call helper will skip it while jumping into another program
> >    */
> > -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> > +static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> > +			  bool tail_call)
> >   {
> >   	u8 *prog = *pprog;
> >   	int cnt = X86_PATCH_SIZE;
> > @@ -238,19 +269,16 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> >   	 */
> >   	memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt);
> >   	prog += cnt;
> > +	if (!ebpf_from_cbpf && tail_call)
> > +		EMIT2(0x31, 0xC0);       /* xor eax, eax */
> > +	else
> > +		EMIT2(0x66, 0x90);       /* nop2 */
> 
> nit: Why does the ebpf_from_cbpf need the extra nop?

Good catch, it doesn't need it :)

> 
> >   	EMIT1(0x55);             /* push rbp */
> >   	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
> >   	/* sub rsp, rounded_stack_depth */
> >   	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
> > -	EMIT1(0x53);             /* push rbx */
> > -	EMIT2(0x41, 0x55);       /* push r13 */
> > -	EMIT2(0x41, 0x56);       /* push r14 */
> > -	EMIT2(0x41, 0x57);       /* push r15 */
> > -	if (!ebpf_from_cbpf) {
> > -		/* zero init tail_call_cnt */
> > -		EMIT2(0x6a, 0x00);
> > -		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
> > -	}
> > +	if (!ebpf_from_cbpf && tail_call)
> > +		EMIT1(0x50);         /* push rax */
> >   	*pprog = prog;
> >   }
> [...]
> > -static void emit_bpf_tail_call_indirect(u8 **pprog)
> > +static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> > +					u32 stack_depth)
> >   {
> >   	u8 *prog = *pprog;
> > -	int label1, label2, label3;
> > +	int pop_bytes = 0;
> > +	int off1 = 49;
> > +	int off2 = 38;
> > +	int off3 = 16;
> >   	int cnt = 0;
> > +	/* count the additional bytes used for popping callee regs from stack
> > +	 * that need to be taken into account for each of the offsets that
> > +	 * are used for bailing out of the tail call
> > +	 */
> > +	pop_bytes = get_pop_bytes(callee_regs_used);
> > +	off1 += pop_bytes;
> > +	off2 += pop_bytes;
> > +	off3 += pop_bytes;
> > +
> >   	/*
> >   	 * rdi - pointer to ctx
> >   	 * rsi - pointer to bpf_array
> > @@ -370,72 +427,108 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
> >   	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
> >   	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
> >   	      offsetof(struct bpf_array, map.max_entries));
> > -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
> > +#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
> 
> The whole rename belongs into the first patch to avoid breaking bisectability
> as mentioned.

Ack.

> 
> >   	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
> > -	label1 = cnt;
> >   	/*
> >   	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
> >   	 *	goto out;
> >   	 */
> > -	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
> > +	EMIT2_off32(0x8B, 0x85                    /* mov eax, dword ptr [rbp - (4 + sd)] */,
> > +		    -4 - round_up(stack_depth, 8));
> >   	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> > -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
> > +#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
> >   	EMIT2(X86_JA, OFFSET2);                   /* ja out */
> > -	label2 = cnt;
> >   	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> > -	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
> > +	EMIT2_off32(0x89, 0x85,                   /* mov dword ptr [rbp - (4 + sd)], eax */
> > +		    -4 - round_up(stack_depth, 8));
> 
> nit: should probably sit in a var

Sure.

> 
> >   	/* prog = array->ptrs[index]; */
> > -	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
> > +	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,        /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
> >   		    offsetof(struct bpf_array, ptrs));
> >   	/*
> >   	 * if (prog == NULL)
> >   	 *	goto out;
> >   	 */
> > -	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
> > -#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
> > -	EMIT2(X86_JE, OFFSET3);                   /* je out */
> > -	label3 = cnt;
> > +	EMIT3(0x48, 0x85, 0xC9);                   /* test rcx,rcx */
> > +#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE)
> > +	EMIT2(X86_JE, OFFSET3);                    /* je out */
> [...]
> 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index c67c88ad35f8..38897b9c7d61 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -651,14 +651,15 @@ enum bpf_jit_poke_reason {
> >   /* Descriptor of pokes pointing /into/ the JITed image. */
> >   struct bpf_jit_poke_descriptor {
> > -	void *ip;
> > +	void *tailcall_target;
> > +	void *tailcall_bypass;
> >   	union {
> >   		struct {
> >   			struct bpf_map *map;
> >   			u32 key;
> >   		} tail_call;
> >   	};
> > -	bool ip_stable;
> > +	bool tailcall_target_stable;
> 
> Probably makes sense to split off the pure rename into a separate patch to
> reduce this one slightly.

I was thinking of that as well. I will pull out as you're suggesting.

> 
> >   	u8 adj_off;
> >   	u16 reason;
> >   };
> > @@ -1775,6 +1776,9 @@ enum bpf_text_poke_type {
> >   	BPF_MOD_JUMP,
> >   };
> > +/* Number of bytes emit_patch() needs to generate instructions */
> > +#define X86_PATCH_SIZE		5
> 
> nit: this is arch specific, so should not be exposed in here, neither in
> arraymap.c below

Okay, so I think that we should add another member to poke descriptor that
will hold specifically the bypass address so that we wouldn't have to
calculate it in here. And I think this extension should go to this patch
whereas the renaming to a separate.

> 
> >   int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
> >   		       void *addr1, void *addr2);
> > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > index c66e8273fccd..d15729a3f46c 100644
> > --- a/kernel/bpf/arraymap.c
> > +++ b/kernel/bpf/arraymap.c
> > @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> >   				    struct bpf_prog *old,
> >   				    struct bpf_prog *new)
> >   {
> > +	u8 *bypass_addr, *old_addr, *new_addr;
> >   	struct prog_poke_elem *elem;
> >   	struct bpf_array_aux *aux;
> > @@ -770,12 +771,13 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> >   			 *    there could be danger of use after free otherwise.
> >   			 * 2) Initially when we start tracking aux, the program
> >   			 *    is not JITed yet and also does not have a kallsyms
> > -			 *    entry. We skip these as poke->ip_stable is not
> > -			 *    active yet. The JIT will do the final fixup before
> > -			 *    setting it stable. The various poke->ip_stable are
> > -			 *    successively activated, so tail call updates can
> > -			 *    arrive from here while JIT is still finishing its
> > -			 *    final fixup for non-activated poke entries.
> > +			 *    entry. We skip these as poke->tailcall_target_stable
> > +			 *    is not active yet. The JIT will do the final fixup
> > +			 *    before setting it stable. The various
> > +			 *    poke->tailcall_target_stable are successively activated,
> > +			 *    so tail call updates can arrive from here while JIT
> > +			 *    is still finishing its final fixup for non-activated
> > +			 *    poke entries.
> >   			 * 3) On program teardown, the program's kallsym entry gets
> >   			 *    removed out of RCU callback, but we can only untrack
> >   			 *    from sleepable context, therefore bpf_arch_text_poke()
> > @@ -792,20 +794,53 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
> >   			 * 5) Any other error happening below from bpf_arch_text_poke()
> >   			 *    is a unexpected bug.
> >   			 */
> > -			if (!READ_ONCE(poke->ip_stable))
> > +			if (!READ_ONCE(poke->tailcall_target_stable))
> >   				continue;
> >   			if (poke->reason != BPF_POKE_REASON_TAIL_CALL)
> >   				continue;
> >   			if (poke->tail_call.map != map ||
> >   			    poke->tail_call.key != key)
> >   				continue;
> > +			/* protect against un-updated poke descriptors since
> > +			 * we could fill them from subprog and the same desc
> > +			 * is present on main's program poke tab
> > +			 */
> > +			if (!poke->tailcall_bypass || !poke->tailcall_target)
> > +				continue;
> 
> Can't we avoid copying these descriptors over to the subprog in the first place?

I think we can, but can we consider it as something that we will do as a
follow-up?

> 
> > +			if (!old && !new)
> > +				continue;
> 
> Could we avoid this above but instead signal via bpf_arch_text_poke() that nothing
> had to be patched? Reason is that bpf_arch_text_poke() will still do the sanity
> check to make sure reality meets expectation wrt current insns (which is also
> why I didn't add this skip). In that case we could then just avoid the expensive
> synchronize_rcu().

I was even thinking to have such a check before walking through the poke
descriptors, so that's the opposite of what you suggest.

If you insist, I can play with this a bit on monday, but I recall that it
was the only thing that was stopping the Alexei's pseudo-code from being
fully functional (the nop->nop update).

> 
> > -			ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP,
> > -						 old ? (u8 *)old->bpf_func +
> > -						 poke->adj_off : NULL,
> > -						 new ? (u8 *)new->bpf_func +
> > -						 poke->adj_off : NULL);
> > -			BUG_ON(ret < 0 && ret != -EINVAL);
> > +			bypass_addr = (u8 *)poke->tailcall_target + X86_PATCH_SIZE;
> > +			old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL;
> > +			new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL;
> > +
> > +			if (new) {
> > +				ret = bpf_arch_text_poke(poke->tailcall_target,
> > +							 BPF_MOD_JUMP,
> > +							 old_addr, new_addr);
> > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > +				if (!old) {
> > +					ret = bpf_arch_text_poke(poke->tailcall_bypass,
> > +								 BPF_MOD_JUMP,
> > +								 bypass_addr, NULL);
> > +					BUG_ON(ret < 0 && ret != -EINVAL);
> > +				}
> > +			} else {
> > +				ret = bpf_arch_text_poke(poke->tailcall_bypass,
> > +							 BPF_MOD_JUMP,
> > +							 NULL, bypass_addr);
> > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > +				/* let other CPUs finish the execution of program
> > +				 * so that it will not possible to expose them
> > +				 * to invalid nop, stack unwind, nop state
> > +				 */
> > +				synchronize_rcu();
> 
> Very heavyweight that we need to potentially call this /multiple/ times for just a
> /single/ map update under poke mutex even ... but agree it's needed here to avoid
> racing. :(
> 
> > +				ret = bpf_arch_text_poke(poke->tailcall_target,
> > +							 BPF_MOD_JUMP,
> > +							 old_addr, NULL);
> > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > +			}
> >   		}
> >   	}
> >   }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-17  2:16     ` Alexei Starovoitov
@ 2020-07-17 10:57       ` Maciej Fijalkowski
  2020-07-17 16:16         ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-17 10:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Thu, Jul 16, 2020 at 07:16:24PM -0700, Alexei Starovoitov wrote:
> On Fri, Jul 17, 2020 at 01:06:07AM +0200, Daniel Borkmann wrote:
> > > +				ret = bpf_arch_text_poke(poke->tailcall_bypass,
> > > +							 BPF_MOD_JUMP,
> > > +							 NULL, bypass_addr);
> > > +				BUG_ON(ret < 0 && ret != -EINVAL);
> > > +				/* let other CPUs finish the execution of program
> > > +				 * so that it will not possible to expose them
> > > +				 * to invalid nop, stack unwind, nop state
> > > +				 */
> > > +				synchronize_rcu();
> > 
> > Very heavyweight that we need to potentially call this /multiple/ times for just a
> > /single/ map update under poke mutex even ... but agree it's needed here to avoid
> > racing. :(
> 
> Yeah. I wasn't clear with my suggestion earlier.
> I meant to say that synchronize_rcu() can be done between two loops.
> list_for_each_entry(elem, &aux->poke_progs, list)
>    for (i = 0; i < elem->aux->size_poke_tab; i++)
>         bpf_arch_text_poke(poke->tailcall_bypass, ...
> synchronize_rcu();
> list_for_each_entry(elem, &aux->poke_progs, list)
>    for (i = 0; i < elem->aux->size_poke_tab; i++)
>         bpf_arch_text_poke(poke->poke->tailcall_target, ...
> 
> Not sure how much better it will be though.
> text_poke is heavy.
> I think it's heavier than synchronize_rcu().
> Long term we can do batch of text_poke-s.

Yeah since we introduce another poke target we could come up with
preparing the vector of pokes as you're saying?

> 
> I'm actually fine with above approach of synchronize_rcu() without splitting the loop.
> This kind of optimizations can be done later as a follow up.
> I'd really like to land this stuff in this bpf-next cycle.
> It's a big improvement to tail_calls and bpf2bpf calls.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms
  2020-07-16 22:46     ` Daniel Borkmann
@ 2020-07-17 11:39       ` Maciej Fijalkowski
  0 siblings, 0 replies; 19+ messages in thread
From: Maciej Fijalkowski @ 2020-07-17 11:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Maciej Fijalkowski, ast, bpf, netdev, bjorn.topel, magnus.karlsson

On Fri, Jul 17, 2020 at 1:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 7/16/20 11:29 PM, Daniel Borkmann wrote:
> > On 7/16/20 1:36 AM, Maciej Fijalkowski wrote:
> >> Relax verifier's restriction that was meant to forbid tailcall usage
> >> when subprog count was higher than 1.
> >>
> >> Also, do not max out the stack depth of program that utilizes tailcalls.
> >>
> >> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> >> ---
> >>   kernel/bpf/verifier.c | 5 -----
> >>   1 file changed, 5 deletions(-)
> >>
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 3c1efc9d08fd..6481342b31ba 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -4172,10 +4172,6 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
> >>       case BPF_FUNC_tail_call:
> >>           if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
> >>               goto error;
> >> -        if (env->subprog_cnt > 1) {
> >> -            verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
> >> -            return -EINVAL;
> >> -        }
> >>           break;
> >>       case BPF_FUNC_perf_event_read:
> >>       case BPF_FUNC_perf_event_output:
> >> @@ -10252,7 +10248,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
> >>                * the program array.
> >>                */
> >>               prog->cb_access = 1;
> >> -            env->prog->aux->stack_depth = MAX_BPF_STACK;
> >>               env->prog->aux->max_pkt_offset = MAX_PACKET_OFF;
> >>               /* mark bpf_tail_call as different opcode to avoid
> >
> > Also, isn't this broken when JIT is not used (as in stack oob access)?
>
> (Similarly for non-x86 archs after this set.)

Honestly at this point I'm not sure how to approach it, but as I said I'm
in a bit of a rush so probably not thinking clearly :)

So in the end we want to allow it *only* for case when underlying arch
is the x86-64 and when JIT is turned on, correct? Is this a matter of
#define's juggling or how do you see it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-17 10:57       ` Maciej Fijalkowski
@ 2020-07-17 16:16         ` Alexei Starovoitov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2020-07-17 16:16 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Daniel Borkmann, Alexei Starovoitov, bpf, Network Development,
	Björn Töpel, Karlsson, Magnus

On Fri, Jul 17, 2020 at 4:02 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Thu, Jul 16, 2020 at 07:16:24PM -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 17, 2020 at 01:06:07AM +0200, Daniel Borkmann wrote:
> > > > +                         ret = bpf_arch_text_poke(poke->tailcall_bypass,
> > > > +                                                  BPF_MOD_JUMP,
> > > > +                                                  NULL, bypass_addr);
> > > > +                         BUG_ON(ret < 0 && ret != -EINVAL);
> > > > +                         /* let other CPUs finish the execution of program
> > > > +                          * so that it will not possible to expose them
> > > > +                          * to invalid nop, stack unwind, nop state
> > > > +                          */
> > > > +                         synchronize_rcu();
> > >
> > > Very heavyweight that we need to potentially call this /multiple/ times for just a
> > > /single/ map update under poke mutex even ... but agree it's needed here to avoid
> > > racing. :(
> >
> > Yeah. I wasn't clear with my suggestion earlier.
> > I meant to say that synchronize_rcu() can be done between two loops.
> > list_for_each_entry(elem, &aux->poke_progs, list)
> >    for (i = 0; i < elem->aux->size_poke_tab; i++)
> >         bpf_arch_text_poke(poke->tailcall_bypass, ...
> > synchronize_rcu();
> > list_for_each_entry(elem, &aux->poke_progs, list)
> >    for (i = 0; i < elem->aux->size_poke_tab; i++)
> >         bpf_arch_text_poke(poke->poke->tailcall_target, ...
> >
> > Not sure how much better it will be though.
> > text_poke is heavy.
> > I think it's heavier than synchronize_rcu().
> > Long term we can do batch of text_poke-s.
>
> Yeah since we introduce another poke target we could come up with
> preparing the vector of pokes as you're saying?

yes. eventually. Pls keep it simple for now.
The logic is already quite complex.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-07-17 16:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 23:36 [PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
2020-07-16 20:36   ` Daniel Borkmann
2020-07-17  9:29     ` Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
2020-07-16 21:10   ` Daniel Borkmann
2020-07-16 21:29   ` Daniel Borkmann
2020-07-16 22:46     ` Daniel Borkmann
2020-07-17 11:39       ` Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
2020-07-16 21:16   ` Daniel Borkmann
2020-07-17  9:36     ` Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
2020-07-16 23:06   ` Daniel Borkmann
2020-07-17  2:16     ` Alexei Starovoitov
2020-07-17 10:57       ` Maciej Fijalkowski
2020-07-17 16:16         ` Alexei Starovoitov
2020-07-17 10:52     ` Maciej Fijalkowski
2020-07-15 23:36 ` [PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).