bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms
@ 2020-07-02 13:49 Maciej Fijalkowski
  2020-07-02 13:49 ` [RFC PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-02 13:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Hello,

today bpf2bpf calls and tailcalls exclude each other. This set is a
proposal to make them work together. It is still a RFC because we need
to decide if the performance impact for BPF programs with tailcalls is
acceptable or not. Note that I have been focused only on x86
architecture, I am not sure if other architectures have some other
restrictions that were stopping us to have tailcalls in BPF subprograms.

I would also like to get a feedback on prog_array_map_poke_run changes.

To give you an overview how this work started, previously I posted RFC
that was targetted at getting rid of push/pop instructions for callee
saved registers in x86-64 JIT that are not used by the BPF program.
Alexei saw a potential that that work could be lifted a bit and
tailcalls could work with BPF subprograms. More on that in [1], [2].

In [1], Alexei says:

"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

So basically I gave a shot at that suggestion. Patch 4 has a description
of implementation.

Quick overview of patches:
Patch 1 changes BPF retpoline to use %rcx instead of %rax to store
address of BPF tailcall target program
Patch 2 relaxes verifier's restrictions about tailcalls being used with
BPF subprograms
Patch 3 propagates poke descriptors from main program to each subprogram
Patch 4 is the main dish in this set. It implements new prologue layout
that was suggested by Alexei and reworks tailcall handling.
Patch 5 is the new selftest that proves tailcalls can be used from
within BPF subprogram.

-------------------------------------------------------------------
Debatable prog_array_map_poke_run changes:

Before the tailcall and with the new prologue layout, stack need to be
unwinded and callee saved registers need to be popped. Instructions
responsible for that are generated, but they should not be executed if
target program is not present. To address that, new poke target 'ip_aux'
is introduced to poke descriptor that will be used for skipping these
instructions. This means there are two poke targets for handling direct
tailcalls. Simplified flow can be presented as three sections:

1. skip call or nop (poke->ip_aux)
2. stack unwind
3. call tail or nop (poke->ip)

It would be possible that one of CPU might be in point 2 and point 3 is
not yet updated (nop), which would lead to problems mentioned in patch 4
commit message, IOW unwind section should not be executed if there is no
target program.

We can define the following state matrix for that (courtesy of Bjorn):
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). C is the starting state. What if
we just make sure we *never* enter than state, and never return to C?

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->ip) and do NOT touch the poke->ip_aux
Remove tail call: B(f')->D(f')
 * do NOT touch poke->ip, only poke the poke->ip_aux. Note that we do
   not get back to C(f')
Install new tail call (f''): D(f')->D(f'')->B(f'').
 * poke both targets, first ->ip then ->ip_aux

So, by avoiding A and never going back to C another CPU can never be
exposed to the "unwind, tail" state.

Right now, due to 'faking' the bpf_arch_text_poke,
prog_array_map_poke_run looks a bit messy. I dropped the 'old' argument
usage at all and instead I do the reverse calculation that is being done
by emit_patch, so that the result of memcmp(ip, old_insn, X86_PATCH_SIZE)
is 0 and we do the actual poking.

Presumably this is something to be fixed/improved, but at first, I would
like to hear opinions of others and have some decision whether it is worth
pursuing, or not.

-------------------------------------------------------------------
Performance impact:

As requested, I'm including the performance numbers that show an
impact of that set, but I did not analyze it. Let's do this on list.
Also, please let me know if these scenarios make sense and are
sufficient.

All of this work, as stated in [2], started as a way to speed up AF-XDP
by dropping the push/pop of unused callee saved registers in prologue
and epilogue. Impact is positive, 15% of performance gain.

However, it is obvious that it will have a negative impact on BPF
programs that utilize tailcalls. I was asked to provide some numbers
that will tell us how much actually are theses cases damaged by this
set.

Below are te numbers from 'perf stat' for two scenarios.
First scenario is the output of command:

$ sudo perf stat -ddd -r 1024 ./test_progs -t tailcalls

tailcalls kselftest was modified in a following way:
- only taicall1 subtest is enabled
- each of the bpf_prog_test_run() calls got set 'repeat' argument to
  1000000

Numbers without this set:

 Performance counter stats for './test_progs -t tailcalls' (1024 runs):

            198.73 msec task-clock                #    0.997 CPUs utilized            ( +-  0.13% )
                 6      context-switches          #    0.030 K/sec                    ( +-  0.75% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +- 22.15% )
               108      page-faults               #    0.546 K/sec                    ( +-  0.03% )
       693,910,413      cycles                    #    3.492 GHz                      ( +-  0.11% )  (30.26%)
     1,067,635,122      instructions              #    1.54  insn per cycle           ( +-  0.03% )  (38.16%)
       165,308,809      branches                  #  831.822 M/sec                    ( +-  0.02% )  (38.46%)
         9,940,504      branch-misses             #    6.01% of all branches          ( +-  0.02% )  (38.77%)
       226,741,985      L1-dcache-loads           # 1140.949 M/sec                    ( +-  0.02% )  (39.07%)
           161,936      L1-dcache-load-misses     #    0.07% of all L1-dcache hits    ( +-  0.66% )  (39.12%)
            43,777      LLC-loads                 #    0.220 M/sec                    ( +-  0.97% )  (31.07%)
            11,773      LLC-load-misses           #   26.89% of all LL-cache hits     ( +-  0.99% )  (30.93%)
   <not supported>      L1-icache-loads
            97,692      L1-icache-load-misses                                         ( +-  0.51% )  (30.77%)
       229,069,211      dTLB-loads                # 1152.659 M/sec                    ( +-  0.02% )  (30.62%)
             1,031      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.28% )  (30.46%)
             2,236      iTLB-loads                #    0.011 M/sec                    ( +-  1.28% )  (30.30%)
               357      iTLB-load-misses          #   15.99% of all iTLB cache hits   ( +-  2.10% )  (30.16%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          0.199307 +- 0.000250 seconds time elapsed  ( +-  0.13% )

With:

 Performance counter stats for './test_progs -t tailcalls' (1024 runs):

            202.48 msec task-clock                #    0.997 CPUs utilized            ( +-  0.09% )
                 6      context-switches          #    0.032 K/sec                    ( +-  1.86% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +- 30.00% )
               108      page-faults               #    0.535 K/sec                    ( +-  0.03% )
       718,001,313      cycles                    #    3.546 GHz                      ( +-  0.06% )  (30.12%)
     1,041,618,306      instructions              #    1.45  insn per cycle           ( +-  0.03% )  (37.96%)
       226,386,119      branches                  # 1118.091 M/sec                    ( +-  0.03% )  (38.35%)
         9,882,436      branch-misses             #    4.37% of all branches          ( +-  0.02% )  (38.59%)
       196,832,137      L1-dcache-loads           #  972.128 M/sec                    ( +-  0.02% )  (39.15%)
           217,794      L1-dcache-load-misses     #    0.11% of all L1-dcache hits    ( +-  0.67% )  (39.23%)
            70,690      LLC-loads                 #    0.349 M/sec                    ( +-  0.90% )  (31.15%)
            18,802      LLC-load-misses           #   26.60% of all LL-cache hits     ( +-  0.84% )  (31.18%)
   <not supported>      L1-icache-loads
           106,461      L1-icache-load-misses                                         ( +-  0.51% )  (30.83%)
       198,887,011      dTLB-loads                #  982.277 M/sec                    ( +-  0.02% )  (30.66%)
             1,483      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.28% )  (30.50%)
             4,064      iTLB-loads                #    0.020 M/sec                    ( +- 21.43% )  (30.23%)
               488      iTLB-load-misses          #   12.00% of all iTLB cache hits   ( +-  1.95% )  (30.03%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          0.203081 +- 0.000187 seconds time elapsed  ( +-  0.09% )


Second conducted measurement was on BPF kselftest flow_dissector that is
using the progs/bpf_flow.c with 'repeat' argument on
bpf_prog_test_run_xattr set also to 1000000.

Without:

 Performance counter stats for './test_progs -t flow_dissector' (1024 runs):

          1,340.52 msec task-clock                #    0.987 CPUs utilized            ( +-  0.05% )
                25      context-switches          #    0.018 K/sec                    ( +-  0.32% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-  8.59% )
               122      page-faults               #    0.091 K/sec                    ( +-  0.03% )
     4,764,381,512      cycles                    #    3.554 GHz                      ( +-  0.04% )  (30.68%)
     7,674,803,496      instructions              #    1.61  insn per cycle           ( +-  0.01% )  (38.41%)
     1,118,346,714      branches                  #  834.261 M/sec                    ( +-  0.00% )  (38.46%)
        29,132,651      branch-misses             #    2.60% of all branches          ( +-  0.00% )  (38.50%)
     1,737,552,687      L1-dcache-loads           # 1296.174 M/sec                    ( +-  0.01% )  (38.55%)
         1,064,105      L1-dcache-load-misses     #    0.06% of all L1-dcache hits    ( +-  1.28% )  (38.57%)
            50,356      LLC-loads                 #    0.038 M/sec                    ( +-  1.42% )  (30.82%)
            10,825      LLC-load-misses           #   21.50% of all LL-cache hits     ( +-  1.42% )  (30.79%)
   <not supported>      L1-icache-loads
           568,800      L1-icache-load-misses                                         ( +-  0.66% )  (30.77%)
     1,741,511,307      dTLB-loads                # 1299.127 M/sec                    ( +-  0.01% )  (30.75%)
             5,112      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.29% )  (30.73%)
             2,128      iTLB-loads                #    0.002 M/sec                    ( +-  2.06% )  (30.70%)
               571      iTLB-load-misses          #   26.85% of all iTLB cache hits   ( +-  3.10% )  (30.68%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          1.358653 +- 0.000741 seconds time elapsed  ( +-  0.05% )


With:

 Performance counter stats for './test_progs -t flow_dissector' (1024 runs):

          1,426.95 msec task-clock                #    0.989 CPUs utilized            ( +-  0.04% )
                23      context-switches          #    0.016 K/sec                    ( +-  0.40% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-  6.38% )
               122      page-faults               #    0.085 K/sec                    ( +-  0.03% )
     4,772,798,523      cycles                    #    3.345 GHz                      ( +-  0.03% )  (30.70%)
     7,837,101,633      instructions              #    1.64  insn per cycle           ( +-  0.00% )  (38.42%)
     1,118,716,987      branches                  #  783.992 M/sec                    ( +-  0.00% )  (38.46%)
        29,147,367      branch-misses             #    2.61% of all branches          ( +-  0.00% )  (38.51%)
     1,797,232,091      L1-dcache-loads           # 1259.492 M/sec                    ( +-  0.00% )  (38.55%)
         1,487,769      L1-dcache-load-misses     #    0.08% of all L1-dcache hits    ( +-  0.66% )  (38.55%)
            50,180      LLC-loads                 #    0.035 M/sec                    ( +-  1.37% )  (30.81%)
            14,709      LLC-load-misses           #   29.31% of all LL-cache hits     ( +-  1.11% )  (30.79%)
   <not supported>      L1-icache-loads
           626,633      L1-icache-load-misses                                         ( +-  0.58% )  (30.77%)
     1,800,278,668      dTLB-loads                # 1261.627 M/sec                    ( +-  0.01% )  (30.75%)
             3,809      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.71% )  (30.72%)
             1,745      iTLB-loads                #    0.001 M/sec                    ( +-  3.90% )  (30.70%)
            12,267      iTLB-load-misses          #  703.02% of all iTLB cache hits   ( +- 96.08% )  (30.68%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          1.442116 +- 0.000632 seconds time elapsed  ( +-  0.04% )
-------------------------------------------------------------------

Thank you,
Maciej

Maciej Fijalkowski (5):
  bpf, x64: use %rcx instead of %rax for tail call retpolines
  bpf: allow for tailcalls in BPF subprograms
  bpf: propagate poke descriptors to subprograms
  bpf, x64: rework pro/epilogue and tailcall handling in JIT
  selftests: bpf: add dummy prog for bpf2bpf with tailcall

 arch/x86/include/asm/nospec-branch.h          |  16 +-
 arch/x86/net/bpf_jit_comp.c                   | 224 ++++++++++++++----
 include/linux/bpf.h                           |   1 +
 kernel/bpf/arraymap.c                         |  27 ++-
 kernel/bpf/verifier.c                         |  14 +-
 .../selftests/bpf/prog_tests/tailcalls.c      |  85 +++++++
 tools/testing/selftests/bpf/progs/tailcall6.c |  38 +++
 7 files changed, 339 insertions(+), 66 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines
  2020-07-02 13:49 [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
@ 2020-07-02 13:49 ` Maciej Fijalkowski
  2020-07-02 13:49 ` [RFC PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-02 13:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Currently, %rax is used to store the jump target when BPF program is
emitting the retpoline instructions that are handling the indirect
tailcall.

There is a plan to use %rax for different purpose, which is storing the
tail call counter. In order to preserve this value across the tailcalls,
use %rcx instead for jump target storage in retpoline instructions.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/include/asm/nospec-branch.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index e7752b4038ff..e491c3d9f227 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -314,19 +314,19 @@ static inline void mds_idle_clear_cpu_buffers(void)
  *    lfence
  *    jmp spec_trap
  *  do_rop:
- *    mov %rax,(%rsp) for x86_64
+ *    mov %rcx,(%rsp) for x86_64
  *    mov %edx,(%esp) for x86_32
  *    retq
  *
  * Without retpolines configured:
  *
- *    jmp *%rax for x86_64
+ *    jmp *%rcx for x86_64
  *    jmp *%edx for x86_32
  */
 #ifdef CONFIG_RETPOLINE
 # ifdef CONFIG_X86_64
-#  define RETPOLINE_RAX_BPF_JIT_SIZE	17
-#  define RETPOLINE_RAX_BPF_JIT()				\
+#  define RETPOLINE_RCX_BPF_JIT_SIZE	17
+#  define RETPOLINE_RCX_BPF_JIT()				\
 do {								\
 	EMIT1_off32(0xE8, 7);	 /* callq do_rop */		\
 	/* spec_trap: */					\
@@ -334,7 +334,7 @@ do {								\
 	EMIT3(0x0F, 0xAE, 0xE8); /* lfence */			\
 	EMIT2(0xEB, 0xF9);       /* jmp spec_trap */		\
 	/* do_rop: */						\
-	EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */	\
+	EMIT4(0x48, 0x89, 0x0C, 0x24); /* mov %rcx,(%rsp) */	\
 	EMIT1(0xC3);             /* retq */			\
 } while (0)
 # else /* !CONFIG_X86_64 */
@@ -352,9 +352,9 @@ do {								\
 # endif
 #else /* !CONFIG_RETPOLINE */
 # ifdef CONFIG_X86_64
-#  define RETPOLINE_RAX_BPF_JIT_SIZE	2
-#  define RETPOLINE_RAX_BPF_JIT()				\
-	EMIT2(0xFF, 0xE0);       /* jmp *%rax */
+#  define RETPOLINE_RCX_BPF_JIT_SIZE	2
+#  define RETPOLINE_RCX_BPF_JIT()				\
+	EMIT2(0xFF, 0xE1);       /* jmp *%rcx */
 # else /* !CONFIG_X86_64 */
 #  define RETPOLINE_EDX_BPF_JIT()				\
 	EMIT2(0xFF, 0xE2)        /* jmp *%edx */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms
  2020-07-02 13:49 [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
  2020-07-02 13:49 ` [RFC PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
@ 2020-07-02 13:49 ` Maciej Fijalkowski
  2020-07-02 13:49 ` [RFC PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-02 13:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Relax verifier's restriction that was meant to forbid tailcall usage
when subprog count was higher than 1.

Also, do not max out the stack depth of program that utilizes tailcalls.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 kernel/bpf/verifier.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7de98906ddf4..e7b6b059e6e9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4173,10 +4173,6 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_FUNC_tail_call:
 		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
 			goto error;
-		if (env->subprog_cnt > 1) {
-			verbose(env, "tail_calls are not allowed in programs with bpf-to-bpf calls\n");
-			return -EINVAL;
-		}
 		break;
 	case BPF_FUNC_perf_event_read:
 	case BPF_FUNC_perf_event_output:
@@ -10243,7 +10239,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			 * the program array.
 			 */
 			prog->cb_access = 1;
-			env->prog->aux->stack_depth = MAX_BPF_STACK;
 			env->prog->aux->max_pkt_offset = MAX_PACKET_OFF;
 
 			/* mark bpf_tail_call as different opcode to avoid
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms
  2020-07-02 13:49 [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
  2020-07-02 13:49 ` [RFC PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
  2020-07-02 13:49 ` [RFC PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
@ 2020-07-02 13:49 ` Maciej Fijalkowski
  2020-07-02 13:49 ` [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-02 13:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Previously, there was no need for poke descriptors being present in
subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
in them. Each subprog is JITed independently so in order to enable
JITing such subprograms, simply copy poke descriptors from main program
to subprogram's poke tab.

Add also subprog's aux struct to the BPF map poke_progs list by calling
on it map_poke_track().

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 kernel/bpf/verifier.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e7b6b059e6e9..716538ada537 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9931,6 +9931,9 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		goto out_undo_insn;
 
 	for (i = 0; i < env->subprog_cnt; i++) {
+		struct bpf_map *map_ptr;
+		int j;
+
 		subprog_start = subprog_end;
 		subprog_end = env->subprog_info[i + 1].start;
 
@@ -9955,6 +9958,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		func[i]->aux->btf = prog->aux->btf;
 		func[i]->aux->func_info = prog->aux->func_info;
 
+		for (j = 0; j < prog->aux->size_poke_tab; j++) {
+			bpf_jit_add_poke_descriptor(func[i], &prog->aux->poke_tab[j]);
+			map_ptr = func[i]->aux->poke_tab[j].tail_call.map;
+			map_ptr->ops->map_poke_track(map_ptr, func[i]->aux);
+		}
+
 		/* Use bpf_prog_F_tag to indicate functions in stack traces.
 		 * Long term would need debug info to populate names
 		 */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-02 13:49 [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2020-07-02 13:49 ` [RFC PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
@ 2020-07-02 13:49 ` Maciej Fijalkowski
  2020-07-10 23:56   ` Alexei Starovoitov
  2020-07-02 13:49 ` [RFC PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
  2020-07-11  0:10 ` [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Alexei Starovoitov
  5 siblings, 1 reply; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-02 13:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

This commit serves two things:
1) it optimizes BPF prologue/epilogue generation
2) it makes possible to have tailcalls within BPF subprogram

Both points are related to each other since without 1), 2) could not be
achieved.

In [1], Alexei says:
"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
             // function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call
counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

Implement that suggestion.

Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it to rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.

Drop the BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.

Introduce new poke target, 'ip_aux' to poke descriptor that is dedicated
for skipping the register pops and stack unwind that are generated right
before the actual jump to target program.
For case when the target program is not present, BPF program will skip
the pop instructions and nop5 dedicated for jmpq $target. An example of
such state when only R6 of callee saved registers is used by program:

ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
ffffffffc0513aa6:       5b                      pop    %rbx
ffffffffc0513aa7:       58                      pop    %rax
ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi

When target program is inserted, the jump that was there to skip
pops/nop5 will become the nop5, so CPU will go over pops and do the
actual tailcall.

One might ask why there simply can not be pushes after the nop5?
In the following example snippet:

ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
(...)
ffffffffc0370332:       5b                      pop    %rbx
ffffffffc0370333:       58                      pop    %rax
ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
ffffffffc0370347:       50                      push   %rax
ffffffffc0370348:       53                      push   %rbx
ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548

There is the bpf2bpf call right after the tailcall and jump target is
not present. ctx is %rbx and BPF subprogram that we will call into on
ffffffffc037034c is relying on it, e.g. it will pick ctx from there.
Such code layout is therefore broken as we would overwrite the content
of %rbx with the value that was pushed on the prologue.

For regression checks, 'tailcalls' kselftest was executed:
$ sudo ./test_progs -t tailcalls
 #64/1 tailcall_1:OK
 #64/2 tailcall_2:OK
 #64/3 tailcall_3:OK
 #64/4 tailcall_4:OK
 #64/5 tailcall_5:OK
 #64 tailcalls:OK
Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED

Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 224 ++++++++++++++++++++++++++++--------
 include/linux/bpf.h         |   1 +
 kernel/bpf/arraymap.c       |  27 ++++-
 3 files changed, 199 insertions(+), 53 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 42b6709e6dc7..45136270b02b 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -222,13 +222,47 @@ struct jit_context {
 /* Number of bytes emit_patch() needs to generate instructions */
 #define X86_PATCH_SIZE		5
 
-#define PROLOGUE_SIZE		25
+/* Number of bytes that will be skipped on tailcall */
+#define X86_TAIL_CALL_OFFSET	11
+
+static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[0])
+		EMIT1(0x53);         /* push rbx */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x55);   /* push r13 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x56);   /* push r14 */
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x57);   /* push r15 */
+	*pprog = prog;
+}
+
+static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (callee_regs_used[3])
+		EMIT2(0x41, 0x5F);   /* pop r15 */
+	if (callee_regs_used[2])
+		EMIT2(0x41, 0x5E);   /* pop r14 */
+	if (callee_regs_used[1])
+		EMIT2(0x41, 0x5D);   /* pop r13 */
+	if (callee_regs_used[0])
+		EMIT1(0x5B);         /* pop rbx */
+	*pprog = prog;
+}
 
 /*
- * Emit x86-64 prologue code for BPF program and check its size.
+ * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip it while jumping into another program
  */
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
+static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
+			  bool tail_call)
 {
 	u8 *prog = *pprog;
 	int cnt = X86_PATCH_SIZE;
@@ -238,19 +272,16 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
 	 */
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt);
 	prog += cnt;
+	if (!ebpf_from_cbpf && tail_call)
+		EMIT2(0x31, 0xC0);       /* xor eax, eax */
+	else
+		EMIT2(0x66, 0x90);       /* nop2 */
 	EMIT1(0x55);             /* push rbp */
 	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
 	/* sub rsp, rounded_stack_depth */
 	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
-	EMIT1(0x53);             /* push rbx */
-	EMIT2(0x41, 0x55);       /* push r13 */
-	EMIT2(0x41, 0x56);       /* push r14 */
-	EMIT2(0x41, 0x57);       /* push r15 */
-	if (!ebpf_from_cbpf) {
-		/* zero init tail_call_cnt */
-		EMIT2(0x6a, 0x00);
-		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
-	}
+	if (!ebpf_from_cbpf && tail_call)
+		EMIT1(0x50);         /* push rax */
 	*pprog = prog;
 }
 
@@ -337,6 +368,22 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 	return __bpf_arch_text_poke(ip, t, old_addr, new_addr, true);
 }
 
+static int get_pop_bytes(bool *callee_regs_used)
+{
+	int bytes = 0;
+
+	if (callee_regs_used[3])
+		bytes += 2;
+	if (callee_regs_used[2])
+		bytes += 2;
+	if (callee_regs_used[1])
+		bytes += 2;
+	if (callee_regs_used[0])
+		bytes += 1;
+
+	return bytes;
+}
+
 /*
  * Generate the following code:
  *
@@ -351,12 +398,25 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
  *   goto *(prog->bpf_func + prologue_size);
  * out:
  */
-static void emit_bpf_tail_call_indirect(u8 **pprog)
+static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
+					u32 stack_depth)
 {
 	u8 *prog = *pprog;
-	int label1, label2, label3;
+	int pop_bytes = 0;
+	int off1 = 49;
+	int off2 = 38;
+	int off3 = 16;
 	int cnt = 0;
 
+	/* count the additional bytes used for popping callee regs from stack
+	 * that need to be taken into account for each of the offsets that
+	 * are used for bailing out of the tail call
+	 */
+	pop_bytes = get_pop_bytes(callee_regs_used);
+	off1 += pop_bytes;
+	off2 += pop_bytes;
+	off3 += pop_bytes;
+
 	/*
 	 * rdi - pointer to ctx
 	 * rsi - pointer to bpf_array
@@ -370,75 +430,111 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
 	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
 	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
 	      offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
+#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
 	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
-	label1 = cnt;
 
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85                    /* mov eax, dword ptr [rbp - (4 + sd)] */,
+		    -4 - round_up(stack_depth, 8));
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
-#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
+#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
 	EMIT2(X86_JA, OFFSET2);                   /* ja out */
-	label2 = cnt;
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85,                   /* mov dword ptr [rbp - (4 + sd)], eax */
+		    -4 - round_up(stack_depth, 8));
 
 	/* prog = array->ptrs[index]; */
-	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
+	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,        /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
 		    offsetof(struct bpf_array, ptrs));
 
 	/*
 	 * if (prog == NULL)
 	 *	goto out;
 	 */
-	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
-#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
-	EMIT2(X86_JE, OFFSET3);                   /* je out */
-	label3 = cnt;
+	EMIT3(0x48, 0x85, 0xC9);                   /* test rcx,rcx */
+#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE)
+	EMIT2(X86_JE, OFFSET3);                    /* je out */
 
-	/* goto *(prog->bpf_func + prologue_size); */
-	EMIT4(0x48, 0x8B, 0x40,                   /* mov rax, qword ptr [rax + 32] */
-	      offsetof(struct bpf_prog, bpf_func));
-	EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
+
+	EMIT1(0x58);                               /* pop rax */
+	EMIT3_off32(0x48, 0x81, 0xC4,              /* add rsp, sd */
+		    round_up(stack_depth, 8));
 
+	/* goto *(prog->bpf_func + X86_TAIL_CALL_OFFSET); */
+	EMIT4(0x48, 0x8B, 0x49,                   /* mov rcx, qword ptr [rcx + 32] */
+	      offsetof(struct bpf_prog, bpf_func));
+	EMIT4(0x48, 0x83, 0xC1,                   /* add rcx, X86_TAIL_CALL_OFFSET */
+	      X86_TAIL_CALL_OFFSET);
 	/*
-	 * Wow we're ready to jump into next BPF program
+	 * Now we're ready to jump into next BPF program
 	 * rdi == ctx (1st arg)
-	 * rax == prog->bpf_func + prologue_size
+	 * rcx == prog->bpf_func + X86_TAIL_CALL_OFFSET
 	 */
-	RETPOLINE_RAX_BPF_JIT();
+	RETPOLINE_RCX_BPF_JIT();
 
 	/* out: */
-	BUILD_BUG_ON(cnt - label1 != OFFSET1);
-	BUILD_BUG_ON(cnt - label2 != OFFSET2);
-	BUILD_BUG_ON(cnt - label3 != OFFSET3);
 	*pprog = prog;
 }
 
 static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
-				      u8 **pprog, int addr, u8 *image)
+				      u8 **pprog, int addr, u8 *image,
+				      bool *callee_regs_used, u32 stack_depth)
 {
 	u8 *prog = *pprog;
+	int pop_bytes = 0;
+	int off1 = 27;
+	int poke_off;
 	int cnt = 0;
 
+	/* count the additional bytes used for popping callee regs to stack
+	 * that need to be taken into account for offset that is used for
+	 * bailing out of the tail call limit is reached and the poke->ip
+	 */
+	pop_bytes = get_pop_bytes(callee_regs_used);
+	off1 += pop_bytes;
+
+	/*
+	 * total bytes for:
+	 * - nop5/ jmpq $off
+	 * - pop callee regs
+	 * - sub rsp, $val
+	 * - pop rax
+	 */
+	poke_off = X86_PATCH_SIZE + pop_bytes + 7 + 1;
+
 	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *	goto out;
 	 */
-	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
+	EMIT2_off32(0x8B, 0x85,
+		    -4 - round_up(stack_depth, 8));   /* mov eax, dword ptr [rbp - (4 + sd)] */
 	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
-	EMIT2(X86_JA, 14);                            /* ja out */
+	EMIT2(X86_JA, off1);                          /* ja out */
 	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
-	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
+	EMIT2_off32(0x89, 0x85,
+		    -4 - round_up(stack_depth, 8));   /* mov dword ptr [rbp - (4 + sd)], eax */
 
+	poke->ip_aux = image + (addr - poke_off - X86_PATCH_SIZE);
+	poke->adj_off = X86_TAIL_CALL_OFFSET;
 	poke->ip = image + (addr - X86_PATCH_SIZE);
-	poke->adj_off = PROLOGUE_SIZE;
+
+	emit_jump(&prog, (u8 *)poke->ip + X86_PATCH_SIZE, poke->ip_aux);
+
+	*pprog = prog;
+	pop_callee_regs(pprog, callee_regs_used);
+	prog = *pprog;
+	EMIT1(0x58);                                  /* pop rax */
+	EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
 
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
 	prog += X86_PATCH_SIZE;
+
 	/* out: */
 
 	*pprog = prog;
@@ -474,6 +570,10 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog)
 						   (u8 *)target->bpf_func +
 						   poke->adj_off, false);
 			BUG_ON(ret < 0);
+			ret = __bpf_arch_text_poke(poke->ip_aux, BPF_MOD_JUMP,
+						   (u8 *)poke->ip + X86_PATCH_SIZE,
+						   NULL, false);
+			BUG_ON(ret < 0);
 		}
 		WRITE_ONCE(poke->ip_stable, true);
 		mutex_unlock(&array->aux->poke_mutex);
@@ -652,19 +752,44 @@ static bool ex_handler_bpf(const struct exception_table_entry *x,
 	return true;
 }
 
+static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
+			     bool *regs_used, bool *tail_call_seen)
+{
+	int i;
+
+	for (i = 1; i <= insn_cnt; i++, insn++) {
+		if (insn->code == (BPF_JMP | BPF_TAIL_CALL))
+			*tail_call_seen = true;
+		if (insn->dst_reg == BPF_REG_6 || insn->src_reg == BPF_REG_6)
+			regs_used[0] = true;
+		if (insn->dst_reg == BPF_REG_7 || insn->src_reg == BPF_REG_7)
+			regs_used[1] = true;
+		if (insn->dst_reg == BPF_REG_8 || insn->src_reg == BPF_REG_8)
+			regs_used[2] = true;
+		if (insn->dst_reg == BPF_REG_9 || insn->src_reg == BPF_REG_9)
+			regs_used[3] = true;
+	}
+}
+
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 		  int oldproglen, struct jit_context *ctx)
 {
 	struct bpf_insn *insn = bpf_prog->insnsi;
+	bool callee_regs_used[4] = {};
 	int insn_cnt = bpf_prog->len;
+	bool tail_call_seen = false;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
 	int i, cnt = 0, excnt = 0;
 	int proglen = 0;
 	u8 *prog = temp;
 
+	detect_reg_usage(insn, insn_cnt, callee_regs_used,
+			 &tail_call_seen);
+
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
-		      bpf_prog_was_classic(bpf_prog));
+		      bpf_prog_was_classic(bpf_prog), tail_call_seen);
+	push_callee_regs(&prog, callee_regs_used);
 	addrs[0] = prog - temp;
 
 	for (i = 1; i <= insn_cnt; i++, insn++) {
@@ -1109,9 +1234,13 @@ xadd:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_TAIL_CALL:
 			if (imm32)
 				emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
-							  &prog, addrs[i], image);
+							  &prog, addrs[i], image,
+							  callee_regs_used,
+							  bpf_prog->aux->stack_depth);
 			else
-				emit_bpf_tail_call_indirect(&prog);
+				emit_bpf_tail_call_indirect(&prog,
+							    callee_regs_used,
+							    bpf_prog->aux->stack_depth);
 			break;
 
 			/* cond jump */
@@ -1294,12 +1423,9 @@ xadd:			if (is_imm8(insn->off))
 			seen_exit = true;
 			/* Update cleanup_addr */
 			ctx->cleanup_addr = proglen;
-			if (!bpf_prog_was_classic(bpf_prog))
-				EMIT1(0x5B); /* get rid of tail_call_cnt */
-			EMIT2(0x41, 0x5F);   /* pop r15 */
-			EMIT2(0x41, 0x5E);   /* pop r14 */
-			EMIT2(0x41, 0x5D);   /* pop r13 */
-			EMIT1(0x5B);         /* pop rbx */
+			pop_callee_regs(&prog, callee_regs_used);
+			if (!bpf_prog_was_classic(bpf_prog) && tail_call_seen)
+				EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */
 			EMIT1(0xC9);         /* leave */
 			EMIT1(0xC3);         /* ret */
 			break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3d2ade703a35..0554af067e61 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -652,6 +652,7 @@ enum bpf_jit_poke_reason {
 /* Descriptor of pokes pointing /into/ the JITed image. */
 struct bpf_jit_poke_descriptor {
 	void *ip;
+	void *ip_aux;
 	union {
 		struct {
 			struct bpf_map *map;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index ec5cd11032aa..60423467997d 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -761,6 +761,8 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 {
 	struct prog_poke_elem *elem;
 	struct bpf_array_aux *aux;
+	bool is_nop;
+	s32 *off;
 
 	aux = container_of(map, struct bpf_array, map)->aux;
 	WARN_ON_ONCE(!mutex_is_locked(&aux->poke_mutex));
@@ -808,12 +810,29 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 			if (poke->tail_call.map != map ||
 			    poke->tail_call.key != key)
 				continue;
+			/* protect against un-updated poke descriptors since
+			 * we could fill them from subprog and the same desc
+			 * is present on main's program poke tab
+			 */
+			if (!poke->ip_aux || !poke->ip)
+				continue;
 
+			if (!new)
+				goto skip_poke;
+
+			off = (s32 *)((u8 *)(poke->ip + 1));
+			is_nop = !memcmp(poke->ip, ideal_nops[NOP_ATOMIC5], 5);
 			ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP,
-						 old ? (u8 *)old->bpf_func +
-						 poke->adj_off : NULL,
-						 new ? (u8 *)new->bpf_func +
-						 poke->adj_off : NULL);
+						 is_nop ? NULL : (u8 *)poke->ip +
+						 *off + 5,
+						 (u8 *)new->bpf_func +
+						 poke->adj_off);
+			BUG_ON(ret < 0 && ret != -EINVAL);
+skip_poke:
+			is_nop = !memcmp(poke->ip_aux, ideal_nops[NOP_ATOMIC5], 5);
+			ret = bpf_arch_text_poke(poke->ip_aux, BPF_MOD_JUMP,
+						 is_nop ? NULL : (u8 *)poke->ip + 5,
+						 new ? NULL : (u8 *)poke->ip + 5);
 			BUG_ON(ret < 0 && ret != -EINVAL);
 		}
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall
  2020-07-02 13:49 [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2020-07-02 13:49 ` [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
@ 2020-07-02 13:49 ` Maciej Fijalkowski
  2020-07-11  0:10 ` [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Alexei Starovoitov
  5 siblings, 0 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-02 13:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: bpf, netdev, bjorn.topel, magnus.karlsson, Maciej Fijalkowski

Introduce 6th test to taicalls kselftest that checks if tailcall can be
correctly executed from the BPF subprogram.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/tailcalls.c      | 85 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/tailcall6.c | 38 +++++++++
 2 files changed, 123 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tailcall6.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tailcalls.c b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
index bb8fe646dd9f..192c94896809 100644
--- a/tools/testing/selftests/bpf/prog_tests/tailcalls.c
+++ b/tools/testing/selftests/bpf/prog_tests/tailcalls.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
+#include <network_helpers.h>
 
 /* test_tailcall_1 checks basic functionality by patching multiple locations
  * in a single program for a single tail call slot with nop->jmp, jmp->nop
@@ -472,6 +473,88 @@ static void test_tailcall_5(void)
 	bpf_object__close(obj);
 }
 
+/* test_tailcall_6 purpose is to make sure that tailcalls are working
+ * correctly in correlation with BPF subprograms
+ */
+static void test_tailcall_6(void)
+{
+	int err, map_fd, prog_fd, main_fd, i;
+	struct bpf_map *prog_array;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	__u32 retval, duration;
+	char prog_name[32];
+
+	err = bpf_prog_load("tailcall6.o", BPF_PROG_TYPE_SCHED_CLS, &obj,
+			    &prog_fd);
+	if (CHECK_FAIL(err))
+		return;
+
+	prog = bpf_object__find_program_by_title(obj, "classifier");
+	if (CHECK_FAIL(!prog))
+		goto out;
+
+	main_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(main_fd < 0))
+		goto out;
+
+	prog_array = bpf_object__find_map_by_name(obj, "jmp_table");
+	if (CHECK_FAIL(!prog_array))
+		goto out;
+
+	map_fd = bpf_map__fd(prog_array);
+	if (CHECK_FAIL(map_fd < 0))
+		goto out;
+
+	/* nop -> jmp */
+	for (i = 0; i < bpf_map__def(prog_array)->max_entries; i++) {
+		snprintf(prog_name, sizeof(prog_name), "classifier/%i", i);
+
+		prog = bpf_object__find_program_by_title(obj, prog_name);
+		if (CHECK_FAIL(!prog))
+			goto out;
+
+		prog_fd = bpf_program__fd(prog);
+		if (CHECK_FAIL(prog_fd < 0))
+			goto out;
+
+		err = bpf_map_update_elem(map_fd, &i, &prog_fd, BPF_ANY);
+		if (CHECK_FAIL(err))
+			goto out;
+	}
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 1, "tailcall",
+	      "err %d errno %d retval %d\n", err, errno, retval);
+
+	/* jmp -> nop, call subprog that will do tailcall */
+	i = 1;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 0, "tailcall", "err %d errno %d retval %d\n",
+	      err, errno, retval);
+
+	/* make sure that subprog can access ctx and entry prog that
+	 * called this subprog can properly return
+	 */
+	i = 0;
+	err = bpf_map_delete_elem(map_fd, &i);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	err = bpf_prog_test_run(main_fd, 1, &pkt_v4, sizeof(pkt_v4), 0,
+				0, &retval, &duration);
+	CHECK(err || retval != 108, "tailcall", "err %d errno %d retval %d\n",
+	      err, errno, retval);
+out:
+	bpf_object__close(obj);
+}
+
 void test_tailcalls(void)
 {
 	if (test__start_subtest("tailcall_1"))
@@ -484,4 +567,6 @@ void test_tailcalls(void)
 		test_tailcall_4();
 	if (test__start_subtest("tailcall_5"))
 		test_tailcall_5();
+	if (test__start_subtest("tailcall_6"))
+		test_tailcall_6();
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall6.c b/tools/testing/selftests/bpf/progs/tailcall6.c
new file mode 100644
index 000000000000..e72ca5869b58
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tailcall6.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 2);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+#define TAIL_FUNC(x) 				\
+	SEC("classifier/" #x)			\
+	int bpf_func_##x(struct __sk_buff *skb)	\
+	{					\
+		return x;			\
+	}
+TAIL_FUNC(0)
+TAIL_FUNC(1)
+
+static __attribute__ ((noinline))
+int subprog_tail(struct __sk_buff *skb)
+{
+	bpf_tail_call(skb, &jmp_table, 0);
+
+	return skb->len * 2;
+}
+
+SEC("classifier")
+int entry(struct __sk_buff *skb)
+{
+	bpf_tail_call(skb, &jmp_table, 1);
+
+	return subprog_tail(skb);
+}
+
+char __license[] SEC("license") = "GPL";
+int _version SEC("version") = 1;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-02 13:49 ` [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
@ 2020-07-10 23:56   ` Alexei Starovoitov
  2020-07-11  3:20     ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2020-07-10 23:56 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: ast, daniel, bpf, netdev, bjorn.topel, magnus.karlsson

On Thu, Jul 02, 2020 at 03:49:29PM +0200, Maciej Fijalkowski wrote:
> This commit serves two things:
> 1) it optimizes BPF prologue/epilogue generation
> 2) it makes possible to have tailcalls within BPF subprogram
> 
> Both points are related to each other since without 1), 2) could not be
> achieved.
> 
> In [1], Alexei says:
> "The prologue will look like:
> nop5
> xor eax,eax  // two new bytes if bpf_tail_call() is used in this
>              // function
> push rbp
> mov rbp, rsp
> sub rsp, rounded_stack_depth
> push rax // zero init tail_call counter
> variable number of push rbx,r13,r14,r15
> 
> Then bpf_tail_call will pop variable number rbx,..
> and final 'pop rax'
> Then 'add rsp, size_of_current_stack_frame'
> jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> rbp, rsp'
> 
> This way new function will set its own stack size and will init tail
> call
> counter with whatever value the parent had.
> 
> If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> Instead it would need to have 'nop2' in there."
> 
> Implement that suggestion.
> 
> Since the layout of stack is changed, tail call counter handling can not
> rely anymore on popping it to rbx just like it have been handled for
> constant prologue case and later overwrite of rbx with actual value of
> rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
> is considered to be volatile/caller-saved and pop the value of tail call
> counter in there in the epilogue.
> 
> Drop the BUILD_BUG_ON in emit_prologue and in
> emit_bpf_tail_call_indirect where instruction layout is not constant
> anymore.
> 
> Introduce new poke target, 'ip_aux' to poke descriptor that is dedicated

imo ip_aux approach has too much x86 specific code in kernel/bpf/arraymap.c
Ex. NOP_ATOMIC5 is x86 only and will break build on all other archs.

But I'm not sure ip_aux is really necessary.
It's nice to optimize the case when tail_call target is NULL, but
redundant unwind + nop5 + push_regs_again makes for much simpler design
without worrying about state transitions.

So I don't think optimizing the case of target==NULL is really worth the complexity.

> for skipping the register pops and stack unwind that are generated right
> before the actual jump to target program.
> For case when the target program is not present, BPF program will skip
> the pop instructions and nop5 dedicated for jmpq $target. An example of
> such state when only R6 of callee saved registers is used by program:
> 
> ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
> ffffffffc0513aa6:       5b                      pop    %rbx
> ffffffffc0513aa7:       58                      pop    %rax
> ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi

so this last rbx->rdi insn is not part of bpf_tail_call insn, right?
That is just 'R1 = R6;' bpf insn jited.

> 
> When target program is inserted, the jump that was there to skip
> pops/nop5 will become the nop5, so CPU will go over pops and do the
> actual tailcall.
> 
> One might ask why there simply can not be pushes after the nop5?

exactly... and...

> In the following example snippet:
> 
> ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
> (...)
> ffffffffc0370332:       5b                      pop    %rbx
> ffffffffc0370333:       58                      pop    %rax
> ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
> ffffffffc0370347:       50                      push   %rax
> ffffffffc0370348:       53                      push   %rbx
> ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
> ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
> 
> There is the bpf2bpf call right after the tailcall and jump target is
> not present. ctx is %rbx and BPF subprogram that we will call into on
> ffffffffc037034c is relying on it, e.g. it will pick ctx from there.
> Such code layout is therefore broken as we would overwrite the content
> of %rbx with the value that was pushed on the prologue.

I don't understand above explanation.
Are you saying 'callq  0xffffffffc0372548' above is a call to bpf subprogram?
The 'mov %rbx,%rdi' was 'R1 = R6' before JIT.
The code is storing ctx into R1 to pass into bpf subprogram.
It's not part of proposed emit_bpf_tail_call_direct() handling.
It's part of BPF program.
I don't see what breaks.

The new emit_bpf_tail_call_indirect() looks correct to me.

But emit_bpf_tail_call_direct() doesn't need
+ emit_jump(&prog, (u8 *)poke->ip + X86_PATCH_SIZE, poke->ip_aux);
and messy poke->ip_aux.

It can do:
pop_callee_regs()
memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
push_callee_regs()

When target is NULL the pairs of pop/push overall is a nop.
They don't affect correctness.
When prog_array_map_poke_run() is called it will replace a nop5
with a jump. So still all good.

Yes there will be tiny overhead when tail_call target is NULL,
but x86 will execute pop/push pair in _one_ cpu cycle.
As far as I recall x86 hardware has special logic to recognize
such push/pop sequences so they are really fast.

What am I missing?

> 
> For regression checks, 'tailcalls' kselftest was executed:
> $ sudo ./test_progs -t tailcalls
>  #64/1 tailcall_1:OK
>  #64/2 tailcall_2:OK
>  #64/3 tailcall_3:OK
>  #64/4 tailcall_4:OK
>  #64/5 tailcall_5:OK
>  #64 tailcalls:OK
> Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED
> 
> Tail call related cases from test_verifier kselftest are also working
> fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
> work properly as well.
> 
> [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/
> 
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 224 ++++++++++++++++++++++++++++--------
>  include/linux/bpf.h         |   1 +
>  kernel/bpf/arraymap.c       |  27 ++++-
>  3 files changed, 199 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 42b6709e6dc7..45136270b02b 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -222,13 +222,47 @@ struct jit_context {
>  /* Number of bytes emit_patch() needs to generate instructions */
>  #define X86_PATCH_SIZE		5
>  
> -#define PROLOGUE_SIZE		25
> +/* Number of bytes that will be skipped on tailcall */
> +#define X86_TAIL_CALL_OFFSET	11
> +
> +static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
> +{
> +	u8 *prog = *pprog;
> +	int cnt = 0;
> +
> +	if (callee_regs_used[0])
> +		EMIT1(0x53);         /* push rbx */
> +	if (callee_regs_used[1])
> +		EMIT2(0x41, 0x55);   /* push r13 */
> +	if (callee_regs_used[2])
> +		EMIT2(0x41, 0x56);   /* push r14 */
> +	if (callee_regs_used[3])
> +		EMIT2(0x41, 0x57);   /* push r15 */
> +	*pprog = prog;
> +}
> +
> +static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
> +{
> +	u8 *prog = *pprog;
> +	int cnt = 0;
> +
> +	if (callee_regs_used[3])
> +		EMIT2(0x41, 0x5F);   /* pop r15 */
> +	if (callee_regs_used[2])
> +		EMIT2(0x41, 0x5E);   /* pop r14 */
> +	if (callee_regs_used[1])
> +		EMIT2(0x41, 0x5D);   /* pop r13 */
> +	if (callee_regs_used[0])
> +		EMIT1(0x5B);         /* pop rbx */
> +	*pprog = prog;
> +}
>  
>  /*
> - * Emit x86-64 prologue code for BPF program and check its size.
> + * Emit x86-64 prologue code for BPF program.
>   * bpf_tail_call helper will skip it while jumping into another program
>   */
> -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
> +static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> +			  bool tail_call)
>  {
>  	u8 *prog = *pprog;
>  	int cnt = X86_PATCH_SIZE;
> @@ -238,19 +272,16 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf)
>  	 */
>  	memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt);
>  	prog += cnt;
> +	if (!ebpf_from_cbpf && tail_call)
> +		EMIT2(0x31, 0xC0);       /* xor eax, eax */
> +	else
> +		EMIT2(0x66, 0x90);       /* nop2 */
>  	EMIT1(0x55);             /* push rbp */
>  	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
>  	/* sub rsp, rounded_stack_depth */
>  	EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
> -	EMIT1(0x53);             /* push rbx */
> -	EMIT2(0x41, 0x55);       /* push r13 */
> -	EMIT2(0x41, 0x56);       /* push r14 */
> -	EMIT2(0x41, 0x57);       /* push r15 */
> -	if (!ebpf_from_cbpf) {
> -		/* zero init tail_call_cnt */
> -		EMIT2(0x6a, 0x00);
> -		BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
> -	}
> +	if (!ebpf_from_cbpf && tail_call)
> +		EMIT1(0x50);         /* push rax */
>  	*pprog = prog;
>  }
>  
> @@ -337,6 +368,22 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
>  	return __bpf_arch_text_poke(ip, t, old_addr, new_addr, true);
>  }
>  
> +static int get_pop_bytes(bool *callee_regs_used)
> +{
> +	int bytes = 0;
> +
> +	if (callee_regs_used[3])
> +		bytes += 2;
> +	if (callee_regs_used[2])
> +		bytes += 2;
> +	if (callee_regs_used[1])
> +		bytes += 2;
> +	if (callee_regs_used[0])
> +		bytes += 1;
> +
> +	return bytes;
> +}
> +
>  /*
>   * Generate the following code:
>   *
> @@ -351,12 +398,25 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
>   *   goto *(prog->bpf_func + prologue_size);
>   * out:
>   */
> -static void emit_bpf_tail_call_indirect(u8 **pprog)
> +static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> +					u32 stack_depth)
>  {
>  	u8 *prog = *pprog;
> -	int label1, label2, label3;
> +	int pop_bytes = 0;
> +	int off1 = 49;
> +	int off2 = 38;
> +	int off3 = 16;
>  	int cnt = 0;
>  
> +	/* count the additional bytes used for popping callee regs from stack
> +	 * that need to be taken into account for each of the offsets that
> +	 * are used for bailing out of the tail call
> +	 */
> +	pop_bytes = get_pop_bytes(callee_regs_used);
> +	off1 += pop_bytes;
> +	off2 += pop_bytes;
> +	off3 += pop_bytes;
> +
>  	/*
>  	 * rdi - pointer to ctx
>  	 * rsi - pointer to bpf_array
> @@ -370,75 +430,111 @@ static void emit_bpf_tail_call_indirect(u8 **pprog)
>  	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
>  	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
>  	      offsetof(struct bpf_array, map.max_entries));
> -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */
> +#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */
>  	EMIT2(X86_JBE, OFFSET1);                  /* jbe out */
> -	label1 = cnt;
>  
>  	/*
>  	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
> +	EMIT2_off32(0x8B, 0x85                    /* mov eax, dword ptr [rbp - (4 + sd)] */,
> +		    -4 - round_up(stack_depth, 8));
>  	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);     /* cmp eax, MAX_TAIL_CALL_CNT */
> -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE)
> +#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE)
>  	EMIT2(X86_JA, OFFSET2);                   /* ja out */
> -	label2 = cnt;
>  	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
> +	EMIT2_off32(0x89, 0x85,                   /* mov dword ptr [rbp - (4 + sd)], eax */
> +		    -4 - round_up(stack_depth, 8));
>  
>  	/* prog = array->ptrs[index]; */
> -	EMIT4_off32(0x48, 0x8B, 0x84, 0xD6,       /* mov rax, [rsi + rdx * 8 + offsetof(...)] */
> +	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,        /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
>  		    offsetof(struct bpf_array, ptrs));
>  
>  	/*
>  	 * if (prog == NULL)
>  	 *	goto out;
>  	 */
> -	EMIT3(0x48, 0x85, 0xC0);		  /* test rax,rax */
> -#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE)
> -	EMIT2(X86_JE, OFFSET3);                   /* je out */
> -	label3 = cnt;
> +	EMIT3(0x48, 0x85, 0xC9);                   /* test rcx,rcx */
> +#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE)
> +	EMIT2(X86_JE, OFFSET3);                    /* je out */
>  
> -	/* goto *(prog->bpf_func + prologue_size); */
> -	EMIT4(0x48, 0x8B, 0x40,                   /* mov rax, qword ptr [rax + 32] */
> -	      offsetof(struct bpf_prog, bpf_func));
> -	EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
> +	*pprog = prog;
> +	pop_callee_regs(pprog, callee_regs_used);
> +	prog = *pprog;
> +
> +	EMIT1(0x58);                               /* pop rax */
> +	EMIT3_off32(0x48, 0x81, 0xC4,              /* add rsp, sd */
> +		    round_up(stack_depth, 8));
>  
> +	/* goto *(prog->bpf_func + X86_TAIL_CALL_OFFSET); */
> +	EMIT4(0x48, 0x8B, 0x49,                   /* mov rcx, qword ptr [rcx + 32] */
> +	      offsetof(struct bpf_prog, bpf_func));
> +	EMIT4(0x48, 0x83, 0xC1,                   /* add rcx, X86_TAIL_CALL_OFFSET */
> +	      X86_TAIL_CALL_OFFSET);
>  	/*
> -	 * Wow we're ready to jump into next BPF program
> +	 * Now we're ready to jump into next BPF program
>  	 * rdi == ctx (1st arg)
> -	 * rax == prog->bpf_func + prologue_size
> +	 * rcx == prog->bpf_func + X86_TAIL_CALL_OFFSET
>  	 */
> -	RETPOLINE_RAX_BPF_JIT();
> +	RETPOLINE_RCX_BPF_JIT();
>  
>  	/* out: */
> -	BUILD_BUG_ON(cnt - label1 != OFFSET1);
> -	BUILD_BUG_ON(cnt - label2 != OFFSET2);
> -	BUILD_BUG_ON(cnt - label3 != OFFSET3);
>  	*pprog = prog;
>  }
>  
>  static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
> -				      u8 **pprog, int addr, u8 *image)
> +				      u8 **pprog, int addr, u8 *image,
> +				      bool *callee_regs_used, u32 stack_depth)
>  {
>  	u8 *prog = *pprog;
> +	int pop_bytes = 0;
> +	int off1 = 27;
> +	int poke_off;
>  	int cnt = 0;
>  
> +	/* count the additional bytes used for popping callee regs to stack
> +	 * that need to be taken into account for offset that is used for
> +	 * bailing out of the tail call limit is reached and the poke->ip
> +	 */
> +	pop_bytes = get_pop_bytes(callee_regs_used);
> +	off1 += pop_bytes;
> +
> +	/*
> +	 * total bytes for:
> +	 * - nop5/ jmpq $off
> +	 * - pop callee regs
> +	 * - sub rsp, $val
> +	 * - pop rax
> +	 */
> +	poke_off = X86_PATCH_SIZE + pop_bytes + 7 + 1;
> +
>  	/*
>  	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
>  	 *	goto out;
>  	 */
> -	EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */
> +	EMIT2_off32(0x8B, 0x85,
> +		    -4 - round_up(stack_depth, 8));   /* mov eax, dword ptr [rbp - (4 + sd)] */
>  	EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT);         /* cmp eax, MAX_TAIL_CALL_CNT */
> -	EMIT2(X86_JA, 14);                            /* ja out */
> +	EMIT2(X86_JA, off1);                          /* ja out */
>  	EMIT3(0x83, 0xC0, 0x01);                      /* add eax, 1 */
> -	EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */
> +	EMIT2_off32(0x89, 0x85,
> +		    -4 - round_up(stack_depth, 8));   /* mov dword ptr [rbp - (4 + sd)], eax */
>  
> +	poke->ip_aux = image + (addr - poke_off - X86_PATCH_SIZE);
> +	poke->adj_off = X86_TAIL_CALL_OFFSET;
>  	poke->ip = image + (addr - X86_PATCH_SIZE);
> -	poke->adj_off = PROLOGUE_SIZE;
> +
> +	emit_jump(&prog, (u8 *)poke->ip + X86_PATCH_SIZE, poke->ip_aux);
> +
> +	*pprog = prog;
> +	pop_callee_regs(pprog, callee_regs_used);
> +	prog = *pprog;
> +	EMIT1(0x58);                                  /* pop rax */
> +	EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
>  
>  	memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
>  	prog += X86_PATCH_SIZE;
> +
>  	/* out: */
>  
>  	*pprog = prog;
> @@ -474,6 +570,10 @@ static void bpf_tail_call_direct_fixup(struct bpf_prog *prog)
>  						   (u8 *)target->bpf_func +
>  						   poke->adj_off, false);
>  			BUG_ON(ret < 0);
> +			ret = __bpf_arch_text_poke(poke->ip_aux, BPF_MOD_JUMP,
> +						   (u8 *)poke->ip + X86_PATCH_SIZE,
> +						   NULL, false);
> +			BUG_ON(ret < 0);
>  		}
>  		WRITE_ONCE(poke->ip_stable, true);
>  		mutex_unlock(&array->aux->poke_mutex);
> @@ -652,19 +752,44 @@ static bool ex_handler_bpf(const struct exception_table_entry *x,
>  	return true;
>  }
>  
> +static void detect_reg_usage(struct bpf_insn *insn, int insn_cnt,
> +			     bool *regs_used, bool *tail_call_seen)
> +{
> +	int i;
> +
> +	for (i = 1; i <= insn_cnt; i++, insn++) {
> +		if (insn->code == (BPF_JMP | BPF_TAIL_CALL))
> +			*tail_call_seen = true;
> +		if (insn->dst_reg == BPF_REG_6 || insn->src_reg == BPF_REG_6)
> +			regs_used[0] = true;
> +		if (insn->dst_reg == BPF_REG_7 || insn->src_reg == BPF_REG_7)
> +			regs_used[1] = true;
> +		if (insn->dst_reg == BPF_REG_8 || insn->src_reg == BPF_REG_8)
> +			regs_used[2] = true;
> +		if (insn->dst_reg == BPF_REG_9 || insn->src_reg == BPF_REG_9)
> +			regs_used[3] = true;
> +	}
> +}
> +
>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  		  int oldproglen, struct jit_context *ctx)
>  {
>  	struct bpf_insn *insn = bpf_prog->insnsi;
> +	bool callee_regs_used[4] = {};
>  	int insn_cnt = bpf_prog->len;
> +	bool tail_call_seen = false;
>  	bool seen_exit = false;
>  	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
>  	int i, cnt = 0, excnt = 0;
>  	int proglen = 0;
>  	u8 *prog = temp;
>  
> +	detect_reg_usage(insn, insn_cnt, callee_regs_used,
> +			 &tail_call_seen);
> +
>  	emit_prologue(&prog, bpf_prog->aux->stack_depth,
> -		      bpf_prog_was_classic(bpf_prog));
> +		      bpf_prog_was_classic(bpf_prog), tail_call_seen);
> +	push_callee_regs(&prog, callee_regs_used);
>  	addrs[0] = prog - temp;
>  
>  	for (i = 1; i <= insn_cnt; i++, insn++) {
> @@ -1109,9 +1234,13 @@ xadd:			if (is_imm8(insn->off))
>  		case BPF_JMP | BPF_TAIL_CALL:
>  			if (imm32)
>  				emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
> -							  &prog, addrs[i], image);
> +							  &prog, addrs[i], image,
> +							  callee_regs_used,
> +							  bpf_prog->aux->stack_depth);
>  			else
> -				emit_bpf_tail_call_indirect(&prog);
> +				emit_bpf_tail_call_indirect(&prog,
> +							    callee_regs_used,
> +							    bpf_prog->aux->stack_depth);
>  			break;
>  
>  			/* cond jump */
> @@ -1294,12 +1423,9 @@ xadd:			if (is_imm8(insn->off))
>  			seen_exit = true;
>  			/* Update cleanup_addr */
>  			ctx->cleanup_addr = proglen;
> -			if (!bpf_prog_was_classic(bpf_prog))
> -				EMIT1(0x5B); /* get rid of tail_call_cnt */
> -			EMIT2(0x41, 0x5F);   /* pop r15 */
> -			EMIT2(0x41, 0x5E);   /* pop r14 */
> -			EMIT2(0x41, 0x5D);   /* pop r13 */
> -			EMIT1(0x5B);         /* pop rbx */
> +			pop_callee_regs(&prog, callee_regs_used);
> +			if (!bpf_prog_was_classic(bpf_prog) && tail_call_seen)
> +				EMIT1(0x59); /* pop rcx, get rid of tail_call_cnt */
>  			EMIT1(0xC9);         /* leave */
>  			EMIT1(0xC3);         /* ret */
>  			break;
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3d2ade703a35..0554af067e61 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -652,6 +652,7 @@ enum bpf_jit_poke_reason {
>  /* Descriptor of pokes pointing /into/ the JITed image. */
>  struct bpf_jit_poke_descriptor {
>  	void *ip;
> +	void *ip_aux;
>  	union {
>  		struct {
>  			struct bpf_map *map;
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index ec5cd11032aa..60423467997d 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -761,6 +761,8 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
>  {
>  	struct prog_poke_elem *elem;
>  	struct bpf_array_aux *aux;
> +	bool is_nop;
> +	s32 *off;
>  
>  	aux = container_of(map, struct bpf_array, map)->aux;
>  	WARN_ON_ONCE(!mutex_is_locked(&aux->poke_mutex));
> @@ -808,12 +810,29 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
>  			if (poke->tail_call.map != map ||
>  			    poke->tail_call.key != key)
>  				continue;
> +			/* protect against un-updated poke descriptors since
> +			 * we could fill them from subprog and the same desc
> +			 * is present on main's program poke tab
> +			 */
> +			if (!poke->ip_aux || !poke->ip)
> +				continue;
>  
> +			if (!new)
> +				goto skip_poke;
> +
> +			off = (s32 *)((u8 *)(poke->ip + 1));
> +			is_nop = !memcmp(poke->ip, ideal_nops[NOP_ATOMIC5], 5);
>  			ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP,
> -						 old ? (u8 *)old->bpf_func +
> -						 poke->adj_off : NULL,
> -						 new ? (u8 *)new->bpf_func +
> -						 poke->adj_off : NULL);
> +						 is_nop ? NULL : (u8 *)poke->ip +
> +						 *off + 5,
> +						 (u8 *)new->bpf_func +
> +						 poke->adj_off);
> +			BUG_ON(ret < 0 && ret != -EINVAL);
> +skip_poke:
> +			is_nop = !memcmp(poke->ip_aux, ideal_nops[NOP_ATOMIC5], 5);
> +			ret = bpf_arch_text_poke(poke->ip_aux, BPF_MOD_JUMP,
> +						 is_nop ? NULL : (u8 *)poke->ip + 5,
> +						 new ? NULL : (u8 *)poke->ip + 5);
>  			BUG_ON(ret < 0 && ret != -EINVAL);
>  		}
>  	}
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms
  2020-07-02 13:49 [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
                   ` (4 preceding siblings ...)
  2020-07-02 13:49 ` [RFC PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
@ 2020-07-11  0:10 ` Alexei Starovoitov
  2020-07-14  0:22   ` Maciej Fijalkowski
  5 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2020-07-11  0:10 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: ast, daniel, bpf, netdev, bjorn.topel, magnus.karlsson

On Thu, Jul 02, 2020 at 03:49:25PM +0200, Maciej Fijalkowski wrote:
> Hello,
> 
> today bpf2bpf calls and tailcalls exclude each other. This set is a
> proposal to make them work together. It is still a RFC because we need
> to decide if the performance impact for BPF programs with tailcalls is
> acceptable or not. Note that I have been focused only on x86
> architecture, I am not sure if other architectures have some other
> restrictions that were stopping us to have tailcalls in BPF subprograms.
> 
> I would also like to get a feedback on prog_array_map_poke_run changes.
> 
> To give you an overview how this work started, previously I posted RFC
> that was targetted at getting rid of push/pop instructions for callee
> saved registers in x86-64 JIT that are not used by the BPF program.
> Alexei saw a potential that that work could be lifted a bit and
> tailcalls could work with BPF subprograms. More on that in [1], [2].
> 
> In [1], Alexei says:
> 
> "The prologue will look like:
> nop5
> xor eax,eax  // two new bytes if bpf_tail_call() is used in this
> function
> push rbp
> mov rbp, rsp
> sub rsp, rounded_stack_depth
> push rax // zero init tail_call counter
> variable number of push rbx,r13,r14,r15
> 
> Then bpf_tail_call will pop variable number rbx,..
> and final 'pop rax'
> Then 'add rsp, size_of_current_stack_frame'
> jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> rbp, rsp'
> 
> This way new function will set its own stack size and will init tail
> call counter with whatever value the parent had.
> 
> If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> Instead it would need to have 'nop2' in there."
> 
> So basically I gave a shot at that suggestion. Patch 4 has a description
> of implementation.
> 
> Quick overview of patches:
> Patch 1 changes BPF retpoline to use %rcx instead of %rax to store
> address of BPF tailcall target program
> Patch 2 relaxes verifier's restrictions about tailcalls being used with
> BPF subprograms
> Patch 3 propagates poke descriptors from main program to each subprogram
> Patch 4 is the main dish in this set. It implements new prologue layout
> that was suggested by Alexei and reworks tailcall handling.
> Patch 5 is the new selftest that proves tailcalls can be used from
> within BPF subprogram.
> 
> -------------------------------------------------------------------
> Debatable prog_array_map_poke_run changes:
> 
> Before the tailcall and with the new prologue layout, stack need to be
> unwinded and callee saved registers need to be popped. Instructions
> responsible for that are generated, but they should not be executed if
> target program is not present. To address that, new poke target 'ip_aux'
> is introduced to poke descriptor that will be used for skipping these
> instructions. This means there are two poke targets for handling direct
> tailcalls. Simplified flow can be presented as three sections:
> 
> 1. skip call or nop (poke->ip_aux)
> 2. stack unwind
> 3. call tail or nop (poke->ip)
> 
> It would be possible that one of CPU might be in point 2 and point 3 is
> not yet updated (nop), which would lead to problems mentioned in patch 4
> commit message, IOW unwind section should not be executed if there is no
> target program.
> 
> We can define the following state matrix for that (courtesy of Bjorn):
> A nop, unwind, nop
> B nop, unwind, tail
> C skip, unwind, nop
> D skip, unwind, tail
> 
> A is forbidden (lead to incorrectness). C is the starting state. What if
> we just make sure we *never* enter than state, and never return to C?
> 
> First install tail call f: C->D->B(f)
>  * poke the tailcall, after that get rid of the skip
> Update tail call f to f': B(f)->B(f')
>  * poke the tailcall (poke->ip) and do NOT touch the poke->ip_aux
> Remove tail call: B(f')->D(f')
>  * do NOT touch poke->ip, only poke the poke->ip_aux. Note that we do
>    not get back to C(f')
> Install new tail call (f''): D(f')->D(f'')->B(f'').
>  * poke both targets, first ->ip then ->ip_aux
> 
> So, by avoiding A and never going back to C another CPU can never be
> exposed to the "unwind, tail" state.
> 
> Right now, due to 'faking' the bpf_arch_text_poke,
> prog_array_map_poke_run looks a bit messy. I dropped the 'old' argument
> usage at all and instead I do the reverse calculation that is being done
> by emit_patch, so that the result of memcmp(ip, old_insn, X86_PATCH_SIZE)
> is 0 and we do the actual poking.
> 
> Presumably this is something to be fixed/improved, but at first, I would
> like to hear opinions of others and have some decision whether it is worth
> pursuing, or not.

above state transitions break my mind.
I replied in the patch 3. I hope we don't need this extra first nop5
and ip_aux.

> 
> -------------------------------------------------------------------
> Performance impact:
> 
> As requested, I'm including the performance numbers that show an
> impact of that set, but I did not analyze it. Let's do this on list.
> Also, please let me know if these scenarios make sense and are
> sufficient.
> 
> All of this work, as stated in [2], started as a way to speed up AF-XDP
> by dropping the push/pop of unused callee saved registers in prologue
> and epilogue. Impact is positive, 15% of performance gain.
> 
> However, it is obvious that it will have a negative impact on BPF
> programs that utilize tailcalls. I was asked to provide some numbers
> that will tell us how much actually are theses cases damaged by this
> set.
> 
> Below are te numbers from 'perf stat' for two scenarios.
> First scenario is the output of command:
> 
> $ sudo perf stat -ddd -r 1024 ./test_progs -t tailcalls
> 
> tailcalls kselftest was modified in a following way:
> - only taicall1 subtest is enabled
> - each of the bpf_prog_test_run() calls got set 'repeat' argument to
>   1000000
> 
> Numbers without this set:
> 
>  Performance counter stats for './test_progs -t tailcalls' (1024 runs):
> 
>             198.73 msec task-clock                #    0.997 CPUs utilized            ( +-  0.13% )
>                  6      context-switches          #    0.030 K/sec                    ( +-  0.75% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +- 22.15% )
>                108      page-faults               #    0.546 K/sec                    ( +-  0.03% )
>        693,910,413      cycles                    #    3.492 GHz                      ( +-  0.11% )  (30.26%)
>      1,067,635,122      instructions              #    1.54  insn per cycle           ( +-  0.03% )  (38.16%)
>        165,308,809      branches                  #  831.822 M/sec                    ( +-  0.02% )  (38.46%)
>          9,940,504      branch-misses             #    6.01% of all branches          ( +-  0.02% )  (38.77%)
>        226,741,985      L1-dcache-loads           # 1140.949 M/sec                    ( +-  0.02% )  (39.07%)
>            161,936      L1-dcache-load-misses     #    0.07% of all L1-dcache hits    ( +-  0.66% )  (39.12%)
>             43,777      LLC-loads                 #    0.220 M/sec                    ( +-  0.97% )  (31.07%)
>             11,773      LLC-load-misses           #   26.89% of all LL-cache hits     ( +-  0.99% )  (30.93%)
>    <not supported>      L1-icache-loads
>             97,692      L1-icache-load-misses                                         ( +-  0.51% )  (30.77%)
>        229,069,211      dTLB-loads                # 1152.659 M/sec                    ( +-  0.02% )  (30.62%)
>              1,031      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.28% )  (30.46%)
>              2,236      iTLB-loads                #    0.011 M/sec                    ( +-  1.28% )  (30.30%)
>                357      iTLB-load-misses          #   15.99% of all iTLB cache hits   ( +-  2.10% )  (30.16%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
> 
>           0.199307 +- 0.000250 seconds time elapsed  ( +-  0.13% )
> 
> With:
> 
>  Performance counter stats for './test_progs -t tailcalls' (1024 runs):
> 
>             202.48 msec task-clock                #    0.997 CPUs utilized            ( +-  0.09% )

I think this extra overhead is totally acceptable for such important feature.

>                  6      context-switches          #    0.032 K/sec                    ( +-  1.86% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +- 30.00% )
>                108      page-faults               #    0.535 K/sec                    ( +-  0.03% )
>        718,001,313      cycles                    #    3.546 GHz                      ( +-  0.06% )  (30.12%)
>      1,041,618,306      instructions              #    1.45  insn per cycle           ( +-  0.03% )  (37.96%)
>        226,386,119      branches                  # 1118.091 M/sec                    ( +-  0.03% )  (38.35%)
>          9,882,436      branch-misses             #    4.37% of all branches          ( +-  0.02% )  (38.59%)
>        196,832,137      L1-dcache-loads           #  972.128 M/sec                    ( +-  0.02% )  (39.15%)
>            217,794      L1-dcache-load-misses     #    0.11% of all L1-dcache hits    ( +-  0.67% )  (39.23%)
>             70,690      LLC-loads                 #    0.349 M/sec                    ( +-  0.90% )  (31.15%)
>             18,802      LLC-load-misses           #   26.60% of all LL-cache hits     ( +-  0.84% )  (31.18%)
>    <not supported>      L1-icache-loads
>            106,461      L1-icache-load-misses                                         ( +-  0.51% )  (30.83%)
>        198,887,011      dTLB-loads                #  982.277 M/sec                    ( +-  0.02% )  (30.66%)
>              1,483      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.28% )  (30.50%)
>              4,064      iTLB-loads                #    0.020 M/sec                    ( +- 21.43% )  (30.23%)
>                488      iTLB-load-misses          #   12.00% of all iTLB cache hits   ( +-  1.95% )  (30.03%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
> 
>           0.203081 +- 0.000187 seconds time elapsed  ( +-  0.09% )
> 
> 
> Second conducted measurement was on BPF kselftest flow_dissector that is
> using the progs/bpf_flow.c with 'repeat' argument on
> bpf_prog_test_run_xattr set also to 1000000.
> 
> Without:
> 
>  Performance counter stats for './test_progs -t flow_dissector' (1024 runs):
> 
>           1,340.52 msec task-clock                #    0.987 CPUs utilized            ( +-  0.05% )
>                 25      context-switches          #    0.018 K/sec                    ( +-  0.32% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +-  8.59% )
>                122      page-faults               #    0.091 K/sec                    ( +-  0.03% )
>      4,764,381,512      cycles                    #    3.554 GHz                      ( +-  0.04% )  (30.68%)
>      7,674,803,496      instructions              #    1.61  insn per cycle           ( +-  0.01% )  (38.41%)
>      1,118,346,714      branches                  #  834.261 M/sec                    ( +-  0.00% )  (38.46%)
>         29,132,651      branch-misses             #    2.60% of all branches          ( +-  0.00% )  (38.50%)
>      1,737,552,687      L1-dcache-loads           # 1296.174 M/sec                    ( +-  0.01% )  (38.55%)
>          1,064,105      L1-dcache-load-misses     #    0.06% of all L1-dcache hits    ( +-  1.28% )  (38.57%)
>             50,356      LLC-loads                 #    0.038 M/sec                    ( +-  1.42% )  (30.82%)
>             10,825      LLC-load-misses           #   21.50% of all LL-cache hits     ( +-  1.42% )  (30.79%)
>    <not supported>      L1-icache-loads
>            568,800      L1-icache-load-misses                                         ( +-  0.66% )  (30.77%)
>      1,741,511,307      dTLB-loads                # 1299.127 M/sec                    ( +-  0.01% )  (30.75%)
>              5,112      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.29% )  (30.73%)
>              2,128      iTLB-loads                #    0.002 M/sec                    ( +-  2.06% )  (30.70%)
>                571      iTLB-load-misses          #   26.85% of all iTLB cache hits   ( +-  3.10% )  (30.68%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
> 
>           1.358653 +- 0.000741 seconds time elapsed  ( +-  0.05% )
> 
> 
> With:
> 
>  Performance counter stats for './test_progs -t flow_dissector' (1024 runs):
> 
>           1,426.95 msec task-clock                #    0.989 CPUs utilized            ( +-  0.04% )

are you saying the patches add ~6% overhead?

>                 23      context-switches          #    0.016 K/sec                    ( +-  0.40% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +-  6.38% )
>                122      page-faults               #    0.085 K/sec                    ( +-  0.03% )
>      4,772,798,523      cycles                    #    3.345 GHz                      ( +-  0.03% )  (30.70%)
>      7,837,101,633      instructions              #    1.64  insn per cycle           ( +-  0.00% )  (38.42%)

but the overhead cannot be due to extra instructions.

>      1,118,716,987      branches                  #  783.992 M/sec                    ( +-  0.00% )  (38.46%)
>         29,147,367      branch-misses             #    2.61% of all branches          ( +-  0.00% )  (38.51%)
>      1,797,232,091      L1-dcache-loads           # 1259.492 M/sec                    ( +-  0.00% )  (38.55%)
>          1,487,769      L1-dcache-load-misses     #    0.08% of all L1-dcache hits    ( +-  0.66% )  (38.55%)
>             50,180      LLC-loads                 #    0.035 M/sec                    ( +-  1.37% )  (30.81%)
>             14,709      LLC-load-misses           #   29.31% of all LL-cache hits     ( +-  1.11% )  (30.79%)
>    <not supported>      L1-icache-loads
>            626,633      L1-icache-load-misses                                         ( +-  0.58% )  (30.77%)
>      1,800,278,668      dTLB-loads                # 1261.627 M/sec                    ( +-  0.01% )  (30.75%)
>              3,809      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.71% )  (30.72%)
>              1,745      iTLB-loads                #    0.001 M/sec                    ( +-  3.90% )  (30.70%)
>             12,267      iTLB-load-misses          #  703.02% of all iTLB cache hits   ( +- 96.08% )  (30.68%)

Looks like that's where the perf is suffering. The number of iTLB misses jumps.
It could be due to ip_aux. Just a guess.

Could you try unwind+nop5+push approach and compare before/after
but this time please share annotated 'perf report'.
If iTLB is still struggling 'perf report' should be able to pin point
the instruction that is causing it.
May be jmp target needs to be 16-byte aligned or something.
Or we simply got unlucky by pushing into different cache line.

Also when you do this microbenchmarking please make sure
that bpf_jit_binary_alloc() does not use get_random_int().
It can cause nasty surprises and run-to-run variations.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-10 23:56   ` Alexei Starovoitov
@ 2020-07-11  3:20     ` Alexei Starovoitov
  2020-07-11  3:25       ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2020-07-11  3:20 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
	Björn Töpel, Karlsson, Magnus

On Fri, Jul 10, 2020 at 4:56 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 02, 2020 at 03:49:29PM +0200, Maciej Fijalkowski wrote:
> > This commit serves two things:
> > 1) it optimizes BPF prologue/epilogue generation
> > 2) it makes possible to have tailcalls within BPF subprogram
> >
> > Both points are related to each other since without 1), 2) could not be
> > achieved.
> >
> > In [1], Alexei says:
> > "The prologue will look like:
> > nop5
> > xor eax,eax  // two new bytes if bpf_tail_call() is used in this
> >              // function
> > push rbp
> > mov rbp, rsp
> > sub rsp, rounded_stack_depth
> > push rax // zero init tail_call counter
> > variable number of push rbx,r13,r14,r15
> >
> > Then bpf_tail_call will pop variable number rbx,..
> > and final 'pop rax'
> > Then 'add rsp, size_of_current_stack_frame'
> > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> > rbp, rsp'
> >
> > This way new function will set its own stack size and will init tail
> > call
> > counter with whatever value the parent had.
> >
> > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> > Instead it would need to have 'nop2' in there."
> >
> > Implement that suggestion.
> >
> > Since the layout of stack is changed, tail call counter handling can not
> > rely anymore on popping it to rbx just like it have been handled for
> > constant prologue case and later overwrite of rbx with actual value of
> > rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
> > is considered to be volatile/caller-saved and pop the value of tail call
> > counter in there in the epilogue.
> >
> > Drop the BUILD_BUG_ON in emit_prologue and in
> > emit_bpf_tail_call_indirect where instruction layout is not constant
> > anymore.
> >
> > Introduce new poke target, 'ip_aux' to poke descriptor that is dedicated
>
> imo ip_aux approach has too much x86 specific code in kernel/bpf/arraymap.c
> Ex. NOP_ATOMIC5 is x86 only and will break build on all other archs.
>
> But I'm not sure ip_aux is really necessary.
> It's nice to optimize the case when tail_call target is NULL, but
> redundant unwind + nop5 + push_regs_again makes for much simpler design
> without worrying about state transitions.
>
> So I don't think optimizing the case of target==NULL is really worth the complexity.
>
> > for skipping the register pops and stack unwind that are generated right
> > before the actual jump to target program.
> > For case when the target program is not present, BPF program will skip
> > the pop instructions and nop5 dedicated for jmpq $target. An example of
> > such state when only R6 of callee saved registers is used by program:
> >
> > ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
> > ffffffffc0513aa6:       5b                      pop    %rbx
> > ffffffffc0513aa7:       58                      pop    %rax
> > ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
>
> so this last rbx->rdi insn is not part of bpf_tail_call insn, right?
> That is just 'R1 = R6;' bpf insn jited.
>
> >
> > When target program is inserted, the jump that was there to skip
> > pops/nop5 will become the nop5, so CPU will go over pops and do the
> > actual tailcall.
> >
> > One might ask why there simply can not be pushes after the nop5?
>
> exactly... and...
>
> > In the following example snippet:
> >
> > ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
> > (...)
> > ffffffffc0370332:       5b                      pop    %rbx
> > ffffffffc0370333:       58                      pop    %rax
> > ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
> > ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
> > ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
> > ffffffffc0370347:       50                      push   %rax
> > ffffffffc0370348:       53                      push   %rbx
> > ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
> > ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
> >
> > There is the bpf2bpf call right after the tailcall and jump target is
> > not present. ctx is %rbx and BPF subprogram that we will call into on
> > ffffffffc037034c is relying on it, e.g. it will pick ctx from there.
> > Such code layout is therefore broken as we would overwrite the content
> > of %rbx with the value that was pushed on the prologue.
>
> I don't understand above explanation.
> Are you saying 'callq  0xffffffffc0372548' above is a call to bpf subprogram?
> The 'mov %rbx,%rdi' was 'R1 = R6' before JIT.
> The code is storing ctx into R1 to pass into bpf subprogram.
> It's not part of proposed emit_bpf_tail_call_direct() handling.
> It's part of BPF program.
> I don't see what breaks.
>
> The new emit_bpf_tail_call_indirect() looks correct to me.
>
> But emit_bpf_tail_call_direct() doesn't need
> + emit_jump(&prog, (u8 *)poke->ip + X86_PATCH_SIZE, poke->ip_aux);
> and messy poke->ip_aux.
>
> It can do:
> pop_callee_regs()
> memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
> push_callee_regs()
>
> When target is NULL the pairs of pop/push overall is a nop.
> They don't affect correctness.
> When prog_array_map_poke_run() is called it will replace a nop5
> with a jump. So still all good.
>
> Yes there will be tiny overhead when tail_call target is NULL,
> but x86 will execute pop/push pair in _one_ cpu cycle.
> As far as I recall x86 hardware has special logic to recognize
> such push/pop sequences so they are really fast.
>
> What am I missing?

Of course you are right.
pop+nop+push is incorrect.

How about the following instead:
- during JIT:
emit_jump(to_skip_below)  <- poke->tailcall_bypass
pop_callee_regs
emit_jump(to_tailcall_target) <- poke->tailcall_target

- Transition from one target to another:
text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
if (new_jmp != NULL)
  text_poke(poke->tailcall_bypass, MOD jmp into nop);
else
  text_poke(poke->tailcall_bypass, MOD nop into jmp);

In other words, let's keep jmp as always valid, so the race
you've described in the cover letter won't ever happen.

The kernel/bpf/arraymap.c will stay arch independent too.

Thoughts?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-11  3:20     ` Alexei Starovoitov
@ 2020-07-11  3:25       ` Alexei Starovoitov
  2020-07-14  1:00         ` Maciej Fijalkowski
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2020-07-11  3:25 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
	Björn Töpel, Karlsson, Magnus

On Fri, Jul 10, 2020 at 8:20 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Of course you are right.
> pop+nop+push is incorrect.
>
> How about the following instead:
> - during JIT:
> emit_jump(to_skip_below)  <- poke->tailcall_bypass
> pop_callee_regs
> emit_jump(to_tailcall_target) <- poke->tailcall_target
>
> - Transition from one target to another:
> text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> if (new_jmp != NULL)
>   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> else
>   text_poke(poke->tailcall_bypass, MOD nop into jmp);

One more correction. I meant:

if (new_jmp != NULL) {
  text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
  text_poke(poke->tailcall_bypass, MOD jmp into nop);
} else {
  text_poke(poke->tailcall_bypass, MOD nop into jmp);
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms
  2020-07-11  0:10 ` [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Alexei Starovoitov
@ 2020-07-14  0:22   ` Maciej Fijalkowski
  0 siblings, 0 replies; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-14  0:22 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: ast, daniel, bpf, netdev, bjorn.topel, magnus.karlsson

On Fri, Jul 10, 2020 at 05:10:08PM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 02, 2020 at 03:49:25PM +0200, Maciej Fijalkowski wrote:
> > Hello,
> > 
> > today bpf2bpf calls and tailcalls exclude each other. This set is a
> > proposal to make them work together. It is still a RFC because we need
> > to decide if the performance impact for BPF programs with tailcalls is
> > acceptable or not. Note that I have been focused only on x86
> > architecture, I am not sure if other architectures have some other
> > restrictions that were stopping us to have tailcalls in BPF subprograms.
> > 
> > I would also like to get a feedback on prog_array_map_poke_run changes.
> > 
> > To give you an overview how this work started, previously I posted RFC
> > that was targetted at getting rid of push/pop instructions for callee
> > saved registers in x86-64 JIT that are not used by the BPF program.
> > Alexei saw a potential that that work could be lifted a bit and
> > tailcalls could work with BPF subprograms. More on that in [1], [2].
> > 
> > In [1], Alexei says:
> > 
> > "The prologue will look like:
> > nop5
> > xor eax,eax  // two new bytes if bpf_tail_call() is used in this
> > function
> > push rbp
> > mov rbp, rsp
> > sub rsp, rounded_stack_depth
> > push rax // zero init tail_call counter
> > variable number of push rbx,r13,r14,r15
> > 
> > Then bpf_tail_call will pop variable number rbx,..
> > and final 'pop rax'
> > Then 'add rsp, size_of_current_stack_frame'
> > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
> > rbp, rsp'
> > 
> > This way new function will set its own stack size and will init tail
> > call counter with whatever value the parent had.
> > 
> > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> > Instead it would need to have 'nop2' in there."
> > 
> > So basically I gave a shot at that suggestion. Patch 4 has a description
> > of implementation.
> > 
> > Quick overview of patches:
> > Patch 1 changes BPF retpoline to use %rcx instead of %rax to store
> > address of BPF tailcall target program
> > Patch 2 relaxes verifier's restrictions about tailcalls being used with
> > BPF subprograms
> > Patch 3 propagates poke descriptors from main program to each subprogram
> > Patch 4 is the main dish in this set. It implements new prologue layout
> > that was suggested by Alexei and reworks tailcall handling.
> > Patch 5 is the new selftest that proves tailcalls can be used from
> > within BPF subprogram.
> > 
> > -------------------------------------------------------------------
> > Debatable prog_array_map_poke_run changes:
> > 
> > Before the tailcall and with the new prologue layout, stack need to be
> > unwinded and callee saved registers need to be popped. Instructions
> > responsible for that are generated, but they should not be executed if
> > target program is not present. To address that, new poke target 'ip_aux'
> > is introduced to poke descriptor that will be used for skipping these
> > instructions. This means there are two poke targets for handling direct
> > tailcalls. Simplified flow can be presented as three sections:
> > 
> > 1. skip call or nop (poke->ip_aux)
> > 2. stack unwind
> > 3. call tail or nop (poke->ip)
> > 
> > It would be possible that one of CPU might be in point 2 and point 3 is
> > not yet updated (nop), which would lead to problems mentioned in patch 4
> > commit message, IOW unwind section should not be executed if there is no
> > target program.
> > 
> > We can define the following state matrix for that (courtesy of Bjorn):
> > A nop, unwind, nop
> > B nop, unwind, tail
> > C skip, unwind, nop
> > D skip, unwind, tail
> > 
> > A is forbidden (lead to incorrectness). C is the starting state. What if
> > we just make sure we *never* enter than state, and never return to C?
> > 
> > First install tail call f: C->D->B(f)
> >  * poke the tailcall, after that get rid of the skip
> > Update tail call f to f': B(f)->B(f')
> >  * poke the tailcall (poke->ip) and do NOT touch the poke->ip_aux
> > Remove tail call: B(f')->D(f')
> >  * do NOT touch poke->ip, only poke the poke->ip_aux. Note that we do
> >    not get back to C(f')
> > Install new tail call (f''): D(f')->D(f'')->B(f'').
> >  * poke both targets, first ->ip then ->ip_aux
> > 
> > So, by avoiding A and never going back to C another CPU can never be
> > exposed to the "unwind, tail" state.
> > 
> > Right now, due to 'faking' the bpf_arch_text_poke,
> > prog_array_map_poke_run looks a bit messy. I dropped the 'old' argument
> > usage at all and instead I do the reverse calculation that is being done
> > by emit_patch, so that the result of memcmp(ip, old_insn, X86_PATCH_SIZE)
> > is 0 and we do the actual poking.
> > 
> > Presumably this is something to be fixed/improved, but at first, I would
> > like to hear opinions of others and have some decision whether it is worth
> > pursuing, or not.
> 
> above state transitions break my mind.
> I replied in the patch 3. I hope we don't need this extra first nop5
> and ip_aux.
> 
> > 
> > -------------------------------------------------------------------
> > Performance impact:
> > 
> > As requested, I'm including the performance numbers that show an
> > impact of that set, but I did not analyze it. Let's do this on list.
> > Also, please let me know if these scenarios make sense and are
> > sufficient.
> > 
> > All of this work, as stated in [2], started as a way to speed up AF-XDP
> > by dropping the push/pop of unused callee saved registers in prologue
> > and epilogue. Impact is positive, 15% of performance gain.
> > 
> > However, it is obvious that it will have a negative impact on BPF
> > programs that utilize tailcalls. I was asked to provide some numbers
> > that will tell us how much actually are theses cases damaged by this
> > set.
> > 
> > Below are te numbers from 'perf stat' for two scenarios.
> > First scenario is the output of command:
> > 
> > $ sudo perf stat -ddd -r 1024 ./test_progs -t tailcalls
> > 
> > tailcalls kselftest was modified in a following way:
> > - only taicall1 subtest is enabled
> > - each of the bpf_prog_test_run() calls got set 'repeat' argument to
> >   1000000
> > 
> > Numbers without this set:
> > 
> >  Performance counter stats for './test_progs -t tailcalls' (1024 runs):
> > 
> >             198.73 msec task-clock                #    0.997 CPUs utilized            ( +-  0.13% )
> >                  6      context-switches          #    0.030 K/sec                    ( +-  0.75% )
> >                  0      cpu-migrations            #    0.000 K/sec                    ( +- 22.15% )
> >                108      page-faults               #    0.546 K/sec                    ( +-  0.03% )
> >        693,910,413      cycles                    #    3.492 GHz                      ( +-  0.11% )  (30.26%)
> >      1,067,635,122      instructions              #    1.54  insn per cycle           ( +-  0.03% )  (38.16%)
> >        165,308,809      branches                  #  831.822 M/sec                    ( +-  0.02% )  (38.46%)
> >          9,940,504      branch-misses             #    6.01% of all branches          ( +-  0.02% )  (38.77%)
> >        226,741,985      L1-dcache-loads           # 1140.949 M/sec                    ( +-  0.02% )  (39.07%)
> >            161,936      L1-dcache-load-misses     #    0.07% of all L1-dcache hits    ( +-  0.66% )  (39.12%)
> >             43,777      LLC-loads                 #    0.220 M/sec                    ( +-  0.97% )  (31.07%)
> >             11,773      LLC-load-misses           #   26.89% of all LL-cache hits     ( +-  0.99% )  (30.93%)
> >    <not supported>      L1-icache-loads
> >             97,692      L1-icache-load-misses                                         ( +-  0.51% )  (30.77%)
> >        229,069,211      dTLB-loads                # 1152.659 M/sec                    ( +-  0.02% )  (30.62%)
> >              1,031      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.28% )  (30.46%)
> >              2,236      iTLB-loads                #    0.011 M/sec                    ( +-  1.28% )  (30.30%)
> >                357      iTLB-load-misses          #   15.99% of all iTLB cache hits   ( +-  2.10% )  (30.16%)
> >    <not supported>      L1-dcache-prefetches
> >    <not supported>      L1-dcache-prefetch-misses
> > 
> >           0.199307 +- 0.000250 seconds time elapsed  ( +-  0.13% )
> > 
> > With:
> > 
> >  Performance counter stats for './test_progs -t tailcalls' (1024 runs):
> > 
> >             202.48 msec task-clock                #    0.997 CPUs utilized            ( +-  0.09% )
> 
> I think this extra overhead is totally acceptable for such important feature.
> 
> >                  6      context-switches          #    0.032 K/sec                    ( +-  1.86% )
> >                  0      cpu-migrations            #    0.000 K/sec                    ( +- 30.00% )
> >                108      page-faults               #    0.535 K/sec                    ( +-  0.03% )
> >        718,001,313      cycles                    #    3.546 GHz                      ( +-  0.06% )  (30.12%)
> >      1,041,618,306      instructions              #    1.45  insn per cycle           ( +-  0.03% )  (37.96%)
> >        226,386,119      branches                  # 1118.091 M/sec                    ( +-  0.03% )  (38.35%)
> >          9,882,436      branch-misses             #    4.37% of all branches          ( +-  0.02% )  (38.59%)
> >        196,832,137      L1-dcache-loads           #  972.128 M/sec                    ( +-  0.02% )  (39.15%)
> >            217,794      L1-dcache-load-misses     #    0.11% of all L1-dcache hits    ( +-  0.67% )  (39.23%)
> >             70,690      LLC-loads                 #    0.349 M/sec                    ( +-  0.90% )  (31.15%)
> >             18,802      LLC-load-misses           #   26.60% of all LL-cache hits     ( +-  0.84% )  (31.18%)
> >    <not supported>      L1-icache-loads
> >            106,461      L1-icache-load-misses                                         ( +-  0.51% )  (30.83%)
> >        198,887,011      dTLB-loads                #  982.277 M/sec                    ( +-  0.02% )  (30.66%)
> >              1,483      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  1.28% )  (30.50%)
> >              4,064      iTLB-loads                #    0.020 M/sec                    ( +- 21.43% )  (30.23%)
> >                488      iTLB-load-misses          #   12.00% of all iTLB cache hits   ( +-  1.95% )  (30.03%)
> >    <not supported>      L1-dcache-prefetches
> >    <not supported>      L1-dcache-prefetch-misses
> > 
> >           0.203081 +- 0.000187 seconds time elapsed  ( +-  0.09% )
> > 
> > 
> > Second conducted measurement was on BPF kselftest flow_dissector that is
> > using the progs/bpf_flow.c with 'repeat' argument on
> > bpf_prog_test_run_xattr set also to 1000000.
> > 
> > Without:
> > 
> >  Performance counter stats for './test_progs -t flow_dissector' (1024 runs):
> > 
> >           1,340.52 msec task-clock                #    0.987 CPUs utilized            ( +-  0.05% )
> >                 25      context-switches          #    0.018 K/sec                    ( +-  0.32% )
> >                  0      cpu-migrations            #    0.000 K/sec                    ( +-  8.59% )
> >                122      page-faults               #    0.091 K/sec                    ( +-  0.03% )
> >      4,764,381,512      cycles                    #    3.554 GHz                      ( +-  0.04% )  (30.68%)
> >      7,674,803,496      instructions              #    1.61  insn per cycle           ( +-  0.01% )  (38.41%)
> >      1,118,346,714      branches                  #  834.261 M/sec                    ( +-  0.00% )  (38.46%)
> >         29,132,651      branch-misses             #    2.60% of all branches          ( +-  0.00% )  (38.50%)
> >      1,737,552,687      L1-dcache-loads           # 1296.174 M/sec                    ( +-  0.01% )  (38.55%)
> >          1,064,105      L1-dcache-load-misses     #    0.06% of all L1-dcache hits    ( +-  1.28% )  (38.57%)
> >             50,356      LLC-loads                 #    0.038 M/sec                    ( +-  1.42% )  (30.82%)
> >             10,825      LLC-load-misses           #   21.50% of all LL-cache hits     ( +-  1.42% )  (30.79%)
> >    <not supported>      L1-icache-loads
> >            568,800      L1-icache-load-misses                                         ( +-  0.66% )  (30.77%)
> >      1,741,511,307      dTLB-loads                # 1299.127 M/sec                    ( +-  0.01% )  (30.75%)
> >              5,112      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.29% )  (30.73%)
> >              2,128      iTLB-loads                #    0.002 M/sec                    ( +-  2.06% )  (30.70%)
> >                571      iTLB-load-misses          #   26.85% of all iTLB cache hits   ( +-  3.10% )  (30.68%)
> >    <not supported>      L1-dcache-prefetches
> >    <not supported>      L1-dcache-prefetch-misses
> > 
> >           1.358653 +- 0.000741 seconds time elapsed  ( +-  0.05% )
> > 
> > 
> > With:
> > 
> >  Performance counter stats for './test_progs -t flow_dissector' (1024 runs):
> > 
> >           1,426.95 msec task-clock                #    0.989 CPUs utilized            ( +-  0.04% )
> 
> are you saying the patches add ~6% overhead?
> 
> >                 23      context-switches          #    0.016 K/sec                    ( +-  0.40% )
> >                  0      cpu-migrations            #    0.000 K/sec                    ( +-  6.38% )
> >                122      page-faults               #    0.085 K/sec                    ( +-  0.03% )
> >      4,772,798,523      cycles                    #    3.345 GHz                      ( +-  0.03% )  (30.70%)
> >      7,837,101,633      instructions              #    1.64  insn per cycle           ( +-  0.00% )  (38.42%)
> 
> but the overhead cannot be due to extra instructions.
> 
> >      1,118,716,987      branches                  #  783.992 M/sec                    ( +-  0.00% )  (38.46%)
> >         29,147,367      branch-misses             #    2.61% of all branches          ( +-  0.00% )  (38.51%)
> >      1,797,232,091      L1-dcache-loads           # 1259.492 M/sec                    ( +-  0.00% )  (38.55%)
> >          1,487,769      L1-dcache-load-misses     #    0.08% of all L1-dcache hits    ( +-  0.66% )  (38.55%)
> >             50,180      LLC-loads                 #    0.035 M/sec                    ( +-  1.37% )  (30.81%)
> >             14,709      LLC-load-misses           #   29.31% of all LL-cache hits     ( +-  1.11% )  (30.79%)
> >    <not supported>      L1-icache-loads
> >            626,633      L1-icache-load-misses                                         ( +-  0.58% )  (30.77%)
> >      1,800,278,668      dTLB-loads                # 1261.627 M/sec                    ( +-  0.01% )  (30.75%)
> >              3,809      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.71% )  (30.72%)
> >              1,745      iTLB-loads                #    0.001 M/sec                    ( +-  3.90% )  (30.70%)
> >             12,267      iTLB-load-misses          #  703.02% of all iTLB cache hits   ( +- 96.08% )  (30.68%)
> 
> Looks like that's where the perf is suffering. The number of iTLB misses jumps.
> It could be due to ip_aux. Just a guess.

That's fishy to me. I can only hit this case of huge iTLB-load-misses once
per tens of perf tests. The standard numbers are more or less as follows:

 Performance counter stats for './test_progs -t flow_dissector' (1024 runs):

          1,321.55 msec task-clock                #    0.989 CPUs utilized            ( +-  0.07% )
                33      context-switches          #    0.025 K/sec                    ( +-  0.34% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-  9.46% )
               127      page-faults               #    0.096 K/sec                    ( +-  0.03% )
     4,589,283,427      cycles                    #    3.473 GHz                      ( +-  0.03% )  (30.69%)
     6,900,115,113      instructions              #    1.50  insn per cycle           ( +-  0.01% )  (38.42%)
     1,129,942,194      branches                  #  855.011 M/sec                    ( +-  0.01% )  (38.47%)
        29,146,505      branch-misses             #    2.58% of all branches          ( +-  0.01% )  (38.52%)
     1,795,475,517      L1-dcache-loads           # 1358.610 M/sec                    ( +-  0.01% )  (38.57%)
           656,831      L1-dcache-load-misses     #    0.04% of all L1-dcache hits    ( +-  0.65% )  (38.57%)
            67,896      LLC-loads                 #    0.051 M/sec                    ( +-  0.91% )  (30.82%)
            14,321      LLC-load-misses           #   21.09% of all LL-cache hits     ( +-  1.32% )  (30.79%)
   <not supported>      L1-icache-loads
           683,386      L1-icache-load-misses                                         ( +-  0.73% )  (30.77%)
     1,801,883,740      dTLB-loads                # 1363.459 M/sec                    ( +-  0.01% )  (30.74%)
             6,362      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  2.07% )  (30.72%)
             2,901      iTLB-loads                #    0.002 M/sec                    ( +-  2.96% )  (30.69%)
             1,195      iTLB-load-misses          #   41.18% of all iTLB cache hits   ( +-  2.14% )  (30.66%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

          1.335695 +- 0.000977 seconds time elapsed  ( +-  0.07% )

So these numbers seem acceptable to me.

> 
> Could you try unwind+nop5+push approach and compare before/after
> but this time please share annotated 'perf report'.

perf record makes the iTLB-load-misses drop down to silly 2% of all iTLB
cache hits, so I can't provide the report unfortunately :<

> If iTLB is still struggling 'perf report' should be able to pin point
> the instruction that is causing it.
> May be jmp target needs to be 16-byte aligned or something.
> Or we simply got unlucky by pushing into different cache line.

Probably we should align the jump target that is taken on poke->ip_aux, so
we should place the nops right after the poke->ip. I would like to give it
a shot and see if huge-but-sporadic iTLB-load-misses would still occur.

I started to work on that, but I'm not there yet. I suppose it's better to
share the struggle rather than being silent.

With the dirty patch made on top of this series, pasted below:

@@ -486,12 +525,14 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 				      u8 **pprog, int addr, u8 *image,
 				      bool *callee_regs_used, u32 stack_depth)
 {
+	u8 *aligned_target;
 	u8 *prog = *pprog;
 	int pop_bytes = 0;
 	int off1 = 27;
 	int poke_off;
 	int cnt = 0;
 
+
 	/* count the additional bytes used for popping callee regs to stack
 	 * that need to be taken into account for offset that is used for
 	 * bailing out of the tail call limit is reached and the poke->ip
@@ -524,7 +565,9 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 	poke->adj_off = X86_TAIL_CALL_OFFSET;
 	poke->ip = image + (addr - X86_PATCH_SIZE);
 
-	emit_jump(&prog, (u8 *)poke->ip + X86_PATCH_SIZE, poke->ip_aux);
+	aligned_target = PTR_ALIGN((u8 *)poke->ip + X86_PATCH_SIZE, 16);
+	noplen = aligned_target - ((u8 *)poke->ip + X86_PATCH_SIZE);
+	emit_jump(&prog, aligned_target, poke->ip_aux);
 
 	*pprog = prog;
 	pop_callee_regs(pprog, callee_regs_used);
@@ -535,8 +578,9 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 	memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
 	prog += X86_PATCH_SIZE;
 
-	/* out: */
+	emit_nops(&prog, noplen);
 
+	/* out: */
 	*pprog = prog;
 }

I'm hitting the following check in do_jit():

	if (unlikely(proglen + ilen > oldproglen)) {
		pr_err("bpf_jit: fatal error\n");
		return -EFAULT;
	}

Presumably this is due to the fact that JIT image is shrinking throughout
runs and number of nops needed might differ after each run? I need to do
more digging.

Here's the instructions layout that we're striving for if this is the
correct approach:

ffffffffc0468777:  e9 14 00 00 00          jmpq   0xffffffffc0468790 // poke->ip_aux
ffffffffc046877c:  5b                      pop    %rbx
ffffffffc046877d:  58                      pop    %rax
ffffffffc046877e:  48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc0468785:  0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)   // poke->ip
ffffffffc046878a:  66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)   // aligning nop
ffffffffc0468790:  48 89 df                mov    %rbx,%rdi          // aligned target

> 
> Also when you do this microbenchmarking please make sure
> that bpf_jit_binary_alloc() does not use get_random_int().
> It can cause nasty surprises and run-to-run variations.

Can you explain under what circumstances bpf_jit_binary_alloc() would not
use get_random_int() ? Out of curiosity as from a quick look I can't tell
when.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-11  3:25       ` Alexei Starovoitov
@ 2020-07-14  1:00         ` Maciej Fijalkowski
  2020-07-14  3:36           ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-14  1:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
	Björn Töpel, Karlsson, Magnus

On Fri, Jul 10, 2020 at 08:25:20PM -0700, Alexei Starovoitov wrote:
> On Fri, Jul 10, 2020 at 8:20 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Of course you are right.
> > pop+nop+push is incorrect.
> >
> > How about the following instead:
> > - during JIT:
> > emit_jump(to_skip_below)  <- poke->tailcall_bypass

That's the jump to the instruction right after the poke->tailcall_target.

> > pop_callee_regs
> > emit_jump(to_tailcall_target) <- poke->tailcall_target

During JIT there's no tailcall_target so this will be nop5, right?

> >
> > - Transition from one target to another:
> > text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> > if (new_jmp != NULL)
> >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > else
> >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> 
> One more correction. I meant:
> 
> if (new_jmp != NULL) {
>   text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)

Problem with having the old_jmp here is that you could have the
tailcall_target removed followed by the new program being inserted. So for
that case old_jmp is NULL but we decided to not poke the
poke->tailcall_target when removing the program, only the tailcall_bypass
is poked back to jmp from nop. IOW old_jmp is not equal to what
poke->tailcall_target currently stores. This means that
bpf_arch_text_poke() would not be successful for this update and that is
the reason of faking it in this patch.

>   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> } else {
>   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> }

I think that's what we currently (mostly) have. map_poke_run() is skipping
the poke of poke->tailcall_target if new bpf_prog is NULL, just like
you're proposing above. Of course I can rename the members in poke
descriptor to names you're suggesting. I also assume that by text_poke you
meant the bpf_arch_text_poke?

I've been able to hide the nop5 detection within the bpf_arch_text_poke so
map_poke_run() is arch-independent in that approach. My feeling is that
we don't need the old bpf_prog at all.

Some bits might change here due to the jump target alignment that I'm
trying to introduce.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-14  1:00         ` Maciej Fijalkowski
@ 2020-07-14  3:36           ` Alexei Starovoitov
  2020-07-14 20:50             ` Maciej Fijalkowski
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2020-07-14  3:36 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
	Björn Töpel, Karlsson, Magnus

On Tue, Jul 14, 2020 at 03:00:45AM +0200, Maciej Fijalkowski wrote:
> On Fri, Jul 10, 2020 at 08:25:20PM -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 10, 2020 at 8:20 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > Of course you are right.
> > > pop+nop+push is incorrect.
> > >
> > > How about the following instead:
> > > - during JIT:
> > > emit_jump(to_skip_below)  <- poke->tailcall_bypass
> 
> That's the jump to the instruction right after the poke->tailcall_target.

right. Mainly looking for better names than ip and ip_aux.

> > > pop_callee_regs
> > > emit_jump(to_tailcall_target) <- poke->tailcall_target
> 
> During JIT there's no tailcall_target so this will be nop5, right?

I thought it will be always jmp, but with new info I agree that
it will start with nop.

> 
> > >
> > > - Transition from one target to another:
> > > text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> > > if (new_jmp != NULL)
> > >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > > else
> > >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> > 
> > One more correction. I meant:
> > 
> > if (new_jmp != NULL) {
> >   text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> 
> Problem with having the old_jmp here is that you could have the
> tailcall_target removed followed by the new program being inserted. So for
> that case old_jmp is NULL but we decided to not poke the
> poke->tailcall_target when removing the program, only the tailcall_bypass
> is poked back to jmp from nop. IOW old_jmp is not equal to what
> poke->tailcall_target currently stores. This means that
> bpf_arch_text_poke() would not be successful for this update and that is
> the reason of faking it in this patch.

got it.
I think it can be solved two ways:
1. add synchronize_rcu() after poking of tailcall_bypass into jmp
and then update tailcall_target into nop.
so the race you've described in cover letter won't happen.
In the future with sleepable progs we'd need to call sync_rcu_tasks_trace too.
Which will make poke_run even slower.

2. add a flag to bpf_arch_text_poke() to ignore 5 bytes in there
and update tailcall_target to new jmp.
The speed of poke_run will be faster,
but considering the speed of text_poke_bp() it's starting to feel like
premature optimization.

I think approach 1 is cleaner.
Then the pseudo code will be:
if (new_jmp != NULL) {
   text_poke(poke->tailcall_target, MOD_JMP, old ? old_jmp : NULL, new_jmp);
   if (!old)
     text_poke(poke->tailcall_bypass, MOD_JMP, bypass_addr, NULL /* into nop */);
} else {
   text_poke(poke->tailcall_bypass, MOD_JMP, NULL /* from nop */, bypass_addr);
   sync_rcu(); /* let progs finish */
   text_poke(poke->tailcall_target, MOD_JMP, old_jmp, NULL /* into nop */)
}

> 
> >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > } else {
> >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> > }
> 
> I think that's what we currently (mostly) have. map_poke_run() is skipping
> the poke of poke->tailcall_target if new bpf_prog is NULL, just like
> you're proposing above. Of course I can rename the members in poke
> descriptor to names you're suggesting. I also assume that by text_poke you
> meant the bpf_arch_text_poke?

yep.

> 
> I've been able to hide the nop5 detection within the bpf_arch_text_poke so
> map_poke_run() is arch-independent in that approach. My feeling is that
> we don't need the old bpf_prog at all.
> 
> Some bits might change here due to the jump target alignment that I'm
> trying to introduce.

> Can you explain under what circumstances bpf_jit_binary_alloc() would not
> use get_random_int() ? Out of curiosity as from a quick look I can't tell
> when.

I meant when you're doing benchmarking get rid of that randomization
from bpf_jit_binary_alloc in your test kernel.

> I'm hitting the following check in do_jit():

I think aligning bypass_addr is a bit too much. Let it all be linear for now.
Since iTLB is sporadic it could be due to randomization and nothing to do
with additional jmp and unwind that this set is introducing.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-14  3:36           ` Alexei Starovoitov
@ 2020-07-14 20:50             ` Maciej Fijalkowski
  2020-07-14 22:34               ` Alexei Starovoitov
  0 siblings, 1 reply; 15+ messages in thread
From: Maciej Fijalkowski @ 2020-07-14 20:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
	Björn Töpel, Karlsson, Magnus

On Mon, Jul 13, 2020 at 08:36:30PM -0700, Alexei Starovoitov wrote:
> On Tue, Jul 14, 2020 at 03:00:45AM +0200, Maciej Fijalkowski wrote:
> > On Fri, Jul 10, 2020 at 08:25:20PM -0700, Alexei Starovoitov wrote:
> > > On Fri, Jul 10, 2020 at 8:20 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > Of course you are right.
> > > > pop+nop+push is incorrect.
> > > >
> > > > How about the following instead:
> > > > - during JIT:
> > > > emit_jump(to_skip_below)  <- poke->tailcall_bypass
> > 
> > That's the jump to the instruction right after the poke->tailcall_target.
> 
> right. Mainly looking for better names than ip and ip_aux.
> 
> > > > pop_callee_regs
> > > > emit_jump(to_tailcall_target) <- poke->tailcall_target
> > 
> > During JIT there's no tailcall_target so this will be nop5, right?
> 
> I thought it will be always jmp, but with new info I agree that
> it will start with nop.
> 
> > 
> > > >
> > > > - Transition from one target to another:
> > > > text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> > > > if (new_jmp != NULL)
> > > >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > > > else
> > > >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> > > 
> > > One more correction. I meant:
> > > 
> > > if (new_jmp != NULL) {
> > >   text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> > 
> > Problem with having the old_jmp here is that you could have the
> > tailcall_target removed followed by the new program being inserted. So for
> > that case old_jmp is NULL but we decided to not poke the
> > poke->tailcall_target when removing the program, only the tailcall_bypass
> > is poked back to jmp from nop. IOW old_jmp is not equal to what
> > poke->tailcall_target currently stores. This means that
> > bpf_arch_text_poke() would not be successful for this update and that is
> > the reason of faking it in this patch.
> 
> got it.
> I think it can be solved two ways:
> 1. add synchronize_rcu() after poking of tailcall_bypass into jmp
> and then update tailcall_target into nop.
> so the race you've described in cover letter won't happen.
> In the future with sleepable progs we'd need to call sync_rcu_tasks_trace too.
> Which will make poke_run even slower.
> 
> 2. add a flag to bpf_arch_text_poke() to ignore 5 bytes in there
> and update tailcall_target to new jmp.
> The speed of poke_run will be faster,
> but considering the speed of text_poke_bp() it's starting to feel like
> premature optimization.
> 
> I think approach 1 is cleaner.
> Then the pseudo code will be:
> if (new_jmp != NULL) {
>    text_poke(poke->tailcall_target, MOD_JMP, old ? old_jmp : NULL, new_jmp);
>    if (!old)
>      text_poke(poke->tailcall_bypass, MOD_JMP, bypass_addr, NULL /* into nop */);
> } else {
>    text_poke(poke->tailcall_bypass, MOD_JMP, NULL /* from nop */, bypass_addr);
>    sync_rcu(); /* let progs finish */
>    text_poke(poke->tailcall_target, MOD_JMP, old_jmp, NULL /* into nop */)
> }

Seems like this does the job :) clever stuff with sync_rcu.
I tried this approach and one last thing that needs to be covered
separately is the case of nop->nop update. We should simply avoid poking
in this case. With this in place everything is functional.

I will update the patch and descriptions and send the non-RFC revision, if
you don't mind of course.

> 
> > 
> > >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > > } else {
> > >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> > > }
> > 
> > I think that's what we currently (mostly) have. map_poke_run() is skipping
> > the poke of poke->tailcall_target if new bpf_prog is NULL, just like
> > you're proposing above. Of course I can rename the members in poke
> > descriptor to names you're suggesting. I also assume that by text_poke you
> > meant the bpf_arch_text_poke?
> 
> yep.
> 
> > 
> > I've been able to hide the nop5 detection within the bpf_arch_text_poke so
> > map_poke_run() is arch-independent in that approach. My feeling is that
> > we don't need the old bpf_prog at all.
> > 
> > Some bits might change here due to the jump target alignment that I'm
> > trying to introduce.
> 
> > Can you explain under what circumstances bpf_jit_binary_alloc() would not
> > use get_random_int() ? Out of curiosity as from a quick look I can't tell
> > when.
> 
> I meant when you're doing benchmarking get rid of that randomization
> from bpf_jit_binary_alloc in your test kernel.
> 
> > I'm hitting the following check in do_jit():
> 
> I think aligning bypass_addr is a bit too much. Let it all be linear for now.
> Since iTLB is sporadic it could be due to randomization and nothing to do
> with additional jmp and unwind that this set is introducing.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT
  2020-07-14 20:50             ` Maciej Fijalkowski
@ 2020-07-14 22:34               ` Alexei Starovoitov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2020-07-14 22:34 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
	Björn Töpel, Karlsson, Magnus

On Tue, Jul 14, 2020 at 1:55 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Mon, Jul 13, 2020 at 08:36:30PM -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 14, 2020 at 03:00:45AM +0200, Maciej Fijalkowski wrote:
> > > On Fri, Jul 10, 2020 at 08:25:20PM -0700, Alexei Starovoitov wrote:
> > > > On Fri, Jul 10, 2020 at 8:20 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > Of course you are right.
> > > > > pop+nop+push is incorrect.
> > > > >
> > > > > How about the following instead:
> > > > > - during JIT:
> > > > > emit_jump(to_skip_below)  <- poke->tailcall_bypass
> > >
> > > That's the jump to the instruction right after the poke->tailcall_target.
> >
> > right. Mainly looking for better names than ip and ip_aux.
> >
> > > > > pop_callee_regs
> > > > > emit_jump(to_tailcall_target) <- poke->tailcall_target
> > >
> > > During JIT there's no tailcall_target so this will be nop5, right?
> >
> > I thought it will be always jmp, but with new info I agree that
> > it will start with nop.
> >
> > >
> > > > >
> > > > > - Transition from one target to another:
> > > > > text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> > > > > if (new_jmp != NULL)
> > > > >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > > > > else
> > > > >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> > > >
> > > > One more correction. I meant:
> > > >
> > > > if (new_jmp != NULL) {
> > > >   text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> > >
> > > Problem with having the old_jmp here is that you could have the
> > > tailcall_target removed followed by the new program being inserted. So for
> > > that case old_jmp is NULL but we decided to not poke the
> > > poke->tailcall_target when removing the program, only the tailcall_bypass
> > > is poked back to jmp from nop. IOW old_jmp is not equal to what
> > > poke->tailcall_target currently stores. This means that
> > > bpf_arch_text_poke() would not be successful for this update and that is
> > > the reason of faking it in this patch.
> >
> > got it.
> > I think it can be solved two ways:
> > 1. add synchronize_rcu() after poking of tailcall_bypass into jmp
> > and then update tailcall_target into nop.
> > so the race you've described in cover letter won't happen.
> > In the future with sleepable progs we'd need to call sync_rcu_tasks_trace too.
> > Which will make poke_run even slower.
> >
> > 2. add a flag to bpf_arch_text_poke() to ignore 5 bytes in there
> > and update tailcall_target to new jmp.
> > The speed of poke_run will be faster,
> > but considering the speed of text_poke_bp() it's starting to feel like
> > premature optimization.
> >
> > I think approach 1 is cleaner.
> > Then the pseudo code will be:
> > if (new_jmp != NULL) {
> >    text_poke(poke->tailcall_target, MOD_JMP, old ? old_jmp : NULL, new_jmp);
> >    if (!old)
> >      text_poke(poke->tailcall_bypass, MOD_JMP, bypass_addr, NULL /* into nop */);
> > } else {
> >    text_poke(poke->tailcall_bypass, MOD_JMP, NULL /* from nop */, bypass_addr);
> >    sync_rcu(); /* let progs finish */
> >    text_poke(poke->tailcall_target, MOD_JMP, old_jmp, NULL /* into nop */)
> > }
>
> Seems like this does the job :) clever stuff with sync_rcu.
> I tried this approach and one last thing that needs to be covered
> separately is the case of nop->nop update. We should simply avoid poking
> in this case. With this in place everything is functional.
>
> I will update the patch and descriptions and send the non-RFC revision, if
> you don't mind of course.

Yes. Please. Cannot wait actually :)

Please think through Daniel's comment in prog_array_map_poke_run().
Especially points 3 and 4. The new logic will be hitting the same cases,
but in a more elaborate way.
That comment also makes clear why memcmp(poke->ip, nop5...);
was not the correct approach... poke->ip address can be gone at that time.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-07-14 22:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02 13:49 [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Maciej Fijalkowski
2020-07-02 13:49 ` [RFC PATCH bpf-next 1/5] bpf, x64: use %rcx instead of %rax for tail call retpolines Maciej Fijalkowski
2020-07-02 13:49 ` [RFC PATCH bpf-next 2/5] bpf: allow for tailcalls in BPF subprograms Maciej Fijalkowski
2020-07-02 13:49 ` [RFC PATCH bpf-next 3/5] bpf: propagate poke descriptors to subprograms Maciej Fijalkowski
2020-07-02 13:49 ` [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT Maciej Fijalkowski
2020-07-10 23:56   ` Alexei Starovoitov
2020-07-11  3:20     ` Alexei Starovoitov
2020-07-11  3:25       ` Alexei Starovoitov
2020-07-14  1:00         ` Maciej Fijalkowski
2020-07-14  3:36           ` Alexei Starovoitov
2020-07-14 20:50             ` Maciej Fijalkowski
2020-07-14 22:34               ` Alexei Starovoitov
2020-07-02 13:49 ` [RFC PATCH bpf-next 5/5] selftests: bpf: add dummy prog for bpf2bpf with tailcall Maciej Fijalkowski
2020-07-11  0:10 ` [RFC PATCH bpf-next 0/5] bpf: tailcalls in BPF subprograms Alexei Starovoitov
2020-07-14  0:22   ` Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).