All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-06-25 16:12 ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
native calling convention to bpf calling convention and is used to implement
various bpf features, such as fentry, fexit, fmod_ret and struct_ops.

The trampoline introduced does essentially the same thing as the bpf
trampoline does on x86.

Tested on raspberry pi 4b and qemu:

 #18 /1     bpf_tcp_ca/dctcp:OK
 #18 /2     bpf_tcp_ca/cubic:OK
 #18 /3     bpf_tcp_ca/invalid_license:OK
 #18 /4     bpf_tcp_ca/dctcp_fallback:OK
 #18 /5     bpf_tcp_ca/rel_setsockopt:OK
 #18        bpf_tcp_ca:OK
 #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
 #51 /2     dummy_st_ops/dummy_init_ret_value:OK
 #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
 #51 /4     dummy_st_ops/dummy_multiple_args:OK
 #51        dummy_st_ops:OK
 #57 /1     fexit_bpf2bpf/target_no_callees:OK
 #57 /2     fexit_bpf2bpf/target_yes_callees:OK
 #57 /3     fexit_bpf2bpf/func_replace:OK
 #57 /4     fexit_bpf2bpf/func_replace_verify:OK
 #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
 #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
 #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
 #57 /8     fexit_bpf2bpf/func_replace_multi:OK
 #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
 #57        fexit_bpf2bpf:OK
 #237       xdp_bpf2bpf:OK

v6:
- Since Mark is refactoring arm64 ftrace to support long jump and reduce the
  ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
  to regular kernel functions, so remove ftrace related patches for now.
- Add long jump support for attaching bpf trampoline to bpf prog, since bpf
  trampoline and bpf prog are allocated via vmalloc, there is chance the
  distance exceeds the max branch range.
- Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
  should be kept, since the changes to it is not trivial
- Update some commit messages and comments

v5: https://lore.kernel.org/bpf/20220518131638.3401509-1-xukuohai@huawei.com/
- As Alexei suggested, remove is_valid_bpf_tramp_flags()

v4: https://lore.kernel.org/bpf/20220517071838.3366093-1-xukuohai@huawei.com/
- Run the test cases on raspberry pi 4b
- Rebase and add cookie to trampoline
- As Steve suggested, move trace_direct_tramp() back to entry-ftrace.S to
  avoid messing up generic code with architecture specific code
- As Jakub suggested, merge patch 4 and patch 5 of v3 to provide full function
  in one patch
- As Mark suggested, add a comment for the use of aarch64_insn_patch_text_nosync()
- Do not generate trampoline for long jump to avoid triggering ftrace_bug
- Round stack size to multiples of 16B to avoid SPAlignmentFault
- Use callee saved register x20 to reduce the use of mov_i64
- Add missing BTI J instructions
- Trivial spelling and code style fixes

v3: https://lore.kernel.org/bpf/20220424154028.1698685-1-xukuohai@huawei.com/
- Append test results for bpf_tcp_ca, dummy_st_ops, fexit_bpf2bpf,
  xdp_bpf2bpf
- Support to poke bpf progs
- Fix return value of arch_prepare_bpf_trampoline() to the total number
  of bytes instead of number of instructions 
- Do not check whether CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled in
  arch_prepare_bpf_trampoline, since the trampoline may be hooked to a bpf
  prog
- Restrict bpf_arch_text_poke() to poke bpf text only, as kernel functions
  are poked by ftrace
- Rewrite trace_direct_tramp() in inline assembly in trace_selftest.c
  to avoid messing entry-ftrace.S
- isolate arch_ftrace_set_direct_caller() with macro
  CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to avoid compile error
  when this macro is disabled
- Some trivial code sytle fixes

v2: https://lore.kernel.org/bpf/20220414162220.1985095-1-xukuohai@huawei.com/
- Add Song's ACK
- Change the multi-line comment in is_valid_bpf_tramp_flags() into net
  style (patch 3)
- Fix a deadloop issue in ftrace selftest (patch 2)
- Replace pt_regs->x0 with pt_regs->orig_x0 in patch 1 commit message 
- Replace "bpf trampoline" with "custom trampoline" in patch 1, as
  ftrace direct call is not only used by bpf trampoline.

v1: https://lore.kernel.org/bpf/20220413054959.1053668-1-xukuohai@huawei.com/

Xu Kuohai (4):
  bpf: Remove is_valid_bpf_tramp_flags()
  arm64: Add LDR (literal) instruction
  bpf, arm64: Impelment bpf_arch_text_poke() for arm64
  bpf, arm64: bpf trampoline for arm64

 arch/arm64/include/asm/insn.h |   3 +
 arch/arm64/lib/insn.c         |  30 +-
 arch/arm64/net/bpf_jit.h      |   7 +
 arch/arm64/net/bpf_jit_comp.c | 717 +++++++++++++++++++++++++++++++++-
 arch/x86/net/bpf_jit_comp.c   |  20 -
 kernel/bpf/bpf_struct_ops.c   |   3 +
 kernel/bpf/trampoline.c       |   3 +
 7 files changed, 742 insertions(+), 41 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-06-25 16:12 ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
native calling convention to bpf calling convention and is used to implement
various bpf features, such as fentry, fexit, fmod_ret and struct_ops.

The trampoline introduced does essentially the same thing as the bpf
trampoline does on x86.

Tested on raspberry pi 4b and qemu:

 #18 /1     bpf_tcp_ca/dctcp:OK
 #18 /2     bpf_tcp_ca/cubic:OK
 #18 /3     bpf_tcp_ca/invalid_license:OK
 #18 /4     bpf_tcp_ca/dctcp_fallback:OK
 #18 /5     bpf_tcp_ca/rel_setsockopt:OK
 #18        bpf_tcp_ca:OK
 #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
 #51 /2     dummy_st_ops/dummy_init_ret_value:OK
 #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
 #51 /4     dummy_st_ops/dummy_multiple_args:OK
 #51        dummy_st_ops:OK
 #57 /1     fexit_bpf2bpf/target_no_callees:OK
 #57 /2     fexit_bpf2bpf/target_yes_callees:OK
 #57 /3     fexit_bpf2bpf/func_replace:OK
 #57 /4     fexit_bpf2bpf/func_replace_verify:OK
 #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
 #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
 #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
 #57 /8     fexit_bpf2bpf/func_replace_multi:OK
 #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
 #57        fexit_bpf2bpf:OK
 #237       xdp_bpf2bpf:OK

v6:
- Since Mark is refactoring arm64 ftrace to support long jump and reduce the
  ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
  to regular kernel functions, so remove ftrace related patches for now.
- Add long jump support for attaching bpf trampoline to bpf prog, since bpf
  trampoline and bpf prog are allocated via vmalloc, there is chance the
  distance exceeds the max branch range.
- Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
  should be kept, since the changes to it is not trivial
- Update some commit messages and comments

v5: https://lore.kernel.org/bpf/20220518131638.3401509-1-xukuohai@huawei.com/
- As Alexei suggested, remove is_valid_bpf_tramp_flags()

v4: https://lore.kernel.org/bpf/20220517071838.3366093-1-xukuohai@huawei.com/
- Run the test cases on raspberry pi 4b
- Rebase and add cookie to trampoline
- As Steve suggested, move trace_direct_tramp() back to entry-ftrace.S to
  avoid messing up generic code with architecture specific code
- As Jakub suggested, merge patch 4 and patch 5 of v3 to provide full function
  in one patch
- As Mark suggested, add a comment for the use of aarch64_insn_patch_text_nosync()
- Do not generate trampoline for long jump to avoid triggering ftrace_bug
- Round stack size to multiples of 16B to avoid SPAlignmentFault
- Use callee saved register x20 to reduce the use of mov_i64
- Add missing BTI J instructions
- Trivial spelling and code style fixes

v3: https://lore.kernel.org/bpf/20220424154028.1698685-1-xukuohai@huawei.com/
- Append test results for bpf_tcp_ca, dummy_st_ops, fexit_bpf2bpf,
  xdp_bpf2bpf
- Support to poke bpf progs
- Fix return value of arch_prepare_bpf_trampoline() to the total number
  of bytes instead of number of instructions 
- Do not check whether CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled in
  arch_prepare_bpf_trampoline, since the trampoline may be hooked to a bpf
  prog
- Restrict bpf_arch_text_poke() to poke bpf text only, as kernel functions
  are poked by ftrace
- Rewrite trace_direct_tramp() in inline assembly in trace_selftest.c
  to avoid messing entry-ftrace.S
- isolate arch_ftrace_set_direct_caller() with macro
  CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to avoid compile error
  when this macro is disabled
- Some trivial code sytle fixes

v2: https://lore.kernel.org/bpf/20220414162220.1985095-1-xukuohai@huawei.com/
- Add Song's ACK
- Change the multi-line comment in is_valid_bpf_tramp_flags() into net
  style (patch 3)
- Fix a deadloop issue in ftrace selftest (patch 2)
- Replace pt_regs->x0 with pt_regs->orig_x0 in patch 1 commit message 
- Replace "bpf trampoline" with "custom trampoline" in patch 1, as
  ftrace direct call is not only used by bpf trampoline.

v1: https://lore.kernel.org/bpf/20220413054959.1053668-1-xukuohai@huawei.com/

Xu Kuohai (4):
  bpf: Remove is_valid_bpf_tramp_flags()
  arm64: Add LDR (literal) instruction
  bpf, arm64: Impelment bpf_arch_text_poke() for arm64
  bpf, arm64: bpf trampoline for arm64

 arch/arm64/include/asm/insn.h |   3 +
 arch/arm64/lib/insn.c         |  30 +-
 arch/arm64/net/bpf_jit.h      |   7 +
 arch/arm64/net/bpf_jit_comp.c | 717 +++++++++++++++++++++++++++++++++-
 arch/x86/net/bpf_jit_comp.c   |  20 -
 kernel/bpf/bpf_struct_ops.c   |   3 +
 kernel/bpf/trampoline.c       |   3 +
 7 files changed, 742 insertions(+), 41 deletions(-)

-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 1/4] bpf: Remove is_valid_bpf_tramp_flags()
  2022-06-25 16:12 ` Xu Kuohai
@ 2022-06-25 16:12   ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Before generating bpf trampoline, x86 calls is_valid_bpf_tramp_flags()
to check the input flags. This check is architecture independent.
So, to be consistent with x86, arm64 should also do this check
before generating bpf trampoline.

However, the BPF_TRAMP_F_XXX flags are not used by user code and the
flags argument is almost constant at copmile time, so this run time
check is a bit redundant.

Remove is_valid_bpf_tramp_flags() and add some comments to the usage of
BPF_TRAMP_F_XXX flags, as suggested by Alexei.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/net/bpf_jit_comp.c | 20 --------------------
 kernel/bpf/bpf_struct_ops.c |  3 +++
 kernel/bpf/trampoline.c     |  3 +++
 3 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 2c51ca9f7cec..4f8938db03b1 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1927,23 +1927,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
 	return 0;
 }
 
-static bool is_valid_bpf_tramp_flags(unsigned int flags)
-{
-	if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
-	    (flags & BPF_TRAMP_F_SKIP_FRAME))
-		return false;
-
-	/*
-	 * BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
-	 * and it must be used alone.
-	 */
-	if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) &&
-	    (flags & ~BPF_TRAMP_F_RET_FENTRY_RET))
-		return false;
-
-	return true;
-}
-
 /* Example:
  * __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
  * its 'struct btf_func_model' will be nr_args=2
@@ -2022,9 +2005,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 	if (nr_args > 6)
 		return -ENOTSUPP;
 
-	if (!is_valid_bpf_tramp_flags(flags))
-		return -EINVAL;
-
 	/* Generated trampoline stack layout:
 	 *
 	 * RBP + 8         [ return address  ]
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 7e0068c3399c..84b2d9dba79a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks,
 
 	tlinks[BPF_TRAMP_FENTRY].links[0] = link;
 	tlinks[BPF_TRAMP_FENTRY].nr_links = 1;
+	/* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
+	 * and it must be used alone.
+	 */
 	flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0;
 	return arch_prepare_bpf_trampoline(NULL, image, image_end,
 					   model, flags, tlinks, NULL);
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 93c7675f0c9e..bd3f2e673874 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -358,6 +358,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
 
 	if (tlinks[BPF_TRAMP_FEXIT].nr_links ||
 	    tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links)
+		/* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME
+		 * should not be set together.
+		 */
 		flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
 
 	if (ip_arg)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 1/4] bpf: Remove is_valid_bpf_tramp_flags()
@ 2022-06-25 16:12   ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Before generating bpf trampoline, x86 calls is_valid_bpf_tramp_flags()
to check the input flags. This check is architecture independent.
So, to be consistent with x86, arm64 should also do this check
before generating bpf trampoline.

However, the BPF_TRAMP_F_XXX flags are not used by user code and the
flags argument is almost constant at copmile time, so this run time
check is a bit redundant.

Remove is_valid_bpf_tramp_flags() and add some comments to the usage of
BPF_TRAMP_F_XXX flags, as suggested by Alexei.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/net/bpf_jit_comp.c | 20 --------------------
 kernel/bpf/bpf_struct_ops.c |  3 +++
 kernel/bpf/trampoline.c     |  3 +++
 3 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 2c51ca9f7cec..4f8938db03b1 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1927,23 +1927,6 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
 	return 0;
 }
 
-static bool is_valid_bpf_tramp_flags(unsigned int flags)
-{
-	if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
-	    (flags & BPF_TRAMP_F_SKIP_FRAME))
-		return false;
-
-	/*
-	 * BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
-	 * and it must be used alone.
-	 */
-	if ((flags & BPF_TRAMP_F_RET_FENTRY_RET) &&
-	    (flags & ~BPF_TRAMP_F_RET_FENTRY_RET))
-		return false;
-
-	return true;
-}
-
 /* Example:
  * __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
  * its 'struct btf_func_model' will be nr_args=2
@@ -2022,9 +2005,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 	if (nr_args > 6)
 		return -ENOTSUPP;
 
-	if (!is_valid_bpf_tramp_flags(flags))
-		return -EINVAL;
-
 	/* Generated trampoline stack layout:
 	 *
 	 * RBP + 8         [ return address  ]
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 7e0068c3399c..84b2d9dba79a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -341,6 +341,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks,
 
 	tlinks[BPF_TRAMP_FENTRY].links[0] = link;
 	tlinks[BPF_TRAMP_FENTRY].nr_links = 1;
+	/* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops,
+	 * and it must be used alone.
+	 */
 	flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0;
 	return arch_prepare_bpf_trampoline(NULL, image, image_end,
 					   model, flags, tlinks, NULL);
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 93c7675f0c9e..bd3f2e673874 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -358,6 +358,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
 
 	if (tlinks[BPF_TRAMP_FEXIT].nr_links ||
 	    tlinks[BPF_TRAMP_MODIFY_RETURN].nr_links)
+		/* NOTE: BPF_TRAMP_F_RESTORE_REGS and BPF_TRAMP_F_SKIP_FRAME
+		 * should not be set together.
+		 */
 		flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
 
 	if (ip_arg)
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 2/4] arm64: Add LDR (literal) instruction
  2022-06-25 16:12 ` Xu Kuohai
@ 2022-06-25 16:12   ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Add LDR (literal) instruction to load data from address relative to PC.
This instruction will be used to implement long jump from bpf prog to
bpf rampoline in the follow-up patch.

The instruction encoding:

    3       2   2     2                                     0        0
    0       7   6     4                                     5        0
+-----+-------+---+-----+-------------------------------------+--------+
| 0 x | 0 1 1 | 0 | 0 0 |                imm19                |   Rt   |
+-----+-------+---+-----+-------------------------------------+--------+

for 32-bit, variant x == 0; for 64-bit, x == 1.

branch_imm_common() is used to check the distance between pc and target
address, since it's reused by this patch and LDR (literal) is not a branch
instruction, rename it to aarch64_imm_common().

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 arch/arm64/include/asm/insn.h |  3 +++
 arch/arm64/lib/insn.c         | 30 ++++++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 6aa2dc836db1..834bff720582 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -510,6 +510,9 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg,
 				    unsigned int imm,
 				    enum aarch64_insn_size_type size,
 				    enum aarch64_insn_ldst_type type);
+u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr,
+				  enum aarch64_insn_register reg,
+				  bool is64bit);
 u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
 				     enum aarch64_insn_register reg2,
 				     enum aarch64_insn_register base,
diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c
index 695d7368fadc..12f7d03595af 100644
--- a/arch/arm64/lib/insn.c
+++ b/arch/arm64/lib/insn.c
@@ -323,7 +323,7 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type,
 	return insn;
 }
 
-static inline long branch_imm_common(unsigned long pc, unsigned long addr,
+static inline long aarch64_imm_common(unsigned long pc, unsigned long addr,
 				     long range)
 {
 	long offset;
@@ -354,7 +354,7 @@ u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
 	 * ARM64 virtual address arrangement guarantees all kernel and module
 	 * texts are within +/-128M.
 	 */
-	offset = branch_imm_common(pc, addr, SZ_128M);
+	offset = aarch64_imm_common(pc, addr, SZ_128M);
 	if (offset >= SZ_128M)
 		return AARCH64_BREAK_FAULT;
 
@@ -382,7 +382,7 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr,
 	u32 insn;
 	long offset;
 
-	offset = branch_imm_common(pc, addr, SZ_1M);
+	offset = aarch64_imm_common(pc, addr, SZ_1M);
 	if (offset >= SZ_1M)
 		return AARCH64_BREAK_FAULT;
 
@@ -421,7 +421,7 @@ u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr,
 	u32 insn;
 	long offset;
 
-	offset = branch_imm_common(pc, addr, SZ_1M);
+	offset = aarch64_imm_common(pc, addr, SZ_1M);
 
 	insn = aarch64_insn_get_bcond_value();
 
@@ -543,6 +543,28 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg,
 	return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm);
 }
 
+u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr,
+				  enum aarch64_insn_register reg,
+				  bool is64bit)
+{
+	u32 insn;
+	long offset;
+
+	offset = aarch64_imm_common(pc, addr, SZ_1M);
+	if (offset >= SZ_1M)
+		return AARCH64_BREAK_FAULT;
+
+	insn = aarch64_insn_get_ldr_lit_value();
+
+	if (is64bit)
+		insn |= BIT(30);
+
+	insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, reg);
+
+	return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_19, insn,
+					     offset >> 2);
+}
+
 u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
 				     enum aarch64_insn_register reg2,
 				     enum aarch64_insn_register base,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 2/4] arm64: Add LDR (literal) instruction
@ 2022-06-25 16:12   ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Add LDR (literal) instruction to load data from address relative to PC.
This instruction will be used to implement long jump from bpf prog to
bpf rampoline in the follow-up patch.

The instruction encoding:

    3       2   2     2                                     0        0
    0       7   6     4                                     5        0
+-----+-------+---+-----+-------------------------------------+--------+
| 0 x | 0 1 1 | 0 | 0 0 |                imm19                |   Rt   |
+-----+-------+---+-----+-------------------------------------+--------+

for 32-bit, variant x == 0; for 64-bit, x == 1.

branch_imm_common() is used to check the distance between pc and target
address, since it's reused by this patch and LDR (literal) is not a branch
instruction, rename it to aarch64_imm_common().

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 arch/arm64/include/asm/insn.h |  3 +++
 arch/arm64/lib/insn.c         | 30 ++++++++++++++++++++++++++----
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 6aa2dc836db1..834bff720582 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -510,6 +510,9 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg,
 				    unsigned int imm,
 				    enum aarch64_insn_size_type size,
 				    enum aarch64_insn_ldst_type type);
+u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr,
+				  enum aarch64_insn_register reg,
+				  bool is64bit);
 u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
 				     enum aarch64_insn_register reg2,
 				     enum aarch64_insn_register base,
diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c
index 695d7368fadc..12f7d03595af 100644
--- a/arch/arm64/lib/insn.c
+++ b/arch/arm64/lib/insn.c
@@ -323,7 +323,7 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type,
 	return insn;
 }
 
-static inline long branch_imm_common(unsigned long pc, unsigned long addr,
+static inline long aarch64_imm_common(unsigned long pc, unsigned long addr,
 				     long range)
 {
 	long offset;
@@ -354,7 +354,7 @@ u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
 	 * ARM64 virtual address arrangement guarantees all kernel and module
 	 * texts are within +/-128M.
 	 */
-	offset = branch_imm_common(pc, addr, SZ_128M);
+	offset = aarch64_imm_common(pc, addr, SZ_128M);
 	if (offset >= SZ_128M)
 		return AARCH64_BREAK_FAULT;
 
@@ -382,7 +382,7 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr,
 	u32 insn;
 	long offset;
 
-	offset = branch_imm_common(pc, addr, SZ_1M);
+	offset = aarch64_imm_common(pc, addr, SZ_1M);
 	if (offset >= SZ_1M)
 		return AARCH64_BREAK_FAULT;
 
@@ -421,7 +421,7 @@ u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr,
 	u32 insn;
 	long offset;
 
-	offset = branch_imm_common(pc, addr, SZ_1M);
+	offset = aarch64_imm_common(pc, addr, SZ_1M);
 
 	insn = aarch64_insn_get_bcond_value();
 
@@ -543,6 +543,28 @@ u32 aarch64_insn_gen_load_store_imm(enum aarch64_insn_register reg,
 	return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm);
 }
 
+u32 aarch64_insn_gen_load_literal(unsigned long pc, unsigned long addr,
+				  enum aarch64_insn_register reg,
+				  bool is64bit)
+{
+	u32 insn;
+	long offset;
+
+	offset = aarch64_imm_common(pc, addr, SZ_1M);
+	if (offset >= SZ_1M)
+		return AARCH64_BREAK_FAULT;
+
+	insn = aarch64_insn_get_ldr_lit_value();
+
+	if (is64bit)
+		insn |= BIT(30);
+
+	insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, reg);
+
+	return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_19, insn,
+					     offset >> 2);
+}
+
 u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
 				     enum aarch64_insn_register reg2,
 				     enum aarch64_insn_register base,
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
  2022-06-25 16:12 ` Xu Kuohai
@ 2022-06-25 16:12   ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Impelment bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.

When the target address is NULL, the original instruction is patched to a
NOP.

When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.

To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.

However, when a bpf trmapoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.

When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:

        bpf_prog:
                mov x9, lr
                nop // patchsite
                ...
                ret

        plt:
                ldr x10, target
                br x10
        target:
                .quad dummy_tramp // plt target

This is also the state when no trampoline is attached.

When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:

        bpf_prog:
                mov x9, lr
                bl <short-jump bpf trampoline address> // patchsite
                ...
                ret

        plt:
                ldr x10, target
                br x10
        target:
                .quad dummy_tramp // plt target

When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:

        bpf_prog:
                mov x9, lr
                bl plt // patchsite
                ...
                ret

        plt:
                ldr x10, target
                br x10
        target:
                .quad <long-jump bpf trampoline address>

dummy_tramp  is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.

The patching process is as follows:

1. when neither the old address or the new address is a long jump, the
   patchsite is replaced with a bl to the new address, or nop if the new
   address is NULL;

2. when the old address is not long jump but the new one is, the
   branch target address is written to plt first, then the patchsite
   is replaced with a bl instruction to the plt;

3. when the old address is long jump but the new one is not, the address
   of dummy_tramp is written to plt first, then the patchsite is replaced
   with a bl to the new address, or a nop if the new address is NULL;

4. when both the old address and the new address are long jump, the
   new address is written to plt and the patchsite is not changed.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
---
 arch/arm64/net/bpf_jit.h      |   7 +
 arch/arm64/net/bpf_jit_comp.c | 330 ++++++++++++++++++++++++++++++++--
 2 files changed, 323 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h
index 194c95ccc1cf..a6acb94ea3d6 100644
--- a/arch/arm64/net/bpf_jit.h
+++ b/arch/arm64/net/bpf_jit.h
@@ -80,6 +80,12 @@
 #define A64_STR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, STORE)
 #define A64_LDR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, LOAD)
 
+/* LDR (literal) */
+#define A64_LDR32LIT(Wt, offset) \
+	aarch64_insn_gen_load_literal(0, offset, Wt, false)
+#define A64_LDR64LIT(Xt, offset) \
+	aarch64_insn_gen_load_literal(0, offset, Xt, true)
+
 /* Load/store register pair */
 #define A64_LS_PAIR(Rt, Rt2, Rn, offset, ls, type) \
 	aarch64_insn_gen_load_store_pair(Rt, Rt2, Rn, offset, \
@@ -270,6 +276,7 @@
 #define A64_BTI_C  A64_HINT(AARCH64_INSN_HINT_BTIC)
 #define A64_BTI_J  A64_HINT(AARCH64_INSN_HINT_BTIJ)
 #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC)
+#define A64_NOP    A64_HINT(AARCH64_INSN_HINT_NOP)
 
 /* DMB */
 #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f08a4447d363..e0e9c705a2e4 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -9,6 +9,7 @@
 
 #include <linux/bitfield.h>
 #include <linux/bpf.h>
+#include <linux/memory.h>
 #include <linux/filter.h>
 #include <linux/printk.h>
 #include <linux/slab.h>
@@ -18,6 +19,7 @@
 #include <asm/cacheflush.h>
 #include <asm/debug-monitors.h>
 #include <asm/insn.h>
+#include <asm/patching.h>
 #include <asm/set_memory.h>
 
 #include "bpf_jit.h"
@@ -78,6 +80,15 @@ struct jit_ctx {
 	int fpb_offset;
 };
 
+struct bpf_plt {
+	u32 insn_ldr; /* load target */
+	u32 insn_br;  /* branch to target */
+	u64 target;   /* target value */
+} __packed;
+
+#define PLT_TARGET_SIZE   sizeof_field(struct bpf_plt, target)
+#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
+
 static inline void emit(const u32 insn, struct jit_ctx *ctx)
 {
 	if (ctx->image != NULL)
@@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val,
 	}
 }
 
+static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
+{
+	if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
+		emit(insn, ctx);
+}
+
 /*
  * Kernel addresses in the vmalloc space use at most 48 bits, and the
  * remaining bits are guaranteed to be 0x1. So we can compose the address
@@ -235,13 +252,30 @@ static bool is_lsi_offset(int offset, int scale)
 	return true;
 }
 
+/* generated prologue:
+ *      bti c // if CONFIG_ARM64_BTI_KERNEL
+ *      mov x9, lr
+ *      nop  // POKE_OFFSET
+ *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
+ *      stp x29, lr, [sp, #-16]!
+ *      mov x29, sp
+ *      stp x19, x20, [sp, #-16]!
+ *      stp x21, x22, [sp, #-16]!
+ *      stp x25, x26, [sp, #-16]!
+ *      stp x27, x28, [sp, #-16]!
+ *      mov x25, sp
+ *      mov tcc, #0
+ *      // PROLOGUE_OFFSET
+ */
+
+#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
+#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
+
+/* Offset of nop instruction in bpf prog entry to be poked */
+#define POKE_OFFSET (BTI_INSNS + 1)
+
 /* Tail call offset to jump into */
-#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
-	IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
-#define PROLOGUE_OFFSET 9
-#else
-#define PROLOGUE_OFFSET 8
-#endif
+#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
 
 static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 {
@@ -280,12 +314,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 	 *
 	 */
 
+	emit_bti(A64_BTI_C, ctx);
+
+	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
+	emit(A64_NOP, ctx);
+
 	/* Sign lr */
 	if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
 		emit(A64_PACIASP, ctx);
-	/* BTI landing pad */
-	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
-		emit(A64_BTI_C, ctx);
 
 	/* Save FP and LR registers to stay align with ARM64 AAPCS */
 	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
@@ -312,8 +348,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 		}
 
 		/* BTI landing pad for the tail call, done with a BR */
-		if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
-			emit(A64_BTI_J, ctx);
+		emit_bti(A64_BTI_J, ctx);
 	}
 
 	emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
@@ -557,6 +592,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
 	return 0;
 }
 
+void dummy_tramp(void);
+
+asm (
+"	.pushsection .text, \"ax\", @progbits\n"
+"	.type dummy_tramp, %function\n"
+"dummy_tramp:"
+#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
+"	bti j\n" /* dummy_tramp is called via "br x10" */
+#endif
+"	mov x10, lr\n"
+"	mov lr, x9\n"
+"	ret x10\n"
+"	.size dummy_tramp, .-dummy_tramp\n"
+"	.popsection\n"
+);
+
+/* build a plt initialized like this:
+ *
+ * plt:
+ *      ldr tmp, target
+ *      br tmp
+ * target:
+ *      .quad dummy_tramp
+ *
+ * when a long jump trampoline is attached, target is filled with the
+ * trampoline address, and when the trampoine is removed, target is
+ * restored to dummy_tramp address.
+ */
+static void build_plt(struct jit_ctx *ctx, bool write_target)
+{
+	const u8 tmp = bpf2a64[TMP_REG_1];
+	struct bpf_plt *plt = NULL;
+
+	/* make sure target is 64-bit aligend */
+	if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
+		emit(A64_NOP, ctx);
+
+	plt = (struct bpf_plt *)(ctx->image + ctx->idx);
+	/* plt is called via bl, no BTI needed here */
+	emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
+	emit(A64_BR(tmp), ctx);
+
+	/* false write_target means target space is not allocated yet */
+	if (write_target)
+		plt->target = (u64)&dummy_tramp;
+}
+
 static void build_epilogue(struct jit_ctx *ctx)
 {
 	const u8 r0 = bpf2a64[BPF_REG_0];
@@ -1356,7 +1438,7 @@ struct arm64_jit_data {
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
-	int image_size, prog_size, extable_size;
+	int image_size, prog_size, extable_size, extable_align, extable_offset;
 	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
 	struct arm64_jit_data *jit_data;
@@ -1426,13 +1508,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 
 	ctx.epilogue_offset = ctx.idx;
 	build_epilogue(&ctx);
+	build_plt(&ctx, false);
 
+	extable_align = __alignof__(struct exception_table_entry);
 	extable_size = prog->aux->num_exentries *
 		sizeof(struct exception_table_entry);
 
 	/* Now we know the actual image size. */
 	prog_size = sizeof(u32) * ctx.idx;
-	image_size = prog_size + extable_size;
+	/* also allocate space for plt target */
+	extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
+	image_size = extable_offset + extable_size;
 	header = bpf_jit_binary_alloc(image_size, &image_ptr,
 				      sizeof(u32), jit_fill_hole);
 	if (header == NULL) {
@@ -1444,7 +1530,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 
 	ctx.image = (__le32 *)image_ptr;
 	if (extable_size)
-		prog->aux->extable = (void *)image_ptr + prog_size;
+		prog->aux->extable = (void *)image_ptr + extable_offset;
 skip_init_ctx:
 	ctx.idx = 0;
 	ctx.exentry_idx = 0;
@@ -1458,6 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	}
 
 	build_epilogue(&ctx);
+	build_plt(&ctx, true);
 
 	/* 3. Extra pass to validate JITed code. */
 	if (validate_code(&ctx)) {
@@ -1537,3 +1624,218 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 {
 	return true;
 }
+
+static bool is_long_jump(void *ip, void *target)
+{
+	long offset;
+
+	/* NULL target means this is a NOP */
+	if (!target)
+		return false;
+
+	offset = (long)target - (long)ip;
+	return offset < -SZ_128M || offset >= SZ_128M;
+}
+
+static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
+			     void *addr, void *plt, u32 *insn)
+{
+	void *target;
+
+	if (!addr) {
+		*insn = aarch64_insn_gen_nop();
+		return 0;
+	}
+
+	if (is_long_jump(ip, addr))
+		target = plt;
+	else
+		target = addr;
+
+	*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
+					    (unsigned long)target,
+					    type);
+
+	return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
+}
+
+/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
+ * trampoline with the branch instruction from @ip to @new_addr. If @old_addr
+ * or @new_addr is NULL, the old or new instruction is NOP.
+ *
+ * When @ip is the bpf prog entry, a bpf trampoline is being attached or
+ * detached. Since bpf trampoline and bpf prog are allocated separately with
+ * vmalloc, the address distance may exceed 128MB, the maximum branch range.
+ * So long jump should be handled.
+ *
+ * When a bpf prog is constructed, a plt pointing to empty trampoline
+ * dummy_tramp is placed at the end:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              nop // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad dummy_tramp // plt target
+ *
+ * This is also the state when no trampoline is attached.
+ *
+ * When a short-jump bpf trampoline is attached, the patchsite is patched
+ * to a bl instruction to the trampoline directly:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              bl <short-jump bpf trampoline address> // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad dummy_tramp // plt target
+ *
+ * When a long-jump bpf trampoline is attached, the plt target is filled with
+ * the trampoline address and the patchsite is patched to a bl instruction to
+ * the plt:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              bl plt // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad <long-jump bpf trampoline address> // plt target
+ *
+ * The dummy_tramp is used to prevent another CPU from jumping to unknown
+ * locations during the patching process, making the patching process easier.
+ */
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
+		       void *old_addr, void *new_addr)
+{
+	int ret;
+	u32 old_insn;
+	u32 new_insn;
+	u32 replaced;
+	struct bpf_plt *plt = NULL;
+	unsigned long size = 0UL;
+	unsigned long offset = ~0UL;
+	enum aarch64_insn_branch_type branch_type;
+	char namebuf[KSYM_NAME_LEN];
+	void *image = NULL;
+	u64 plt_target = 0ULL;
+	bool poking_bpf_entry;
+
+	if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
+		/* Only poking bpf text is supported. Since kernel function
+		 * entry is set up by ftrace, we reply on ftrace to poke kernel
+		 * functions.
+		 */
+		return -ENOTSUPP;
+
+	image = ip - offset;
+	/* zero offset means we're poking bpf prog entry */
+	poking_bpf_entry = (offset == 0UL);
+
+	/* bpf prog entry, find plt and the real patchsite */
+	if (poking_bpf_entry) {
+		/* plt locates at the end of bpf prog */
+		plt = image + size - PLT_TARGET_OFFSET;
+
+		/* skip to the nop instruction in bpf prog entry:
+		 * bti c // if BTI enabled
+		 * mov x9, x30
+		 * nop
+		 */
+		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
+	}
+
+	/* long jump is only possible at bpf prog entry */
+	if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
+		    !poking_bpf_entry))
+		return -EINVAL;
+
+	if (poke_type == BPF_MOD_CALL)
+		branch_type = AARCH64_INSN_BRANCH_LINK;
+	else
+		branch_type = AARCH64_INSN_BRANCH_NOLINK;
+
+	if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
+		return -EFAULT;
+
+	if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
+		return -EFAULT;
+
+	if (is_long_jump(ip, new_addr))
+		plt_target = (u64)new_addr;
+	else if (is_long_jump(ip, old_addr))
+		/* if the old target is a long jump and the new target is not,
+		 * restore the plt target to dummy_tramp, so there is always a
+		 * legal and harmless address stored in plt target, and we'll
+		 * never jump from plt to an unknown place.
+		 */
+		plt_target = (u64)&dummy_tramp;
+
+	if (plt_target) {
+		/* non-zero plt_target indicates we're patching a bpf prog,
+		 * which is read only.
+		 */
+		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
+			return -EFAULT;
+		WRITE_ONCE(plt->target, plt_target);
+		set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
+		/* since plt target points to either the new trmapoline
+		 * or dummy_tramp, even if aother CPU reads the old plt
+		 * target value before fetching the bl instruction to plt,
+		 * it will be brought back by dummy_tramp, so no barrier is
+		 * required here.
+		 */
+	}
+
+	/* if the old target and the new target are both long jumps, no
+	 * patching is required
+	 */
+	if (old_insn == new_insn)
+		return 0;
+
+	mutex_lock(&text_mutex);
+	if (aarch64_insn_read(ip, &replaced)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (replaced != old_insn) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* We call aarch64_insn_patch_text_nosync() to replace instruction
+	 * atomically, so no other CPUs will fetch a half-new and half-old
+	 * instruction. But there is chance that another CPU executes the
+	 * old instruction after the patching operation finishes (e.g.,
+	 * pipeline not flushed, or icache not synchronized yet).
+	 *
+	 * 1. when a new trampoline is attached, it is not a problem for
+	 *    different CPUs to jump to different trampolines temporarily.
+	 *
+	 * 2. when an old trampoline is freed, we should wait for all other
+	 *    CPUs to exit the trampoline and make sure the trampoline is no
+	 *    longer reachable, since bpf_tramp_image_put() function already
+	 *    uses percpu_ref and task rcu to do the sync, no need to call
+	 *    the sync version here, see bpf_tramp_image_put() for details.
+	 */
+	ret = aarch64_insn_patch_text_nosync(ip, new_insn);
+out:
+	mutex_unlock(&text_mutex);
+
+	return ret;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
@ 2022-06-25 16:12   ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Impelment bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
can be patched with it.

When the target address is NULL, the original instruction is patched to a
NOP.

When the target address and the source address are within the branch
range, the original instruction is patched to a bl instruction to the
target address directly.

To support attaching bpf trampoline to both regular kernel function and
bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
instructions are inserted at the beginning of bpf prog, the first one
saves the return address to x9, and the second is a nop which will be
patched to a bl instruction when a bpf trampoline is attached.

However, when a bpf trmapoline is attached to bpf prog, the distance
between target address and source address may exceed 128MB, the maximum
branch range, because bpf trampoline and bpf prog are allocated
separately with vmalloc. So long jump should be handled.

When a bpf prog is constructed, a plt pointing to empty trampoline
dummy_tramp is placed at the end:

        bpf_prog:
                mov x9, lr
                nop // patchsite
                ...
                ret

        plt:
                ldr x10, target
                br x10
        target:
                .quad dummy_tramp // plt target

This is also the state when no trampoline is attached.

When a short-jump bpf trampoline is attached, the patchsite is patched to
a bl instruction to the trampoline directly:

        bpf_prog:
                mov x9, lr
                bl <short-jump bpf trampoline address> // patchsite
                ...
                ret

        plt:
                ldr x10, target
                br x10
        target:
                .quad dummy_tramp // plt target

When a long-jump bpf trampoline is attached, the plt target is filled with
the trampoline address and the patchsite is patched to a bl instruction to
the plt:

        bpf_prog:
                mov x9, lr
                bl plt // patchsite
                ...
                ret

        plt:
                ldr x10, target
                br x10
        target:
                .quad <long-jump bpf trampoline address>

dummy_tramp  is used to prevent another CPU from jumping to an unknown
location during the patching process, making the patching process easier.

The patching process is as follows:

1. when neither the old address or the new address is a long jump, the
   patchsite is replaced with a bl to the new address, or nop if the new
   address is NULL;

2. when the old address is not long jump but the new one is, the
   branch target address is written to plt first, then the patchsite
   is replaced with a bl instruction to the plt;

3. when the old address is long jump but the new one is not, the address
   of dummy_tramp is written to plt first, then the patchsite is replaced
   with a bl to the new address, or a nop if the new address is NULL;

4. when both the old address and the new address are long jump, the
   new address is written to plt and the patchsite is not changed.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
---
 arch/arm64/net/bpf_jit.h      |   7 +
 arch/arm64/net/bpf_jit_comp.c | 330 ++++++++++++++++++++++++++++++++--
 2 files changed, 323 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h
index 194c95ccc1cf..a6acb94ea3d6 100644
--- a/arch/arm64/net/bpf_jit.h
+++ b/arch/arm64/net/bpf_jit.h
@@ -80,6 +80,12 @@
 #define A64_STR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, STORE)
 #define A64_LDR64I(Xt, Xn, imm) A64_LS_IMM(Xt, Xn, imm, 64, LOAD)
 
+/* LDR (literal) */
+#define A64_LDR32LIT(Wt, offset) \
+	aarch64_insn_gen_load_literal(0, offset, Wt, false)
+#define A64_LDR64LIT(Xt, offset) \
+	aarch64_insn_gen_load_literal(0, offset, Xt, true)
+
 /* Load/store register pair */
 #define A64_LS_PAIR(Rt, Rt2, Rn, offset, ls, type) \
 	aarch64_insn_gen_load_store_pair(Rt, Rt2, Rn, offset, \
@@ -270,6 +276,7 @@
 #define A64_BTI_C  A64_HINT(AARCH64_INSN_HINT_BTIC)
 #define A64_BTI_J  A64_HINT(AARCH64_INSN_HINT_BTIJ)
 #define A64_BTI_JC A64_HINT(AARCH64_INSN_HINT_BTIJC)
+#define A64_NOP    A64_HINT(AARCH64_INSN_HINT_NOP)
 
 /* DMB */
 #define A64_DMB_ISH aarch64_insn_gen_dmb(AARCH64_INSN_MB_ISH)
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f08a4447d363..e0e9c705a2e4 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -9,6 +9,7 @@
 
 #include <linux/bitfield.h>
 #include <linux/bpf.h>
+#include <linux/memory.h>
 #include <linux/filter.h>
 #include <linux/printk.h>
 #include <linux/slab.h>
@@ -18,6 +19,7 @@
 #include <asm/cacheflush.h>
 #include <asm/debug-monitors.h>
 #include <asm/insn.h>
+#include <asm/patching.h>
 #include <asm/set_memory.h>
 
 #include "bpf_jit.h"
@@ -78,6 +80,15 @@ struct jit_ctx {
 	int fpb_offset;
 };
 
+struct bpf_plt {
+	u32 insn_ldr; /* load target */
+	u32 insn_br;  /* branch to target */
+	u64 target;   /* target value */
+} __packed;
+
+#define PLT_TARGET_SIZE   sizeof_field(struct bpf_plt, target)
+#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
+
 static inline void emit(const u32 insn, struct jit_ctx *ctx)
 {
 	if (ctx->image != NULL)
@@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val,
 	}
 }
 
+static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
+{
+	if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
+		emit(insn, ctx);
+}
+
 /*
  * Kernel addresses in the vmalloc space use at most 48 bits, and the
  * remaining bits are guaranteed to be 0x1. So we can compose the address
@@ -235,13 +252,30 @@ static bool is_lsi_offset(int offset, int scale)
 	return true;
 }
 
+/* generated prologue:
+ *      bti c // if CONFIG_ARM64_BTI_KERNEL
+ *      mov x9, lr
+ *      nop  // POKE_OFFSET
+ *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
+ *      stp x29, lr, [sp, #-16]!
+ *      mov x29, sp
+ *      stp x19, x20, [sp, #-16]!
+ *      stp x21, x22, [sp, #-16]!
+ *      stp x25, x26, [sp, #-16]!
+ *      stp x27, x28, [sp, #-16]!
+ *      mov x25, sp
+ *      mov tcc, #0
+ *      // PROLOGUE_OFFSET
+ */
+
+#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
+#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
+
+/* Offset of nop instruction in bpf prog entry to be poked */
+#define POKE_OFFSET (BTI_INSNS + 1)
+
 /* Tail call offset to jump into */
-#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
-	IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
-#define PROLOGUE_OFFSET 9
-#else
-#define PROLOGUE_OFFSET 8
-#endif
+#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
 
 static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 {
@@ -280,12 +314,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 	 *
 	 */
 
+	emit_bti(A64_BTI_C, ctx);
+
+	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
+	emit(A64_NOP, ctx);
+
 	/* Sign lr */
 	if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
 		emit(A64_PACIASP, ctx);
-	/* BTI landing pad */
-	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
-		emit(A64_BTI_C, ctx);
 
 	/* Save FP and LR registers to stay align with ARM64 AAPCS */
 	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
@@ -312,8 +348,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 		}
 
 		/* BTI landing pad for the tail call, done with a BR */
-		if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
-			emit(A64_BTI_J, ctx);
+		emit_bti(A64_BTI_J, ctx);
 	}
 
 	emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
@@ -557,6 +592,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
 	return 0;
 }
 
+void dummy_tramp(void);
+
+asm (
+"	.pushsection .text, \"ax\", @progbits\n"
+"	.type dummy_tramp, %function\n"
+"dummy_tramp:"
+#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
+"	bti j\n" /* dummy_tramp is called via "br x10" */
+#endif
+"	mov x10, lr\n"
+"	mov lr, x9\n"
+"	ret x10\n"
+"	.size dummy_tramp, .-dummy_tramp\n"
+"	.popsection\n"
+);
+
+/* build a plt initialized like this:
+ *
+ * plt:
+ *      ldr tmp, target
+ *      br tmp
+ * target:
+ *      .quad dummy_tramp
+ *
+ * when a long jump trampoline is attached, target is filled with the
+ * trampoline address, and when the trampoine is removed, target is
+ * restored to dummy_tramp address.
+ */
+static void build_plt(struct jit_ctx *ctx, bool write_target)
+{
+	const u8 tmp = bpf2a64[TMP_REG_1];
+	struct bpf_plt *plt = NULL;
+
+	/* make sure target is 64-bit aligend */
+	if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
+		emit(A64_NOP, ctx);
+
+	plt = (struct bpf_plt *)(ctx->image + ctx->idx);
+	/* plt is called via bl, no BTI needed here */
+	emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
+	emit(A64_BR(tmp), ctx);
+
+	/* false write_target means target space is not allocated yet */
+	if (write_target)
+		plt->target = (u64)&dummy_tramp;
+}
+
 static void build_epilogue(struct jit_ctx *ctx)
 {
 	const u8 r0 = bpf2a64[BPF_REG_0];
@@ -1356,7 +1438,7 @@ struct arm64_jit_data {
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
-	int image_size, prog_size, extable_size;
+	int image_size, prog_size, extable_size, extable_align, extable_offset;
 	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
 	struct arm64_jit_data *jit_data;
@@ -1426,13 +1508,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 
 	ctx.epilogue_offset = ctx.idx;
 	build_epilogue(&ctx);
+	build_plt(&ctx, false);
 
+	extable_align = __alignof__(struct exception_table_entry);
 	extable_size = prog->aux->num_exentries *
 		sizeof(struct exception_table_entry);
 
 	/* Now we know the actual image size. */
 	prog_size = sizeof(u32) * ctx.idx;
-	image_size = prog_size + extable_size;
+	/* also allocate space for plt target */
+	extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
+	image_size = extable_offset + extable_size;
 	header = bpf_jit_binary_alloc(image_size, &image_ptr,
 				      sizeof(u32), jit_fill_hole);
 	if (header == NULL) {
@@ -1444,7 +1530,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 
 	ctx.image = (__le32 *)image_ptr;
 	if (extable_size)
-		prog->aux->extable = (void *)image_ptr + prog_size;
+		prog->aux->extable = (void *)image_ptr + extable_offset;
 skip_init_ctx:
 	ctx.idx = 0;
 	ctx.exentry_idx = 0;
@@ -1458,6 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	}
 
 	build_epilogue(&ctx);
+	build_plt(&ctx, true);
 
 	/* 3. Extra pass to validate JITed code. */
 	if (validate_code(&ctx)) {
@@ -1537,3 +1624,218 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 {
 	return true;
 }
+
+static bool is_long_jump(void *ip, void *target)
+{
+	long offset;
+
+	/* NULL target means this is a NOP */
+	if (!target)
+		return false;
+
+	offset = (long)target - (long)ip;
+	return offset < -SZ_128M || offset >= SZ_128M;
+}
+
+static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
+			     void *addr, void *plt, u32 *insn)
+{
+	void *target;
+
+	if (!addr) {
+		*insn = aarch64_insn_gen_nop();
+		return 0;
+	}
+
+	if (is_long_jump(ip, addr))
+		target = plt;
+	else
+		target = addr;
+
+	*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
+					    (unsigned long)target,
+					    type);
+
+	return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
+}
+
+/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
+ * trampoline with the branch instruction from @ip to @new_addr. If @old_addr
+ * or @new_addr is NULL, the old or new instruction is NOP.
+ *
+ * When @ip is the bpf prog entry, a bpf trampoline is being attached or
+ * detached. Since bpf trampoline and bpf prog are allocated separately with
+ * vmalloc, the address distance may exceed 128MB, the maximum branch range.
+ * So long jump should be handled.
+ *
+ * When a bpf prog is constructed, a plt pointing to empty trampoline
+ * dummy_tramp is placed at the end:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              nop // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad dummy_tramp // plt target
+ *
+ * This is also the state when no trampoline is attached.
+ *
+ * When a short-jump bpf trampoline is attached, the patchsite is patched
+ * to a bl instruction to the trampoline directly:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              bl <short-jump bpf trampoline address> // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad dummy_tramp // plt target
+ *
+ * When a long-jump bpf trampoline is attached, the plt target is filled with
+ * the trampoline address and the patchsite is patched to a bl instruction to
+ * the plt:
+ *
+ *      bpf_prog:
+ *              mov x9, lr
+ *              bl plt // patchsite
+ *              ...
+ *              ret
+ *
+ *      plt:
+ *              ldr x10, target
+ *              br x10
+ *      target:
+ *              .quad <long-jump bpf trampoline address> // plt target
+ *
+ * The dummy_tramp is used to prevent another CPU from jumping to unknown
+ * locations during the patching process, making the patching process easier.
+ */
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
+		       void *old_addr, void *new_addr)
+{
+	int ret;
+	u32 old_insn;
+	u32 new_insn;
+	u32 replaced;
+	struct bpf_plt *plt = NULL;
+	unsigned long size = 0UL;
+	unsigned long offset = ~0UL;
+	enum aarch64_insn_branch_type branch_type;
+	char namebuf[KSYM_NAME_LEN];
+	void *image = NULL;
+	u64 plt_target = 0ULL;
+	bool poking_bpf_entry;
+
+	if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
+		/* Only poking bpf text is supported. Since kernel function
+		 * entry is set up by ftrace, we reply on ftrace to poke kernel
+		 * functions.
+		 */
+		return -ENOTSUPP;
+
+	image = ip - offset;
+	/* zero offset means we're poking bpf prog entry */
+	poking_bpf_entry = (offset == 0UL);
+
+	/* bpf prog entry, find plt and the real patchsite */
+	if (poking_bpf_entry) {
+		/* plt locates at the end of bpf prog */
+		plt = image + size - PLT_TARGET_OFFSET;
+
+		/* skip to the nop instruction in bpf prog entry:
+		 * bti c // if BTI enabled
+		 * mov x9, x30
+		 * nop
+		 */
+		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
+	}
+
+	/* long jump is only possible at bpf prog entry */
+	if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
+		    !poking_bpf_entry))
+		return -EINVAL;
+
+	if (poke_type == BPF_MOD_CALL)
+		branch_type = AARCH64_INSN_BRANCH_LINK;
+	else
+		branch_type = AARCH64_INSN_BRANCH_NOLINK;
+
+	if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
+		return -EFAULT;
+
+	if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
+		return -EFAULT;
+
+	if (is_long_jump(ip, new_addr))
+		plt_target = (u64)new_addr;
+	else if (is_long_jump(ip, old_addr))
+		/* if the old target is a long jump and the new target is not,
+		 * restore the plt target to dummy_tramp, so there is always a
+		 * legal and harmless address stored in plt target, and we'll
+		 * never jump from plt to an unknown place.
+		 */
+		plt_target = (u64)&dummy_tramp;
+
+	if (plt_target) {
+		/* non-zero plt_target indicates we're patching a bpf prog,
+		 * which is read only.
+		 */
+		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
+			return -EFAULT;
+		WRITE_ONCE(plt->target, plt_target);
+		set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
+		/* since plt target points to either the new trmapoline
+		 * or dummy_tramp, even if aother CPU reads the old plt
+		 * target value before fetching the bl instruction to plt,
+		 * it will be brought back by dummy_tramp, so no barrier is
+		 * required here.
+		 */
+	}
+
+	/* if the old target and the new target are both long jumps, no
+	 * patching is required
+	 */
+	if (old_insn == new_insn)
+		return 0;
+
+	mutex_lock(&text_mutex);
+	if (aarch64_insn_read(ip, &replaced)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (replaced != old_insn) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/* We call aarch64_insn_patch_text_nosync() to replace instruction
+	 * atomically, so no other CPUs will fetch a half-new and half-old
+	 * instruction. But there is chance that another CPU executes the
+	 * old instruction after the patching operation finishes (e.g.,
+	 * pipeline not flushed, or icache not synchronized yet).
+	 *
+	 * 1. when a new trampoline is attached, it is not a problem for
+	 *    different CPUs to jump to different trampolines temporarily.
+	 *
+	 * 2. when an old trampoline is freed, we should wait for all other
+	 *    CPUs to exit the trampoline and make sure the trampoline is no
+	 *    longer reachable, since bpf_tramp_image_put() function already
+	 *    uses percpu_ref and task rcu to do the sync, no need to call
+	 *    the sync version here, see bpf_tramp_image_put() for details.
+	 */
+	ret = aarch64_insn_patch_text_nosync(ip, new_insn);
+out:
+	mutex_unlock(&text_mutex);
+
+	return ret;
+}
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
  2022-06-25 16:12 ` Xu Kuohai
@ 2022-06-25 16:12   ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

This is arm64 version of commit fec56f5890d9 ("bpf: Introduce BPF
trampoline"). A bpf trampoline converts native calling convention to bpf
calling convention and is used to implement various bpf features, such
as fentry, fexit, fmod_ret and struct_ops.

This patch does essentially the same thing that bpf trampoline does on x86.

Tested on raspberry pi 4b and qemu:

 #18 /1     bpf_tcp_ca/dctcp:OK
 #18 /2     bpf_tcp_ca/cubic:OK
 #18 /3     bpf_tcp_ca/invalid_license:OK
 #18 /4     bpf_tcp_ca/dctcp_fallback:OK
 #18 /5     bpf_tcp_ca/rel_setsockopt:OK
 #18        bpf_tcp_ca:OK
 #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
 #51 /2     dummy_st_ops/dummy_init_ret_value:OK
 #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
 #51 /4     dummy_st_ops/dummy_multiple_args:OK
 #51        dummy_st_ops:OK
 #57 /1     fexit_bpf2bpf/target_no_callees:OK
 #57 /2     fexit_bpf2bpf/target_yes_callees:OK
 #57 /3     fexit_bpf2bpf/func_replace:OK
 #57 /4     fexit_bpf2bpf/func_replace_verify:OK
 #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
 #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
 #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
 #57 /8     fexit_bpf2bpf/func_replace_multi:OK
 #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
 #57        fexit_bpf2bpf:OK
 #237       xdp_bpf2bpf:OK

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
---
 arch/arm64/net/bpf_jit_comp.c | 387 +++++++++++++++++++++++++++++++++-
 1 file changed, 384 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index e0e9c705a2e4..dd5a843601b8 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -176,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val,
 	}
 }
 
+static inline void emit_call(u64 target, struct jit_ctx *ctx)
+{
+	u8 tmp = bpf2a64[TMP_REG_1];
+
+	emit_addr_mov_i64(tmp, target, ctx);
+	emit(A64_BLR(tmp), ctx);
+}
+
 static inline int bpf2a64_offset(int bpf_insn, int off,
 				 const struct jit_ctx *ctx)
 {
@@ -1073,8 +1081,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 					    &func_addr, &func_addr_fixed);
 		if (ret < 0)
 			return ret;
-		emit_addr_mov_i64(tmp, func_addr, ctx);
-		emit(A64_BLR(tmp), ctx);
+		emit_call(func_addr, ctx);
 		emit(A64_MOV(1, r0, A64_R(0)), ctx);
 		break;
 	}
@@ -1418,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx)
 		if (a64_insn == AARCH64_BREAK_FAULT)
 			return -1;
 	}
+	return 0;
+}
+
+static int validate_ctx(struct jit_ctx *ctx)
+{
+	if (validate_code(ctx))
+		return -1;
 
 	if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
 		return -1;
@@ -1547,7 +1561,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	build_plt(&ctx, true);
 
 	/* 3. Extra pass to validate JITed code. */
-	if (validate_code(&ctx)) {
+	if (validate_ctx(&ctx)) {
 		bpf_jit_binary_free(header);
 		prog = orig_prog;
 		goto out_off;
@@ -1625,6 +1639,373 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 	return true;
 }
 
+static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
+			    int args_off, int retval_off, int run_ctx_off,
+			    bool save_ret)
+{
+	u32 *branch;
+	u64 enter_prog;
+	u64 exit_prog;
+	u8 tmp = bpf2a64[TMP_REG_1];
+	u8 r0 = bpf2a64[BPF_REG_0];
+	struct bpf_prog *p = l->link.prog;
+	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
+
+	if (p->aux->sleepable) {
+		enter_prog = (u64)__bpf_prog_enter_sleepable;
+		exit_prog = (u64)__bpf_prog_exit_sleepable;
+	} else {
+		enter_prog = (u64)__bpf_prog_enter;
+		exit_prog = (u64)__bpf_prog_exit;
+	}
+
+	if (l->cookie == 0) {
+		/* if cookie is zero, one instruction is enough to store it */
+		emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
+	} else {
+		emit_a64_mov_i64(tmp, l->cookie, ctx);
+		emit(A64_STR64I(tmp, A64_SP, run_ctx_off + cookie_off), ctx);
+	}
+
+	/* save p to callee saved register x19 to avoid loading p with mov_i64
+	 * each time.
+	 */
+	emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
+
+	/* arg1: prog */
+	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
+	/* arg2: &run_ctx */
+	emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
+
+	emit_call(enter_prog, ctx);
+
+	/* if (__bpf_prog_enter(prog) == 0)
+	 *         goto skip_exec_of_prog;
+	 */
+	branch = ctx->image + ctx->idx;
+	emit(A64_NOP, ctx);
+
+	/* save return value to callee saved register x20 */
+	emit(A64_MOV(1, A64_R(20), r0), ctx);
+
+	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
+	if (!p->jited)
+		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
+
+	emit_call((const u64)p->bpf_func, ctx);
+
+	/* store return value */
+	if (save_ret)
+		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
+
+	if (ctx->image) {
+		int offset = &ctx->image[ctx->idx] - branch;
+		*branch = A64_CBZ(1, A64_R(0), offset);
+	}
+
+	/* arg1: prog */
+	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
+	/* arg2: start time */
+	emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);
+	/* arg3: &run_ctx */
+	emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
+
+	emit_call(exit_prog, ctx);
+}
+
+static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
+			       int args_off, int retval_off, int run_ctx_off,
+			       u32 **branches)
+{
+	int i;
+
+	/* The first fmod_ret program will receive a garbage return value.
+	 * Set this to 0 to avoid confusing the program.
+	 */
+	emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
+	for (i = 0; i < tl->nr_links; i++) {
+		invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
+				run_ctx_off, true);
+		/* if (*(u64 *)(sp + retval_off) !=  0)
+		 *	goto do_fexit;
+		 */
+		emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
+		/* Save the location of branch, and generate a nop.
+		 * This nop will be replaced with a cbnz later.
+		 */
+		branches[i] = ctx->image + ctx->idx;
+		emit(A64_NOP, ctx);
+	}
+}
+
+static void save_args(struct jit_ctx *ctx, int args_off, int nargs)
+{
+	int i;
+
+	for (i = 0; i < nargs; i++) {
+		emit(A64_STR64I(i, A64_SP, args_off), ctx);
+		args_off += 8;
+	}
+}
+
+static void restore_args(struct jit_ctx *ctx, int args_off, int nargs)
+{
+	int i;
+
+	for (i = 0; i < nargs; i++) {
+		emit(A64_LDR64I(i, A64_SP, args_off), ctx);
+		args_off += 8;
+	}
+}
+
+/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
+ *
+ * bpf prog and function entry before bpf trampoline hooked:
+ *   mov x9, lr
+ *   nop
+ *
+ * bpf prog and function entry after bpf trampoline hooked:
+ *   mov x9, lr
+ *   bl  <bpf_trampoline or plt>
+ *
+ */
+static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
+			      struct bpf_tramp_links *tlinks, void *orig_call,
+			      int nargs, u32 flags)
+{
+	int i;
+	int stack_size;
+	int retaddr_off;
+	int regs_off;
+	int retval_off;
+	int args_off;
+	int nargs_off;
+	int ip_off;
+	int run_ctx_off;
+	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
+	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
+	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
+	bool save_ret;
+	u32 **branches = NULL;
+
+	/* trampoline stack layout:
+	 *                  [ parent ip         ]
+	 *                  [ FP                ]
+	 * SP + retaddr_off [ self ip           ]
+	 *                  [ FP                ]
+	 *
+	 *                  [ padding           ] align SP to multiples of 16
+	 *
+	 *                  [ x20               ] callee saved reg x20
+	 * SP + regs_off    [ x19               ] callee saved reg x19
+	 *
+	 * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
+	 *                                        BPF_TRAMP_F_RET_FENTRY_RET
+	 *
+	 *                  [ argN              ]
+	 *                  [ ...               ]
+	 * SP + args_off    [ arg1              ]
+	 *
+	 * SP + nargs_off   [ args count        ]
+	 *
+	 * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
+	 *
+	 * SP + run_ctx_off [ bpf_tramp_run_ctx ]
+	 */
+
+	stack_size = 0;
+	run_ctx_off = stack_size;
+	/* room for bpf_tramp_run_ctx */
+	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
+
+	ip_off = stack_size;
+	/* room for IP address argument */
+	if (flags & BPF_TRAMP_F_IP_ARG)
+		stack_size += 8;
+
+	nargs_off = stack_size;
+	/* room for args count */
+	stack_size += 8;
+
+	args_off = stack_size;
+	/* room for args */
+	stack_size += nargs * 8;
+
+	/* room for return value */
+	retval_off = stack_size;
+	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
+	if (save_ret)
+		stack_size += 8;
+
+	/* room for callee saved registers, currently x19 and x20 are used */
+	regs_off = stack_size;
+	stack_size += 16;
+
+	/* round up to multiples of 16 to avoid SPAlignmentFault */
+	stack_size = round_up(stack_size, 16);
+
+	/* return address locates above FP */
+	retaddr_off = stack_size + 8;
+
+	/* bpf trampoline may be invoked by 3 instruction types:
+	 * 1. bl, attached to bpf prog or kernel function via short jump
+	 * 2. br, attached to bpf prog or kernel function via long jump
+	 * 3. blr, working as a function pointer, used by struct_ops.
+	 * So BTI_JC should used here to support both br and blr.
+	 */
+	emit_bti(A64_BTI_JC, ctx);
+
+	/* frame for parent function */
+	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
+	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
+
+	/* frame for patched function */
+	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
+	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
+
+	/* allocate stack space */
+	emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+
+	if (flags & BPF_TRAMP_F_IP_ARG) {
+		/* save ip address of the traced function */
+		emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx);
+		emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
+	}
+
+	/* save args count*/
+	emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx);
+	emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx);
+
+	/* save args */
+	save_args(ctx, args_off, nargs);
+
+	/* save callee saved registers */
+	emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
+	emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
+		emit_call((const u64)__bpf_tramp_enter, ctx);
+	}
+
+	for (i = 0; i < fentry->nr_links; i++)
+		invoke_bpf_prog(ctx, fentry->links[i], args_off,
+				retval_off, run_ctx_off,
+				flags & BPF_TRAMP_F_RET_FENTRY_RET);
+
+	if (fmod_ret->nr_links) {
+		branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *),
+				   GFP_KERNEL);
+		if (!branches)
+			return -ENOMEM;
+
+		invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
+				   run_ctx_off, branches);
+	}
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		restore_args(ctx, args_off, nargs);
+		/* call original func */
+		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
+		emit(A64_BLR(A64_R(10)), ctx);
+		/* store return value */
+		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
+		/* reserve a nop for bpf_tramp_image_put */
+		im->ip_after_call = ctx->image + ctx->idx;
+		emit(A64_NOP, ctx);
+	}
+
+	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
+	for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
+		int offset = &ctx->image[ctx->idx] - branches[i];
+		*branches[i] = A64_CBNZ(1, A64_R(10), offset);
+	}
+
+	for (i = 0; i < fexit->nr_links; i++)
+		invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
+				run_ctx_off, false);
+
+	if (flags & BPF_TRAMP_F_RESTORE_REGS)
+		restore_args(ctx, args_off, nargs);
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		im->ip_epilogue = ctx->image + ctx->idx;
+		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
+		emit_call((const u64)__bpf_tramp_exit, ctx);
+	}
+
+	/* restore callee saved register x19 and x20 */
+	emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
+	emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
+
+	if (save_ret)
+		emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
+
+	/* reset SP  */
+	emit(A64_MOV(1, A64_SP, A64_FP), ctx);
+
+	/* pop frames  */
+	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
+	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
+
+	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
+		/* skip patched function, return to parent */
+		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
+		emit(A64_RET(A64_R(9)), ctx);
+	} else {
+		/* return to patched function */
+		emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
+		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
+		emit(A64_RET(A64_R(10)), ctx);
+	}
+
+	if (ctx->image)
+		bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
+
+	kfree(branches);
+
+	return ctx->idx;
+}
+
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
+				void *image_end, const struct btf_func_model *m,
+				u32 flags, struct bpf_tramp_links *tlinks,
+				void *orig_call)
+{
+	int ret;
+	int nargs = m->nr_args;
+	int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE;
+	struct jit_ctx ctx = {
+		.image = NULL,
+		.idx = 0,
+	};
+
+	/* the first 8 arguments are passed by registers */
+	if (nargs > 8)
+		return -ENOTSUPP;
+
+	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
+	if (ret < 0)
+		return ret;
+
+	if (ret > max_insns)
+		return -EFBIG;
+
+	ctx.image = image;
+	ctx.idx = 0;
+
+	jit_fill_hole(image, (unsigned int)(image_end - image));
+	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
+
+	if (ret > 0 && validate_code(&ctx) < 0)
+		ret = -EINVAL;
+
+	if (ret > 0)
+		ret *= AARCH64_INSN_SIZE;
+
+	return ret;
+}
+
 static bool is_long_jump(void *ip, void *target)
 {
 	long offset;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
@ 2022-06-25 16:12   ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-06-25 16:12 UTC (permalink / raw)
  To: bpf, linux-arm-kernel, linux-kernel, netdev
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

This is arm64 version of commit fec56f5890d9 ("bpf: Introduce BPF
trampoline"). A bpf trampoline converts native calling convention to bpf
calling convention and is used to implement various bpf features, such
as fentry, fexit, fmod_ret and struct_ops.

This patch does essentially the same thing that bpf trampoline does on x86.

Tested on raspberry pi 4b and qemu:

 #18 /1     bpf_tcp_ca/dctcp:OK
 #18 /2     bpf_tcp_ca/cubic:OK
 #18 /3     bpf_tcp_ca/invalid_license:OK
 #18 /4     bpf_tcp_ca/dctcp_fallback:OK
 #18 /5     bpf_tcp_ca/rel_setsockopt:OK
 #18        bpf_tcp_ca:OK
 #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
 #51 /2     dummy_st_ops/dummy_init_ret_value:OK
 #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
 #51 /4     dummy_st_ops/dummy_multiple_args:OK
 #51        dummy_st_ops:OK
 #57 /1     fexit_bpf2bpf/target_no_callees:OK
 #57 /2     fexit_bpf2bpf/target_yes_callees:OK
 #57 /3     fexit_bpf2bpf/func_replace:OK
 #57 /4     fexit_bpf2bpf/func_replace_verify:OK
 #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
 #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
 #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
 #57 /8     fexit_bpf2bpf/func_replace_multi:OK
 #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
 #57        fexit_bpf2bpf:OK
 #237       xdp_bpf2bpf:OK

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
---
 arch/arm64/net/bpf_jit_comp.c | 387 +++++++++++++++++++++++++++++++++-
 1 file changed, 384 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index e0e9c705a2e4..dd5a843601b8 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -176,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val,
 	}
 }
 
+static inline void emit_call(u64 target, struct jit_ctx *ctx)
+{
+	u8 tmp = bpf2a64[TMP_REG_1];
+
+	emit_addr_mov_i64(tmp, target, ctx);
+	emit(A64_BLR(tmp), ctx);
+}
+
 static inline int bpf2a64_offset(int bpf_insn, int off,
 				 const struct jit_ctx *ctx)
 {
@@ -1073,8 +1081,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 					    &func_addr, &func_addr_fixed);
 		if (ret < 0)
 			return ret;
-		emit_addr_mov_i64(tmp, func_addr, ctx);
-		emit(A64_BLR(tmp), ctx);
+		emit_call(func_addr, ctx);
 		emit(A64_MOV(1, r0, A64_R(0)), ctx);
 		break;
 	}
@@ -1418,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx)
 		if (a64_insn == AARCH64_BREAK_FAULT)
 			return -1;
 	}
+	return 0;
+}
+
+static int validate_ctx(struct jit_ctx *ctx)
+{
+	if (validate_code(ctx))
+		return -1;
 
 	if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
 		return -1;
@@ -1547,7 +1561,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	build_plt(&ctx, true);
 
 	/* 3. Extra pass to validate JITed code. */
-	if (validate_code(&ctx)) {
+	if (validate_ctx(&ctx)) {
 		bpf_jit_binary_free(header);
 		prog = orig_prog;
 		goto out_off;
@@ -1625,6 +1639,373 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 	return true;
 }
 
+static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
+			    int args_off, int retval_off, int run_ctx_off,
+			    bool save_ret)
+{
+	u32 *branch;
+	u64 enter_prog;
+	u64 exit_prog;
+	u8 tmp = bpf2a64[TMP_REG_1];
+	u8 r0 = bpf2a64[BPF_REG_0];
+	struct bpf_prog *p = l->link.prog;
+	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
+
+	if (p->aux->sleepable) {
+		enter_prog = (u64)__bpf_prog_enter_sleepable;
+		exit_prog = (u64)__bpf_prog_exit_sleepable;
+	} else {
+		enter_prog = (u64)__bpf_prog_enter;
+		exit_prog = (u64)__bpf_prog_exit;
+	}
+
+	if (l->cookie == 0) {
+		/* if cookie is zero, one instruction is enough to store it */
+		emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
+	} else {
+		emit_a64_mov_i64(tmp, l->cookie, ctx);
+		emit(A64_STR64I(tmp, A64_SP, run_ctx_off + cookie_off), ctx);
+	}
+
+	/* save p to callee saved register x19 to avoid loading p with mov_i64
+	 * each time.
+	 */
+	emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
+
+	/* arg1: prog */
+	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
+	/* arg2: &run_ctx */
+	emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
+
+	emit_call(enter_prog, ctx);
+
+	/* if (__bpf_prog_enter(prog) == 0)
+	 *         goto skip_exec_of_prog;
+	 */
+	branch = ctx->image + ctx->idx;
+	emit(A64_NOP, ctx);
+
+	/* save return value to callee saved register x20 */
+	emit(A64_MOV(1, A64_R(20), r0), ctx);
+
+	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
+	if (!p->jited)
+		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
+
+	emit_call((const u64)p->bpf_func, ctx);
+
+	/* store return value */
+	if (save_ret)
+		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
+
+	if (ctx->image) {
+		int offset = &ctx->image[ctx->idx] - branch;
+		*branch = A64_CBZ(1, A64_R(0), offset);
+	}
+
+	/* arg1: prog */
+	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
+	/* arg2: start time */
+	emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);
+	/* arg3: &run_ctx */
+	emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
+
+	emit_call(exit_prog, ctx);
+}
+
+static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
+			       int args_off, int retval_off, int run_ctx_off,
+			       u32 **branches)
+{
+	int i;
+
+	/* The first fmod_ret program will receive a garbage return value.
+	 * Set this to 0 to avoid confusing the program.
+	 */
+	emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
+	for (i = 0; i < tl->nr_links; i++) {
+		invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
+				run_ctx_off, true);
+		/* if (*(u64 *)(sp + retval_off) !=  0)
+		 *	goto do_fexit;
+		 */
+		emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
+		/* Save the location of branch, and generate a nop.
+		 * This nop will be replaced with a cbnz later.
+		 */
+		branches[i] = ctx->image + ctx->idx;
+		emit(A64_NOP, ctx);
+	}
+}
+
+static void save_args(struct jit_ctx *ctx, int args_off, int nargs)
+{
+	int i;
+
+	for (i = 0; i < nargs; i++) {
+		emit(A64_STR64I(i, A64_SP, args_off), ctx);
+		args_off += 8;
+	}
+}
+
+static void restore_args(struct jit_ctx *ctx, int args_off, int nargs)
+{
+	int i;
+
+	for (i = 0; i < nargs; i++) {
+		emit(A64_LDR64I(i, A64_SP, args_off), ctx);
+		args_off += 8;
+	}
+}
+
+/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
+ *
+ * bpf prog and function entry before bpf trampoline hooked:
+ *   mov x9, lr
+ *   nop
+ *
+ * bpf prog and function entry after bpf trampoline hooked:
+ *   mov x9, lr
+ *   bl  <bpf_trampoline or plt>
+ *
+ */
+static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
+			      struct bpf_tramp_links *tlinks, void *orig_call,
+			      int nargs, u32 flags)
+{
+	int i;
+	int stack_size;
+	int retaddr_off;
+	int regs_off;
+	int retval_off;
+	int args_off;
+	int nargs_off;
+	int ip_off;
+	int run_ctx_off;
+	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
+	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
+	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
+	bool save_ret;
+	u32 **branches = NULL;
+
+	/* trampoline stack layout:
+	 *                  [ parent ip         ]
+	 *                  [ FP                ]
+	 * SP + retaddr_off [ self ip           ]
+	 *                  [ FP                ]
+	 *
+	 *                  [ padding           ] align SP to multiples of 16
+	 *
+	 *                  [ x20               ] callee saved reg x20
+	 * SP + regs_off    [ x19               ] callee saved reg x19
+	 *
+	 * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
+	 *                                        BPF_TRAMP_F_RET_FENTRY_RET
+	 *
+	 *                  [ argN              ]
+	 *                  [ ...               ]
+	 * SP + args_off    [ arg1              ]
+	 *
+	 * SP + nargs_off   [ args count        ]
+	 *
+	 * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
+	 *
+	 * SP + run_ctx_off [ bpf_tramp_run_ctx ]
+	 */
+
+	stack_size = 0;
+	run_ctx_off = stack_size;
+	/* room for bpf_tramp_run_ctx */
+	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
+
+	ip_off = stack_size;
+	/* room for IP address argument */
+	if (flags & BPF_TRAMP_F_IP_ARG)
+		stack_size += 8;
+
+	nargs_off = stack_size;
+	/* room for args count */
+	stack_size += 8;
+
+	args_off = stack_size;
+	/* room for args */
+	stack_size += nargs * 8;
+
+	/* room for return value */
+	retval_off = stack_size;
+	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
+	if (save_ret)
+		stack_size += 8;
+
+	/* room for callee saved registers, currently x19 and x20 are used */
+	regs_off = stack_size;
+	stack_size += 16;
+
+	/* round up to multiples of 16 to avoid SPAlignmentFault */
+	stack_size = round_up(stack_size, 16);
+
+	/* return address locates above FP */
+	retaddr_off = stack_size + 8;
+
+	/* bpf trampoline may be invoked by 3 instruction types:
+	 * 1. bl, attached to bpf prog or kernel function via short jump
+	 * 2. br, attached to bpf prog or kernel function via long jump
+	 * 3. blr, working as a function pointer, used by struct_ops.
+	 * So BTI_JC should used here to support both br and blr.
+	 */
+	emit_bti(A64_BTI_JC, ctx);
+
+	/* frame for parent function */
+	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
+	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
+
+	/* frame for patched function */
+	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
+	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
+
+	/* allocate stack space */
+	emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+
+	if (flags & BPF_TRAMP_F_IP_ARG) {
+		/* save ip address of the traced function */
+		emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx);
+		emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
+	}
+
+	/* save args count*/
+	emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx);
+	emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx);
+
+	/* save args */
+	save_args(ctx, args_off, nargs);
+
+	/* save callee saved registers */
+	emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
+	emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
+		emit_call((const u64)__bpf_tramp_enter, ctx);
+	}
+
+	for (i = 0; i < fentry->nr_links; i++)
+		invoke_bpf_prog(ctx, fentry->links[i], args_off,
+				retval_off, run_ctx_off,
+				flags & BPF_TRAMP_F_RET_FENTRY_RET);
+
+	if (fmod_ret->nr_links) {
+		branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *),
+				   GFP_KERNEL);
+		if (!branches)
+			return -ENOMEM;
+
+		invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
+				   run_ctx_off, branches);
+	}
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		restore_args(ctx, args_off, nargs);
+		/* call original func */
+		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
+		emit(A64_BLR(A64_R(10)), ctx);
+		/* store return value */
+		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
+		/* reserve a nop for bpf_tramp_image_put */
+		im->ip_after_call = ctx->image + ctx->idx;
+		emit(A64_NOP, ctx);
+	}
+
+	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
+	for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
+		int offset = &ctx->image[ctx->idx] - branches[i];
+		*branches[i] = A64_CBNZ(1, A64_R(10), offset);
+	}
+
+	for (i = 0; i < fexit->nr_links; i++)
+		invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
+				run_ctx_off, false);
+
+	if (flags & BPF_TRAMP_F_RESTORE_REGS)
+		restore_args(ctx, args_off, nargs);
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		im->ip_epilogue = ctx->image + ctx->idx;
+		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
+		emit_call((const u64)__bpf_tramp_exit, ctx);
+	}
+
+	/* restore callee saved register x19 and x20 */
+	emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
+	emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
+
+	if (save_ret)
+		emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
+
+	/* reset SP  */
+	emit(A64_MOV(1, A64_SP, A64_FP), ctx);
+
+	/* pop frames  */
+	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
+	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
+
+	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
+		/* skip patched function, return to parent */
+		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
+		emit(A64_RET(A64_R(9)), ctx);
+	} else {
+		/* return to patched function */
+		emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
+		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
+		emit(A64_RET(A64_R(10)), ctx);
+	}
+
+	if (ctx->image)
+		bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
+
+	kfree(branches);
+
+	return ctx->idx;
+}
+
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
+				void *image_end, const struct btf_func_model *m,
+				u32 flags, struct bpf_tramp_links *tlinks,
+				void *orig_call)
+{
+	int ret;
+	int nargs = m->nr_args;
+	int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE;
+	struct jit_ctx ctx = {
+		.image = NULL,
+		.idx = 0,
+	};
+
+	/* the first 8 arguments are passed by registers */
+	if (nargs > 8)
+		return -ENOTSUPP;
+
+	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
+	if (ret < 0)
+		return ret;
+
+	if (ret > max_insns)
+		return -EFBIG;
+
+	ctx.image = image;
+	ctx.idx = 0;
+
+	jit_fill_hole(image, (unsigned int)(image_end - image));
+	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
+
+	if (ret > 0 && validate_code(&ctx) < 0)
+		ret = -EINVAL;
+
+	if (ret > 0)
+		ret *= AARCH64_INSN_SIZE;
+
+	return ret;
+}
+
 static bool is_long_jump(void *ip, void *target)
 {
 	long offset;
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
  2022-06-25 16:12 ` Xu Kuohai
@ 2022-06-30 21:12   ` Daniel Borkmann
  -1 siblings, 0 replies; 42+ messages in thread
From: Daniel Borkmann @ 2022-06-30 21:12 UTC (permalink / raw)
  To: Xu Kuohai, bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, David S . Miller, Hideaki YOSHIFUJI,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Jakub Kicinski,
	Jesper Dangaard Brouer, Russell King, James Morse, Hou Tao,
	Jason Wang

Hey Mark,

On 6/25/22 6:12 PM, Xu Kuohai wrote:
> This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
> native calling convention to bpf calling convention and is used to implement
> various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
> 
> The trampoline introduced does essentially the same thing as the bpf
> trampoline does on x86.
> 
> Tested on raspberry pi 4b and qemu:
> 
>   #18 /1     bpf_tcp_ca/dctcp:OK
>   #18 /2     bpf_tcp_ca/cubic:OK
>   #18 /3     bpf_tcp_ca/invalid_license:OK
>   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>   #18        bpf_tcp_ca:OK
>   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>   #51 /4     dummy_st_ops/dummy_multiple_args:OK
>   #51        dummy_st_ops:OK
>   #57 /1     fexit_bpf2bpf/target_no_callees:OK
>   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>   #57 /3     fexit_bpf2bpf/func_replace:OK
>   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>   #57        fexit_bpf2bpf:OK
>   #237       xdp_bpf2bpf:OK
> 
> v6:
> - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
>    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
>    to regular kernel functions, so remove ftrace related patches for now.
> - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
>    trampoline and bpf prog are allocated via vmalloc, there is chance the
>    distance exceeds the max branch range.
> - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
>    should be kept, since the changes to it is not trivial
> - Update some commit messages and comments

Given you've been taking a look and had objections in v5, would be great if you
can find some cycles for this v6.

Thanks a lot,
Daniel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-06-30 21:12   ` Daniel Borkmann
  0 siblings, 0 replies; 42+ messages in thread
From: Daniel Borkmann @ 2022-06-30 21:12 UTC (permalink / raw)
  To: Xu Kuohai, bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland
  Cc: Catalin Marinas, Will Deacon, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, David S . Miller, Hideaki YOSHIFUJI,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Jakub Kicinski,
	Jesper Dangaard Brouer, Russell King, James Morse, Hou Tao,
	Jason Wang

Hey Mark,

On 6/25/22 6:12 PM, Xu Kuohai wrote:
> This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
> native calling convention to bpf calling convention and is used to implement
> various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
> 
> The trampoline introduced does essentially the same thing as the bpf
> trampoline does on x86.
> 
> Tested on raspberry pi 4b and qemu:
> 
>   #18 /1     bpf_tcp_ca/dctcp:OK
>   #18 /2     bpf_tcp_ca/cubic:OK
>   #18 /3     bpf_tcp_ca/invalid_license:OK
>   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>   #18        bpf_tcp_ca:OK
>   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>   #51 /4     dummy_st_ops/dummy_multiple_args:OK
>   #51        dummy_st_ops:OK
>   #57 /1     fexit_bpf2bpf/target_no_callees:OK
>   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>   #57 /3     fexit_bpf2bpf/func_replace:OK
>   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>   #57        fexit_bpf2bpf:OK
>   #237       xdp_bpf2bpf:OK
> 
> v6:
> - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
>    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
>    to regular kernel functions, so remove ftrace related patches for now.
> - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
>    trampoline and bpf prog are allocated via vmalloc, there is chance the
>    distance exceeds the max branch range.
> - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
>    should be kept, since the changes to it is not trivial
> - Update some commit messages and comments

Given you've been taking a look and had objections in v5, would be great if you
can find some cycles for this v6.

Thanks a lot,
Daniel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
  2022-06-30 21:12   ` Daniel Borkmann
@ 2022-07-05 16:00     ` Will Deacon
  -1 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2022-07-05 16:00 UTC (permalink / raw)
  To: Daniel Borkmann, jean-philippe.brucker
  Cc: Xu Kuohai, bpf, linux-arm-kernel, linux-kernel, netdev,
	Mark Rutland, Catalin Marinas, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, David S . Miller, Hideaki YOSHIFUJI,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Jakub Kicinski,
	Jesper Dangaard Brouer, Russell King, James Morse, Hou Tao,
	Jason Wang

Hi Daniel,

On Thu, Jun 30, 2022 at 11:12:54PM +0200, Daniel Borkmann wrote:
> On 6/25/22 6:12 PM, Xu Kuohai wrote:
> > This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
> > native calling convention to bpf calling convention and is used to implement
> > various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
> > 
> > The trampoline introduced does essentially the same thing as the bpf
> > trampoline does on x86.
> > 
> > Tested on raspberry pi 4b and qemu:
> > 
> >   #18 /1     bpf_tcp_ca/dctcp:OK
> >   #18 /2     bpf_tcp_ca/cubic:OK
> >   #18 /3     bpf_tcp_ca/invalid_license:OK
> >   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
> >   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
> >   #18        bpf_tcp_ca:OK
> >   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
> >   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
> >   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
> >   #51 /4     dummy_st_ops/dummy_multiple_args:OK
> >   #51        dummy_st_ops:OK
> >   #57 /1     fexit_bpf2bpf/target_no_callees:OK
> >   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
> >   #57 /3     fexit_bpf2bpf/func_replace:OK
> >   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
> >   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
> >   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
> >   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
> >   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
> >   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
> >   #57        fexit_bpf2bpf:OK
> >   #237       xdp_bpf2bpf:OK
> > 
> > v6:
> > - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
> >    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
> >    to regular kernel functions, so remove ftrace related patches for now.
> > - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
> >    trampoline and bpf prog are allocated via vmalloc, there is chance the
> >    distance exceeds the max branch range.
> > - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
> >    should be kept, since the changes to it is not trivial
> > - Update some commit messages and comments
> 
> Given you've been taking a look and had objections in v5, would be great if you
> can find some cycles for this v6.

Mark's out at the moment, so I wouldn't hold this series up pending his ack.
However, I agree that it would be good if _somebody_ from the Arm side can
give it the once over, so I've added Jean-Philippe to cc in case he has time
for a quick review. KP said he would also have a look, as he is interested
in this series landing.

Failing that, I'll try to look this week, but I'm off next week and I don't
want this to miss the merge window on my account.

Cheers,

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-07-05 16:00     ` Will Deacon
  0 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2022-07-05 16:00 UTC (permalink / raw)
  To: Daniel Borkmann, jean-philippe.brucker
  Cc: Xu Kuohai, bpf, linux-arm-kernel, linux-kernel, netdev,
	Mark Rutland, Catalin Marinas, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, David S . Miller, Hideaki YOSHIFUJI,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Jakub Kicinski,
	Jesper Dangaard Brouer, Russell King, James Morse, Hou Tao,
	Jason Wang

Hi Daniel,

On Thu, Jun 30, 2022 at 11:12:54PM +0200, Daniel Borkmann wrote:
> On 6/25/22 6:12 PM, Xu Kuohai wrote:
> > This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
> > native calling convention to bpf calling convention and is used to implement
> > various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
> > 
> > The trampoline introduced does essentially the same thing as the bpf
> > trampoline does on x86.
> > 
> > Tested on raspberry pi 4b and qemu:
> > 
> >   #18 /1     bpf_tcp_ca/dctcp:OK
> >   #18 /2     bpf_tcp_ca/cubic:OK
> >   #18 /3     bpf_tcp_ca/invalid_license:OK
> >   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
> >   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
> >   #18        bpf_tcp_ca:OK
> >   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
> >   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
> >   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
> >   #51 /4     dummy_st_ops/dummy_multiple_args:OK
> >   #51        dummy_st_ops:OK
> >   #57 /1     fexit_bpf2bpf/target_no_callees:OK
> >   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
> >   #57 /3     fexit_bpf2bpf/func_replace:OK
> >   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
> >   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
> >   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
> >   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
> >   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
> >   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
> >   #57        fexit_bpf2bpf:OK
> >   #237       xdp_bpf2bpf:OK
> > 
> > v6:
> > - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
> >    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
> >    to regular kernel functions, so remove ftrace related patches for now.
> > - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
> >    trampoline and bpf prog are allocated via vmalloc, there is chance the
> >    distance exceeds the max branch range.
> > - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
> >    should be kept, since the changes to it is not trivial
> > - Update some commit messages and comments
> 
> Given you've been taking a look and had objections in v5, would be great if you
> can find some cycles for this v6.

Mark's out at the moment, so I wouldn't hold this series up pending his ack.
However, I agree that it would be good if _somebody_ from the Arm side can
give it the once over, so I've added Jean-Philippe to cc in case he has time
for a quick review. KP said he would also have a look, as he is interested
in this series landing.

Failing that, I'll try to look this week, but I'm off next week and I don't
want this to miss the merge window on my account.

Cheers,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 2/4] arm64: Add LDR (literal) instruction
  2022-06-25 16:12   ` Xu Kuohai
@ 2022-07-05 16:39     ` Will Deacon
  -1 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2022-07-05 16:39 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Daniel Borkmann, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On Sat, Jun 25, 2022 at 12:12:53PM -0400, Xu Kuohai wrote:
> Add LDR (literal) instruction to load data from address relative to PC.
> This instruction will be used to implement long jump from bpf prog to
> bpf rampoline in the follow-up patch.

typo: trampoline

> 
> The instruction encoding:
> 
>     3       2   2     2                                     0        0
>     0       7   6     4                                     5        0
> +-----+-------+---+-----+-------------------------------------+--------+
> | 0 x | 0 1 1 | 0 | 0 0 |                imm19                |   Rt   |
> +-----+-------+---+-----+-------------------------------------+--------+
> 
> for 32-bit, variant x == 0; for 64-bit, x == 1.
> 
> branch_imm_common() is used to check the distance between pc and target
> address, since it's reused by this patch and LDR (literal) is not a branch
> instruction, rename it to aarch64_imm_common().

nit, but I think "label_imm_common()" would be a better name.

Anyway, I checked the encodings and the code looks good, so:

Acked-by: Will Deacon <will@kernel.org>

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 2/4] arm64: Add LDR (literal) instruction
@ 2022-07-05 16:39     ` Will Deacon
  0 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2022-07-05 16:39 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Daniel Borkmann, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On Sat, Jun 25, 2022 at 12:12:53PM -0400, Xu Kuohai wrote:
> Add LDR (literal) instruction to load data from address relative to PC.
> This instruction will be used to implement long jump from bpf prog to
> bpf rampoline in the follow-up patch.

typo: trampoline

> 
> The instruction encoding:
> 
>     3       2   2     2                                     0        0
>     0       7   6     4                                     5        0
> +-----+-------+---+-----+-------------------------------------+--------+
> | 0 x | 0 1 1 | 0 | 0 0 |                imm19                |   Rt   |
> +-----+-------+---+-----+-------------------------------------+--------+
> 
> for 32-bit, variant x == 0; for 64-bit, x == 1.
> 
> branch_imm_common() is used to check the distance between pc and target
> address, since it's reused by this patch and LDR (literal) is not a branch
> instruction, rename it to aarch64_imm_common().

nit, but I think "label_imm_common()" would be a better name.

Anyway, I checked the encodings and the code looks good, so:

Acked-by: Will Deacon <will@kernel.org>

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
  2022-07-05 16:00     ` Will Deacon
@ 2022-07-05 18:34       ` KP Singh
  -1 siblings, 0 replies; 42+ messages in thread
From: KP Singh @ 2022-07-05 18:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Daniel Borkmann, jean-philippe.brucker, Xu Kuohai, bpf,
	linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Tue, Jul 5, 2022 at 6:00 PM Will Deacon <will@kernel.org> wrote:
>
> Hi Daniel,
>
> On Thu, Jun 30, 2022 at 11:12:54PM +0200, Daniel Borkmann wrote:
> > On 6/25/22 6:12 PM, Xu Kuohai wrote:
> > > This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
> > > native calling convention to bpf calling convention and is used to implement
> > > various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
> > >
> > > The trampoline introduced does essentially the same thing as the bpf
> > > trampoline does on x86.
> > >
> > > Tested on raspberry pi 4b and qemu:
> > >
> > >   #18 /1     bpf_tcp_ca/dctcp:OK
> > >   #18 /2     bpf_tcp_ca/cubic:OK
> > >   #18 /3     bpf_tcp_ca/invalid_license:OK
> > >   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
> > >   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
> > >   #18        bpf_tcp_ca:OK
> > >   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
> > >   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
> > >   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
> > >   #51 /4     dummy_st_ops/dummy_multiple_args:OK
> > >   #51        dummy_st_ops:OK
> > >   #57 /1     fexit_bpf2bpf/target_no_callees:OK
> > >   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
> > >   #57 /3     fexit_bpf2bpf/func_replace:OK
> > >   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
> > >   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
> > >   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
> > >   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
> > >   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
> > >   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
> > >   #57        fexit_bpf2bpf:OK
> > >   #237       xdp_bpf2bpf:OK
> > >
> > > v6:
> > > - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
> > >    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
> > >    to regular kernel functions, so remove ftrace related patches for now.
> > > - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
> > >    trampoline and bpf prog are allocated via vmalloc, there is chance the
> > >    distance exceeds the max branch range.
> > > - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
> > >    should be kept, since the changes to it is not trivial

+1 I need to give it another pass.

> > > - Update some commit messages and comments
> >
> > Given you've been taking a look and had objections in v5, would be great if you
> > can find some cycles for this v6.
>
> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
> However, I agree that it would be good if _somebody_ from the Arm side can
> give it the once over, so I've added Jean-Philippe to cc in case he has time

Makes sense,  Jean-Philippe had worked on BPF trampolines for ARM.

> for a quick review. KP said he would also have a look, as he is interested

Thank you so much Will, I will give this another pass before the end
of the week.

> in this series landing.
>
> Failing that, I'll try to look this week, but I'm off next week and I don't
> want this to miss the merge window on my account.

Thanks for being considerate. Much appreciated.

- KP

>
> Cheers,
>
> Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-07-05 18:34       ` KP Singh
  0 siblings, 0 replies; 42+ messages in thread
From: KP Singh @ 2022-07-05 18:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Daniel Borkmann, jean-philippe.brucker, Xu Kuohai, bpf,
	linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Tue, Jul 5, 2022 at 6:00 PM Will Deacon <will@kernel.org> wrote:
>
> Hi Daniel,
>
> On Thu, Jun 30, 2022 at 11:12:54PM +0200, Daniel Borkmann wrote:
> > On 6/25/22 6:12 PM, Xu Kuohai wrote:
> > > This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
> > > native calling convention to bpf calling convention and is used to implement
> > > various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
> > >
> > > The trampoline introduced does essentially the same thing as the bpf
> > > trampoline does on x86.
> > >
> > > Tested on raspberry pi 4b and qemu:
> > >
> > >   #18 /1     bpf_tcp_ca/dctcp:OK
> > >   #18 /2     bpf_tcp_ca/cubic:OK
> > >   #18 /3     bpf_tcp_ca/invalid_license:OK
> > >   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
> > >   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
> > >   #18        bpf_tcp_ca:OK
> > >   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
> > >   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
> > >   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
> > >   #51 /4     dummy_st_ops/dummy_multiple_args:OK
> > >   #51        dummy_st_ops:OK
> > >   #57 /1     fexit_bpf2bpf/target_no_callees:OK
> > >   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
> > >   #57 /3     fexit_bpf2bpf/func_replace:OK
> > >   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
> > >   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
> > >   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
> > >   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
> > >   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
> > >   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
> > >   #57        fexit_bpf2bpf:OK
> > >   #237       xdp_bpf2bpf:OK
> > >
> > > v6:
> > > - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
> > >    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
> > >    to regular kernel functions, so remove ftrace related patches for now.
> > > - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
> > >    trampoline and bpf prog are allocated via vmalloc, there is chance the
> > >    distance exceeds the max branch range.
> > > - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
> > >    should be kept, since the changes to it is not trivial

+1 I need to give it another pass.

> > > - Update some commit messages and comments
> >
> > Given you've been taking a look and had objections in v5, would be great if you
> > can find some cycles for this v6.
>
> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
> However, I agree that it would be good if _somebody_ from the Arm side can
> give it the once over, so I've added Jean-Philippe to cc in case he has time

Makes sense,  Jean-Philippe had worked on BPF trampolines for ARM.

> for a quick review. KP said he would also have a look, as he is interested

Thank you so much Will, I will give this another pass before the end
of the week.

> in this series landing.
>
> Failing that, I'll try to look this week, but I'm off next week and I don't
> want this to miss the merge window on my account.

Thanks for being considerate. Much appreciated.

- KP

>
> Cheers,
>
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 2/4] arm64: Add LDR (literal) instruction
  2022-07-05 16:39     ` Will Deacon
@ 2022-07-06  1:43       ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-06  1:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Daniel Borkmann, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On 7/6/2022 12:39 AM, Will Deacon wrote:
> On Sat, Jun 25, 2022 at 12:12:53PM -0400, Xu Kuohai wrote:
>> Add LDR (literal) instruction to load data from address relative to PC.
>> This instruction will be used to implement long jump from bpf prog to
>> bpf rampoline in the follow-up patch.
> 
> typo: trampoline
>

will fix

>>
>> The instruction encoding:
>>
>>     3       2   2     2                                     0        0
>>     0       7   6     4                                     5        0
>> +-----+-------+---+-----+-------------------------------------+--------+
>> | 0 x | 0 1 1 | 0 | 0 0 |                imm19                |   Rt   |
>> +-----+-------+---+-----+-------------------------------------+--------+
>>
>> for 32-bit, variant x == 0; for 64-bit, x == 1.
>>
>> branch_imm_common() is used to check the distance between pc and target
>> address, since it's reused by this patch and LDR (literal) is not a branch
>> instruction, rename it to aarch64_imm_common().
> 
> nit, but I think "label_imm_common()" would be a better name.
> 

will rename

> Anyway, I checked the encodings and the code looks good, so:
> 
> Acked-by: Will Deacon <will@kernel.org>
> 
> Will
> .


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 2/4] arm64: Add LDR (literal) instruction
@ 2022-07-06  1:43       ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-06  1:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Daniel Borkmann, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On 7/6/2022 12:39 AM, Will Deacon wrote:
> On Sat, Jun 25, 2022 at 12:12:53PM -0400, Xu Kuohai wrote:
>> Add LDR (literal) instruction to load data from address relative to PC.
>> This instruction will be used to implement long jump from bpf prog to
>> bpf rampoline in the follow-up patch.
> 
> typo: trampoline
>

will fix

>>
>> The instruction encoding:
>>
>>     3       2   2     2                                     0        0
>>     0       7   6     4                                     5        0
>> +-----+-------+---+-----+-------------------------------------+--------+
>> | 0 x | 0 1 1 | 0 | 0 0 |                imm19                |   Rt   |
>> +-----+-------+---+-----+-------------------------------------+--------+
>>
>> for 32-bit, variant x == 0; for 64-bit, x == 1.
>>
>> branch_imm_common() is used to check the distance between pc and target
>> address, since it's reused by this patch and LDR (literal) is not a branch
>> instruction, rename it to aarch64_imm_common().
> 
> nit, but I think "label_imm_common()" would be a better name.
> 

will rename

> Anyway, I checked the encodings and the code looks good, so:
> 
> Acked-by: Will Deacon <will@kernel.org>
> 
> Will
> .


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
  2022-07-05 16:00     ` Will Deacon
@ 2022-07-06 16:08       ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-06 16:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: Daniel Borkmann, Xu Kuohai, bpf, linux-arm-kernel, linux-kernel,
	netdev, Mark Rutland, Catalin Marinas, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On Tue, Jul 05, 2022 at 05:00:46PM +0100, Will Deacon wrote:
> > Given you've been taking a look and had objections in v5, would be great if
> you
> > can find some cycles for this v6.
> 
> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
> However, I agree that it would be good if _somebody_ from the Arm side can
> give it the once over, so I've added Jean-Philippe to cc in case he has time
> for a quick review.

I'll take a look. Sorry for not catching this earlier, all versions of the
series somehow ended up in my spams :/

Thanks,
Jean

> KP said he would also have a look, as he is interested
> in this series landing.
> 
> Failing that, I'll try to look this week, but I'm off next week and I don't
> want this to miss the merge window on my account.
> 
> Cheers,
> 
> Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-07-06 16:08       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-06 16:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: Daniel Borkmann, Xu Kuohai, bpf, linux-arm-kernel, linux-kernel,
	netdev, Mark Rutland, Catalin Marinas, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On Tue, Jul 05, 2022 at 05:00:46PM +0100, Will Deacon wrote:
> > Given you've been taking a look and had objections in v5, would be great if
> you
> > can find some cycles for this v6.
> 
> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
> However, I agree that it would be good if _somebody_ from the Arm side can
> give it the once over, so I've added Jean-Philippe to cc in case he has time
> for a quick review.

I'll take a look. Sorry for not catching this earlier, all versions of the
series somehow ended up in my spams :/

Thanks,
Jean

> KP said he would also have a look, as he is interested
> in this series landing.
> 
> Failing that, I'll try to look this week, but I'm off next week and I don't
> want this to miss the merge window on my account.
> 
> Cheers,
> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
  2022-07-06 16:08       ` Jean-Philippe Brucker
@ 2022-07-06 16:11         ` Will Deacon
  -1 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2022-07-06 16:11 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Daniel Borkmann, Xu Kuohai, bpf, linux-arm-kernel, linux-kernel,
	netdev, Mark Rutland, Catalin Marinas, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On Wed, Jul 06, 2022 at 05:08:49PM +0100, Jean-Philippe Brucker wrote:
> On Tue, Jul 05, 2022 at 05:00:46PM +0100, Will Deacon wrote:
> > > Given you've been taking a look and had objections in v5, would be great if
> > you
> > > can find some cycles for this v6.
> > 
> > Mark's out at the moment, so I wouldn't hold this series up pending his ack.
> > However, I agree that it would be good if _somebody_ from the Arm side can
> > give it the once over, so I've added Jean-Philippe to cc in case he has time
> > for a quick review.
> 
> I'll take a look. Sorry for not catching this earlier, all versions of the
> series somehow ended up in my spams :/

Yeah, same here. It was only Daniel's mail that hit my inbox!

Will

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-07-06 16:11         ` Will Deacon
  0 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2022-07-06 16:11 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Daniel Borkmann, Xu Kuohai, bpf, linux-arm-kernel, linux-kernel,
	netdev, Mark Rutland, Catalin Marinas, Alexei Starovoitov,
	Zi Shen Lim, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, David S . Miller,
	Hideaki YOSHIFUJI, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Jakub Kicinski, Jesper Dangaard Brouer, Russell King,
	James Morse, Hou Tao, Jason Wang

On Wed, Jul 06, 2022 at 05:08:49PM +0100, Jean-Philippe Brucker wrote:
> On Tue, Jul 05, 2022 at 05:00:46PM +0100, Will Deacon wrote:
> > > Given you've been taking a look and had objections in v5, would be great if
> > you
> > > can find some cycles for this v6.
> > 
> > Mark's out at the moment, so I wouldn't hold this series up pending his ack.
> > However, I agree that it would be good if _somebody_ from the Arm side can
> > give it the once over, so I've added Jean-Philippe to cc in case he has time
> > for a quick review.
> 
> I'll take a look. Sorry for not catching this earlier, all versions of the
> series somehow ended up in my spams :/

Yeah, same here. It was only Daniel's mail that hit my inbox!

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
  2022-07-06 16:11         ` Will Deacon
@ 2022-07-07  2:56           ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-07  2:56 UTC (permalink / raw)
  To: Will Deacon, Jean-Philippe Brucker
  Cc: Daniel Borkmann, bpf, linux-arm-kernel, linux-kernel, netdev,
	Mark Rutland, Catalin Marinas, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, David S . Miller, Hideaki YOSHIFUJI,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Jakub Kicinski,
	Jesper Dangaard Brouer, Russell King, James Morse, Hou Tao,
	Jason Wang

On 7/7/2022 12:11 AM, Will Deacon wrote:
> On Wed, Jul 06, 2022 at 05:08:49PM +0100, Jean-Philippe Brucker wrote:
>> On Tue, Jul 05, 2022 at 05:00:46PM +0100, Will Deacon wrote:
>>>> Given you've been taking a look and had objections in v5, would be great if
>>> you
>>>> can find some cycles for this v6.
>>>
>>> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
>>> However, I agree that it would be good if _somebody_ from the Arm side can
>>> give it the once over, so I've added Jean-Philippe to cc in case he has time
>>> for a quick review.
>>
>> I'll take a look. Sorry for not catching this earlier, all versions of the
>> series somehow ended up in my spams :/
> 
> Yeah, same here. It was only Daniel's mail that hit my inbox!
> 
> Will
> .

Sorry, there is a misconfiguration in the huawei.com mail server:

https://lore.kernel.org/all/20220523152516.7sr247i3bzwhr44w@quack3.lan/

Our IT admins are working on this issue and hopefully they'll fix it soon.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-07-07  2:56           ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-07  2:56 UTC (permalink / raw)
  To: Will Deacon, Jean-Philippe Brucker
  Cc: Daniel Borkmann, bpf, linux-arm-kernel, linux-kernel, netdev,
	Mark Rutland, Catalin Marinas, Alexei Starovoitov, Zi Shen Lim,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, David S . Miller, Hideaki YOSHIFUJI,
	David Ahern, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Jakub Kicinski,
	Jesper Dangaard Brouer, Russell King, James Morse, Hou Tao,
	Jason Wang

On 7/7/2022 12:11 AM, Will Deacon wrote:
> On Wed, Jul 06, 2022 at 05:08:49PM +0100, Jean-Philippe Brucker wrote:
>> On Tue, Jul 05, 2022 at 05:00:46PM +0100, Will Deacon wrote:
>>>> Given you've been taking a look and had objections in v5, would be great if
>>> you
>>>> can find some cycles for this v6.
>>>
>>> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
>>> However, I agree that it would be good if _somebody_ from the Arm side can
>>> give it the once over, so I've added Jean-Philippe to cc in case he has time
>>> for a quick review.
>>
>> I'll take a look. Sorry for not catching this earlier, all versions of the
>> series somehow ended up in my spams :/
> 
> Yeah, same here. It was only Daniel's mail that hit my inbox!
> 
> Will
> .

Sorry, there is a misconfiguration in the huawei.com mail server:

https://lore.kernel.org/all/20220523152516.7sr247i3bzwhr44w@quack3.lan/

Our IT admins are working on this issue and hopefully they'll fix it soon.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
  2022-07-05 18:34       ` KP Singh
@ 2022-07-07  3:35         ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-07  3:35 UTC (permalink / raw)
  To: KP Singh, Will Deacon
  Cc: Daniel Borkmann, jean-philippe.brucker, bpf, linux-arm-kernel,
	linux-kernel, netdev, Mark Rutland, Catalin Marinas,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/6/2022 2:34 AM, KP Singh wrote:
> On Tue, Jul 5, 2022 at 6:00 PM Will Deacon <will@kernel.org> wrote:
>>
>> Hi Daniel,
>>
>> On Thu, Jun 30, 2022 at 11:12:54PM +0200, Daniel Borkmann wrote:
>>> On 6/25/22 6:12 PM, Xu Kuohai wrote:
>>>> This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
>>>> native calling convention to bpf calling convention and is used to implement
>>>> various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
>>>>
>>>> The trampoline introduced does essentially the same thing as the bpf
>>>> trampoline does on x86.
>>>>
>>>> Tested on raspberry pi 4b and qemu:
>>>>
>>>>   #18 /1     bpf_tcp_ca/dctcp:OK
>>>>   #18 /2     bpf_tcp_ca/cubic:OK
>>>>   #18 /3     bpf_tcp_ca/invalid_license:OK
>>>>   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>>>>   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>>>>   #18        bpf_tcp_ca:OK
>>>>   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>>>>   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>>>>   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>>>>   #51 /4     dummy_st_ops/dummy_multiple_args:OK
>>>>   #51        dummy_st_ops:OK
>>>>   #57 /1     fexit_bpf2bpf/target_no_callees:OK
>>>>   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>>>>   #57 /3     fexit_bpf2bpf/func_replace:OK
>>>>   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>>>>   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>>>>   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>>>>   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>>>>   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>>>>   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>>>>   #57        fexit_bpf2bpf:OK
>>>>   #237       xdp_bpf2bpf:OK
>>>>
>>>> v6:
>>>> - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
>>>>    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
>>>>    to regular kernel functions, so remove ftrace related patches for now.
>>>> - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
>>>>    trampoline and bpf prog are allocated via vmalloc, there is chance the
>>>>    distance exceeds the max branch range.
>>>> - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
>>>>    should be kept, since the changes to it is not trivial
> 
> +1 I need to give it another pass.>

Thank you verfy much! But I have to admit a problem. This patchset does
not suport attaching bpf trampoline to regular kernel functions with
ftrace. So lsm still does not work since the LSM HOOKS themselves are
regular kernel functions. Sorry about that and hopefully we'll find an
acceptable solution soon.

>>>> - Update some commit messages and comments
>>>
>>> Given you've been taking a look and had objections in v5, would be great if you
>>> can find some cycles for this v6.
>>
>> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
>> However, I agree that it would be good if _somebody_ from the Arm side can
>> give it the once over, so I've added Jean-Philippe to cc in case he has time
> 
> Makes sense,  Jean-Philippe had worked on BPF trampolines for ARM.
> 
>> for a quick review. KP said he would also have a look, as he is interested
> 
> Thank you so much Will, I will give this another pass before the end
> of the week.
> 
>> in this series landing.
>>
>> Failing that, I'll try to look this week, but I'm off next week and I don't
>> want this to miss the merge window on my account.
> 
> Thanks for being considerate. Much appreciated.
> 
> - KP
> 
>>
>> Cheers,
>>
>> Will
> .


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 0/4] bpf trampoline for arm64
@ 2022-07-07  3:35         ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-07  3:35 UTC (permalink / raw)
  To: KP Singh, Will Deacon
  Cc: Daniel Borkmann, jean-philippe.brucker, bpf, linux-arm-kernel,
	linux-kernel, netdev, Mark Rutland, Catalin Marinas,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/6/2022 2:34 AM, KP Singh wrote:
> On Tue, Jul 5, 2022 at 6:00 PM Will Deacon <will@kernel.org> wrote:
>>
>> Hi Daniel,
>>
>> On Thu, Jun 30, 2022 at 11:12:54PM +0200, Daniel Borkmann wrote:
>>> On 6/25/22 6:12 PM, Xu Kuohai wrote:
>>>> This patchset introduces bpf trampoline on arm64. A bpf trampoline converts
>>>> native calling convention to bpf calling convention and is used to implement
>>>> various bpf features, such as fentry, fexit, fmod_ret and struct_ops.
>>>>
>>>> The trampoline introduced does essentially the same thing as the bpf
>>>> trampoline does on x86.
>>>>
>>>> Tested on raspberry pi 4b and qemu:
>>>>
>>>>   #18 /1     bpf_tcp_ca/dctcp:OK
>>>>   #18 /2     bpf_tcp_ca/cubic:OK
>>>>   #18 /3     bpf_tcp_ca/invalid_license:OK
>>>>   #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>>>>   #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>>>>   #18        bpf_tcp_ca:OK
>>>>   #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>>>>   #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>>>>   #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>>>>   #51 /4     dummy_st_ops/dummy_multiple_args:OK
>>>>   #51        dummy_st_ops:OK
>>>>   #57 /1     fexit_bpf2bpf/target_no_callees:OK
>>>>   #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>>>>   #57 /3     fexit_bpf2bpf/func_replace:OK
>>>>   #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>>>>   #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>>>>   #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>>>>   #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>>>>   #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>>>>   #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>>>>   #57        fexit_bpf2bpf:OK
>>>>   #237       xdp_bpf2bpf:OK
>>>>
>>>> v6:
>>>> - Since Mark is refactoring arm64 ftrace to support long jump and reduce the
>>>>    ftrace trampoline overhead, it's not clear how we'll attach bpf trampoline
>>>>    to regular kernel functions, so remove ftrace related patches for now.
>>>> - Add long jump support for attaching bpf trampoline to bpf prog, since bpf
>>>>    trampoline and bpf prog are allocated via vmalloc, there is chance the
>>>>    distance exceeds the max branch range.
>>>> - Collect ACK/Review-by, not sure if the ACK and Review-bys for bpf_arch_text_poke()
>>>>    should be kept, since the changes to it is not trivial
> 
> +1 I need to give it another pass.>

Thank you verfy much! But I have to admit a problem. This patchset does
not suport attaching bpf trampoline to regular kernel functions with
ftrace. So lsm still does not work since the LSM HOOKS themselves are
regular kernel functions. Sorry about that and hopefully we'll find an
acceptable solution soon.

>>>> - Update some commit messages and comments
>>>
>>> Given you've been taking a look and had objections in v5, would be great if you
>>> can find some cycles for this v6.
>>
>> Mark's out at the moment, so I wouldn't hold this series up pending his ack.
>> However, I agree that it would be good if _somebody_ from the Arm side can
>> give it the once over, so I've added Jean-Philippe to cc in case he has time
> 
> Makes sense,  Jean-Philippe had worked on BPF trampolines for ARM.
> 
>> for a quick review. KP said he would also have a look, as he is interested
> 
> Thank you so much Will, I will give this another pass before the end
> of the week.
> 
>> in this series landing.
>>
>> Failing that, I'll try to look this week, but I'm off next week and I don't
>> want this to miss the merge window on my account.
> 
> Thanks for being considerate. Much appreciated.
> 
> - KP
> 
>>
>> Cheers,
>>
>> Will
> .


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
  2022-06-25 16:12   ` Xu Kuohai
@ 2022-07-07 16:37     ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-07 16:37 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Nice!  Looks good overall, I just have a few comments inline.

On Sat, Jun 25, 2022 at 12:12:55PM -0400, Xu Kuohai wrote:
> This is arm64 version of commit fec56f5890d9 ("bpf: Introduce BPF
> trampoline"). A bpf trampoline converts native calling convention to bpf
> calling convention and is used to implement various bpf features, such
> as fentry, fexit, fmod_ret and struct_ops.
> 
> This patch does essentially the same thing that bpf trampoline does on x86.
> 
> Tested on raspberry pi 4b and qemu:
> 
>  #18 /1     bpf_tcp_ca/dctcp:OK
>  #18 /2     bpf_tcp_ca/cubic:OK
>  #18 /3     bpf_tcp_ca/invalid_license:OK
>  #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>  #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>  #18        bpf_tcp_ca:OK
>  #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>  #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>  #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>  #51 /4     dummy_st_ops/dummy_multiple_args:OK
>  #51        dummy_st_ops:OK
>  #57 /1     fexit_bpf2bpf/target_no_callees:OK
>  #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>  #57 /3     fexit_bpf2bpf/func_replace:OK
>  #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>  #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>  #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>  #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>  #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>  #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>  #57        fexit_bpf2bpf:OK
>  #237       xdp_bpf2bpf:OK
> 
> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> Acked-by: Song Liu <songliubraving@fb.com>
> Acked-by: KP Singh <kpsingh@kernel.org>
> ---
>  arch/arm64/net/bpf_jit_comp.c | 387 +++++++++++++++++++++++++++++++++-
>  1 file changed, 384 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index e0e9c705a2e4..dd5a843601b8 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -176,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val,
>  	}
>  }
>  
> +static inline void emit_call(u64 target, struct jit_ctx *ctx)
> +{
> +	u8 tmp = bpf2a64[TMP_REG_1];
> +
> +	emit_addr_mov_i64(tmp, target, ctx);
> +	emit(A64_BLR(tmp), ctx);
> +}
> +
>  static inline int bpf2a64_offset(int bpf_insn, int off,
>  				 const struct jit_ctx *ctx)
>  {
> @@ -1073,8 +1081,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
>  					    &func_addr, &func_addr_fixed);
>  		if (ret < 0)
>  			return ret;
> -		emit_addr_mov_i64(tmp, func_addr, ctx);
> -		emit(A64_BLR(tmp), ctx);
> +		emit_call(func_addr, ctx);
>  		emit(A64_MOV(1, r0, A64_R(0)), ctx);
>  		break;
>  	}
> @@ -1418,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx)
>  		if (a64_insn == AARCH64_BREAK_FAULT)
>  			return -1;
>  	}
> +	return 0;
> +}
> +
> +static int validate_ctx(struct jit_ctx *ctx)
> +{
> +	if (validate_code(ctx))
> +		return -1;
>  
>  	if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
>  		return -1;
> @@ -1547,7 +1561,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  	build_plt(&ctx, true);
>  
>  	/* 3. Extra pass to validate JITed code. */
> -	if (validate_code(&ctx)) {
> +	if (validate_ctx(&ctx)) {
>  		bpf_jit_binary_free(header);
>  		prog = orig_prog;
>  		goto out_off;
> @@ -1625,6 +1639,373 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>  	return true;
>  }
>  
> +static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
> +			    int args_off, int retval_off, int run_ctx_off,
> +			    bool save_ret)
> +{
> +	u32 *branch;
> +	u64 enter_prog;
> +	u64 exit_prog;
> +	u8 tmp = bpf2a64[TMP_REG_1];

I wonder if we should stick with A64_R(x) rather than bpf2a64[y]. After
all this isn't translated BPF code but direct arm64 assembly. In any case
it should be consistent (below functions use A64_R(10))

> +	u8 r0 = bpf2a64[BPF_REG_0];
> +	struct bpf_prog *p = l->link.prog;
> +	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
> +
> +	if (p->aux->sleepable) {
> +		enter_prog = (u64)__bpf_prog_enter_sleepable;
> +		exit_prog = (u64)__bpf_prog_exit_sleepable;
> +	} else {
> +		enter_prog = (u64)__bpf_prog_enter;
> +		exit_prog = (u64)__bpf_prog_exit;
> +	}
> +
> +	if (l->cookie == 0) {
> +		/* if cookie is zero, one instruction is enough to store it */
> +		emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
> +	} else {
> +		emit_a64_mov_i64(tmp, l->cookie, ctx);
> +		emit(A64_STR64I(tmp, A64_SP, run_ctx_off + cookie_off), ctx);
> +	}
> +
> +	/* save p to callee saved register x19 to avoid loading p with mov_i64
> +	 * each time.
> +	 */
> +	emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
> +
> +	/* arg1: prog */
> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
> +	/* arg2: &run_ctx */
> +	emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
> +
> +	emit_call(enter_prog, ctx);
> +
> +	/* if (__bpf_prog_enter(prog) == 0)
> +	 *         goto skip_exec_of_prog;
> +	 */
> +	branch = ctx->image + ctx->idx;
> +	emit(A64_NOP, ctx);
> +
> +	/* save return value to callee saved register x20 */
> +	emit(A64_MOV(1, A64_R(20), r0), ctx);

Shouldn't that be x0?  r0 is x7

> +
> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
> +	if (!p->jited)
> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
> +
> +	emit_call((const u64)p->bpf_func, ctx);
> +
> +	/* store return value */
> +	if (save_ret)
> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);

Here too I think it should be x0. I'm guessing r0 may work for jitted
functions but not interpreted ones

> +
> +	if (ctx->image) {
> +		int offset = &ctx->image[ctx->idx] - branch;
> +		*branch = A64_CBZ(1, A64_R(0), offset);
> +	}
> +
> +	/* arg1: prog */
> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
> +	/* arg2: start time */
> +	emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);

By the way, it looks like the timestamp could be moved into
bpf_tramp_run_ctx now?  Nothing to do with this series, just a general
cleanup

> +	/* arg3: &run_ctx */
> +	emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
> +
> +	emit_call(exit_prog, ctx);
> +}
> +
> +static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
> +			       int args_off, int retval_off, int run_ctx_off,
> +			       u32 **branches)
> +{
> +	int i;
> +
> +	/* The first fmod_ret program will receive a garbage return value.
> +	 * Set this to 0 to avoid confusing the program.
> +	 */
> +	emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
> +	for (i = 0; i < tl->nr_links; i++) {
> +		invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
> +				run_ctx_off, true);
> +		/* if (*(u64 *)(sp + retval_off) !=  0)
> +		 *	goto do_fexit;
> +		 */
> +		emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
> +		/* Save the location of branch, and generate a nop.
> +		 * This nop will be replaced with a cbnz later.
> +		 */
> +		branches[i] = ctx->image + ctx->idx;
> +		emit(A64_NOP, ctx);
> +	}
> +}
> +
> +static void save_args(struct jit_ctx *ctx, int args_off, int nargs)
> +{
> +	int i;
> +
> +	for (i = 0; i < nargs; i++) {
> +		emit(A64_STR64I(i, A64_SP, args_off), ctx);
> +		args_off += 8;
> +	}
> +}
> +
> +static void restore_args(struct jit_ctx *ctx, int args_off, int nargs)
> +{
> +	int i;
> +
> +	for (i = 0; i < nargs; i++) {
> +		emit(A64_LDR64I(i, A64_SP, args_off), ctx);
> +		args_off += 8;
> +	}
> +}
> +
> +/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
> + *
> + * bpf prog and function entry before bpf trampoline hooked:
> + *   mov x9, lr
> + *   nop
> + *
> + * bpf prog and function entry after bpf trampoline hooked:
> + *   mov x9, lr
> + *   bl  <bpf_trampoline or plt>
> + *
> + */
> +static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
> +			      struct bpf_tramp_links *tlinks, void *orig_call,
> +			      int nargs, u32 flags)
> +{
> +	int i;
> +	int stack_size;
> +	int retaddr_off;
> +	int regs_off;
> +	int retval_off;
> +	int args_off;
> +	int nargs_off;
> +	int ip_off;
> +	int run_ctx_off;
> +	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> +	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> +	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> +	bool save_ret;
> +	u32 **branches = NULL;
> +
> +	/* trampoline stack layout:
> +	 *                  [ parent ip         ]

nit: maybe s/ip/pc/ here and elsewhere

> +	 *                  [ FP                ]
> +	 * SP + retaddr_off [ self ip           ]
> +	 *                  [ FP                ]
> +	 *
> +	 *                  [ padding           ] align SP to multiples of 16
> +	 *
> +	 *                  [ x20               ] callee saved reg x20
> +	 * SP + regs_off    [ x19               ] callee saved reg x19
> +	 *
> +	 * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
> +	 *                                        BPF_TRAMP_F_RET_FENTRY_RET
> +	 *
> +	 *                  [ argN              ]
> +	 *                  [ ...               ]
> +	 * SP + args_off    [ arg1              ]
> +	 *
> +	 * SP + nargs_off   [ args count        ]
> +	 *
> +	 * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
> +	 *
> +	 * SP + run_ctx_off [ bpf_tramp_run_ctx ]
> +	 */
> +
> +	stack_size = 0;
> +	run_ctx_off = stack_size;
> +	/* room for bpf_tramp_run_ctx */
> +	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
> +
> +	ip_off = stack_size;
> +	/* room for IP address argument */
> +	if (flags & BPF_TRAMP_F_IP_ARG)
> +		stack_size += 8;
> +
> +	nargs_off = stack_size;
> +	/* room for args count */
> +	stack_size += 8;
> +
> +	args_off = stack_size;
> +	/* room for args */
> +	stack_size += nargs * 8;
> +
> +	/* room for return value */
> +	retval_off = stack_size;
> +	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
> +	if (save_ret)
> +		stack_size += 8;
> +
> +	/* room for callee saved registers, currently x19 and x20 are used */
> +	regs_off = stack_size;
> +	stack_size += 16;
> +
> +	/* round up to multiples of 16 to avoid SPAlignmentFault */
> +	stack_size = round_up(stack_size, 16);
> +
> +	/* return address locates above FP */
> +	retaddr_off = stack_size + 8;
> +
> +	/* bpf trampoline may be invoked by 3 instruction types:
> +	 * 1. bl, attached to bpf prog or kernel function via short jump
> +	 * 2. br, attached to bpf prog or kernel function via long jump
> +	 * 3. blr, working as a function pointer, used by struct_ops.
> +	 * So BTI_JC should used here to support both br and blr.
> +	 */
> +	emit_bti(A64_BTI_JC, ctx);
> +
> +	/* frame for parent function */
> +	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> +
> +	/* frame for patched function */
> +	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> +
> +	/* allocate stack space */
> +	emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
> +
> +	if (flags & BPF_TRAMP_F_IP_ARG) {
> +		/* save ip address of the traced function */
> +		emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx);
> +		emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
> +	}
> +
> +	/* save args count*/
> +	emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx);
> +	emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx);
> +
> +	/* save args */
> +	save_args(ctx, args_off, nargs);
> +
> +	/* save callee saved registers */
> +	emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
> +	emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
> +		emit_call((const u64)__bpf_tramp_enter, ctx);
> +	}
> +
> +	for (i = 0; i < fentry->nr_links; i++)
> +		invoke_bpf_prog(ctx, fentry->links[i], args_off,
> +				retval_off, run_ctx_off,
> +				flags & BPF_TRAMP_F_RET_FENTRY_RET);
> +
> +	if (fmod_ret->nr_links) {
> +		branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *),
> +				   GFP_KERNEL);
> +		if (!branches)
> +			return -ENOMEM;
> +
> +		invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
> +				   run_ctx_off, branches);
> +	}
> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		restore_args(ctx, args_off, nargs);
> +		/* call original func */
> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
> +		emit(A64_BLR(A64_R(10)), ctx);

I don't think we can do this when BTI is enabled because we're not jumping
to a BTI instruction. We could introduce one in a patched BPF function
(there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
in a kernel function.

We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
function after modifying its LR. Not sure whether that works with pointer
auth though.

> +		/* store return value */
> +		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
> +		/* reserve a nop for bpf_tramp_image_put */
> +		im->ip_after_call = ctx->image + ctx->idx;
> +		emit(A64_NOP, ctx);
> +	}
> +
> +	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
> +	for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
> +		int offset = &ctx->image[ctx->idx] - branches[i];
> +		*branches[i] = A64_CBNZ(1, A64_R(10), offset);
> +	}
> +
> +	for (i = 0; i < fexit->nr_links; i++)
> +		invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
> +				run_ctx_off, false);
> +
> +	if (flags & BPF_TRAMP_F_RESTORE_REGS)
> +		restore_args(ctx, args_off, nargs);

I guess the combination RESTORE_REGS | CALL_ORIG doesn't make much sense,
but it's not disallowed by the documentation. So it might be safer to move
this after the next if() to avoid clobbering the regs.

Thanks,
Jean

> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		im->ip_epilogue = ctx->image + ctx->idx;
> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
> +		emit_call((const u64)__bpf_tramp_exit, ctx);
> +	}
> +
> +	/* restore callee saved register x19 and x20 */
> +	emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
> +	emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
> +
> +	if (save_ret)
> +		emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
> +
> +	/* reset SP  */
> +	emit(A64_MOV(1, A64_SP, A64_FP), ctx);
> +
> +	/* pop frames  */
> +	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
> +	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
> +
> +	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> +		/* skip patched function, return to parent */
> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
> +		emit(A64_RET(A64_R(9)), ctx);
> +	} else {
> +		/* return to patched function */
> +		emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
> +		emit(A64_RET(A64_R(10)), ctx);
> +	}
> +
> +	if (ctx->image)
> +		bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
> +
> +	kfree(branches);
> +
> +	return ctx->idx;
> +}
> +
> +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
> +				void *image_end, const struct btf_func_model *m,
> +				u32 flags, struct bpf_tramp_links *tlinks,
> +				void *orig_call)
> +{
> +	int ret;
> +	int nargs = m->nr_args;
> +	int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE;
> +	struct jit_ctx ctx = {
> +		.image = NULL,
> +		.idx = 0,
> +	};
> +
> +	/* the first 8 arguments are passed by registers */
> +	if (nargs > 8)
> +		return -ENOTSUPP;
> +
> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret > max_insns)
> +		return -EFBIG;
> +
> +	ctx.image = image;
> +	ctx.idx = 0;
> +
> +	jit_fill_hole(image, (unsigned int)(image_end - image));
> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
> +
> +	if (ret > 0 && validate_code(&ctx) < 0)
> +		ret = -EINVAL;
> +
> +	if (ret > 0)
> +		ret *= AARCH64_INSN_SIZE;
> +
> +	return ret;
> +}
> +
>  static bool is_long_jump(void *ip, void *target)
>  {
>  	long offset;
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
@ 2022-07-07 16:37     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-07 16:37 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

Nice!  Looks good overall, I just have a few comments inline.

On Sat, Jun 25, 2022 at 12:12:55PM -0400, Xu Kuohai wrote:
> This is arm64 version of commit fec56f5890d9 ("bpf: Introduce BPF
> trampoline"). A bpf trampoline converts native calling convention to bpf
> calling convention and is used to implement various bpf features, such
> as fentry, fexit, fmod_ret and struct_ops.
> 
> This patch does essentially the same thing that bpf trampoline does on x86.
> 
> Tested on raspberry pi 4b and qemu:
> 
>  #18 /1     bpf_tcp_ca/dctcp:OK
>  #18 /2     bpf_tcp_ca/cubic:OK
>  #18 /3     bpf_tcp_ca/invalid_license:OK
>  #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>  #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>  #18        bpf_tcp_ca:OK
>  #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>  #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>  #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>  #51 /4     dummy_st_ops/dummy_multiple_args:OK
>  #51        dummy_st_ops:OK
>  #57 /1     fexit_bpf2bpf/target_no_callees:OK
>  #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>  #57 /3     fexit_bpf2bpf/func_replace:OK
>  #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>  #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>  #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>  #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>  #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>  #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>  #57        fexit_bpf2bpf:OK
>  #237       xdp_bpf2bpf:OK
> 
> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> Acked-by: Song Liu <songliubraving@fb.com>
> Acked-by: KP Singh <kpsingh@kernel.org>
> ---
>  arch/arm64/net/bpf_jit_comp.c | 387 +++++++++++++++++++++++++++++++++-
>  1 file changed, 384 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index e0e9c705a2e4..dd5a843601b8 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -176,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val,
>  	}
>  }
>  
> +static inline void emit_call(u64 target, struct jit_ctx *ctx)
> +{
> +	u8 tmp = bpf2a64[TMP_REG_1];
> +
> +	emit_addr_mov_i64(tmp, target, ctx);
> +	emit(A64_BLR(tmp), ctx);
> +}
> +
>  static inline int bpf2a64_offset(int bpf_insn, int off,
>  				 const struct jit_ctx *ctx)
>  {
> @@ -1073,8 +1081,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
>  					    &func_addr, &func_addr_fixed);
>  		if (ret < 0)
>  			return ret;
> -		emit_addr_mov_i64(tmp, func_addr, ctx);
> -		emit(A64_BLR(tmp), ctx);
> +		emit_call(func_addr, ctx);
>  		emit(A64_MOV(1, r0, A64_R(0)), ctx);
>  		break;
>  	}
> @@ -1418,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx)
>  		if (a64_insn == AARCH64_BREAK_FAULT)
>  			return -1;
>  	}
> +	return 0;
> +}
> +
> +static int validate_ctx(struct jit_ctx *ctx)
> +{
> +	if (validate_code(ctx))
> +		return -1;
>  
>  	if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
>  		return -1;
> @@ -1547,7 +1561,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  	build_plt(&ctx, true);
>  
>  	/* 3. Extra pass to validate JITed code. */
> -	if (validate_code(&ctx)) {
> +	if (validate_ctx(&ctx)) {
>  		bpf_jit_binary_free(header);
>  		prog = orig_prog;
>  		goto out_off;
> @@ -1625,6 +1639,373 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>  	return true;
>  }
>  
> +static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
> +			    int args_off, int retval_off, int run_ctx_off,
> +			    bool save_ret)
> +{
> +	u32 *branch;
> +	u64 enter_prog;
> +	u64 exit_prog;
> +	u8 tmp = bpf2a64[TMP_REG_1];

I wonder if we should stick with A64_R(x) rather than bpf2a64[y]. After
all this isn't translated BPF code but direct arm64 assembly. In any case
it should be consistent (below functions use A64_R(10))

> +	u8 r0 = bpf2a64[BPF_REG_0];
> +	struct bpf_prog *p = l->link.prog;
> +	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
> +
> +	if (p->aux->sleepable) {
> +		enter_prog = (u64)__bpf_prog_enter_sleepable;
> +		exit_prog = (u64)__bpf_prog_exit_sleepable;
> +	} else {
> +		enter_prog = (u64)__bpf_prog_enter;
> +		exit_prog = (u64)__bpf_prog_exit;
> +	}
> +
> +	if (l->cookie == 0) {
> +		/* if cookie is zero, one instruction is enough to store it */
> +		emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
> +	} else {
> +		emit_a64_mov_i64(tmp, l->cookie, ctx);
> +		emit(A64_STR64I(tmp, A64_SP, run_ctx_off + cookie_off), ctx);
> +	}
> +
> +	/* save p to callee saved register x19 to avoid loading p with mov_i64
> +	 * each time.
> +	 */
> +	emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
> +
> +	/* arg1: prog */
> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
> +	/* arg2: &run_ctx */
> +	emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
> +
> +	emit_call(enter_prog, ctx);
> +
> +	/* if (__bpf_prog_enter(prog) == 0)
> +	 *         goto skip_exec_of_prog;
> +	 */
> +	branch = ctx->image + ctx->idx;
> +	emit(A64_NOP, ctx);
> +
> +	/* save return value to callee saved register x20 */
> +	emit(A64_MOV(1, A64_R(20), r0), ctx);

Shouldn't that be x0?  r0 is x7

> +
> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
> +	if (!p->jited)
> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
> +
> +	emit_call((const u64)p->bpf_func, ctx);
> +
> +	/* store return value */
> +	if (save_ret)
> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);

Here too I think it should be x0. I'm guessing r0 may work for jitted
functions but not interpreted ones

> +
> +	if (ctx->image) {
> +		int offset = &ctx->image[ctx->idx] - branch;
> +		*branch = A64_CBZ(1, A64_R(0), offset);
> +	}
> +
> +	/* arg1: prog */
> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
> +	/* arg2: start time */
> +	emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);

By the way, it looks like the timestamp could be moved into
bpf_tramp_run_ctx now?  Nothing to do with this series, just a general
cleanup

> +	/* arg3: &run_ctx */
> +	emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
> +
> +	emit_call(exit_prog, ctx);
> +}
> +
> +static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
> +			       int args_off, int retval_off, int run_ctx_off,
> +			       u32 **branches)
> +{
> +	int i;
> +
> +	/* The first fmod_ret program will receive a garbage return value.
> +	 * Set this to 0 to avoid confusing the program.
> +	 */
> +	emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
> +	for (i = 0; i < tl->nr_links; i++) {
> +		invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
> +				run_ctx_off, true);
> +		/* if (*(u64 *)(sp + retval_off) !=  0)
> +		 *	goto do_fexit;
> +		 */
> +		emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
> +		/* Save the location of branch, and generate a nop.
> +		 * This nop will be replaced with a cbnz later.
> +		 */
> +		branches[i] = ctx->image + ctx->idx;
> +		emit(A64_NOP, ctx);
> +	}
> +}
> +
> +static void save_args(struct jit_ctx *ctx, int args_off, int nargs)
> +{
> +	int i;
> +
> +	for (i = 0; i < nargs; i++) {
> +		emit(A64_STR64I(i, A64_SP, args_off), ctx);
> +		args_off += 8;
> +	}
> +}
> +
> +static void restore_args(struct jit_ctx *ctx, int args_off, int nargs)
> +{
> +	int i;
> +
> +	for (i = 0; i < nargs; i++) {
> +		emit(A64_LDR64I(i, A64_SP, args_off), ctx);
> +		args_off += 8;
> +	}
> +}
> +
> +/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
> + *
> + * bpf prog and function entry before bpf trampoline hooked:
> + *   mov x9, lr
> + *   nop
> + *
> + * bpf prog and function entry after bpf trampoline hooked:
> + *   mov x9, lr
> + *   bl  <bpf_trampoline or plt>
> + *
> + */
> +static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
> +			      struct bpf_tramp_links *tlinks, void *orig_call,
> +			      int nargs, u32 flags)
> +{
> +	int i;
> +	int stack_size;
> +	int retaddr_off;
> +	int regs_off;
> +	int retval_off;
> +	int args_off;
> +	int nargs_off;
> +	int ip_off;
> +	int run_ctx_off;
> +	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> +	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> +	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> +	bool save_ret;
> +	u32 **branches = NULL;
> +
> +	/* trampoline stack layout:
> +	 *                  [ parent ip         ]

nit: maybe s/ip/pc/ here and elsewhere

> +	 *                  [ FP                ]
> +	 * SP + retaddr_off [ self ip           ]
> +	 *                  [ FP                ]
> +	 *
> +	 *                  [ padding           ] align SP to multiples of 16
> +	 *
> +	 *                  [ x20               ] callee saved reg x20
> +	 * SP + regs_off    [ x19               ] callee saved reg x19
> +	 *
> +	 * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
> +	 *                                        BPF_TRAMP_F_RET_FENTRY_RET
> +	 *
> +	 *                  [ argN              ]
> +	 *                  [ ...               ]
> +	 * SP + args_off    [ arg1              ]
> +	 *
> +	 * SP + nargs_off   [ args count        ]
> +	 *
> +	 * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
> +	 *
> +	 * SP + run_ctx_off [ bpf_tramp_run_ctx ]
> +	 */
> +
> +	stack_size = 0;
> +	run_ctx_off = stack_size;
> +	/* room for bpf_tramp_run_ctx */
> +	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
> +
> +	ip_off = stack_size;
> +	/* room for IP address argument */
> +	if (flags & BPF_TRAMP_F_IP_ARG)
> +		stack_size += 8;
> +
> +	nargs_off = stack_size;
> +	/* room for args count */
> +	stack_size += 8;
> +
> +	args_off = stack_size;
> +	/* room for args */
> +	stack_size += nargs * 8;
> +
> +	/* room for return value */
> +	retval_off = stack_size;
> +	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
> +	if (save_ret)
> +		stack_size += 8;
> +
> +	/* room for callee saved registers, currently x19 and x20 are used */
> +	regs_off = stack_size;
> +	stack_size += 16;
> +
> +	/* round up to multiples of 16 to avoid SPAlignmentFault */
> +	stack_size = round_up(stack_size, 16);
> +
> +	/* return address locates above FP */
> +	retaddr_off = stack_size + 8;
> +
> +	/* bpf trampoline may be invoked by 3 instruction types:
> +	 * 1. bl, attached to bpf prog or kernel function via short jump
> +	 * 2. br, attached to bpf prog or kernel function via long jump
> +	 * 3. blr, working as a function pointer, used by struct_ops.
> +	 * So BTI_JC should used here to support both br and blr.
> +	 */
> +	emit_bti(A64_BTI_JC, ctx);
> +
> +	/* frame for parent function */
> +	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> +
> +	/* frame for patched function */
> +	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> +
> +	/* allocate stack space */
> +	emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
> +
> +	if (flags & BPF_TRAMP_F_IP_ARG) {
> +		/* save ip address of the traced function */
> +		emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx);
> +		emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
> +	}
> +
> +	/* save args count*/
> +	emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx);
> +	emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx);
> +
> +	/* save args */
> +	save_args(ctx, args_off, nargs);
> +
> +	/* save callee saved registers */
> +	emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
> +	emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
> +		emit_call((const u64)__bpf_tramp_enter, ctx);
> +	}
> +
> +	for (i = 0; i < fentry->nr_links; i++)
> +		invoke_bpf_prog(ctx, fentry->links[i], args_off,
> +				retval_off, run_ctx_off,
> +				flags & BPF_TRAMP_F_RET_FENTRY_RET);
> +
> +	if (fmod_ret->nr_links) {
> +		branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *),
> +				   GFP_KERNEL);
> +		if (!branches)
> +			return -ENOMEM;
> +
> +		invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
> +				   run_ctx_off, branches);
> +	}
> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		restore_args(ctx, args_off, nargs);
> +		/* call original func */
> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
> +		emit(A64_BLR(A64_R(10)), ctx);

I don't think we can do this when BTI is enabled because we're not jumping
to a BTI instruction. We could introduce one in a patched BPF function
(there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
in a kernel function.

We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
function after modifying its LR. Not sure whether that works with pointer
auth though.

> +		/* store return value */
> +		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
> +		/* reserve a nop for bpf_tramp_image_put */
> +		im->ip_after_call = ctx->image + ctx->idx;
> +		emit(A64_NOP, ctx);
> +	}
> +
> +	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
> +	for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
> +		int offset = &ctx->image[ctx->idx] - branches[i];
> +		*branches[i] = A64_CBNZ(1, A64_R(10), offset);
> +	}
> +
> +	for (i = 0; i < fexit->nr_links; i++)
> +		invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
> +				run_ctx_off, false);
> +
> +	if (flags & BPF_TRAMP_F_RESTORE_REGS)
> +		restore_args(ctx, args_off, nargs);

I guess the combination RESTORE_REGS | CALL_ORIG doesn't make much sense,
but it's not disallowed by the documentation. So it might be safer to move
this after the next if() to avoid clobbering the regs.

Thanks,
Jean

> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		im->ip_epilogue = ctx->image + ctx->idx;
> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
> +		emit_call((const u64)__bpf_tramp_exit, ctx);
> +	}
> +
> +	/* restore callee saved register x19 and x20 */
> +	emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
> +	emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
> +
> +	if (save_ret)
> +		emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
> +
> +	/* reset SP  */
> +	emit(A64_MOV(1, A64_SP, A64_FP), ctx);
> +
> +	/* pop frames  */
> +	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
> +	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
> +
> +	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> +		/* skip patched function, return to parent */
> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
> +		emit(A64_RET(A64_R(9)), ctx);
> +	} else {
> +		/* return to patched function */
> +		emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
> +		emit(A64_RET(A64_R(10)), ctx);
> +	}
> +
> +	if (ctx->image)
> +		bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
> +
> +	kfree(branches);
> +
> +	return ctx->idx;
> +}
> +
> +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
> +				void *image_end, const struct btf_func_model *m,
> +				u32 flags, struct bpf_tramp_links *tlinks,
> +				void *orig_call)
> +{
> +	int ret;
> +	int nargs = m->nr_args;
> +	int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE;
> +	struct jit_ctx ctx = {
> +		.image = NULL,
> +		.idx = 0,
> +	};
> +
> +	/* the first 8 arguments are passed by registers */
> +	if (nargs > 8)
> +		return -ENOTSUPP;
> +
> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret > max_insns)
> +		return -EFBIG;
> +
> +	ctx.image = image;
> +	ctx.idx = 0;
> +
> +	jit_fill_hole(image, (unsigned int)(image_end - image));
> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
> +
> +	if (ret > 0 && validate_code(&ctx) < 0)
> +		ret = -EINVAL;
> +
> +	if (ret > 0)
> +		ret *= AARCH64_INSN_SIZE;
> +
> +	return ret;
> +}
> +
>  static bool is_long_jump(void *ip, void *target)
>  {
>  	long offset;
> -- 
> 2.30.2
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
  2022-06-25 16:12   ` Xu Kuohai
@ 2022-07-07 16:41     ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-07 16:41 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Sat, Jun 25, 2022 at 12:12:54PM -0400, Xu Kuohai wrote:
> Impelment bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline

Implement

> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index f08a4447d363..e0e9c705a2e4 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -9,6 +9,7 @@
>  
>  #include <linux/bitfield.h>
>  #include <linux/bpf.h>
> +#include <linux/memory.h>

nit: keep sorted

>  #include <linux/filter.h>
>  #include <linux/printk.h>
>  #include <linux/slab.h>
> @@ -18,6 +19,7 @@
>  #include <asm/cacheflush.h>
>  #include <asm/debug-monitors.h>
>  #include <asm/insn.h>
> +#include <asm/patching.h>
>  #include <asm/set_memory.h>
>  
>  #include "bpf_jit.h"
> @@ -78,6 +80,15 @@ struct jit_ctx {
>  	int fpb_offset;
>  };
>  
> +struct bpf_plt {
> +	u32 insn_ldr; /* load target */
> +	u32 insn_br;  /* branch to target */
> +	u64 target;   /* target value */
> +} __packed;

don't need __packed

> +
> +#define PLT_TARGET_SIZE   sizeof_field(struct bpf_plt, target)
> +#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
> +
>  static inline void emit(const u32 insn, struct jit_ctx *ctx)
>  {
>  	if (ctx->image != NULL)
> @@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val,
>  	}
>  }
>  
> +static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
> +{
> +	if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> +		emit(insn, ctx);
> +}
> +
>  /*
>   * Kernel addresses in the vmalloc space use at most 48 bits, and the
>   * remaining bits are guaranteed to be 0x1. So we can compose the address
> @@ -235,13 +252,30 @@ static bool is_lsi_offset(int offset, int scale)
>  	return true;
>  }
>  
> +/* generated prologue:
> + *      bti c // if CONFIG_ARM64_BTI_KERNEL
> + *      mov x9, lr
> + *      nop  // POKE_OFFSET
> + *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL

Any reason for the change regarding BTI and pointer auth?  We used to put
'bti c' at the function entry if (BTI && !PA), or 'paciasp' if (BTI && PA),
because 'paciasp' is an implicit BTI.

> + *      stp x29, lr, [sp, #-16]!
> + *      mov x29, sp
> + *      stp x19, x20, [sp, #-16]!
> + *      stp x21, x22, [sp, #-16]!
> + *      stp x25, x26, [sp, #-16]!
> + *      stp x27, x28, [sp, #-16]!
> + *      mov x25, sp
> + *      mov tcc, #0
> + *      // PROLOGUE_OFFSET
> + */
> +
> +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
> +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
> +
> +/* Offset of nop instruction in bpf prog entry to be poked */
> +#define POKE_OFFSET (BTI_INSNS + 1)
> +
>  /* Tail call offset to jump into */
> -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
> -	IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
> -#define PROLOGUE_OFFSET 9
> -#else
> -#define PROLOGUE_OFFSET 8
> -#endif
> +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
>  
>  static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>  {
> @@ -280,12 +314,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>  	 *
>  	 */
>  
> +	emit_bti(A64_BTI_C, ctx);
> +
> +	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
> +	emit(A64_NOP, ctx);
> +
>  	/* Sign lr */
>  	if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
>  		emit(A64_PACIASP, ctx);
> -	/* BTI landing pad */
> -	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> -		emit(A64_BTI_C, ctx);
>  
>  	/* Save FP and LR registers to stay align with ARM64 AAPCS */
>  	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
> @@ -312,8 +348,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>  		}
>  
>  		/* BTI landing pad for the tail call, done with a BR */
> -		if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> -			emit(A64_BTI_J, ctx);
> +		emit_bti(A64_BTI_J, ctx);
>  	}
>  
>  	emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
> @@ -557,6 +592,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
>  	return 0;
>  }
>  
> +void dummy_tramp(void);
> +
> +asm (
> +"	.pushsection .text, \"ax\", @progbits\n"
> +"	.type dummy_tramp, %function\n"
> +"dummy_tramp:"
> +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
> +"	bti j\n" /* dummy_tramp is called via "br x10" */
> +#endif
> +"	mov x10, lr\n"
> +"	mov lr, x9\n"
> +"	ret x10\n"
> +"	.size dummy_tramp, .-dummy_tramp\n"
> +"	.popsection\n"
> +);
> +
> +/* build a plt initialized like this:
> + *
> + * plt:
> + *      ldr tmp, target
> + *      br tmp
> + * target:
> + *      .quad dummy_tramp
> + *
> + * when a long jump trampoline is attached, target is filled with the
> + * trampoline address, and when the trampoine is removed, target is

s/trampoine/trampoline/

> + * restored to dummy_tramp address.
> + */
> +static void build_plt(struct jit_ctx *ctx, bool write_target)
> +{
> +	const u8 tmp = bpf2a64[TMP_REG_1];
> +	struct bpf_plt *plt = NULL;
> +
> +	/* make sure target is 64-bit aligend */

aligned

> +	if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
> +		emit(A64_NOP, ctx);
> +
> +	plt = (struct bpf_plt *)(ctx->image + ctx->idx);
> +	/* plt is called via bl, no BTI needed here */
> +	emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
> +	emit(A64_BR(tmp), ctx);
> +
> +	/* false write_target means target space is not allocated yet */
> +	if (write_target)

How about "if (ctx->image)", to be consistent

> +		plt->target = (u64)&dummy_tramp;
> +}
> +
>  static void build_epilogue(struct jit_ctx *ctx)
>  {
>  	const u8 r0 = bpf2a64[BPF_REG_0];
> @@ -1356,7 +1438,7 @@ struct arm64_jit_data {
>  
>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  {
> -	int image_size, prog_size, extable_size;
> +	int image_size, prog_size, extable_size, extable_align, extable_offset;
>  	struct bpf_prog *tmp, *orig_prog = prog;
>  	struct bpf_binary_header *header;
>  	struct arm64_jit_data *jit_data;
> @@ -1426,13 +1508,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  
>  	ctx.epilogue_offset = ctx.idx;
>  	build_epilogue(&ctx);
> +	build_plt(&ctx, false);
>  
> +	extable_align = __alignof__(struct exception_table_entry);
>  	extable_size = prog->aux->num_exentries *
>  		sizeof(struct exception_table_entry);
>  
>  	/* Now we know the actual image size. */
>  	prog_size = sizeof(u32) * ctx.idx;
> -	image_size = prog_size + extable_size;
> +	/* also allocate space for plt target */
> +	extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
> +	image_size = extable_offset + extable_size;
>  	header = bpf_jit_binary_alloc(image_size, &image_ptr,
>  				      sizeof(u32), jit_fill_hole);
>  	if (header == NULL) {
> @@ -1444,7 +1530,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  
>  	ctx.image = (__le32 *)image_ptr;
>  	if (extable_size)
> -		prog->aux->extable = (void *)image_ptr + prog_size;
> +		prog->aux->extable = (void *)image_ptr + extable_offset;
>  skip_init_ctx:
>  	ctx.idx = 0;
>  	ctx.exentry_idx = 0;
> @@ -1458,6 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  	}
>  
>  	build_epilogue(&ctx);
> +	build_plt(&ctx, true);
>  
>  	/* 3. Extra pass to validate JITed code. */
>  	if (validate_code(&ctx)) {
> @@ -1537,3 +1624,218 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>  {
>  	return true;
>  }
> +
> +static bool is_long_jump(void *ip, void *target)
> +{
> +	long offset;
> +
> +	/* NULL target means this is a NOP */
> +	if (!target)
> +		return false;
> +
> +	offset = (long)target - (long)ip;
> +	return offset < -SZ_128M || offset >= SZ_128M;
> +}
> +
> +static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
> +			     void *addr, void *plt, u32 *insn)
> +{
> +	void *target;
> +
> +	if (!addr) {
> +		*insn = aarch64_insn_gen_nop();
> +		return 0;
> +	}
> +
> +	if (is_long_jump(ip, addr))
> +		target = plt;
> +	else
> +		target = addr;
> +
> +	*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
> +					    (unsigned long)target,
> +					    type);
> +
> +	return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
> +}
> +
> +/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
> + * trampoline with the branch instruction from @ip to @new_addr. If @old_addr
> + * or @new_addr is NULL, the old or new instruction is NOP.
> + *
> + * When @ip is the bpf prog entry, a bpf trampoline is being attached or
> + * detached. Since bpf trampoline and bpf prog are allocated separately with
> + * vmalloc, the address distance may exceed 128MB, the maximum branch range.
> + * So long jump should be handled.
> + *
> + * When a bpf prog is constructed, a plt pointing to empty trampoline
> + * dummy_tramp is placed at the end:
> + *
> + *      bpf_prog:
> + *              mov x9, lr
> + *              nop // patchsite
> + *              ...
> + *              ret
> + *
> + *      plt:
> + *              ldr x10, target
> + *              br x10
> + *      target:
> + *              .quad dummy_tramp // plt target
> + *
> + * This is also the state when no trampoline is attached.
> + *
> + * When a short-jump bpf trampoline is attached, the patchsite is patched
> + * to a bl instruction to the trampoline directly:
> + *
> + *      bpf_prog:
> + *              mov x9, lr
> + *              bl <short-jump bpf trampoline address> // patchsite
> + *              ...
> + *              ret
> + *
> + *      plt:
> + *              ldr x10, target
> + *              br x10
> + *      target:
> + *              .quad dummy_tramp // plt target
> + *
> + * When a long-jump bpf trampoline is attached, the plt target is filled with
> + * the trampoline address and the patchsite is patched to a bl instruction to
> + * the plt:
> + *
> + *      bpf_prog:
> + *              mov x9, lr
> + *              bl plt // patchsite
> + *              ...
> + *              ret
> + *
> + *      plt:
> + *              ldr x10, target
> + *              br x10
> + *      target:
> + *              .quad <long-jump bpf trampoline address> // plt target
> + *
> + * The dummy_tramp is used to prevent another CPU from jumping to unknown
> + * locations during the patching process, making the patching process easier.
> + */
> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
> +		       void *old_addr, void *new_addr)
> +{
> +	int ret;
> +	u32 old_insn;
> +	u32 new_insn;
> +	u32 replaced;
> +	struct bpf_plt *plt = NULL;
> +	unsigned long size = 0UL;
> +	unsigned long offset = ~0UL;
> +	enum aarch64_insn_branch_type branch_type;
> +	char namebuf[KSYM_NAME_LEN];
> +	void *image = NULL;
> +	u64 plt_target = 0ULL;
> +	bool poking_bpf_entry;
> +
> +	if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
> +		/* Only poking bpf text is supported. Since kernel function
> +		 * entry is set up by ftrace, we reply on ftrace to poke kernel
> +		 * functions.
> +		 */
> +		return -ENOTSUPP;
> +
> +	image = ip - offset;
> +	/* zero offset means we're poking bpf prog entry */
> +	poking_bpf_entry = (offset == 0UL);
> +
> +	/* bpf prog entry, find plt and the real patchsite */
> +	if (poking_bpf_entry) {
> +		/* plt locates at the end of bpf prog */
> +		plt = image + size - PLT_TARGET_OFFSET;
> +
> +		/* skip to the nop instruction in bpf prog entry:
> +		 * bti c // if BTI enabled
> +		 * mov x9, x30
> +		 * nop
> +		 */
> +		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
> +	}
> +
> +	/* long jump is only possible at bpf prog entry */
> +	if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
> +		    !poking_bpf_entry))
> +		return -EINVAL;
> +
> +	if (poke_type == BPF_MOD_CALL)
> +		branch_type = AARCH64_INSN_BRANCH_LINK;
> +	else
> +		branch_type = AARCH64_INSN_BRANCH_NOLINK;
> +
> +	if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
> +		return -EFAULT;
> +
> +	if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
> +		return -EFAULT;
> +
> +	if (is_long_jump(ip, new_addr))
> +		plt_target = (u64)new_addr;
> +	else if (is_long_jump(ip, old_addr))
> +		/* if the old target is a long jump and the new target is not,
> +		 * restore the plt target to dummy_tramp, so there is always a
> +		 * legal and harmless address stored in plt target, and we'll
> +		 * never jump from plt to an unknown place.
> +		 */
> +		plt_target = (u64)&dummy_tramp;
> +
> +	if (plt_target) {
> +		/* non-zero plt_target indicates we're patching a bpf prog,
> +		 * which is read only.
> +		 */
> +		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
> +			return -EFAULT;
> +		WRITE_ONCE(plt->target, plt_target);
> +		set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
> +		/* since plt target points to either the new trmapoline

trampoline

> +		 * or dummy_tramp, even if aother CPU reads the old plt

another

Thanks,
Jean

> +		 * target value before fetching the bl instruction to plt,
> +		 * it will be brought back by dummy_tramp, so no barrier is
> +		 * required here.
> +		 */
> +	}
> +
> +	/* if the old target and the new target are both long jumps, no
> +	 * patching is required
> +	 */
> +	if (old_insn == new_insn)
> +		return 0;
> +
> +	mutex_lock(&text_mutex);
> +	if (aarch64_insn_read(ip, &replaced)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (replaced != old_insn) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* We call aarch64_insn_patch_text_nosync() to replace instruction
> +	 * atomically, so no other CPUs will fetch a half-new and half-old
> +	 * instruction. But there is chance that another CPU executes the
> +	 * old instruction after the patching operation finishes (e.g.,
> +	 * pipeline not flushed, or icache not synchronized yet).
> +	 *
> +	 * 1. when a new trampoline is attached, it is not a problem for
> +	 *    different CPUs to jump to different trampolines temporarily.
> +	 *
> +	 * 2. when an old trampoline is freed, we should wait for all other
> +	 *    CPUs to exit the trampoline and make sure the trampoline is no
> +	 *    longer reachable, since bpf_tramp_image_put() function already
> +	 *    uses percpu_ref and task rcu to do the sync, no need to call
> +	 *    the sync version here, see bpf_tramp_image_put() for details.
> +	 */
> +	ret = aarch64_insn_patch_text_nosync(ip, new_insn);
> +out:
> +	mutex_unlock(&text_mutex);
> +
> +	return ret;
> +}
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
@ 2022-07-07 16:41     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-07 16:41 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Sat, Jun 25, 2022 at 12:12:54PM -0400, Xu Kuohai wrote:
> Impelment bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline

Implement

> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index f08a4447d363..e0e9c705a2e4 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -9,6 +9,7 @@
>  
>  #include <linux/bitfield.h>
>  #include <linux/bpf.h>
> +#include <linux/memory.h>

nit: keep sorted

>  #include <linux/filter.h>
>  #include <linux/printk.h>
>  #include <linux/slab.h>
> @@ -18,6 +19,7 @@
>  #include <asm/cacheflush.h>
>  #include <asm/debug-monitors.h>
>  #include <asm/insn.h>
> +#include <asm/patching.h>
>  #include <asm/set_memory.h>
>  
>  #include "bpf_jit.h"
> @@ -78,6 +80,15 @@ struct jit_ctx {
>  	int fpb_offset;
>  };
>  
> +struct bpf_plt {
> +	u32 insn_ldr; /* load target */
> +	u32 insn_br;  /* branch to target */
> +	u64 target;   /* target value */
> +} __packed;

don't need __packed

> +
> +#define PLT_TARGET_SIZE   sizeof_field(struct bpf_plt, target)
> +#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
> +
>  static inline void emit(const u32 insn, struct jit_ctx *ctx)
>  {
>  	if (ctx->image != NULL)
> @@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val,
>  	}
>  }
>  
> +static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
> +{
> +	if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> +		emit(insn, ctx);
> +}
> +
>  /*
>   * Kernel addresses in the vmalloc space use at most 48 bits, and the
>   * remaining bits are guaranteed to be 0x1. So we can compose the address
> @@ -235,13 +252,30 @@ static bool is_lsi_offset(int offset, int scale)
>  	return true;
>  }
>  
> +/* generated prologue:
> + *      bti c // if CONFIG_ARM64_BTI_KERNEL
> + *      mov x9, lr
> + *      nop  // POKE_OFFSET
> + *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL

Any reason for the change regarding BTI and pointer auth?  We used to put
'bti c' at the function entry if (BTI && !PA), or 'paciasp' if (BTI && PA),
because 'paciasp' is an implicit BTI.

> + *      stp x29, lr, [sp, #-16]!
> + *      mov x29, sp
> + *      stp x19, x20, [sp, #-16]!
> + *      stp x21, x22, [sp, #-16]!
> + *      stp x25, x26, [sp, #-16]!
> + *      stp x27, x28, [sp, #-16]!
> + *      mov x25, sp
> + *      mov tcc, #0
> + *      // PROLOGUE_OFFSET
> + */
> +
> +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
> +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
> +
> +/* Offset of nop instruction in bpf prog entry to be poked */
> +#define POKE_OFFSET (BTI_INSNS + 1)
> +
>  /* Tail call offset to jump into */
> -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
> -	IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
> -#define PROLOGUE_OFFSET 9
> -#else
> -#define PROLOGUE_OFFSET 8
> -#endif
> +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
>  
>  static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>  {
> @@ -280,12 +314,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>  	 *
>  	 */
>  
> +	emit_bti(A64_BTI_C, ctx);
> +
> +	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
> +	emit(A64_NOP, ctx);
> +
>  	/* Sign lr */
>  	if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
>  		emit(A64_PACIASP, ctx);
> -	/* BTI landing pad */
> -	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> -		emit(A64_BTI_C, ctx);
>  
>  	/* Save FP and LR registers to stay align with ARM64 AAPCS */
>  	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
> @@ -312,8 +348,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>  		}
>  
>  		/* BTI landing pad for the tail call, done with a BR */
> -		if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> -			emit(A64_BTI_J, ctx);
> +		emit_bti(A64_BTI_J, ctx);
>  	}
>  
>  	emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
> @@ -557,6 +592,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
>  	return 0;
>  }
>  
> +void dummy_tramp(void);
> +
> +asm (
> +"	.pushsection .text, \"ax\", @progbits\n"
> +"	.type dummy_tramp, %function\n"
> +"dummy_tramp:"
> +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
> +"	bti j\n" /* dummy_tramp is called via "br x10" */
> +#endif
> +"	mov x10, lr\n"
> +"	mov lr, x9\n"
> +"	ret x10\n"
> +"	.size dummy_tramp, .-dummy_tramp\n"
> +"	.popsection\n"
> +);
> +
> +/* build a plt initialized like this:
> + *
> + * plt:
> + *      ldr tmp, target
> + *      br tmp
> + * target:
> + *      .quad dummy_tramp
> + *
> + * when a long jump trampoline is attached, target is filled with the
> + * trampoline address, and when the trampoine is removed, target is

s/trampoine/trampoline/

> + * restored to dummy_tramp address.
> + */
> +static void build_plt(struct jit_ctx *ctx, bool write_target)
> +{
> +	const u8 tmp = bpf2a64[TMP_REG_1];
> +	struct bpf_plt *plt = NULL;
> +
> +	/* make sure target is 64-bit aligend */

aligned

> +	if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
> +		emit(A64_NOP, ctx);
> +
> +	plt = (struct bpf_plt *)(ctx->image + ctx->idx);
> +	/* plt is called via bl, no BTI needed here */
> +	emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
> +	emit(A64_BR(tmp), ctx);
> +
> +	/* false write_target means target space is not allocated yet */
> +	if (write_target)

How about "if (ctx->image)", to be consistent

> +		plt->target = (u64)&dummy_tramp;
> +}
> +
>  static void build_epilogue(struct jit_ctx *ctx)
>  {
>  	const u8 r0 = bpf2a64[BPF_REG_0];
> @@ -1356,7 +1438,7 @@ struct arm64_jit_data {
>  
>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  {
> -	int image_size, prog_size, extable_size;
> +	int image_size, prog_size, extable_size, extable_align, extable_offset;
>  	struct bpf_prog *tmp, *orig_prog = prog;
>  	struct bpf_binary_header *header;
>  	struct arm64_jit_data *jit_data;
> @@ -1426,13 +1508,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  
>  	ctx.epilogue_offset = ctx.idx;
>  	build_epilogue(&ctx);
> +	build_plt(&ctx, false);
>  
> +	extable_align = __alignof__(struct exception_table_entry);
>  	extable_size = prog->aux->num_exentries *
>  		sizeof(struct exception_table_entry);
>  
>  	/* Now we know the actual image size. */
>  	prog_size = sizeof(u32) * ctx.idx;
> -	image_size = prog_size + extable_size;
> +	/* also allocate space for plt target */
> +	extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
> +	image_size = extable_offset + extable_size;
>  	header = bpf_jit_binary_alloc(image_size, &image_ptr,
>  				      sizeof(u32), jit_fill_hole);
>  	if (header == NULL) {
> @@ -1444,7 +1530,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  
>  	ctx.image = (__le32 *)image_ptr;
>  	if (extable_size)
> -		prog->aux->extable = (void *)image_ptr + prog_size;
> +		prog->aux->extable = (void *)image_ptr + extable_offset;
>  skip_init_ctx:
>  	ctx.idx = 0;
>  	ctx.exentry_idx = 0;
> @@ -1458,6 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  	}
>  
>  	build_epilogue(&ctx);
> +	build_plt(&ctx, true);
>  
>  	/* 3. Extra pass to validate JITed code. */
>  	if (validate_code(&ctx)) {
> @@ -1537,3 +1624,218 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>  {
>  	return true;
>  }
> +
> +static bool is_long_jump(void *ip, void *target)
> +{
> +	long offset;
> +
> +	/* NULL target means this is a NOP */
> +	if (!target)
> +		return false;
> +
> +	offset = (long)target - (long)ip;
> +	return offset < -SZ_128M || offset >= SZ_128M;
> +}
> +
> +static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
> +			     void *addr, void *plt, u32 *insn)
> +{
> +	void *target;
> +
> +	if (!addr) {
> +		*insn = aarch64_insn_gen_nop();
> +		return 0;
> +	}
> +
> +	if (is_long_jump(ip, addr))
> +		target = plt;
> +	else
> +		target = addr;
> +
> +	*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
> +					    (unsigned long)target,
> +					    type);
> +
> +	return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
> +}
> +
> +/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
> + * trampoline with the branch instruction from @ip to @new_addr. If @old_addr
> + * or @new_addr is NULL, the old or new instruction is NOP.
> + *
> + * When @ip is the bpf prog entry, a bpf trampoline is being attached or
> + * detached. Since bpf trampoline and bpf prog are allocated separately with
> + * vmalloc, the address distance may exceed 128MB, the maximum branch range.
> + * So long jump should be handled.
> + *
> + * When a bpf prog is constructed, a plt pointing to empty trampoline
> + * dummy_tramp is placed at the end:
> + *
> + *      bpf_prog:
> + *              mov x9, lr
> + *              nop // patchsite
> + *              ...
> + *              ret
> + *
> + *      plt:
> + *              ldr x10, target
> + *              br x10
> + *      target:
> + *              .quad dummy_tramp // plt target
> + *
> + * This is also the state when no trampoline is attached.
> + *
> + * When a short-jump bpf trampoline is attached, the patchsite is patched
> + * to a bl instruction to the trampoline directly:
> + *
> + *      bpf_prog:
> + *              mov x9, lr
> + *              bl <short-jump bpf trampoline address> // patchsite
> + *              ...
> + *              ret
> + *
> + *      plt:
> + *              ldr x10, target
> + *              br x10
> + *      target:
> + *              .quad dummy_tramp // plt target
> + *
> + * When a long-jump bpf trampoline is attached, the plt target is filled with
> + * the trampoline address and the patchsite is patched to a bl instruction to
> + * the plt:
> + *
> + *      bpf_prog:
> + *              mov x9, lr
> + *              bl plt // patchsite
> + *              ...
> + *              ret
> + *
> + *      plt:
> + *              ldr x10, target
> + *              br x10
> + *      target:
> + *              .quad <long-jump bpf trampoline address> // plt target
> + *
> + * The dummy_tramp is used to prevent another CPU from jumping to unknown
> + * locations during the patching process, making the patching process easier.
> + */
> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
> +		       void *old_addr, void *new_addr)
> +{
> +	int ret;
> +	u32 old_insn;
> +	u32 new_insn;
> +	u32 replaced;
> +	struct bpf_plt *plt = NULL;
> +	unsigned long size = 0UL;
> +	unsigned long offset = ~0UL;
> +	enum aarch64_insn_branch_type branch_type;
> +	char namebuf[KSYM_NAME_LEN];
> +	void *image = NULL;
> +	u64 plt_target = 0ULL;
> +	bool poking_bpf_entry;
> +
> +	if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
> +		/* Only poking bpf text is supported. Since kernel function
> +		 * entry is set up by ftrace, we reply on ftrace to poke kernel
> +		 * functions.
> +		 */
> +		return -ENOTSUPP;
> +
> +	image = ip - offset;
> +	/* zero offset means we're poking bpf prog entry */
> +	poking_bpf_entry = (offset == 0UL);
> +
> +	/* bpf prog entry, find plt and the real patchsite */
> +	if (poking_bpf_entry) {
> +		/* plt locates at the end of bpf prog */
> +		plt = image + size - PLT_TARGET_OFFSET;
> +
> +		/* skip to the nop instruction in bpf prog entry:
> +		 * bti c // if BTI enabled
> +		 * mov x9, x30
> +		 * nop
> +		 */
> +		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
> +	}
> +
> +	/* long jump is only possible at bpf prog entry */
> +	if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
> +		    !poking_bpf_entry))
> +		return -EINVAL;
> +
> +	if (poke_type == BPF_MOD_CALL)
> +		branch_type = AARCH64_INSN_BRANCH_LINK;
> +	else
> +		branch_type = AARCH64_INSN_BRANCH_NOLINK;
> +
> +	if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
> +		return -EFAULT;
> +
> +	if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
> +		return -EFAULT;
> +
> +	if (is_long_jump(ip, new_addr))
> +		plt_target = (u64)new_addr;
> +	else if (is_long_jump(ip, old_addr))
> +		/* if the old target is a long jump and the new target is not,
> +		 * restore the plt target to dummy_tramp, so there is always a
> +		 * legal and harmless address stored in plt target, and we'll
> +		 * never jump from plt to an unknown place.
> +		 */
> +		plt_target = (u64)&dummy_tramp;
> +
> +	if (plt_target) {
> +		/* non-zero plt_target indicates we're patching a bpf prog,
> +		 * which is read only.
> +		 */
> +		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
> +			return -EFAULT;
> +		WRITE_ONCE(plt->target, plt_target);
> +		set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
> +		/* since plt target points to either the new trmapoline

trampoline

> +		 * or dummy_tramp, even if aother CPU reads the old plt

another

Thanks,
Jean

> +		 * target value before fetching the bl instruction to plt,
> +		 * it will be brought back by dummy_tramp, so no barrier is
> +		 * required here.
> +		 */
> +	}
> +
> +	/* if the old target and the new target are both long jumps, no
> +	 * patching is required
> +	 */
> +	if (old_insn == new_insn)
> +		return 0;
> +
> +	mutex_lock(&text_mutex);
> +	if (aarch64_insn_read(ip, &replaced)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (replaced != old_insn) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* We call aarch64_insn_patch_text_nosync() to replace instruction
> +	 * atomically, so no other CPUs will fetch a half-new and half-old
> +	 * instruction. But there is chance that another CPU executes the
> +	 * old instruction after the patching operation finishes (e.g.,
> +	 * pipeline not flushed, or icache not synchronized yet).
> +	 *
> +	 * 1. when a new trampoline is attached, it is not a problem for
> +	 *    different CPUs to jump to different trampolines temporarily.
> +	 *
> +	 * 2. when an old trampoline is freed, we should wait for all other
> +	 *    CPUs to exit the trampoline and make sure the trampoline is no
> +	 *    longer reachable, since bpf_tramp_image_put() function already
> +	 *    uses percpu_ref and task rcu to do the sync, no need to call
> +	 *    the sync version here, see bpf_tramp_image_put() for details.
> +	 */
> +	ret = aarch64_insn_patch_text_nosync(ip, new_insn);
> +out:
> +	mutex_unlock(&text_mutex);
> +
> +	return ret;
> +}
> -- 
> 2.30.2
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
  2022-07-07 16:41     ` Jean-Philippe Brucker
@ 2022-07-08  2:41       ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-08  2:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/8/2022 12:41 AM, Jean-Philippe Brucker wrote:
> On Sat, Jun 25, 2022 at 12:12:54PM -0400, Xu Kuohai wrote:
>> Impelment bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
> 
> Implement
> 

will fix

>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>> index f08a4447d363..e0e9c705a2e4 100644
>> --- a/arch/arm64/net/bpf_jit_comp.c
>> +++ b/arch/arm64/net/bpf_jit_comp.c
>> @@ -9,6 +9,7 @@
>>  
>>  #include <linux/bitfield.h>
>>  #include <linux/bpf.h>
>> +#include <linux/memory.h>
> 
> nit: keep sorted
> 

will fix

>>  #include <linux/filter.h>
>>  #include <linux/printk.h>
>>  #include <linux/slab.h>
>> @@ -18,6 +19,7 @@
>>  #include <asm/cacheflush.h>
>>  #include <asm/debug-monitors.h>
>>  #include <asm/insn.h>
>> +#include <asm/patching.h>
>>  #include <asm/set_memory.h>
>>  
>>  #include "bpf_jit.h"
>> @@ -78,6 +80,15 @@ struct jit_ctx {
>>  	int fpb_offset;
>>  };
>>  
>> +struct bpf_plt {
>> +	u32 insn_ldr; /* load target */
>> +	u32 insn_br;  /* branch to target */
>> +	u64 target;   /* target value */
>> +} __packed;
> 
> don't need __packed
> 

will fix

>> +
>> +#define PLT_TARGET_SIZE   sizeof_field(struct bpf_plt, target)
>> +#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
>> +
>>  static inline void emit(const u32 insn, struct jit_ctx *ctx)
>>  {
>>  	if (ctx->image != NULL)
>> @@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val,
>>  	}
>>  }
>>  
>> +static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
>> +{
>> +	if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
>> +		emit(insn, ctx);
>> +}
>> +
>>  /*
>>   * Kernel addresses in the vmalloc space use at most 48 bits, and the
>>   * remaining bits are guaranteed to be 0x1. So we can compose the address
>> @@ -235,13 +252,30 @@ static bool is_lsi_offset(int offset, int scale)
>>  	return true;
>>  }
>>  
>> +/* generated prologue:
>> + *      bti c // if CONFIG_ARM64_BTI_KERNEL
>> + *      mov x9, lr
>> + *      nop  // POKE_OFFSET
>> + *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
> 
> Any reason for the change regarding BTI and pointer auth?  We used to put
> 'bti c' at the function entry if (BTI && !PA), or 'paciasp' if (BTI && PA),
> because 'paciasp' is an implicit BTI.
> 

Assuming paciasp is the first instruction if (BTI && PA), when a
trampoline with BPF_TRAMP_F_CALL_ORIG flag attached, we'll encounter the
following scenario.

bpf_prog:
        paciasp // LR1
        mov x9, lr
        bl <trampoline> ----> trampoline:
                                      ....
                                      mov x10, <entry_for_CALL_ORIG>
                                      blr x10
                                        |
CALL_ORIG_entry:                        |
        bti c        <------------------|
        stp x29, lr, [sp, #- 16]!
        ...
        autiasp // LR2
        ret

Because LR1 and LR2 are not equal, the autiasp will fail!

To make this scenario work properly, the first instruction should be
'bti c'.

>> + *      stp x29, lr, [sp, #-16]!
>> + *      mov x29, sp
>> + *      stp x19, x20, [sp, #-16]!
>> + *      stp x21, x22, [sp, #-16]!
>> + *      stp x25, x26, [sp, #-16]!
>> + *      stp x27, x28, [sp, #-16]!
>> + *      mov x25, sp
>> + *      mov tcc, #0
>> + *      // PROLOGUE_OFFSET
>> + */
>> +
>> +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
>> +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
>> +
>> +/* Offset of nop instruction in bpf prog entry to be poked */
>> +#define POKE_OFFSET (BTI_INSNS + 1)
>> +
>>  /* Tail call offset to jump into */
>> -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
>> -	IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
>> -#define PROLOGUE_OFFSET 9
>> -#else
>> -#define PROLOGUE_OFFSET 8
>> -#endif
>> +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
>>  
>>  static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>>  {
>> @@ -280,12 +314,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>>  	 *
>>  	 */
>>  
>> +	emit_bti(A64_BTI_C, ctx);
>> +
>> +	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
>> +	emit(A64_NOP, ctx);
>> +
>>  	/* Sign lr */
>>  	if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
>>  		emit(A64_PACIASP, ctx);
>> -	/* BTI landing pad */
>> -	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
>> -		emit(A64_BTI_C, ctx);
>>  
>>  	/* Save FP and LR registers to stay align with ARM64 AAPCS */
>>  	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
>> @@ -312,8 +348,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>>  		}
>>  
>>  		/* BTI landing pad for the tail call, done with a BR */
>> -		if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
>> -			emit(A64_BTI_J, ctx);
>> +		emit_bti(A64_BTI_J, ctx);
>>  	}
>>  
>>  	emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
>> @@ -557,6 +592,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
>>  	return 0;
>>  }
>>  
>> +void dummy_tramp(void);
>> +
>> +asm (
>> +"	.pushsection .text, \"ax\", @progbits\n"
>> +"	.type dummy_tramp, %function\n"
>> +"dummy_tramp:"
>> +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
>> +"	bti j\n" /* dummy_tramp is called via "br x10" */
>> +#endif
>> +"	mov x10, lr\n"
>> +"	mov lr, x9\n"
>> +"	ret x10\n"
>> +"	.size dummy_tramp, .-dummy_tramp\n"
>> +"	.popsection\n"
>> +);
>> +
>> +/* build a plt initialized like this:
>> + *
>> + * plt:
>> + *      ldr tmp, target
>> + *      br tmp
>> + * target:
>> + *      .quad dummy_tramp
>> + *
>> + * when a long jump trampoline is attached, target is filled with the
>> + * trampoline address, and when the trampoine is removed, target is
> 
> s/trampoine/trampoline/
> 

will fix, thanks

>> + * restored to dummy_tramp address.
>> + */
>> +static void build_plt(struct jit_ctx *ctx, bool write_target)
>> +{
>> +	const u8 tmp = bpf2a64[TMP_REG_1];
>> +	struct bpf_plt *plt = NULL;
>> +
>> +	/* make sure target is 64-bit aligend */
> 
> aligned
>

will fix, thanks

>> +	if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
>> +		emit(A64_NOP, ctx);
>> +
>> +	plt = (struct bpf_plt *)(ctx->image + ctx->idx);
>> +	/* plt is called via bl, no BTI needed here */
>> +	emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
>> +	emit(A64_BR(tmp), ctx);
>> +
>> +	/* false write_target means target space is not allocated yet */
>> +	if (write_target)
> 
> How about "if (ctx->image)", to be consistent
> 

great, thanks

>> +		plt->target = (u64)&dummy_tramp;
>> +}
>> +
>>  static void build_epilogue(struct jit_ctx *ctx)
>>  {
>>  	const u8 r0 = bpf2a64[BPF_REG_0];
>> @@ -1356,7 +1438,7 @@ struct arm64_jit_data {
>>  
>>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  {
>> -	int image_size, prog_size, extable_size;
>> +	int image_size, prog_size, extable_size, extable_align, extable_offset;
>>  	struct bpf_prog *tmp, *orig_prog = prog;
>>  	struct bpf_binary_header *header;
>>  	struct arm64_jit_data *jit_data;
>> @@ -1426,13 +1508,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  
>>  	ctx.epilogue_offset = ctx.idx;
>>  	build_epilogue(&ctx);
>> +	build_plt(&ctx, false);
>>  
>> +	extable_align = __alignof__(struct exception_table_entry);
>>  	extable_size = prog->aux->num_exentries *
>>  		sizeof(struct exception_table_entry);
>>  
>>  	/* Now we know the actual image size. */
>>  	prog_size = sizeof(u32) * ctx.idx;
>> -	image_size = prog_size + extable_size;
>> +	/* also allocate space for plt target */
>> +	extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
>> +	image_size = extable_offset + extable_size;
>>  	header = bpf_jit_binary_alloc(image_size, &image_ptr,
>>  				      sizeof(u32), jit_fill_hole);
>>  	if (header == NULL) {
>> @@ -1444,7 +1530,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  
>>  	ctx.image = (__le32 *)image_ptr;
>>  	if (extable_size)
>> -		prog->aux->extable = (void *)image_ptr + prog_size;
>> +		prog->aux->extable = (void *)image_ptr + extable_offset;
>>  skip_init_ctx:
>>  	ctx.idx = 0;
>>  	ctx.exentry_idx = 0;
>> @@ -1458,6 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  	}
>>  
>>  	build_epilogue(&ctx);
>> +	build_plt(&ctx, true);
>>  
>>  	/* 3. Extra pass to validate JITed code. */
>>  	if (validate_code(&ctx)) {
>> @@ -1537,3 +1624,218 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>>  {
>>  	return true;
>>  }
>> +
>> +static bool is_long_jump(void *ip, void *target)
>> +{
>> +	long offset;
>> +
>> +	/* NULL target means this is a NOP */
>> +	if (!target)
>> +		return false;
>> +
>> +	offset = (long)target - (long)ip;
>> +	return offset < -SZ_128M || offset >= SZ_128M;
>> +}
>> +
>> +static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
>> +			     void *addr, void *plt, u32 *insn)
>> +{
>> +	void *target;
>> +
>> +	if (!addr) {
>> +		*insn = aarch64_insn_gen_nop();
>> +		return 0;
>> +	}
>> +
>> +	if (is_long_jump(ip, addr))
>> +		target = plt;
>> +	else
>> +		target = addr;
>> +
>> +	*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
>> +					    (unsigned long)target,
>> +					    type);
>> +
>> +	return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
>> +}
>> +
>> +/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
>> + * trampoline with the branch instruction from @ip to @new_addr. If @old_addr
>> + * or @new_addr is NULL, the old or new instruction is NOP.
>> + *
>> + * When @ip is the bpf prog entry, a bpf trampoline is being attached or
>> + * detached. Since bpf trampoline and bpf prog are allocated separately with
>> + * vmalloc, the address distance may exceed 128MB, the maximum branch range.
>> + * So long jump should be handled.
>> + *
>> + * When a bpf prog is constructed, a plt pointing to empty trampoline
>> + * dummy_tramp is placed at the end:
>> + *
>> + *      bpf_prog:
>> + *              mov x9, lr
>> + *              nop // patchsite
>> + *              ...
>> + *              ret
>> + *
>> + *      plt:
>> + *              ldr x10, target
>> + *              br x10
>> + *      target:
>> + *              .quad dummy_tramp // plt target
>> + *
>> + * This is also the state when no trampoline is attached.
>> + *
>> + * When a short-jump bpf trampoline is attached, the patchsite is patched
>> + * to a bl instruction to the trampoline directly:
>> + *
>> + *      bpf_prog:
>> + *              mov x9, lr
>> + *              bl <short-jump bpf trampoline address> // patchsite
>> + *              ...
>> + *              ret
>> + *
>> + *      plt:
>> + *              ldr x10, target
>> + *              br x10
>> + *      target:
>> + *              .quad dummy_tramp // plt target
>> + *
>> + * When a long-jump bpf trampoline is attached, the plt target is filled with
>> + * the trampoline address and the patchsite is patched to a bl instruction to
>> + * the plt:
>> + *
>> + *      bpf_prog:
>> + *              mov x9, lr
>> + *              bl plt // patchsite
>> + *              ...
>> + *              ret
>> + *
>> + *      plt:
>> + *              ldr x10, target
>> + *              br x10
>> + *      target:
>> + *              .quad <long-jump bpf trampoline address> // plt target
>> + *
>> + * The dummy_tramp is used to prevent another CPU from jumping to unknown
>> + * locations during the patching process, making the patching process easier.
>> + */
>> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>> +		       void *old_addr, void *new_addr)
>> +{
>> +	int ret;
>> +	u32 old_insn;
>> +	u32 new_insn;
>> +	u32 replaced;
>> +	struct bpf_plt *plt = NULL;
>> +	unsigned long size = 0UL;
>> +	unsigned long offset = ~0UL;
>> +	enum aarch64_insn_branch_type branch_type;
>> +	char namebuf[KSYM_NAME_LEN];
>> +	void *image = NULL;
>> +	u64 plt_target = 0ULL;
>> +	bool poking_bpf_entry;
>> +
>> +	if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
>> +		/* Only poking bpf text is supported. Since kernel function
>> +		 * entry is set up by ftrace, we reply on ftrace to poke kernel
>> +		 * functions.
>> +		 */
>> +		return -ENOTSUPP;
>> +
>> +	image = ip - offset;
>> +	/* zero offset means we're poking bpf prog entry */
>> +	poking_bpf_entry = (offset == 0UL);
>> +
>> +	/* bpf prog entry, find plt and the real patchsite */
>> +	if (poking_bpf_entry) {
>> +		/* plt locates at the end of bpf prog */
>> +		plt = image + size - PLT_TARGET_OFFSET;
>> +
>> +		/* skip to the nop instruction in bpf prog entry:
>> +		 * bti c // if BTI enabled
>> +		 * mov x9, x30
>> +		 * nop
>> +		 */
>> +		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
>> +	}
>> +
>> +	/* long jump is only possible at bpf prog entry */
>> +	if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
>> +		    !poking_bpf_entry))
>> +		return -EINVAL;
>> +
>> +	if (poke_type == BPF_MOD_CALL)
>> +		branch_type = AARCH64_INSN_BRANCH_LINK;
>> +	else
>> +		branch_type = AARCH64_INSN_BRANCH_NOLINK;
>> +
>> +	if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
>> +		return -EFAULT;
>> +
>> +	if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
>> +		return -EFAULT;
>> +
>> +	if (is_long_jump(ip, new_addr))
>> +		plt_target = (u64)new_addr;
>> +	else if (is_long_jump(ip, old_addr))
>> +		/* if the old target is a long jump and the new target is not,
>> +		 * restore the plt target to dummy_tramp, so there is always a
>> +		 * legal and harmless address stored in plt target, and we'll
>> +		 * never jump from plt to an unknown place.
>> +		 */
>> +		plt_target = (u64)&dummy_tramp;
>> +
>> +	if (plt_target) {
>> +		/* non-zero plt_target indicates we're patching a bpf prog,
>> +		 * which is read only.
>> +		 */
>> +		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
>> +			return -EFAULT;
>> +		WRITE_ONCE(plt->target, plt_target);
>> +		set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
>> +		/* since plt target points to either the new trmapoline
> 
> trampoline

will fix

> 
>> +		 * or dummy_tramp, even if aother CPU reads the old plt
> 
> another

will fix

> 
> Thanks,
> Jean
> 

sorry for so many typos, thanks a lot!

>> +		 * target value before fetching the bl instruction to plt,
>> +		 * it will be brought back by dummy_tramp, so no barrier is
>> +		 * required here.
>> +		 */
>> +	}
>> +
>> +	/* if the old target and the new target are both long jumps, no
>> +	 * patching is required
>> +	 */
>> +	if (old_insn == new_insn)
>> +		return 0;
>> +
>> +	mutex_lock(&text_mutex);
>> +	if (aarch64_insn_read(ip, &replaced)) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	if (replaced != old_insn) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	/* We call aarch64_insn_patch_text_nosync() to replace instruction
>> +	 * atomically, so no other CPUs will fetch a half-new and half-old
>> +	 * instruction. But there is chance that another CPU executes the
>> +	 * old instruction after the patching operation finishes (e.g.,
>> +	 * pipeline not flushed, or icache not synchronized yet).
>> +	 *
>> +	 * 1. when a new trampoline is attached, it is not a problem for
>> +	 *    different CPUs to jump to different trampolines temporarily.
>> +	 *
>> +	 * 2. when an old trampoline is freed, we should wait for all other
>> +	 *    CPUs to exit the trampoline and make sure the trampoline is no
>> +	 *    longer reachable, since bpf_tramp_image_put() function already
>> +	 *    uses percpu_ref and task rcu to do the sync, no need to call
>> +	 *    the sync version here, see bpf_tramp_image_put() for details.
>> +	 */
>> +	ret = aarch64_insn_patch_text_nosync(ip, new_insn);
>> +out:
>> +	mutex_unlock(&text_mutex);
>> +
>> +	return ret;
>> +}
>> -- 
>> 2.30.2
>>
> .


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
@ 2022-07-08  2:41       ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-08  2:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/8/2022 12:41 AM, Jean-Philippe Brucker wrote:
> On Sat, Jun 25, 2022 at 12:12:54PM -0400, Xu Kuohai wrote:
>> Impelment bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
> 
> Implement
> 

will fix

>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>> index f08a4447d363..e0e9c705a2e4 100644
>> --- a/arch/arm64/net/bpf_jit_comp.c
>> +++ b/arch/arm64/net/bpf_jit_comp.c
>> @@ -9,6 +9,7 @@
>>  
>>  #include <linux/bitfield.h>
>>  #include <linux/bpf.h>
>> +#include <linux/memory.h>
> 
> nit: keep sorted
> 

will fix

>>  #include <linux/filter.h>
>>  #include <linux/printk.h>
>>  #include <linux/slab.h>
>> @@ -18,6 +19,7 @@
>>  #include <asm/cacheflush.h>
>>  #include <asm/debug-monitors.h>
>>  #include <asm/insn.h>
>> +#include <asm/patching.h>
>>  #include <asm/set_memory.h>
>>  
>>  #include "bpf_jit.h"
>> @@ -78,6 +80,15 @@ struct jit_ctx {
>>  	int fpb_offset;
>>  };
>>  
>> +struct bpf_plt {
>> +	u32 insn_ldr; /* load target */
>> +	u32 insn_br;  /* branch to target */
>> +	u64 target;   /* target value */
>> +} __packed;
> 
> don't need __packed
> 

will fix

>> +
>> +#define PLT_TARGET_SIZE   sizeof_field(struct bpf_plt, target)
>> +#define PLT_TARGET_OFFSET offsetof(struct bpf_plt, target)
>> +
>>  static inline void emit(const u32 insn, struct jit_ctx *ctx)
>>  {
>>  	if (ctx->image != NULL)
>> @@ -140,6 +151,12 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val,
>>  	}
>>  }
>>  
>> +static inline void emit_bti(u32 insn, struct jit_ctx *ctx)
>> +{
>> +	if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
>> +		emit(insn, ctx);
>> +}
>> +
>>  /*
>>   * Kernel addresses in the vmalloc space use at most 48 bits, and the
>>   * remaining bits are guaranteed to be 0x1. So we can compose the address
>> @@ -235,13 +252,30 @@ static bool is_lsi_offset(int offset, int scale)
>>  	return true;
>>  }
>>  
>> +/* generated prologue:
>> + *      bti c // if CONFIG_ARM64_BTI_KERNEL
>> + *      mov x9, lr
>> + *      nop  // POKE_OFFSET
>> + *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
> 
> Any reason for the change regarding BTI and pointer auth?  We used to put
> 'bti c' at the function entry if (BTI && !PA), or 'paciasp' if (BTI && PA),
> because 'paciasp' is an implicit BTI.
> 

Assuming paciasp is the first instruction if (BTI && PA), when a
trampoline with BPF_TRAMP_F_CALL_ORIG flag attached, we'll encounter the
following scenario.

bpf_prog:
        paciasp // LR1
        mov x9, lr
        bl <trampoline> ----> trampoline:
                                      ....
                                      mov x10, <entry_for_CALL_ORIG>
                                      blr x10
                                        |
CALL_ORIG_entry:                        |
        bti c        <------------------|
        stp x29, lr, [sp, #- 16]!
        ...
        autiasp // LR2
        ret

Because LR1 and LR2 are not equal, the autiasp will fail!

To make this scenario work properly, the first instruction should be
'bti c'.

>> + *      stp x29, lr, [sp, #-16]!
>> + *      mov x29, sp
>> + *      stp x19, x20, [sp, #-16]!
>> + *      stp x21, x22, [sp, #-16]!
>> + *      stp x25, x26, [sp, #-16]!
>> + *      stp x27, x28, [sp, #-16]!
>> + *      mov x25, sp
>> + *      mov tcc, #0
>> + *      // PROLOGUE_OFFSET
>> + */
>> +
>> +#define BTI_INSNS (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) ? 1 : 0)
>> +#define PAC_INSNS (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) ? 1 : 0)
>> +
>> +/* Offset of nop instruction in bpf prog entry to be poked */
>> +#define POKE_OFFSET (BTI_INSNS + 1)
>> +
>>  /* Tail call offset to jump into */
>> -#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) || \
>> -	IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL)
>> -#define PROLOGUE_OFFSET 9
>> -#else
>> -#define PROLOGUE_OFFSET 8
>> -#endif
>> +#define PROLOGUE_OFFSET (BTI_INSNS + 2 + PAC_INSNS + 8)
>>  
>>  static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>>  {
>> @@ -280,12 +314,14 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>>  	 *
>>  	 */
>>  
>> +	emit_bti(A64_BTI_C, ctx);
>> +
>> +	emit(A64_MOV(1, A64_R(9), A64_LR), ctx);
>> +	emit(A64_NOP, ctx);
>> +
>>  	/* Sign lr */
>>  	if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL))
>>  		emit(A64_PACIASP, ctx);
>> -	/* BTI landing pad */
>> -	else if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
>> -		emit(A64_BTI_C, ctx);
>>  
>>  	/* Save FP and LR registers to stay align with ARM64 AAPCS */
>>  	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
>> @@ -312,8 +348,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
>>  		}
>>  
>>  		/* BTI landing pad for the tail call, done with a BR */
>> -		if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
>> -			emit(A64_BTI_J, ctx);
>> +		emit_bti(A64_BTI_J, ctx);
>>  	}
>>  
>>  	emit(A64_SUB_I(1, fpb, fp, ctx->fpb_offset), ctx);
>> @@ -557,6 +592,53 @@ static int emit_ll_sc_atomic(const struct bpf_insn *insn, struct jit_ctx *ctx)
>>  	return 0;
>>  }
>>  
>> +void dummy_tramp(void);
>> +
>> +asm (
>> +"	.pushsection .text, \"ax\", @progbits\n"
>> +"	.type dummy_tramp, %function\n"
>> +"dummy_tramp:"
>> +#if IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)
>> +"	bti j\n" /* dummy_tramp is called via "br x10" */
>> +#endif
>> +"	mov x10, lr\n"
>> +"	mov lr, x9\n"
>> +"	ret x10\n"
>> +"	.size dummy_tramp, .-dummy_tramp\n"
>> +"	.popsection\n"
>> +);
>> +
>> +/* build a plt initialized like this:
>> + *
>> + * plt:
>> + *      ldr tmp, target
>> + *      br tmp
>> + * target:
>> + *      .quad dummy_tramp
>> + *
>> + * when a long jump trampoline is attached, target is filled with the
>> + * trampoline address, and when the trampoine is removed, target is
> 
> s/trampoine/trampoline/
> 

will fix, thanks

>> + * restored to dummy_tramp address.
>> + */
>> +static void build_plt(struct jit_ctx *ctx, bool write_target)
>> +{
>> +	const u8 tmp = bpf2a64[TMP_REG_1];
>> +	struct bpf_plt *plt = NULL;
>> +
>> +	/* make sure target is 64-bit aligend */
> 
> aligned
>

will fix, thanks

>> +	if ((ctx->idx + PLT_TARGET_OFFSET / AARCH64_INSN_SIZE) % 2)
>> +		emit(A64_NOP, ctx);
>> +
>> +	plt = (struct bpf_plt *)(ctx->image + ctx->idx);
>> +	/* plt is called via bl, no BTI needed here */
>> +	emit(A64_LDR64LIT(tmp, 2 * AARCH64_INSN_SIZE), ctx);
>> +	emit(A64_BR(tmp), ctx);
>> +
>> +	/* false write_target means target space is not allocated yet */
>> +	if (write_target)
> 
> How about "if (ctx->image)", to be consistent
> 

great, thanks

>> +		plt->target = (u64)&dummy_tramp;
>> +}
>> +
>>  static void build_epilogue(struct jit_ctx *ctx)
>>  {
>>  	const u8 r0 = bpf2a64[BPF_REG_0];
>> @@ -1356,7 +1438,7 @@ struct arm64_jit_data {
>>  
>>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  {
>> -	int image_size, prog_size, extable_size;
>> +	int image_size, prog_size, extable_size, extable_align, extable_offset;
>>  	struct bpf_prog *tmp, *orig_prog = prog;
>>  	struct bpf_binary_header *header;
>>  	struct arm64_jit_data *jit_data;
>> @@ -1426,13 +1508,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  
>>  	ctx.epilogue_offset = ctx.idx;
>>  	build_epilogue(&ctx);
>> +	build_plt(&ctx, false);
>>  
>> +	extable_align = __alignof__(struct exception_table_entry);
>>  	extable_size = prog->aux->num_exentries *
>>  		sizeof(struct exception_table_entry);
>>  
>>  	/* Now we know the actual image size. */
>>  	prog_size = sizeof(u32) * ctx.idx;
>> -	image_size = prog_size + extable_size;
>> +	/* also allocate space for plt target */
>> +	extable_offset = round_up(prog_size + PLT_TARGET_SIZE, extable_align);
>> +	image_size = extable_offset + extable_size;
>>  	header = bpf_jit_binary_alloc(image_size, &image_ptr,
>>  				      sizeof(u32), jit_fill_hole);
>>  	if (header == NULL) {
>> @@ -1444,7 +1530,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  
>>  	ctx.image = (__le32 *)image_ptr;
>>  	if (extable_size)
>> -		prog->aux->extable = (void *)image_ptr + prog_size;
>> +		prog->aux->extable = (void *)image_ptr + extable_offset;
>>  skip_init_ctx:
>>  	ctx.idx = 0;
>>  	ctx.exentry_idx = 0;
>> @@ -1458,6 +1544,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  	}
>>  
>>  	build_epilogue(&ctx);
>> +	build_plt(&ctx, true);
>>  
>>  	/* 3. Extra pass to validate JITed code. */
>>  	if (validate_code(&ctx)) {
>> @@ -1537,3 +1624,218 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>>  {
>>  	return true;
>>  }
>> +
>> +static bool is_long_jump(void *ip, void *target)
>> +{
>> +	long offset;
>> +
>> +	/* NULL target means this is a NOP */
>> +	if (!target)
>> +		return false;
>> +
>> +	offset = (long)target - (long)ip;
>> +	return offset < -SZ_128M || offset >= SZ_128M;
>> +}
>> +
>> +static int gen_branch_or_nop(enum aarch64_insn_branch_type type, void *ip,
>> +			     void *addr, void *plt, u32 *insn)
>> +{
>> +	void *target;
>> +
>> +	if (!addr) {
>> +		*insn = aarch64_insn_gen_nop();
>> +		return 0;
>> +	}
>> +
>> +	if (is_long_jump(ip, addr))
>> +		target = plt;
>> +	else
>> +		target = addr;
>> +
>> +	*insn = aarch64_insn_gen_branch_imm((unsigned long)ip,
>> +					    (unsigned long)target,
>> +					    type);
>> +
>> +	return *insn != AARCH64_BREAK_FAULT ? 0 : -EFAULT;
>> +}
>> +
>> +/* Replace the branch instruction from @ip to @old_addr in a bpf prog or a bpf
>> + * trampoline with the branch instruction from @ip to @new_addr. If @old_addr
>> + * or @new_addr is NULL, the old or new instruction is NOP.
>> + *
>> + * When @ip is the bpf prog entry, a bpf trampoline is being attached or
>> + * detached. Since bpf trampoline and bpf prog are allocated separately with
>> + * vmalloc, the address distance may exceed 128MB, the maximum branch range.
>> + * So long jump should be handled.
>> + *
>> + * When a bpf prog is constructed, a plt pointing to empty trampoline
>> + * dummy_tramp is placed at the end:
>> + *
>> + *      bpf_prog:
>> + *              mov x9, lr
>> + *              nop // patchsite
>> + *              ...
>> + *              ret
>> + *
>> + *      plt:
>> + *              ldr x10, target
>> + *              br x10
>> + *      target:
>> + *              .quad dummy_tramp // plt target
>> + *
>> + * This is also the state when no trampoline is attached.
>> + *
>> + * When a short-jump bpf trampoline is attached, the patchsite is patched
>> + * to a bl instruction to the trampoline directly:
>> + *
>> + *      bpf_prog:
>> + *              mov x9, lr
>> + *              bl <short-jump bpf trampoline address> // patchsite
>> + *              ...
>> + *              ret
>> + *
>> + *      plt:
>> + *              ldr x10, target
>> + *              br x10
>> + *      target:
>> + *              .quad dummy_tramp // plt target
>> + *
>> + * When a long-jump bpf trampoline is attached, the plt target is filled with
>> + * the trampoline address and the patchsite is patched to a bl instruction to
>> + * the plt:
>> + *
>> + *      bpf_prog:
>> + *              mov x9, lr
>> + *              bl plt // patchsite
>> + *              ...
>> + *              ret
>> + *
>> + *      plt:
>> + *              ldr x10, target
>> + *              br x10
>> + *      target:
>> + *              .quad <long-jump bpf trampoline address> // plt target
>> + *
>> + * The dummy_tramp is used to prevent another CPU from jumping to unknown
>> + * locations during the patching process, making the patching process easier.
>> + */
>> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>> +		       void *old_addr, void *new_addr)
>> +{
>> +	int ret;
>> +	u32 old_insn;
>> +	u32 new_insn;
>> +	u32 replaced;
>> +	struct bpf_plt *plt = NULL;
>> +	unsigned long size = 0UL;
>> +	unsigned long offset = ~0UL;
>> +	enum aarch64_insn_branch_type branch_type;
>> +	char namebuf[KSYM_NAME_LEN];
>> +	void *image = NULL;
>> +	u64 plt_target = 0ULL;
>> +	bool poking_bpf_entry;
>> +
>> +	if (!__bpf_address_lookup((unsigned long)ip, &size, &offset, namebuf))
>> +		/* Only poking bpf text is supported. Since kernel function
>> +		 * entry is set up by ftrace, we reply on ftrace to poke kernel
>> +		 * functions.
>> +		 */
>> +		return -ENOTSUPP;
>> +
>> +	image = ip - offset;
>> +	/* zero offset means we're poking bpf prog entry */
>> +	poking_bpf_entry = (offset == 0UL);
>> +
>> +	/* bpf prog entry, find plt and the real patchsite */
>> +	if (poking_bpf_entry) {
>> +		/* plt locates at the end of bpf prog */
>> +		plt = image + size - PLT_TARGET_OFFSET;
>> +
>> +		/* skip to the nop instruction in bpf prog entry:
>> +		 * bti c // if BTI enabled
>> +		 * mov x9, x30
>> +		 * nop
>> +		 */
>> +		ip = image + POKE_OFFSET * AARCH64_INSN_SIZE;
>> +	}
>> +
>> +	/* long jump is only possible at bpf prog entry */
>> +	if (WARN_ON((is_long_jump(ip, new_addr) || is_long_jump(ip, old_addr)) &&
>> +		    !poking_bpf_entry))
>> +		return -EINVAL;
>> +
>> +	if (poke_type == BPF_MOD_CALL)
>> +		branch_type = AARCH64_INSN_BRANCH_LINK;
>> +	else
>> +		branch_type = AARCH64_INSN_BRANCH_NOLINK;
>> +
>> +	if (gen_branch_or_nop(branch_type, ip, old_addr, plt, &old_insn) < 0)
>> +		return -EFAULT;
>> +
>> +	if (gen_branch_or_nop(branch_type, ip, new_addr, plt, &new_insn) < 0)
>> +		return -EFAULT;
>> +
>> +	if (is_long_jump(ip, new_addr))
>> +		plt_target = (u64)new_addr;
>> +	else if (is_long_jump(ip, old_addr))
>> +		/* if the old target is a long jump and the new target is not,
>> +		 * restore the plt target to dummy_tramp, so there is always a
>> +		 * legal and harmless address stored in plt target, and we'll
>> +		 * never jump from plt to an unknown place.
>> +		 */
>> +		plt_target = (u64)&dummy_tramp;
>> +
>> +	if (plt_target) {
>> +		/* non-zero plt_target indicates we're patching a bpf prog,
>> +		 * which is read only.
>> +		 */
>> +		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target), 1))
>> +			return -EFAULT;
>> +		WRITE_ONCE(plt->target, plt_target);
>> +		set_memory_ro(PAGE_MASK & ((uintptr_t)&plt->target), 1);
>> +		/* since plt target points to either the new trmapoline
> 
> trampoline

will fix

> 
>> +		 * or dummy_tramp, even if aother CPU reads the old plt
> 
> another

will fix

> 
> Thanks,
> Jean
> 

sorry for so many typos, thanks a lot!

>> +		 * target value before fetching the bl instruction to plt,
>> +		 * it will be brought back by dummy_tramp, so no barrier is
>> +		 * required here.
>> +		 */
>> +	}
>> +
>> +	/* if the old target and the new target are both long jumps, no
>> +	 * patching is required
>> +	 */
>> +	if (old_insn == new_insn)
>> +		return 0;
>> +
>> +	mutex_lock(&text_mutex);
>> +	if (aarch64_insn_read(ip, &replaced)) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	if (replaced != old_insn) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	/* We call aarch64_insn_patch_text_nosync() to replace instruction
>> +	 * atomically, so no other CPUs will fetch a half-new and half-old
>> +	 * instruction. But there is chance that another CPU executes the
>> +	 * old instruction after the patching operation finishes (e.g.,
>> +	 * pipeline not flushed, or icache not synchronized yet).
>> +	 *
>> +	 * 1. when a new trampoline is attached, it is not a problem for
>> +	 *    different CPUs to jump to different trampolines temporarily.
>> +	 *
>> +	 * 2. when an old trampoline is freed, we should wait for all other
>> +	 *    CPUs to exit the trampoline and make sure the trampoline is no
>> +	 *    longer reachable, since bpf_tramp_image_put() function already
>> +	 *    uses percpu_ref and task rcu to do the sync, no need to call
>> +	 *    the sync version here, see bpf_tramp_image_put() for details.
>> +	 */
>> +	ret = aarch64_insn_patch_text_nosync(ip, new_insn);
>> +out:
>> +	mutex_unlock(&text_mutex);
>> +
>> +	return ret;
>> +}
>> -- 
>> 2.30.2
>>
> .


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
  2022-07-07 16:37     ` Jean-Philippe Brucker
@ 2022-07-08  4:35       ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-08  4:35 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/8/2022 12:37 AM, Jean-Philippe Brucker wrote:
> Nice!  Looks good overall, I just have a few comments inline.
> 
> On Sat, Jun 25, 2022 at 12:12:55PM -0400, Xu Kuohai wrote:
>> This is arm64 version of commit fec56f5890d9 ("bpf: Introduce BPF
>> trampoline"). A bpf trampoline converts native calling convention to bpf
>> calling convention and is used to implement various bpf features, such
>> as fentry, fexit, fmod_ret and struct_ops.
>>
>> This patch does essentially the same thing that bpf trampoline does on x86.
>>
>> Tested on raspberry pi 4b and qemu:
>>
>>  #18 /1     bpf_tcp_ca/dctcp:OK
>>  #18 /2     bpf_tcp_ca/cubic:OK
>>  #18 /3     bpf_tcp_ca/invalid_license:OK
>>  #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>>  #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>>  #18        bpf_tcp_ca:OK
>>  #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>>  #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>>  #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>>  #51 /4     dummy_st_ops/dummy_multiple_args:OK
>>  #51        dummy_st_ops:OK
>>  #57 /1     fexit_bpf2bpf/target_no_callees:OK
>>  #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>>  #57 /3     fexit_bpf2bpf/func_replace:OK
>>  #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>>  #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>>  #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>>  #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>>  #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>>  #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>>  #57        fexit_bpf2bpf:OK
>>  #237       xdp_bpf2bpf:OK
>>
>> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
>> Acked-by: Song Liu <songliubraving@fb.com>
>> Acked-by: KP Singh <kpsingh@kernel.org>
>> ---
>>  arch/arm64/net/bpf_jit_comp.c | 387 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 384 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>> index e0e9c705a2e4..dd5a843601b8 100644
>> --- a/arch/arm64/net/bpf_jit_comp.c
>> +++ b/arch/arm64/net/bpf_jit_comp.c
>> @@ -176,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val,
>>  	}
>>  }
>>  
>> +static inline void emit_call(u64 target, struct jit_ctx *ctx)
>> +{
>> +	u8 tmp = bpf2a64[TMP_REG_1];
>> +
>> +	emit_addr_mov_i64(tmp, target, ctx);
>> +	emit(A64_BLR(tmp), ctx);
>> +}
>> +
>>  static inline int bpf2a64_offset(int bpf_insn, int off,
>>  				 const struct jit_ctx *ctx)
>>  {
>> @@ -1073,8 +1081,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
>>  					    &func_addr, &func_addr_fixed);
>>  		if (ret < 0)
>>  			return ret;
>> -		emit_addr_mov_i64(tmp, func_addr, ctx);
>> -		emit(A64_BLR(tmp), ctx);
>> +		emit_call(func_addr, ctx);
>>  		emit(A64_MOV(1, r0, A64_R(0)), ctx);
>>  		break;
>>  	}
>> @@ -1418,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx)
>>  		if (a64_insn == AARCH64_BREAK_FAULT)
>>  			return -1;
>>  	}
>> +	return 0;
>> +}
>> +
>> +static int validate_ctx(struct jit_ctx *ctx)
>> +{
>> +	if (validate_code(ctx))
>> +		return -1;
>>  
>>  	if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
>>  		return -1;
>> @@ -1547,7 +1561,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  	build_plt(&ctx, true);
>>  
>>  	/* 3. Extra pass to validate JITed code. */
>> -	if (validate_code(&ctx)) {
>> +	if (validate_ctx(&ctx)) {
>>  		bpf_jit_binary_free(header);
>>  		prog = orig_prog;
>>  		goto out_off;
>> @@ -1625,6 +1639,373 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>>  	return true;
>>  }
>>  
>> +static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
>> +			    int args_off, int retval_off, int run_ctx_off,
>> +			    bool save_ret)
>> +{
>> +	u32 *branch;
>> +	u64 enter_prog;
>> +	u64 exit_prog;
>> +	u8 tmp = bpf2a64[TMP_REG_1];
> 
> I wonder if we should stick with A64_R(x) rather than bpf2a64[y]. After
> all this isn't translated BPF code but direct arm64 assembly. In any case
> it should be consistent (below functions use A64_R(10))
> 

will replace it with A64_R, thanks.

>> +	u8 r0 = bpf2a64[BPF_REG_0];
>> +	struct bpf_prog *p = l->link.prog;
>> +	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
>> +
>> +	if (p->aux->sleepable) {
>> +		enter_prog = (u64)__bpf_prog_enter_sleepable;
>> +		exit_prog = (u64)__bpf_prog_exit_sleepable;
>> +	} else {
>> +		enter_prog = (u64)__bpf_prog_enter;
>> +		exit_prog = (u64)__bpf_prog_exit;
>> +	}
>> +
>> +	if (l->cookie == 0) {
>> +		/* if cookie is zero, one instruction is enough to store it */
>> +		emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
>> +	} else {
>> +		emit_a64_mov_i64(tmp, l->cookie, ctx);
>> +		emit(A64_STR64I(tmp, A64_SP, run_ctx_off + cookie_off), ctx);
>> +	}
>> +
>> +	/* save p to callee saved register x19 to avoid loading p with mov_i64
>> +	 * each time.
>> +	 */
>> +	emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
>> +
>> +	/* arg1: prog */
>> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
>> +	/* arg2: &run_ctx */
>> +	emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
>> +
>> +	emit_call(enter_prog, ctx);
>> +
>> +	/* if (__bpf_prog_enter(prog) == 0)
>> +	 *         goto skip_exec_of_prog;
>> +	 */
>> +	branch = ctx->image + ctx->idx;
>> +	emit(A64_NOP, ctx);
>> +
>> +	/* save return value to callee saved register x20 */
>> +	emit(A64_MOV(1, A64_R(20), r0), ctx);
> 
> Shouldn't that be x0?  r0 is x7
> 

Yes, it should be x0, thanks.

>> +
>> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
>> +	if (!p->jited)
>> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
>> +
>> +	emit_call((const u64)p->bpf_func, ctx);
>> +
>> +	/* store return value */
>> +	if (save_ret)
>> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
> 
> Here too I think it should be x0. I'm guessing r0 may work for jitted
> functions but not interpreted ones
> 

Yes, r0 is only correct for jitted code, will fix it to:

if (save_ret)
        emit(A64_STR64I(p->jited ? r0 : A64_R(0), A64_SP, retval_off),
             ctx);

>> +
>> +	if (ctx->image) {
>> +		int offset = &ctx->image[ctx->idx] - branch;
>> +		*branch = A64_CBZ(1, A64_R(0), offset);
>> +	}
>> +
>> +	/* arg1: prog */
>> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
>> +	/* arg2: start time */
>> +	emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);
> 
> By the way, it looks like the timestamp could be moved into
> bpf_tramp_run_ctx now?  Nothing to do with this series, just a general
> cleanup
>

It should work, but I haven't tested it.

>> +	/* arg3: &run_ctx */
>> +	emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
>> +
>> +	emit_call(exit_prog, ctx);
>> +}
>> +
>> +static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
>> +			       int args_off, int retval_off, int run_ctx_off,
>> +			       u32 **branches)
>> +{
>> +	int i;
>> +
>> +	/* The first fmod_ret program will receive a garbage return value.
>> +	 * Set this to 0 to avoid confusing the program.
>> +	 */
>> +	emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
>> +	for (i = 0; i < tl->nr_links; i++) {
>> +		invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
>> +				run_ctx_off, true);
>> +		/* if (*(u64 *)(sp + retval_off) !=  0)
>> +		 *	goto do_fexit;
>> +		 */
>> +		emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
>> +		/* Save the location of branch, and generate a nop.
>> +		 * This nop will be replaced with a cbnz later.
>> +		 */
>> +		branches[i] = ctx->image + ctx->idx;
>> +		emit(A64_NOP, ctx);
>> +	}
>> +}
>> +
>> +static void save_args(struct jit_ctx *ctx, int args_off, int nargs)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nargs; i++) {
>> +		emit(A64_STR64I(i, A64_SP, args_off), ctx);
>> +		args_off += 8;
>> +	}
>> +}
>> +
>> +static void restore_args(struct jit_ctx *ctx, int args_off, int nargs)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nargs; i++) {
>> +		emit(A64_LDR64I(i, A64_SP, args_off), ctx);
>> +		args_off += 8;
>> +	}
>> +}
>> +
>> +/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
>> + *
>> + * bpf prog and function entry before bpf trampoline hooked:
>> + *   mov x9, lr
>> + *   nop
>> + *
>> + * bpf prog and function entry after bpf trampoline hooked:
>> + *   mov x9, lr
>> + *   bl  <bpf_trampoline or plt>
>> + *
>> + */
>> +static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>> +			      struct bpf_tramp_links *tlinks, void *orig_call,
>> +			      int nargs, u32 flags)
>> +{
>> +	int i;
>> +	int stack_size;
>> +	int retaddr_off;
>> +	int regs_off;
>> +	int retval_off;
>> +	int args_off;
>> +	int nargs_off;
>> +	int ip_off;
>> +	int run_ctx_off;
>> +	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
>> +	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
>> +	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>> +	bool save_ret;
>> +	u32 **branches = NULL;
>> +
>> +	/* trampoline stack layout:
>> +	 *                  [ parent ip         ]
> 
> nit: maybe s/ip/pc/ here and elsewhere
> 

It seems that "ip" is more consistent in the bpf world, e.g.:

int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
                              void *addr1, void *addr2)

static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void
                         *new_addr)
{
        void *ip = tr->func.addr;
        ...
}

struct bpf_tramp_image {
        ...
        void *ip_after_call;
        void *ip_epilogue;
        ...
};

>> +	 *                  [ FP                ]
>> +	 * SP + retaddr_off [ self ip           ]
>> +	 *                  [ FP                ]
>> +	 *
>> +	 *                  [ padding           ] align SP to multiples of 16
>> +	 *
>> +	 *                  [ x20               ] callee saved reg x20
>> +	 * SP + regs_off    [ x19               ] callee saved reg x19
>> +	 *
>> +	 * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
>> +	 *                                        BPF_TRAMP_F_RET_FENTRY_RET
>> +	 *
>> +	 *                  [ argN              ]
>> +	 *                  [ ...               ]
>> +	 * SP + args_off    [ arg1              ]
>> +	 *
>> +	 * SP + nargs_off   [ args count        ]
>> +	 *
>> +	 * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
>> +	 *
>> +	 * SP + run_ctx_off [ bpf_tramp_run_ctx ]
>> +	 */
>> +
>> +	stack_size = 0;
>> +	run_ctx_off = stack_size;
>> +	/* room for bpf_tramp_run_ctx */
>> +	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
>> +
>> +	ip_off = stack_size;
>> +	/* room for IP address argument */
>> +	if (flags & BPF_TRAMP_F_IP_ARG)
>> +		stack_size += 8;
>> +
>> +	nargs_off = stack_size;
>> +	/* room for args count */
>> +	stack_size += 8;
>> +
>> +	args_off = stack_size;
>> +	/* room for args */
>> +	stack_size += nargs * 8;
>> +
>> +	/* room for return value */
>> +	retval_off = stack_size;
>> +	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
>> +	if (save_ret)
>> +		stack_size += 8;
>> +
>> +	/* room for callee saved registers, currently x19 and x20 are used */
>> +	regs_off = stack_size;
>> +	stack_size += 16;
>> +
>> +	/* round up to multiples of 16 to avoid SPAlignmentFault */
>> +	stack_size = round_up(stack_size, 16);
>> +
>> +	/* return address locates above FP */
>> +	retaddr_off = stack_size + 8;
>> +
>> +	/* bpf trampoline may be invoked by 3 instruction types:
>> +	 * 1. bl, attached to bpf prog or kernel function via short jump
>> +	 * 2. br, attached to bpf prog or kernel function via long jump
>> +	 * 3. blr, working as a function pointer, used by struct_ops.
>> +	 * So BTI_JC should used here to support both br and blr.
>> +	 */
>> +	emit_bti(A64_BTI_JC, ctx);
>> +
>> +	/* frame for parent function */
>> +	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
>> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>> +
>> +	/* frame for patched function */
>> +	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
>> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>> +
>> +	/* allocate stack space */
>> +	emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
>> +
>> +	if (flags & BPF_TRAMP_F_IP_ARG) {
>> +		/* save ip address of the traced function */
>> +		emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx);
>> +		emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
>> +	}
>> +
>> +	/* save args count*/
>> +	emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx);
>> +	emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx);
>> +
>> +	/* save args */
>> +	save_args(ctx, args_off, nargs);
>> +
>> +	/* save callee saved registers */
>> +	emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
>> +	emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
>> +		emit_call((const u64)__bpf_tramp_enter, ctx);
>> +	}
>> +
>> +	for (i = 0; i < fentry->nr_links; i++)
>> +		invoke_bpf_prog(ctx, fentry->links[i], args_off,
>> +				retval_off, run_ctx_off,
>> +				flags & BPF_TRAMP_F_RET_FENTRY_RET);
>> +
>> +	if (fmod_ret->nr_links) {
>> +		branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *),
>> +				   GFP_KERNEL);
>> +		if (!branches)
>> +			return -ENOMEM;
>> +
>> +		invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
>> +				   run_ctx_off, branches);
>> +	}
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		restore_args(ctx, args_off, nargs);
>> +		/* call original func */
>> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
>> +		emit(A64_BLR(A64_R(10)), ctx);
> 
> I don't think we can do this when BTI is enabled because we're not jumping
> to a BTI instruction. We could introduce one in a patched BPF function
> (there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
> in a kernel function.
> 
> We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
> function after modifying its LR. Not sure whether that works with pointer
> auth though.
> 

Yes, the blr instruction should be replaced with ret instruction, thanks!

The layout for bpf prog and regular kernel function is as follows, with
bti always coming first and paciasp immediately after patchsite, so the
ret instruction should work in all cases.

bpf prog or kernel function:
        bti c // if BTI
        mov x9, lr
        bl <trampoline>    ------> trampoline:
                                           ...
                                           mov lr, <return_entry>
                                           mov x10, <ORIG_CALL_entry>
ORIG_CALL_entry:           <-------        ret x10
                                   return_entry:
                                           ...
        paciasp // if PA
        ...

>> +		/* store return value */
>> +		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
>> +		/* reserve a nop for bpf_tramp_image_put */
>> +		im->ip_after_call = ctx->image + ctx->idx;
>> +		emit(A64_NOP, ctx);
>> +	}
>> +
>> +	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
>> +	for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
>> +		int offset = &ctx->image[ctx->idx] - branches[i];
>> +		*branches[i] = A64_CBNZ(1, A64_R(10), offset);
>> +	}
>> +
>> +	for (i = 0; i < fexit->nr_links; i++)
>> +		invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
>> +				run_ctx_off, false);
>> +
>> +	if (flags & BPF_TRAMP_F_RESTORE_REGS)
>> +		restore_args(ctx, args_off, nargs);
> 
> I guess the combination RESTORE_REGS | CALL_ORIG doesn't make much sense,
> but it's not disallowed by the documentation. So it might be safer to move
> this after the next if() to avoid clobbering the regs.
> 

agree, will move it, thanks.

> Thanks,
> Jean
> 
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		im->ip_epilogue = ctx->image + ctx->idx;
>> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
>> +		emit_call((const u64)__bpf_tramp_exit, ctx);
>> +	}
>> +
>> +	/* restore callee saved register x19 and x20 */
>> +	emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
>> +	emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
>> +
>> +	if (save_ret)
>> +		emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
>> +
>> +	/* reset SP  */
>> +	emit(A64_MOV(1, A64_SP, A64_FP), ctx);
>> +
>> +	/* pop frames  */
>> +	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>> +	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
>> +
>> +	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
>> +		/* skip patched function, return to parent */
>> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
>> +		emit(A64_RET(A64_R(9)), ctx);
>> +	} else {
>> +		/* return to patched function */
>> +		emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
>> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
>> +		emit(A64_RET(A64_R(10)), ctx);
>> +	}
>> +
>> +	if (ctx->image)
>> +		bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
>> +
>> +	kfree(branches);
>> +
>> +	return ctx->idx;
>> +}
>> +
>> +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
>> +				void *image_end, const struct btf_func_model *m,
>> +				u32 flags, struct bpf_tramp_links *tlinks,
>> +				void *orig_call)
>> +{
>> +	int ret;
>> +	int nargs = m->nr_args;
>> +	int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE;
>> +	struct jit_ctx ctx = {
>> +		.image = NULL,
>> +		.idx = 0,
>> +	};
>> +
>> +	/* the first 8 arguments are passed by registers */
>> +	if (nargs > 8)
>> +		return -ENOTSUPP;
>> +
>> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (ret > max_insns)
>> +		return -EFBIG;
>> +
>> +	ctx.image = image;
>> +	ctx.idx = 0;
>> +
>> +	jit_fill_hole(image, (unsigned int)(image_end - image));
>> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
>> +
>> +	if (ret > 0 && validate_code(&ctx) < 0)
>> +		ret = -EINVAL;
>> +
>> +	if (ret > 0)
>> +		ret *= AARCH64_INSN_SIZE;
>> +
>> +	return ret;
>> +}
>> +
>>  static bool is_long_jump(void *ip, void *target)
>>  {
>>  	long offset;
>> -- 
>> 2.30.2
>>
> .


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
@ 2022-07-08  4:35       ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-08  4:35 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/8/2022 12:37 AM, Jean-Philippe Brucker wrote:
> Nice!  Looks good overall, I just have a few comments inline.
> 
> On Sat, Jun 25, 2022 at 12:12:55PM -0400, Xu Kuohai wrote:
>> This is arm64 version of commit fec56f5890d9 ("bpf: Introduce BPF
>> trampoline"). A bpf trampoline converts native calling convention to bpf
>> calling convention and is used to implement various bpf features, such
>> as fentry, fexit, fmod_ret and struct_ops.
>>
>> This patch does essentially the same thing that bpf trampoline does on x86.
>>
>> Tested on raspberry pi 4b and qemu:
>>
>>  #18 /1     bpf_tcp_ca/dctcp:OK
>>  #18 /2     bpf_tcp_ca/cubic:OK
>>  #18 /3     bpf_tcp_ca/invalid_license:OK
>>  #18 /4     bpf_tcp_ca/dctcp_fallback:OK
>>  #18 /5     bpf_tcp_ca/rel_setsockopt:OK
>>  #18        bpf_tcp_ca:OK
>>  #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
>>  #51 /2     dummy_st_ops/dummy_init_ret_value:OK
>>  #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
>>  #51 /4     dummy_st_ops/dummy_multiple_args:OK
>>  #51        dummy_st_ops:OK
>>  #57 /1     fexit_bpf2bpf/target_no_callees:OK
>>  #57 /2     fexit_bpf2bpf/target_yes_callees:OK
>>  #57 /3     fexit_bpf2bpf/func_replace:OK
>>  #57 /4     fexit_bpf2bpf/func_replace_verify:OK
>>  #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
>>  #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
>>  #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
>>  #57 /8     fexit_bpf2bpf/func_replace_multi:OK
>>  #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
>>  #57        fexit_bpf2bpf:OK
>>  #237       xdp_bpf2bpf:OK
>>
>> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
>> Acked-by: Song Liu <songliubraving@fb.com>
>> Acked-by: KP Singh <kpsingh@kernel.org>
>> ---
>>  arch/arm64/net/bpf_jit_comp.c | 387 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 384 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>> index e0e9c705a2e4..dd5a843601b8 100644
>> --- a/arch/arm64/net/bpf_jit_comp.c
>> +++ b/arch/arm64/net/bpf_jit_comp.c
>> @@ -176,6 +176,14 @@ static inline void emit_addr_mov_i64(const int reg, const u64 val,
>>  	}
>>  }
>>  
>> +static inline void emit_call(u64 target, struct jit_ctx *ctx)
>> +{
>> +	u8 tmp = bpf2a64[TMP_REG_1];
>> +
>> +	emit_addr_mov_i64(tmp, target, ctx);
>> +	emit(A64_BLR(tmp), ctx);
>> +}
>> +
>>  static inline int bpf2a64_offset(int bpf_insn, int off,
>>  				 const struct jit_ctx *ctx)
>>  {
>> @@ -1073,8 +1081,7 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
>>  					    &func_addr, &func_addr_fixed);
>>  		if (ret < 0)
>>  			return ret;
>> -		emit_addr_mov_i64(tmp, func_addr, ctx);
>> -		emit(A64_BLR(tmp), ctx);
>> +		emit_call(func_addr, ctx);
>>  		emit(A64_MOV(1, r0, A64_R(0)), ctx);
>>  		break;
>>  	}
>> @@ -1418,6 +1425,13 @@ static int validate_code(struct jit_ctx *ctx)
>>  		if (a64_insn == AARCH64_BREAK_FAULT)
>>  			return -1;
>>  	}
>> +	return 0;
>> +}
>> +
>> +static int validate_ctx(struct jit_ctx *ctx)
>> +{
>> +	if (validate_code(ctx))
>> +		return -1;
>>  
>>  	if (WARN_ON_ONCE(ctx->exentry_idx != ctx->prog->aux->num_exentries))
>>  		return -1;
>> @@ -1547,7 +1561,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  	build_plt(&ctx, true);
>>  
>>  	/* 3. Extra pass to validate JITed code. */
>> -	if (validate_code(&ctx)) {
>> +	if (validate_ctx(&ctx)) {
>>  		bpf_jit_binary_free(header);
>>  		prog = orig_prog;
>>  		goto out_off;
>> @@ -1625,6 +1639,373 @@ bool bpf_jit_supports_subprog_tailcalls(void)
>>  	return true;
>>  }
>>  
>> +static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
>> +			    int args_off, int retval_off, int run_ctx_off,
>> +			    bool save_ret)
>> +{
>> +	u32 *branch;
>> +	u64 enter_prog;
>> +	u64 exit_prog;
>> +	u8 tmp = bpf2a64[TMP_REG_1];
> 
> I wonder if we should stick with A64_R(x) rather than bpf2a64[y]. After
> all this isn't translated BPF code but direct arm64 assembly. In any case
> it should be consistent (below functions use A64_R(10))
> 

will replace it with A64_R, thanks.

>> +	u8 r0 = bpf2a64[BPF_REG_0];
>> +	struct bpf_prog *p = l->link.prog;
>> +	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
>> +
>> +	if (p->aux->sleepable) {
>> +		enter_prog = (u64)__bpf_prog_enter_sleepable;
>> +		exit_prog = (u64)__bpf_prog_exit_sleepable;
>> +	} else {
>> +		enter_prog = (u64)__bpf_prog_enter;
>> +		exit_prog = (u64)__bpf_prog_exit;
>> +	}
>> +
>> +	if (l->cookie == 0) {
>> +		/* if cookie is zero, one instruction is enough to store it */
>> +		emit(A64_STR64I(A64_ZR, A64_SP, run_ctx_off + cookie_off), ctx);
>> +	} else {
>> +		emit_a64_mov_i64(tmp, l->cookie, ctx);
>> +		emit(A64_STR64I(tmp, A64_SP, run_ctx_off + cookie_off), ctx);
>> +	}
>> +
>> +	/* save p to callee saved register x19 to avoid loading p with mov_i64
>> +	 * each time.
>> +	 */
>> +	emit_addr_mov_i64(A64_R(19), (const u64)p, ctx);
>> +
>> +	/* arg1: prog */
>> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
>> +	/* arg2: &run_ctx */
>> +	emit(A64_ADD_I(1, A64_R(1), A64_SP, run_ctx_off), ctx);
>> +
>> +	emit_call(enter_prog, ctx);
>> +
>> +	/* if (__bpf_prog_enter(prog) == 0)
>> +	 *         goto skip_exec_of_prog;
>> +	 */
>> +	branch = ctx->image + ctx->idx;
>> +	emit(A64_NOP, ctx);
>> +
>> +	/* save return value to callee saved register x20 */
>> +	emit(A64_MOV(1, A64_R(20), r0), ctx);
> 
> Shouldn't that be x0?  r0 is x7
> 

Yes, it should be x0, thanks.

>> +
>> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
>> +	if (!p->jited)
>> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
>> +
>> +	emit_call((const u64)p->bpf_func, ctx);
>> +
>> +	/* store return value */
>> +	if (save_ret)
>> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
> 
> Here too I think it should be x0. I'm guessing r0 may work for jitted
> functions but not interpreted ones
> 

Yes, r0 is only correct for jitted code, will fix it to:

if (save_ret)
        emit(A64_STR64I(p->jited ? r0 : A64_R(0), A64_SP, retval_off),
             ctx);

>> +
>> +	if (ctx->image) {
>> +		int offset = &ctx->image[ctx->idx] - branch;
>> +		*branch = A64_CBZ(1, A64_R(0), offset);
>> +	}
>> +
>> +	/* arg1: prog */
>> +	emit(A64_MOV(1, A64_R(0), A64_R(19)), ctx);
>> +	/* arg2: start time */
>> +	emit(A64_MOV(1, A64_R(1), A64_R(20)), ctx);
> 
> By the way, it looks like the timestamp could be moved into
> bpf_tramp_run_ctx now?  Nothing to do with this series, just a general
> cleanup
>

It should work, but I haven't tested it.

>> +	/* arg3: &run_ctx */
>> +	emit(A64_ADD_I(1, A64_R(2), A64_SP, run_ctx_off), ctx);
>> +
>> +	emit_call(exit_prog, ctx);
>> +}
>> +
>> +static void invoke_bpf_mod_ret(struct jit_ctx *ctx, struct bpf_tramp_links *tl,
>> +			       int args_off, int retval_off, int run_ctx_off,
>> +			       u32 **branches)
>> +{
>> +	int i;
>> +
>> +	/* The first fmod_ret program will receive a garbage return value.
>> +	 * Set this to 0 to avoid confusing the program.
>> +	 */
>> +	emit(A64_STR64I(A64_ZR, A64_SP, retval_off), ctx);
>> +	for (i = 0; i < tl->nr_links; i++) {
>> +		invoke_bpf_prog(ctx, tl->links[i], args_off, retval_off,
>> +				run_ctx_off, true);
>> +		/* if (*(u64 *)(sp + retval_off) !=  0)
>> +		 *	goto do_fexit;
>> +		 */
>> +		emit(A64_LDR64I(A64_R(10), A64_SP, retval_off), ctx);
>> +		/* Save the location of branch, and generate a nop.
>> +		 * This nop will be replaced with a cbnz later.
>> +		 */
>> +		branches[i] = ctx->image + ctx->idx;
>> +		emit(A64_NOP, ctx);
>> +	}
>> +}
>> +
>> +static void save_args(struct jit_ctx *ctx, int args_off, int nargs)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nargs; i++) {
>> +		emit(A64_STR64I(i, A64_SP, args_off), ctx);
>> +		args_off += 8;
>> +	}
>> +}
>> +
>> +static void restore_args(struct jit_ctx *ctx, int args_off, int nargs)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nargs; i++) {
>> +		emit(A64_LDR64I(i, A64_SP, args_off), ctx);
>> +		args_off += 8;
>> +	}
>> +}
>> +
>> +/* Based on the x86's implementation of arch_prepare_bpf_trampoline().
>> + *
>> + * bpf prog and function entry before bpf trampoline hooked:
>> + *   mov x9, lr
>> + *   nop
>> + *
>> + * bpf prog and function entry after bpf trampoline hooked:
>> + *   mov x9, lr
>> + *   bl  <bpf_trampoline or plt>
>> + *
>> + */
>> +static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>> +			      struct bpf_tramp_links *tlinks, void *orig_call,
>> +			      int nargs, u32 flags)
>> +{
>> +	int i;
>> +	int stack_size;
>> +	int retaddr_off;
>> +	int regs_off;
>> +	int retval_off;
>> +	int args_off;
>> +	int nargs_off;
>> +	int ip_off;
>> +	int run_ctx_off;
>> +	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
>> +	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
>> +	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>> +	bool save_ret;
>> +	u32 **branches = NULL;
>> +
>> +	/* trampoline stack layout:
>> +	 *                  [ parent ip         ]
> 
> nit: maybe s/ip/pc/ here and elsewhere
> 

It seems that "ip" is more consistent in the bpf world, e.g.:

int __weak bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
                              void *addr1, void *addr2)

static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void
                         *new_addr)
{
        void *ip = tr->func.addr;
        ...
}

struct bpf_tramp_image {
        ...
        void *ip_after_call;
        void *ip_epilogue;
        ...
};

>> +	 *                  [ FP                ]
>> +	 * SP + retaddr_off [ self ip           ]
>> +	 *                  [ FP                ]
>> +	 *
>> +	 *                  [ padding           ] align SP to multiples of 16
>> +	 *
>> +	 *                  [ x20               ] callee saved reg x20
>> +	 * SP + regs_off    [ x19               ] callee saved reg x19
>> +	 *
>> +	 * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
>> +	 *                                        BPF_TRAMP_F_RET_FENTRY_RET
>> +	 *
>> +	 *                  [ argN              ]
>> +	 *                  [ ...               ]
>> +	 * SP + args_off    [ arg1              ]
>> +	 *
>> +	 * SP + nargs_off   [ args count        ]
>> +	 *
>> +	 * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
>> +	 *
>> +	 * SP + run_ctx_off [ bpf_tramp_run_ctx ]
>> +	 */
>> +
>> +	stack_size = 0;
>> +	run_ctx_off = stack_size;
>> +	/* room for bpf_tramp_run_ctx */
>> +	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
>> +
>> +	ip_off = stack_size;
>> +	/* room for IP address argument */
>> +	if (flags & BPF_TRAMP_F_IP_ARG)
>> +		stack_size += 8;
>> +
>> +	nargs_off = stack_size;
>> +	/* room for args count */
>> +	stack_size += 8;
>> +
>> +	args_off = stack_size;
>> +	/* room for args */
>> +	stack_size += nargs * 8;
>> +
>> +	/* room for return value */
>> +	retval_off = stack_size;
>> +	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
>> +	if (save_ret)
>> +		stack_size += 8;
>> +
>> +	/* room for callee saved registers, currently x19 and x20 are used */
>> +	regs_off = stack_size;
>> +	stack_size += 16;
>> +
>> +	/* round up to multiples of 16 to avoid SPAlignmentFault */
>> +	stack_size = round_up(stack_size, 16);
>> +
>> +	/* return address locates above FP */
>> +	retaddr_off = stack_size + 8;
>> +
>> +	/* bpf trampoline may be invoked by 3 instruction types:
>> +	 * 1. bl, attached to bpf prog or kernel function via short jump
>> +	 * 2. br, attached to bpf prog or kernel function via long jump
>> +	 * 3. blr, working as a function pointer, used by struct_ops.
>> +	 * So BTI_JC should used here to support both br and blr.
>> +	 */
>> +	emit_bti(A64_BTI_JC, ctx);
>> +
>> +	/* frame for parent function */
>> +	emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx);
>> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>> +
>> +	/* frame for patched function */
>> +	emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
>> +	emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>> +
>> +	/* allocate stack space */
>> +	emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
>> +
>> +	if (flags & BPF_TRAMP_F_IP_ARG) {
>> +		/* save ip address of the traced function */
>> +		emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx);
>> +		emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx);
>> +	}
>> +
>> +	/* save args count*/
>> +	emit(A64_MOVZ(1, A64_R(10), nargs, 0), ctx);
>> +	emit(A64_STR64I(A64_R(10), A64_SP, nargs_off), ctx);
>> +
>> +	/* save args */
>> +	save_args(ctx, args_off, nargs);
>> +
>> +	/* save callee saved registers */
>> +	emit(A64_STR64I(A64_R(19), A64_SP, regs_off), ctx);
>> +	emit(A64_STR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
>> +		emit_call((const u64)__bpf_tramp_enter, ctx);
>> +	}
>> +
>> +	for (i = 0; i < fentry->nr_links; i++)
>> +		invoke_bpf_prog(ctx, fentry->links[i], args_off,
>> +				retval_off, run_ctx_off,
>> +				flags & BPF_TRAMP_F_RET_FENTRY_RET);
>> +
>> +	if (fmod_ret->nr_links) {
>> +		branches = kcalloc(fmod_ret->nr_links, sizeof(u32 *),
>> +				   GFP_KERNEL);
>> +		if (!branches)
>> +			return -ENOMEM;
>> +
>> +		invoke_bpf_mod_ret(ctx, fmod_ret, args_off, retval_off,
>> +				   run_ctx_off, branches);
>> +	}
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		restore_args(ctx, args_off, nargs);
>> +		/* call original func */
>> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
>> +		emit(A64_BLR(A64_R(10)), ctx);
> 
> I don't think we can do this when BTI is enabled because we're not jumping
> to a BTI instruction. We could introduce one in a patched BPF function
> (there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
> in a kernel function.
> 
> We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
> function after modifying its LR. Not sure whether that works with pointer
> auth though.
> 

Yes, the blr instruction should be replaced with ret instruction, thanks!

The layout for bpf prog and regular kernel function is as follows, with
bti always coming first and paciasp immediately after patchsite, so the
ret instruction should work in all cases.

bpf prog or kernel function:
        bti c // if BTI
        mov x9, lr
        bl <trampoline>    ------> trampoline:
                                           ...
                                           mov lr, <return_entry>
                                           mov x10, <ORIG_CALL_entry>
ORIG_CALL_entry:           <-------        ret x10
                                   return_entry:
                                           ...
        paciasp // if PA
        ...

>> +		/* store return value */
>> +		emit(A64_STR64I(A64_R(0), A64_SP, retval_off), ctx);
>> +		/* reserve a nop for bpf_tramp_image_put */
>> +		im->ip_after_call = ctx->image + ctx->idx;
>> +		emit(A64_NOP, ctx);
>> +	}
>> +
>> +	/* update the branches saved in invoke_bpf_mod_ret with cbnz */
>> +	for (i = 0; i < fmod_ret->nr_links && ctx->image != NULL; i++) {
>> +		int offset = &ctx->image[ctx->idx] - branches[i];
>> +		*branches[i] = A64_CBNZ(1, A64_R(10), offset);
>> +	}
>> +
>> +	for (i = 0; i < fexit->nr_links; i++)
>> +		invoke_bpf_prog(ctx, fexit->links[i], args_off, retval_off,
>> +				run_ctx_off, false);
>> +
>> +	if (flags & BPF_TRAMP_F_RESTORE_REGS)
>> +		restore_args(ctx, args_off, nargs);
> 
> I guess the combination RESTORE_REGS | CALL_ORIG doesn't make much sense,
> but it's not disallowed by the documentation. So it might be safer to move
> this after the next if() to avoid clobbering the regs.
> 

agree, will move it, thanks.

> Thanks,
> Jean
> 
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		im->ip_epilogue = ctx->image + ctx->idx;
>> +		emit_addr_mov_i64(A64_R(0), (const u64)im, ctx);
>> +		emit_call((const u64)__bpf_tramp_exit, ctx);
>> +	}
>> +
>> +	/* restore callee saved register x19 and x20 */
>> +	emit(A64_LDR64I(A64_R(19), A64_SP, regs_off), ctx);
>> +	emit(A64_LDR64I(A64_R(20), A64_SP, regs_off + 8), ctx);
>> +
>> +	if (save_ret)
>> +		emit(A64_LDR64I(A64_R(0), A64_SP, retval_off), ctx);
>> +
>> +	/* reset SP  */
>> +	emit(A64_MOV(1, A64_SP, A64_FP), ctx);
>> +
>> +	/* pop frames  */
>> +	emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>> +	emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx);
>> +
>> +	if (flags & BPF_TRAMP_F_SKIP_FRAME) {
>> +		/* skip patched function, return to parent */
>> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
>> +		emit(A64_RET(A64_R(9)), ctx);
>> +	} else {
>> +		/* return to patched function */
>> +		emit(A64_MOV(1, A64_R(10), A64_LR), ctx);
>> +		emit(A64_MOV(1, A64_LR, A64_R(9)), ctx);
>> +		emit(A64_RET(A64_R(10)), ctx);
>> +	}
>> +
>> +	if (ctx->image)
>> +		bpf_flush_icache(ctx->image, ctx->image + ctx->idx);
>> +
>> +	kfree(branches);
>> +
>> +	return ctx->idx;
>> +}
>> +
>> +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
>> +				void *image_end, const struct btf_func_model *m,
>> +				u32 flags, struct bpf_tramp_links *tlinks,
>> +				void *orig_call)
>> +{
>> +	int ret;
>> +	int nargs = m->nr_args;
>> +	int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE;
>> +	struct jit_ctx ctx = {
>> +		.image = NULL,
>> +		.idx = 0,
>> +	};
>> +
>> +	/* the first 8 arguments are passed by registers */
>> +	if (nargs > 8)
>> +		return -ENOTSUPP;
>> +
>> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (ret > max_insns)
>> +		return -EFBIG;
>> +
>> +	ctx.image = image;
>> +	ctx.idx = 0;
>> +
>> +	jit_fill_hole(image, (unsigned int)(image_end - image));
>> +	ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nargs, flags);
>> +
>> +	if (ret > 0 && validate_code(&ctx) < 0)
>> +		ret = -EINVAL;
>> +
>> +	if (ret > 0)
>> +		ret *= AARCH64_INSN_SIZE;
>> +
>> +	return ret;
>> +}
>> +
>>  static bool is_long_jump(void *ip, void *target)
>>  {
>>  	long offset;
>> -- 
>> 2.30.2
>>
> .


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
  2022-07-08  4:35       ` Xu Kuohai
@ 2022-07-08  8:24         ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-08  8:24 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Fri, Jul 08, 2022 at 12:35:33PM +0800, Xu Kuohai wrote:
> >> +
> >> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
> >> +	if (!p->jited)
> >> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
> >> +
> >> +	emit_call((const u64)p->bpf_func, ctx);
> >> +
> >> +	/* store return value */
> >> +	if (save_ret)
> >> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
> > 
> > Here too I think it should be x0. I'm guessing r0 may work for jitted
> > functions but not interpreted ones
> > 
> 
> Yes, r0 is only correct for jitted code, will fix it to:
> 
> if (save_ret)
>         emit(A64_STR64I(p->jited ? r0 : A64_R(0), A64_SP, retval_off),
>              ctx);

I don't think we need this test because x0 should be correct in all cases.
x7 happens to equal x0 when jitted due to the way build_epilogue() builds
the function at the moment, but we shouldn't rely on that.


> >> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> >> +		restore_args(ctx, args_off, nargs);
> >> +		/* call original func */
> >> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
> >> +		emit(A64_BLR(A64_R(10)), ctx);
> > 
> > I don't think we can do this when BTI is enabled because we're not jumping
> > to a BTI instruction. We could introduce one in a patched BPF function
> > (there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
> > in a kernel function.
> > 
> > We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
> > function after modifying its LR. Not sure whether that works with pointer
> > auth though.
> > 
> 
> Yes, the blr instruction should be replaced with ret instruction, thanks!
> 
> The layout for bpf prog and regular kernel function is as follows, with
> bti always coming first and paciasp immediately after patchsite, so the
> ret instruction should work in all cases.
> 
> bpf prog or kernel function:
>         bti c // if BTI
>         mov x9, lr
>         bl <trampoline>    ------> trampoline:
>                                            ...
>                                            mov lr, <return_entry>
>                                            mov x10, <ORIG_CALL_entry>
> ORIG_CALL_entry:           <-------        ret x10
>                                    return_entry:
>                                            ...
>         paciasp // if PA
>         ...

Actually I just noticed that CONFIG_ARM64_BTI_KERNEL depends on
CONFIG_ARM64_PTR_AUTH_KERNEL, so we should be able to rely on there always
being a PACIASP at ORIG_CALL_entry, and since it's a landing pad for BLR
we don't need to make this a RET

 92e2294d870b ("arm64: bti: Support building kernel C code using BTI")

Thanks,
Jean


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
@ 2022-07-08  8:24         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-08  8:24 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Fri, Jul 08, 2022 at 12:35:33PM +0800, Xu Kuohai wrote:
> >> +
> >> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
> >> +	if (!p->jited)
> >> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
> >> +
> >> +	emit_call((const u64)p->bpf_func, ctx);
> >> +
> >> +	/* store return value */
> >> +	if (save_ret)
> >> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
> > 
> > Here too I think it should be x0. I'm guessing r0 may work for jitted
> > functions but not interpreted ones
> > 
> 
> Yes, r0 is only correct for jitted code, will fix it to:
> 
> if (save_ret)
>         emit(A64_STR64I(p->jited ? r0 : A64_R(0), A64_SP, retval_off),
>              ctx);

I don't think we need this test because x0 should be correct in all cases.
x7 happens to equal x0 when jitted due to the way build_epilogue() builds
the function at the moment, but we shouldn't rely on that.


> >> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> >> +		restore_args(ctx, args_off, nargs);
> >> +		/* call original func */
> >> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
> >> +		emit(A64_BLR(A64_R(10)), ctx);
> > 
> > I don't think we can do this when BTI is enabled because we're not jumping
> > to a BTI instruction. We could introduce one in a patched BPF function
> > (there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
> > in a kernel function.
> > 
> > We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
> > function after modifying its LR. Not sure whether that works with pointer
> > auth though.
> > 
> 
> Yes, the blr instruction should be replaced with ret instruction, thanks!
> 
> The layout for bpf prog and regular kernel function is as follows, with
> bti always coming first and paciasp immediately after patchsite, so the
> ret instruction should work in all cases.
> 
> bpf prog or kernel function:
>         bti c // if BTI
>         mov x9, lr
>         bl <trampoline>    ------> trampoline:
>                                            ...
>                                            mov lr, <return_entry>
>                                            mov x10, <ORIG_CALL_entry>
> ORIG_CALL_entry:           <-------        ret x10
>                                    return_entry:
>                                            ...
>         paciasp // if PA
>         ...

Actually I just noticed that CONFIG_ARM64_BTI_KERNEL depends on
CONFIG_ARM64_PTR_AUTH_KERNEL, so we should be able to rely on there always
being a PACIASP at ORIG_CALL_entry, and since it's a landing pad for BLR
we don't need to make this a RET

 92e2294d870b ("arm64: bti: Support building kernel C code using BTI")

Thanks,
Jean


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
  2022-07-08  2:41       ` Xu Kuohai
@ 2022-07-08  8:25         ` Jean-Philippe Brucker
  -1 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-08  8:25 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Fri, Jul 08, 2022 at 10:41:46AM +0800, Xu Kuohai wrote:
> >> +/* generated prologue:
> >> + *      bti c // if CONFIG_ARM64_BTI_KERNEL
> >> + *      mov x9, lr
> >> + *      nop  // POKE_OFFSET
> >> + *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
> > 
> > Any reason for the change regarding BTI and pointer auth?  We used to put
> > 'bti c' at the function entry if (BTI && !PA), or 'paciasp' if (BTI && PA),
> > because 'paciasp' is an implicit BTI.
> > 
> 
> Assuming paciasp is the first instruction if (BTI && PA), when a
> trampoline with BPF_TRAMP_F_CALL_ORIG flag attached, we'll encounter the
> following scenario.
> 
> bpf_prog:
>         paciasp // LR1
>         mov x9, lr
>         bl <trampoline> ----> trampoline:
>                                       ....
>                                       mov x10, <entry_for_CALL_ORIG>
>                                       blr x10
>                                         |
> CALL_ORIG_entry:                        |
>         bti c        <------------------|
>         stp x29, lr, [sp, #- 16]!
>         ...
>         autiasp // LR2
>         ret
> 
> Because LR1 and LR2 are not equal, the autiasp will fail!
> 
> To make this scenario work properly, the first instruction should be
> 'bti c'.

Right my mistake, this layout is also what GCC generates for normal kernel
functions when (BTI && PA), so it makes sense to use the same

Thanks,
Jean

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64
@ 2022-07-08  8:25         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 42+ messages in thread
From: Jean-Philippe Brucker @ 2022-07-08  8:25 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On Fri, Jul 08, 2022 at 10:41:46AM +0800, Xu Kuohai wrote:
> >> +/* generated prologue:
> >> + *      bti c // if CONFIG_ARM64_BTI_KERNEL
> >> + *      mov x9, lr
> >> + *      nop  // POKE_OFFSET
> >> + *      paciasp // if CONFIG_ARM64_PTR_AUTH_KERNEL
> > 
> > Any reason for the change regarding BTI and pointer auth?  We used to put
> > 'bti c' at the function entry if (BTI && !PA), or 'paciasp' if (BTI && PA),
> > because 'paciasp' is an implicit BTI.
> > 
> 
> Assuming paciasp is the first instruction if (BTI && PA), when a
> trampoline with BPF_TRAMP_F_CALL_ORIG flag attached, we'll encounter the
> following scenario.
> 
> bpf_prog:
>         paciasp // LR1
>         mov x9, lr
>         bl <trampoline> ----> trampoline:
>                                       ....
>                                       mov x10, <entry_for_CALL_ORIG>
>                                       blr x10
>                                         |
> CALL_ORIG_entry:                        |
>         bti c        <------------------|
>         stp x29, lr, [sp, #- 16]!
>         ...
>         autiasp // LR2
>         ret
> 
> Because LR1 and LR2 are not equal, the autiasp will fail!
> 
> To make this scenario work properly, the first instruction should be
> 'bti c'.

Right my mistake, this layout is also what GCC generates for normal kernel
functions when (BTI && PA), so it makes sense to use the same

Thanks,
Jean

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
  2022-07-08  8:24         ` Jean-Philippe Brucker
@ 2022-07-08  9:08           ` Xu Kuohai
  -1 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-08  9:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/8/2022 4:24 PM, Jean-Philippe Brucker wrote:
> On Fri, Jul 08, 2022 at 12:35:33PM +0800, Xu Kuohai wrote:
>>>> +
>>>> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
>>>> +	if (!p->jited)
>>>> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
>>>> +
>>>> +	emit_call((const u64)p->bpf_func, ctx);
>>>> +
>>>> +	/* store return value */
>>>> +	if (save_ret)
>>>> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
>>>
>>> Here too I think it should be x0. I'm guessing r0 may work for jitted
>>> functions but not interpreted ones
>>>
>>
>> Yes, r0 is only correct for jitted code, will fix it to:
>>
>> if (save_ret)
>>         emit(A64_STR64I(p->jited ? r0 : A64_R(0), A64_SP, retval_off),
>>              ctx);
> 
> I don't think we need this test because x0 should be correct in all cases.
> x7 happens to equal x0 when jitted due to the way build_epilogue() builds
> the function at the moment, but we shouldn't rely on that.
> 
> 
>>>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>>>> +		restore_args(ctx, args_off, nargs);
>>>> +		/* call original func */
>>>> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
>>>> +		emit(A64_BLR(A64_R(10)), ctx);
>>>
>>> I don't think we can do this when BTI is enabled because we're not jumping
>>> to a BTI instruction. We could introduce one in a patched BPF function
>>> (there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
>>> in a kernel function.
>>>
>>> We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
>>> function after modifying its LR. Not sure whether that works with pointer
>>> auth though.
>>>
>>
>> Yes, the blr instruction should be replaced with ret instruction, thanks!
>>
>> The layout for bpf prog and regular kernel function is as follows, with
>> bti always coming first and paciasp immediately after patchsite, so the
>> ret instruction should work in all cases.
>>
>> bpf prog or kernel function:
>>         bti c // if BTI
>>         mov x9, lr
>>         bl <trampoline>    ------> trampoline:
>>                                            ...
>>                                            mov lr, <return_entry>
>>                                            mov x10, <ORIG_CALL_entry>
>> ORIG_CALL_entry:           <-------        ret x10
>>                                    return_entry:
>>                                            ...
>>         paciasp // if PA
>>         ...
> 
> Actually I just noticed that CONFIG_ARM64_BTI_KERNEL depends on
> CONFIG_ARM64_PTR_AUTH_KERNEL, so we should be able to rely on there always
> being a PACIASP at ORIG_CALL_entry, and since it's a landing pad for BLR
> we don't need to make this a RET
> 
>  92e2294d870b ("arm64: bti: Support building kernel C code using BTI")
> 

oh, yeah, thanks

> Thanks,
> Jean
> 
> .


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline for arm64
@ 2022-07-08  9:08           ` Xu Kuohai
  0 siblings, 0 replies; 42+ messages in thread
From: Xu Kuohai @ 2022-07-08  9:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, Mark Rutland,
	Catalin Marinas, Will Deacon, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Jakub Kicinski, Jesper Dangaard Brouer,
	Russell King, James Morse, Hou Tao, Jason Wang

On 7/8/2022 4:24 PM, Jean-Philippe Brucker wrote:
> On Fri, Jul 08, 2022 at 12:35:33PM +0800, Xu Kuohai wrote:
>>>> +
>>>> +	emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
>>>> +	if (!p->jited)
>>>> +		emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
>>>> +
>>>> +	emit_call((const u64)p->bpf_func, ctx);
>>>> +
>>>> +	/* store return value */
>>>> +	if (save_ret)
>>>> +		emit(A64_STR64I(r0, A64_SP, retval_off), ctx);
>>>
>>> Here too I think it should be x0. I'm guessing r0 may work for jitted
>>> functions but not interpreted ones
>>>
>>
>> Yes, r0 is only correct for jitted code, will fix it to:
>>
>> if (save_ret)
>>         emit(A64_STR64I(p->jited ? r0 : A64_R(0), A64_SP, retval_off),
>>              ctx);
> 
> I don't think we need this test because x0 should be correct in all cases.
> x7 happens to equal x0 when jitted due to the way build_epilogue() builds
> the function at the moment, but we shouldn't rely on that.
> 
> 
>>>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>>>> +		restore_args(ctx, args_off, nargs);
>>>> +		/* call original func */
>>>> +		emit(A64_LDR64I(A64_R(10), A64_SP, retaddr_off), ctx);
>>>> +		emit(A64_BLR(A64_R(10)), ctx);
>>>
>>> I don't think we can do this when BTI is enabled because we're not jumping
>>> to a BTI instruction. We could introduce one in a patched BPF function
>>> (there currently is one if CONFIG_ARM64_PTR_AUTH_KERNEL), but probably not
>>> in a kernel function.
>>>
>>> We could fo like FUNCTION_GRAPH_TRACER does and return to the patched
>>> function after modifying its LR. Not sure whether that works with pointer
>>> auth though.
>>>
>>
>> Yes, the blr instruction should be replaced with ret instruction, thanks!
>>
>> The layout for bpf prog and regular kernel function is as follows, with
>> bti always coming first and paciasp immediately after patchsite, so the
>> ret instruction should work in all cases.
>>
>> bpf prog or kernel function:
>>         bti c // if BTI
>>         mov x9, lr
>>         bl <trampoline>    ------> trampoline:
>>                                            ...
>>                                            mov lr, <return_entry>
>>                                            mov x10, <ORIG_CALL_entry>
>> ORIG_CALL_entry:           <-------        ret x10
>>                                    return_entry:
>>                                            ...
>>         paciasp // if PA
>>         ...
> 
> Actually I just noticed that CONFIG_ARM64_BTI_KERNEL depends on
> CONFIG_ARM64_PTR_AUTH_KERNEL, so we should be able to rely on there always
> being a PACIASP at ORIG_CALL_entry, and since it's a landing pad for BLR
> we don't need to make this a RET
> 
>  92e2294d870b ("arm64: bti: Support building kernel C code using BTI")
> 

oh, yeah, thanks

> Thanks,
> Jean
> 
> .


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2022-07-08  9:10 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-25 16:12 [PATCH bpf-next v6 0/4] bpf trampoline for arm64 Xu Kuohai
2022-06-25 16:12 ` Xu Kuohai
2022-06-25 16:12 ` [PATCH bpf-next v6 1/4] bpf: Remove is_valid_bpf_tramp_flags() Xu Kuohai
2022-06-25 16:12   ` Xu Kuohai
2022-06-25 16:12 ` [PATCH bpf-next v6 2/4] arm64: Add LDR (literal) instruction Xu Kuohai
2022-06-25 16:12   ` Xu Kuohai
2022-07-05 16:39   ` Will Deacon
2022-07-05 16:39     ` Will Deacon
2022-07-06  1:43     ` Xu Kuohai
2022-07-06  1:43       ` Xu Kuohai
2022-06-25 16:12 ` [PATCH bpf-next v6 3/4] bpf, arm64: Impelment bpf_arch_text_poke() for arm64 Xu Kuohai
2022-06-25 16:12   ` Xu Kuohai
2022-07-07 16:41   ` Jean-Philippe Brucker
2022-07-07 16:41     ` Jean-Philippe Brucker
2022-07-08  2:41     ` Xu Kuohai
2022-07-08  2:41       ` Xu Kuohai
2022-07-08  8:25       ` Jean-Philippe Brucker
2022-07-08  8:25         ` Jean-Philippe Brucker
2022-06-25 16:12 ` [PATCH bpf-next v6 4/4] bpf, arm64: bpf trampoline " Xu Kuohai
2022-06-25 16:12   ` Xu Kuohai
2022-07-07 16:37   ` Jean-Philippe Brucker
2022-07-07 16:37     ` Jean-Philippe Brucker
2022-07-08  4:35     ` Xu Kuohai
2022-07-08  4:35       ` Xu Kuohai
2022-07-08  8:24       ` Jean-Philippe Brucker
2022-07-08  8:24         ` Jean-Philippe Brucker
2022-07-08  9:08         ` Xu Kuohai
2022-07-08  9:08           ` Xu Kuohai
2022-06-30 21:12 ` [PATCH bpf-next v6 0/4] " Daniel Borkmann
2022-06-30 21:12   ` Daniel Borkmann
2022-07-05 16:00   ` Will Deacon
2022-07-05 16:00     ` Will Deacon
2022-07-05 18:34     ` KP Singh
2022-07-05 18:34       ` KP Singh
2022-07-07  3:35       ` Xu Kuohai
2022-07-07  3:35         ` Xu Kuohai
2022-07-06 16:08     ` Jean-Philippe Brucker
2022-07-06 16:08       ` Jean-Philippe Brucker
2022-07-06 16:11       ` Will Deacon
2022-07-06 16:11         ` Will Deacon
2022-07-07  2:56         ` Xu Kuohai
2022-07-07  2:56           ` Xu Kuohai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.