bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64
@ 2022-12-20  2:13 Pu Lehui
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call Pu Lehui
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Pu Lehui @ 2022-12-20  2:13 UTC (permalink / raw)
  To: bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui, Pu Lehui

BPF trampoline is the critical infrastructure of the bpf
subsystem, acting as a mediator between kernel functions
and BPF programs. Numerous important features, such as
using ebpf program for zero overhead kernel introspection,
rely on this key component. We can't wait to support bpf
trampoline on RV64. The implementation of bpf trampoline
was closely to x86 and arm64 for future development.

As most of riscv cpu support unaligned memory accesses,
we temporarily use patch [1] to facilitate testing. The
test results are as follow, and test_verifier with no
new failure ceses.

- fexit_test:OK
- fentry_test:OK
- fentry_fexit:OK
- fexit_stress:OK
- fexit_bpf2bpf:OK
- dummy_st_ops:OK
- modify_return:OK
- get_func_ip_test:OK
- get_func_args_test:OK
- trampoline_count:OK

[1] https://lore.kernel.org/linux-riscv/20210916130855.4054926-2-chenhuang5@huawei.com/

Pu Lehui (4):
  bpf: Rollback to text_poke when arch not supported ftrace direct call
  riscv, bpf: Factor out emit_call for kernel and bpf context
  riscv, bpf: Add bpf_arch_text_poke support for RV64
  riscv, bpf: Add bpf trampoline support for RV64

 arch/riscv/net/bpf_jit.h        |   5 +
 arch/riscv/net/bpf_jit_comp64.c | 483 ++++++++++++++++++++++++++++++--
 kernel/bpf/trampoline.c         |   8 +-
 3 files changed, 471 insertions(+), 25 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call
  2022-12-20  2:13 [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64 Pu Lehui
@ 2022-12-20  2:13 ` Pu Lehui
  2022-12-20  2:32   ` Xu Kuohai
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 2/4] riscv, bpf: Factor out emit_call for kernel and bpf context Pu Lehui
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Pu Lehui @ 2022-12-20  2:13 UTC (permalink / raw)
  To: bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui, Pu Lehui

From: Pu Lehui <pulehui@huawei.com>

The current bpf trampoline attach to kernel functions via ftrace direct
call API, while text_poke is applied for bpf2bpf attach and tail call
optimization. For architectures that do not support ftrace direct call,
text_poke is still able to attach bpf trampoline to kernel functions.
Let's relax it by rollback to text_poke when architecture not supported.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
---
 kernel/bpf/trampoline.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index d6395215b849..386197a7952c 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -228,15 +228,11 @@ static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_ad
 static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
 {
 	void *ip = tr->func.addr;
-	unsigned long faddr;
 	int ret;
 
-	faddr = ftrace_location((unsigned long)ip);
-	if (faddr) {
-		if (!tr->fops)
-			return -ENOTSUPP;
+	if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS) &&
+	    !!ftrace_location((unsigned long)ip))
 		tr->func.ftrace_managed = true;
-	}
 
 	if (bpf_trampoline_module_get(tr))
 		return -ENOENT;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH RESEND bpf-next 2/4] riscv, bpf: Factor out emit_call for kernel and bpf context
  2022-12-20  2:13 [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64 Pu Lehui
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call Pu Lehui
@ 2022-12-20  2:13 ` Pu Lehui
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64 Pu Lehui
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Pu Lehui @ 2022-12-20  2:13 UTC (permalink / raw)
  To: bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui, Pu Lehui

From: Pu Lehui <pulehui@huawei.com>

The current emit_call function is not suitable for kernel
function call as it store return value to bpf R0 register.
We can separate it out for common use. Meanwhile, simplify
judgment logic, that is, fixed function address can use jal
or auipc+jalr, while the unfixed can use only auipc+jalr.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
---
 arch/riscv/net/bpf_jit_comp64.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index 5b568ba6dcfe..bf4721a99a09 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -428,12 +428,12 @@ static void emit_sext_32_rd(u8 *rd, struct rv_jit_context *ctx)
 	*rd = RV_REG_T2;
 }
 
-static int emit_jump_and_link(u8 rd, s64 rvoff, bool force_jalr,
+static int emit_jump_and_link(u8 rd, s64 rvoff, bool fixed_addr,
 			      struct rv_jit_context *ctx)
 {
 	s64 upper, lower;
 
-	if (rvoff && is_21b_int(rvoff) && !force_jalr) {
+	if (rvoff && fixed_addr && is_21b_int(rvoff)) {
 		emit(rv_jal(rd, rvoff >> 1), ctx);
 		return 0;
 	} else if (in_auipc_jalr_range(rvoff)) {
@@ -454,24 +454,17 @@ static bool is_signed_bpf_cond(u8 cond)
 		cond == BPF_JSGE || cond == BPF_JSLE;
 }
 
-static int emit_call(bool fixed, u64 addr, struct rv_jit_context *ctx)
+static int emit_call(u64 addr, bool fixed_addr, struct rv_jit_context *ctx)
 {
 	s64 off = 0;
 	u64 ip;
-	u8 rd;
-	int ret;
 
 	if (addr && ctx->insns) {
 		ip = (u64)(long)(ctx->insns + ctx->ninsns);
 		off = addr - ip;
 	}
 
-	ret = emit_jump_and_link(RV_REG_RA, off, !fixed, ctx);
-	if (ret)
-		return ret;
-	rd = bpf_to_rv_reg(BPF_REG_0, ctx);
-	emit_mv(rd, RV_REG_A0, ctx);
-	return 0;
+	return emit_jump_and_link(RV_REG_RA, off, fixed_addr, ctx);
 }
 
 static void emit_atomic(u8 rd, u8 rs, s16 off, s32 imm, bool is64,
@@ -913,7 +906,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 	/* JUMP off */
 	case BPF_JMP | BPF_JA:
 		rvoff = rv_offset(i, off, ctx);
-		ret = emit_jump_and_link(RV_REG_ZERO, rvoff, false, ctx);
+		ret = emit_jump_and_link(RV_REG_ZERO, rvoff, true, ctx);
 		if (ret)
 			return ret;
 		break;
@@ -1032,17 +1025,20 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 	/* function call */
 	case BPF_JMP | BPF_CALL:
 	{
-		bool fixed;
+		bool fixed_addr;
 		u64 addr;
 
 		mark_call(ctx);
-		ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass, &addr,
-					    &fixed);
+		ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass,
+					    &addr, &fixed_addr);
 		if (ret < 0)
 			return ret;
-		ret = emit_call(fixed, addr, ctx);
+
+		ret = emit_call(addr, fixed_addr, ctx);
 		if (ret)
 			return ret;
+
+		emit_mv(bpf_to_rv_reg(BPF_REG_0, ctx), RV_REG_A0, ctx);
 		break;
 	}
 	/* tail call */
@@ -1057,7 +1053,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 			break;
 
 		rvoff = epilogue_offset(ctx);
-		ret = emit_jump_and_link(RV_REG_ZERO, rvoff, false, ctx);
+		ret = emit_jump_and_link(RV_REG_ZERO, rvoff, true, ctx);
 		if (ret)
 			return ret;
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64
  2022-12-20  2:13 [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64 Pu Lehui
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call Pu Lehui
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 2/4] riscv, bpf: Factor out emit_call for kernel and bpf context Pu Lehui
@ 2022-12-20  2:13 ` Pu Lehui
  2022-12-20  3:00   ` Xu Kuohai
  2023-01-03 18:12   ` Björn Töpel
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 4/4] riscv, bpf: Add bpf trampoline " Pu Lehui
  2022-12-22 13:00 ` [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline " Björn Töpel
  4 siblings, 2 replies; 18+ messages in thread
From: Pu Lehui @ 2022-12-20  2:13 UTC (permalink / raw)
  To: bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui, Pu Lehui

From: Pu Lehui <pulehui@huawei.com>

Implement bpf_arch_text_poke for RV64. For call scenario,
ftrace framework reserve 4 nops for RV64 kernel function
as function entry, and use auipc+jalr instructions to call
kernel or module functions. However, since the auipc+jalr
call instructions is non-atomic operation, we need to use
stop-machine to make sure instruction patching in atomic
context. As for jump scenario, since we only jump inside
the trampoline, a jal instruction is sufficient.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
---
 arch/riscv/net/bpf_jit.h        |   5 ++
 arch/riscv/net/bpf_jit_comp64.c | 131 +++++++++++++++++++++++++++++++-
 2 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/net/bpf_jit.h b/arch/riscv/net/bpf_jit.h
index d926e0f7ef57..bf9802a63061 100644
--- a/arch/riscv/net/bpf_jit.h
+++ b/arch/riscv/net/bpf_jit.h
@@ -573,6 +573,11 @@ static inline u32 rv_fence(u8 pred, u8 succ)
 	return rv_i_insn(imm11_0, 0, 0, 0, 0xf);
 }
 
+static inline u32 rv_nop(void)
+{
+	return rv_i_insn(0, 0, 0, 0, 0x13);
+}
+
 /* RVC instrutions. */
 
 static inline u16 rvc_addi4spn(u8 rd, u32 imm10)
diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index bf4721a99a09..fa8b03c52463 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -8,6 +8,8 @@
 #include <linux/bitfield.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/memory.h>
+#include <linux/stop_machine.h>
 #include "bpf_jit.h"
 
 #define RV_REG_TCC RV_REG_A6
@@ -238,7 +240,7 @@ static void __build_epilogue(bool is_tail_call, struct rv_jit_context *ctx)
 	if (!is_tail_call)
 		emit_mv(RV_REG_A0, RV_REG_A5, ctx);
 	emit_jalr(RV_REG_ZERO, is_tail_call ? RV_REG_T3 : RV_REG_RA,
-		  is_tail_call ? 4 : 0, /* skip TCC init */
+		  is_tail_call ? 20 : 0, /* skip reserved nops and TCC init */
 		  ctx);
 }
 
@@ -615,6 +617,127 @@ static int add_exception_handler(const struct bpf_insn *insn,
 	return 0;
 }
 
+struct text_poke_args {
+	void *addr;
+	const void *insns;
+	size_t len;
+	atomic_t cpu_count;
+};
+
+static int do_text_poke(void *data)
+{
+	int ret = 0;
+	struct text_poke_args *patch = data;
+
+	if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
+		ret = patch_text_nosync(patch->addr, patch->insns, patch->len);
+		atomic_inc(&patch->cpu_count);
+	} else {
+		while (atomic_read(&patch->cpu_count) <= num_online_cpus())
+			cpu_relax();
+		smp_mb();
+	}
+
+	return ret;
+}
+
+static int bpf_text_poke_stop_machine(void *addr, const void *insns, size_t len)
+{
+	struct text_poke_args patch = {
+		.addr = addr,
+		.insns = insns,
+		.len = len,
+		.cpu_count = ATOMIC_INIT(0),
+	};
+
+	return stop_machine(do_text_poke, &patch, cpu_online_mask);
+}
+
+static int gen_call_or_nops(void *target, void *ip, u32 *insns)
+{
+	int i, ret;
+	s64 rvoff;
+	struct rv_jit_context ctx;
+
+	ctx.ninsns = 0;
+	ctx.insns = (u16 *)insns;
+
+	if (!target) {
+		for (i = 0; i < 4; i++)
+			emit(rv_nop(), &ctx);
+		return 0;
+	}
+
+	rvoff = (s64)(target - ip);
+	emit(rv_sd(RV_REG_SP, -8, RV_REG_RA), &ctx);
+	ret = emit_jump_and_link(RV_REG_RA, rvoff, false, &ctx);
+	if (ret)
+		return ret;
+	emit(rv_ld(RV_REG_RA, -8, RV_REG_SP), &ctx);
+
+	return 0;
+
+}
+
+static int bpf_text_poke_call(void *ip, void *old_addr, void *new_addr)
+{
+	int ret;
+	u32 old_insns[4], new_insns[4];
+
+	ret = gen_call_or_nops(old_addr, ip + 4, old_insns);
+	if (ret)
+		return ret;
+
+	ret = gen_call_or_nops(new_addr, ip + 4, new_insns);
+	if (ret)
+		return ret;
+
+	mutex_lock(&text_mutex);
+	if (memcmp(ip, old_insns, sizeof(old_insns))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (memcmp(ip, new_insns, sizeof(new_insns)))
+		ret = bpf_text_poke_stop_machine(ip, new_insns, sizeof(new_insns));
+out:
+	mutex_unlock(&text_mutex);
+	return ret;
+}
+
+static int bpf_text_poke_jump(void *ip, void *old_addr, void *new_addr)
+{
+	int ret;
+	u32 old_insn, new_insn;
+
+	old_insn = old_addr ? rv_jal(RV_REG_ZERO, (s64)(old_addr - ip) >> 1) : rv_nop();
+	new_insn = new_addr ? rv_jal(RV_REG_ZERO, (s64)(new_addr - ip) >> 1) : rv_nop();
+
+	mutex_lock(&text_mutex);
+	if (memcmp(ip, &old_insn, sizeof(old_insn))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (memcmp(ip, &new_insn, sizeof(new_insn)))
+		ret = patch_text_nosync(ip, &new_insn, sizeof(new_insn));
+out:
+	mutex_unlock(&text_mutex);
+	return ret;
+}
+
+int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
+		       void *old_addr, void *new_addr)
+{
+	if (!is_kernel_text((unsigned long)ip) &&
+	    !is_bpf_text_address((unsigned long)ip))
+		return -ENOTSUPP;
+
+	return poke_type == BPF_MOD_CALL ?
+	       bpf_text_poke_call(ip, old_addr, new_addr) :
+	       bpf_text_poke_jump(ip, old_addr, new_addr);
+}
+
 int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 		      bool extra_pass)
 {
@@ -1266,7 +1389,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 
 void bpf_jit_build_prologue(struct rv_jit_context *ctx)
 {
-	int stack_adjust = 0, store_offset, bpf_stack_adjust;
+	int i, stack_adjust = 0, store_offset, bpf_stack_adjust;
 	bool is_main_prog = ctx->prog->aux->func_idx == 0;
 
 	bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
@@ -1294,6 +1417,10 @@ void bpf_jit_build_prologue(struct rv_jit_context *ctx)
 
 	store_offset = stack_adjust - 8;
 
+	/* reserve 4 nop insns */
+	for (i = 0; i < 4; i++)
+		emit(rv_nop(), ctx);
+
 	/* First instruction is always setting the tail-call-counter
 	 * (TCC) register. This instruction is skipped for tail calls.
 	 * Force using a 4-byte (non-compressed) instruction.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH RESEND bpf-next 4/4] riscv, bpf: Add bpf trampoline support for RV64
  2022-12-20  2:13 [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64 Pu Lehui
                   ` (2 preceding siblings ...)
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64 Pu Lehui
@ 2022-12-20  2:13 ` Pu Lehui
  2023-01-03 19:16   ` Björn Töpel
  2022-12-22 13:00 ` [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline " Björn Töpel
  4 siblings, 1 reply; 18+ messages in thread
From: Pu Lehui @ 2022-12-20  2:13 UTC (permalink / raw)
  To: bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui, Pu Lehui

From: Pu Lehui <pulehui@huawei.com>

BPF trampoline is the critical infrastructure of the bpf
subsystem, acting as a mediator between kernel functions
and BPF programs. Numerous important features, such as
using ebpf program for zero overhead kernel introspection,
rely on this key component. We can't wait to support bpf
trampoline on RV64. The implementation of bpf trampoline
was closely to x86 and arm64 for future development. The
related tests have passed, as well as the test_verifier
with no new failure ceses.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
---
 arch/riscv/net/bpf_jit_comp64.c | 322 ++++++++++++++++++++++++++++++++
 1 file changed, 322 insertions(+)

diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index fa8b03c52463..11c001782e7b 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -738,6 +738,328 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
 	       bpf_text_poke_jump(ip, old_addr, new_addr);
 }
 
+static void store_args(int nregs, int args_off, struct rv_jit_context *ctx)
+{
+	int i;
+
+	for (i = 0; i < nregs; i++) {
+		emit_sd(RV_REG_FP, -args_off, RV_REG_A0 + i, ctx);
+		args_off -= 8;
+	}
+}
+
+static void restore_args(int nregs, int args_off, struct rv_jit_context *ctx)
+{
+	int i;
+
+	for (i = 0; i < nregs; i++) {
+		emit_ld(RV_REG_A0 + i, -args_off, RV_REG_FP, ctx);
+		args_off -= 8;
+	}
+}
+
+static int invoke_bpf_prog(struct bpf_tramp_link *l, int args_off, int retval_off,
+			   int run_ctx_off, bool save_ret, struct rv_jit_context *ctx)
+{
+	u32 insn;
+	int ret, branch_off, offset;
+	struct bpf_prog *p = l->link.prog;
+	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
+
+	if (l->cookie) {
+		emit_imm(RV_REG_T1, l->cookie, ctx);
+		emit_sd(RV_REG_FP, -run_ctx_off + cookie_off, RV_REG_T1, ctx);
+	} else {
+		emit_sd(RV_REG_FP, -run_ctx_off + cookie_off, RV_REG_ZERO, ctx);
+	}
+
+	/* arg1: prog */
+	emit_imm(RV_REG_A0, (const s64)p, ctx);
+	/* arg2: &run_ctx */
+	emit_addi(RV_REG_A1, RV_REG_FP, -run_ctx_off, ctx);
+	ret = emit_call((const u64)bpf_trampoline_enter(p), true, ctx);
+	if (ret)
+		return ret;
+
+	/* if (__bpf_prog_enter(prog) == 0)
+	 *	goto skip_exec_of_prog;
+	 */
+	branch_off = ctx->ninsns;
+	/* nop reserved for conditional jump */
+	emit(rv_nop(), ctx);
+
+	/* store prog start time */
+	emit_mv(RV_REG_S1, RV_REG_A0, ctx);
+
+	/* arg1: &args_off */
+	emit_addi(RV_REG_A0, RV_REG_FP, -args_off, ctx);
+	if (!p->jited)
+		/* arg2: progs[i]->insnsi for interpreter */
+		emit_imm(RV_REG_A1, (const s64)p->insnsi, ctx);
+	ret = emit_call((const u64)p->bpf_func, true, ctx);
+	if (ret)
+		return ret;
+
+	if (save_ret)
+		emit_sd(RV_REG_FP, -retval_off, regmap[BPF_REG_0], ctx);
+
+	/* update branch with beqz */
+	offset = ninsns_rvoff(ctx->ninsns - branch_off);
+	insn = rv_beq(RV_REG_A0, RV_REG_ZERO, offset >> 1);
+	*(u32 *)(ctx->insns + branch_off) = insn;
+
+	/* arg1: prog */
+	emit_imm(RV_REG_A0, (const s64)p, ctx);
+	/* arg2: prog start time */
+	emit_mv(RV_REG_A1, RV_REG_S1, ctx);
+	/* arg3: &run_ctx */
+	emit_addi(RV_REG_A2, RV_REG_FP, -run_ctx_off, ctx);
+	ret = emit_call((const u64)bpf_trampoline_exit(p), true, ctx);
+
+	return ret;
+}
+
+static int invoke_bpf_mod_ret(struct bpf_tramp_links *tl, int args_off, int retval_off,
+			      int run_ctx_off, int *branches_off, struct rv_jit_context *ctx)
+{
+	int i, ret;
+
+	/* cleanup to avoid garbage return value confusion */
+	emit_sd(RV_REG_FP, -retval_off, RV_REG_ZERO, ctx);
+	for (i = 0; i < tl->nr_links; i++) {
+		ret = invoke_bpf_prog(tl->links[i], args_off, retval_off,
+				run_ctx_off, true, ctx);
+		if (ret)
+			return ret;
+		emit_ld(RV_REG_T1, -retval_off, RV_REG_FP, ctx);
+		branches_off[i] = ctx->ninsns;
+		/* nop reserved for conditional jump */
+		emit(rv_nop(), ctx);
+	}
+
+	return 0;
+}
+
+static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im,
+					 const struct btf_func_model *m,
+					 struct bpf_tramp_links *tlinks,
+					 void *func_addr, u32 flags,
+					 struct rv_jit_context *ctx)
+{
+	int i, ret, offset;
+	int *branches_off = NULL;
+	int stack_size = 0, nregs = m->nr_args;
+	int retaddr_off, fp_off, retval_off, args_off;
+	int nregs_off, ip_off, run_ctx_off, sreg_off;
+	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
+	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
+	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
+	void *orig_call = func_addr;
+	bool save_ret;
+	u32 insn;
+
+	/* Generated trampoline stack layout:
+	 *
+	 * FP - 8	    [ RA of parent func	] return address of parent
+	 *					  function
+	 * FP - retaddr_off [ RA of traced func	] return address of traced
+	 *					  function
+	 * FP - fp_off	    [ FP of parent func ]
+	 *
+	 * FP - retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
+	 *					  BPF_TRAMP_F_RET_FENTRY_RET
+	 *                  [ argN              ]
+	 *                  [ ...               ]
+	 * FP - args_off    [ arg1              ]
+	 *
+	 * FP - nregs_off   [ regs count        ]
+	 *
+	 * FP - ip_off      [ traced func	] BPF_TRAMP_F_IP_ARG
+	 *
+	 * FP - run_ctx_off [ bpf_tramp_run_ctx ]
+	 *
+	 * FP - sreg_off    [ callee saved reg	]
+	 *
+	 *		    [ pads              ] pads for 16 bytes alignment
+	 */
+
+	if (flags & (BPF_TRAMP_F_ORIG_STACK | BPF_TRAMP_F_SHARE_IPMODIFY))
+		return -ENOTSUPP;
+
+	/* extra regiters for struct arguments */
+	for (i = 0; i < m->nr_args; i++)
+		if (m->arg_flags[i] & BTF_FMODEL_STRUCT_ARG)
+			nregs += round_up(m->arg_size[i], 8) / 8 - 1;
+
+	/* 8 arguments passed by registers */
+	if (nregs > 8)
+		return -ENOTSUPP;
+
+	/* room for parent function return address */
+	stack_size += 8;
+
+	stack_size += 8;
+	retaddr_off = stack_size;
+
+	stack_size += 8;
+	fp_off = stack_size;
+
+	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
+	if (save_ret) {
+		stack_size += 8;
+		retval_off = stack_size;
+	}
+
+	stack_size += nregs * 8;
+	args_off = stack_size;
+
+	stack_size += 8;
+	nregs_off = stack_size;
+
+	if (flags & BPF_TRAMP_F_IP_ARG) {
+		stack_size += 8;
+		ip_off = stack_size;
+	}
+
+	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
+	run_ctx_off = stack_size;
+
+	stack_size += 8;
+	sreg_off = stack_size;
+
+	stack_size = round_up(stack_size, 16);
+
+	emit_addi(RV_REG_SP, RV_REG_SP, -stack_size, ctx);
+
+	emit_sd(RV_REG_SP, stack_size - retaddr_off, RV_REG_RA, ctx);
+	emit_sd(RV_REG_SP, stack_size - fp_off, RV_REG_FP, ctx);
+
+	emit_addi(RV_REG_FP, RV_REG_SP, stack_size, ctx);
+
+	/* callee saved register S1 to pass start time */
+	emit_sd(RV_REG_FP, -sreg_off, RV_REG_S1, ctx);
+
+	/* store ip address of the traced function */
+	if (flags & BPF_TRAMP_F_IP_ARG) {
+		emit_imm(RV_REG_T1, (const s64)func_addr, ctx);
+		emit_sd(RV_REG_FP, -ip_off, RV_REG_T1, ctx);
+	}
+
+	emit_li(RV_REG_T1, nregs, ctx);
+	emit_sd(RV_REG_FP, -nregs_off, RV_REG_T1, ctx);
+
+	store_args(nregs, args_off, ctx);
+
+	/* skip to actual body of traced function */
+	if (flags & BPF_TRAMP_F_SKIP_FRAME)
+		orig_call += 16;
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		emit_imm(RV_REG_A0, (const s64)im, ctx);
+		ret = emit_call((const u64)__bpf_tramp_enter, true, ctx);
+		if (ret)
+			return ret;
+	}
+
+	for (i = 0; i < fentry->nr_links; i++) {
+		ret = invoke_bpf_prog(fentry->links[i], args_off, retval_off, run_ctx_off,
+				      flags & BPF_TRAMP_F_RET_FENTRY_RET, ctx);
+		if (ret)
+			return ret;
+	}
+
+	if (fmod_ret->nr_links) {
+		branches_off = kcalloc(fmod_ret->nr_links, sizeof(int), GFP_KERNEL);
+		if (!branches_off)
+			return -ENOMEM;
+
+		ret = invoke_bpf_mod_ret(fmod_ret, args_off, retval_off, run_ctx_off,
+					 branches_off, ctx);
+		if (ret)
+			return ret;
+	}
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		restore_args(nregs, args_off, ctx);
+		ret = emit_call((const u64)orig_call, true, ctx);
+		if (ret)
+			return ret;
+		emit_sd(RV_REG_FP, -retval_off, RV_REG_A0, ctx);
+		/* nop reserved for bpf_tramp_image_put */
+		im->ip_after_call = ctx->insns + ctx->ninsns;
+		emit(rv_nop(), ctx);
+	}
+
+	/* update branches saved in invoke_bpf_mod_ret with bnez */
+	for (i = 0; i < fmod_ret->nr_links; i++) {
+		offset = ninsns_rvoff(ctx->ninsns - branches_off[i]);
+		insn = rv_bne(RV_REG_T1, RV_REG_ZERO, offset >> 1);
+		*(u32 *)(ctx->insns + branches_off[i]) = insn;
+	}
+
+	for (i = 0; i < fexit->nr_links; i++) {
+		ret = invoke_bpf_prog(fexit->links[i], args_off, retval_off,
+				      run_ctx_off, false, ctx);
+		if (ret)
+			return ret;
+	}
+
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		im->ip_epilogue = ctx->insns + ctx->ninsns;
+		emit_imm(RV_REG_A0, (const s64)im, ctx);
+		ret = emit_call((const u64)__bpf_tramp_exit, true, ctx);
+		if (ret)
+			return ret;
+	}
+
+	if (flags & BPF_TRAMP_F_RESTORE_REGS)
+		restore_args(nregs, args_off, ctx);
+
+	if (save_ret)
+		emit_ld(RV_REG_A0, -retval_off, RV_REG_FP, ctx);
+
+	emit_ld(RV_REG_S1, -sreg_off, RV_REG_FP, ctx);
+
+	if (flags & BPF_TRAMP_F_SKIP_FRAME)
+		/* return address of parent function */
+		emit_ld(RV_REG_RA, stack_size - 8, RV_REG_SP, ctx);
+	else
+		/* return address of traced function */
+		emit_ld(RV_REG_RA, stack_size - retaddr_off, RV_REG_SP, ctx);
+
+	emit_ld(RV_REG_FP, stack_size - fp_off, RV_REG_SP, ctx);
+	emit_addi(RV_REG_SP, RV_REG_SP, stack_size, ctx);
+
+	emit_jalr(RV_REG_ZERO, RV_REG_RA, 0, ctx);
+
+	bpf_flush_icache(ctx->insns, ctx->insns + ctx->ninsns);
+
+	kfree(branches_off);
+
+	return ctx->ninsns;
+
+}
+
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
+				void *image_end, const struct btf_func_model *m,
+				u32 flags, struct bpf_tramp_links *tlinks,
+				void *func_addr)
+{
+	int ret;
+	struct rv_jit_context ctx;
+
+	ctx.ninsns = 0;
+	ctx.insns = image;
+	ret = __arch_prepare_bpf_trampoline(im, m, tlinks, func_addr, flags, &ctx);
+	if (ret < 0)
+		return ret;
+
+	if (ninsns_rvoff(ret) > (long)image_end - (long)image)
+		return -EFBIG;
+
+	return ninsns_rvoff(ret);
+}
+
 int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 		      bool extra_pass)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call Pu Lehui
@ 2022-12-20  2:32   ` Xu Kuohai
  2022-12-21  2:36     ` Pu Lehui
  0 siblings, 1 reply; 18+ messages in thread
From: Xu Kuohai @ 2022-12-20  2:32 UTC (permalink / raw)
  To: Pu Lehui, bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui

On 12/20/2022 10:13 AM, Pu Lehui wrote:
> From: Pu Lehui <pulehui@huawei.com>
> 
> The current bpf trampoline attach to kernel functions via ftrace direct
> call API, while text_poke is applied for bpf2bpf attach and tail call
> optimization. For architectures that do not support ftrace direct call,
> text_poke is still able to attach bpf trampoline to kernel functions.
> Let's relax it by rollback to text_poke when architecture not supported.
> 
> Signed-off-by: Pu Lehui <pulehui@huawei.com>
> ---
>   kernel/bpf/trampoline.c | 8 ++------
>   1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> index d6395215b849..386197a7952c 100644
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -228,15 +228,11 @@ static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_ad
>   static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
>   {
>   	void *ip = tr->func.addr;
> -	unsigned long faddr;
>   	int ret;
>   
> -	faddr = ftrace_location((unsigned long)ip);
> -	if (faddr) {
> -		if (!tr->fops)
> -			return -ENOTSUPP;
> +	if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS) &&
> +	    !!ftrace_location((unsigned long)ip))
>   		tr->func.ftrace_managed = true;
> -	}
>

After this patch, a kernel function with true trace_location will be patched
by bpf when CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is disabled, which means
that a kernel function may be patched by both bpf and ftrace in a mutually
unaware way. This will cause ftrace and bpf_arch_text_poke to fail in a
somewhat random way if the function to be patched was already patched
by the other.

>   	if (bpf_trampoline_module_get(tr))
>   		return -ENOENT;


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64 Pu Lehui
@ 2022-12-20  3:00   ` Xu Kuohai
  2022-12-22  3:01     ` Pu Lehui
  2023-01-03 18:12   ` Björn Töpel
  1 sibling, 1 reply; 18+ messages in thread
From: Xu Kuohai @ 2022-12-20  3:00 UTC (permalink / raw)
  To: Pu Lehui, bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui

On 12/20/2022 10:13 AM, Pu Lehui wrote:
> From: Pu Lehui <pulehui@huawei.com>
> 
> Implement bpf_arch_text_poke for RV64. For call scenario,
> ftrace framework reserve 4 nops for RV64 kernel function
> as function entry, and use auipc+jalr instructions to call
> kernel or module functions. However, since the auipc+jalr
> call instructions is non-atomic operation, we need to use
> stop-machine to make sure instruction patching in atomic
> context. As for jump scenario, since we only jump inside
> the trampoline, a jal instruction is sufficient.
> 
> Signed-off-by: Pu Lehui <pulehui@huawei.com>
> ---
>   arch/riscv/net/bpf_jit.h        |   5 ++
>   arch/riscv/net/bpf_jit_comp64.c | 131 +++++++++++++++++++++++++++++++-
>   2 files changed, 134 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/riscv/net/bpf_jit.h b/arch/riscv/net/bpf_jit.h
> index d926e0f7ef57..bf9802a63061 100644
> --- a/arch/riscv/net/bpf_jit.h
> +++ b/arch/riscv/net/bpf_jit.h
> @@ -573,6 +573,11 @@ static inline u32 rv_fence(u8 pred, u8 succ)
>   	return rv_i_insn(imm11_0, 0, 0, 0, 0xf);
>   }
>   
> +static inline u32 rv_nop(void)
> +{
> +	return rv_i_insn(0, 0, 0, 0, 0x13);
> +}
> +
>   /* RVC instrutions. */
>   
>   static inline u16 rvc_addi4spn(u8 rd, u32 imm10)
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index bf4721a99a09..fa8b03c52463 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -8,6 +8,8 @@
>   #include <linux/bitfield.h>
>   #include <linux/bpf.h>
>   #include <linux/filter.h>
> +#include <linux/memory.h>
> +#include <linux/stop_machine.h>
>   #include "bpf_jit.h"
>   
>   #define RV_REG_TCC RV_REG_A6
> @@ -238,7 +240,7 @@ static void __build_epilogue(bool is_tail_call, struct rv_jit_context *ctx)
>   	if (!is_tail_call)
>   		emit_mv(RV_REG_A0, RV_REG_A5, ctx);
>   	emit_jalr(RV_REG_ZERO, is_tail_call ? RV_REG_T3 : RV_REG_RA,
> -		  is_tail_call ? 4 : 0, /* skip TCC init */
> +		  is_tail_call ? 20 : 0, /* skip reserved nops and TCC init */
>   		  ctx);
>   }
>   
> @@ -615,6 +617,127 @@ static int add_exception_handler(const struct bpf_insn *insn,
>   	return 0;
>   }
>   
> +struct text_poke_args {
> +	void *addr;
> +	const void *insns;
> +	size_t len;
> +	atomic_t cpu_count;
> +};
> +
> +static int do_text_poke(void *data)
> +{
> +	int ret = 0;
> +	struct text_poke_args *patch = data;
> +
> +	if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {

seems this sync is not needed, why not calling stop machine like this:

stop_machine(do_text_poke, &patch, NULL);

> +		ret = patch_text_nosync(patch->addr, patch->insns, patch->len);
> +		atomic_inc(&patch->cpu_count);
> +	} else {
> +		while (atomic_read(&patch->cpu_count) <= num_online_cpus())
> +			cpu_relax();
> +		smp_mb();
> +	}
> +
> +	return ret;
> +}
> +
> +static int bpf_text_poke_stop_machine(void *addr, const void *insns, size_t len)
> +{
> +	struct text_poke_args patch = {
> +		.addr = addr,
> +		.insns = insns,
> +		.len = len,
> +		.cpu_count = ATOMIC_INIT(0),
> +	};
> +
> +	return stop_machine(do_text_poke, &patch, cpu_online_mask);
> +}
> +
> +static int gen_call_or_nops(void *target, void *ip, u32 *insns)
> +{
> +	int i, ret;
> +	s64 rvoff;
> +	struct rv_jit_context ctx;
> +
> +	ctx.ninsns = 0;
> +	ctx.insns = (u16 *)insns;
> +
> +	if (!target) {
> +		for (i = 0; i < 4; i++)
> +			emit(rv_nop(), &ctx);
> +		return 0;
> +	}
> +
> +	rvoff = (s64)(target - ip);
> +	emit(rv_sd(RV_REG_SP, -8, RV_REG_RA), &ctx);
> +	ret = emit_jump_and_link(RV_REG_RA, rvoff, false, &ctx);
> +	if (ret)
> +		return ret;
> +	emit(rv_ld(RV_REG_RA, -8, RV_REG_SP), &ctx);
> +
> +	return 0;
> +
> +}
> +
> +static int bpf_text_poke_call(void *ip, void *old_addr, void *new_addr)
> +{
> +	int ret;
> +	u32 old_insns[4], new_insns[4];
> +
> +	ret = gen_call_or_nops(old_addr, ip + 4, old_insns);
> +	if (ret)
> +		return ret;
> +
> +	ret = gen_call_or_nops(new_addr, ip + 4, new_insns);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&text_mutex);
> +	if (memcmp(ip, old_insns, sizeof(old_insns))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (memcmp(ip, new_insns, sizeof(new_insns)))
> +		ret = bpf_text_poke_stop_machine(ip, new_insns, sizeof(new_insns));

since there are 4 instructions being replaced, it is possible that
serveral old instructions were already executed before stop machine.

> +out:
> +	mutex_unlock(&text_mutex);
> +	return ret;
> +}
> +
> +static int bpf_text_poke_jump(void *ip, void *old_addr, void *new_addr)
> +{
> +	int ret;
> +	u32 old_insn, new_insn;
> +
> +	old_insn = old_addr ? rv_jal(RV_REG_ZERO, (s64)(old_addr - ip) >> 1) : rv_nop();
> +	new_insn = new_addr ? rv_jal(RV_REG_ZERO, (s64)(new_addr - ip) >> 1) : rv_nop();
> +
> +	mutex_lock(&text_mutex);
> +	if (memcmp(ip, &old_insn, sizeof(old_insn))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (memcmp(ip, &new_insn, sizeof(new_insn)))
> +		ret = patch_text_nosync(ip, &new_insn, sizeof(new_insn));
> +out:
> +	mutex_unlock(&text_mutex);
> +	return ret;
> +}
> +
> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
> +		       void *old_addr, void *new_addr)
> +{
> +	if (!is_kernel_text((unsigned long)ip) &&
> +	    !is_bpf_text_address((unsigned long)ip))
> +		return -ENOTSUPP;
> +
> +	return poke_type == BPF_MOD_CALL ?
> +	       bpf_text_poke_call(ip, old_addr, new_addr) :
> +	       bpf_text_poke_jump(ip, old_addr, new_addr);
> +}
> +
>   int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>   		      bool extra_pass)
>   {
> @@ -1266,7 +1389,7 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>   
>   void bpf_jit_build_prologue(struct rv_jit_context *ctx)
>   {
> -	int stack_adjust = 0, store_offset, bpf_stack_adjust;
> +	int i, stack_adjust = 0, store_offset, bpf_stack_adjust;
>   	bool is_main_prog = ctx->prog->aux->func_idx == 0;
>   
>   	bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
> @@ -1294,6 +1417,10 @@ void bpf_jit_build_prologue(struct rv_jit_context *ctx)
>   
>   	store_offset = stack_adjust - 8;
>   
> +	/* reserve 4 nop insns */
> +	for (i = 0; i < 4; i++)
> +		emit(rv_nop(), ctx);
> +
>   	/* First instruction is always setting the tail-call-counter
>   	 * (TCC) register. This instruction is skipped for tail calls.
>   	 * Force using a 4-byte (non-compressed) instruction.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call
  2022-12-20  2:32   ` Xu Kuohai
@ 2022-12-21  2:36     ` Pu Lehui
  2023-01-03 12:05       ` Björn Töpel
  0 siblings, 1 reply; 18+ messages in thread
From: Pu Lehui @ 2022-12-21  2:36 UTC (permalink / raw)
  To: Xu Kuohai, bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui



On 2022/12/20 10:32, Xu Kuohai wrote:
> On 12/20/2022 10:13 AM, Pu Lehui wrote:
>> From: Pu Lehui <pulehui@huawei.com>
>>
>> The current bpf trampoline attach to kernel functions via ftrace direct
>> call API, while text_poke is applied for bpf2bpf attach and tail call
>> optimization. For architectures that do not support ftrace direct call,
>> text_poke is still able to attach bpf trampoline to kernel functions.
>> Let's relax it by rollback to text_poke when architecture not supported.
>>
>> Signed-off-by: Pu Lehui <pulehui@huawei.com>
>> ---
>>   kernel/bpf/trampoline.c | 8 ++------
>>   1 file changed, 2 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
>> index d6395215b849..386197a7952c 100644
>> --- a/kernel/bpf/trampoline.c
>> +++ b/kernel/bpf/trampoline.c
>> @@ -228,15 +228,11 @@ static int modify_fentry(struct bpf_trampoline 
>> *tr, void *old_addr, void *new_ad
>>   static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
>>   {
>>       void *ip = tr->func.addr;
>> -    unsigned long faddr;
>>       int ret;
>> -    faddr = ftrace_location((unsigned long)ip);
>> -    if (faddr) {
>> -        if (!tr->fops)
>> -            return -ENOTSUPP;
>> +    if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS) &&
>> +        !!ftrace_location((unsigned long)ip))
>>           tr->func.ftrace_managed = true;
>> -    }
>>
> 
> After this patch, a kernel function with true trace_location will be 
> patched
> by bpf when CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is disabled, which 
> means
> that a kernel function may be patched by both bpf and ftrace in a mutually
> unaware way. This will cause ftrace and bpf_arch_text_poke to fail in a
> somewhat random way if the function to be patched was already patched
> by the other.

Thanks for your review. And yes, this is a backward compatible solution 
for architectures that not support DYNAMIC_FTRACE_WITH_DIRECT_CALLS.

> 
>>       if (bpf_trampoline_module_get(tr))
>>           return -ENOENT;


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64
  2022-12-20  3:00   ` Xu Kuohai
@ 2022-12-22  3:01     ` Pu Lehui
  0 siblings, 0 replies; 18+ messages in thread
From: Pu Lehui @ 2022-12-22  3:01 UTC (permalink / raw)
  To: Xu Kuohai, bpf, linux-riscv, linux-kernel
  Cc: Björn Töpel, Luke Nelson, Xi Wang, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Pu Lehui



On 2022/12/20 11:00, Xu Kuohai wrote:
> On 12/20/2022 10:13 AM, Pu Lehui wrote:
>> From: Pu Lehui <pulehui@huawei.com>
>>
>> Implement bpf_arch_text_poke for RV64. For call scenario,
>> ftrace framework reserve 4 nops for RV64 kernel function
>> as function entry, and use auipc+jalr instructions to call
>> kernel or module functions. However, since the auipc+jalr
>> call instructions is non-atomic operation, we need to use
>> stop-machine to make sure instruction patching in atomic
>> context. As for jump scenario, since we only jump inside
>> the trampoline, a jal instruction is sufficient.
>>
>> Signed-off-by: Pu Lehui <pulehui@huawei.com>
>> ---
>>   arch/riscv/net/bpf_jit.h        |   5 ++
>>   arch/riscv/net/bpf_jit_comp64.c | 131 +++++++++++++++++++++++++++++++-
>>   2 files changed, 134 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/riscv/net/bpf_jit.h b/arch/riscv/net/bpf_jit.h
>> index d926e0f7ef57..bf9802a63061 100644
>> --- a/arch/riscv/net/bpf_jit.h
>> +++ b/arch/riscv/net/bpf_jit.h
>> @@ -573,6 +573,11 @@ static inline u32 rv_fence(u8 pred, u8 succ)
>>       return rv_i_insn(imm11_0, 0, 0, 0, 0xf);
>>   }
>> +static inline u32 rv_nop(void)
>> +{
>> +    return rv_i_insn(0, 0, 0, 0, 0x13);
>> +}
>> +
>>   /* RVC instrutions. */
>>   static inline u16 rvc_addi4spn(u8 rd, u32 imm10)
>> diff --git a/arch/riscv/net/bpf_jit_comp64.c 
>> b/arch/riscv/net/bpf_jit_comp64.c
>> index bf4721a99a09..fa8b03c52463 100644
>> --- a/arch/riscv/net/bpf_jit_comp64.c
>> +++ b/arch/riscv/net/bpf_jit_comp64.c
>> @@ -8,6 +8,8 @@
>>   #include <linux/bitfield.h>
>>   #include <linux/bpf.h>
>>   #include <linux/filter.h>
>> +#include <linux/memory.h>
>> +#include <linux/stop_machine.h>
>>   #include "bpf_jit.h"
>>   #define RV_REG_TCC RV_REG_A6
>> @@ -238,7 +240,7 @@ static void __build_epilogue(bool is_tail_call, 
>> struct rv_jit_context *ctx)
>>       if (!is_tail_call)
>>           emit_mv(RV_REG_A0, RV_REG_A5, ctx);
>>       emit_jalr(RV_REG_ZERO, is_tail_call ? RV_REG_T3 : RV_REG_RA,
>> -          is_tail_call ? 4 : 0, /* skip TCC init */
>> +          is_tail_call ? 20 : 0, /* skip reserved nops and TCC init */
>>             ctx);
>>   }
>> @@ -615,6 +617,127 @@ static int add_exception_handler(const struct 
>> bpf_insn *insn,
>>       return 0;
>>   }
>> +struct text_poke_args {
>> +    void *addr;
>> +    const void *insns;
>> +    size_t len;
>> +    atomic_t cpu_count;
>> +};
>> +
>> +static int do_text_poke(void *data)
>> +{
>> +    int ret = 0;
>> +    struct text_poke_args *patch = data;
>> +
>> +    if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
> 
> seems this sync is not needed, why not calling stop machine like this:
> 
> stop_machine(do_text_poke, &patch, NULL);
> 

I think both version should be ok. While the NULL version uses first cpu 
as master to patch_text and the others touch nmi watchdog, this version 
attempts to let other cpus enter cpu_relax to calm down, and let master 
cpu owns system resources as possible.

>> +        ret = patch_text_nosync(patch->addr, patch->insns, patch->len);
>> +        atomic_inc(&patch->cpu_count);
>> +    } else {
>> +        while (atomic_read(&patch->cpu_count) <= num_online_cpus())
>> +            cpu_relax();
>> +        smp_mb();
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static int bpf_text_poke_stop_machine(void *addr, const void *insns, 
>> size_t len)
>> +{
>> +    struct text_poke_args patch = {
>> +        .addr = addr,
>> +        .insns = insns,
>> +        .len = len,
>> +        .cpu_count = ATOMIC_INIT(0),
>> +    };
>> +
>> +    return stop_machine(do_text_poke, &patch, cpu_online_mask);
>> +}
>> +
>> +static int gen_call_or_nops(void *target, void *ip, u32 *insns)
>> +{
>> +    int i, ret;
>> +    s64 rvoff;
>> +    struct rv_jit_context ctx;
>> +
>> +    ctx.ninsns = 0;
>> +    ctx.insns = (u16 *)insns;
>> +
>> +    if (!target) {
>> +        for (i = 0; i < 4; i++)
>> +            emit(rv_nop(), &ctx);
>> +        return 0;
>> +    }
>> +
>> +    rvoff = (s64)(target - ip);
>> +    emit(rv_sd(RV_REG_SP, -8, RV_REG_RA), &ctx);
>> +    ret = emit_jump_and_link(RV_REG_RA, rvoff, false, &ctx);
>> +    if (ret)
>> +        return ret;
>> +    emit(rv_ld(RV_REG_RA, -8, RV_REG_SP), &ctx);
>> +
>> +    return 0;
>> +
>> +}
>> +
>> +static int bpf_text_poke_call(void *ip, void *old_addr, void *new_addr)
>> +{
>> +    int ret;
>> +    u32 old_insns[4], new_insns[4];
>> +
>> +    ret = gen_call_or_nops(old_addr, ip + 4, old_insns);
>> +    if (ret)
>> +        return ret;
>> +
>> +    ret = gen_call_or_nops(new_addr, ip + 4, new_insns);
>> +    if (ret)
>> +        return ret;
>> +
>> +    mutex_lock(&text_mutex);
>> +    if (memcmp(ip, old_insns, sizeof(old_insns))) {
>> +        ret = -EFAULT;
>> +        goto out;
>> +    }
>> +
>> +    if (memcmp(ip, new_insns, sizeof(new_insns)))
>> +        ret = bpf_text_poke_stop_machine(ip, new_insns, 
>> sizeof(new_insns));
> 
> since there are 4 instructions being replaced, it is possible that
> serveral old instructions were already executed before stop machine.
> 

Yes, there are unpredictable problems after stop machine in that 
situation. In addition, probing stop machine sub-functions may lead to 
loop nesting problems. And this is the current problem with riscv ftrace.

But the surprise was that Andy's work [1] was about to solve these 
problems. We can synchronize changes in the future.

[1] 
https://lore.kernel.org/linux-riscv/20220913094252.3555240-1-andy.chiu@sifive.com

>> +out:
>> +    mutex_unlock(&text_mutex);
>> +    return ret;
>> +}
>> +
>> +static int bpf_text_poke_jump(void *ip, void *old_addr, void *new_addr)
>> +{
>> +    int ret;
>> +    u32 old_insn, new_insn;
>> +
>> +    old_insn = old_addr ? rv_jal(RV_REG_ZERO, (s64)(old_addr - ip) >> 
>> 1) : rv_nop();
>> +    new_insn = new_addr ? rv_jal(RV_REG_ZERO, (s64)(new_addr - ip) >> 
>> 1) : rv_nop();
>> +
>> +    mutex_lock(&text_mutex);
>> +    if (memcmp(ip, &old_insn, sizeof(old_insn))) {
>> +        ret = -EFAULT;
>> +        goto out;
>> +    }
>> +
>> +    if (memcmp(ip, &new_insn, sizeof(new_insn)))
>> +        ret = patch_text_nosync(ip, &new_insn, sizeof(new_insn));
>> +out:
>> +    mutex_unlock(&text_mutex);
>> +    return ret;
>> +}
>> +
>> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>> +               void *old_addr, void *new_addr)
>> +{
>> +    if (!is_kernel_text((unsigned long)ip) &&
>> +        !is_bpf_text_address((unsigned long)ip))
>> +        return -ENOTSUPP;
>> +
>> +    return poke_type == BPF_MOD_CALL ?
>> +           bpf_text_poke_call(ip, old_addr, new_addr) :
>> +           bpf_text_poke_jump(ip, old_addr, new_addr);
>> +}
>> +
>>   int bpf_jit_emit_insn(const struct bpf_insn *insn, struct 
>> rv_jit_context *ctx,
>>                 bool extra_pass)
>>   {
>> @@ -1266,7 +1389,7 @@ int bpf_jit_emit_insn(const struct bpf_insn 
>> *insn, struct rv_jit_context *ctx,
>>   void bpf_jit_build_prologue(struct rv_jit_context *ctx)
>>   {
>> -    int stack_adjust = 0, store_offset, bpf_stack_adjust;
>> +    int i, stack_adjust = 0, store_offset, bpf_stack_adjust;
>>       bool is_main_prog = ctx->prog->aux->func_idx == 0;
>>       bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
>> @@ -1294,6 +1417,10 @@ void bpf_jit_build_prologue(struct 
>> rv_jit_context *ctx)
>>       store_offset = stack_adjust - 8;
>> +    /* reserve 4 nop insns */
>> +    for (i = 0; i < 4; i++)
>> +        emit(rv_nop(), ctx);
>> +
>>       /* First instruction is always setting the tail-call-counter
>>        * (TCC) register. This instruction is skipped for tail calls.
>>        * Force using a 4-byte (non-compressed) instruction.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64
  2022-12-20  2:13 [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64 Pu Lehui
                   ` (3 preceding siblings ...)
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 4/4] riscv, bpf: Add bpf trampoline " Pu Lehui
@ 2022-12-22 13:00 ` Björn Töpel
  2022-12-24  6:22   ` Pu Lehui
  4 siblings, 1 reply; 18+ messages in thread
From: Björn Töpel @ 2022-12-22 13:00 UTC (permalink / raw)
  To: Pu Lehui, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui, Pu Lehui

Pu Lehui <pulehui@huaweicloud.com> writes:

> BPF trampoline is the critical infrastructure of the bpf
> subsystem, acting as a mediator between kernel functions
> and BPF programs. Numerous important features, such as
> using ebpf program for zero overhead kernel introspection,
> rely on this key component. We can't wait to support bpf
> trampoline on RV64. The implementation of bpf trampoline
> was closely to x86 and arm64 for future development.

Thank you for working on this! BPF trampoline is the "missing piece"
from getting proper kfunc support.

Unfortunately, I wont be able to do a proper review until next week.


Happy holidays,
Björn

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64
  2022-12-22 13:00 ` [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline " Björn Töpel
@ 2022-12-24  6:22   ` Pu Lehui
  0 siblings, 0 replies; 18+ messages in thread
From: Pu Lehui @ 2022-12-24  6:22 UTC (permalink / raw)
  To: Björn Töpel, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui



On 2022/12/22 21:00, Björn Töpel wrote:
> Pu Lehui <pulehui@huaweicloud.com> writes:
> 
>> BPF trampoline is the critical infrastructure of the bpf
>> subsystem, acting as a mediator between kernel functions
>> and BPF programs. Numerous important features, such as
>> using ebpf program for zero overhead kernel introspection,
>> rely on this key component. We can't wait to support bpf
>> trampoline on RV64. The implementation of bpf trampoline
>> was closely to x86 and arm64 for future development.
> 
> Thank you for working on this! BPF trampoline is the "missing piece"
> from getting proper kfunc support.
> 
> Unfortunately, I wont be able to do a proper review until next week.
> 

Take your time, it might take several rounds of optimization. Yep, riscv 
bpf will look more complete when bpf trampoline and kfunc are supported.

Anyway, have a nice christmas holidays.😄

Lehui

> 
> Happy holidays,
> Björn


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call
  2022-12-21  2:36     ` Pu Lehui
@ 2023-01-03 12:05       ` Björn Töpel
  2023-01-06  2:35         ` Pu Lehui
  0 siblings, 1 reply; 18+ messages in thread
From: Björn Töpel @ 2023-01-03 12:05 UTC (permalink / raw)
  To: Pu Lehui, Xu Kuohai, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui

Pu Lehui <pulehui@huaweicloud.com> writes:

> On 2022/12/20 10:32, Xu Kuohai wrote:
>> On 12/20/2022 10:13 AM, Pu Lehui wrote:
>>> From: Pu Lehui <pulehui@huawei.com>
>>>
>>> The current bpf trampoline attach to kernel functions via ftrace direct
>>> call API, while text_poke is applied for bpf2bpf attach and tail call
>>> optimization. For architectures that do not support ftrace direct call,
>>> text_poke is still able to attach bpf trampoline to kernel functions.
>>> Let's relax it by rollback to text_poke when architecture not supported.
>>>
>>> Signed-off-by: Pu Lehui <pulehui@huawei.com>
>>> ---
>>>   kernel/bpf/trampoline.c | 8 ++------
>>>   1 file changed, 2 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
>>> index d6395215b849..386197a7952c 100644
>>> --- a/kernel/bpf/trampoline.c
>>> +++ b/kernel/bpf/trampoline.c
>>> @@ -228,15 +228,11 @@ static int modify_fentry(struct bpf_trampoline 
>>> *tr, void *old_addr, void *new_ad
>>>   static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
>>>   {
>>>       void *ip = tr->func.addr;
>>> -    unsigned long faddr;
>>>       int ret;
>>> -    faddr = ftrace_location((unsigned long)ip);
>>> -    if (faddr) {
>>> -        if (!tr->fops)
>>> -            return -ENOTSUPP;
>>> +    if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS) &&
>>> +        !!ftrace_location((unsigned long)ip))
>>>           tr->func.ftrace_managed = true;
>>> -    }
>>>
>> 
>> After this patch, a kernel function with true trace_location will be 
>> patched
>> by bpf when CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is disabled, which 
>> means
>> that a kernel function may be patched by both bpf and ftrace in a mutually
>> unaware way. This will cause ftrace and bpf_arch_text_poke to fail in a
>> somewhat random way if the function to be patched was already patched
>> by the other.
>
> Thanks for your review. And yes, this is a backward compatible solution 
> for architectures that not support DYNAMIC_FTRACE_WITH_DIRECT_CALLS.

It's not "backward compatible". Reiterating what Kuohai said; The BPF
trampoline must be aware of ftrace's state -- with this patch, the
trampoline can't blindly poke the text managed my ftrace.

I'd recommend to remove this patch from the series.


Björn

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64 Pu Lehui
  2022-12-20  3:00   ` Xu Kuohai
@ 2023-01-03 18:12   ` Björn Töpel
  2023-01-06  1:57     ` Pu Lehui
  1 sibling, 1 reply; 18+ messages in thread
From: Björn Töpel @ 2023-01-03 18:12 UTC (permalink / raw)
  To: Pu Lehui, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui, Pu Lehui

Pu Lehui <pulehui@huaweicloud.com> writes:

> From: Pu Lehui <pulehui@huawei.com>
>
> Implement bpf_arch_text_poke for RV64. For call scenario,
> ftrace framework reserve 4 nops for RV64 kernel function
> as function entry, and use auipc+jalr instructions to call
> kernel or module functions. However, since the auipc+jalr
> call instructions is non-atomic operation, we need to use
> stop-machine to make sure instruction patching in atomic
> context. As for jump scenario, since we only jump inside
> the trampoline, a jal instruction is sufficient.

Hmm, is that really true? More below!

>
> Signed-off-by: Pu Lehui <pulehui@huawei.com>
> ---
>  arch/riscv/net/bpf_jit.h        |   5 ++
>  arch/riscv/net/bpf_jit_comp64.c | 131 +++++++++++++++++++++++++++++++-
>  2 files changed, 134 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/net/bpf_jit.h b/arch/riscv/net/bpf_jit.h
> index d926e0f7ef57..bf9802a63061 100644
> --- a/arch/riscv/net/bpf_jit.h
> +++ b/arch/riscv/net/bpf_jit.h
> @@ -573,6 +573,11 @@ static inline u32 rv_fence(u8 pred, u8 succ)
>  	return rv_i_insn(imm11_0, 0, 0, 0, 0xf);
>  }
>  
> +static inline u32 rv_nop(void)
> +{
> +	return rv_i_insn(0, 0, 0, 0, 0x13);
> +}
> +
>  /* RVC instrutions. */
>  
>  static inline u16 rvc_addi4spn(u8 rd, u32 imm10)
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index bf4721a99a09..fa8b03c52463 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -8,6 +8,8 @@
>  #include <linux/bitfield.h>
>  #include <linux/bpf.h>
>  #include <linux/filter.h>
> +#include <linux/memory.h>
> +#include <linux/stop_machine.h>
>  #include "bpf_jit.h"
>  
>  #define RV_REG_TCC RV_REG_A6
> @@ -238,7 +240,7 @@ static void __build_epilogue(bool is_tail_call, struct rv_jit_context *ctx)
>  	if (!is_tail_call)
>  		emit_mv(RV_REG_A0, RV_REG_A5, ctx);
>  	emit_jalr(RV_REG_ZERO, is_tail_call ? RV_REG_T3 : RV_REG_RA,
> -		  is_tail_call ? 4 : 0, /* skip TCC init */
> +		  is_tail_call ? 20 : 0, /* skip reserved nops and TCC init */
>  		  ctx);
>  }
>  
> @@ -615,6 +617,127 @@ static int add_exception_handler(const struct bpf_insn *insn,
>  	return 0;
>  }
>  
> +struct text_poke_args {
> +	void *addr;
> +	const void *insns;
> +	size_t len;
> +	atomic_t cpu_count;
> +};
> +
> +static int do_text_poke(void *data)
> +{
> +	int ret = 0;
> +	struct text_poke_args *patch = data;
> +
> +	if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
> +		ret = patch_text_nosync(patch->addr, patch->insns, patch->len);
> +		atomic_inc(&patch->cpu_count);
> +	} else {
> +		while (atomic_read(&patch->cpu_count) <= num_online_cpus())
> +			cpu_relax();
> +		smp_mb();
> +	}
> +
> +	return ret;
> +}
> +
> +static int bpf_text_poke_stop_machine(void *addr, const void *insns, size_t len)
> +{
> +	struct text_poke_args patch = {
> +		.addr = addr,
> +		.insns = insns,
> +		.len = len,
> +		.cpu_count = ATOMIC_INIT(0),
> +	};
> +
> +	return stop_machine(do_text_poke, &patch, cpu_online_mask);
> +}
> +
> +static int gen_call_or_nops(void *target, void *ip, u32 *insns)
> +{
> +	int i, ret;
> +	s64 rvoff;
> +	struct rv_jit_context ctx;
> +
> +	ctx.ninsns = 0;
> +	ctx.insns = (u16 *)insns;
> +
> +	if (!target) {
> +		for (i = 0; i < 4; i++)
> +			emit(rv_nop(), &ctx);
> +		return 0;
> +	}
> +
> +	rvoff = (s64)(target - ip);
> +	emit(rv_sd(RV_REG_SP, -8, RV_REG_RA), &ctx);
> +	ret = emit_jump_and_link(RV_REG_RA, rvoff, false, &ctx);
> +	if (ret)
> +		return ret;
> +	emit(rv_ld(RV_REG_RA, -8, RV_REG_SP), &ctx);
> +
> +	return 0;
> +
> +}
> +
> +static int bpf_text_poke_call(void *ip, void *old_addr, void *new_addr)
> +{
> +	int ret;
> +	u32 old_insns[4], new_insns[4];
> +
> +	ret = gen_call_or_nops(old_addr, ip + 4, old_insns);
> +	if (ret)
> +		return ret;
> +
> +	ret = gen_call_or_nops(new_addr, ip + 4, new_insns);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&text_mutex);
> +	if (memcmp(ip, old_insns, sizeof(old_insns))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (memcmp(ip, new_insns, sizeof(new_insns)))
> +		ret = bpf_text_poke_stop_machine(ip, new_insns,
> sizeof(new_insns));

I'd rather see that you added a patch_text variant to
arch/riscv/kernel/patch.c (something like your
bpf_text_poke_stop_machine()), and use that here. Might be other users
of that as well -- Andy's ftrace patch maybe? :-)

> +out:
> +	mutex_unlock(&text_mutex);
> +	return ret;
> +}
> +
> +static int bpf_text_poke_jump(void *ip, void *old_addr, void *new_addr)
> +{
> +	int ret;
> +	u32 old_insn, new_insn;
> +
> +	old_insn = old_addr ? rv_jal(RV_REG_ZERO, (s64)(old_addr - ip) >> 1) : rv_nop();
> +	new_insn = new_addr ? rv_jal(RV_REG_ZERO, (s64)(new_addr - ip) >> 1) : rv_nop();
> +
> +	mutex_lock(&text_mutex);
> +	if (memcmp(ip, &old_insn, sizeof(old_insn))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (memcmp(ip, &new_insn, sizeof(new_insn)))
> +		ret = patch_text_nosync(ip, &new_insn, sizeof(new_insn));
> +out:
> +	mutex_unlock(&text_mutex);
> +	return ret;
> +}
> +
> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
> +		       void *old_addr, void *new_addr)

AFAIU there's nothing in the bpf_arch_text_poke() API that say that
BPF_MOD_JUMP is jumps within the trampoline. That is one usage, but not
the only one. In general, the jal might not have enough reach.

I believe that this needs to be an auipc/jalr pair similar to
BPF_MOD_CALL (w/o linked register). 


And again, thanks for working on the RV trampoline!
Björn

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 4/4] riscv, bpf: Add bpf trampoline support for RV64
  2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 4/4] riscv, bpf: Add bpf trampoline " Pu Lehui
@ 2023-01-03 19:16   ` Björn Töpel
  2023-01-06  1:36     ` Pu Lehui
  0 siblings, 1 reply; 18+ messages in thread
From: Björn Töpel @ 2023-01-03 19:16 UTC (permalink / raw)
  To: Pu Lehui, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui, Pu Lehui

Pu Lehui <pulehui@huaweicloud.com> writes:

> From: Pu Lehui <pulehui@huawei.com>
>
> BPF trampoline is the critical infrastructure of the bpf
> subsystem, acting as a mediator between kernel functions
> and BPF programs. Numerous important features, such as
> using ebpf program for zero overhead kernel introspection,
> rely on this key component. We can't wait to support bpf
> trampoline on RV64. The implementation of bpf trampoline
> was closely to x86 and arm64 for future development. The
> related tests have passed, as well as the test_verifier
> with no new failure ceses.
>
> Signed-off-by: Pu Lehui <pulehui@huawei.com>
> ---
>  arch/riscv/net/bpf_jit_comp64.c | 322 ++++++++++++++++++++++++++++++++
>  1 file changed, 322 insertions(+)
>
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index fa8b03c52463..11c001782e7b 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -738,6 +738,328 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>  	       bpf_text_poke_jump(ip, old_addr, new_addr);
>  }
>  
> +static void store_args(int nregs, int args_off, struct rv_jit_context *ctx)
> +{
> +	int i;
> +
> +	for (i = 0; i < nregs; i++) {
> +		emit_sd(RV_REG_FP, -args_off, RV_REG_A0 + i, ctx);
> +		args_off -= 8;
> +	}
> +}
> +
> +static void restore_args(int nregs, int args_off, struct rv_jit_context *ctx)
> +{
> +	int i;
> +
> +	for (i = 0; i < nregs; i++) {
> +		emit_ld(RV_REG_A0 + i, -args_off, RV_REG_FP, ctx);
> +		args_off -= 8;
> +	}
> +}
> +
> +static int invoke_bpf_prog(struct bpf_tramp_link *l, int args_off, int retval_off,
> +			   int run_ctx_off, bool save_ret, struct rv_jit_context *ctx)
> +{
> +	u32 insn;
> +	int ret, branch_off, offset;
> +	struct bpf_prog *p = l->link.prog;
> +	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
> +
> +	if (l->cookie) {
> +		emit_imm(RV_REG_T1, l->cookie, ctx);
> +		emit_sd(RV_REG_FP, -run_ctx_off + cookie_off, RV_REG_T1, ctx);
> +	} else {
> +		emit_sd(RV_REG_FP, -run_ctx_off + cookie_off, RV_REG_ZERO, ctx);
> +	}
> +
> +	/* arg1: prog */
> +	emit_imm(RV_REG_A0, (const s64)p, ctx);
> +	/* arg2: &run_ctx */
> +	emit_addi(RV_REG_A1, RV_REG_FP, -run_ctx_off, ctx);
> +	ret = emit_call((const u64)bpf_trampoline_enter(p), true, ctx);
> +	if (ret)
> +		return ret;
> +
> +	/* if (__bpf_prog_enter(prog) == 0)
> +	 *	goto skip_exec_of_prog;
> +	 */
> +	branch_off = ctx->ninsns;
> +	/* nop reserved for conditional jump */
> +	emit(rv_nop(), ctx);
> +
> +	/* store prog start time */
> +	emit_mv(RV_REG_S1, RV_REG_A0, ctx);
> +
> +	/* arg1: &args_off */
> +	emit_addi(RV_REG_A0, RV_REG_FP, -args_off, ctx);
> +	if (!p->jited)
> +		/* arg2: progs[i]->insnsi for interpreter */
> +		emit_imm(RV_REG_A1, (const s64)p->insnsi, ctx);
> +	ret = emit_call((const u64)p->bpf_func, true, ctx);
> +	if (ret)
> +		return ret;
> +
> +	if (save_ret)
> +		emit_sd(RV_REG_FP, -retval_off, regmap[BPF_REG_0], ctx);
> +
> +	/* update branch with beqz */
> +	offset = ninsns_rvoff(ctx->ninsns - branch_off);
> +	insn = rv_beq(RV_REG_A0, RV_REG_ZERO, offset >> 1);
> +	*(u32 *)(ctx->insns + branch_off) = insn;
> +
> +	/* arg1: prog */
> +	emit_imm(RV_REG_A0, (const s64)p, ctx);
> +	/* arg2: prog start time */
> +	emit_mv(RV_REG_A1, RV_REG_S1, ctx);
> +	/* arg3: &run_ctx */
> +	emit_addi(RV_REG_A2, RV_REG_FP, -run_ctx_off, ctx);
> +	ret = emit_call((const u64)bpf_trampoline_exit(p), true, ctx);
> +
> +	return ret;
> +}
> +
> +static int invoke_bpf_mod_ret(struct bpf_tramp_links *tl, int args_off, int retval_off,
> +			      int run_ctx_off, int *branches_off, struct rv_jit_context *ctx)
> +{
> +	int i, ret;
> +
> +	/* cleanup to avoid garbage return value confusion */
> +	emit_sd(RV_REG_FP, -retval_off, RV_REG_ZERO, ctx);
> +	for (i = 0; i < tl->nr_links; i++) {
> +		ret = invoke_bpf_prog(tl->links[i], args_off, retval_off,
> +				run_ctx_off, true, ctx);
> +		if (ret)
> +			return ret;
> +		emit_ld(RV_REG_T1, -retval_off, RV_REG_FP, ctx);
> +		branches_off[i] = ctx->ninsns;
> +		/* nop reserved for conditional jump */
> +		emit(rv_nop(), ctx);
> +	}
> +
> +	return 0;
> +}
> +
> +static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im,
> +					 const struct btf_func_model *m,
> +					 struct bpf_tramp_links *tlinks,
> +					 void *func_addr, u32 flags,
> +					 struct rv_jit_context *ctx)
> +{
> +	int i, ret, offset;
> +	int *branches_off = NULL;
> +	int stack_size = 0, nregs = m->nr_args;
> +	int retaddr_off, fp_off, retval_off, args_off;
> +	int nregs_off, ip_off, run_ctx_off, sreg_off;
> +	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
> +	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
> +	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
> +	void *orig_call = func_addr;
> +	bool save_ret;
> +	u32 insn;
> +
> +	/* Generated trampoline stack layout:
> +	 *
> +	 * FP - 8	    [ RA of parent func	] return address of parent
> +	 *					  function
> +	 * FP - retaddr_off [ RA of traced func	] return address of traced
> +	 *					  function
> +	 * FP - fp_off	    [ FP of parent func ]
> +	 *
> +	 * FP - retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
> +	 *					  BPF_TRAMP_F_RET_FENTRY_RET
> +	 *                  [ argN              ]
> +	 *                  [ ...               ]
> +	 * FP - args_off    [ arg1              ]
> +	 *
> +	 * FP - nregs_off   [ regs count        ]
> +	 *
> +	 * FP - ip_off      [ traced func	] BPF_TRAMP_F_IP_ARG
> +	 *
> +	 * FP - run_ctx_off [ bpf_tramp_run_ctx ]
> +	 *
> +	 * FP - sreg_off    [ callee saved reg	]
> +	 *
> +	 *		    [ pads              ] pads for 16 bytes alignment
> +	 */
> +
> +	if (flags & (BPF_TRAMP_F_ORIG_STACK | BPF_TRAMP_F_SHARE_IPMODIFY))
> +		return -ENOTSUPP;
> +
> +	/* extra regiters for struct arguments */
> +	for (i = 0; i < m->nr_args; i++)
> +		if (m->arg_flags[i] & BTF_FMODEL_STRUCT_ARG)
> +			nregs += round_up(m->arg_size[i], 8) / 8 - 1;
> +
> +	/* 8 arguments passed by registers */
> +	if (nregs > 8)
> +		return -ENOTSUPP;
> +
> +	/* room for parent function return address */
> +	stack_size += 8;
> +
> +	stack_size += 8;
> +	retaddr_off = stack_size;
> +
> +	stack_size += 8;
> +	fp_off = stack_size;
> +
> +	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
> +	if (save_ret) {
> +		stack_size += 8;
> +		retval_off = stack_size;
> +	}
> +
> +	stack_size += nregs * 8;
> +	args_off = stack_size;
> +
> +	stack_size += 8;
> +	nregs_off = stack_size;
> +
> +	if (flags & BPF_TRAMP_F_IP_ARG) {
> +		stack_size += 8;
> +		ip_off = stack_size;
> +	}
> +
> +	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
> +	run_ctx_off = stack_size;
> +
> +	stack_size += 8;
> +	sreg_off = stack_size;
> +
> +	stack_size = round_up(stack_size, 16);
> +
> +	emit_addi(RV_REG_SP, RV_REG_SP, -stack_size, ctx);
> +
> +	emit_sd(RV_REG_SP, stack_size - retaddr_off, RV_REG_RA, ctx);
> +	emit_sd(RV_REG_SP, stack_size - fp_off, RV_REG_FP, ctx);
> +
> +	emit_addi(RV_REG_FP, RV_REG_SP, stack_size, ctx);
> +
> +	/* callee saved register S1 to pass start time */
> +	emit_sd(RV_REG_FP, -sreg_off, RV_REG_S1, ctx);
> +
> +	/* store ip address of the traced function */
> +	if (flags & BPF_TRAMP_F_IP_ARG) {
> +		emit_imm(RV_REG_T1, (const s64)func_addr, ctx);
> +		emit_sd(RV_REG_FP, -ip_off, RV_REG_T1, ctx);
> +	}
> +
> +	emit_li(RV_REG_T1, nregs, ctx);
> +	emit_sd(RV_REG_FP, -nregs_off, RV_REG_T1, ctx);
> +
> +	store_args(nregs, args_off, ctx);
> +
> +	/* skip to actual body of traced function */
> +	if (flags & BPF_TRAMP_F_SKIP_FRAME)
> +		orig_call += 16;
> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		emit_imm(RV_REG_A0, (const s64)im, ctx);
> +		ret = emit_call((const u64)__bpf_tramp_enter, true, ctx);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	for (i = 0; i < fentry->nr_links; i++) {
> +		ret = invoke_bpf_prog(fentry->links[i], args_off, retval_off, run_ctx_off,
> +				      flags & BPF_TRAMP_F_RET_FENTRY_RET, ctx);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (fmod_ret->nr_links) {
> +		branches_off = kcalloc(fmod_ret->nr_links, sizeof(int), GFP_KERNEL);
> +		if (!branches_off)
> +			return -ENOMEM;
> +
> +		ret = invoke_bpf_mod_ret(fmod_ret, args_off, retval_off, run_ctx_off,
> +					 branches_off, ctx);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		restore_args(nregs, args_off, ctx);
> +		ret = emit_call((const u64)orig_call, true, ctx);
> +		if (ret)
> +			return ret;
> +		emit_sd(RV_REG_FP, -retval_off, RV_REG_A0, ctx);
> +		/* nop reserved for bpf_tramp_image_put */
> +		im->ip_after_call = ctx->insns + ctx->ninsns;
> +		emit(rv_nop(), ctx);
> +	}
> +
> +	/* update branches saved in invoke_bpf_mod_ret with bnez */
> +	for (i = 0; i < fmod_ret->nr_links; i++) {
> +		offset = ninsns_rvoff(ctx->ninsns - branches_off[i]);
> +		insn = rv_bne(RV_REG_T1, RV_REG_ZERO, offset >> 1);
> +		*(u32 *)(ctx->insns + branches_off[i]) = insn;
> +	}
> +
> +	for (i = 0; i < fexit->nr_links; i++) {
> +		ret = invoke_bpf_prog(fexit->links[i], args_off, retval_off,
> +				      run_ctx_off, false, ctx);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
> +		im->ip_epilogue = ctx->insns + ctx->ninsns;
> +		emit_imm(RV_REG_A0, (const s64)im, ctx);
> +		ret = emit_call((const u64)__bpf_tramp_exit, true, ctx);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (flags & BPF_TRAMP_F_RESTORE_REGS)
> +		restore_args(nregs, args_off, ctx);
> +
> +	if (save_ret)
> +		emit_ld(RV_REG_A0, -retval_off, RV_REG_FP, ctx);
> +
> +	emit_ld(RV_REG_S1, -sreg_off, RV_REG_FP, ctx);
> +
> +	if (flags & BPF_TRAMP_F_SKIP_FRAME)
> +		/* return address of parent function */
> +		emit_ld(RV_REG_RA, stack_size - 8, RV_REG_SP, ctx);
> +	else
> +		/* return address of traced function */
> +		emit_ld(RV_REG_RA, stack_size - retaddr_off, RV_REG_SP, ctx);
> +
> +	emit_ld(RV_REG_FP, stack_size - fp_off, RV_REG_SP, ctx);
> +	emit_addi(RV_REG_SP, RV_REG_SP, stack_size, ctx);
> +
> +	emit_jalr(RV_REG_ZERO, RV_REG_RA, 0, ctx);
> +
> +	bpf_flush_icache(ctx->insns, ctx->insns + ctx->ninsns);
> +
> +	kfree(branches_off);
> +
> +	return ctx->ninsns;
> +
> +}
> +
> +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
> +				void *image_end, const struct btf_func_model *m,
> +				u32 flags, struct bpf_tramp_links *tlinks,
> +				void *func_addr)
> +{
> +	int ret;
> +	struct rv_jit_context ctx;
> +
> +	ctx.ninsns = 0;
> +	ctx.insns = image;
> +	ret = __arch_prepare_bpf_trampoline(im, m, tlinks, func_addr, flags, &ctx);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ninsns_rvoff(ret) > (long)image_end - (long)image)
> +		return -EFBIG;

This looks risky! First you generate the image, and here you realize
that you already wrote in all the wrong places?!

> +
> +	return ninsns_rvoff(ret);

Ok, this was a bit subtle to me. The return value of the this function
is used in kernel/bpf/bpf_struct_ops.c. Now I know! :-)


Thanks!
Björn

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 4/4] riscv, bpf: Add bpf trampoline support for RV64
  2023-01-03 19:16   ` Björn Töpel
@ 2023-01-06  1:36     ` Pu Lehui
  0 siblings, 0 replies; 18+ messages in thread
From: Pu Lehui @ 2023-01-06  1:36 UTC (permalink / raw)
  To: Björn Töpel, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui



On 2023/1/4 3:16, Björn Töpel wrote:
> Pu Lehui <pulehui@huaweicloud.com> writes:
> 
>> From: Pu Lehui <pulehui@huawei.com>
>>
>> BPF trampoline is the critical infrastructure of the bpf
>> subsystem, acting as a mediator between kernel functions
>> and BPF programs. Numerous important features, such as
>> using ebpf program for zero overhead kernel introspection,
>> rely on this key component. We can't wait to support bpf
>> trampoline on RV64. The implementation of bpf trampoline
>> was closely to x86 and arm64 for future development. The
>> related tests have passed, as well as the test_verifier
>> with no new failure ceses.
>>
>> Signed-off-by: Pu Lehui <pulehui@huawei.com>
>> ---
>>   arch/riscv/net/bpf_jit_comp64.c | 322 ++++++++++++++++++++++++++++++++
>>   1 file changed, 322 insertions(+)
>>
>> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
>> index fa8b03c52463..11c001782e7b 100644
>> --- a/arch/riscv/net/bpf_jit_comp64.c
>> +++ b/arch/riscv/net/bpf_jit_comp64.c
>> @@ -738,6 +738,328 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>>   	       bpf_text_poke_jump(ip, old_addr, new_addr);
>>   }
>>   
>> +static void store_args(int nregs, int args_off, struct rv_jit_context *ctx)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nregs; i++) {
>> +		emit_sd(RV_REG_FP, -args_off, RV_REG_A0 + i, ctx);
>> +		args_off -= 8;
>> +	}
>> +}
>> +
>> +static void restore_args(int nregs, int args_off, struct rv_jit_context *ctx)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nregs; i++) {
>> +		emit_ld(RV_REG_A0 + i, -args_off, RV_REG_FP, ctx);
>> +		args_off -= 8;
>> +	}
>> +}
>> +
>> +static int invoke_bpf_prog(struct bpf_tramp_link *l, int args_off, int retval_off,
>> +			   int run_ctx_off, bool save_ret, struct rv_jit_context *ctx)
>> +{
>> +	u32 insn;
>> +	int ret, branch_off, offset;
>> +	struct bpf_prog *p = l->link.prog;
>> +	int cookie_off = offsetof(struct bpf_tramp_run_ctx, bpf_cookie);
>> +
>> +	if (l->cookie) {
>> +		emit_imm(RV_REG_T1, l->cookie, ctx);
>> +		emit_sd(RV_REG_FP, -run_ctx_off + cookie_off, RV_REG_T1, ctx);
>> +	} else {
>> +		emit_sd(RV_REG_FP, -run_ctx_off + cookie_off, RV_REG_ZERO, ctx);
>> +	}
>> +
>> +	/* arg1: prog */
>> +	emit_imm(RV_REG_A0, (const s64)p, ctx);
>> +	/* arg2: &run_ctx */
>> +	emit_addi(RV_REG_A1, RV_REG_FP, -run_ctx_off, ctx);
>> +	ret = emit_call((const u64)bpf_trampoline_enter(p), true, ctx);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* if (__bpf_prog_enter(prog) == 0)
>> +	 *	goto skip_exec_of_prog;
>> +	 */
>> +	branch_off = ctx->ninsns;
>> +	/* nop reserved for conditional jump */
>> +	emit(rv_nop(), ctx);
>> +
>> +	/* store prog start time */
>> +	emit_mv(RV_REG_S1, RV_REG_A0, ctx);
>> +
>> +	/* arg1: &args_off */
>> +	emit_addi(RV_REG_A0, RV_REG_FP, -args_off, ctx);
>> +	if (!p->jited)
>> +		/* arg2: progs[i]->insnsi for interpreter */
>> +		emit_imm(RV_REG_A1, (const s64)p->insnsi, ctx);
>> +	ret = emit_call((const u64)p->bpf_func, true, ctx);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (save_ret)
>> +		emit_sd(RV_REG_FP, -retval_off, regmap[BPF_REG_0], ctx);
>> +
>> +	/* update branch with beqz */
>> +	offset = ninsns_rvoff(ctx->ninsns - branch_off);
>> +	insn = rv_beq(RV_REG_A0, RV_REG_ZERO, offset >> 1);
>> +	*(u32 *)(ctx->insns + branch_off) = insn;
>> +
>> +	/* arg1: prog */
>> +	emit_imm(RV_REG_A0, (const s64)p, ctx);
>> +	/* arg2: prog start time */
>> +	emit_mv(RV_REG_A1, RV_REG_S1, ctx);
>> +	/* arg3: &run_ctx */
>> +	emit_addi(RV_REG_A2, RV_REG_FP, -run_ctx_off, ctx);
>> +	ret = emit_call((const u64)bpf_trampoline_exit(p), true, ctx);
>> +
>> +	return ret;
>> +}
>> +
>> +static int invoke_bpf_mod_ret(struct bpf_tramp_links *tl, int args_off, int retval_off,
>> +			      int run_ctx_off, int *branches_off, struct rv_jit_context *ctx)
>> +{
>> +	int i, ret;
>> +
>> +	/* cleanup to avoid garbage return value confusion */
>> +	emit_sd(RV_REG_FP, -retval_off, RV_REG_ZERO, ctx);
>> +	for (i = 0; i < tl->nr_links; i++) {
>> +		ret = invoke_bpf_prog(tl->links[i], args_off, retval_off,
>> +				run_ctx_off, true, ctx);
>> +		if (ret)
>> +			return ret;
>> +		emit_ld(RV_REG_T1, -retval_off, RV_REG_FP, ctx);
>> +		branches_off[i] = ctx->ninsns;
>> +		/* nop reserved for conditional jump */
>> +		emit(rv_nop(), ctx);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im,
>> +					 const struct btf_func_model *m,
>> +					 struct bpf_tramp_links *tlinks,
>> +					 void *func_addr, u32 flags,
>> +					 struct rv_jit_context *ctx)
>> +{
>> +	int i, ret, offset;
>> +	int *branches_off = NULL;
>> +	int stack_size = 0, nregs = m->nr_args;
>> +	int retaddr_off, fp_off, retval_off, args_off;
>> +	int nregs_off, ip_off, run_ctx_off, sreg_off;
>> +	struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
>> +	struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
>> +	struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>> +	void *orig_call = func_addr;
>> +	bool save_ret;
>> +	u32 insn;
>> +
>> +	/* Generated trampoline stack layout:
>> +	 *
>> +	 * FP - 8	    [ RA of parent func	] return address of parent
>> +	 *					  function
>> +	 * FP - retaddr_off [ RA of traced func	] return address of traced
>> +	 *					  function
>> +	 * FP - fp_off	    [ FP of parent func ]
>> +	 *
>> +	 * FP - retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
>> +	 *					  BPF_TRAMP_F_RET_FENTRY_RET
>> +	 *                  [ argN              ]
>> +	 *                  [ ...               ]
>> +	 * FP - args_off    [ arg1              ]
>> +	 *
>> +	 * FP - nregs_off   [ regs count        ]
>> +	 *
>> +	 * FP - ip_off      [ traced func	] BPF_TRAMP_F_IP_ARG
>> +	 *
>> +	 * FP - run_ctx_off [ bpf_tramp_run_ctx ]
>> +	 *
>> +	 * FP - sreg_off    [ callee saved reg	]
>> +	 *
>> +	 *		    [ pads              ] pads for 16 bytes alignment
>> +	 */
>> +
>> +	if (flags & (BPF_TRAMP_F_ORIG_STACK | BPF_TRAMP_F_SHARE_IPMODIFY))
>> +		return -ENOTSUPP;
>> +
>> +	/* extra regiters for struct arguments */
>> +	for (i = 0; i < m->nr_args; i++)
>> +		if (m->arg_flags[i] & BTF_FMODEL_STRUCT_ARG)
>> +			nregs += round_up(m->arg_size[i], 8) / 8 - 1;
>> +
>> +	/* 8 arguments passed by registers */
>> +	if (nregs > 8)
>> +		return -ENOTSUPP;
>> +
>> +	/* room for parent function return address */
>> +	stack_size += 8;
>> +
>> +	stack_size += 8;
>> +	retaddr_off = stack_size;
>> +
>> +	stack_size += 8;
>> +	fp_off = stack_size;
>> +
>> +	save_ret = flags & (BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_RET_FENTRY_RET);
>> +	if (save_ret) {
>> +		stack_size += 8;
>> +		retval_off = stack_size;
>> +	}
>> +
>> +	stack_size += nregs * 8;
>> +	args_off = stack_size;
>> +
>> +	stack_size += 8;
>> +	nregs_off = stack_size;
>> +
>> +	if (flags & BPF_TRAMP_F_IP_ARG) {
>> +		stack_size += 8;
>> +		ip_off = stack_size;
>> +	}
>> +
>> +	stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
>> +	run_ctx_off = stack_size;
>> +
>> +	stack_size += 8;
>> +	sreg_off = stack_size;
>> +
>> +	stack_size = round_up(stack_size, 16);
>> +
>> +	emit_addi(RV_REG_SP, RV_REG_SP, -stack_size, ctx);
>> +
>> +	emit_sd(RV_REG_SP, stack_size - retaddr_off, RV_REG_RA, ctx);
>> +	emit_sd(RV_REG_SP, stack_size - fp_off, RV_REG_FP, ctx);
>> +
>> +	emit_addi(RV_REG_FP, RV_REG_SP, stack_size, ctx);
>> +
>> +	/* callee saved register S1 to pass start time */
>> +	emit_sd(RV_REG_FP, -sreg_off, RV_REG_S1, ctx);
>> +
>> +	/* store ip address of the traced function */
>> +	if (flags & BPF_TRAMP_F_IP_ARG) {
>> +		emit_imm(RV_REG_T1, (const s64)func_addr, ctx);
>> +		emit_sd(RV_REG_FP, -ip_off, RV_REG_T1, ctx);
>> +	}
>> +
>> +	emit_li(RV_REG_T1, nregs, ctx);
>> +	emit_sd(RV_REG_FP, -nregs_off, RV_REG_T1, ctx);
>> +
>> +	store_args(nregs, args_off, ctx);
>> +
>> +	/* skip to actual body of traced function */
>> +	if (flags & BPF_TRAMP_F_SKIP_FRAME)
>> +		orig_call += 16;
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		emit_imm(RV_REG_A0, (const s64)im, ctx);
>> +		ret = emit_call((const u64)__bpf_tramp_enter, true, ctx);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	for (i = 0; i < fentry->nr_links; i++) {
>> +		ret = invoke_bpf_prog(fentry->links[i], args_off, retval_off, run_ctx_off,
>> +				      flags & BPF_TRAMP_F_RET_FENTRY_RET, ctx);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	if (fmod_ret->nr_links) {
>> +		branches_off = kcalloc(fmod_ret->nr_links, sizeof(int), GFP_KERNEL);
>> +		if (!branches_off)
>> +			return -ENOMEM;
>> +
>> +		ret = invoke_bpf_mod_ret(fmod_ret, args_off, retval_off, run_ctx_off,
>> +					 branches_off, ctx);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		restore_args(nregs, args_off, ctx);
>> +		ret = emit_call((const u64)orig_call, true, ctx);
>> +		if (ret)
>> +			return ret;
>> +		emit_sd(RV_REG_FP, -retval_off, RV_REG_A0, ctx);
>> +		/* nop reserved for bpf_tramp_image_put */
>> +		im->ip_after_call = ctx->insns + ctx->ninsns;
>> +		emit(rv_nop(), ctx);
>> +	}
>> +
>> +	/* update branches saved in invoke_bpf_mod_ret with bnez */
>> +	for (i = 0; i < fmod_ret->nr_links; i++) {
>> +		offset = ninsns_rvoff(ctx->ninsns - branches_off[i]);
>> +		insn = rv_bne(RV_REG_T1, RV_REG_ZERO, offset >> 1);
>> +		*(u32 *)(ctx->insns + branches_off[i]) = insn;
>> +	}
>> +
>> +	for (i = 0; i < fexit->nr_links; i++) {
>> +		ret = invoke_bpf_prog(fexit->links[i], args_off, retval_off,
>> +				      run_ctx_off, false, ctx);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	if (flags & BPF_TRAMP_F_CALL_ORIG) {
>> +		im->ip_epilogue = ctx->insns + ctx->ninsns;
>> +		emit_imm(RV_REG_A0, (const s64)im, ctx);
>> +		ret = emit_call((const u64)__bpf_tramp_exit, true, ctx);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	if (flags & BPF_TRAMP_F_RESTORE_REGS)
>> +		restore_args(nregs, args_off, ctx);
>> +
>> +	if (save_ret)
>> +		emit_ld(RV_REG_A0, -retval_off, RV_REG_FP, ctx);
>> +
>> +	emit_ld(RV_REG_S1, -sreg_off, RV_REG_FP, ctx);
>> +
>> +	if (flags & BPF_TRAMP_F_SKIP_FRAME)
>> +		/* return address of parent function */
>> +		emit_ld(RV_REG_RA, stack_size - 8, RV_REG_SP, ctx);
>> +	else
>> +		/* return address of traced function */
>> +		emit_ld(RV_REG_RA, stack_size - retaddr_off, RV_REG_SP, ctx);
>> +
>> +	emit_ld(RV_REG_FP, stack_size - fp_off, RV_REG_SP, ctx);
>> +	emit_addi(RV_REG_SP, RV_REG_SP, stack_size, ctx);
>> +
>> +	emit_jalr(RV_REG_ZERO, RV_REG_RA, 0, ctx);
>> +
>> +	bpf_flush_icache(ctx->insns, ctx->insns + ctx->ninsns);
>> +
>> +	kfree(branches_off);
>> +
>> +	return ctx->ninsns;
>> +
>> +}
>> +
>> +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image,
>> +				void *image_end, const struct btf_func_model *m,
>> +				u32 flags, struct bpf_tramp_links *tlinks,
>> +				void *func_addr)
>> +{
>> +	int ret;
>> +	struct rv_jit_context ctx;
>> +
>> +	ctx.ninsns = 0;
>> +	ctx.insns = image;
>> +	ret = __arch_prepare_bpf_trampoline(im, m, tlinks, func_addr, flags, &ctx);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (ninsns_rvoff(ret) > (long)image_end - (long)image)
>> +		return -EFBIG;
> 
> This looks risky! First you generate the image, and here you realize
> that you already wrote in all the wrong places?!
> 

Oops, this will probably lead to OOB. Will fix in the next.

>> +
>> +	return ninsns_rvoff(ret);
> 
> Ok, this was a bit subtle to me. The return value of the this function
> is used in kernel/bpf/bpf_struct_ops.c. Now I know! :-)
> 
> 
> Thanks!
> Björn


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64
  2023-01-03 18:12   ` Björn Töpel
@ 2023-01-06  1:57     ` Pu Lehui
  2023-01-06  3:59       ` Pu Lehui
  0 siblings, 1 reply; 18+ messages in thread
From: Pu Lehui @ 2023-01-06  1:57 UTC (permalink / raw)
  To: Björn Töpel, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui



On 2023/1/4 2:12, Björn Töpel wrote:
> Pu Lehui <pulehui@huaweicloud.com> writes:
> 
>> From: Pu Lehui <pulehui@huawei.com>
>>
>> Implement bpf_arch_text_poke for RV64. For call scenario,
>> ftrace framework reserve 4 nops for RV64 kernel function
>> as function entry, and use auipc+jalr instructions to call
>> kernel or module functions. However, since the auipc+jalr
>> call instructions is non-atomic operation, we need to use
>> stop-machine to make sure instruction patching in atomic
>> context. As for jump scenario, since we only jump inside
>> the trampoline, a jal instruction is sufficient.
> 
> Hmm, is that really true? More below!
> 
>>
>> Signed-off-by: Pu Lehui <pulehui@huawei.com>
>> ---
>>   arch/riscv/net/bpf_jit.h        |   5 ++
>>   arch/riscv/net/bpf_jit_comp64.c | 131 +++++++++++++++++++++++++++++++-
>>   2 files changed, 134 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/riscv/net/bpf_jit.h b/arch/riscv/net/bpf_jit.h
>> index d926e0f7ef57..bf9802a63061 100644
>> --- a/arch/riscv/net/bpf_jit.h
>> +++ b/arch/riscv/net/bpf_jit.h
>> @@ -573,6 +573,11 @@ static inline u32 rv_fence(u8 pred, u8 succ)
>>   	return rv_i_insn(imm11_0, 0, 0, 0, 0xf);
>>   }
>>   
>> +static inline u32 rv_nop(void)
>> +{
>> +	return rv_i_insn(0, 0, 0, 0, 0x13);
>> +}
>> +
>>   /* RVC instrutions. */
>>   
>>   static inline u16 rvc_addi4spn(u8 rd, u32 imm10)
>> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
>> index bf4721a99a09..fa8b03c52463 100644
>> --- a/arch/riscv/net/bpf_jit_comp64.c
>> +++ b/arch/riscv/net/bpf_jit_comp64.c
>> @@ -8,6 +8,8 @@
>>   #include <linux/bitfield.h>
>>   #include <linux/bpf.h>
>>   #include <linux/filter.h>
>> +#include <linux/memory.h>
>> +#include <linux/stop_machine.h>
>>   #include "bpf_jit.h"
>>   
>>   #define RV_REG_TCC RV_REG_A6
>> @@ -238,7 +240,7 @@ static void __build_epilogue(bool is_tail_call, struct rv_jit_context *ctx)
>>   	if (!is_tail_call)
>>   		emit_mv(RV_REG_A0, RV_REG_A5, ctx);
>>   	emit_jalr(RV_REG_ZERO, is_tail_call ? RV_REG_T3 : RV_REG_RA,
>> -		  is_tail_call ? 4 : 0, /* skip TCC init */
>> +		  is_tail_call ? 20 : 0, /* skip reserved nops and TCC init */
>>   		  ctx);
>>   }
>>   
>> @@ -615,6 +617,127 @@ static int add_exception_handler(const struct bpf_insn *insn,
>>   	return 0;
>>   }
>>   
>> +struct text_poke_args {
>> +	void *addr;
>> +	const void *insns;
>> +	size_t len;
>> +	atomic_t cpu_count;
>> +};
>> +
>> +static int do_text_poke(void *data)
>> +{
>> +	int ret = 0;
>> +	struct text_poke_args *patch = data;
>> +
>> +	if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
>> +		ret = patch_text_nosync(patch->addr, patch->insns, patch->len);
>> +		atomic_inc(&patch->cpu_count);
>> +	} else {
>> +		while (atomic_read(&patch->cpu_count) <= num_online_cpus())
>> +			cpu_relax();
>> +		smp_mb();
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static int bpf_text_poke_stop_machine(void *addr, const void *insns, size_t len)
>> +{
>> +	struct text_poke_args patch = {
>> +		.addr = addr,
>> +		.insns = insns,
>> +		.len = len,
>> +		.cpu_count = ATOMIC_INIT(0),
>> +	};
>> +
>> +	return stop_machine(do_text_poke, &patch, cpu_online_mask);
>> +}
>> +
>> +static int gen_call_or_nops(void *target, void *ip, u32 *insns)
>> +{
>> +	int i, ret;
>> +	s64 rvoff;
>> +	struct rv_jit_context ctx;
>> +
>> +	ctx.ninsns = 0;
>> +	ctx.insns = (u16 *)insns;
>> +
>> +	if (!target) {
>> +		for (i = 0; i < 4; i++)
>> +			emit(rv_nop(), &ctx);
>> +		return 0;
>> +	}
>> +
>> +	rvoff = (s64)(target - ip);
>> +	emit(rv_sd(RV_REG_SP, -8, RV_REG_RA), &ctx);
>> +	ret = emit_jump_and_link(RV_REG_RA, rvoff, false, &ctx);
>> +	if (ret)
>> +		return ret;
>> +	emit(rv_ld(RV_REG_RA, -8, RV_REG_SP), &ctx);
>> +
>> +	return 0;
>> +
>> +}
>> +
>> +static int bpf_text_poke_call(void *ip, void *old_addr, void *new_addr)
>> +{
>> +	int ret;
>> +	u32 old_insns[4], new_insns[4];
>> +
>> +	ret = gen_call_or_nops(old_addr, ip + 4, old_insns);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = gen_call_or_nops(new_addr, ip + 4, new_insns);
>> +	if (ret)
>> +		return ret;
>> +
>> +	mutex_lock(&text_mutex);
>> +	if (memcmp(ip, old_insns, sizeof(old_insns))) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	if (memcmp(ip, new_insns, sizeof(new_insns)))
>> +		ret = bpf_text_poke_stop_machine(ip, new_insns,
>> sizeof(new_insns));
> 
> I'd rather see that you added a patch_text variant to
> arch/riscv/kernel/patch.c (something like your
> bpf_text_poke_stop_machine()), and use that here. Might be other users
> of that as well -- Andy's ftrace patch maybe? :-)
> 

Good idea.

>> +out:
>> +	mutex_unlock(&text_mutex);
>> +	return ret;
>> +}
>> +
>> +static int bpf_text_poke_jump(void *ip, void *old_addr, void *new_addr)
>> +{
>> +	int ret;
>> +	u32 old_insn, new_insn;
>> +
>> +	old_insn = old_addr ? rv_jal(RV_REG_ZERO, (s64)(old_addr - ip) >> 1) : rv_nop();
>> +	new_insn = new_addr ? rv_jal(RV_REG_ZERO, (s64)(new_addr - ip) >> 1) : rv_nop();
>> +
>> +	mutex_lock(&text_mutex);
>> +	if (memcmp(ip, &old_insn, sizeof(old_insn))) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
>> +
>> +	if (memcmp(ip, &new_insn, sizeof(new_insn)))
>> +		ret = patch_text_nosync(ip, &new_insn, sizeof(new_insn));
>> +out:
>> +	mutex_unlock(&text_mutex);
>> +	return ret;
>> +}
>> +
>> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>> +		       void *old_addr, void *new_addr)
> 
> AFAIU there's nothing in the bpf_arch_text_poke() API that say that
> BPF_MOD_JUMP is jumps within the trampoline. That is one usage, but not
> the only one. In general, the jal might not have enough reach.
> 
> I believe that this needs to be an auipc/jalr pair similar to
> BPF_MOD_CALL (w/o linked register).
> 

The initial idea was that currently BPF_MOD_JUMP only serves for 
bpf_tramp_image_put, and jal, which range is +/- 1MB, is sufficient for 
the distance between im->ip_after_call and im->ip_epilogue, and try to 
not use not-atomic auipc/jalr pair. But take deep consideration, this 
might be extended to other uses, such as tailcall optimization. So agree 
with your suggestion.

> 
> And again, thanks for working on the RV trampoline!
> Björn


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call
  2023-01-03 12:05       ` Björn Töpel
@ 2023-01-06  2:35         ` Pu Lehui
  0 siblings, 0 replies; 18+ messages in thread
From: Pu Lehui @ 2023-01-06  2:35 UTC (permalink / raw)
  To: Björn Töpel, Xu Kuohai, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Pu Lehui



On 2023/1/3 20:05, Björn Töpel wrote:
> Pu Lehui <pulehui@huaweicloud.com> writes:
> 
>> On 2022/12/20 10:32, Xu Kuohai wrote:
>>> On 12/20/2022 10:13 AM, Pu Lehui wrote:
>>>> From: Pu Lehui <pulehui@huawei.com>
>>>>
>>>> The current bpf trampoline attach to kernel functions via ftrace direct
>>>> call API, while text_poke is applied for bpf2bpf attach and tail call
>>>> optimization. For architectures that do not support ftrace direct call,
>>>> text_poke is still able to attach bpf trampoline to kernel functions.
>>>> Let's relax it by rollback to text_poke when architecture not supported.
>>>>
>>>> Signed-off-by: Pu Lehui <pulehui@huawei.com>
>>>> ---
>>>>    kernel/bpf/trampoline.c | 8 ++------
>>>>    1 file changed, 2 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
>>>> index d6395215b849..386197a7952c 100644
>>>> --- a/kernel/bpf/trampoline.c
>>>> +++ b/kernel/bpf/trampoline.c
>>>> @@ -228,15 +228,11 @@ static int modify_fentry(struct bpf_trampoline
>>>> *tr, void *old_addr, void *new_ad
>>>>    static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
>>>>    {
>>>>        void *ip = tr->func.addr;
>>>> -    unsigned long faddr;
>>>>        int ret;
>>>> -    faddr = ftrace_location((unsigned long)ip);
>>>> -    if (faddr) {
>>>> -        if (!tr->fops)
>>>> -            return -ENOTSUPP;
>>>> +    if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS) &&
>>>> +        !!ftrace_location((unsigned long)ip))
>>>>            tr->func.ftrace_managed = true;
>>>> -    }
>>>>
>>>
>>> After this patch, a kernel function with true trace_location will be
>>> patched
>>> by bpf when CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is disabled, which
>>> means
>>> that a kernel function may be patched by both bpf and ftrace in a mutually
>>> unaware way. This will cause ftrace and bpf_arch_text_poke to fail in a
>>> somewhat random way if the function to be patched was already patched
>>> by the other.
>>
>> Thanks for your review. And yes, this is a backward compatible solution
>> for architectures that not support DYNAMIC_FTRACE_WITH_DIRECT_CALLS.
> 
> It's not "backward compatible". Reiterating what Kuohai said; The BPF
> trampoline must be aware of ftrace's state -- with this patch, the
> trampoline can't blindly poke the text managed my ftrace.
> 
> I'd recommend to remove this patch from the series.
> 

After deep consideration, Kuohai's catching is much more reasonable. 
Will remove it in the next.

In the meantime, I found that song and guoren have worked on supporting 
riscv ftrace with direct call [0], so we can concentrate on making 
bpf_arch_text_poke specifically for the bpf context.

However, riscv ftrace base framework will change because [0] uses t0 as 
the link register of traced function. We should consider the generality 
of riscv bpf trampoline for kernel function and bpf context. It's not 
clear if [0] will be upstreamed, so maybe we should wait for it?

[0] 
https://lore.kernel.org/linux-riscv/20221208091244.203407-7-guoren@kernel.org

Anyway, thanks both of you for the review.

> 
> Björn


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64
  2023-01-06  1:57     ` Pu Lehui
@ 2023-01-06  3:59       ` Pu Lehui
  0 siblings, 0 replies; 18+ messages in thread
From: Pu Lehui @ 2023-01-06  3:59 UTC (permalink / raw)
  To: Björn Töpel, Pu Lehui, bpf, linux-riscv, linux-kernel
  Cc: Luke Nelson, Xi Wang, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou



On 2023/1/6 9:57, Pu Lehui wrote:
> 
> 
> On 2023/1/4 2:12, Björn Töpel wrote:
>> Pu Lehui <pulehui@huaweicloud.com> writes:
>>
>>> From: Pu Lehui <pulehui@huawei.com>
>>>
>>> Implement bpf_arch_text_poke for RV64. For call scenario,
>>> ftrace framework reserve 4 nops for RV64 kernel function
>>> as function entry, and use auipc+jalr instructions to call
>>> kernel or module functions. However, since the auipc+jalr
>>> call instructions is non-atomic operation, we need to use
>>> stop-machine to make sure instruction patching in atomic
>>> context. As for jump scenario, since we only jump inside
>>> the trampoline, a jal instruction is sufficient.
>>
>> Hmm, is that really true? More below!
>>
>>>
>>> Signed-off-by: Pu Lehui <pulehui@huawei.com>
>>> ---
>>>   arch/riscv/net/bpf_jit.h        |   5 ++
>>>   arch/riscv/net/bpf_jit_comp64.c | 131 +++++++++++++++++++++++++++++++-
>>>   2 files changed, 134 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/riscv/net/bpf_jit.h b/arch/riscv/net/bpf_jit.h
>>> index d926e0f7ef57..bf9802a63061 100644
>>> --- a/arch/riscv/net/bpf_jit.h
>>> +++ b/arch/riscv/net/bpf_jit.h
>>> @@ -573,6 +573,11 @@ static inline u32 rv_fence(u8 pred, u8 succ)
>>>       return rv_i_insn(imm11_0, 0, 0, 0, 0xf);
>>>   }
>>> +static inline u32 rv_nop(void)
>>> +{
>>> +    return rv_i_insn(0, 0, 0, 0, 0x13);
>>> +}
>>> +
>>>   /* RVC instrutions. */
>>>   static inline u16 rvc_addi4spn(u8 rd, u32 imm10)
>>> diff --git a/arch/riscv/net/bpf_jit_comp64.c 
>>> b/arch/riscv/net/bpf_jit_comp64.c
>>> index bf4721a99a09..fa8b03c52463 100644
>>> --- a/arch/riscv/net/bpf_jit_comp64.c
>>> +++ b/arch/riscv/net/bpf_jit_comp64.c
>>> @@ -8,6 +8,8 @@
>>>   #include <linux/bitfield.h>
>>>   #include <linux/bpf.h>
>>>   #include <linux/filter.h>
>>> +#include <linux/memory.h>
>>> +#include <linux/stop_machine.h>
>>>   #include "bpf_jit.h"
>>>   #define RV_REG_TCC RV_REG_A6
>>> @@ -238,7 +240,7 @@ static void __build_epilogue(bool is_tail_call, 
>>> struct rv_jit_context *ctx)
>>>       if (!is_tail_call)
>>>           emit_mv(RV_REG_A0, RV_REG_A5, ctx);
>>>       emit_jalr(RV_REG_ZERO, is_tail_call ? RV_REG_T3 : RV_REG_RA,
>>> -          is_tail_call ? 4 : 0, /* skip TCC init */
>>> +          is_tail_call ? 20 : 0, /* skip reserved nops and TCC init */
>>>             ctx);
>>>   }
>>> @@ -615,6 +617,127 @@ static int add_exception_handler(const struct 
>>> bpf_insn *insn,
>>>       return 0;
>>>   }
>>> +struct text_poke_args {
>>> +    void *addr;
>>> +    const void *insns;
>>> +    size_t len;
>>> +    atomic_t cpu_count;
>>> +};
>>> +
>>> +static int do_text_poke(void *data)
>>> +{
>>> +    int ret = 0;
>>> +    struct text_poke_args *patch = data;
>>> +
>>> +    if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
>>> +        ret = patch_text_nosync(patch->addr, patch->insns, patch->len);
>>> +        atomic_inc(&patch->cpu_count);
>>> +    } else {
>>> +        while (atomic_read(&patch->cpu_count) <= num_online_cpus())
>>> +            cpu_relax();
>>> +        smp_mb();
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int bpf_text_poke_stop_machine(void *addr, const void *insns, 
>>> size_t len)
>>> +{
>>> +    struct text_poke_args patch = {
>>> +        .addr = addr,
>>> +        .insns = insns,
>>> +        .len = len,
>>> +        .cpu_count = ATOMIC_INIT(0),
>>> +    };
>>> +
>>> +    return stop_machine(do_text_poke, &patch, cpu_online_mask);
>>> +}
>>> +
>>> +static int gen_call_or_nops(void *target, void *ip, u32 *insns)
>>> +{
>>> +    int i, ret;
>>> +    s64 rvoff;
>>> +    struct rv_jit_context ctx;
>>> +
>>> +    ctx.ninsns = 0;
>>> +    ctx.insns = (u16 *)insns;
>>> +
>>> +    if (!target) {
>>> +        for (i = 0; i < 4; i++)
>>> +            emit(rv_nop(), &ctx);
>>> +        return 0;
>>> +    }
>>> +
>>> +    rvoff = (s64)(target - ip);
>>> +    emit(rv_sd(RV_REG_SP, -8, RV_REG_RA), &ctx);
>>> +    ret = emit_jump_and_link(RV_REG_RA, rvoff, false, &ctx);
>>> +    if (ret)
>>> +        return ret;
>>> +    emit(rv_ld(RV_REG_RA, -8, RV_REG_SP), &ctx);
>>> +
>>> +    return 0;
>>> +
>>> +}
>>> +
>>> +static int bpf_text_poke_call(void *ip, void *old_addr, void *new_addr)
>>> +{
>>> +    int ret;
>>> +    u32 old_insns[4], new_insns[4];
>>> +
>>> +    ret = gen_call_or_nops(old_addr, ip + 4, old_insns);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    ret = gen_call_or_nops(new_addr, ip + 4, new_insns);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    mutex_lock(&text_mutex);
>>> +    if (memcmp(ip, old_insns, sizeof(old_insns))) {
>>> +        ret = -EFAULT;
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (memcmp(ip, new_insns, sizeof(new_insns)))
>>> +        ret = bpf_text_poke_stop_machine(ip, new_insns,
>>> sizeof(new_insns));
>>
>> I'd rather see that you added a patch_text variant to
>> arch/riscv/kernel/patch.c (something like your
>> bpf_text_poke_stop_machine()), and use that here. Might be other users
>> of that as well -- Andy's ftrace patch maybe? :-)
>>
> 
> Good idea.
> 
>>> +out:
>>> +    mutex_unlock(&text_mutex);
>>> +    return ret;
>>> +}
>>> +
>>> +static int bpf_text_poke_jump(void *ip, void *old_addr, void *new_addr)
>>> +{
>>> +    int ret;
>>> +    u32 old_insn, new_insn;
>>> +
>>> +    old_insn = old_addr ? rv_jal(RV_REG_ZERO, (s64)(old_addr - ip) 
>>> >> 1) : rv_nop();
>>> +    new_insn = new_addr ? rv_jal(RV_REG_ZERO, (s64)(new_addr - ip) 
>>> >> 1) : rv_nop();
>>> +
>>> +    mutex_lock(&text_mutex);
>>> +    if (memcmp(ip, &old_insn, sizeof(old_insn))) {
>>> +        ret = -EFAULT;
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (memcmp(ip, &new_insn, sizeof(new_insn)))
>>> +        ret = patch_text_nosync(ip, &new_insn, sizeof(new_insn));
>>> +out:
>>> +    mutex_unlock(&text_mutex);
>>> +    return ret;
>>> +}
>>> +
>>> +int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type poke_type,
>>> +               void *old_addr, void *new_addr)
>>
>> AFAIU there's nothing in the bpf_arch_text_poke() API that say that
>> BPF_MOD_JUMP is jumps within the trampoline. That is one usage, but not
>> the only one. In general, the jal might not have enough reach.
>>
>> I believe that this needs to be an auipc/jalr pair similar to
>> BPF_MOD_CALL (w/o linked register).
>>
> 
> The initial idea was that currently BPF_MOD_JUMP only serves for

small nit,the current riscv BPF_MOD_JUMP

> bpf_tramp_image_put, and jal, which range is +/- 1MB, is sufficient for 
> the distance between im->ip_after_call and im->ip_epilogue, and try to 
> not use not-atomic auipc/jalr pair. But take deep consideration, this 
> might be extended to other uses, such as tailcall optimization. So agree 
> with your suggestion.
> 
>>
>> And again, thanks for working on the RV trampoline!
>> Björn
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-01-06  3:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-20  2:13 [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline for RV64 Pu Lehui
2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 1/4] bpf: Rollback to text_poke when arch not supported ftrace direct call Pu Lehui
2022-12-20  2:32   ` Xu Kuohai
2022-12-21  2:36     ` Pu Lehui
2023-01-03 12:05       ` Björn Töpel
2023-01-06  2:35         ` Pu Lehui
2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 2/4] riscv, bpf: Factor out emit_call for kernel and bpf context Pu Lehui
2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 3/4] riscv, bpf: Add bpf_arch_text_poke support for RV64 Pu Lehui
2022-12-20  3:00   ` Xu Kuohai
2022-12-22  3:01     ` Pu Lehui
2023-01-03 18:12   ` Björn Töpel
2023-01-06  1:57     ` Pu Lehui
2023-01-06  3:59       ` Pu Lehui
2022-12-20  2:13 ` [RFC PATCH RESEND bpf-next 4/4] riscv, bpf: Add bpf trampoline " Pu Lehui
2023-01-03 19:16   ` Björn Töpel
2023-01-06  1:36     ` Pu Lehui
2022-12-22 13:00 ` [RFC PATCH RESEND bpf-next 0/4] Support bpf trampoline " Björn Töpel
2022-12-24  6:22   ` Pu Lehui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).