All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 00/10] Exceptions - 1/2
@ 2023-07-13  2:32 Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 01/10] bpf: Fix kfunc callback register type handling Kumar Kartikeya Dwivedi
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

This series implements the _first_ part of the runtime and verifier
support needed to enable BPF exceptions. Exceptions thrown from programs
are processed as an immediate exit from the program, which unwinds all
the active stack frames until the main stack frame, and returns to the
BPF program's caller. The ability to perform this unwinding safely
allows the program to test conditions that are always true at runtime
but which the verifier has no visibility into.

Thus, it also reduces verification effort by safely terminating
redundant paths that can be taken within a program.

The patches to perform runtime resource cleanup during the
frame-by-frame unwinding will be posted as a follow-up to this set.

It must be noted that exceptions are not an error handling mechanism for
unlikely runtime conditions, but a way to safely terminate the execution
of a program in presence of conditions that should never occur at
runtime. They are meant to serve higher-level primitives such as program
assertions.

As such, a program can only install an exception handler once for the
lifetime of a BPF program, and this handler cannot be changed at
runtime. The purpose of the handler is to simply interpret the cookie
value supplied by the bpf_throw call, and execute user-defined logic
corresponding to it. The primary purpose of allowing a handler is to
control the return value of the program. The default handler returns 0
when from the program when an exception is thrown.

Fixing the handler for the lifetime of the program eliminates tricky and
expensive handling in case of runtime changes of the handler callback
when programs begin to nest, where it becomes more complex to save and
restore the active handler at runtime.

The following kfuncs are introduced:

// Throw a BPF exception, terminating the execution of the program.
//
// @cookie: The cookie that is passed to the exception callback. Without
//          an exception callback set by the user, the programs returns
//          0 when an exception is thrown.
void bpf_throw(u64 cookie);

// Set an exception handler globally for the entire program. The handler
// is invoked after the unwinding of the stack is finished. The return
// value of the handler will be the return value of the program. By
// default, without a supplied exception handler, the return value is 0.
//
// Note that this helper is *idempotent*, and can only be called once in
// a program. The exception callback is then set permanently for the
// lifetime of the BPF program, and cannot be changed.
//
// @cb: The exception callback, which receives the cookie paramter
//	passed to the bpf_throw call which invoked a BPF exception.
void bpf_set_exception_callback(int (*cb)(u64 cookie));

This version of offline unwinding based BPF exceptions is truly zero
overhead, with the exception of generation of a default callback which
contains a few instructions to return a default return value (0) when no
exception callback is supplied by the user.

A limitation currently exists where all callee-saved registers have to
be saved on entry into the main BPF subprog. This will be fixed with a
follow-up or in the next revision.

Callbacks are disallowed from throwing BPF exceptions for now, since
such exceptions need to cross the callback helper boundary (and
therefore must care about unwinding kernel state), however it is
possible to lift this restriction in the future follow-up.

Exceptions terminate propogating at program boundaries, hence both
BPF_PROG_TYPE_EXT and tail call targets return to their caller context
the return value of the exception callback, in the event that they throw
an exception. Thus, exceptions do not cross extension or tail call
boundary.

However, this is mostly an implementation choice, and can be changed to
suit more user-friendly semantics.

PS: Patches 2 and 3 have been sent as [0] but are included to allow CI to
    build the set.

 [0]: https://lore.kernel.org/bpf/20230713003118.1327943-1-memxor@gmail.com

Notes
-----

A couple of questions to consider:

 * Should the default callback simply return the cookie value supplied
   to bpf_throw?

 * Should exceptions cross tail call and extension program boundaries?
   Currently they invoke the exception callback of tail call or
   extension program (if installed) and return to the caller, aborting
   propagation.

 * How should the assertion macros interface look like? It would be
   great to have more feedback from potential users (David?).

A few notes:

 * I'm working to address the callee-saved register spilling issue on
   entry into the main subprog as a follow-up. I wanted to send the
   current version out first.

 * The resource cleanup patches are based on top of this set, so once
   we converge on the implementation, they can either be appended to
   the set or sent as a follow up (whichever occurs first).

Known issues
------------

 * There is currently a splat when KASAN is enabled, which I believe to
   be a false positive occuring due to bad interplay between KASAN's stack
   memory accounting logic and my unwinding logic. I'm investigating it.

 * Since bpf_throw is marked noreturn, the compiler sometimes may determine
   that a function always throws and emit the final instruction as a call
   to it without emitting an exit in the caller. This leads to an error
   where the verifier complains about final instruction not being a jump,
   exit, or bpf_throw call (which gets treated as an exit). This is
   unlikely to occur as bpf_throw wouldn't be used whenever the condition
   is already known at compile time, but I could see it when testing with
   always throwing subprogs and calling into them.

 * Just asm volatile ("call bpf_throw" :::) does not emit DATASEC .ksyms
   for bpf_throw, there needs to be explicit call in C for clang to emit
   the DATASEC info in BTF, leading to errors during compilation.

Changelog:
----------
RFC v1 -> v1
RFC v1: https://lore.kernel.org/bpf/20230405004239.1375399-1-memxor@gmail.com

 * Completely rework the unwinding infrastructure to use offline
   unwinding support.
 * Remove the runtime exception state and program rewriting code.
 * Make bpf_set_exception_callback idempotent to avoid vexing
   synchronization and state clobbering issues in presence of program
   nesting.
 * Disable bpf_throw within callback functions, for now.
 * Allow bpf_throw in tail call programs and extension programs,
   removing limitations of rewrite based unwinding.
 * Expand selftests.

Kumar Kartikeya Dwivedi (10):
  bpf: Fix kfunc callback register type handling
  bpf: Fix subprog idx logic in check_max_stack_depth
  bpf: Repeat check_max_stack_depth for async callbacks
  bpf: Add support for inserting new subprogs
  arch/x86: Implement arch_bpf_stack_walk
  bpf: Implement bpf_throw kfunc
  bpf: Ensure IP is within prog->jited_length for bpf_throw calls
  bpf: Introduce bpf_set_exception_callback
  selftests/bpf: Add BPF assertion macros
  selftests/bpf: Add tests for BPF exceptions

 arch/x86/net/bpf_jit_comp.c                   | 105 +++-
 include/linux/bpf.h                           |   6 +
 include/linux/bpf_verifier.h                  |   9 +-
 include/linux/filter.h                        |   8 +
 kernel/bpf/core.c                             |  15 +-
 kernel/bpf/helpers.c                          |  44 ++
 kernel/bpf/syscall.c                          |  19 +-
 kernel/bpf/verifier.c                         | 284 +++++++++--
 .../testing/selftests/bpf/bpf_experimental.h  |  28 ++
 .../selftests/bpf/prog_tests/exceptions.c     | 272 +++++++++++
 .../testing/selftests/bpf/progs/exceptions.c  | 450 ++++++++++++++++++
 .../selftests/bpf/progs/exceptions_ext.c      |  42 ++
 .../selftests/bpf/progs/exceptions_fail.c     | 311 ++++++++++++
 13 files changed, 1537 insertions(+), 56 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/exceptions.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions_ext.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions_fail.c


base-commit: 0a5550b1165cd60ad6972791eda4a3eb7e347280
-- 
2.40.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 01/10] bpf: Fix kfunc callback register type handling
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 02/10] bpf: Fix subprog idx logic in check_max_stack_depth Kumar Kartikeya Dwivedi
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Dave Marchevsky, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, David Vernet

The kfunc code to handle KF_ARG_PTR_TO_CALLBACK does not check the reg
type before using reg->subprogno. This can accidently permit invalid
pointers from being passed into callback helpers (e.g. silently from
different paths). Likewise, reg->subprogno from the per-register type
union may not be meaningful either. We need to reject any other type
except PTR_TO_FUNC.

Cc: Dave Marchevsky <davemarchevsky@fb.com>
Fixes: 5d92ddc3de1b ("bpf: Add callback validation to kfunc verifier logic")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 81a93eeac7a0..7a00bf69bff8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11023,6 +11023,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			break;
 		}
 		case KF_ARG_PTR_TO_CALLBACK:
+			if (reg->type != PTR_TO_FUNC) {
+				verbose(env, "arg%d expected pointer to func\n", i);
+				return -EINVAL;
+			}
 			meta->subprogno = reg->subprogno;
 			break;
 		case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 02/10] bpf: Fix subprog idx logic in check_max_stack_depth
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 01/10] bpf: Fix kfunc callback register type handling Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 03/10] bpf: Repeat check_max_stack_depth for async callbacks Kumar Kartikeya Dwivedi
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

The assignment to idx in check_max_stack_depth happens once we see a
bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of
the code performs a few checks and then pushes the frame to the frame
stack, except the case of async callbacks. If the async callback case
causes the loop iteration to be skipped, the idx assignment will be
incorrect on the next iteration of the loop. The value stored in the
frame stack (as the subprogno of the current subprog) will be incorrect.

This leads to incorrect checks and incorrect tail_call_reachable
marking. Save the target subprog in a new variable and only assign to
idx once we are done with the is_async_cb check which may skip pushing
of frame to the frame stack and subsequent stack depth checks and tail
call markings.

Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7a00bf69bff8..66fee45d313d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5622,7 +5622,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 continue_func:
 	subprog_end = subprog[idx + 1].start;
 	for (; i < subprog_end; i++) {
-		int next_insn;
+		int next_insn, sidx;
 
 		if (!bpf_pseudo_call(insn + i) && !bpf_pseudo_func(insn + i))
 			continue;
@@ -5632,14 +5632,14 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 
 		/* find the callee */
 		next_insn = i + insn[i].imm + 1;
-		idx = find_subprog(env, next_insn);
-		if (idx < 0) {
+		sidx = find_subprog(env, next_insn);
+		if (sidx < 0) {
 			WARN_ONCE(1, "verifier bug. No program starts at insn %d\n",
 				  next_insn);
 			return -EFAULT;
 		}
-		if (subprog[idx].is_async_cb) {
-			if (subprog[idx].has_tail_call) {
+		if (subprog[sidx].is_async_cb) {
+			if (subprog[sidx].has_tail_call) {
 				verbose(env, "verifier bug. subprog has tail_call and async cb\n");
 				return -EFAULT;
 			}
@@ -5647,6 +5647,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 			continue;
 		}
 		i = next_insn;
+		idx = sidx;
 
 		if (subprog[idx].has_tail_call)
 			tail_call_reachable = true;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 03/10] bpf: Repeat check_max_stack_depth for async callbacks
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 01/10] bpf: Fix kfunc callback register type handling Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 02/10] bpf: Fix subprog idx logic in check_max_stack_depth Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs Kumar Kartikeya Dwivedi
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

While the check_max_stack_depth function explores call chains emanating
from the main prog, which is typically enough to cover all possible call
chains, it doesn't explore those rooted at async callbacks unless the
async callback will have been directly called, since unlike non-async
callbacks it skips their instruction exploration as they don't
contribute to stack depth.

It could be the case that the async callback leads to a callchain which
exceeds the stack depth, but this is never reachable while only
exploring the entry point from main subprog. Hence, repeat the check for
the main subprog *and* all async callbacks marked by the symbolic
execution pass of the verifier, as execution of the program may begin at
any of them.

Consider a function with following stack depths:
main : 256
async : 256
foo: 256

main:
	rX = async
	bpf_timer_set_callback(...)

async:
	foo()

Here, async is never seen to contribute to the stack depth of main, so
it is not descended, but when async is invoked asynchronously when the
timer fires, it will end up breaching MAX_BPF_STACK limit imposed by the
verifier.

Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 66fee45d313d..5d432df5df06 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5574,16 +5574,17 @@ static int update_stack_depth(struct bpf_verifier_env *env,
  * Since recursion is prevented by check_cfg() this algorithm
  * only needs a local stack of MAX_CALL_FRAMES to remember callsites
  */
-static int check_max_stack_depth(struct bpf_verifier_env *env)
+static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 {
-	int depth = 0, frame = 0, idx = 0, i = 0, subprog_end;
 	struct bpf_subprog_info *subprog = env->subprog_info;
 	struct bpf_insn *insn = env->prog->insnsi;
+	int depth = 0, frame = 0, i, subprog_end;
 	bool tail_call_reachable = false;
 	int ret_insn[MAX_CALL_FRAMES];
 	int ret_prog[MAX_CALL_FRAMES];
 	int j;
 
+	i = subprog[idx].start;
 process_func:
 	/* protect against potential stack overflow that might happen when
 	 * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
@@ -5683,6 +5684,22 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 	goto continue_func;
 }
 
+static int check_max_stack_depth(struct bpf_verifier_env *env)
+{
+	struct bpf_subprog_info *si = env->subprog_info;
+	int ret;
+
+	for (int i = 0; i < env->subprog_cnt; i++) {
+		if (!i || si[i].is_async_cb) {
+			ret = check_max_stack_depth_subprog(env, i);
+			if (ret < 0)
+				return ret;
+		}
+		continue;
+	}
+	return 0;
+}
+
 #ifndef CONFIG_BPF_JIT_ALWAYS_ON
 static int get_callee_stack_depth(struct bpf_verifier_env *env,
 				  const struct bpf_insn *insn, int idx)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 03/10] bpf: Repeat check_max_stack_depth for async callbacks Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-14 21:58   ` Alexei Starovoitov
  2023-07-13  2:32 ` [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk Kumar Kartikeya Dwivedi
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

Introduce support in the verifier for generating a subprogram and
include it as part of a BPF program dynamically after the do_check
phase is complete. The appropriate place of invocation would be
do_misc_fixups.

Since they are always appended to the end of the instruction sequence of
the program, it becomes relatively inexpensive to do the related
adjustments to the subprog_info of the program. Only the fake exit
subprogram is shifted forward by 1, making room for our invented subprog.

This is useful to insert a new subprogram and obtain its function
pointer. The next patch will use this functionality to insert a default
exception callback which will be invoked after unwinding the stack.

Note that these invented subprograms are invisible to userspace, and
never reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
invented program is supported, but more can be easily supported in the
future.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h          |  1 +
 include/linux/bpf_verifier.h |  4 +++-
 kernel/bpf/core.c            |  4 ++--
 kernel/bpf/syscall.c         | 19 ++++++++++++++++++-
 kernel/bpf/verifier.c        | 29 ++++++++++++++++++++++++++++-
 5 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 360433f14496..70f212dddfbf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1385,6 +1385,7 @@ struct bpf_prog_aux {
 	bool sleepable;
 	bool tail_call_reachable;
 	bool xdp_has_frags;
+	bool invented_prog;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
 	/* function name for valid attach_btf_id */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index f70f9ac884d2..360aa304ec09 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -540,6 +540,7 @@ struct bpf_subprog_info {
 	bool has_tail_call;
 	bool tail_call_reachable;
 	bool has_ld_abs;
+	bool invented_prog;
 	bool is_async_cb;
 };
 
@@ -594,10 +595,11 @@ struct bpf_verifier_env {
 	bool bypass_spec_v1;
 	bool bypass_spec_v4;
 	bool seen_direct_write;
+	bool invented_prog;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	const struct bpf_line_info *prev_linfo;
 	struct bpf_verifier_log log;
-	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1];
+	struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 2]; /* max + 2 for the fake and exception subprogs */
 	union {
 		struct bpf_idmap idmap_scratch;
 		struct bpf_idset idset_scratch;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index dc85240a0134..5c484b2bc3d6 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -211,7 +211,7 @@ void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
 	const struct bpf_line_info *linfo;
 	void **jited_linfo;
 
-	if (!prog->aux->jited_linfo)
+	if (!prog->aux->jited_linfo || prog->aux->invented_prog)
 		/* Userspace did not provide linfo */
 		return;
 
@@ -579,7 +579,7 @@ bpf_prog_ksym_set_name(struct bpf_prog *prog)
 	sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
 
 	/* prog->aux->name will be ignored if full btf name is available */
-	if (prog->aux->func_info_cnt) {
+	if (prog->aux->func_info_cnt && !prog->aux->invented_prog) {
 		type = btf_type_by_id(prog->aux->btf,
 				      prog->aux->func_info[prog->aux->func_idx].type_id);
 		func_name = btf_name_by_offset(prog->aux->btf, type->name_off);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ee8cb1a174aa..769550287bed 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4291,8 +4291,11 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 		u32 i;
 
 		info.jited_prog_len = 0;
-		for (i = 0; i < prog->aux->func_cnt; i++)
+		for (i = 0; i < prog->aux->func_cnt; i++) {
+			if (prog->aux->func[i]->aux->invented_prog)
+				break;
 			info.jited_prog_len += prog->aux->func[i]->jited_len;
+		}
 	} else {
 		info.jited_prog_len = prog->jited_len;
 	}
@@ -4311,6 +4314,8 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 
 				free = ulen;
 				for (i = 0; i < prog->aux->func_cnt; i++) {
+					if (prog->aux->func[i]->aux->invented_prog)
+						break;
 					len = prog->aux->func[i]->jited_len;
 					len = min_t(u32, len, free);
 					img = (u8 *) prog->aux->func[i]->bpf_func;
@@ -4332,6 +4337,8 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 
 	ulen = info.nr_jited_ksyms;
 	info.nr_jited_ksyms = prog->aux->func_cnt ? : 1;
+	if (prog->aux->func_cnt && prog->aux->func[prog->aux->func_cnt - 1]->aux->invented_prog)
+		info.nr_jited_ksyms--;
 	if (ulen) {
 		if (bpf_dump_raw_ok(file->f_cred)) {
 			unsigned long ksym_addr;
@@ -4345,6 +4352,8 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 			user_ksyms = u64_to_user_ptr(info.jited_ksyms);
 			if (prog->aux->func_cnt) {
 				for (i = 0; i < ulen; i++) {
+					if (prog->aux->func[i]->aux->invented_prog)
+						break;
 					ksym_addr = (unsigned long)
 						prog->aux->func[i]->bpf_func;
 					if (put_user((u64) ksym_addr,
@@ -4363,6 +4372,8 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 
 	ulen = info.nr_jited_func_lens;
 	info.nr_jited_func_lens = prog->aux->func_cnt ? : 1;
+	if (prog->aux->func_cnt && prog->aux->func[prog->aux->func_cnt - 1]->aux->invented_prog)
+		info.nr_jited_func_lens--;
 	if (ulen) {
 		if (bpf_dump_raw_ok(file->f_cred)) {
 			u32 __user *user_lens;
@@ -4373,6 +4384,8 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 			user_lens = u64_to_user_ptr(info.jited_func_lens);
 			if (prog->aux->func_cnt) {
 				for (i = 0; i < ulen; i++) {
+					if (prog->aux->func[i]->aux->invented_prog)
+						break;
 					func_len =
 						prog->aux->func[i]->jited_len;
 					if (put_user(func_len, &user_lens[i]))
@@ -4443,6 +4456,8 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 
 	ulen = info.nr_prog_tags;
 	info.nr_prog_tags = prog->aux->func_cnt ? : 1;
+	if (prog->aux->func_cnt && prog->aux->func[prog->aux->func_cnt - 1]->aux->invented_prog)
+		info.nr_prog_tags--;
 	if (ulen) {
 		__u8 __user (*user_prog_tags)[BPF_TAG_SIZE];
 		u32 i;
@@ -4451,6 +4466,8 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 		ulen = min_t(u32, info.nr_prog_tags, ulen);
 		if (prog->aux->func_cnt) {
 			for (i = 0; i < ulen; i++) {
+				if (prog->aux->func[i]->aux->invented_prog)
+					break;
 				if (copy_to_user(user_prog_tags[i],
 						 prog->aux->func[i]->tag,
 						 BPF_TAG_SIZE))
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5d432df5df06..8ce72a7b4758 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14891,8 +14891,11 @@ static void adjust_btf_func(struct bpf_verifier_env *env)
 	if (!aux->func_info)
 		return;
 
-	for (i = 0; i < env->subprog_cnt; i++)
+	for (i = 0; i < env->subprog_cnt; i++) {
+		if (env->subprog_info[i].invented_prog)
+			break;
 		aux->func_info[i].insn_off = env->subprog_info[i].start;
+	}
 }
 
 #define MIN_BPF_LINEINFO_SIZE	offsetofend(struct bpf_line_info, line_col)
@@ -17778,6 +17781,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		}
 		func[i]->aux->num_exentries = num_exentries;
 		func[i]->aux->tail_call_reachable = env->subprog_info[i].tail_call_reachable;
+		func[i]->aux->invented_prog = env->subprog_info[i].invented_prog;
 		func[i] = bpf_int_jit_compile(func[i]);
 		if (!func[i]->jited) {
 			err = -ENOTSUPP;
@@ -18071,6 +18075,29 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	return 0;
 }
 
+/* The function requires that first instruction in 'patch' is insnsi[prog->len - 1] */
+static int invent_subprog(struct bpf_verifier_env *env, struct bpf_insn *patch, int len)
+{
+	struct bpf_subprog_info *info = env->subprog_info;
+	int cnt = env->subprog_cnt;
+	struct bpf_prog *prog;
+
+	if (env->invented_prog) {
+		verbose(env, "verifier internal error: only one invented prog supported\n");
+		return -EFAULT;
+	}
+	prog = bpf_patch_insn_data(env, env->prog->len - 1, patch, len);
+	if (!prog)
+		return -ENOMEM;
+	env->prog = prog;
+	info[cnt + 1].start = info[cnt].start;
+	info[cnt].start = prog->len - len + 1;
+	info[cnt].invented_prog = true;
+	env->subprog_cnt++;
+	env->invented_prog = true;
+	return 0;
+}
+
 /* Do various post-verification rewrites in a single program pass.
  * These rewrites simplify JIT and interpreter implementations.
  */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-14 22:05   ` Alexei Starovoitov
  2023-07-13  2:32 ` [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc Kumar Kartikeya Dwivedi
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

The plumbing for offline unwinding when we throw an exception in
programs would require walking the stack, hence introduce a new
arch_bpf_stack_walk function. This is provided when the JIT supports
exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific
code is really minimal, hence it should straightforward to extend this
support to other architectures as well, as it reuses the logic of
arch_stack_walk, but allowing access to unwind_state data.

Once the stack pointer and frame pointer are known for the main subprog
during the unwinding, we know the stack layout and location of any
callee-saved registers which must be restored before we return back to
the kernel.

This handling will be added in the next patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 21 +++++++++++++++++++++
 include/linux/filter.h      |  2 ++
 kernel/bpf/core.c           |  9 +++++++++
 3 files changed, 32 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 438adb695daa..d326503ce242 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -16,6 +16,7 @@
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/text-patching.h>
+#include <asm/unwind.h>
 
 static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
 {
@@ -2660,3 +2661,23 @@ void bpf_jit_free(struct bpf_prog *prog)
 
 	bpf_prog_unlock_free(prog);
 }
+
+bool bpf_jit_supports_exceptions(void)
+{
+	return IS_ENABLED(CONFIG_UNWINDER_ORC) || IS_ENABLED(CONFIG_UNWINDER_FRAME_POINTER);
+}
+
+void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
+{
+#if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
+	struct unwind_state state;
+	unsigned long addr;
+
+	for (unwind_start(&state, current, NULL, NULL); !unwind_done(&state);
+	     unwind_next_frame(&state)) {
+		addr = unwind_get_return_address(&state);
+		if (!addr || !consume_fn(cookie, (u64)addr, (u64)state.sp, (u64)state.bp))
+			break;
+	}
+#endif
+}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f69114083ec7..21ac801330bb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -920,6 +920,8 @@ bool bpf_jit_needs_zext(void);
 bool bpf_jit_supports_subprog_tailcalls(void);
 bool bpf_jit_supports_kfunc_call(void);
 bool bpf_jit_supports_far_kfunc_call(void);
+bool bpf_jit_supports_exceptions(void);
+void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
 bool bpf_helper_changes_pkt_data(void *func);
 
 static inline bool bpf_dump_raw_ok(const struct cred *cred)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5c484b2bc3d6..5e224cf0ec27 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2770,6 +2770,15 @@ int __weak bpf_arch_text_invalidate(void *dst, size_t len)
 	return -ENOTSUPP;
 }
 
+bool __weak bpf_jit_supports_exceptions(void)
+{
+	return false;
+}
+
+void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
+{
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 static int __init bpf_global_ma_init(void)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-14 22:34   ` Alexei Starovoitov
  2023-07-14 22:51   ` Alexei Starovoitov
  2023-07-13  2:32 ` [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

This patch implements BPF exceptions, and introduces a bpf_throw kfunc
to allow programs to throw exceptions during their execution at runtime.
A bpf_throw invocation is treated as an immediate termination of the
program, returning back to its caller within the kernel, unwinding all
stack frames.

This allows the program to simplify its implementation, by testing for
runtime conditions which the verifier has no visibility into, and assert
that they are true. In case they are not, the program can simply throw
an exception from the other branch.

BPF exceptions are explicitly *NOT* an unlikely slowpath error handling
primitive, and this objective has guided design choices of the
implementation of the them within the kernel (with the bulk of the cost
for unwinding the stack offloaded to the bpf_throw kfunc).

The implementation of this mechanism requires use of the invent_subprog
mechanism introduced in the previous patch, which generates a couple of
instructions to zero R0 and exit. The JIT then rewrites the prologue of
this subprog to take the stack pointer and frame pointer as inputs and
reset the stack frame, popping all callee-saved registers saved by the
main subprog. The bpf_throw function then walks the stack at runtime,
and invokes this exception subprog with the stack and frame pointers as
parameters.

Reviewers must take note that currently the main program is made to save
all callee-saved registers on x86_64 during entry into the program. This
is because we must do an equivalent of a lightweight context switch when
unwinding the stack, therefore we need the callee-saved registers of the
caller of the BPF program to be able to return with a sane state.

Note that we have to additionally handle r12, even though it is not used
by the program, because when throwing the exception the program makes an
entry into the kernel which could clobber r12 after saving it on the
stack. To be able to preserve the value we received on program entry, we
push r12 and restore it from the generated subprogram when unwinding the
stack.

All of this can however be addressed by recording which callee-saved
registers are saved for each program, and then restore them from the
corresponding stack frames (mapped to each program) when unwinding. This
would not require pushing all callee-saved registers on entry into a BPF
program. However, this optimization is deferred for a future patch to
manage the number of moving pieces within this set.

For now, bpf_throw invocation fails when lingering resources or locks
exist in that path of the program. In a future followup, bpf_throw will
be extended to perform frame-by-frame unwinding to release lingering
resources for each stack frame, removing this limitation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c                   |  73 +++++++----
 include/linux/bpf.h                           |   3 +
 include/linux/bpf_verifier.h                  |   4 +
 include/linux/filter.h                        |   6 +
 kernel/bpf/core.c                             |   2 +-
 kernel/bpf/helpers.c                          |  38 ++++++
 kernel/bpf/verifier.c                         | 124 ++++++++++++++++--
 .../testing/selftests/bpf/bpf_experimental.h  |   6 +
 8 files changed, 219 insertions(+), 37 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index d326503ce242..8d97c6a60f9a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -256,32 +256,36 @@ struct jit_context {
 /* Number of bytes that will be skipped on tailcall */
 #define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
 
-static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
+static void push_callee_regs(u8 **pprog, bool *callee_regs_used, bool force)
 {
 	u8 *prog = *pprog;
 
-	if (callee_regs_used[0])
+	if (callee_regs_used[0] || force)
 		EMIT1(0x53);         /* push rbx */
-	if (callee_regs_used[1])
+	if (force)
+		EMIT2(0x41, 0x54);   /* push r12 */
+	if (callee_regs_used[1] || force)
 		EMIT2(0x41, 0x55);   /* push r13 */
-	if (callee_regs_used[2])
+	if (callee_regs_used[2] || force)
 		EMIT2(0x41, 0x56);   /* push r14 */
-	if (callee_regs_used[3])
+	if (callee_regs_used[3] || force)
 		EMIT2(0x41, 0x57);   /* push r15 */
 	*pprog = prog;
 }
 
-static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
+static void pop_callee_regs(u8 **pprog, bool *callee_regs_used, bool force)
 {
 	u8 *prog = *pprog;
 
-	if (callee_regs_used[3])
+	if (callee_regs_used[3] || force)
 		EMIT2(0x41, 0x5F);   /* pop r15 */
-	if (callee_regs_used[2])
+	if (callee_regs_used[2] || force)
 		EMIT2(0x41, 0x5E);   /* pop r14 */
-	if (callee_regs_used[1])
+	if (callee_regs_used[1] || force)
 		EMIT2(0x41, 0x5D);   /* pop r13 */
-	if (callee_regs_used[0])
+	if (force)
+		EMIT2(0x41, 0x5C);   /* pop r12 */
+	if (callee_regs_used[0] || force)
 		EMIT1(0x5B);         /* pop rbx */
 	*pprog = prog;
 }
@@ -292,7 +296,8 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
  * while jumping to another program
  */
 static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
-			  bool tail_call_reachable, bool is_subprog)
+			  bool tail_call_reachable, bool is_subprog,
+			  bool is_exception_cb)
 {
 	u8 *prog = *pprog;
 
@@ -308,8 +313,23 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
 		else
 			EMIT2(0x66, 0x90); /* nop2 */
 	}
-	EMIT1(0x55);             /* push rbp */
-	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
+	/* Exception callback receives FP as second parameter */
+	if (is_exception_cb) {
+		bool regs_used[4] = {};
+
+		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
+		EMIT3(0x48, 0x89, 0xD5); /* mov rbp, rdx */
+		/* The main frame must have seen_exception as true, so we first
+		 * restore those callee-saved regs from stack, before reusing
+		 * the stack frame.
+		 */
+		pop_callee_regs(&prog, regs_used, true);
+		/* Reset the stack frame. */
+		EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */
+	} else {
+		EMIT1(0x55);             /* push rbp */
+		EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
+	}
 
 	/* X86_TAIL_CALL_OFFSET is here */
 	EMIT_ENDBR();
@@ -468,10 +488,12 @@ static void emit_return(u8 **pprog, u8 *ip)
  *   goto *(prog->bpf_func + prologue_size);
  * out:
  */
-static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
+static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
+					u8 **pprog, bool *callee_regs_used,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
+	bool force_pop_all = bpf_prog->aux->seen_exception;
 	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
@@ -518,7 +540,7 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JE, offset);                    /* je out */
 
-	pop_callee_regs(&prog, callee_regs_used);
+	pop_callee_regs(&prog, callee_regs_used, force_pop_all);
 
 	EMIT1(0x58);                              /* pop rax */
 	if (stack_depth)
@@ -542,11 +564,13 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
 	*pprog = prog;
 }
 
-static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
+static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
+				      struct bpf_jit_poke_descriptor *poke,
 				      u8 **pprog, u8 *ip,
 				      bool *callee_regs_used, u32 stack_depth,
 				      struct jit_context *ctx)
 {
+	bool force_pop_all = bpf_prog->aux->seen_exception;
 	int tcc_off = -4 - round_up(stack_depth, 8);
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
@@ -571,7 +595,7 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
 	emit_jump(&prog, (u8 *)poke->tailcall_target + X86_PATCH_SIZE,
 		  poke->tailcall_bypass);
 
-	pop_callee_regs(&prog, callee_regs_used);
+	pop_callee_regs(&prog, callee_regs_used, force_pop_all);
 	EMIT1(0x58);                                  /* pop rax */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
@@ -987,8 +1011,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
 		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
-		      bpf_prog->aux->func_idx != 0);
-	push_callee_regs(&prog, callee_regs_used);
+		      bpf_prog->aux->func_idx != 0, bpf_prog->aux->exception_cb);
+	/* Exception callback will clobber callee regs for its own use, and
+	 * restore the original callee regs from main prog's stack frame.
+	 */
+	push_callee_regs(&prog, callee_regs_used, bpf_prog->aux->seen_exception);
 
 	ilen = prog - temp;
 	if (rw_image)
@@ -1557,13 +1584,15 @@ st:			if (is_imm8(insn->off))
 
 		case BPF_JMP | BPF_TAIL_CALL:
 			if (imm32)
-				emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
+				emit_bpf_tail_call_direct(bpf_prog,
+							  &bpf_prog->aux->poke_tab[imm32 - 1],
 							  &prog, image + addrs[i - 1],
 							  callee_regs_used,
 							  bpf_prog->aux->stack_depth,
 							  ctx);
 			else
-				emit_bpf_tail_call_indirect(&prog,
+				emit_bpf_tail_call_indirect(bpf_prog,
+							    &prog,
 							    callee_regs_used,
 							    bpf_prog->aux->stack_depth,
 							    image + addrs[i - 1],
@@ -1808,7 +1837,7 @@ st:			if (is_imm8(insn->off))
 			seen_exit = true;
 			/* Update cleanup_addr */
 			ctx->cleanup_addr = proglen;
-			pop_callee_regs(&prog, callee_regs_used);
+			pop_callee_regs(&prog, callee_regs_used, bpf_prog->aux->seen_exception);
 			EMIT1(0xC9);         /* leave */
 			emit_return(&prog, image + addrs[i - 1] + (prog - temp));
 			break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 70f212dddfbf..61cdb291311f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1386,6 +1386,8 @@ struct bpf_prog_aux {
 	bool tail_call_reachable;
 	bool xdp_has_frags;
 	bool invented_prog;
+	bool exception_cb;
+	bool seen_exception;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
 	/* function name for valid attach_btf_id */
@@ -1408,6 +1410,7 @@ struct bpf_prog_aux {
 	int cgroup_atype; /* enum cgroup_bpf_attach_type */
 	struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
 	char name[BPF_OBJ_NAME_LEN];
+	unsigned int (*bpf_exception_cb)(u64 cookie, u64 sp, u64 bp);
 #ifdef CONFIG_SECURITY
 	void *security;
 #endif
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 360aa304ec09..e28386fa462f 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -541,7 +541,9 @@ struct bpf_subprog_info {
 	bool tail_call_reachable;
 	bool has_ld_abs;
 	bool invented_prog;
+	bool is_cb;
 	bool is_async_cb;
+	bool is_exception_cb;
 };
 
 struct bpf_verifier_env;
@@ -588,6 +590,7 @@ struct bpf_verifier_env {
 	u32 used_map_cnt;		/* number of used maps */
 	u32 used_btf_cnt;		/* number of used BTF objects */
 	u32 id_gen;			/* used to generate unique reg IDs */
+	int exception_callback_subprog;
 	bool explore_alu_limits;
 	bool allow_ptr_leaks;
 	bool allow_uninit_stack;
@@ -596,6 +599,7 @@ struct bpf_verifier_env {
 	bool bypass_spec_v4;
 	bool seen_direct_write;
 	bool invented_prog;
+	bool seen_exception;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	const struct bpf_line_info *prev_linfo;
 	struct bpf_verifier_log log;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 21ac801330bb..f45a54f8dd7d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1137,6 +1137,7 @@ const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
 bool is_bpf_text_address(unsigned long addr);
 int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
 		    char *sym);
+struct bpf_prog *bpf_prog_ksym_find(unsigned long addr);
 
 static inline const char *
 bpf_address_lookup(unsigned long addr, unsigned long *size,
@@ -1204,6 +1205,11 @@ static inline int bpf_get_kallsym(unsigned int symnum, unsigned long *value,
 	return -ERANGE;
 }
 
+static inline struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
+{
+	return NULL;
+}
+
 static inline const char *
 bpf_address_lookup(unsigned long addr, unsigned long *size,
 		   unsigned long *off, char **modname, char *sym)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5e224cf0ec27..efbc2f965226 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -723,7 +723,7 @@ bool is_bpf_text_address(unsigned long addr)
 	return ret;
 }
 
-static struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
+struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
 {
 	struct bpf_ksym *ksym = bpf_ksym_find(addr);
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9e80efa59a5d..da1493a1da25 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2402,6 +2402,43 @@ __bpf_kfunc void bpf_rcu_read_unlock(void)
 	rcu_read_unlock();
 }
 
+struct bpf_throw_ctx {
+	struct bpf_prog_aux *aux;
+	u64 sp;
+	u64 bp;
+	int cnt;
+};
+
+static bool bpf_stack_walker(void *cookie, u64 ip, u64 sp, u64 bp)
+{
+	struct bpf_throw_ctx *ctx = cookie;
+	struct bpf_prog *prog;
+
+	if (!is_bpf_text_address(ip))
+		return !ctx->cnt;
+	prog = bpf_prog_ksym_find(ip);
+	ctx->cnt++;
+	if (!prog->aux->id)
+		return true;
+	ctx->aux = prog->aux;
+	ctx->sp = sp;
+	ctx->bp = bp;
+	return false;
+}
+
+__bpf_kfunc void bpf_throw(u64 cookie)
+{
+	struct bpf_throw_ctx ctx = {};
+
+	arch_bpf_stack_walk(bpf_stack_walker, &ctx);
+	WARN_ON_ONCE(!ctx.aux);
+	if (ctx.aux)
+		WARN_ON_ONCE(!ctx.aux->seen_exception);
+	WARN_ON_ONCE(!ctx.bp);
+	WARN_ON_ONCE(!ctx.cnt);
+	ctx.aux->bpf_exception_cb(cookie, ctx.sp, ctx.bp);
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -2429,6 +2466,7 @@ BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
 #endif
 BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_throw)
 BTF_SET8_END(generic_btf_ids)
 
 static const struct btf_kfunc_id_set generic_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8ce72a7b4758..61101a87d96b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -542,6 +542,8 @@ static bool is_dynptr_ref_function(enum bpf_func_id func_id)
 }
 
 static bool is_callback_calling_kfunc(u32 btf_id);
+static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id);
+static bool is_bpf_throw_kfunc(struct bpf_insn *insn);
 
 static bool is_callback_calling_function(enum bpf_func_id func_id)
 {
@@ -2864,11 +2866,12 @@ static int check_subprogs(struct bpf_verifier_env *env)
 		if (i == subprog_end - 1) {
 			/* to avoid fall-through from one subprog into another
 			 * the last insn of the subprog should be either exit
-			 * or unconditional jump back
+			 * or unconditional jump back or bpf_throw call
 			 */
 			if (code != (BPF_JMP | BPF_EXIT) &&
-			    code != (BPF_JMP | BPF_JA)) {
-				verbose(env, "last insn is not an exit or jmp\n");
+			    code != (BPF_JMP | BPF_JA) &&
+			    !is_bpf_throw_kfunc(insn + i)) {
+				verbose(env, "last insn is not an exit or jmp or bpf_throw call\n");
 				return -EINVAL;
 			}
 			subprog_start = subprog_end;
@@ -5625,6 +5628,25 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 	for (; i < subprog_end; i++) {
 		int next_insn, sidx;
 
+		if (bpf_pseudo_kfunc_call(insn + i) && !insn[i].off) {
+			bool err = false;
+
+			if (!is_forbidden_exception_kfunc_in_cb(insn[i].imm))
+				continue;
+			if (subprog[idx].is_cb)
+				err = true;
+			for (int c = 0; c < frame && !err; c++) {
+				if (subprog[ret_prog[c]].is_cb) {
+					err = true;
+					break;
+				}
+			}
+			if (!err)
+				continue;
+			verbose(env, "exception kfunc insn %d cannot be called from callback\n", i);
+			return -EINVAL;
+		}
+
 		if (!bpf_pseudo_call(insn + i) && !bpf_pseudo_func(insn + i))
 			continue;
 		/* remember insn and function to return to */
@@ -8734,6 +8756,7 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	 * callbacks
 	 */
 	if (set_callee_state_cb != set_callee_state) {
+		env->subprog_info[subprog].is_cb = true;
 		if (bpf_pseudo_kfunc_call(insn) &&
 		    !is_callback_calling_kfunc(insn->imm)) {
 			verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
@@ -9247,17 +9270,17 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	return 0;
 }
 
-static int check_reference_leak(struct bpf_verifier_env *env)
+static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
 {
 	struct bpf_func_state *state = cur_func(env);
 	bool refs_lingering = false;
 	int i;
 
-	if (state->frameno && !state->in_callback_fn)
+	if (!exception_exit && state->frameno && !state->in_callback_fn)
 		return 0;
 
 	for (i = 0; i < state->acquired_refs; i++) {
-		if (state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
+		if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
 			continue;
 		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
 			state->refs[i].id, state->refs[i].insn_idx);
@@ -9491,7 +9514,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 
 	switch (func_id) {
 	case BPF_FUNC_tail_call:
-		err = check_reference_leak(env);
+		err = check_reference_leak(env, false);
 		if (err) {
 			verbose(env, "tail_call would lead to reference leak\n");
 			return err;
@@ -10109,6 +10132,7 @@ enum special_kfunc_type {
 	KF_bpf_dynptr_slice,
 	KF_bpf_dynptr_slice_rdwr,
 	KF_bpf_dynptr_clone,
+	KF_bpf_throw,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -10129,6 +10153,7 @@ BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_throw)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -10151,6 +10176,7 @@ BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_throw)
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -10464,6 +10490,17 @@ static bool is_callback_calling_kfunc(u32 btf_id)
 	return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
 }
 
+static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
+{
+	return bpf_pseudo_kfunc_call(insn) && insn->off == 0 &&
+	       insn->imm == special_kfunc_list[KF_bpf_throw];
+}
+
+static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id)
+{
+	return btf_id == special_kfunc_list[KF_bpf_throw];
+}
+
 static bool is_rbtree_lock_required_kfunc(u32 btf_id)
 {
 	return is_bpf_rbtree_api_kfunc(btf_id);
@@ -11140,6 +11177,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	const struct btf_param *args;
 	const struct btf_type *ret_t;
 	struct btf *desc_btf;
+	bool throw = false;
 
 	/* skip for now, but return error when we find this in fixup_kfunc_call */
 	if (!insn->imm)
@@ -11242,6 +11280,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (meta.func_id == special_kfunc_list[KF_bpf_throw]) {
+		if (!bpf_jit_supports_exceptions()) {
+			verbose(env, "JIT does not support calling kfunc %s#%d\n",
+				func_name, meta.func_id);
+			return -EINVAL;
+		}
+		env->seen_exception = true;
+		throw = true;
+	}
+
 	for (i = 0; i < CALLER_SAVED_REGS; i++)
 		mark_reg_not_init(env, regs, caller_saved[i]);
 
@@ -11463,7 +11511,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			return err;
 	}
 
-	return 0;
+	return throw ? 1 : 0;
 }
 
 static bool signed_add_overflows(s64 a, s64 b)
@@ -14211,7 +14259,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	 * gen_ld_abs() may terminate the program at runtime, leading to
 	 * reference leak.
 	 */
-	err = check_reference_leak(env);
+	err = check_reference_leak(env, false);
 	if (err) {
 		verbose(env, "BPF_LD_[ABS|IND] cannot be mixed with socket references\n");
 		return err;
@@ -14619,6 +14667,9 @@ static int visit_insn(int t, struct bpf_verifier_env *env)
 		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 			struct bpf_kfunc_call_arg_meta meta;
 
+			/* No fallthrough edge to walk, same as BPF_EXIT */
+			if (is_bpf_throw_kfunc(insn))
+				return DONE_EXPLORING;
 			ret = fetch_kfunc_meta(env, insn, &meta, NULL);
 			if (ret == 0 && is_iter_next_kfunc(&meta)) {
 				mark_prune_point(env, t);
@@ -16222,6 +16273,7 @@ static int do_check(struct bpf_verifier_env *env)
 	int prev_insn_idx = -1;
 
 	for (;;) {
+		bool exception_exit = false;
 		struct bpf_insn *insn;
 		u8 class;
 		int err;
@@ -16435,12 +16487,18 @@ static int do_check(struct bpf_verifier_env *env)
 						return -EINVAL;
 					}
 				}
-				if (insn->src_reg == BPF_PSEUDO_CALL)
+				if (insn->src_reg == BPF_PSEUDO_CALL) {
 					err = check_func_call(env, insn, &env->insn_idx);
-				else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL)
+				} else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 					err = check_kfunc_call(env, insn, &env->insn_idx);
-				else
+					if (err == 1) {
+						err = 0;
+						exception_exit = true;
+						goto process_bpf_exit_full;
+					}
+				} else {
 					err = check_helper_call(env, insn, &env->insn_idx);
+				}
 				if (err)
 					return err;
 
@@ -16467,7 +16525,7 @@ static int do_check(struct bpf_verifier_env *env)
 					verbose(env, "BPF_EXIT uses reserved fields\n");
 					return -EINVAL;
 				}
-
+process_bpf_exit_full:
 				if (env->cur_state->active_lock.ptr &&
 				    !in_rbtree_lock_required_cb(env)) {
 					verbose(env, "bpf_spin_unlock is missing\n");
@@ -16485,10 +16543,23 @@ static int do_check(struct bpf_verifier_env *env)
 				 * function, for which reference_state must
 				 * match caller reference state when it exits.
 				 */
-				err = check_reference_leak(env);
+				err = check_reference_leak(env, exception_exit);
 				if (err)
 					return err;
 
+				/* The side effect of the prepare_func_exit
+				 * which is being skipped is that it frees
+				 * bpf_func_state. Typically, process_bpf_exit
+				 * will only be hit with outermost exit.
+				 * copy_verifier_state in pop_stack will handle
+				 * freeing of any extra bpf_func_state left over
+				 * from not processing all nested function
+				 * exits. We also skip return code checks as
+				 * they are not needed for exceptional exits.
+				 */
+				if (exception_exit)
+					goto process_bpf_exit;
+
 				if (state->curframe) {
 					/* exit from nested function */
 					err = prepare_func_exit(env, &env->insn_idx);
@@ -17782,6 +17853,9 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		func[i]->aux->num_exentries = num_exentries;
 		func[i]->aux->tail_call_reachable = env->subprog_info[i].tail_call_reachable;
 		func[i]->aux->invented_prog = env->subprog_info[i].invented_prog;
+		func[i]->aux->exception_cb = env->subprog_info[i].is_exception_cb;
+		if (!i)
+			func[i]->aux->seen_exception = env->seen_exception;
 		func[i] = bpf_int_jit_compile(func[i]);
 		if (!func[i]->jited) {
 			err = -ENOTSUPP;
@@ -17868,6 +17942,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	prog->aux->num_exentries = func[0]->aux->num_exentries;
 	prog->aux->func = func;
 	prog->aux->func_cnt = env->subprog_cnt;
+	prog->aux->bpf_exception_cb = (void *)func[env->exception_callback_subprog]->bpf_func;
+	prog->aux->seen_exception = func[0]->aux->seen_exception;
 	bpf_prog_jit_attempt_done(prog);
 	return 0;
 out_free:
@@ -18116,6 +18192,26 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 	struct bpf_map *map_ptr;
 	int i, ret, cnt, delta = 0;
 
+	if (env->seen_exception && !env->exception_callback_subprog) {
+		struct bpf_insn patch[] = {
+			env->prog->insnsi[insn_cnt - 1],
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		};
+
+		ret = invent_subprog(env, patch, ARRAY_SIZE(patch));
+		if (ret < 0)
+			return ret;
+		prog = env->prog;
+		insn = prog->insnsi;
+
+		env->exception_callback_subprog = env->subprog_cnt - 1;
+		/* Don't update insn_cnt, as invent_subprog always appends insns */
+		env->subprog_info[env->exception_callback_subprog].is_cb = true;
+		env->subprog_info[env->exception_callback_subprog].is_async_cb = true;
+		env->subprog_info[env->exception_callback_subprog].is_exception_cb = true;
+	}
+
 	for (i = 0; i < insn_cnt; i++, insn++) {
 		/* Make divide-by-zero exceptions impossible. */
 		if (insn->code == (BPF_ALU64 | BPF_MOD | BPF_X) ||
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 209811b1993a..f1d7de1349bc 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -131,4 +131,10 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
  */
 extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
 
+__attribute__((noreturn))
+extern void bpf_throw(u64 cookie) __ksym;
+
+#define throw bpf_throw(0)
+#define throw_value(cookie) bpf_throw(cookie)
+
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (5 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-14 22:39   ` Alexei Starovoitov
  2023-07-13  2:32 ` [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

Now that we allow exception throwing using bpf_throw kfunc, it can
appear as the final instruction in a prog. When this happens, and we
begin to unwind the stack using arch_bpf_stack_walk, the instruction
pointer (IP) may appear to lie outside the JITed instructions. This
happens because the return address is the instruction following the
call, but the bpf_throw never returns to the program, so the JIT
considers instruction ending at the bpf_throw call as the final JITed
instruction and end of the jited_length for the program.

This becomes a problem when we search the IP using is_bpf_text_address
and bpf_prog_ksym_find, both of which use bpf_ksym_find under the hood,
and it rightfully considers addr == ksym.end to be outside the program's
boundaries.

Insert a dummy 'int3' instruction which will never be hit to bump the
jited_length and allow us to handle programs with their final
isntruction being a call to bpf_throw.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 arch/x86/net/bpf_jit_comp.c | 11 +++++++++++
 include/linux/bpf.h         |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 8d97c6a60f9a..052230cc7f50 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1579,6 +1579,17 @@ st:			if (is_imm8(insn->off))
 			}
 			if (emit_call(&prog, func, image + addrs[i - 1] + offs))
 				return -EINVAL;
+			/* Similar to BPF_EXIT_INSN, call for bpf_throw may be
+			 * the final instruction in the program. Insert an int3
+			 * following the call instruction so that we can still
+			 * detect pc to be part of the bpf_prog in
+			 * bpf_ksym_find, otherwise when this is the last
+			 * instruction (as allowed by verifier, similar to exit
+			 * and jump instructions), pc will be == ksym.end,
+			 * leading to bpf_throw failing to unwind the stack.
+			 */
+			if (func == (u8 *)&bpf_throw)
+				EMIT1(0xCC); /* int3 */
 			break;
 		}
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 61cdb291311f..1652d184ee7f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3111,4 +3111,6 @@ static inline gfp_t bpf_memcg_flags(gfp_t flags)
 	return flags;
 }
 
+extern void bpf_throw(u64);
+
 #endif /* _LINUX_BPF_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (6 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-14 22:47   ` Alexei Starovoitov
  2023-07-13  2:32 ` [PATCH bpf-next v1 09/10] selftests/bpf: Add BPF assertion macros Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

By default, the subprog generated by the verifier to handle a thrown
exception hardcodes a return value of 0. To allow user-defined logic
and modification of the return value when an exception is thrown,
introduce the bpf_set_exception_callback kfunc, which installs a
callback as the default exception handler for the program.

Compared to runtime kfuncs, this kfunc acts a built-in, i.e. it only
takes semantic effect during verification, and is erased from the
program at runtime.

This kfunc can only be called once within a program, and always sets the
global exception handler, regardless of whether it was invoked in all
paths of the program or not. The kfunc is idempotent, and the default
exception callback cannot be modified at runtime.

Allowing modification of the callback for the current program execution
at runtime leads to issues when the programs begin to nest, as any
per-CPU state maintaing this information will have to be saved and
restored. We don't want it to stay in bpf_prog_aux as this takes a
global effect for all programs. An alternative solution is spilling
the callback pointer at a known location on the program stack on entry,
and then passing this location to bpf_throw as a parameter.

However, since exceptions are geared more towards a use case where they
are ideally never invoked, optimizing for this use case and adding to
the complexity has diminishing returns.

In the future, a variant of bpf_throw which allows supplying a callback
can also be introduced, to modify the callback for a certain throw
statement. For now, bpf_set_exception_callback is meant to serve as a
way to set statically set a subprog as the exception handler of a BPF
program.

TODO: Should we change default behavior to just return whatever cookie
value was passed to bpf_throw? That might allow people to avoid
installing a callback in case they just want to manipulate the return
value.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h                  |  1 +
 kernel/bpf/helpers.c                          |  6 ++
 kernel/bpf/verifier.c                         | 97 +++++++++++++++++--
 .../testing/selftests/bpf/bpf_experimental.h  |  2 +
 4 files changed, 98 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index e28386fa462f..bd9d73b0647e 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -300,6 +300,7 @@ struct bpf_func_state {
 	bool in_callback_fn;
 	struct tnum callback_ret_range;
 	bool in_async_callback_fn;
+	bool in_exception_callback_fn;
 
 	/* The following fields should be last. See copy_func_state() */
 	int acquired_refs;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index da1493a1da25..a2cb7ebf3d99 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2439,6 +2439,11 @@ __bpf_kfunc void bpf_throw(u64 cookie)
 	ctx.aux->bpf_exception_cb(cookie, ctx.sp, ctx.bp);
 }
 
+__bpf_kfunc void bpf_set_exception_callback(int (*cb)(u64))
+{
+	WARN_ON_ONCE(1);
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -2467,6 +2472,7 @@ BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
 #endif
 BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_throw)
+BTF_ID_FLAGS(func, bpf_set_exception_callback)
 BTF_SET8_END(generic_btf_ids)
 
 static const struct btf_kfunc_id_set generic_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 61101a87d96b..9bdb3c7d3926 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -544,6 +544,8 @@ static bool is_dynptr_ref_function(enum bpf_func_id func_id)
 static bool is_callback_calling_kfunc(u32 btf_id);
 static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id);
 static bool is_bpf_throw_kfunc(struct bpf_insn *insn);
+static bool is_async_callback_calling_kfunc(u32 btf_id);
+static bool is_exception_cb_kfunc(struct bpf_insn *insn);
 
 static bool is_callback_calling_function(enum bpf_func_id func_id)
 {
@@ -3555,7 +3557,8 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx, int subseq_idx,
 		} else if ((bpf_helper_call(insn) &&
 			    is_callback_calling_function(insn->imm) &&
 			    !is_async_callback_calling_function(insn->imm)) ||
-			   (bpf_pseudo_kfunc_call(insn) && is_callback_calling_kfunc(insn->imm))) {
+			   (bpf_pseudo_kfunc_call(insn) && is_callback_calling_kfunc(insn->imm) &&
+			    !is_async_callback_calling_kfunc(insn->imm))) {
 			/* callback-calling helper or kfunc call, which means
 			 * we are exiting from subprog, but unlike the subprog
 			 * call handling above, we shouldn't propagate
@@ -5665,6 +5668,14 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 			if (subprog[sidx].has_tail_call) {
 				verbose(env, "verifier bug. subprog has tail_call and async cb\n");
 				return -EFAULT;
+			}
+			if (subprog[sidx].is_exception_cb && bpf_pseudo_call(insn + i)) {
+				verbose(env, "insn %d cannot call exception cb directly\n", i);
+				return -EINVAL;
+			}
+			if (subprog[sidx].is_exception_cb && subprog[sidx].has_tail_call) {
+				verbose(env, "insn %d cannot tail call within exception cb\n", i);
+				return -EINVAL;
 			}
 			 /* async callbacks don't increase bpf prog stack size */
 			continue;
@@ -5689,8 +5700,13 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 	 * tail call counter throughout bpf2bpf calls combined with tailcalls
 	 */
 	if (tail_call_reachable)
-		for (j = 0; j < frame; j++)
+		for (j = 0; j < frame; j++) {
+			if (subprog[ret_prog[j]].is_exception_cb) {
+				verbose(env, "cannot tail call within exception cb\n");
+				return -EINVAL;
+			}
 			subprog[ret_prog[j]].tail_call_reachable = true;
+		}
 	if (subprog[0].tail_call_reachable)
 		env->prog->aux->tail_call_reachable = true;
 
@@ -8770,13 +8786,16 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		}
 	}
 
-	if (insn->code == (BPF_JMP | BPF_CALL) &&
+	if ((insn->code == (BPF_JMP | BPF_CALL) &&
 	    insn->src_reg == 0 &&
-	    insn->imm == BPF_FUNC_timer_set_callback) {
+	    insn->imm == BPF_FUNC_timer_set_callback) ||
+	    is_exception_cb_kfunc(insn)) {
 		struct bpf_verifier_state *async_cb;
 
 		/* there is no real recursion here. timer callbacks are async */
 		env->subprog_info[subprog].is_async_cb = true;
+		if (is_exception_cb_kfunc(insn))
+			env->subprog_info[subprog].is_exception_cb = true;
 		async_cb = push_async_cb(env, env->subprog_info[subprog].start,
 					 *insn_idx, subprog);
 		if (!async_cb)
@@ -9032,6 +9051,22 @@ static int set_user_ringbuf_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_exception_callback_state(struct bpf_verifier_env *env,
+					struct bpf_func_state *caller,
+					struct bpf_func_state *callee,
+					int insn_idx)
+{
+	__mark_reg_unknown(env, &callee->regs[BPF_REG_1]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_2]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_exception_callback_fn = true;
+	callee->callback_ret_range = tnum_range(0, 0);
+	return 0;
+
+}
+
 static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
 					 struct bpf_func_state *caller,
 					 struct bpf_func_state *callee,
@@ -10133,6 +10168,7 @@ enum special_kfunc_type {
 	KF_bpf_dynptr_slice_rdwr,
 	KF_bpf_dynptr_clone,
 	KF_bpf_throw,
+	KF_bpf_set_exception_callback,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -10154,6 +10190,7 @@ BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
 BTF_ID(func, bpf_throw)
+BTF_ID(func, bpf_set_exception_callback)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -10177,6 +10214,7 @@ BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
 BTF_ID(func, bpf_throw)
+BTF_ID(func, bpf_set_exception_callback)
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -10487,7 +10525,19 @@ static bool is_bpf_graph_api_kfunc(u32 btf_id)
 
 static bool is_callback_calling_kfunc(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
+	return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl] ||
+	       btf_id == special_kfunc_list[KF_bpf_set_exception_callback];
+}
+
+static bool is_async_callback_calling_kfunc(u32 btf_id)
+{
+	return btf_id == special_kfunc_list[KF_bpf_set_exception_callback];
+}
+
+static bool is_exception_cb_kfunc(struct bpf_insn *insn)
+{
+	return bpf_pseudo_kfunc_call(insn) && insn->off == 0 &&
+	       insn->imm == special_kfunc_list[KF_bpf_set_exception_callback];
 }
 
 static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
@@ -10498,7 +10548,8 @@ static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
 
 static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_throw];
+	return btf_id == special_kfunc_list[KF_bpf_throw] ||
+	       btf_id == special_kfunc_list[KF_bpf_set_exception_callback];
 }
 
 static bool is_rbtree_lock_required_kfunc(u32 btf_id)
@@ -11290,6 +11341,33 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		throw = true;
 	}
 
+	if (meta.func_id == special_kfunc_list[KF_bpf_set_exception_callback]) {
+		if (!bpf_jit_supports_exceptions()) {
+			verbose(env, "JIT does not support calling kfunc %s#%d\n",
+				func_name, meta.func_id);
+			return -EINVAL;
+		}
+
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_exception_callback_state);
+		if (err) {
+			verbose(env, "kfunc %s#%d failed callback verification\n",
+				func_name, meta.func_id);
+			return err;
+		}
+		if (env->cur_state->frame[0]->subprogno) {
+			verbose(env, "kfunc %s#%d can only be called from main prog\n",
+				func_name, meta.func_id);
+			return -EINVAL;
+		}
+		if (env->exception_callback_subprog) {
+			verbose(env, "kfunc %s#%d can only be called once to set exception callback\n",
+				func_name, meta.func_id);
+			return -EINVAL;
+		}
+		env->exception_callback_subprog = meta.subprogno;
+	}
+
 	for (i = 0; i < CALLER_SAVED_REGS; i++)
 		mark_reg_not_init(env, regs, caller_saved[i]);
 
@@ -14320,7 +14398,7 @@ static int check_return_code(struct bpf_verifier_env *env)
 	const bool is_subprog = frame->subprogno;
 
 	/* LSM and struct_ops func-ptr's return type could be "void" */
-	if (!is_subprog) {
+	if (!is_subprog || frame->in_exception_callback_fn) {
 		switch (prog_type) {
 		case BPF_PROG_TYPE_LSM:
 			if (prog->expected_attach_type == BPF_LSM_CGROUP)
@@ -14368,7 +14446,7 @@ static int check_return_code(struct bpf_verifier_env *env)
 		return 0;
 	}
 
-	if (is_subprog) {
+	if (is_subprog && !frame->in_exception_callback_fn) {
 		if (reg->type != SCALAR_VALUE) {
 			verbose(env, "At subprogram exit the register R0 is not a scalar value (%s)\n",
 				reg_type_str(env, reg->type));
@@ -18147,6 +18225,9 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
 		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
 		*cnt = 1;
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_set_exception_callback]) {
+		insn_buf[0] = BPF_JMP_IMM(BPF_JA, 0, 0, 0);
+		*cnt = 1;
 	}
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index f1d7de1349bc..d27e694392a7 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -137,4 +137,6 @@ extern void bpf_throw(u64 cookie) __ksym;
 #define throw bpf_throw(0)
 #define throw_value(cookie) bpf_throw(cookie)
 
+extern void bpf_set_exception_callback(int (*cb)(u64)) __ksym;
+
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 09/10] selftests/bpf: Add BPF assertion macros
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (7 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-13  2:32 ` [PATCH bpf-next v1 10/10] selftests/bpf: Add tests for BPF exceptions Kumar Kartikeya Dwivedi
  2023-07-17 18:15 ` [PATCH bpf-next v1 00/10] Exceptions - 1/2 Daniel Xu
  10 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

Add macros implementing an 'assert' statement primitive using macros,
built on top of the BPF exceptions support introduced in previous
patches.

The bpf_assert_*_value variants allow supplying a value which can the be
inspected within the exception handler to signify the assert statement
that led to the program being terminated abruptly.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../testing/selftests/bpf/bpf_experimental.h  | 20 +++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index d27e694392a7..30774fef455c 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -139,4 +139,24 @@ extern void bpf_throw(u64 cookie) __ksym;
 
 extern void bpf_set_exception_callback(int (*cb)(u64)) __ksym;
 
+#define __bpf_assert_op(LHS, op, RHS, VAL)							   \
+	_Static_assert(sizeof(&(LHS)), "1st argument must be an lvalue expression");		   \
+	_Static_assert(__builtin_constant_p((RHS)), "2nd argument must be a constant expression"); \
+	asm volatile ("if %[lhs] " op " %[rhs] goto +2; r1 = %[value]; call bpf_throw"		   \
+		      : : [lhs] "r"(LHS), [rhs] "i"(RHS), [value] "ri"(VAL) : )
+
+#define bpf_assert_eq(LHS, RHS) __bpf_assert_op(LHS, "==", RHS, 0)
+#define bpf_assert_ne(LHS, RHS) __bpf_assert_op(LHS, "!=", RHS, 0)
+#define bpf_assert_lt(LHS, RHS) __bpf_assert_op(LHS, "<", RHS, 0)
+#define bpf_assert_gt(LHS, RHS) __bpf_assert_op(LHS, ">", RHS, 0)
+#define bpf_assert_le(LHS, RHS) __bpf_assert_op(LHS, "<=", RHS, 0)
+#define bpf_assert_ge(LHS, RHS) __bpf_assert_op(LHS, ">=", RHS, 0)
+
+#define bpf_assert_eq_value(LHS, RHS, value) __bpf_assert_op(LHS, "==", RHS, value)
+#define bpf_assert_ne_value(LHS, RHS, value) __bpf_assert_op(LHS, "!=", RHS, value)
+#define bpf_assert_lt_value(LHS, RHS, value) __bpf_assert_op(LHS, "<", RHS, value)
+#define bpf_assert_gt_value(LHS, RHS, value) __bpf_assert_op(LHS, ">", RHS, value)
+#define bpf_assert_le_value(LHS, RHS, value) __bpf_assert_op(LHS, "<=", RHS, value)
+#define bpf_assert_ge_value(LHS, RHS, value) __bpf_assert_op(LHS, ">=", RHS, value)
+
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 10/10] selftests/bpf: Add tests for BPF exceptions
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (8 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 09/10] selftests/bpf: Add BPF assertion macros Kumar Kartikeya Dwivedi
@ 2023-07-13  2:32 ` Kumar Kartikeya Dwivedi
  2023-07-17 18:15 ` [PATCH bpf-next v1 00/10] Exceptions - 1/2 Daniel Xu
  10 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-13  2:32 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

Add selftests to cover success and failure cases of API usage, runtime
behavior and invariants that need to be maintained for implementation
correctness.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/exceptions.c     | 272 +++++++++++
 .../testing/selftests/bpf/progs/exceptions.c  | 450 ++++++++++++++++++
 .../selftests/bpf/progs/exceptions_ext.c      |  42 ++
 .../selftests/bpf/progs/exceptions_fail.c     | 311 ++++++++++++
 4 files changed, 1075 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/exceptions.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions_ext.c
 create mode 100644 tools/testing/selftests/bpf/progs/exceptions_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/exceptions.c b/tools/testing/selftests/bpf/prog_tests/exceptions.c
new file mode 100644
index 000000000000..e6a906ef6852
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/exceptions.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+
+#include "exceptions.skel.h"
+#include "exceptions_ext.skel.h"
+#include "exceptions_fail.skel.h"
+
+static char log_buf[1024 * 1024];
+
+static void test_exceptions_failure(void)
+{
+	RUN_TESTS(exceptions_fail);
+}
+
+static void test_exceptions_success(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, ropts,
+		.data_in = &pkt_v4,
+		.data_size_in = sizeof(pkt_v4),
+		.repeat = 1,
+	);
+	struct exceptions_ext *eskel = NULL;
+	struct exceptions *skel;
+	int ret;
+
+	skel = exceptions__open();
+	if (!ASSERT_OK_PTR(skel, "exceptions__open"))
+		return;
+
+	ret = exceptions__load(skel);
+	if (!ASSERT_OK(ret, "exceptions__load"))
+		goto done;
+
+	if (!ASSERT_OK(bpf_map_update_elem(bpf_map__fd(skel->maps.jmp_table), &(int){0},
+					   &(int){bpf_program__fd(skel->progs.exception_tail_call_target)}, BPF_ANY),
+		       "bpf_map_update_elem jmp_table"))
+		goto done;
+
+#define RUN_SUCCESS(_prog, return_val)						  \
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs._prog), &ropts); \
+	ASSERT_OK(ret, #_prog " prog run ret");					  \
+	ASSERT_EQ(ropts.retval, return_val, #_prog " prog run retval");
+
+	RUN_SUCCESS(exception_throw_subprog, 16);
+	RUN_SUCCESS(exception_throw, 0);
+	RUN_SUCCESS(exception_throw_gfunc1, 1);
+	RUN_SUCCESS(exception_throw_gfunc2, 0);
+	RUN_SUCCESS(exception_throw_gfunc3, 1);
+	RUN_SUCCESS(exception_throw_gfunc4, 0);
+	RUN_SUCCESS(exception_throw_gfunc5, 1);
+	RUN_SUCCESS(exception_throw_gfunc6, 16);
+	RUN_SUCCESS(exception_throw_func1, 1);
+	RUN_SUCCESS(exception_throw_func2, 0);
+	RUN_SUCCESS(exception_throw_func3, 1);
+	RUN_SUCCESS(exception_throw_func4, 0);
+	RUN_SUCCESS(exception_throw_func5, 1);
+	RUN_SUCCESS(exception_throw_func6, 16);
+	RUN_SUCCESS(exception_tail_call, 50);
+	RUN_SUCCESS(exception_ext, 5);
+	RUN_SUCCESS(exception_throw_value, 60);
+	RUN_SUCCESS(exception_assert_eq, 16);
+	RUN_SUCCESS(exception_assert_ne, 16);
+	RUN_SUCCESS(exception_assert_lt, 16);
+	RUN_SUCCESS(exception_assert_gt, 16);
+	RUN_SUCCESS(exception_assert_le, 16);
+	RUN_SUCCESS(exception_assert_ge, 16);
+	RUN_SUCCESS(exception_assert_eq_ok, 6);
+	RUN_SUCCESS(exception_assert_ne_ok, 6);
+	RUN_SUCCESS(exception_assert_lt_ok, 6);
+	RUN_SUCCESS(exception_assert_gt_ok, 6);
+	RUN_SUCCESS(exception_assert_le_ok, 6);
+	RUN_SUCCESS(exception_assert_ge_ok, 6);
+	RUN_SUCCESS(exception_assert_eq_value, 42);
+	RUN_SUCCESS(exception_assert_ne_value, 42);
+	RUN_SUCCESS(exception_assert_lt_value, 42);
+	RUN_SUCCESS(exception_assert_gt_value, 42);
+	RUN_SUCCESS(exception_assert_le_value, 42);
+	RUN_SUCCESS(exception_assert_ge_value, 42);
+	RUN_SUCCESS(exception_assert_eq_ok_value, 5);
+	RUN_SUCCESS(exception_assert_ne_ok_value, 5);
+	RUN_SUCCESS(exception_assert_lt_ok_value, 5);
+	RUN_SUCCESS(exception_assert_gt_ok_value, 5);
+	RUN_SUCCESS(exception_assert_le_ok_value, 5);
+	RUN_SUCCESS(exception_assert_ge_ok_value, 5);
+
+#define RUN_EXT(load_ret, attach_err, expr, msg, after_link)			  \
+	{									  \
+		LIBBPF_OPTS(bpf_object_open_opts, o, .kernel_log_buf = log_buf,		 \
+						     .kernel_log_size = sizeof(log_buf), \
+						     .kernel_log_level = 2);		 \
+		exceptions_ext__destroy(eskel);					  \
+		eskel = exceptions_ext__open_opts(&o);				  \
+		struct bpf_program *prog = NULL;				  \
+		struct bpf_link *link = NULL;					  \
+		if (!ASSERT_OK_PTR(eskel, "exceptions_ext__open"))		  \
+			goto done;						  \
+		(expr);								  \
+		ASSERT_OK_PTR(bpf_program__name(prog), bpf_program__name(prog));  \
+		if (!ASSERT_EQ(exceptions_ext__load(eskel), load_ret,		  \
+			       "exceptions_ext__load"))	{			  \
+			printf("%s\n", log_buf);				  \
+			goto done;						  \
+		}								  \
+		if (load_ret != 0) {						  \
+			printf("%s\n", log_buf);				  \
+			if (!ASSERT_OK_PTR(strstr(log_buf, msg), "strstr"))	  \
+				goto done;					  \
+		}								  \
+		if (!load_ret && attach_err) {					  \
+			if (!ASSERT_ERR_PTR(link = bpf_program__attach(prog), "attach err")) \
+				goto done;					  \
+		} else if (!load_ret) {						  \
+			if (!ASSERT_OK_PTR(link = bpf_program__attach(prog), "attach ok"))  \
+				goto done;					  \
+			(void)(after_link);					  \
+			bpf_link__destroy(link);				  \
+		}								  \
+	}
+
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.throwing_extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_ext),
+			       "exception_ext_global"), "set_attach_target"))
+			goto done;
+	}), "", ({ RUN_SUCCESS(exception_ext, 0); }));
+
+	/* non-throwing fexit -> non-throwing subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.pfexit;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* throwing fexit -> non-throwing subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.throwing_fexit;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* non-throwing fexit -> throwing subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.pfexit;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "throwing_subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* throwing fexit -> throwing subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.throwing_fexit;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "throwing_subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* fmod_ret not allowed for subprog - Check so we remember to handle its
+	 * throwing specification compatibility with target when supported.
+	 */
+	RUN_EXT(-EINVAL, false, ({
+		prog = eskel->progs.pfmod_ret;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "subprog"), "set_attach_target"))
+			goto done;
+	}), "can't modify return codes of BPF program", (void)0);
+
+	/* fmod_ret not allowed for subprog - Check so we remember to handle its
+	 * throwing specification compatibility with target when supported.
+	 */
+	RUN_EXT(-EINVAL, false, ({
+		prog = eskel->progs.pfmod_ret;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "global_subprog"), "set_attach_target"))
+			goto done;
+	}), "can't modify return codes of BPF program", (void)0);
+
+	/* non-throwing extension -> non-throwing subprog : BAD (!global) */
+	RUN_EXT(-EINVAL, true, ({
+		prog = eskel->progs.extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "subprog"), "set_attach_target"))
+			goto done;
+	}), "subprog() is not a global function", (void)0);
+
+	/* non-throwing extension -> throwing subprog : BAD (!global) */
+	RUN_EXT(-EINVAL, true, ({
+		prog = eskel->progs.extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "throwing_subprog"), "set_attach_target"))
+			goto done;
+	}), "throwing_subprog() is not a global function", (void)0);
+
+	/* non-throwing extension -> non-throwing global subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "global_subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* non-throwing extension -> throwing global subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "throwing_global_subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* throwing extension -> throwing global subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.throwing_extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "throwing_global_subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* throwing extension -> main subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.throwing_extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "exception_throw_subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+
+	/* throwing extension -> non-throwing global subprog : OK */
+	RUN_EXT(0, false, ({
+		prog = eskel->progs.throwing_extension;
+		bpf_program__set_autoload(prog, true);
+		if (!ASSERT_OK(bpf_program__set_attach_target(prog,
+			       bpf_program__fd(skel->progs.exception_throw_subprog),
+			       "global_subprog"), "set_attach_target"))
+			goto done;
+	}), "", (void)0);
+done:
+	exceptions_ext__destroy(eskel);
+	exceptions__destroy(skel);
+}
+
+void test_exceptions(void)
+{
+	test_exceptions_success();
+	test_exceptions_failure();
+}
diff --git a/tools/testing/selftests/bpf/progs/exceptions.c b/tools/testing/selftests/bpf/progs/exceptions.c
new file mode 100644
index 000000000000..f8c2727f4584
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/exceptions.c
@@ -0,0 +1,450 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+#ifndef ETH_P_IP
+#define ETH_P_IP	0x0800		/* Internet Protocol packet	*/
+#endif
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 4);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+SEC("tc")
+int exception_throw(struct __sk_buff *ctx)
+{
+	volatile int ret = 1;
+
+	if (ctx->protocol)
+		throw;
+	return ret;
+}
+
+
+static __noinline int subprog(struct __sk_buff *ctx)
+{
+	return ctx->len;
+}
+
+static __noinline int throwing_subprog(struct __sk_buff *ctx)
+{
+	volatile int ret = 0;
+
+	if (ctx->protocol)
+		throw;
+	return ret;
+}
+
+__noinline int global_subprog(struct __sk_buff *ctx)
+{
+	return subprog(ctx) + 1;
+}
+
+__noinline int throwing_global_subprog(struct __sk_buff *ctx)
+{
+	volatile int ret = 0;
+
+	if (ctx->protocol)
+		throw;
+	return ret;
+}
+
+__noinline int throwing_global_subprog_value(struct __sk_buff *ctx, u64 value)
+{
+	volatile int ret = 0;
+
+	if (ctx->protocol)
+		throw_value(value);
+	return ret;
+}
+
+static __noinline int exception_cb(u64 c)
+{
+	volatile int ret = 16;
+
+	return ret;
+}
+
+SEC("tc")
+int exception_throw_subprog(struct __sk_buff *ctx)
+{
+	volatile int i;
+
+	bpf_set_exception_callback(exception_cb);
+	i = subprog(ctx);
+	i += global_subprog(ctx) - 1;
+	if (!i)
+		return throwing_global_subprog(ctx);
+	else
+		return throwing_subprog(ctx);
+	throw;
+	return 0;
+}
+
+__noinline int throwing_gfunc(volatile int i)
+{
+	volatile int ret = 1;
+
+	bpf_assert_eq(i, 0);
+	return ret;
+}
+
+__noinline static int throwing_func(volatile int i)
+{
+	volatile int ret = 1;
+
+	bpf_assert_lt(i, 1);
+	return ret;
+}
+
+SEC("tc")
+int exception_throw_gfunc1(void *ctx)
+{
+	return throwing_gfunc(0);
+}
+
+SEC("tc")
+__noinline int exception_throw_gfunc2()
+{
+	return throwing_gfunc(1);
+}
+
+__noinline int throwing_gfunc_2(volatile int i)
+{
+	return throwing_gfunc(i);
+}
+
+SEC("tc")
+int exception_throw_gfunc3(void *ctx)
+{
+	return throwing_gfunc_2(0);
+}
+
+SEC("tc")
+int exception_throw_gfunc4(void *ctx)
+{
+	return throwing_gfunc_2(1);
+}
+
+SEC("tc")
+int exception_throw_gfunc5(void *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	return throwing_gfunc_2(0);
+}
+
+SEC("tc")
+int exception_throw_gfunc6(void *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	return throwing_gfunc_2(1);
+}
+
+
+SEC("tc")
+int exception_throw_func1(void *ctx)
+{
+	return throwing_func(0);
+}
+
+SEC("tc")
+int exception_throw_func2(void *ctx)
+{
+	return throwing_func(1);
+}
+
+__noinline static int throwing_func_2(volatile int i)
+{
+	return throwing_func(i);
+}
+
+SEC("tc")
+int exception_throw_func3(void *ctx)
+{
+	return throwing_func_2(0);
+}
+
+SEC("tc")
+int exception_throw_func4(void *ctx)
+{
+	return throwing_func_2(1);
+}
+
+SEC("tc")
+int exception_throw_func5(void *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	return throwing_func_2(0);
+}
+
+SEC("tc")
+int exception_throw_func6(void *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	return throwing_func_2(1);
+}
+
+static int exception_cb_nz(u64 cookie)
+{
+	volatile int ret = 42;
+
+	return ret;
+}
+
+SEC("tc")
+int exception_tail_call_target(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_nz);
+	throw;
+}
+
+static __noinline
+int exception_tail_call_subprog(struct __sk_buff *ctx)
+{
+	volatile int ret = 10;
+
+	bpf_tail_call_static(ctx, &jmp_table, 0);
+	return ret;
+}
+
+SEC("tc")
+int exception_tail_call(struct __sk_buff *ctx) {
+	volatile int ret = 0;
+
+	bpf_set_exception_callback(exception_cb);
+	ret = exception_tail_call_subprog(ctx);
+	return ret + 8;
+}
+
+__noinline int exception_ext_global(struct __sk_buff *ctx)
+{
+	volatile int ret = 5;
+
+	return ret;
+}
+
+static __noinline int exception_ext_static(struct __sk_buff *ctx)
+{
+	return exception_ext_global(ctx);
+}
+
+SEC("tc")
+int exception_ext(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_nz);
+	return exception_ext_static(ctx);
+}
+
+static __noinline int exception_cb_value(u64 cookie)
+{
+	return cookie - 4;
+}
+
+SEC("tc")
+int exception_throw_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	return throwing_global_subprog_value(ctx, 64);
+}
+
+SEC("tc")
+int exception_assert_eq(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_eq(ctx->protocol, IPPROTO_UDP);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_ne(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_ne(ctx->protocol, __bpf_htons(ETH_P_IP));
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_lt(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_lt(ctx->protocol, __bpf_htons(ETH_P_IP) - 1);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_gt(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_gt(ctx->protocol, __bpf_htons(ETH_P_IP) + 1);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_le(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_le(ctx->protocol, __bpf_htons(ETH_P_IP) - 1);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_ge(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_ge(ctx->protocol, __bpf_htons(ETH_P_IP) + 1);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_eq_ok(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_eq(ctx->protocol, __bpf_htons(ETH_P_IP));
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_ne_ok(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_ne(ctx->protocol, IPPROTO_UDP);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_lt_ok(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_lt(ctx->protocol, __bpf_htons(ETH_P_IP) + 1);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_gt_ok(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_gt(ctx->protocol, __bpf_htons(ETH_P_IP) - 1);
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_le_ok(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_le(ctx->protocol, __bpf_htons(ETH_P_IP));
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_ge_ok(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb);
+	bpf_assert_ge(ctx->protocol, __bpf_htons(ETH_P_IP));
+	return 6;
+}
+
+SEC("tc")
+int exception_assert_eq_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_eq_value(ctx->protocol, IPPROTO_UDP, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_ne_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_ne_value(ctx->protocol, __bpf_htons(ETH_P_IP), 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_lt_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_lt_value(ctx->protocol, __bpf_htons(ETH_P_IP) - 1, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_gt_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_gt_value(ctx->protocol, __bpf_htons(ETH_P_IP) + 1, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_le_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_le_value(ctx->protocol, __bpf_htons(ETH_P_IP) - 1, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_ge_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_ge_value(ctx->protocol, __bpf_htons(ETH_P_IP) + 1, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_eq_ok_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_eq_value(ctx->protocol, __bpf_htons(ETH_P_IP), 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_ne_ok_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_ne_value(ctx->protocol, IPPROTO_UDP, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_lt_ok_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_lt_value(ctx->protocol, __bpf_htons(ETH_P_IP) + 1, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_gt_ok_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_gt_value(ctx->protocol, __bpf_htons(ETH_P_IP) - 1, 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_le_ok_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_le_value(ctx->protocol, __bpf_htons(ETH_P_IP), 46);
+	return 5;
+}
+
+SEC("tc")
+int exception_assert_ge_ok_value(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_value);
+	bpf_assert_ge_value(ctx->protocol, __bpf_htons(ETH_P_IP), 46);
+	return 5;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/exceptions_ext.c b/tools/testing/selftests/bpf/progs/exceptions_ext.c
new file mode 100644
index 000000000000..9ce9752254bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/exceptions_ext.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+
+SEC("?freplace")
+int extension(struct __sk_buff *ctx)
+{
+	return 0;
+}
+
+SEC("?freplace")
+int throwing_extension(struct __sk_buff *ctx)
+{
+	throw;
+}
+
+SEC("?fexit")
+int pfexit(void *ctx)
+{
+	return 0;
+}
+
+SEC("?fexit")
+int throwing_fexit(void *ctx)
+{
+	throw;
+}
+
+SEC("?fmod_ret")
+int pfmod_ret(void *ctx)
+{
+	return 1;
+}
+
+SEC("?fmod_ret")
+int throwing_fmod_ret(void *ctx)
+{
+	throw;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/exceptions_fail.c b/tools/testing/selftests/bpf/progs/exceptions_fail.c
new file mode 100644
index 000000000000..94ee6ae452c8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/exceptions_fail.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+extern void bpf_rcu_read_lock(void) __ksym;
+
+#define private(name) SEC(".bss." #name) __hidden __attribute__((aligned(8)))
+
+struct foo {
+	struct bpf_rb_node node;
+};
+
+private(A) struct bpf_spin_lock lock;
+private(A) struct bpf_rb_root rbtree __contains(foo, node);
+
+__noinline static int subprog_lock(struct __sk_buff *ctx)
+{
+	volatile int ret = 0;
+
+	bpf_spin_lock(&lock);
+	if (ctx->len)
+		throw;
+	return ret;
+}
+
+SEC("?tc")
+__failure __msg("function calls are not allowed while holding a lock")
+int reject_with_lock(void *ctx)
+{
+	bpf_spin_lock(&lock);
+	throw;
+}
+
+SEC("?tc")
+__failure __msg("function calls are not allowed while holding a lock")
+int reject_subprog_with_lock(void *ctx)
+{
+	return subprog_lock(ctx);
+}
+
+SEC("?tc")
+__failure __msg("bpf_rcu_read_unlock is missing")
+int reject_with_rcu_read_lock(void *ctx)
+{
+	bpf_rcu_read_lock();
+	throw;
+}
+
+__noinline static int throwing_subprog(struct __sk_buff *ctx)
+{
+	if (ctx->len)
+		throw;
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("bpf_rcu_read_unlock is missing")
+int reject_subprog_with_rcu_read_lock(void *ctx)
+{
+	bpf_rcu_read_lock();
+	return throwing_subprog(ctx);
+}
+
+static bool rbless(struct bpf_rb_node *n1, const struct bpf_rb_node *n2)
+{
+	throw;
+}
+
+SEC("?tc")
+__failure __msg("function calls are not allowed while holding a lock")
+int reject_with_rbtree_add_throw(void *ctx)
+{
+	struct foo *f;
+
+	f = bpf_obj_new(typeof(*f));
+	if (!f)
+		return 0;
+	bpf_spin_lock(&lock);
+	bpf_rbtree_add(&rbtree, &f->node, rbless);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("Unreleased reference")
+int reject_with_reference(void *ctx)
+{
+	struct foo *f;
+
+	f = bpf_obj_new(typeof(*f));
+	if (!f)
+		return 0;
+	throw;
+}
+
+__noinline static int subprog_ref(struct __sk_buff *ctx)
+{
+	struct foo *f;
+
+	f = bpf_obj_new(typeof(*f));
+	if (!f)
+		return 0;
+	throw;
+}
+
+__noinline static int subprog_cb_ref(u32 i, void *ctx)
+{
+	throw;
+}
+
+SEC("?tc")
+__failure __msg("Unreleased reference")
+int reject_with_cb_reference(void *ctx)
+{
+	struct foo *f;
+
+	f = bpf_obj_new(typeof(*f));
+	if (!f)
+		return 0;
+	bpf_loop(5, subprog_cb_ref, NULL, 0);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("cannot be called from callback")
+int reject_with_cb(void *ctx)
+{
+	bpf_loop(5, subprog_cb_ref, NULL, 0);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("Unreleased reference")
+int reject_with_subprog_reference(void *ctx)
+{
+	return subprog_ref(ctx) + 1;
+}
+
+static __noinline int throwing_exception_cb(u64 c)
+{
+	if (!c)
+		throw;
+	return c;
+}
+
+static __noinline int exception_cb1(u64 c)
+{
+	volatile int i = 0;
+
+	bpf_assert_eq(i, 0);
+	return i;
+}
+
+static __noinline int exception_cb2(u64 c)
+{
+	volatile int i = 0;
+
+	bpf_assert_eq(i, 0);
+	return i;
+}
+
+__noinline int throwing_exception_gfunc(struct __sk_buff *ctx)
+{
+	return throwing_exception_cb(ctx->protocol);
+}
+
+SEC("?tc")
+__failure __msg("cannot be called from callback")
+int reject_throwing_exception_cb_1(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(throwing_exception_cb);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("cannot call exception cb directly")
+int reject_throwing_exception_cb_2(struct __sk_buff *ctx)
+{
+	throwing_exception_gfunc(ctx);
+	bpf_set_exception_callback(throwing_exception_cb);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("can only be called once to set exception callback")
+int reject_throwing_exception_cb_3(struct __sk_buff *ctx)
+{
+	if (ctx->protocol)
+		bpf_set_exception_callback(exception_cb1);
+	else
+		bpf_set_exception_callback(exception_cb2);
+	throw;
+}
+
+__noinline int gfunc_set_exception_cb(u64 c)
+{
+	bpf_set_exception_callback(exception_cb1);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("can only be called from main prog")
+int reject_set_exception_cb_gfunc(struct __sk_buff *ctx)
+{
+	gfunc_set_exception_cb(0);
+	return 0;
+}
+
+static __noinline int exception_cb_rec(u64 c)
+{
+	bpf_set_exception_callback(exception_cb_rec);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("can only be called from main prog")
+int reject_set_exception_cb_rec1(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_rec);
+	return 0;
+}
+
+static __noinline int exception_cb_rec2(u64 c);
+
+static __noinline int exception_cb_rec1(u64 c)
+{
+	bpf_set_exception_callback(exception_cb_rec2);
+	return 0;
+}
+
+static __noinline int exception_cb_rec2(u64 c)
+{
+	bpf_set_exception_callback(exception_cb_rec2);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("can only be called from main prog")
+int reject_set_exception_cb_rec2(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_rec1);
+	return 0;
+}
+
+static __noinline int exception_cb_rec3(u64 c)
+{
+	bpf_set_exception_callback(exception_cb1);
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("can only be called from main prog")
+int reject_set_exception_cb_rec3(struct __sk_buff *ctx)
+{
+	bpf_set_exception_callback(exception_cb_rec3);
+	return 0;
+}
+
+static __noinline int exception_cb_bad_ret(u64 c)
+{
+	return 4242;
+}
+
+SEC("?fentry/bpf_check")
+__failure __msg("At program exit the register R0 has value")
+int reject_set_exception_cb_bad_ret(void *ctx)
+{
+	bpf_set_exception_callback(exception_cb_bad_ret);
+	return 0;
+}
+
+__noinline static int loop_cb1(u32 index, int *ctx)
+{
+	throw;
+	return 0;
+}
+
+__noinline static int loop_cb2(u32 index, int *ctx)
+{
+	throw;
+	return 0;
+}
+
+SEC("?tc")
+__failure __msg("cannot be called from callback")
+int reject_exception_throw_cb(struct __sk_buff *ctx)
+{
+	volatile int ret = 1;
+
+	bpf_loop(5, loop_cb1, NULL, 0);
+	return ret;
+}
+
+SEC("?tc")
+__failure __msg("cannot be called from callback")
+int exception_throw_cb_diff(struct __sk_buff *ctx)
+{
+	volatile int ret = 1;
+
+	if (ctx->protocol)
+		bpf_loop(5, loop_cb1, NULL, 0);
+	else
+		bpf_loop(5, loop_cb2, NULL, 0);
+	return ret;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs
  2023-07-13  2:32 ` [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs Kumar Kartikeya Dwivedi
@ 2023-07-14 21:58   ` Alexei Starovoitov
  2023-07-17 16:21     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-14 21:58 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Thu, Jul 13, 2023 at 08:02:26AM +0530, Kumar Kartikeya Dwivedi wrote:
> Introduce support in the verifier for generating a subprogram and
> include it as part of a BPF program dynamically after the do_check
> phase is complete. The appropriate place of invocation would be
> do_misc_fixups.
> 
> Since they are always appended to the end of the instruction sequence of
> the program, it becomes relatively inexpensive to do the related
> adjustments to the subprog_info of the program. Only the fake exit
> subprogram is shifted forward by 1, making room for our invented subprog.
> 
> This is useful to insert a new subprogram and obtain its function
> pointer. The next patch will use this functionality to insert a default
> exception callback which will be invoked after unwinding the stack.
> 
> Note that these invented subprograms are invisible to userspace, and
> never reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
> invented program is supported, but more can be easily supported in the
> future.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h          |  1 +
>  include/linux/bpf_verifier.h |  4 +++-
>  kernel/bpf/core.c            |  4 ++--
>  kernel/bpf/syscall.c         | 19 ++++++++++++++++++-
>  kernel/bpf/verifier.c        | 29 ++++++++++++++++++++++++++++-
>  5 files changed, 52 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 360433f14496..70f212dddfbf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1385,6 +1385,7 @@ struct bpf_prog_aux {
>  	bool sleepable;
>  	bool tail_call_reachable;
>  	bool xdp_has_frags;
> +	bool invented_prog;
>  	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
>  	const struct btf_type *attach_func_proto;
>  	/* function name for valid attach_btf_id */
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index f70f9ac884d2..360aa304ec09 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -540,6 +540,7 @@ struct bpf_subprog_info {
>  	bool has_tail_call;
>  	bool tail_call_reachable;
>  	bool has_ld_abs;
> +	bool invented_prog;
>  	bool is_async_cb;
>  };
>  
> @@ -594,10 +595,11 @@ struct bpf_verifier_env {
>  	bool bypass_spec_v1;
>  	bool bypass_spec_v4;
>  	bool seen_direct_write;
> +	bool invented_prog;

Instead of a flag in two places how about adding aux->func_cnt_real
and use it in JITing and free-ing while get_info*() keep using aux->func_cnt.

> +/* The function requires that first instruction in 'patch' is insnsi[prog->len - 1] */
> +static int invent_subprog(struct bpf_verifier_env *env, struct bpf_insn *patch, int len)
> +{
> +	struct bpf_subprog_info *info = env->subprog_info;
> +	int cnt = env->subprog_cnt;
> +	struct bpf_prog *prog;
> +
> +	if (env->invented_prog) {
> +		verbose(env, "verifier internal error: only one invented prog supported\n");
> +		return -EFAULT;
> +	}
> +	prog = bpf_patch_insn_data(env, env->prog->len - 1, patch, len);

The actual patching is not necessary.
bpf_prog_realloc() and memcpy would be enough, no?

> +	if (!prog)
> +		return -ENOMEM;
> +	env->prog = prog;
> +	info[cnt + 1].start = info[cnt].start;
> +	info[cnt].start = prog->len - len + 1;
> +	info[cnt].invented_prog = true;
> +	env->subprog_cnt++;
> +	env->invented_prog = true;
> +	return 0;
> +}
> +
>  /* Do various post-verification rewrites in a single program pass.
>   * These rewrites simplify JIT and interpreter implementations.
>   */
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk
  2023-07-13  2:32 ` [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk Kumar Kartikeya Dwivedi
@ 2023-07-14 22:05   ` Alexei Starovoitov
  2023-07-17 16:29     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-14 22:05 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Thu, Jul 13, 2023 at 08:02:27AM +0530, Kumar Kartikeya Dwivedi wrote:
> The plumbing for offline unwinding when we throw an exception in
> programs would require walking the stack, hence introduce a new
> arch_bpf_stack_walk function. This is provided when the JIT supports
> exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific
> code is really minimal, hence it should straightforward to extend this
> support to other architectures as well, as it reuses the logic of
> arch_stack_walk, but allowing access to unwind_state data.
> 
> Once the stack pointer and frame pointer are known for the main subprog
> during the unwinding, we know the stack layout and location of any
> callee-saved registers which must be restored before we return back to
> the kernel.
> 
> This handling will be added in the next patch.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 21 +++++++++++++++++++++
>  include/linux/filter.h      |  2 ++
>  kernel/bpf/core.c           |  9 +++++++++
>  3 files changed, 32 insertions(+)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 438adb695daa..d326503ce242 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -16,6 +16,7 @@
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/text-patching.h>
> +#include <asm/unwind.h>
>  
>  static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
>  {
> @@ -2660,3 +2661,23 @@ void bpf_jit_free(struct bpf_prog *prog)
>  
>  	bpf_prog_unlock_free(prog);
>  }
> +
> +bool bpf_jit_supports_exceptions(void)
> +{
> +	return IS_ENABLED(CONFIG_UNWINDER_ORC) || IS_ENABLED(CONFIG_UNWINDER_FRAME_POINTER);
> +}
> +
> +void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
> +{
> +#if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
> +	struct unwind_state state;
> +	unsigned long addr;
> +
> +	for (unwind_start(&state, current, NULL, NULL); !unwind_done(&state);
> +	     unwind_next_frame(&state)) {
> +		addr = unwind_get_return_address(&state);

I think these steps will work even with UNWINDER_GUESS.
What is the reason for #ifdef ?

> +		if (!addr || !consume_fn(cookie, (u64)addr, (u64)state.sp, (u64)state.bp))
> +			break;
> +	}
> +#endif
> +}
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index f69114083ec7..21ac801330bb 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -920,6 +920,8 @@ bool bpf_jit_needs_zext(void);
>  bool bpf_jit_supports_subprog_tailcalls(void);
>  bool bpf_jit_supports_kfunc_call(void);
>  bool bpf_jit_supports_far_kfunc_call(void);
> +bool bpf_jit_supports_exceptions(void);
> +void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
>  bool bpf_helper_changes_pkt_data(void *func);
>  
>  static inline bool bpf_dump_raw_ok(const struct cred *cred)
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 5c484b2bc3d6..5e224cf0ec27 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2770,6 +2770,15 @@ int __weak bpf_arch_text_invalidate(void *dst, size_t len)
>  	return -ENOTSUPP;
>  }
>  
> +bool __weak bpf_jit_supports_exceptions(void)
> +{
> +	return false;
> +}
> +
> +void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
> +{
> +}
> +
>  #ifdef CONFIG_BPF_SYSCALL
>  static int __init bpf_global_ma_init(void)
>  {
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc
  2023-07-13  2:32 ` [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc Kumar Kartikeya Dwivedi
@ 2023-07-14 22:34   ` Alexei Starovoitov
  2023-07-17 16:43     ` Kumar Kartikeya Dwivedi
  2023-07-14 22:51   ` Alexei Starovoitov
  1 sibling, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-14 22:34 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Thu, Jul 13, 2023 at 08:02:28AM +0530, Kumar Kartikeya Dwivedi wrote:
> This patch implements BPF exceptions, and introduces a bpf_throw kfunc
> to allow programs to throw exceptions during their execution at runtime.
> A bpf_throw invocation is treated as an immediate termination of the
> program, returning back to its caller within the kernel, unwinding all
> stack frames.
> 
> This allows the program to simplify its implementation, by testing for
> runtime conditions which the verifier has no visibility into, and assert
> that they are true. In case they are not, the program can simply throw
> an exception from the other branch.
> 
> BPF exceptions are explicitly *NOT* an unlikely slowpath error handling
> primitive, and this objective has guided design choices of the
> implementation of the them within the kernel (with the bulk of the cost
> for unwinding the stack offloaded to the bpf_throw kfunc).
> 
> The implementation of this mechanism requires use of the invent_subprog

Let's come up with a different name than 'invent_subprog'.
How about 'add_hidden_subprog' ?

> mechanism introduced in the previous patch, which generates a couple of
> instructions to zero R0 and exit. The JIT then rewrites the prologue of
> this subprog to take the stack pointer and frame pointer as inputs and
> reset the stack frame, popping all callee-saved registers saved by the
> main subprog. The bpf_throw function then walks the stack at runtime,
> and invokes this exception subprog with the stack and frame pointers as
> parameters.
> 
> Reviewers must take note that currently the main program is made to save
> all callee-saved registers on x86_64 during entry into the program. This
> is because we must do an equivalent of a lightweight context switch when
> unwinding the stack, therefore we need the callee-saved registers of the
> caller of the BPF program to be able to return with a sane state.
> 
> Note that we have to additionally handle r12, even though it is not used
> by the program, because when throwing the exception the program makes an
> entry into the kernel which could clobber r12 after saving it on the
> stack. To be able to preserve the value we received on program entry, we
> push r12 and restore it from the generated subprogram when unwinding the
> stack.
> 
> All of this can however be addressed by recording which callee-saved
> registers are saved for each program, and then restore them from the
> corresponding stack frames (mapped to each program) when unwinding. This
> would not require pushing all callee-saved registers on entry into a BPF
> program. However, this optimization is deferred for a future patch to
> manage the number of moving pieces within this set.

Let's keep it permanent. Not worth to optimize further.
bpf progs are getting bigger and they use all 4 regs more often than not.

> For now, bpf_throw invocation fails when lingering resources or locks
> exist in that path of the program. In a future followup, bpf_throw will
> be extended to perform frame-by-frame unwinding to release lingering
> resources for each stack frame, removing this limitation.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c                   |  73 +++++++----
>  include/linux/bpf.h                           |   3 +
>  include/linux/bpf_verifier.h                  |   4 +
>  include/linux/filter.h                        |   6 +
>  kernel/bpf/core.c                             |   2 +-
>  kernel/bpf/helpers.c                          |  38 ++++++
>  kernel/bpf/verifier.c                         | 124 ++++++++++++++++--
>  .../testing/selftests/bpf/bpf_experimental.h  |   6 +
>  8 files changed, 219 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index d326503ce242..8d97c6a60f9a 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -256,32 +256,36 @@ struct jit_context {
>  /* Number of bytes that will be skipped on tailcall */
>  #define X86_TAIL_CALL_OFFSET	(11 + ENDBR_INSN_SIZE)
>  
> -static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
> +static void push_callee_regs(u8 **pprog, bool *callee_regs_used, bool force)
>  {
>  	u8 *prog = *pprog;
>  
> -	if (callee_regs_used[0])
> +	if (callee_regs_used[0] || force)
>  		EMIT1(0x53);         /* push rbx */
> -	if (callee_regs_used[1])
> +	if (force)
> +		EMIT2(0x41, 0x54);   /* push r12 */

let's make r12 push/pop explicit. In addition to push_callee_regs.

> +	if (callee_regs_used[1] || force)
>  		EMIT2(0x41, 0x55);   /* push r13 */
> -	if (callee_regs_used[2])
> +	if (callee_regs_used[2] || force)
>  		EMIT2(0x41, 0x56);   /* push r14 */
> -	if (callee_regs_used[3])
> +	if (callee_regs_used[3] || force)
>  		EMIT2(0x41, 0x57);   /* push r15 */

Instead of passing extra bool through the call chain how about doing:

if (bpf_prog->aux->seen_exception)
    memset(callee_regs_used, 1, sizeof(..));
else
    detect_reg_usage()

>  	*pprog = prog;
>  }
>  
> -static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
> +static void pop_callee_regs(u8 **pprog, bool *callee_regs_used, bool force)
>  {
>  	u8 *prog = *pprog;
>  
> -	if (callee_regs_used[3])
> +	if (callee_regs_used[3] || force)
>  		EMIT2(0x41, 0x5F);   /* pop r15 */
> -	if (callee_regs_used[2])
> +	if (callee_regs_used[2] || force)
>  		EMIT2(0x41, 0x5E);   /* pop r14 */
> -	if (callee_regs_used[1])
> +	if (callee_regs_used[1] || force)
>  		EMIT2(0x41, 0x5D);   /* pop r13 */
> -	if (callee_regs_used[0])
> +	if (force)
> +		EMIT2(0x41, 0x5C);   /* pop r12 */
> +	if (callee_regs_used[0] || force)
>  		EMIT1(0x5B);         /* pop rbx */
>  	*pprog = prog;
>  }
> @@ -292,7 +296,8 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
>   * while jumping to another program
>   */
>  static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> -			  bool tail_call_reachable, bool is_subprog)
> +			  bool tail_call_reachable, bool is_subprog,
> +			  bool is_exception_cb)
>  {
>  	u8 *prog = *pprog;
>  
> @@ -308,8 +313,23 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
>  		else
>  			EMIT2(0x66, 0x90); /* nop2 */
>  	}
> -	EMIT1(0x55);             /* push rbp */
> -	EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
> +	/* Exception callback receives FP as second parameter */
> +	if (is_exception_cb) {
> +		bool regs_used[4] = {};
> +
> +		EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
> +		EMIT3(0x48, 0x89, 0xD5); /* mov rbp, rdx */
> +		/* The main frame must have seen_exception as true, so we first
> +		 * restore those callee-saved regs from stack, before reusing
> +		 * the stack frame.
> +		 */
> +		pop_callee_regs(&prog, regs_used, true);
> +		/* Reset the stack frame. */
> +		EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */
> +	} else {
> +		EMIT1(0x55);             /* push rbp */
> +		EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
> +	}
>  
>  	/* X86_TAIL_CALL_OFFSET is here */
>  	EMIT_ENDBR();
> @@ -468,10 +488,12 @@ static void emit_return(u8 **pprog, u8 *ip)
>   *   goto *(prog->bpf_func + prologue_size);
>   * out:
>   */
> -static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> +static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
> +					u8 **pprog, bool *callee_regs_used,
>  					u32 stack_depth, u8 *ip,
>  					struct jit_context *ctx)
>  {
> +	bool force_pop_all = bpf_prog->aux->seen_exception;
>  	int tcc_off = -4 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
> @@ -518,7 +540,7 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
>  	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
>  	EMIT2(X86_JE, offset);                    /* je out */
>  
> -	pop_callee_regs(&prog, callee_regs_used);
> +	pop_callee_regs(&prog, callee_regs_used, force_pop_all);
>  
>  	EMIT1(0x58);                              /* pop rax */
>  	if (stack_depth)
> @@ -542,11 +564,13 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
>  	*pprog = prog;
>  }
>  
> -static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
> +static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
> +				      struct bpf_jit_poke_descriptor *poke,
>  				      u8 **pprog, u8 *ip,
>  				      bool *callee_regs_used, u32 stack_depth,
>  				      struct jit_context *ctx)
>  {
> +	bool force_pop_all = bpf_prog->aux->seen_exception;
>  	int tcc_off = -4 - round_up(stack_depth, 8);
>  	u8 *prog = *pprog, *start = *pprog;
>  	int offset;
> @@ -571,7 +595,7 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
>  	emit_jump(&prog, (u8 *)poke->tailcall_target + X86_PATCH_SIZE,
>  		  poke->tailcall_bypass);
>  
> -	pop_callee_regs(&prog, callee_regs_used);
> +	pop_callee_regs(&prog, callee_regs_used, force_pop_all);
>  	EMIT1(0x58);                                  /* pop rax */
>  	if (stack_depth)
>  		EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
> @@ -987,8 +1011,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>  
>  	emit_prologue(&prog, bpf_prog->aux->stack_depth,
>  		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
> -		      bpf_prog->aux->func_idx != 0);
> -	push_callee_regs(&prog, callee_regs_used);
> +		      bpf_prog->aux->func_idx != 0, bpf_prog->aux->exception_cb);
> +	/* Exception callback will clobber callee regs for its own use, and
> +	 * restore the original callee regs from main prog's stack frame.
> +	 */
> +	push_callee_regs(&prog, callee_regs_used, bpf_prog->aux->seen_exception);
>  
>  	ilen = prog - temp;
>  	if (rw_image)
> @@ -1557,13 +1584,15 @@ st:			if (is_imm8(insn->off))
>  
>  		case BPF_JMP | BPF_TAIL_CALL:
>  			if (imm32)
> -				emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
> +				emit_bpf_tail_call_direct(bpf_prog,
> +							  &bpf_prog->aux->poke_tab[imm32 - 1],
>  							  &prog, image + addrs[i - 1],
>  							  callee_regs_used,
>  							  bpf_prog->aux->stack_depth,
>  							  ctx);
>  			else
> -				emit_bpf_tail_call_indirect(&prog,
> +				emit_bpf_tail_call_indirect(bpf_prog,
> +							    &prog,
>  							    callee_regs_used,
>  							    bpf_prog->aux->stack_depth,
>  							    image + addrs[i - 1],
> @@ -1808,7 +1837,7 @@ st:			if (is_imm8(insn->off))
>  			seen_exit = true;
>  			/* Update cleanup_addr */
>  			ctx->cleanup_addr = proglen;
> -			pop_callee_regs(&prog, callee_regs_used);
> +			pop_callee_regs(&prog, callee_regs_used, bpf_prog->aux->seen_exception);
>  			EMIT1(0xC9);         /* leave */
>  			emit_return(&prog, image + addrs[i - 1] + (prog - temp));
>  			break;
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 70f212dddfbf..61cdb291311f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1386,6 +1386,8 @@ struct bpf_prog_aux {
>  	bool tail_call_reachable;
>  	bool xdp_has_frags;
>  	bool invented_prog;
> +	bool exception_cb;
> +	bool seen_exception;
>  	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
>  	const struct btf_type *attach_func_proto;
>  	/* function name for valid attach_btf_id */
> @@ -1408,6 +1410,7 @@ struct bpf_prog_aux {
>  	int cgroup_atype; /* enum cgroup_bpf_attach_type */
>  	struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>  	char name[BPF_OBJ_NAME_LEN];
> +	unsigned int (*bpf_exception_cb)(u64 cookie, u64 sp, u64 bp);
>  #ifdef CONFIG_SECURITY
>  	void *security;
>  #endif
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 360aa304ec09..e28386fa462f 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -541,7 +541,9 @@ struct bpf_subprog_info {
>  	bool tail_call_reachable;
>  	bool has_ld_abs;
>  	bool invented_prog;
> +	bool is_cb;
>  	bool is_async_cb;
> +	bool is_exception_cb;
>  };
>  
>  struct bpf_verifier_env;
> @@ -588,6 +590,7 @@ struct bpf_verifier_env {
>  	u32 used_map_cnt;		/* number of used maps */
>  	u32 used_btf_cnt;		/* number of used BTF objects */
>  	u32 id_gen;			/* used to generate unique reg IDs */
> +	int exception_callback_subprog;
>  	bool explore_alu_limits;
>  	bool allow_ptr_leaks;
>  	bool allow_uninit_stack;
> @@ -596,6 +599,7 @@ struct bpf_verifier_env {
>  	bool bypass_spec_v4;
>  	bool seen_direct_write;
>  	bool invented_prog;
> +	bool seen_exception;
>  	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
>  	const struct bpf_line_info *prev_linfo;
>  	struct bpf_verifier_log log;
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 21ac801330bb..f45a54f8dd7d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1137,6 +1137,7 @@ const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
>  bool is_bpf_text_address(unsigned long addr);
>  int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
>  		    char *sym);
> +struct bpf_prog *bpf_prog_ksym_find(unsigned long addr);
>  
>  static inline const char *
>  bpf_address_lookup(unsigned long addr, unsigned long *size,
> @@ -1204,6 +1205,11 @@ static inline int bpf_get_kallsym(unsigned int symnum, unsigned long *value,
>  	return -ERANGE;
>  }
>  
> +static inline struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
> +{
> +	return NULL;
> +}
> +
>  static inline const char *
>  bpf_address_lookup(unsigned long addr, unsigned long *size,
>  		   unsigned long *off, char **modname, char *sym)
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 5e224cf0ec27..efbc2f965226 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -723,7 +723,7 @@ bool is_bpf_text_address(unsigned long addr)
>  	return ret;
>  }
>  
> -static struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
> +struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
>  {
>  	struct bpf_ksym *ksym = bpf_ksym_find(addr);
>  
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 9e80efa59a5d..da1493a1da25 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2402,6 +2402,43 @@ __bpf_kfunc void bpf_rcu_read_unlock(void)
>  	rcu_read_unlock();
>  }
>  
> +struct bpf_throw_ctx {
> +	struct bpf_prog_aux *aux;
> +	u64 sp;
> +	u64 bp;
> +	int cnt;
> +};
> +
> +static bool bpf_stack_walker(void *cookie, u64 ip, u64 sp, u64 bp)
> +{
> +	struct bpf_throw_ctx *ctx = cookie;
> +	struct bpf_prog *prog;
> +
> +	if (!is_bpf_text_address(ip))
> +		return !ctx->cnt;
> +	prog = bpf_prog_ksym_find(ip);
> +	ctx->cnt++;
> +	if (!prog->aux->id)
> +		return true;
> +	ctx->aux = prog->aux;
> +	ctx->sp = sp;
> +	ctx->bp = bp;
> +	return false;
> +}
> +
> +__bpf_kfunc void bpf_throw(u64 cookie)
> +{
> +	struct bpf_throw_ctx ctx = {};
> +
> +	arch_bpf_stack_walk(bpf_stack_walker, &ctx);
> +	WARN_ON_ONCE(!ctx.aux);
> +	if (ctx.aux)
> +		WARN_ON_ONCE(!ctx.aux->seen_exception);
> +	WARN_ON_ONCE(!ctx.bp);
> +	WARN_ON_ONCE(!ctx.cnt);
> +	ctx.aux->bpf_exception_cb(cookie, ctx.sp, ctx.bp);
> +}
> +
>  __diag_pop();
>  
>  BTF_SET8_START(generic_btf_ids)
> @@ -2429,6 +2466,7 @@ BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
>  #endif
>  BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_throw)
>  BTF_SET8_END(generic_btf_ids)
>  
>  static const struct btf_kfunc_id_set generic_kfunc_set = {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 8ce72a7b4758..61101a87d96b 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -542,6 +542,8 @@ static bool is_dynptr_ref_function(enum bpf_func_id func_id)
>  }
>  
>  static bool is_callback_calling_kfunc(u32 btf_id);
> +static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id);
> +static bool is_bpf_throw_kfunc(struct bpf_insn *insn);
>  
>  static bool is_callback_calling_function(enum bpf_func_id func_id)
>  {
> @@ -2864,11 +2866,12 @@ static int check_subprogs(struct bpf_verifier_env *env)
>  		if (i == subprog_end - 1) {
>  			/* to avoid fall-through from one subprog into another
>  			 * the last insn of the subprog should be either exit
> -			 * or unconditional jump back
> +			 * or unconditional jump back or bpf_throw call
>  			 */
>  			if (code != (BPF_JMP | BPF_EXIT) &&
> -			    code != (BPF_JMP | BPF_JA)) {
> -				verbose(env, "last insn is not an exit or jmp\n");
> +			    code != (BPF_JMP | BPF_JA) &&
> +			    !is_bpf_throw_kfunc(insn + i)) {
> +				verbose(env, "last insn is not an exit or jmp or bpf_throw call\n");
>  				return -EINVAL;
>  			}
>  			subprog_start = subprog_end;
> @@ -5625,6 +5628,25 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
>  	for (; i < subprog_end; i++) {
>  		int next_insn, sidx;
>  
> +		if (bpf_pseudo_kfunc_call(insn + i) && !insn[i].off) {
> +			bool err = false;
> +
> +			if (!is_forbidden_exception_kfunc_in_cb(insn[i].imm))

Just use is_bpf_throw_kfunc(). No need for another helper.
The compiler will optimize the redundant checks.

> +				continue;
> +			if (subprog[idx].is_cb)
> +				err = true;
> +			for (int c = 0; c < frame && !err; c++) {
> +				if (subprog[ret_prog[c]].is_cb) {
> +					err = true;
> +					break;
> +				}
> +			}
> +			if (!err)
> +				continue;
> +			verbose(env, "exception kfunc insn %d cannot be called from callback\n", i);
> +			return -EINVAL;
> +		}
> +
>  		if (!bpf_pseudo_call(insn + i) && !bpf_pseudo_func(insn + i))
>  			continue;
>  		/* remember insn and function to return to */
> @@ -8734,6 +8756,7 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  	 * callbacks
>  	 */
>  	if (set_callee_state_cb != set_callee_state) {
> +		env->subprog_info[subprog].is_cb = true;
>  		if (bpf_pseudo_kfunc_call(insn) &&
>  		    !is_callback_calling_kfunc(insn->imm)) {
>  			verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
> @@ -9247,17 +9270,17 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>  	return 0;
>  }
>  
> -static int check_reference_leak(struct bpf_verifier_env *env)
> +static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
>  {
>  	struct bpf_func_state *state = cur_func(env);
>  	bool refs_lingering = false;
>  	int i;
>  
> -	if (state->frameno && !state->in_callback_fn)
> +	if (!exception_exit && state->frameno && !state->in_callback_fn)
>  		return 0;
>  
>  	for (i = 0; i < state->acquired_refs; i++) {
> -		if (state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
> +		if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
>  			continue;
>  		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
>  			state->refs[i].id, state->refs[i].insn_idx);
> @@ -9491,7 +9514,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  
>  	switch (func_id) {
>  	case BPF_FUNC_tail_call:
> -		err = check_reference_leak(env);
> +		err = check_reference_leak(env, false);
>  		if (err) {
>  			verbose(env, "tail_call would lead to reference leak\n");
>  			return err;
> @@ -10109,6 +10132,7 @@ enum special_kfunc_type {
>  	KF_bpf_dynptr_slice,
>  	KF_bpf_dynptr_slice_rdwr,
>  	KF_bpf_dynptr_clone,
> +	KF_bpf_throw,
>  };
>  
>  BTF_SET_START(special_kfunc_set)
> @@ -10129,6 +10153,7 @@ BTF_ID(func, bpf_dynptr_from_xdp)
>  BTF_ID(func, bpf_dynptr_slice)
>  BTF_ID(func, bpf_dynptr_slice_rdwr)
>  BTF_ID(func, bpf_dynptr_clone)
> +BTF_ID(func, bpf_throw)
>  BTF_SET_END(special_kfunc_set)
>  
>  BTF_ID_LIST(special_kfunc_list)
> @@ -10151,6 +10176,7 @@ BTF_ID(func, bpf_dynptr_from_xdp)
>  BTF_ID(func, bpf_dynptr_slice)
>  BTF_ID(func, bpf_dynptr_slice_rdwr)
>  BTF_ID(func, bpf_dynptr_clone)
> +BTF_ID(func, bpf_throw)
>  
>  static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
>  {
> @@ -10464,6 +10490,17 @@ static bool is_callback_calling_kfunc(u32 btf_id)
>  	return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
>  }
>  
> +static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
> +{
> +	return bpf_pseudo_kfunc_call(insn) && insn->off == 0 &&
> +	       insn->imm == special_kfunc_list[KF_bpf_throw];
> +}
> +
> +static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id)
> +{
> +	return btf_id == special_kfunc_list[KF_bpf_throw];
> +}
> +
>  static bool is_rbtree_lock_required_kfunc(u32 btf_id)
>  {
>  	return is_bpf_rbtree_api_kfunc(btf_id);
> @@ -11140,6 +11177,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  	const struct btf_param *args;
>  	const struct btf_type *ret_t;
>  	struct btf *desc_btf;
> +	bool throw = false;
>  
>  	/* skip for now, but return error when we find this in fixup_kfunc_call */
>  	if (!insn->imm)
> @@ -11242,6 +11280,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  		}
>  	}
>  
> +	if (meta.func_id == special_kfunc_list[KF_bpf_throw]) {
> +		if (!bpf_jit_supports_exceptions()) {
> +			verbose(env, "JIT does not support calling kfunc %s#%d\n",
> +				func_name, meta.func_id);
> +			return -EINVAL;
> +		}
> +		env->seen_exception = true;
> +		throw = true;
> +	}
> +
>  	for (i = 0; i < CALLER_SAVED_REGS; i++)
>  		mark_reg_not_init(env, regs, caller_saved[i]);
>  
> @@ -11463,7 +11511,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  			return err;
>  	}
>  
> -	return 0;
> +	return throw ? 1 : 0;

I don't like this inband signaling.
Just check for special_kfunc_list[KF_bpf_throw] when check_func_call() returns with zero.

>  }
>  
>  static bool signed_add_overflows(s64 a, s64 b)
> @@ -14211,7 +14259,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  	 * gen_ld_abs() may terminate the program at runtime, leading to
>  	 * reference leak.
>  	 */
> -	err = check_reference_leak(env);
> +	err = check_reference_leak(env, false);
>  	if (err) {
>  		verbose(env, "BPF_LD_[ABS|IND] cannot be mixed with socket references\n");
>  		return err;
> @@ -14619,6 +14667,9 @@ static int visit_insn(int t, struct bpf_verifier_env *env)
>  		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>  			struct bpf_kfunc_call_arg_meta meta;
>  
> +			/* No fallthrough edge to walk, same as BPF_EXIT */
> +			if (is_bpf_throw_kfunc(insn))
> +				return DONE_EXPLORING;
>  			ret = fetch_kfunc_meta(env, insn, &meta, NULL);
>  			if (ret == 0 && is_iter_next_kfunc(&meta)) {
>  				mark_prune_point(env, t);
> @@ -16222,6 +16273,7 @@ static int do_check(struct bpf_verifier_env *env)
>  	int prev_insn_idx = -1;
>  
>  	for (;;) {
> +		bool exception_exit = false;
>  		struct bpf_insn *insn;
>  		u8 class;
>  		int err;
> @@ -16435,12 +16487,18 @@ static int do_check(struct bpf_verifier_env *env)
>  						return -EINVAL;
>  					}
>  				}
> -				if (insn->src_reg == BPF_PSEUDO_CALL)
> +				if (insn->src_reg == BPF_PSEUDO_CALL) {
>  					err = check_func_call(env, insn, &env->insn_idx);
> -				else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL)
> +				} else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>  					err = check_kfunc_call(env, insn, &env->insn_idx);
> -				else
> +					if (err == 1) {
> +						err = 0;
> +						exception_exit = true;
> +						goto process_bpf_exit_full;
> +					}
> +				} else {
>  					err = check_helper_call(env, insn, &env->insn_idx);
> +				}
>  				if (err)
>  					return err;
>  
> @@ -16467,7 +16525,7 @@ static int do_check(struct bpf_verifier_env *env)
>  					verbose(env, "BPF_EXIT uses reserved fields\n");
>  					return -EINVAL;
>  				}
> -
> +process_bpf_exit_full:
>  				if (env->cur_state->active_lock.ptr &&
>  				    !in_rbtree_lock_required_cb(env)) {
>  					verbose(env, "bpf_spin_unlock is missing\n");
> @@ -16485,10 +16543,23 @@ static int do_check(struct bpf_verifier_env *env)
>  				 * function, for which reference_state must
>  				 * match caller reference state when it exits.
>  				 */
> -				err = check_reference_leak(env);
> +				err = check_reference_leak(env, exception_exit);
>  				if (err)
>  					return err;
>  
> +				/* The side effect of the prepare_func_exit
> +				 * which is being skipped is that it frees
> +				 * bpf_func_state. Typically, process_bpf_exit
> +				 * will only be hit with outermost exit.
> +				 * copy_verifier_state in pop_stack will handle
> +				 * freeing of any extra bpf_func_state left over
> +				 * from not processing all nested function
> +				 * exits. We also skip return code checks as
> +				 * they are not needed for exceptional exits.
> +				 */
> +				if (exception_exit)
> +					goto process_bpf_exit;
> +
>  				if (state->curframe) {
>  					/* exit from nested function */
>  					err = prepare_func_exit(env, &env->insn_idx);
> @@ -17782,6 +17853,9 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  		func[i]->aux->num_exentries = num_exentries;
>  		func[i]->aux->tail_call_reachable = env->subprog_info[i].tail_call_reachable;
>  		func[i]->aux->invented_prog = env->subprog_info[i].invented_prog;
> +		func[i]->aux->exception_cb = env->subprog_info[i].is_exception_cb;
> +		if (!i)
> +			func[i]->aux->seen_exception = env->seen_exception;

why conditional?

>  		func[i] = bpf_int_jit_compile(func[i]);
>  		if (!func[i]->jited) {
>  			err = -ENOTSUPP;
> @@ -17868,6 +17942,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  	prog->aux->num_exentries = func[0]->aux->num_exentries;
>  	prog->aux->func = func;
>  	prog->aux->func_cnt = env->subprog_cnt;
> +	prog->aux->bpf_exception_cb = (void *)func[env->exception_callback_subprog]->bpf_func;
> +	prog->aux->seen_exception = func[0]->aux->seen_exception;
>  	bpf_prog_jit_attempt_done(prog);
>  	return 0;
>  out_free:
> @@ -18116,6 +18192,26 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>  	struct bpf_map *map_ptr;
>  	int i, ret, cnt, delta = 0;
>  
> +	if (env->seen_exception && !env->exception_callback_subprog) {
> +		struct bpf_insn patch[] = {
> +			env->prog->insnsi[insn_cnt - 1],
> +			BPF_MOV64_IMM(BPF_REG_0, 0),
> +			BPF_EXIT_INSN(),
> +		};
> +
> +		ret = invent_subprog(env, patch, ARRAY_SIZE(patch));
> +		if (ret < 0)
> +			return ret;
> +		prog = env->prog;
> +		insn = prog->insnsi;
> +
> +		env->exception_callback_subprog = env->subprog_cnt - 1;
> +		/* Don't update insn_cnt, as invent_subprog always appends insns */
> +		env->subprog_info[env->exception_callback_subprog].is_cb = true;
> +		env->subprog_info[env->exception_callback_subprog].is_async_cb = true;
> +		env->subprog_info[env->exception_callback_subprog].is_exception_cb = true;
> +	}
> +
>  	for (i = 0; i < insn_cnt; i++, insn++) {
>  		/* Make divide-by-zero exceptions impossible. */
>  		if (insn->code == (BPF_ALU64 | BPF_MOD | BPF_X) ||
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> index 209811b1993a..f1d7de1349bc 100644
> --- a/tools/testing/selftests/bpf/bpf_experimental.h
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -131,4 +131,10 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
>   */
>  extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
>  
> +__attribute__((noreturn))
> +extern void bpf_throw(u64 cookie) __ksym;
> +
> +#define throw bpf_throw(0)
> +#define throw_value(cookie) bpf_throw(cookie)
> +
>  #endif
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls
  2023-07-13  2:32 ` [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls Kumar Kartikeya Dwivedi
@ 2023-07-14 22:39   ` Alexei Starovoitov
  2023-07-17 16:36     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-14 22:39 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Thu, Jul 13, 2023 at 08:02:29AM +0530, Kumar Kartikeya Dwivedi wrote:
> Now that we allow exception throwing using bpf_throw kfunc, it can
> appear as the final instruction in a prog. When this happens, and we
> begin to unwind the stack using arch_bpf_stack_walk, the instruction
> pointer (IP) may appear to lie outside the JITed instructions. This
> happens because the return address is the instruction following the
> call, but the bpf_throw never returns to the program, so the JIT
> considers instruction ending at the bpf_throw call as the final JITed
> instruction and end of the jited_length for the program.
> 
> This becomes a problem when we search the IP using is_bpf_text_address
> and bpf_prog_ksym_find, both of which use bpf_ksym_find under the hood,
> and it rightfully considers addr == ksym.end to be outside the program's
> boundaries.
> 
> Insert a dummy 'int3' instruction which will never be hit to bump the
> jited_length and allow us to handle programs with their final
> isntruction being a call to bpf_throw.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 11 +++++++++++
>  include/linux/bpf.h         |  2 ++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 8d97c6a60f9a..052230cc7f50 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1579,6 +1579,17 @@ st:			if (is_imm8(insn->off))
>  			}
>  			if (emit_call(&prog, func, image + addrs[i - 1] + offs))
>  				return -EINVAL;
> +			/* Similar to BPF_EXIT_INSN, call for bpf_throw may be
> +			 * the final instruction in the program. Insert an int3
> +			 * following the call instruction so that we can still
> +			 * detect pc to be part of the bpf_prog in
> +			 * bpf_ksym_find, otherwise when this is the last
> +			 * instruction (as allowed by verifier, similar to exit
> +			 * and jump instructions), pc will be == ksym.end,
> +			 * leading to bpf_throw failing to unwind the stack.
> +			 */
> +			if (func == (u8 *)&bpf_throw)
> +				EMIT1(0xCC); /* int3 */

Probably worth explaining that this happens because bpf_throw is marked
__attribute__((noreturn)) and compiler can emit it last without BPF_EXIT insn.
Meaing the program might not have BPF_EXIT at all.

I wonder though whether this self-inflicted pain is worth it.
May be it shouldn't be marked as noreturn.
What do we gain by marking?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback
  2023-07-13  2:32 ` [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback Kumar Kartikeya Dwivedi
@ 2023-07-14 22:47   ` Alexei Starovoitov
  2023-07-17 16:46     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-14 22:47 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Thu, Jul 13, 2023 at 08:02:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> By default, the subprog generated by the verifier to handle a thrown
> exception hardcodes a return value of 0. To allow user-defined logic
> and modification of the return value when an exception is thrown,
> introduce the bpf_set_exception_callback kfunc, which installs a
> callback as the default exception handler for the program.
> 
> Compared to runtime kfuncs, this kfunc acts a built-in, i.e. it only
> takes semantic effect during verification, and is erased from the
> program at runtime.
> 
> This kfunc can only be called once within a program, and always sets the
> global exception handler, regardless of whether it was invoked in all
> paths of the program or not. The kfunc is idempotent, and the default
> exception callback cannot be modified at runtime.
> 
> Allowing modification of the callback for the current program execution
> at runtime leads to issues when the programs begin to nest, as any
> per-CPU state maintaing this information will have to be saved and
> restored. We don't want it to stay in bpf_prog_aux as this takes a
> global effect for all programs. An alternative solution is spilling
> the callback pointer at a known location on the program stack on entry,
> and then passing this location to bpf_throw as a parameter.
> 
> However, since exceptions are geared more towards a use case where they
> are ideally never invoked, optimizing for this use case and adding to
> the complexity has diminishing returns.

Right. No run-time changes pls.

Instead of bpf_set_exception_callback() how about adding a
btf_tag("exception_handler") or better name
and check that such global func is a single func in a program and
it's argument is a single u64.

> In the future, a variant of bpf_throw which allows supplying a callback
> can also be introduced, to modify the callback for a certain throw
> statement. For now, bpf_set_exception_callback is meant to serve as a
> way to set statically set a subprog as the exception handler of a BPF
> program.
> 
> TODO: Should we change default behavior to just return whatever cookie
> value was passed to bpf_throw? That might allow people to avoid
> installing a callback in case they just want to manipulate the return
> value.

and the verifier would check that u64 matches allowed return values?
ex: call check_return_code() on argument of bpf_throw?
I guess that makes sense.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc
  2023-07-13  2:32 ` [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc Kumar Kartikeya Dwivedi
  2023-07-14 22:34   ` Alexei Starovoitov
@ 2023-07-14 22:51   ` Alexei Starovoitov
  2023-07-17 16:44     ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-14 22:51 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Thu, Jul 13, 2023 at 08:02:28AM +0530, Kumar Kartikeya Dwivedi wrote:
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> index 209811b1993a..f1d7de1349bc 100644
> --- a/tools/testing/selftests/bpf/bpf_experimental.h
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -131,4 +131,10 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
>   */
>  extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
>  
> +__attribute__((noreturn))
> +extern void bpf_throw(u64 cookie) __ksym;
> +
> +#define throw bpf_throw(0)
> +#define throw_value(cookie) bpf_throw(cookie)

Reading the patch 10 I think the add-on value of these two macros is negative.
If it was open coded as bpf_throw(0); everywhere it would be easier to read imo.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs
  2023-07-14 21:58   ` Alexei Starovoitov
@ 2023-07-17 16:21     ` Kumar Kartikeya Dwivedi
  2023-07-17 17:24       ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 16:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Sat, 15 Jul 2023 at 03:28, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 13, 2023 at 08:02:26AM +0530, Kumar Kartikeya Dwivedi wrote:
> > Introduce support in the verifier for generating a subprogram and
> > include it as part of a BPF program dynamically after the do_check
> > phase is complete. The appropriate place of invocation would be
> > do_misc_fixups.
> >
> > Since they are always appended to the end of the instruction sequence of
> > the program, it becomes relatively inexpensive to do the related
> > adjustments to the subprog_info of the program. Only the fake exit
> > subprogram is shifted forward by 1, making room for our invented subprog.
> >
> > This is useful to insert a new subprogram and obtain its function
> > pointer. The next patch will use this functionality to insert a default
> > exception callback which will be invoked after unwinding the stack.
> >
> > Note that these invented subprograms are invisible to userspace, and
> > never reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
> > invented program is supported, but more can be easily supported in the
> > future.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h          |  1 +
> >  include/linux/bpf_verifier.h |  4 +++-
> >  kernel/bpf/core.c            |  4 ++--
> >  kernel/bpf/syscall.c         | 19 ++++++++++++++++++-
> >  kernel/bpf/verifier.c        | 29 ++++++++++++++++++++++++++++-
> >  5 files changed, 52 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 360433f14496..70f212dddfbf 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1385,6 +1385,7 @@ struct bpf_prog_aux {
> >       bool sleepable;
> >       bool tail_call_reachable;
> >       bool xdp_has_frags;
> > +     bool invented_prog;
> >       /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
> >       const struct btf_type *attach_func_proto;
> >       /* function name for valid attach_btf_id */
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index f70f9ac884d2..360aa304ec09 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -540,6 +540,7 @@ struct bpf_subprog_info {
> >       bool has_tail_call;
> >       bool tail_call_reachable;
> >       bool has_ld_abs;
> > +     bool invented_prog;
> >       bool is_async_cb;
> >  };
> >
> > @@ -594,10 +595,11 @@ struct bpf_verifier_env {
> >       bool bypass_spec_v1;
> >       bool bypass_spec_v4;
> >       bool seen_direct_write;
> > +     bool invented_prog;
>
> Instead of a flag in two places how about adding aux->func_cnt_real
> and use it in JITing and free-ing while get_info*() keep using aux->func_cnt.
>

That does seem better, thanks. I'll make the change in v2.

> > +/* The function requires that first instruction in 'patch' is insnsi[prog->len - 1] */
> > +static int invent_subprog(struct bpf_verifier_env *env, struct bpf_insn *patch, int len)
> > +{
> > +     struct bpf_subprog_info *info = env->subprog_info;
> > +     int cnt = env->subprog_cnt;
> > +     struct bpf_prog *prog;
> > +
> > +     if (env->invented_prog) {
> > +             verbose(env, "verifier internal error: only one invented prog supported\n");
> > +             return -EFAULT;
> > +     }
> > +     prog = bpf_patch_insn_data(env, env->prog->len - 1, patch, len);
>
> The actual patching is not necessary.
> bpf_prog_realloc() and memcpy would be enough, no?
>

Yes, it should be fine. But I didn't want to special case things here
just to make sure assumptions elsewhere don't break.
E.g. code readily assumes every insn has its own insn_aux_data which
might be broken if we don't expand it.
I think bpf_patch_insn_single is already doing a realloc (and reusing
trailing space in current allocation if available), so it didn't seem
worth it to me.

If you still feel it's better I can analyze if anything might break
and make the change.

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk
  2023-07-14 22:05   ` Alexei Starovoitov
@ 2023-07-17 16:29     ` Kumar Kartikeya Dwivedi
  2023-07-17 17:29       ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 16:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Sat, 15 Jul 2023 at 03:35, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 13, 2023 at 08:02:27AM +0530, Kumar Kartikeya Dwivedi wrote:
> > The plumbing for offline unwinding when we throw an exception in
> > programs would require walking the stack, hence introduce a new
> > arch_bpf_stack_walk function. This is provided when the JIT supports
> > exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific
> > code is really minimal, hence it should straightforward to extend this
> > support to other architectures as well, as it reuses the logic of
> > arch_stack_walk, but allowing access to unwind_state data.
> >
> > Once the stack pointer and frame pointer are known for the main subprog
> > during the unwinding, we know the stack layout and location of any
> > callee-saved registers which must be restored before we return back to
> > the kernel.
> >
> > This handling will be added in the next patch.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 21 +++++++++++++++++++++
> >  include/linux/filter.h      |  2 ++
> >  kernel/bpf/core.c           |  9 +++++++++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 438adb695daa..d326503ce242 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -16,6 +16,7 @@
> >  #include <asm/set_memory.h>
> >  #include <asm/nospec-branch.h>
> >  #include <asm/text-patching.h>
> > +#include <asm/unwind.h>
> >
> >  static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
> >  {
> > @@ -2660,3 +2661,23 @@ void bpf_jit_free(struct bpf_prog *prog)
> >
> >       bpf_prog_unlock_free(prog);
> >  }
> > +
> > +bool bpf_jit_supports_exceptions(void)
> > +{
> > +     return IS_ENABLED(CONFIG_UNWINDER_ORC) || IS_ENABLED(CONFIG_UNWINDER_FRAME_POINTER);
> > +}
> > +
> > +void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
> > +{
> > +#if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
> > +     struct unwind_state state;
> > +     unsigned long addr;
> > +
> > +     for (unwind_start(&state, current, NULL, NULL); !unwind_done(&state);
> > +          unwind_next_frame(&state)) {
> > +             addr = unwind_get_return_address(&state);
>
> I think these steps will work even with UNWINDER_GUESS.
> What is the reason for #ifdef ?

I think we require both unwind_state::sp and unwind_state::bp, but
arch/x86/include/asm/unwind.h does not include unwind_state::bp when
both UNWINDER_ORC and UNWINDER_FRAME_POINTER are unset.

Although it might be possible to calculate and save bp offset during
JIT in bpf_prog_aux (by adding roundup(stack_depth) + 8 (push rax if
tail call reachable) + callee_regs_saved) for the subprog
corresponding to a frame. Then we can make it work everywhere.
The JIT will abstract get_prog_bp(sp) using an arch specific helper.

Let me know if I misunderstood something.

>
> > +             if (!addr || !consume_fn(cookie, (u64)addr, (u64)state.sp, (u64)state.bp))
> > +                     break;
> > +     }
> > +#endif
> > +}
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index f69114083ec7..21ac801330bb 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -920,6 +920,8 @@ bool bpf_jit_needs_zext(void);
> >  bool bpf_jit_supports_subprog_tailcalls(void);
> >  bool bpf_jit_supports_kfunc_call(void);
> >  bool bpf_jit_supports_far_kfunc_call(void);
> > +bool bpf_jit_supports_exceptions(void);
> > +void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
> >  bool bpf_helper_changes_pkt_data(void *func);
> >
> >  static inline bool bpf_dump_raw_ok(const struct cred *cred)
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 5c484b2bc3d6..5e224cf0ec27 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -2770,6 +2770,15 @@ int __weak bpf_arch_text_invalidate(void *dst, size_t len)
> >       return -ENOTSUPP;
> >  }
> >
> > +bool __weak bpf_jit_supports_exceptions(void)
> > +{
> > +     return false;
> > +}
> > +
> > +void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
> > +{
> > +}
> > +
> >  #ifdef CONFIG_BPF_SYSCALL
> >  static int __init bpf_global_ma_init(void)
> >  {
> > --
> > 2.40.1
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls
  2023-07-14 22:39   ` Alexei Starovoitov
@ 2023-07-17 16:36     ` Kumar Kartikeya Dwivedi
  2023-07-17 17:45       ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 16:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Sat, 15 Jul 2023 at 04:09, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 13, 2023 at 08:02:29AM +0530, Kumar Kartikeya Dwivedi wrote:
> > Now that we allow exception throwing using bpf_throw kfunc, it can
> > appear as the final instruction in a prog. When this happens, and we
> > begin to unwind the stack using arch_bpf_stack_walk, the instruction
> > pointer (IP) may appear to lie outside the JITed instructions. This
> > happens because the return address is the instruction following the
> > call, but the bpf_throw never returns to the program, so the JIT
> > considers instruction ending at the bpf_throw call as the final JITed
> > instruction and end of the jited_length for the program.
> >
> > This becomes a problem when we search the IP using is_bpf_text_address
> > and bpf_prog_ksym_find, both of which use bpf_ksym_find under the hood,
> > and it rightfully considers addr == ksym.end to be outside the program's
> > boundaries.
> >
> > Insert a dummy 'int3' instruction which will never be hit to bump the
> > jited_length and allow us to handle programs with their final
> > isntruction being a call to bpf_throw.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  arch/x86/net/bpf_jit_comp.c | 11 +++++++++++
> >  include/linux/bpf.h         |  2 ++
> >  2 files changed, 13 insertions(+)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 8d97c6a60f9a..052230cc7f50 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -1579,6 +1579,17 @@ st:                    if (is_imm8(insn->off))
> >                       }
> >                       if (emit_call(&prog, func, image + addrs[i - 1] + offs))
> >                               return -EINVAL;
> > +                     /* Similar to BPF_EXIT_INSN, call for bpf_throw may be
> > +                      * the final instruction in the program. Insert an int3
> > +                      * following the call instruction so that we can still
> > +                      * detect pc to be part of the bpf_prog in
> > +                      * bpf_ksym_find, otherwise when this is the last
> > +                      * instruction (as allowed by verifier, similar to exit
> > +                      * and jump instructions), pc will be == ksym.end,
> > +                      * leading to bpf_throw failing to unwind the stack.
> > +                      */
> > +                     if (func == (u8 *)&bpf_throw)
> > +                             EMIT1(0xCC); /* int3 */
>
> Probably worth explaining that this happens because bpf_throw is marked
> __attribute__((noreturn)) and compiler can emit it last without BPF_EXIT insn.
> Meaing the program might not have BPF_EXIT at all.

Yes, sorry about omitting that. I will add it to the commit message in v2.

>
> I wonder though whether this self-inflicted pain is worth it.
> May be it shouldn't be marked as noreturn.
> What do we gain by marking?

It felt like the obvious thing to do to me. The cost on the kernel
side is negligible (atleast in my opinion), we just have to allow it
as final instruction in the program. If it's heavily used it allows
the compiler to better optimize the code (marking anything after it
unreachable, no need to save registers etc., although this may not be
a persuasive point for you).

Regardless of this noreturn attribute, I was thinking whether we
should always emit an extra instruction so that any IP (say one past
last instruction) we get for a BPF prog can always be seen as
belonging to it. It probably is only a problem surfaced by this
bpf_throw call at the end, but I was wondering whether doing it
unconditionally makes sense.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc
  2023-07-14 22:34   ` Alexei Starovoitov
@ 2023-07-17 16:43     ` Kumar Kartikeya Dwivedi
  2023-07-17 17:32       ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 16:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Sat, 15 Jul 2023 at 04:04, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 13, 2023 at 08:02:28AM +0530, Kumar Kartikeya Dwivedi wrote:
> > This patch implements BPF exceptions, and introduces a bpf_throw kfunc
> > to allow programs to throw exceptions during their execution at runtime.
> > A bpf_throw invocation is treated as an immediate termination of the
> > program, returning back to its caller within the kernel, unwinding all
> > stack frames.
> >
> > This allows the program to simplify its implementation, by testing for
> > runtime conditions which the verifier has no visibility into, and assert
> > that they are true. In case they are not, the program can simply throw
> > an exception from the other branch.
> >
> > BPF exceptions are explicitly *NOT* an unlikely slowpath error handling
> > primitive, and this objective has guided design choices of the
> > implementation of the them within the kernel (with the bulk of the cost
> > for unwinding the stack offloaded to the bpf_throw kfunc).
> >
> > The implementation of this mechanism requires use of the invent_subprog
>
> Let's come up with a different name than 'invent_subprog'.
> How about 'add_hidden_subprog' ?
>

Sounds good, I will make the change.

> > mechanism introduced in the previous patch, which generates a couple of
> > instructions to zero R0 and exit. The JIT then rewrites the prologue of
> > this subprog to take the stack pointer and frame pointer as inputs and
> > reset the stack frame, popping all callee-saved registers saved by the
> > main subprog. The bpf_throw function then walks the stack at runtime,
> > and invokes this exception subprog with the stack and frame pointers as
> > parameters.
> >
> > Reviewers must take note that currently the main program is made to save
> > all callee-saved registers on x86_64 during entry into the program. This
> > is because we must do an equivalent of a lightweight context switch when
> > unwinding the stack, therefore we need the callee-saved registers of the
> > caller of the BPF program to be able to return with a sane state.
> >
> > Note that we have to additionally handle r12, even though it is not used
> > by the program, because when throwing the exception the program makes an
> > entry into the kernel which could clobber r12 after saving it on the
> > stack. To be able to preserve the value we received on program entry, we
> > push r12 and restore it from the generated subprogram when unwinding the
> > stack.
> >
> > All of this can however be addressed by recording which callee-saved
> > registers are saved for each program, and then restore them from the
> > corresponding stack frames (mapped to each program) when unwinding. This
> > would not require pushing all callee-saved registers on entry into a BPF
> > program. However, this optimization is deferred for a future patch to
> > manage the number of moving pieces within this set.
>
> Let's keep it permanent. Not worth to optimize further.
> bpf progs are getting bigger and they use all 4 regs more often than not.
>

Ack.

> > For now, bpf_throw invocation fails when lingering resources or locks
> > exist in that path of the program. In a future followup, bpf_throw will
> > be extended to perform frame-by-frame unwinding to release lingering
> > resources for each stack frame, removing this limitation.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  arch/x86/net/bpf_jit_comp.c                   |  73 +++++++----
> >  include/linux/bpf.h                           |   3 +
> >  include/linux/bpf_verifier.h                  |   4 +
> >  include/linux/filter.h                        |   6 +
> >  kernel/bpf/core.c                             |   2 +-
> >  kernel/bpf/helpers.c                          |  38 ++++++
> >  kernel/bpf/verifier.c                         | 124 ++++++++++++++++--
> >  .../testing/selftests/bpf/bpf_experimental.h  |   6 +
> >  8 files changed, 219 insertions(+), 37 deletions(-)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index d326503ce242..8d97c6a60f9a 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -256,32 +256,36 @@ struct jit_context {
> >  /* Number of bytes that will be skipped on tailcall */
> >  #define X86_TAIL_CALL_OFFSET (11 + ENDBR_INSN_SIZE)
> >
> > -static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
> > +static void push_callee_regs(u8 **pprog, bool *callee_regs_used, bool force)
> >  {
> >       u8 *prog = *pprog;
> >
> > -     if (callee_regs_used[0])
> > +     if (callee_regs_used[0] || force)
> >               EMIT1(0x53);         /* push rbx */
> > -     if (callee_regs_used[1])
> > +     if (force)
> > +             EMIT2(0x41, 0x54);   /* push r12 */
>
> let's make r12 push/pop explicit. In addition to push_callee_regs.
>

Ack (I'm understanding you mean adding a push/pop_r12 and using that).

> > +     if (callee_regs_used[1] || force)
> >               EMIT2(0x41, 0x55);   /* push r13 */
> > -     if (callee_regs_used[2])
> > +     if (callee_regs_used[2] || force)
> >               EMIT2(0x41, 0x56);   /* push r14 */
> > -     if (callee_regs_used[3])
> > +     if (callee_regs_used[3] || force)
> >               EMIT2(0x41, 0x57);   /* push r15 */
>
> Instead of passing extra bool through the call chain how about doing:
>
> if (bpf_prog->aux->seen_exception)
>     memset(callee_regs_used, 1, sizeof(..));
> else
>     detect_reg_usage()
>

Looks good, I will change it.

> >       *pprog = prog;
> >  }
> >
> > -static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
> > +static void pop_callee_regs(u8 **pprog, bool *callee_regs_used, bool force)
> >  {
> >       u8 *prog = *pprog;
> >
> > -     if (callee_regs_used[3])
> > +     if (callee_regs_used[3] || force)
> >               EMIT2(0x41, 0x5F);   /* pop r15 */
> > -     if (callee_regs_used[2])
> > +     if (callee_regs_used[2] || force)
> >               EMIT2(0x41, 0x5E);   /* pop r14 */
> > -     if (callee_regs_used[1])
> > +     if (callee_regs_used[1] || force)
> >               EMIT2(0x41, 0x5D);   /* pop r13 */
> > -     if (callee_regs_used[0])
> > +     if (force)
> > +             EMIT2(0x41, 0x5C);   /* pop r12 */
> > +     if (callee_regs_used[0] || force)
> >               EMIT1(0x5B);         /* pop rbx */
> >       *pprog = prog;
> >  }
> > @@ -292,7 +296,8 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used)
> >   * while jumping to another program
> >   */
> >  static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> > -                       bool tail_call_reachable, bool is_subprog)
> > +                       bool tail_call_reachable, bool is_subprog,
> > +                       bool is_exception_cb)
> >  {
> >       u8 *prog = *pprog;
> >
> > @@ -308,8 +313,23 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
> >               else
> >                       EMIT2(0x66, 0x90); /* nop2 */
> >       }
> > -     EMIT1(0x55);             /* push rbp */
> > -     EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
> > +     /* Exception callback receives FP as second parameter */
> > +     if (is_exception_cb) {
> > +             bool regs_used[4] = {};
> > +
> > +             EMIT3(0x48, 0x89, 0xF4); /* mov rsp, rsi */
> > +             EMIT3(0x48, 0x89, 0xD5); /* mov rbp, rdx */
> > +             /* The main frame must have seen_exception as true, so we first
> > +              * restore those callee-saved regs from stack, before reusing
> > +              * the stack frame.
> > +              */
> > +             pop_callee_regs(&prog, regs_used, true);
> > +             /* Reset the stack frame. */
> > +             EMIT3(0x48, 0x89, 0xEC); /* mov rsp, rbp */
> > +     } else {
> > +             EMIT1(0x55);             /* push rbp */
> > +             EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
> > +     }
> >
> >       /* X86_TAIL_CALL_OFFSET is here */
> >       EMIT_ENDBR();
> > @@ -468,10 +488,12 @@ static void emit_return(u8 **pprog, u8 *ip)
> >   *   goto *(prog->bpf_func + prologue_size);
> >   * out:
> >   */
> > -static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> > +static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
> > +                                     u8 **pprog, bool *callee_regs_used,
> >                                       u32 stack_depth, u8 *ip,
> >                                       struct jit_context *ctx)
> >  {
> > +     bool force_pop_all = bpf_prog->aux->seen_exception;
> >       int tcc_off = -4 - round_up(stack_depth, 8);
> >       u8 *prog = *pprog, *start = *pprog;
> >       int offset;
> > @@ -518,7 +540,7 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> >       offset = ctx->tail_call_indirect_label - (prog + 2 - start);
> >       EMIT2(X86_JE, offset);                    /* je out */
> >
> > -     pop_callee_regs(&prog, callee_regs_used);
> > +     pop_callee_regs(&prog, callee_regs_used, force_pop_all);
> >
> >       EMIT1(0x58);                              /* pop rax */
> >       if (stack_depth)
> > @@ -542,11 +564,13 @@ static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used,
> >       *pprog = prog;
> >  }
> >
> > -static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
> > +static void emit_bpf_tail_call_direct(struct bpf_prog *bpf_prog,
> > +                                   struct bpf_jit_poke_descriptor *poke,
> >                                     u8 **pprog, u8 *ip,
> >                                     bool *callee_regs_used, u32 stack_depth,
> >                                     struct jit_context *ctx)
> >  {
> > +     bool force_pop_all = bpf_prog->aux->seen_exception;
> >       int tcc_off = -4 - round_up(stack_depth, 8);
> >       u8 *prog = *pprog, *start = *pprog;
> >       int offset;
> > @@ -571,7 +595,7 @@ static void emit_bpf_tail_call_direct(struct bpf_jit_poke_descriptor *poke,
> >       emit_jump(&prog, (u8 *)poke->tailcall_target + X86_PATCH_SIZE,
> >                 poke->tailcall_bypass);
> >
> > -     pop_callee_regs(&prog, callee_regs_used);
> > +     pop_callee_regs(&prog, callee_regs_used, force_pop_all);
> >       EMIT1(0x58);                                  /* pop rax */
> >       if (stack_depth)
> >               EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
> > @@ -987,8 +1011,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> >
> >       emit_prologue(&prog, bpf_prog->aux->stack_depth,
> >                     bpf_prog_was_classic(bpf_prog), tail_call_reachable,
> > -                   bpf_prog->aux->func_idx != 0);
> > -     push_callee_regs(&prog, callee_regs_used);
> > +                   bpf_prog->aux->func_idx != 0, bpf_prog->aux->exception_cb);
> > +     /* Exception callback will clobber callee regs for its own use, and
> > +      * restore the original callee regs from main prog's stack frame.
> > +      */
> > +     push_callee_regs(&prog, callee_regs_used, bpf_prog->aux->seen_exception);
> >
> >       ilen = prog - temp;
> >       if (rw_image)
> > @@ -1557,13 +1584,15 @@ st:                   if (is_imm8(insn->off))
> >
> >               case BPF_JMP | BPF_TAIL_CALL:
> >                       if (imm32)
> > -                             emit_bpf_tail_call_direct(&bpf_prog->aux->poke_tab[imm32 - 1],
> > +                             emit_bpf_tail_call_direct(bpf_prog,
> > +                                                       &bpf_prog->aux->poke_tab[imm32 - 1],
> >                                                         &prog, image + addrs[i - 1],
> >                                                         callee_regs_used,
> >                                                         bpf_prog->aux->stack_depth,
> >                                                         ctx);
> >                       else
> > -                             emit_bpf_tail_call_indirect(&prog,
> > +                             emit_bpf_tail_call_indirect(bpf_prog,
> > +                                                         &prog,
> >                                                           callee_regs_used,
> >                                                           bpf_prog->aux->stack_depth,
> >                                                           image + addrs[i - 1],
> > @@ -1808,7 +1837,7 @@ st:                     if (is_imm8(insn->off))
> >                       seen_exit = true;
> >                       /* Update cleanup_addr */
> >                       ctx->cleanup_addr = proglen;
> > -                     pop_callee_regs(&prog, callee_regs_used);
> > +                     pop_callee_regs(&prog, callee_regs_used, bpf_prog->aux->seen_exception);
> >                       EMIT1(0xC9);         /* leave */
> >                       emit_return(&prog, image + addrs[i - 1] + (prog - temp));
> >                       break;
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 70f212dddfbf..61cdb291311f 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1386,6 +1386,8 @@ struct bpf_prog_aux {
> >       bool tail_call_reachable;
> >       bool xdp_has_frags;
> >       bool invented_prog;
> > +     bool exception_cb;
> > +     bool seen_exception;
> >       /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
> >       const struct btf_type *attach_func_proto;
> >       /* function name for valid attach_btf_id */
> > @@ -1408,6 +1410,7 @@ struct bpf_prog_aux {
> >       int cgroup_atype; /* enum cgroup_bpf_attach_type */
> >       struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
> >       char name[BPF_OBJ_NAME_LEN];
> > +     unsigned int (*bpf_exception_cb)(u64 cookie, u64 sp, u64 bp);
> >  #ifdef CONFIG_SECURITY
> >       void *security;
> >  #endif
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 360aa304ec09..e28386fa462f 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -541,7 +541,9 @@ struct bpf_subprog_info {
> >       bool tail_call_reachable;
> >       bool has_ld_abs;
> >       bool invented_prog;
> > +     bool is_cb;
> >       bool is_async_cb;
> > +     bool is_exception_cb;
> >  };
> >
> >  struct bpf_verifier_env;
> > @@ -588,6 +590,7 @@ struct bpf_verifier_env {
> >       u32 used_map_cnt;               /* number of used maps */
> >       u32 used_btf_cnt;               /* number of used BTF objects */
> >       u32 id_gen;                     /* used to generate unique reg IDs */
> > +     int exception_callback_subprog;
> >       bool explore_alu_limits;
> >       bool allow_ptr_leaks;
> >       bool allow_uninit_stack;
> > @@ -596,6 +599,7 @@ struct bpf_verifier_env {
> >       bool bypass_spec_v4;
> >       bool seen_direct_write;
> >       bool invented_prog;
> > +     bool seen_exception;
> >       struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
> >       const struct bpf_line_info *prev_linfo;
> >       struct bpf_verifier_log log;
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 21ac801330bb..f45a54f8dd7d 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1137,6 +1137,7 @@ const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
> >  bool is_bpf_text_address(unsigned long addr);
> >  int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
> >                   char *sym);
> > +struct bpf_prog *bpf_prog_ksym_find(unsigned long addr);
> >
> >  static inline const char *
> >  bpf_address_lookup(unsigned long addr, unsigned long *size,
> > @@ -1204,6 +1205,11 @@ static inline int bpf_get_kallsym(unsigned int symnum, unsigned long *value,
> >       return -ERANGE;
> >  }
> >
> > +static inline struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
> > +{
> > +     return NULL;
> > +}
> > +
> >  static inline const char *
> >  bpf_address_lookup(unsigned long addr, unsigned long *size,
> >                  unsigned long *off, char **modname, char *sym)
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 5e224cf0ec27..efbc2f965226 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -723,7 +723,7 @@ bool is_bpf_text_address(unsigned long addr)
> >       return ret;
> >  }
> >
> > -static struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
> > +struct bpf_prog *bpf_prog_ksym_find(unsigned long addr)
> >  {
> >       struct bpf_ksym *ksym = bpf_ksym_find(addr);
> >
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 9e80efa59a5d..da1493a1da25 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -2402,6 +2402,43 @@ __bpf_kfunc void bpf_rcu_read_unlock(void)
> >       rcu_read_unlock();
> >  }
> >
> > +struct bpf_throw_ctx {
> > +     struct bpf_prog_aux *aux;
> > +     u64 sp;
> > +     u64 bp;
> > +     int cnt;
> > +};
> > +
> > +static bool bpf_stack_walker(void *cookie, u64 ip, u64 sp, u64 bp)
> > +{
> > +     struct bpf_throw_ctx *ctx = cookie;
> > +     struct bpf_prog *prog;
> > +
> > +     if (!is_bpf_text_address(ip))
> > +             return !ctx->cnt;
> > +     prog = bpf_prog_ksym_find(ip);
> > +     ctx->cnt++;
> > +     if (!prog->aux->id)
> > +             return true;
> > +     ctx->aux = prog->aux;
> > +     ctx->sp = sp;
> > +     ctx->bp = bp;
> > +     return false;
> > +}
> > +
> > +__bpf_kfunc void bpf_throw(u64 cookie)
> > +{
> > +     struct bpf_throw_ctx ctx = {};
> > +
> > +     arch_bpf_stack_walk(bpf_stack_walker, &ctx);
> > +     WARN_ON_ONCE(!ctx.aux);
> > +     if (ctx.aux)
> > +             WARN_ON_ONCE(!ctx.aux->seen_exception);
> > +     WARN_ON_ONCE(!ctx.bp);
> > +     WARN_ON_ONCE(!ctx.cnt);
> > +     ctx.aux->bpf_exception_cb(cookie, ctx.sp, ctx.bp);
> > +}
> > +
> >  __diag_pop();
> >
> >  BTF_SET8_START(generic_btf_ids)
> > @@ -2429,6 +2466,7 @@ BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
> >  BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
> >  #endif
> >  BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
> > +BTF_ID_FLAGS(func, bpf_throw)
> >  BTF_SET8_END(generic_btf_ids)
> >
> >  static const struct btf_kfunc_id_set generic_kfunc_set = {
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 8ce72a7b4758..61101a87d96b 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -542,6 +542,8 @@ static bool is_dynptr_ref_function(enum bpf_func_id func_id)
> >  }
> >
> >  static bool is_callback_calling_kfunc(u32 btf_id);
> > +static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id);
> > +static bool is_bpf_throw_kfunc(struct bpf_insn *insn);
> >
> >  static bool is_callback_calling_function(enum bpf_func_id func_id)
> >  {
> > @@ -2864,11 +2866,12 @@ static int check_subprogs(struct bpf_verifier_env *env)
> >               if (i == subprog_end - 1) {
> >                       /* to avoid fall-through from one subprog into another
> >                        * the last insn of the subprog should be either exit
> > -                      * or unconditional jump back
> > +                      * or unconditional jump back or bpf_throw call
> >                        */
> >                       if (code != (BPF_JMP | BPF_EXIT) &&
> > -                         code != (BPF_JMP | BPF_JA)) {
> > -                             verbose(env, "last insn is not an exit or jmp\n");
> > +                         code != (BPF_JMP | BPF_JA) &&
> > +                         !is_bpf_throw_kfunc(insn + i)) {
> > +                             verbose(env, "last insn is not an exit or jmp or bpf_throw call\n");
> >                               return -EINVAL;
> >                       }
> >                       subprog_start = subprog_end;
> > @@ -5625,6 +5628,25 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
> >       for (; i < subprog_end; i++) {
> >               int next_insn, sidx;
> >
> > +             if (bpf_pseudo_kfunc_call(insn + i) && !insn[i].off) {
> > +                     bool err = false;
> > +
> > +                     if (!is_forbidden_exception_kfunc_in_cb(insn[i].imm))
>
> Just use is_bpf_throw_kfunc(). No need for another helper.
> The compiler will optimize the redundant checks.

Actually I added this because we also need to check
bpf_set_exception_callback as forbidden (check the later patch
extending the forbidden list). But I can check both separately as
well.

>
> > +                             continue;
> > +                     if (subprog[idx].is_cb)
> > +                             err = true;
> > +                     for (int c = 0; c < frame && !err; c++) {
> > +                             if (subprog[ret_prog[c]].is_cb) {
> > +                                     err = true;
> > +                                     break;
> > +                             }
> > +                     }
> > +                     if (!err)
> > +                             continue;
> > +                     verbose(env, "exception kfunc insn %d cannot be called from callback\n", i);
> > +                     return -EINVAL;
> > +             }
> > +
> >               if (!bpf_pseudo_call(insn + i) && !bpf_pseudo_func(insn + i))
> >                       continue;
> >               /* remember insn and function to return to */
> > @@ -8734,6 +8756,7 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >        * callbacks
> >        */
> >       if (set_callee_state_cb != set_callee_state) {
> > +             env->subprog_info[subprog].is_cb = true;
> >               if (bpf_pseudo_kfunc_call(insn) &&
> >                   !is_callback_calling_kfunc(insn->imm)) {
> >                       verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
> > @@ -9247,17 +9270,17 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
> >       return 0;
> >  }
> >
> > -static int check_reference_leak(struct bpf_verifier_env *env)
> > +static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
> >  {
> >       struct bpf_func_state *state = cur_func(env);
> >       bool refs_lingering = false;
> >       int i;
> >
> > -     if (state->frameno && !state->in_callback_fn)
> > +     if (!exception_exit && state->frameno && !state->in_callback_fn)
> >               return 0;
> >
> >       for (i = 0; i < state->acquired_refs; i++) {
> > -             if (state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
> > +             if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
> >                       continue;
> >               verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
> >                       state->refs[i].id, state->refs[i].insn_idx);
> > @@ -9491,7 +9514,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >
> >       switch (func_id) {
> >       case BPF_FUNC_tail_call:
> > -             err = check_reference_leak(env);
> > +             err = check_reference_leak(env, false);
> >               if (err) {
> >                       verbose(env, "tail_call would lead to reference leak\n");
> >                       return err;
> > @@ -10109,6 +10132,7 @@ enum special_kfunc_type {
> >       KF_bpf_dynptr_slice,
> >       KF_bpf_dynptr_slice_rdwr,
> >       KF_bpf_dynptr_clone,
> > +     KF_bpf_throw,
> >  };
> >
> >  BTF_SET_START(special_kfunc_set)
> > @@ -10129,6 +10153,7 @@ BTF_ID(func, bpf_dynptr_from_xdp)
> >  BTF_ID(func, bpf_dynptr_slice)
> >  BTF_ID(func, bpf_dynptr_slice_rdwr)
> >  BTF_ID(func, bpf_dynptr_clone)
> > +BTF_ID(func, bpf_throw)
> >  BTF_SET_END(special_kfunc_set)
> >
> >  BTF_ID_LIST(special_kfunc_list)
> > @@ -10151,6 +10176,7 @@ BTF_ID(func, bpf_dynptr_from_xdp)
> >  BTF_ID(func, bpf_dynptr_slice)
> >  BTF_ID(func, bpf_dynptr_slice_rdwr)
> >  BTF_ID(func, bpf_dynptr_clone)
> > +BTF_ID(func, bpf_throw)
> >
> >  static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
> >  {
> > @@ -10464,6 +10490,17 @@ static bool is_callback_calling_kfunc(u32 btf_id)
> >       return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl];
> >  }
> >
> > +static bool is_bpf_throw_kfunc(struct bpf_insn *insn)
> > +{
> > +     return bpf_pseudo_kfunc_call(insn) && insn->off == 0 &&
> > +            insn->imm == special_kfunc_list[KF_bpf_throw];
> > +}
> > +
> > +static bool is_forbidden_exception_kfunc_in_cb(u32 btf_id)
> > +{
> > +     return btf_id == special_kfunc_list[KF_bpf_throw];
> > +}
> > +
> >  static bool is_rbtree_lock_required_kfunc(u32 btf_id)
> >  {
> >       return is_bpf_rbtree_api_kfunc(btf_id);
> > @@ -11140,6 +11177,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >       const struct btf_param *args;
> >       const struct btf_type *ret_t;
> >       struct btf *desc_btf;
> > +     bool throw = false;
> >
> >       /* skip for now, but return error when we find this in fixup_kfunc_call */
> >       if (!insn->imm)
> > @@ -11242,6 +11280,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >               }
> >       }
> >
> > +     if (meta.func_id == special_kfunc_list[KF_bpf_throw]) {
> > +             if (!bpf_jit_supports_exceptions()) {
> > +                     verbose(env, "JIT does not support calling kfunc %s#%d\n",
> > +                             func_name, meta.func_id);
> > +                     return -EINVAL;
> > +             }
> > +             env->seen_exception = true;
> > +             throw = true;
> > +     }
> > +
> >       for (i = 0; i < CALLER_SAVED_REGS; i++)
> >               mark_reg_not_init(env, regs, caller_saved[i]);
> >
> > @@ -11463,7 +11511,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >                       return err;
> >       }
> >
> > -     return 0;
> > +     return throw ? 1 : 0;
>
> I don't like this inband signaling.
> Just check for special_kfunc_list[KF_bpf_throw] when check_func_call() returns with zero.
>

Ack.

> >  }
> >
> >  static bool signed_add_overflows(s64 a, s64 b)
> > @@ -14211,7 +14259,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
> >        * gen_ld_abs() may terminate the program at runtime, leading to
> >        * reference leak.
> >        */
> > -     err = check_reference_leak(env);
> > +     err = check_reference_leak(env, false);
> >       if (err) {
> >               verbose(env, "BPF_LD_[ABS|IND] cannot be mixed with socket references\n");
> >               return err;
> > @@ -14619,6 +14667,9 @@ static int visit_insn(int t, struct bpf_verifier_env *env)
> >               if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
> >                       struct bpf_kfunc_call_arg_meta meta;
> >
> > +                     /* No fallthrough edge to walk, same as BPF_EXIT */
> > +                     if (is_bpf_throw_kfunc(insn))
> > +                             return DONE_EXPLORING;
> >                       ret = fetch_kfunc_meta(env, insn, &meta, NULL);
> >                       if (ret == 0 && is_iter_next_kfunc(&meta)) {
> >                               mark_prune_point(env, t);
> > @@ -16222,6 +16273,7 @@ static int do_check(struct bpf_verifier_env *env)
> >       int prev_insn_idx = -1;
> >
> >       for (;;) {
> > +             bool exception_exit = false;
> >               struct bpf_insn *insn;
> >               u8 class;
> >               int err;
> > @@ -16435,12 +16487,18 @@ static int do_check(struct bpf_verifier_env *env)
> >                                               return -EINVAL;
> >                                       }
> >                               }
> > -                             if (insn->src_reg == BPF_PSEUDO_CALL)
> > +                             if (insn->src_reg == BPF_PSEUDO_CALL) {
> >                                       err = check_func_call(env, insn, &env->insn_idx);
> > -                             else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL)
> > +                             } else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
> >                                       err = check_kfunc_call(env, insn, &env->insn_idx);
> > -                             else
> > +                                     if (err == 1) {
> > +                                             err = 0;
> > +                                             exception_exit = true;
> > +                                             goto process_bpf_exit_full;
> > +                                     }
> > +                             } else {
> >                                       err = check_helper_call(env, insn, &env->insn_idx);
> > +                             }
> >                               if (err)
> >                                       return err;
> >
> > @@ -16467,7 +16525,7 @@ static int do_check(struct bpf_verifier_env *env)
> >                                       verbose(env, "BPF_EXIT uses reserved fields\n");
> >                                       return -EINVAL;
> >                               }
> > -
> > +process_bpf_exit_full:
> >                               if (env->cur_state->active_lock.ptr &&
> >                                   !in_rbtree_lock_required_cb(env)) {
> >                                       verbose(env, "bpf_spin_unlock is missing\n");
> > @@ -16485,10 +16543,23 @@ static int do_check(struct bpf_verifier_env *env)
> >                                * function, for which reference_state must
> >                                * match caller reference state when it exits.
> >                                */
> > -                             err = check_reference_leak(env);
> > +                             err = check_reference_leak(env, exception_exit);
> >                               if (err)
> >                                       return err;
> >
> > +                             /* The side effect of the prepare_func_exit
> > +                              * which is being skipped is that it frees
> > +                              * bpf_func_state. Typically, process_bpf_exit
> > +                              * will only be hit with outermost exit.
> > +                              * copy_verifier_state in pop_stack will handle
> > +                              * freeing of any extra bpf_func_state left over
> > +                              * from not processing all nested function
> > +                              * exits. We also skip return code checks as
> > +                              * they are not needed for exceptional exits.
> > +                              */
> > +                             if (exception_exit)
> > +                                     goto process_bpf_exit;
> > +
> >                               if (state->curframe) {
> >                                       /* exit from nested function */
> >                                       err = prepare_func_exit(env, &env->insn_idx);
> > @@ -17782,6 +17853,9 @@ static int jit_subprogs(struct bpf_verifier_env *env)
> >               func[i]->aux->num_exentries = num_exentries;
> >               func[i]->aux->tail_call_reachable = env->subprog_info[i].tail_call_reachable;
> >               func[i]->aux->invented_prog = env->subprog_info[i].invented_prog;
> > +             func[i]->aux->exception_cb = env->subprog_info[i].is_exception_cb;
> > +             if (!i)
> > +                     func[i]->aux->seen_exception = env->seen_exception;
>
> why conditional?
>

We only need to set it for the main subprog, since when we throw,
entry from the kernel happens there (so only that needs to push/pop
all callee-saved registers). This is then later used in the JIT as if
(seen_exception). I should probably rename it to better reflect its
use.

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc
  2023-07-14 22:51   ` Alexei Starovoitov
@ 2023-07-17 16:44     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 16:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Sat, 15 Jul 2023 at 04:21, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 13, 2023 at 08:02:28AM +0530, Kumar Kartikeya Dwivedi wrote:
> > diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> > index 209811b1993a..f1d7de1349bc 100644
> > --- a/tools/testing/selftests/bpf/bpf_experimental.h
> > +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> > @@ -131,4 +131,10 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
> >   */
> >  extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
> >
> > +__attribute__((noreturn))
> > +extern void bpf_throw(u64 cookie) __ksym;
> > +
> > +#define throw bpf_throw(0)
> > +#define throw_value(cookie) bpf_throw(cookie)
>
> Reading the patch 10 I think the add-on value of these two macros is negative.
> If it was open coded as bpf_throw(0); everywhere it would be easier to read imo.

Ack, I will drop the macros.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback
  2023-07-14 22:47   ` Alexei Starovoitov
@ 2023-07-17 16:46     ` Kumar Kartikeya Dwivedi
  2023-07-17 17:47       ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 16:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Sat, 15 Jul 2023 at 04:17, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 13, 2023 at 08:02:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> > By default, the subprog generated by the verifier to handle a thrown
> > exception hardcodes a return value of 0. To allow user-defined logic
> > and modification of the return value when an exception is thrown,
> > introduce the bpf_set_exception_callback kfunc, which installs a
> > callback as the default exception handler for the program.
> >
> > Compared to runtime kfuncs, this kfunc acts a built-in, i.e. it only
> > takes semantic effect during verification, and is erased from the
> > program at runtime.
> >
> > This kfunc can only be called once within a program, and always sets the
> > global exception handler, regardless of whether it was invoked in all
> > paths of the program or not. The kfunc is idempotent, and the default
> > exception callback cannot be modified at runtime.
> >
> > Allowing modification of the callback for the current program execution
> > at runtime leads to issues when the programs begin to nest, as any
> > per-CPU state maintaing this information will have to be saved and
> > restored. We don't want it to stay in bpf_prog_aux as this takes a
> > global effect for all programs. An alternative solution is spilling
> > the callback pointer at a known location on the program stack on entry,
> > and then passing this location to bpf_throw as a parameter.
> >
> > However, since exceptions are geared more towards a use case where they
> > are ideally never invoked, optimizing for this use case and adding to
> > the complexity has diminishing returns.
>
> Right. No run-time changes pls.
>

+1

> Instead of bpf_set_exception_callback() how about adding a
> btf_tag("exception_handler") or better name
> and check that such global func is a single func in a program and
> it's argument is a single u64.
>

That does seem better. Also, a conditional bpf_set_exception_callback
taking effect globally may be confusing for users.
I will switch to the BTF tag.

Any specific reason it has to be a global func and cannot have static linkage?

> > In the future, a variant of bpf_throw which allows supplying a callback
> > can also be introduced, to modify the callback for a certain throw
> > statement. For now, bpf_set_exception_callback is meant to serve as a
> > way to set statically set a subprog as the exception handler of a BPF
> > program.
> >
> > TODO: Should we change default behavior to just return whatever cookie
> > value was passed to bpf_throw? That might allow people to avoid
> > installing a callback in case they just want to manipulate the return
> > value.
>
> and the verifier would check that u64 matches allowed return values?
> ex: call check_return_code() on argument of bpf_throw?
> I guess that makes sense.

Ack. I'll make the change.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs
  2023-07-17 16:21     ` Kumar Kartikeya Dwivedi
@ 2023-07-17 17:24       ` Alexei Starovoitov
  2023-07-17 19:05         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-17 17:24 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, Jul 17, 2023 at 9:22 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 15 Jul 2023 at 03:28, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jul 13, 2023 at 08:02:26AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > Introduce support in the verifier for generating a subprogram and
> > > include it as part of a BPF program dynamically after the do_check
> > > phase is complete. The appropriate place of invocation would be
> > > do_misc_fixups.
> > >
> > > Since they are always appended to the end of the instruction sequence of
> > > the program, it becomes relatively inexpensive to do the related
> > > adjustments to the subprog_info of the program. Only the fake exit
> > > subprogram is shifted forward by 1, making room for our invented subprog.
> > >
> > > This is useful to insert a new subprogram and obtain its function
> > > pointer. The next patch will use this functionality to insert a default
> > > exception callback which will be invoked after unwinding the stack.
> > >
> > > Note that these invented subprograms are invisible to userspace, and
> > > never reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
> > > invented program is supported, but more can be easily supported in the
> > > future.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h          |  1 +
> > >  include/linux/bpf_verifier.h |  4 +++-
> > >  kernel/bpf/core.c            |  4 ++--
> > >  kernel/bpf/syscall.c         | 19 ++++++++++++++++++-
> > >  kernel/bpf/verifier.c        | 29 ++++++++++++++++++++++++++++-
> > >  5 files changed, 52 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 360433f14496..70f212dddfbf 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1385,6 +1385,7 @@ struct bpf_prog_aux {
> > >       bool sleepable;
> > >       bool tail_call_reachable;
> > >       bool xdp_has_frags;
> > > +     bool invented_prog;
> > >       /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
> > >       const struct btf_type *attach_func_proto;
> > >       /* function name for valid attach_btf_id */
> > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > index f70f9ac884d2..360aa304ec09 100644
> > > --- a/include/linux/bpf_verifier.h
> > > +++ b/include/linux/bpf_verifier.h
> > > @@ -540,6 +540,7 @@ struct bpf_subprog_info {
> > >       bool has_tail_call;
> > >       bool tail_call_reachable;
> > >       bool has_ld_abs;
> > > +     bool invented_prog;
> > >       bool is_async_cb;
> > >  };
> > >
> > > @@ -594,10 +595,11 @@ struct bpf_verifier_env {
> > >       bool bypass_spec_v1;
> > >       bool bypass_spec_v4;
> > >       bool seen_direct_write;
> > > +     bool invented_prog;
> >
> > Instead of a flag in two places how about adding aux->func_cnt_real
> > and use it in JITing and free-ing while get_info*() keep using aux->func_cnt.
> >
>
> That does seem better, thanks. I'll make the change in v2.
>
> > > +/* The function requires that first instruction in 'patch' is insnsi[prog->len - 1] */
> > > +static int invent_subprog(struct bpf_verifier_env *env, struct bpf_insn *patch, int len)
> > > +{
> > > +     struct bpf_subprog_info *info = env->subprog_info;
> > > +     int cnt = env->subprog_cnt;
> > > +     struct bpf_prog *prog;
> > > +
> > > +     if (env->invented_prog) {
> > > +             verbose(env, "verifier internal error: only one invented prog supported\n");
> > > +             return -EFAULT;
> > > +     }
> > > +     prog = bpf_patch_insn_data(env, env->prog->len - 1, patch, len);
> >
> > The actual patching is not necessary.
> > bpf_prog_realloc() and memcpy would be enough, no?
> >
>
> Yes, it should be fine. But I didn't want to special case things here
> just to make sure assumptions elsewhere don't break.
> E.g. code readily assumes every insn has its own insn_aux_data which
> might be broken if we don't expand it.
> I think bpf_patch_insn_single is already doing a realloc (and reusing
> trailing space in current allocation if available), so it didn't seem
> worth it to me.
>
> If you still feel it's better I can analyze if anything might break
> and make the change.

bpf_patch_insn_data() is a known performance bottleneck.
Folks have been trying to optimize it in the past.
It's certainly delicate code.
I guess since this extra subprog will only be added once
we can live with unnecessary overhead of bpf_patch_insn_data().
Just add the comment that we're not patching existing insn and
all of adjust* ops are nop.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk
  2023-07-17 16:29     ` Kumar Kartikeya Dwivedi
@ 2023-07-17 17:29       ` Alexei Starovoitov
  2023-07-17 19:07         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-17 17:29 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, Jul 17, 2023 at 9:29 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 15 Jul 2023 at 03:35, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jul 13, 2023 at 08:02:27AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > The plumbing for offline unwinding when we throw an exception in
> > > programs would require walking the stack, hence introduce a new
> > > arch_bpf_stack_walk function. This is provided when the JIT supports
> > > exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific
> > > code is really minimal, hence it should straightforward to extend this
> > > support to other architectures as well, as it reuses the logic of
> > > arch_stack_walk, but allowing access to unwind_state data.
> > >
> > > Once the stack pointer and frame pointer are known for the main subprog
> > > during the unwinding, we know the stack layout and location of any
> > > callee-saved registers which must be restored before we return back to
> > > the kernel.
> > >
> > > This handling will be added in the next patch.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 21 +++++++++++++++++++++
> > >  include/linux/filter.h      |  2 ++
> > >  kernel/bpf/core.c           |  9 +++++++++
> > >  3 files changed, 32 insertions(+)
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index 438adb695daa..d326503ce242 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -16,6 +16,7 @@
> > >  #include <asm/set_memory.h>
> > >  #include <asm/nospec-branch.h>
> > >  #include <asm/text-patching.h>
> > > +#include <asm/unwind.h>
> > >
> > >  static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
> > >  {
> > > @@ -2660,3 +2661,23 @@ void bpf_jit_free(struct bpf_prog *prog)
> > >
> > >       bpf_prog_unlock_free(prog);
> > >  }
> > > +
> > > +bool bpf_jit_supports_exceptions(void)
> > > +{
> > > +     return IS_ENABLED(CONFIG_UNWINDER_ORC) || IS_ENABLED(CONFIG_UNWINDER_FRAME_POINTER);
> > > +}
> > > +
> > > +void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
> > > +{
> > > +#if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
> > > +     struct unwind_state state;
> > > +     unsigned long addr;
> > > +
> > > +     for (unwind_start(&state, current, NULL, NULL); !unwind_done(&state);
> > > +          unwind_next_frame(&state)) {
> > > +             addr = unwind_get_return_address(&state);
> >
> > I think these steps will work even with UNWINDER_GUESS.
> > What is the reason for #ifdef ?
>
> I think we require both unwind_state::sp and unwind_state::bp, but
> arch/x86/include/asm/unwind.h does not include unwind_state::bp when
> both UNWINDER_ORC and UNWINDER_FRAME_POINTER are unset.
>
> Although it might be possible to calculate and save bp offset during
> JIT in bpf_prog_aux (by adding roundup(stack_depth) + 8 (push rax if
> tail call reachable) + callee_regs_saved) for the subprog
> corresponding to a frame. Then we can make it work everywhere.
> The JIT will abstract get_prog_bp(sp) using an arch specific helper.
>
> Let me know if I misunderstood something.

JITed progs always have frames. So we're effectively doing
unconditional UNWINDER_FRAME_POINTER.
I think the intended usage of arch_bpf_stack_walk() is to only walk
bpf frames _in this patch set_, if so the extra #ifdefs are misleading.
If in follow-ups we're going to unwind through JITed progs _and_
through kfunc/helpers then this ifdef will be necessary.
I suspect we might want something like this in the future.
So the ifdef is ok to have from the start, but the comment is necessary
to describe what it is for.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc
  2023-07-17 16:43     ` Kumar Kartikeya Dwivedi
@ 2023-07-17 17:32       ` Alexei Starovoitov
  0 siblings, 0 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-17 17:32 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, Jul 17, 2023 at 9:44 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
> > >
> > > -static void push_callee_regs(u8 **pprog, bool *callee_regs_used)
> > > +static void push_callee_regs(u8 **pprog, bool *callee_regs_used, bool force)
> > >  {
> > >       u8 *prog = *pprog;
> > >
> > > -     if (callee_regs_used[0])
> > > +     if (callee_regs_used[0] || force)
> > >               EMIT1(0x53);         /* push rbx */
> > > -     if (callee_regs_used[1])
> > > +     if (force)
> > > +             EMIT2(0x41, 0x54);   /* push r12 */
> >
> > let's make r12 push/pop explicit. In addition to push_callee_regs.
> >
>
> Ack (I'm understanding you mean adding a push/pop_r12 and using that).

I mean keep push_callee_regs() as saving bpf callee regs only and
add other helpers to save/restore r12.
Or just open code push/pop of r12.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls
  2023-07-17 16:36     ` Kumar Kartikeya Dwivedi
@ 2023-07-17 17:45       ` Alexei Starovoitov
  2023-07-17 19:09         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-17 17:45 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, Jul 17, 2023 at 9:36 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 15 Jul 2023 at 04:09, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jul 13, 2023 at 08:02:29AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > Now that we allow exception throwing using bpf_throw kfunc, it can
> > > appear as the final instruction in a prog. When this happens, and we
> > > begin to unwind the stack using arch_bpf_stack_walk, the instruction
> > > pointer (IP) may appear to lie outside the JITed instructions. This
> > > happens because the return address is the instruction following the
> > > call, but the bpf_throw never returns to the program, so the JIT
> > > considers instruction ending at the bpf_throw call as the final JITed
> > > instruction and end of the jited_length for the program.
> > >
> > > This becomes a problem when we search the IP using is_bpf_text_address
> > > and bpf_prog_ksym_find, both of which use bpf_ksym_find under the hood,
> > > and it rightfully considers addr == ksym.end to be outside the program's
> > > boundaries.
> > >
> > > Insert a dummy 'int3' instruction which will never be hit to bump the
> > > jited_length and allow us to handle programs with their final
> > > isntruction being a call to bpf_throw.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 11 +++++++++++
> > >  include/linux/bpf.h         |  2 ++
> > >  2 files changed, 13 insertions(+)
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index 8d97c6a60f9a..052230cc7f50 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -1579,6 +1579,17 @@ st:                    if (is_imm8(insn->off))
> > >                       }
> > >                       if (emit_call(&prog, func, image + addrs[i - 1] + offs))
> > >                               return -EINVAL;
> > > +                     /* Similar to BPF_EXIT_INSN, call for bpf_throw may be
> > > +                      * the final instruction in the program. Insert an int3
> > > +                      * following the call instruction so that we can still
> > > +                      * detect pc to be part of the bpf_prog in
> > > +                      * bpf_ksym_find, otherwise when this is the last
> > > +                      * instruction (as allowed by verifier, similar to exit
> > > +                      * and jump instructions), pc will be == ksym.end,
> > > +                      * leading to bpf_throw failing to unwind the stack.
> > > +                      */
> > > +                     if (func == (u8 *)&bpf_throw)
> > > +                             EMIT1(0xCC); /* int3 */
> >
> > Probably worth explaining that this happens because bpf_throw is marked
> > __attribute__((noreturn)) and compiler can emit it last without BPF_EXIT insn.
> > Meaing the program might not have BPF_EXIT at all.
>
> Yes, sorry about omitting that. I will add it to the commit message in v2.
>
> >
> > I wonder though whether this self-inflicted pain is worth it.
> > May be it shouldn't be marked as noreturn.
> > What do we gain by marking?
>
> It felt like the obvious thing to do to me. The cost on the kernel
> side is negligible (atleast in my opinion), we just have to allow it
> as final instruction in the program. If it's heavily used it allows
> the compiler to better optimize the code (marking anything after it
> unreachable, no need to save registers etc., although this may not be
> a persuasive point for you).

"no need to save registers"... "optimize"... that's the thing that worries me.
I think it's better to drop noreturn attribute.
bpf has implicit prolog/epilogue that only apply to bpf_exit insn.
bpf_call insn that doesn't return is exploiting undefined logic
in the compiler, since we never fully clarified our hidden prologue/epilogue
rules. Saying it differently, bpf_tail_call is also noreturn,
but if we mark it as such all kinds of things will break.
We still need to add alloca(). It doesn't play well with the current BPF ISA.
I think it's better to treat 'noreturn' as broken in the compiler,
since its behavior may change.

> Regardless of this noreturn attribute, I was thinking whether we
> should always emit an extra instruction so that any IP (say one past
> last instruction) we get for a BPF prog can always be seen as
> belonging to it. It probably is only a problem surfaced by this
> bpf_throw call at the end, but I was wondering whether doing it
> unconditionally makes sense.

I think it's a corner case of this 'noreturn' from bpf_call logic.
The bpf prog is padded with 0xcc before and after already.
What you're suggesting is to add one of 0xcc to the body of the prog.
That doesn't sound right.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback
  2023-07-17 16:46     ` Kumar Kartikeya Dwivedi
@ 2023-07-17 17:47       ` Alexei Starovoitov
  2023-07-17 19:10         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-17 17:47 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, Jul 17, 2023 at 9:47 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 15 Jul 2023 at 04:17, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jul 13, 2023 at 08:02:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > By default, the subprog generated by the verifier to handle a thrown
> > > exception hardcodes a return value of 0. To allow user-defined logic
> > > and modification of the return value when an exception is thrown,
> > > introduce the bpf_set_exception_callback kfunc, which installs a
> > > callback as the default exception handler for the program.
> > >
> > > Compared to runtime kfuncs, this kfunc acts a built-in, i.e. it only
> > > takes semantic effect during verification, and is erased from the
> > > program at runtime.
> > >
> > > This kfunc can only be called once within a program, and always sets the
> > > global exception handler, regardless of whether it was invoked in all
> > > paths of the program or not. The kfunc is idempotent, and the default
> > > exception callback cannot be modified at runtime.
> > >
> > > Allowing modification of the callback for the current program execution
> > > at runtime leads to issues when the programs begin to nest, as any
> > > per-CPU state maintaing this information will have to be saved and
> > > restored. We don't want it to stay in bpf_prog_aux as this takes a
> > > global effect for all programs. An alternative solution is spilling
> > > the callback pointer at a known location on the program stack on entry,
> > > and then passing this location to bpf_throw as a parameter.
> > >
> > > However, since exceptions are geared more towards a use case where they
> > > are ideally never invoked, optimizing for this use case and adding to
> > > the complexity has diminishing returns.
> >
> > Right. No run-time changes pls.
> >
>
> +1
>
> > Instead of bpf_set_exception_callback() how about adding a
> > btf_tag("exception_handler") or better name
> > and check that such global func is a single func in a program and
> > it's argument is a single u64.
> >
>
> That does seem better. Also, a conditional bpf_set_exception_callback
> taking effect globally may be confusing for users.
> I will switch to the BTF tag.
>
> Any specific reason it has to be a global func and cannot have static linkage?

The compiler will warn about the unused static function.
Even if we silence the warn somehow the verifier will not verify that
static unused subprog.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 00/10] Exceptions - 1/2
  2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
                   ` (9 preceding siblings ...)
  2023-07-13  2:32 ` [PATCH bpf-next v1 10/10] selftests/bpf: Add tests for BPF exceptions Kumar Kartikeya Dwivedi
@ 2023-07-17 18:15 ` Daniel Xu
  2023-07-17 19:39   ` Kumar Kartikeya Dwivedi
  10 siblings, 1 reply; 34+ messages in thread
From: Daniel Xu @ 2023-07-17 18:15 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Thu, Jul 13, 2023 at 08:02:22AM +0530, Kumar Kartikeya Dwivedi wrote:
> This series implements the _first_ part of the runtime and verifier
> support needed to enable BPF exceptions. Exceptions thrown from programs
> are processed as an immediate exit from the program, which unwinds all
> the active stack frames until the main stack frame, and returns to the
> BPF program's caller. The ability to perform this unwinding safely
> allows the program to test conditions that are always true at runtime
> but which the verifier has no visibility into.
> 
> Thus, it also reduces verification effort by safely terminating
> redundant paths that can be taken within a program.
> 
> The patches to perform runtime resource cleanup during the
> frame-by-frame unwinding will be posted as a follow-up to this set.
> 
> It must be noted that exceptions are not an error handling mechanism for
> unlikely runtime conditions, but a way to safely terminate the execution
> of a program in presence of conditions that should never occur at
> runtime. They are meant to serve higher-level primitives such as program
> assertions.

Sure, that makes sense.

> 
> As such, a program can only install an exception handler once for the
> lifetime of a BPF program, and this handler cannot be changed at
> runtime. The purpose of the handler is to simply interpret the cookie
> value supplied by the bpf_throw call, and execute user-defined logic
> corresponding to it. The primary purpose of allowing a handler is to
> control the return value of the program. The default handler returns 0
> when from the program when an exception is thrown.
> 
> Fixing the handler for the lifetime of the program eliminates tricky and
> expensive handling in case of runtime changes of the handler callback
> when programs begin to nest, where it becomes more complex to save and
> restore the active handler at runtime.
> 
> The following kfuncs are introduced:
> 
> // Throw a BPF exception, terminating the execution of the program.
> //
> // @cookie: The cookie that is passed to the exception callback. Without
> //          an exception callback set by the user, the programs returns
> //          0 when an exception is thrown.
> void bpf_throw(u64 cookie);

If developers are only supposed to use higher level primitives, then why
expose a kfunc for it? The above description makes it sound like this
should be an implementation detail.

[...]

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs
  2023-07-17 17:24       ` Alexei Starovoitov
@ 2023-07-17 19:05         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 19:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, 17 Jul 2023 at 22:55, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 9:22 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, 15 Jul 2023 at 03:28, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 08:02:26AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > Introduce support in the verifier for generating a subprogram and
> > > > include it as part of a BPF program dynamically after the do_check
> > > > phase is complete. The appropriate place of invocation would be
> > > > do_misc_fixups.
> > > >
> > > > Since they are always appended to the end of the instruction sequence of
> > > > the program, it becomes relatively inexpensive to do the related
> > > > adjustments to the subprog_info of the program. Only the fake exit
> > > > subprogram is shifted forward by 1, making room for our invented subprog.
> > > >
> > > > This is useful to insert a new subprogram and obtain its function
> > > > pointer. The next patch will use this functionality to insert a default
> > > > exception callback which will be invoked after unwinding the stack.
> > > >
> > > > Note that these invented subprograms are invisible to userspace, and
> > > > never reported in BPF_OBJ_GET_INFO_BY_ID etc. For now, only a single
> > > > invented program is supported, but more can be easily supported in the
> > > > future.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h          |  1 +
> > > >  include/linux/bpf_verifier.h |  4 +++-
> > > >  kernel/bpf/core.c            |  4 ++--
> > > >  kernel/bpf/syscall.c         | 19 ++++++++++++++++++-
> > > >  kernel/bpf/verifier.c        | 29 ++++++++++++++++++++++++++++-
> > > >  5 files changed, 52 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 360433f14496..70f212dddfbf 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1385,6 +1385,7 @@ struct bpf_prog_aux {
> > > >       bool sleepable;
> > > >       bool tail_call_reachable;
> > > >       bool xdp_has_frags;
> > > > +     bool invented_prog;
> > > >       /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
> > > >       const struct btf_type *attach_func_proto;
> > > >       /* function name for valid attach_btf_id */
> > > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > > index f70f9ac884d2..360aa304ec09 100644
> > > > --- a/include/linux/bpf_verifier.h
> > > > +++ b/include/linux/bpf_verifier.h
> > > > @@ -540,6 +540,7 @@ struct bpf_subprog_info {
> > > >       bool has_tail_call;
> > > >       bool tail_call_reachable;
> > > >       bool has_ld_abs;
> > > > +     bool invented_prog;
> > > >       bool is_async_cb;
> > > >  };
> > > >
> > > > @@ -594,10 +595,11 @@ struct bpf_verifier_env {
> > > >       bool bypass_spec_v1;
> > > >       bool bypass_spec_v4;
> > > >       bool seen_direct_write;
> > > > +     bool invented_prog;
> > >
> > > Instead of a flag in two places how about adding aux->func_cnt_real
> > > and use it in JITing and free-ing while get_info*() keep using aux->func_cnt.
> > >
> >
> > That does seem better, thanks. I'll make the change in v2.
> >
> > > > +/* The function requires that first instruction in 'patch' is insnsi[prog->len - 1] */
> > > > +static int invent_subprog(struct bpf_verifier_env *env, struct bpf_insn *patch, int len)
> > > > +{
> > > > +     struct bpf_subprog_info *info = env->subprog_info;
> > > > +     int cnt = env->subprog_cnt;
> > > > +     struct bpf_prog *prog;
> > > > +
> > > > +     if (env->invented_prog) {
> > > > +             verbose(env, "verifier internal error: only one invented prog supported\n");
> > > > +             return -EFAULT;
> > > > +     }
> > > > +     prog = bpf_patch_insn_data(env, env->prog->len - 1, patch, len);
> > >
> > > The actual patching is not necessary.
> > > bpf_prog_realloc() and memcpy would be enough, no?
> > >
> >
> > Yes, it should be fine. But I didn't want to special case things here
> > just to make sure assumptions elsewhere don't break.
> > E.g. code readily assumes every insn has its own insn_aux_data which
> > might be broken if we don't expand it.
> > I think bpf_patch_insn_single is already doing a realloc (and reusing
> > trailing space in current allocation if available), so it didn't seem
> > worth it to me.
> >
> > If you still feel it's better I can analyze if anything might break
> > and make the change.
>
> bpf_patch_insn_data() is a known performance bottleneck.
> Folks have been trying to optimize it in the past.
> It's certainly delicate code.
> I guess since this extra subprog will only be added once
> we can live with unnecessary overhead of bpf_patch_insn_data().
> Just add the comment that we're not patching existing insn and
> all of adjust* ops are nop.

Ack.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk
  2023-07-17 17:29       ` Alexei Starovoitov
@ 2023-07-17 19:07         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 19:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, 17 Jul 2023 at 22:59, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 9:29 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, 15 Jul 2023 at 03:35, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 08:02:27AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > The plumbing for offline unwinding when we throw an exception in
> > > > programs would require walking the stack, hence introduce a new
> > > > arch_bpf_stack_walk function. This is provided when the JIT supports
> > > > exceptions, i.e. bpf_jit_supports_exceptions is true. The arch-specific
> > > > code is really minimal, hence it should straightforward to extend this
> > > > support to other architectures as well, as it reuses the logic of
> > > > arch_stack_walk, but allowing access to unwind_state data.
> > > >
> > > > Once the stack pointer and frame pointer are known for the main subprog
> > > > during the unwinding, we know the stack layout and location of any
> > > > callee-saved registers which must be restored before we return back to
> > > > the kernel.
> > > >
> > > > This handling will be added in the next patch.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  arch/x86/net/bpf_jit_comp.c | 21 +++++++++++++++++++++
> > > >  include/linux/filter.h      |  2 ++
> > > >  kernel/bpf/core.c           |  9 +++++++++
> > > >  3 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > > index 438adb695daa..d326503ce242 100644
> > > > --- a/arch/x86/net/bpf_jit_comp.c
> > > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > > @@ -16,6 +16,7 @@
> > > >  #include <asm/set_memory.h>
> > > >  #include <asm/nospec-branch.h>
> > > >  #include <asm/text-patching.h>
> > > > +#include <asm/unwind.h>
> > > >
> > > >  static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
> > > >  {
> > > > @@ -2660,3 +2661,23 @@ void bpf_jit_free(struct bpf_prog *prog)
> > > >
> > > >       bpf_prog_unlock_free(prog);
> > > >  }
> > > > +
> > > > +bool bpf_jit_supports_exceptions(void)
> > > > +{
> > > > +     return IS_ENABLED(CONFIG_UNWINDER_ORC) || IS_ENABLED(CONFIG_UNWINDER_FRAME_POINTER);
> > > > +}
> > > > +
> > > > +void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
> > > > +{
> > > > +#if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
> > > > +     struct unwind_state state;
> > > > +     unsigned long addr;
> > > > +
> > > > +     for (unwind_start(&state, current, NULL, NULL); !unwind_done(&state);
> > > > +          unwind_next_frame(&state)) {
> > > > +             addr = unwind_get_return_address(&state);
> > >
> > > I think these steps will work even with UNWINDER_GUESS.
> > > What is the reason for #ifdef ?
> >
> > I think we require both unwind_state::sp and unwind_state::bp, but
> > arch/x86/include/asm/unwind.h does not include unwind_state::bp when
> > both UNWINDER_ORC and UNWINDER_FRAME_POINTER are unset.
> >
> > Although it might be possible to calculate and save bp offset during
> > JIT in bpf_prog_aux (by adding roundup(stack_depth) + 8 (push rax if
> > tail call reachable) + callee_regs_saved) for the subprog
> > corresponding to a frame. Then we can make it work everywhere.
> > The JIT will abstract get_prog_bp(sp) using an arch specific helper.
> >
> > Let me know if I misunderstood something.
>
> JITed progs always have frames. So we're effectively doing
> unconditional UNWINDER_FRAME_POINTER.
> I think the intended usage of arch_bpf_stack_walk() is to only walk
> bpf frames _in this patch set_, if so the extra #ifdefs are misleading.
> If in follow-ups we're going to unwind through JITed progs _and_
> through kfunc/helpers then this ifdef will be necessary.
> I suspect we might want something like this in the future.

I think we actually do unwind through bpf_throw at the very least, so
we are going through both kernel and BPF frames.

> So the ifdef is ok to have from the start, but the comment is necessary
> to describe what it is for.

I'll add the comment in v2.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls
  2023-07-17 17:45       ` Alexei Starovoitov
@ 2023-07-17 19:09         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 19:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, 17 Jul 2023 at 23:15, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 9:36 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, 15 Jul 2023 at 04:09, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 08:02:29AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > Now that we allow exception throwing using bpf_throw kfunc, it can
> > > > appear as the final instruction in a prog. When this happens, and we
> > > > begin to unwind the stack using arch_bpf_stack_walk, the instruction
> > > > pointer (IP) may appear to lie outside the JITed instructions. This
> > > > happens because the return address is the instruction following the
> > > > call, but the bpf_throw never returns to the program, so the JIT
> > > > considers instruction ending at the bpf_throw call as the final JITed
> > > > instruction and end of the jited_length for the program.
> > > >
> > > > This becomes a problem when we search the IP using is_bpf_text_address
> > > > and bpf_prog_ksym_find, both of which use bpf_ksym_find under the hood,
> > > > and it rightfully considers addr == ksym.end to be outside the program's
> > > > boundaries.
> > > >
> > > > Insert a dummy 'int3' instruction which will never be hit to bump the
> > > > jited_length and allow us to handle programs with their final
> > > > isntruction being a call to bpf_throw.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  arch/x86/net/bpf_jit_comp.c | 11 +++++++++++
> > > >  include/linux/bpf.h         |  2 ++
> > > >  2 files changed, 13 insertions(+)
> > > >
> > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > > index 8d97c6a60f9a..052230cc7f50 100644
> > > > --- a/arch/x86/net/bpf_jit_comp.c
> > > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > > @@ -1579,6 +1579,17 @@ st:                    if (is_imm8(insn->off))
> > > >                       }
> > > >                       if (emit_call(&prog, func, image + addrs[i - 1] + offs))
> > > >                               return -EINVAL;
> > > > +                     /* Similar to BPF_EXIT_INSN, call for bpf_throw may be
> > > > +                      * the final instruction in the program. Insert an int3
> > > > +                      * following the call instruction so that we can still
> > > > +                      * detect pc to be part of the bpf_prog in
> > > > +                      * bpf_ksym_find, otherwise when this is the last
> > > > +                      * instruction (as allowed by verifier, similar to exit
> > > > +                      * and jump instructions), pc will be == ksym.end,
> > > > +                      * leading to bpf_throw failing to unwind the stack.
> > > > +                      */
> > > > +                     if (func == (u8 *)&bpf_throw)
> > > > +                             EMIT1(0xCC); /* int3 */
> > >
> > > Probably worth explaining that this happens because bpf_throw is marked
> > > __attribute__((noreturn)) and compiler can emit it last without BPF_EXIT insn.
> > > Meaing the program might not have BPF_EXIT at all.
> >
> > Yes, sorry about omitting that. I will add it to the commit message in v2.
> >
> > >
> > > I wonder though whether this self-inflicted pain is worth it.
> > > May be it shouldn't be marked as noreturn.
> > > What do we gain by marking?
> >
> > It felt like the obvious thing to do to me. The cost on the kernel
> > side is negligible (atleast in my opinion), we just have to allow it
> > as final instruction in the program. If it's heavily used it allows
> > the compiler to better optimize the code (marking anything after it
> > unreachable, no need to save registers etc., although this may not be
> > a persuasive point for you).
>
> "no need to save registers"... "optimize"... that's the thing that worries me.
> I think it's better to drop noreturn attribute.
> bpf has implicit prolog/epilogue that only apply to bpf_exit insn.
> bpf_call insn that doesn't return is exploiting undefined logic
> in the compiler, since we never fully clarified our hidden prologue/epilogue
> rules. Saying it differently, bpf_tail_call is also noreturn,
> but if we mark it as such all kinds of things will break.
> We still need to add alloca(). It doesn't play well with the current BPF ISA.
> I think it's better to treat 'noreturn' as broken in the compiler,
> since its behavior may change.
>

Ok, I think then let's drop this patch and the noreturn attribute on bpf_throw.

> > Regardless of this noreturn attribute, I was thinking whether we
> > should always emit an extra instruction so that any IP (say one past
> > last instruction) we get for a BPF prog can always be seen as
> > belonging to it. It probably is only a problem surfaced by this
> > bpf_throw call at the end, but I was wondering whether doing it
> > unconditionally makes sense.
>
> I think it's a corner case of this 'noreturn' from bpf_call logic.
> The bpf prog is padded with 0xcc before and after already.
> What you're suggesting is to add one of 0xcc to the body of the prog.
> That doesn't sound right.

Actually this patch was added because I caught the case where the
reported ip during unwinding was lying outside jited_length (==
ksym.end).
So including an extra instruction would prevent that. But it's true
it's a side effect of the noreturn attribute, it wouldn't occur
otherwise.

Let's drop this patch then.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback
  2023-07-17 17:47       ` Alexei Starovoitov
@ 2023-07-17 19:10         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 19:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, 17 Jul 2023 at 23:17, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 9:47 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, 15 Jul 2023 at 04:17, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 08:02:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > By default, the subprog generated by the verifier to handle a thrown
> > > > exception hardcodes a return value of 0. To allow user-defined logic
> > > > and modification of the return value when an exception is thrown,
> > > > introduce the bpf_set_exception_callback kfunc, which installs a
> > > > callback as the default exception handler for the program.
> > > >
> > > > Compared to runtime kfuncs, this kfunc acts a built-in, i.e. it only
> > > > takes semantic effect during verification, and is erased from the
> > > > program at runtime.
> > > >
> > > > This kfunc can only be called once within a program, and always sets the
> > > > global exception handler, regardless of whether it was invoked in all
> > > > paths of the program or not. The kfunc is idempotent, and the default
> > > > exception callback cannot be modified at runtime.
> > > >
> > > > Allowing modification of the callback for the current program execution
> > > > at runtime leads to issues when the programs begin to nest, as any
> > > > per-CPU state maintaing this information will have to be saved and
> > > > restored. We don't want it to stay in bpf_prog_aux as this takes a
> > > > global effect for all programs. An alternative solution is spilling
> > > > the callback pointer at a known location on the program stack on entry,
> > > > and then passing this location to bpf_throw as a parameter.
> > > >
> > > > However, since exceptions are geared more towards a use case where they
> > > > are ideally never invoked, optimizing for this use case and adding to
> > > > the complexity has diminishing returns.
> > >
> > > Right. No run-time changes pls.
> > >
> >
> > +1
> >
> > > Instead of bpf_set_exception_callback() how about adding a
> > > btf_tag("exception_handler") or better name
> > > and check that such global func is a single func in a program and
> > > it's argument is a single u64.
> > >
> >
> > That does seem better. Also, a conditional bpf_set_exception_callback
> > taking effect globally may be confusing for users.
> > I will switch to the BTF tag.
> >
> > Any specific reason it has to be a global func and cannot have static linkage?
>
> The compiler will warn about the unused static function.
> Even if we silence the warn somehow the verifier will not verify that
> static unused subprog.

Ah, right. Makes sense. I will change this in v2.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 00/10] Exceptions - 1/2
  2023-07-17 18:15 ` [PATCH bpf-next v1 00/10] Exceptions - 1/2 Daniel Xu
@ 2023-07-17 19:39   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2023-07-17 19:39 UTC (permalink / raw)
  To: Daniel Xu
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, David Vernet

On Mon, 17 Jul 2023 at 23:46, Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> On Thu, Jul 13, 2023 at 08:02:22AM +0530, Kumar Kartikeya Dwivedi wrote:
> > This series implements the _first_ part of the runtime and verifier
> > support needed to enable BPF exceptions. Exceptions thrown from programs
> > are processed as an immediate exit from the program, which unwinds all
> > the active stack frames until the main stack frame, and returns to the
> > BPF program's caller. The ability to perform this unwinding safely
> > allows the program to test conditions that are always true at runtime
> > but which the verifier has no visibility into.
> >
> > Thus, it also reduces verification effort by safely terminating
> > redundant paths that can be taken within a program.
> >
> > The patches to perform runtime resource cleanup during the
> > frame-by-frame unwinding will be posted as a follow-up to this set.
> >
> > It must be noted that exceptions are not an error handling mechanism for
> > unlikely runtime conditions, but a way to safely terminate the execution
> > of a program in presence of conditions that should never occur at
> > runtime. They are meant to serve higher-level primitives such as program
> > assertions.
>
> Sure, that makes sense.
>
> >
> > As such, a program can only install an exception handler once for the
> > lifetime of a BPF program, and this handler cannot be changed at
> > runtime. The purpose of the handler is to simply interpret the cookie
> > value supplied by the bpf_throw call, and execute user-defined logic
> > corresponding to it. The primary purpose of allowing a handler is to
> > control the return value of the program. The default handler returns 0
> > when from the program when an exception is thrown.
> >
> > Fixing the handler for the lifetime of the program eliminates tricky and
> > expensive handling in case of runtime changes of the handler callback
> > when programs begin to nest, where it becomes more complex to save and
> > restore the active handler at runtime.
> >
> > The following kfuncs are introduced:
> >
> > // Throw a BPF exception, terminating the execution of the program.
> > //
> > // @cookie: The cookie that is passed to the exception callback. Without
> > //          an exception callback set by the user, the programs returns
> > //          0 when an exception is thrown.
> > void bpf_throw(u64 cookie);
>
> If developers are only supposed to use higher level primitives, then why
> expose a kfunc for it? The above description makes it sound like this
> should be an implementation detail.
>

I can rephrase, but what I meant to say is that it's not an error
handling mechanism.
But you _can_ directly call bpf_throw as well when failing a condition
that you know is always true.
It's not necessary to always use the assert macros. That may not be
possible as it requires a lvalue, rvalue pair.

If the condition is complicated, e.g. the one below, is totally
acceptable, if you know it's always going to be true, but the verifier
doesn't:

if (data + offset > data_end)
    bpf_throw(XDP_DROP);

This can be from a deeply nested callchain, and it eliminates the need
to handle this condition all the way back to the main prog.

The primary requirement was for implementing assertions within a
program, which when untrue still ensure that the program terminates
safely. Typically this would require the user to handle the other
case, freeing any resources, and returning from a possibly deep
callchain back to the kernel. Testing a condition can be used to
update the verifier's knowledge about a particular register. By
throwing from the other path where the condition is untrue, it's a way
to increase the knowledge of the verifier during its symbolic
execution while at the same time preserving the safety guarantees.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-07-17 19:40 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-13  2:32 [PATCH bpf-next v1 00/10] Exceptions - 1/2 Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 01/10] bpf: Fix kfunc callback register type handling Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 02/10] bpf: Fix subprog idx logic in check_max_stack_depth Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 03/10] bpf: Repeat check_max_stack_depth for async callbacks Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 04/10] bpf: Add support for inserting new subprogs Kumar Kartikeya Dwivedi
2023-07-14 21:58   ` Alexei Starovoitov
2023-07-17 16:21     ` Kumar Kartikeya Dwivedi
2023-07-17 17:24       ` Alexei Starovoitov
2023-07-17 19:05         ` Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 05/10] arch/x86: Implement arch_bpf_stack_walk Kumar Kartikeya Dwivedi
2023-07-14 22:05   ` Alexei Starovoitov
2023-07-17 16:29     ` Kumar Kartikeya Dwivedi
2023-07-17 17:29       ` Alexei Starovoitov
2023-07-17 19:07         ` Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 06/10] bpf: Implement bpf_throw kfunc Kumar Kartikeya Dwivedi
2023-07-14 22:34   ` Alexei Starovoitov
2023-07-17 16:43     ` Kumar Kartikeya Dwivedi
2023-07-17 17:32       ` Alexei Starovoitov
2023-07-14 22:51   ` Alexei Starovoitov
2023-07-17 16:44     ` Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 07/10] bpf: Ensure IP is within prog->jited_length for bpf_throw calls Kumar Kartikeya Dwivedi
2023-07-14 22:39   ` Alexei Starovoitov
2023-07-17 16:36     ` Kumar Kartikeya Dwivedi
2023-07-17 17:45       ` Alexei Starovoitov
2023-07-17 19:09         ` Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 08/10] bpf: Introduce bpf_set_exception_callback Kumar Kartikeya Dwivedi
2023-07-14 22:47   ` Alexei Starovoitov
2023-07-17 16:46     ` Kumar Kartikeya Dwivedi
2023-07-17 17:47       ` Alexei Starovoitov
2023-07-17 19:10         ` Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 09/10] selftests/bpf: Add BPF assertion macros Kumar Kartikeya Dwivedi
2023-07-13  2:32 ` [PATCH bpf-next v1 10/10] selftests/bpf: Add tests for BPF exceptions Kumar Kartikeya Dwivedi
2023-07-17 18:15 ` [PATCH bpf-next v1 00/10] Exceptions - 1/2 Daniel Xu
2023-07-17 19:39   ` Kumar Kartikeya Dwivedi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.