netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog
@ 2022-09-22 22:56 Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-22 22:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, netdev

From: Martin KaFai Lau <martin.lau@kernel.org>

The struct_ops is sharing the tracing-trampoline's enter/exit
function which tracks prog->active to avoid recursion.  It turns
out the struct_ops bpf prog will hit this prog->active and
unnecessarily skipped running the struct_ops prog.  eg.  The
'.ssthresh' may run in_task() and then interrupted by softirq
that runs the same '.ssthresh'.

The kernel does not call the tcp-cc's ops in a recursive way,
so this set is to remove the recursion check for struct_ops prog.

Martin KaFai Lau (5):
  bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
  bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
  bpf: Add bpf_run_ctx_type
  bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  selftests/bpf: Check -EBUSY for the recurred
    bpf_setsockopt(TCP_CONGESTION)

 arch/x86/net/bpf_jit_comp.c                   |  3 ++
 include/linux/bpf.h                           | 21 ++++++--
 include/linux/filter.h                        |  3 ++
 kernel/bpf/bpf_iter.c                         |  2 +-
 kernel/bpf/cgroup.c                           |  2 +-
 kernel/bpf/trampoline.c                       | 27 ++++++++++
 kernel/trace/bpf_trace.c                      |  1 +
 net/bpf/test_run.c                            |  2 +-
 net/core/filter.c                             | 17 +++---
 net/ipv4/bpf_tcp_ca.c                         | 54 +++++++++++++++++++
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |  4 ++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 23 +++++---
 12 files changed, 139 insertions(+), 20 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
  2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
@ 2022-09-22 22:56 ` Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-22 22:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, netdev, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

The struct_ops prog is to allow using bpf to implement the functions in
a struct (eg. kernel module).  The current usage is to implement the
tcp_congestion.  The kernel does not call the tcp-cc's ops (ie.
the bpf prog) in a recursive way.

The struct_ops is sharing the tracing-trampoline's enter/exit
function which tracks prog->active to avoid recursion.  It is
needed for tracing prog.  However, it turns out the struct_ops
bpf prog will hit this prog->active and unnecessarily skipped
running the struct_ops prog.  eg.  The '.ssthresh' may run in_task()
and then interrupted by softirq that runs the same '.ssthresh'.
Skip running the '.ssthresh' will end up returning random value
to the caller.

The patch adds __bpf_prog_{enter,exit}_struct_ops for the
struct_ops trampoline.  They do not track the prog->active
to detect recursion.

One exception is when the tcp_congestion's '.init' ops is doing
bpf_setsockopt(TCP_CONGESTION) and then recurs to the same
'.init' ops.  This will be addressed in the following patches.

Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c |  3 +++
 include/linux/bpf.h         |  4 ++++
 kernel/bpf/trampoline.c     | 23 +++++++++++++++++++++++
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index ae89f4143eb4..58a131dec954 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1836,6 +1836,9 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
 	if (p->aux->sleepable) {
 		enter = __bpf_prog_enter_sleepable;
 		exit = __bpf_prog_exit_sleepable;
+	} else if (p->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		enter = __bpf_prog_enter_struct_ops;
+		exit = __bpf_prog_exit_struct_ops;
 	} else if (p->expected_attach_type == BPF_LSM_CGROUP) {
 		enter = __bpf_prog_enter_lsm_cgroup;
 		exit = __bpf_prog_exit_lsm_cgroup;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index edd43edb27d6..6fdbc1398b8a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -864,6 +864,10 @@ u64 notrace __bpf_prog_enter_lsm_cgroup(struct bpf_prog *prog,
 					struct bpf_tramp_run_ctx *run_ctx);
 void notrace __bpf_prog_exit_lsm_cgroup(struct bpf_prog *prog, u64 start,
 					struct bpf_tramp_run_ctx *run_ctx);
+u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
+					struct bpf_tramp_run_ctx *run_ctx);
+void notrace __bpf_prog_exit_struct_ops(struct bpf_prog *prog, u64 start,
+					struct bpf_tramp_run_ctx *run_ctx);
 void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr);
 void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr);
 
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 41b67eb83ab3..e6551e4a6064 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -976,6 +976,29 @@ void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start,
 	rcu_read_unlock_trace();
 }
 
+u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
+					struct bpf_tramp_run_ctx *run_ctx)
+	__acquires(RCU)
+{
+	rcu_read_lock();
+	migrate_disable();
+
+	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
+
+	return bpf_prog_start_time();
+}
+
+void notrace __bpf_prog_exit_struct_ops(struct bpf_prog *prog, u64 start,
+					struct bpf_tramp_run_ctx *run_ctx)
+	__releases(RCU)
+{
+	bpf_reset_run_ctx(run_ctx->saved_run_ctx);
+
+	update_prog_stats(prog, start);
+	migrate_enable();
+	rcu_read_unlock();
+}
+
 void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr)
 {
 	percpu_ref_get(&tr->pcref);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
  2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
@ 2022-09-22 22:56 ` Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 3/5] bpf: Add bpf_run_ctx_type Martin KaFai Lau
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-22 22:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, netdev, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

The check on the tcp-cc, "cdg", is done in the bpf_sk_setsockopt which is
used by the bpf_tcp_ca, bpf_lsm, cg_sockopt, and tcp_iter hooks.
However, it is not done for cg sock_ddr, cg sockops, and some of
the bpf_lsm_cgroup hooks.

The tcp-cc "cdg" should have very limited usage.  This patch is to
move the "cdg" check to the common sol_tcp_sockopt() so that all
hooks have a consistent behavior.   The motivation to make
this check consistent now is because the latter patch will need
to expose _bpf_setsockopt() for the bpf_tcp_ca to use and it
requires the "cdg" check.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 net/core/filter.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2fd9449026aa..f4cea3ff994a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5127,6 +5127,13 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
 	case TCP_CONGESTION:
 		if (*optlen < 2)
 			return -EINVAL;
+		/* "cdg" is the only cc that alloc a ptr
+		 * in inet_csk_ca area.  The bpf-tcp-cc may
+		 * overwrite this ptr after switching to cdg.
+		 */
+		if (!getopt && *optlen >= sizeof("cdg") - 1 &&
+		    !strncmp("cdg", optval, *optlen))
+			return -ENOTSUPP;
 		break;
 	case TCP_SAVED_SYN:
 		if (*optlen < 1)
@@ -5285,12 +5292,6 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname,
 BPF_CALL_5(bpf_sk_setsockopt, struct sock *, sk, int, level,
 	   int, optname, char *, optval, int, optlen)
 {
-	if (level == SOL_TCP && optname == TCP_CONGESTION) {
-		if (optlen >= sizeof("cdg") - 1 &&
-		    !strncmp("cdg", optval, optlen))
-			return -ENOTSUPP;
-	}
-
 	return _bpf_setsockopt(sk, level, optname, optval, optlen);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 3/5] bpf: Add bpf_run_ctx_type
  2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
@ 2022-09-22 22:56 ` Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
  2022-09-22 22:56 ` [PATCH bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau
  4 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-22 22:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, netdev, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

This patch adds a bpf_run_ctx_type to the struct bpf_run_ctx.
The next patch needs to look at the previous run ctx saved at
tramp_run_ctx->saved_run_ctx and checks if it is also
changing the tcp-cc for the same sk (saved in bpf_cookie).
Thus, it needs to know if the saved_run_ctx is the bpf_run_ctx
type that it is looking for before looking into its members.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 include/linux/bpf.h      | 17 ++++++++++++++---
 kernel/bpf/bpf_iter.c    |  2 +-
 kernel/bpf/cgroup.c      |  2 +-
 kernel/bpf/trampoline.c  |  4 ++++
 kernel/trace/bpf_trace.c |  1 +
 net/bpf/test_run.c       |  2 +-
 6 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6fdbc1398b8a..902b1be047cf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1517,7 +1517,18 @@ int bpf_prog_array_copy(struct bpf_prog_array *old_array,
 			u64 bpf_cookie,
 			struct bpf_prog_array **new_array);
 
-struct bpf_run_ctx {};
+enum bpf_run_ctx_type {
+	BPF_RUN_CTX_TYPE_NONE,
+	BPF_RUN_CTX_TYPE_CG,
+	BPF_RUN_CTX_TYPE_TRACE,
+	BPF_RUN_CTX_TYPE_TRAMP,
+	BPF_RUN_CTX_TYPE_KPROBE_MULTI,
+	BPF_RUN_CTX_TYPE_STRUCT_OPS,
+};
+
+struct bpf_run_ctx {
+	enum bpf_run_ctx_type type;
+};
 
 struct bpf_cg_run_ctx {
 	struct bpf_run_ctx run_ctx;
@@ -1568,7 +1579,7 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
 	const struct bpf_prog_array_item *item;
 	const struct bpf_prog *prog;
 	struct bpf_run_ctx *old_run_ctx;
-	struct bpf_trace_run_ctx run_ctx;
+	struct bpf_trace_run_ctx run_ctx = { .run_ctx.type = BPF_RUN_CTX_TYPE_TRACE };
 	u32 ret = 1;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");
@@ -1607,7 +1618,7 @@ bpf_prog_run_array_sleepable(const struct bpf_prog_array __rcu *array_rcu,
 	const struct bpf_prog *prog;
 	const struct bpf_prog_array *array;
 	struct bpf_run_ctx *old_run_ctx;
-	struct bpf_trace_run_ctx run_ctx;
+	struct bpf_trace_run_ctx run_ctx = { .run_ctx.type = BPF_RUN_CTX_TYPE_TRACE };
 	u32 ret = 1;
 
 	might_fault();
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 5dc307bdeaeb..65ff0c93b0ba 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -694,7 +694,7 @@ struct bpf_prog *bpf_iter_get_info(struct bpf_iter_meta *meta, bool in_stop)
 
 int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
 {
-	struct bpf_run_ctx run_ctx, *old_run_ctx;
+	struct bpf_run_ctx run_ctx = {}, *old_run_ctx;
 	int ret;
 
 	if (prog->aux->sleepable) {
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 00c7f864900e..850fd6983b9a 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -37,7 +37,7 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 	const struct bpf_prog *prog;
 	const struct bpf_prog_array *array;
 	struct bpf_run_ctx *old_run_ctx;
-	struct bpf_cg_run_ctx run_ctx;
+	struct bpf_cg_run_ctx run_ctx = { .run_ctx.type = BPF_RUN_CTX_TYPE_CG };
 	u32 func_ret;
 
 	run_ctx.retval = retval;
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index e6551e4a6064..313619012a59 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -882,6 +882,7 @@ u64 notrace __bpf_prog_enter(struct bpf_prog *prog, struct bpf_tramp_run_ctx *ru
 	rcu_read_lock();
 	migrate_disable();
 
+	run_ctx->run_ctx.type = BPF_RUN_CTX_TYPE_TRAMP;
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
 	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
@@ -934,6 +935,7 @@ u64 notrace __bpf_prog_enter_lsm_cgroup(struct bpf_prog *prog,
 	rcu_read_lock();
 	migrate_disable();
 
+	run_ctx->run_ctx.type = BPF_RUN_CTX_TYPE_TRAMP;
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
 	return NO_START_TIME;
@@ -960,6 +962,7 @@ u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog, struct bpf_tramp_r
 		return 0;
 	}
 
+	run_ctx->run_ctx.type = BPF_RUN_CTX_TYPE_TRAMP;
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
 	return bpf_prog_start_time();
@@ -983,6 +986,7 @@ u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
 	rcu_read_lock();
 	migrate_disable();
 
+	run_ctx->run_ctx.type = BPF_RUN_CTX_TYPE_STRUCT_OPS;
 	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
 
 	return bpf_prog_start_time();
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b05f0310dbd3..7670ca88b721 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2575,6 +2575,7 @@ kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link,
 			   unsigned long entry_ip, struct pt_regs *regs)
 {
 	struct bpf_kprobe_multi_run_ctx run_ctx = {
+		.run_ctx.type = BPF_RUN_CTX_TYPE_KPROBE_MULTI,
 		.link = link,
 		.entry_ip = entry_ip,
 	};
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 13d578ce2a09..1f2a745e8641 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -374,7 +374,7 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 {
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
-	struct bpf_cg_run_ctx run_ctx;
+	struct bpf_cg_run_ctx run_ctx = { .run_ctx.type = BPF_RUN_CTX_TYPE_CG };
 	struct bpf_test_timer t = { NO_MIGRATE };
 	enum bpf_cgroup_storage_type stype;
 	int ret;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2022-09-22 22:56 ` [PATCH bpf-next 3/5] bpf: Add bpf_run_ctx_type Martin KaFai Lau
@ 2022-09-22 22:56 ` Martin KaFai Lau
  2022-09-23  0:12   ` Alexei Starovoitov
  2022-09-22 22:56 ` [PATCH bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau
  4 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-22 22:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, netdev, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

When a bad bpf prog '.init' calls
bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:

.init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
... => .init => bpf_setsockopt(tcp_cc).

It was prevented by the prog->active counter before but the prog->active
detection cannot be used in struct_ops as explained in the earlier
patch of the set.

In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
in order to break the loop.  This is done by checking the
previous bpf_run_ctx has saved the same sk pointer in the
bpf_cookie.

Note that this essentially limits only the first '.init' can
call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
does not support ECN) and the second '.init' cannot fallback to
another cc.  This applies even the second
bpf_setsockopt(TCP_CONGESTION) will not cause a loop.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 include/linux/filter.h |  3 +++
 net/core/filter.c      |  4 ++--
 net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 98e28126c24b..9942ecc68a45 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
 bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
 void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
 
+int _bpf_setsockopt(struct sock *sk, int level, int optname,
+		    char *optval, int optlen);
+
 u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 #define __bpf_call_base_args \
 	((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
diff --git a/net/core/filter.c b/net/core/filter.c
index f4cea3ff994a..e56a1ebcf1bc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
 	return -EINVAL;
 }
 
-static int _bpf_setsockopt(struct sock *sk, int level, int optname,
-			   char *optval, int optlen)
+int _bpf_setsockopt(struct sock *sk, int level, int optname,
+		    char *optval, int optlen)
 {
 	if (sk_fullsock(sk))
 		sock_owned_by_me(sk);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 6da16ae6a962..a9f2cab5ffbc 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
+	int ret;
+
+	if (optname != TCP_CONGESTION)
+		return _bpf_setsockopt(sk, level, optname, optval, optlen);
+
+	run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
+	if (unlikely(run_ctx->saved_run_ctx &&
+		     run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
+		saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
+		/* It stops this looping
+		 *
+		 * .init => bpf_setsockopt(tcp_cc) => .init =>
+		 * bpf_setsockopt(tcp_cc)" => .init => ....
+		 *
+		 * The second bpf_setsockopt(tcp_cc) is not allowed
+		 * in order to break the loop when both .init
+		 * are the same bpf prog.
+		 *
+		 * This applies even the second bpf_setsockopt(tcp_cc)
+		 * does not cause a loop.  This limits only the first
+		 * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
+		 * pick a fallback cc (eg. peer does not support ECN)
+		 * and the second '.init' cannot fallback to
+		 * another cc.
+		 */
+		if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
+			return -EBUSY;
+	}
+
+	run_ctx->bpf_cookie = (uintptr_t)sk;
+	ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
+	run_ctx->bpf_cookie = 0;
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_init_ops_setsockopt_proto = {
+	.func		= bpf_init_ops_setsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID_SOCK_COMMON,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM | MEM_RDONLY,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 static u32 prog_ops_moff(const struct bpf_prog *prog)
 {
 	const struct btf_member *m;
@@ -169,6 +220,9 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
 	case BPF_FUNC_sk_storage_delete:
 		return &bpf_sk_storage_delete_proto;
 	case BPF_FUNC_setsockopt:
+		if (prog_ops_moff(prog) ==
+		    offsetof(struct tcp_congestion_ops, init))
+			return &bpf_init_ops_setsockopt_proto;
 		/* Does not allow release() to call setsockopt.
 		 * release() is called when the current bpf-tcp-cc
 		 * is retiring.  It is not allowed to call
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION)
  2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2022-09-22 22:56 ` [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
@ 2022-09-22 22:56 ` Martin KaFai Lau
  4 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-22 22:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, netdev, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

This patch changes the bpf_dctcp test to ensure the recurred
bpf_setsockopt(TCP_CONGESTION) returns -EBUSY.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |  4 ++++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 23 ++++++++++++++-----
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index 2959a52ced06..930d2e6b3d5e 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -290,6 +290,10 @@ static void test_dctcp_fallback(void)
 		goto done;
 	ASSERT_STREQ(dctcp_skel->bss->cc_res, "cubic", "cc_res");
 	ASSERT_EQ(dctcp_skel->bss->tcp_cdg_res, -ENOTSUPP, "tcp_cdg_res");
+	/* All setsockopt(TCP_CONGESTION) in the recurred
+	 * bpf_dctcp->init() should fail with -EBUSY.
+	 */
+	ASSERT_EQ(dctcp_skel->bss->ebusy_cnt, 4, "ebusy_cnt");
 
 	err = getsockopt(srv_fd, SOL_TCP, TCP_CONGESTION, srv_cc, &cc_len);
 	if (!ASSERT_OK(err, "getsockopt(srv_fd, TCP_CONGESTION)"))
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
index 9573be6122be..0cab241c33b5 100644
--- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 #include <linux/stddef.h>
 #include <linux/tcp.h>
+#include <errno.h>
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
 #include "bpf_tcp_helpers.h"
@@ -23,6 +24,7 @@ const char tcp_cdg[] = "cdg";
 char cc_res[TCP_CA_NAME_MAX];
 int tcp_cdg_res = 0;
 int stg_result = 0;
+int ebusy_cnt = 0;
 
 struct {
 	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
@@ -64,19 +66,28 @@ void BPF_PROG(dctcp_init, struct sock *sk)
 
 	if (!(tp->ecn_flags & TCP_ECN_OK) && fallback[0]) {
 		/* Switch to fallback */
-		bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
-			       (void *)fallback, sizeof(fallback));
+		if (bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+				   (void *)fallback, sizeof(fallback)) == -EBUSY)
+			ebusy_cnt++;
+
 		/* Switch back to myself which the bpf trampoline
 		 * stopped calling dctcp_init recursively.
 		 */
-		bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
-			       (void *)bpf_dctcp, sizeof(bpf_dctcp));
+		if (bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+				   (void *)bpf_dctcp, sizeof(bpf_dctcp)) == -EBUSY)
+			ebusy_cnt++;
+
 		/* Switch back to fallback */
-		bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
-			       (void *)fallback, sizeof(fallback));
+		if (bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+				   (void *)fallback, sizeof(fallback)) == -EBUSY)
+			ebusy_cnt++;
+
 		/* Expecting -ENOTSUPP for tcp_cdg_res */
 		tcp_cdg_res = bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
 					     (void *)tcp_cdg, sizeof(tcp_cdg));
+		if (tcp_cdg_res == -EBUSY)
+			ebusy_cnt++;
+
 		bpf_getsockopt(sk, SOL_TCP, TCP_CONGESTION,
 			       (void *)cc_res, sizeof(cc_res));
 		return;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-22 22:56 ` [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
@ 2022-09-23  0:12   ` Alexei Starovoitov
  2022-09-23  1:11     ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2022-09-23  0:12 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Network Development, Martin KaFai Lau

On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> From: Martin KaFai Lau <martin.lau@kernel.org>
>
> When a bad bpf prog '.init' calls
> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>
> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> ... => .init => bpf_setsockopt(tcp_cc).
>
> It was prevented by the prog->active counter before but the prog->active
> detection cannot be used in struct_ops as explained in the earlier
> patch of the set.
>
> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> in order to break the loop.  This is done by checking the
> previous bpf_run_ctx has saved the same sk pointer in the
> bpf_cookie.
>
> Note that this essentially limits only the first '.init' can
> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> does not support ECN) and the second '.init' cannot fallback to
> another cc.  This applies even the second
> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> ---
>  include/linux/filter.h |  3 +++
>  net/core/filter.c      |  4 ++--
>  net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 59 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 98e28126c24b..9942ecc68a45 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
>  bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
>  void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
>
> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> +                   char *optval, int optlen);
> +
>  u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  #define __bpf_call_base_args \
>         ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
> diff --git a/net/core/filter.c b/net/core/filter.c
> index f4cea3ff994a..e56a1ebcf1bc 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>         return -EINVAL;
>  }
>
> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> -                          char *optval, int optlen)
> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> +                   char *optval, int optlen)
>  {
>         if (sk_fullsock(sk))
>                 sock_owned_by_me(sk);
> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> index 6da16ae6a962..a9f2cab5ffbc 100644
> --- a/net/ipv4/bpf_tcp_ca.c
> +++ b/net/ipv4/bpf_tcp_ca.c
> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
>         .arg2_type      = ARG_ANYTHING,
>  };
>
> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> +          int, optname, char *, optval, int, optlen)
> +{
> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
> +       int ret;
> +
> +       if (optname != TCP_CONGESTION)
> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
> +
> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
> +       if (unlikely(run_ctx->saved_run_ctx &&
> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
> +               /* It stops this looping
> +                *
> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
> +                * bpf_setsockopt(tcp_cc)" => .init => ....
> +                *
> +                * The second bpf_setsockopt(tcp_cc) is not allowed
> +                * in order to break the loop when both .init
> +                * are the same bpf prog.
> +                *
> +                * This applies even the second bpf_setsockopt(tcp_cc)
> +                * does not cause a loop.  This limits only the first
> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> +                * pick a fallback cc (eg. peer does not support ECN)
> +                * and the second '.init' cannot fallback to
> +                * another cc.
> +                */
> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
> +                       return -EBUSY;
> +       }
> +
> +       run_ctx->bpf_cookie = (uintptr_t)sk;
> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
> +       run_ctx->bpf_cookie = 0;

Instead of adding 4 bytes for enum in patch 3
(which will be 8 bytes due to alignment)
and abusing bpf_cookie here
(which struct_ops bpf prog might eventually read and be surprised
to find sk pointer in there)
how about adding 'struct task_struct *saved_current' as another arg
to bpf_tramp_run_ctx ?
Always store the current task in there in prog_entry_struct_ops
and then compare it here in this specialized bpf_init_ops_setsockopt?

Or maybe always check in enter_prog_struct_ops:
if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
run_ctx)->saved_current == current) // goto out since recursion?
it will prevent issues in case we don't know about and will
address the good recursion case as explained in patch 1?
I'm assuming 2nd ssthresh runs in a different task..
Or is it actually the same task?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23  0:12   ` Alexei Starovoitov
@ 2022-09-23  1:11     ` Martin KaFai Lau
  2022-09-23  1:59       ` Hao Luo
  2022-09-23 15:26       ` Alexei Starovoitov
  0 siblings, 2 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-23  1:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Network Development

On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>
>> When a bad bpf prog '.init' calls
>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>
>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>> ... => .init => bpf_setsockopt(tcp_cc).
>>
>> It was prevented by the prog->active counter before but the prog->active
>> detection cannot be used in struct_ops as explained in the earlier
>> patch of the set.
>>
>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>> in order to break the loop.  This is done by checking the
>> previous bpf_run_ctx has saved the same sk pointer in the
>> bpf_cookie.
>>
>> Note that this essentially limits only the first '.init' can
>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>> does not support ECN) and the second '.init' cannot fallback to
>> another cc.  This applies even the second
>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>
>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>> ---
>>   include/linux/filter.h |  3 +++
>>   net/core/filter.c      |  4 ++--
>>   net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 59 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 98e28126c24b..9942ecc68a45 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
>>   bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
>>   void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
>>
>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>> +                   char *optval, int optlen);
>> +
>>   u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>   #define __bpf_call_base_args \
>>          ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index f4cea3ff994a..e56a1ebcf1bc 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>>          return -EINVAL;
>>   }
>>
>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>> -                          char *optval, int optlen)
>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>> +                   char *optval, int optlen)
>>   {
>>          if (sk_fullsock(sk))
>>                  sock_owned_by_me(sk);
>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
>> index 6da16ae6a962..a9f2cab5ffbc 100644
>> --- a/net/ipv4/bpf_tcp_ca.c
>> +++ b/net/ipv4/bpf_tcp_ca.c
>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
>>          .arg2_type      = ARG_ANYTHING,
>>   };
>>
>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>> +          int, optname, char *, optval, int, optlen)
>> +{
>> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
>> +       int ret;
>> +
>> +       if (optname != TCP_CONGESTION)
>> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
>> +
>> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
>> +       if (unlikely(run_ctx->saved_run_ctx &&
>> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
>> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
>> +               /* It stops this looping
>> +                *
>> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
>> +                * bpf_setsockopt(tcp_cc)" => .init => ....
>> +                *
>> +                * The second bpf_setsockopt(tcp_cc) is not allowed
>> +                * in order to break the loop when both .init
>> +                * are the same bpf prog.
>> +                *
>> +                * This applies even the second bpf_setsockopt(tcp_cc)
>> +                * does not cause a loop.  This limits only the first
>> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>> +                * pick a fallback cc (eg. peer does not support ECN)
>> +                * and the second '.init' cannot fallback to
>> +                * another cc.
>> +                */
>> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
>> +                       return -EBUSY;
>> +       }
>> +
>> +       run_ctx->bpf_cookie = (uintptr_t)sk;
>> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
>> +       run_ctx->bpf_cookie = 0;
> 
> Instead of adding 4 bytes for enum in patch 3
> (which will be 8 bytes due to alignment)
> and abusing bpf_cookie here
> (which struct_ops bpf prog might eventually read and be surprised
> to find sk pointer in there)
> how about adding 'struct task_struct *saved_current' as another arg
> to bpf_tramp_run_ctx ?
> Always store the current task in there in prog_entry_struct_ops
> and then compare it here in this specialized bpf_init_ops_setsockopt?
> 
> Or maybe always check in enter_prog_struct_ops:
> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> run_ctx)->saved_current == current) // goto out since recursion?
> it will prevent issues in case we don't know about and will
> address the good recursion case as explained in patch 1?
> I'm assuming 2nd ssthresh runs in a different task..
> Or is it actually the same task?

The 2nd ssthresh() should run in the same task but different sk.  The 
first ssthresh(sk[1]) was run in_task() context and then got 
interrupted.  The softirq then handles the rcv path which just happens 
to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like 
ssthresh(sk[1]) => softirq => ssthresh(sk[2]).

The tcp-cc ops can recur but cannot recur on the same sk because it 
requires to hold the sk lock, so the patch remembers what was the 
previous sk to ensure it does not recur on the same sk.  Then it needs 
to peek into the previous run ctx which may not always be 
bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call 
bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the 
a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek 
the previous bpf_run_ctx.

Since struct_ops is the only one that needs to peek into the previous 
run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4 
bytes to the bpf_run_ctx, one idea just came to my mind is to use one 
bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like 
this if it reuses the bpf_cookie (probably missed some int/ptr type 
casting):

#define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL

u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
                                     struct bpf_tramp_run_ctx *run_ctx)
         __acquires(RCU)
{
	rcu_read_lock();
	migrate_disable();

	run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
					BPF_RUN_CTX_STRUCT_OPS_BIT);

         return bpf_prog_start_time();
}

BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
            int, optname, char *, optval, int, optlen)
{
	/* ... */
	if (unlikely((run_ctx->saved_run_ctx &
			BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
		/* ... */
		if (bpf_cookie == (uintptr_t)sk)
			return -EBUSY;
	}

}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23  1:11     ` Martin KaFai Lau
@ 2022-09-23  1:59       ` Hao Luo
  2022-09-23 15:26       ` Alexei Starovoitov
  1 sibling, 0 replies; 14+ messages in thread
From: Hao Luo @ 2022-09-23  1:59 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, Network Development

On Thu, Sep 22, 2022 at 6:12 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
[...]
> > Instead of adding 4 bytes for enum in patch 3
> > (which will be 8 bytes due to alignment)
> > and abusing bpf_cookie here
> > (which struct_ops bpf prog might eventually read and be surprised
> > to find sk pointer in there)
> > how about adding 'struct task_struct *saved_current' as another arg
> > to bpf_tramp_run_ctx ?
> > Always store the current task in there in prog_entry_struct_ops
> > and then compare it here in this specialized bpf_init_ops_setsockopt?
> >
> > Or maybe always check in enter_prog_struct_ops:
> > if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> > run_ctx)->saved_current == current) // goto out since recursion?
> > it will prevent issues in case we don't know about and will
> > address the good recursion case as explained in patch 1?
> > I'm assuming 2nd ssthresh runs in a different task..
> > Or is it actually the same task?
>
> The 2nd ssthresh() should run in the same task but different sk.  The
> first ssthresh(sk[1]) was run in_task() context and then got
> interrupted.  The softirq then handles the rcv path which just happens
> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>
> The tcp-cc ops can recur but cannot recur on the same sk because it
> requires to hold the sk lock, so the patch remembers what was the
> previous sk to ensure it does not recur on the same sk.  Then it needs
> to peek into the previous run ctx which may not always be
> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
> the previous bpf_run_ctx.
>
> Since struct_ops is the only one that needs to peek into the previous
> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
> this if it reuses the bpf_cookie (probably missed some int/ptr type
> casting):
>
> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>
> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
>                                      struct bpf_tramp_run_ctx *run_ctx)
>          __acquires(RCU)
> {
>         rcu_read_lock();
>         migrate_disable();
>
>         run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
>                                         BPF_RUN_CTX_STRUCT_OPS_BIT);
>
>          return bpf_prog_start_time();
> }
>
> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>             int, optname, char *, optval, int, optlen)
> {
>         /* ... */
>         if (unlikely((run_ctx->saved_run_ctx &
>                         BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
>                 /* ... */
>                 if (bpf_cookie == (uintptr_t)sk)
>                         return -EBUSY;
>         }
>
> }

If I understand correctly, the purpose of adding a field in run_ctx is
to tell the enclosing type from a generic bpf_run_ctx.

In the following lines:

> >> +       if (unlikely(run_ctx->saved_run_ctx &&
> >> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> >> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;

the enclosing type of run_ctx->saved_run_ctx is a bpf_tramp_run_ctx,
so we can safely type cast it and further check sk.

The best way I can come up with is also what Martin thinks, maybe
encoding the type information in the lower bits of the saved_run_ctx
field.

Hao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23  1:11     ` Martin KaFai Lau
  2022-09-23  1:59       ` Hao Luo
@ 2022-09-23 15:26       ` Alexei Starovoitov
  2022-09-23 17:46         ` Martin KaFai Lau
  1 sibling, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2022-09-23 15:26 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Network Development

On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
> > On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >>
> >> From: Martin KaFai Lau <martin.lau@kernel.org>
> >>
> >> When a bad bpf prog '.init' calls
> >> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >>
> >> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> >> ... => .init => bpf_setsockopt(tcp_cc).
> >>
> >> It was prevented by the prog->active counter before but the prog->active
> >> detection cannot be used in struct_ops as explained in the earlier
> >> patch of the set.
> >>
> >> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> >> in order to break the loop.  This is done by checking the
> >> previous bpf_run_ctx has saved the same sk pointer in the
> >> bpf_cookie.
> >>
> >> Note that this essentially limits only the first '.init' can
> >> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> >> does not support ECN) and the second '.init' cannot fallback to
> >> another cc.  This applies even the second
> >> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >>
> >> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> >> ---
> >>   include/linux/filter.h |  3 +++
> >>   net/core/filter.c      |  4 ++--
> >>   net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
> >>   3 files changed, 59 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >> index 98e28126c24b..9942ecc68a45 100644
> >> --- a/include/linux/filter.h
> >> +++ b/include/linux/filter.h
> >> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
> >>   bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
> >>   void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
> >>
> >> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >> +                   char *optval, int optlen);
> >> +
> >>   u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> >>   #define __bpf_call_base_args \
> >>          ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index f4cea3ff994a..e56a1ebcf1bc 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> >>          return -EINVAL;
> >>   }
> >>
> >> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >> -                          char *optval, int optlen)
> >> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >> +                   char *optval, int optlen)
> >>   {
> >>          if (sk_fullsock(sk))
> >>                  sock_owned_by_me(sk);
> >> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> >> index 6da16ae6a962..a9f2cab5ffbc 100644
> >> --- a/net/ipv4/bpf_tcp_ca.c
> >> +++ b/net/ipv4/bpf_tcp_ca.c
> >> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
> >>          .arg2_type      = ARG_ANYTHING,
> >>   };
> >>
> >> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> >> +          int, optname, char *, optval, int, optlen)
> >> +{
> >> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
> >> +       int ret;
> >> +
> >> +       if (optname != TCP_CONGESTION)
> >> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
> >> +
> >> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
> >> +       if (unlikely(run_ctx->saved_run_ctx &&
> >> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> >> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
> >> +               /* It stops this looping
> >> +                *
> >> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
> >> +                * bpf_setsockopt(tcp_cc)" => .init => ....
> >> +                *
> >> +                * The second bpf_setsockopt(tcp_cc) is not allowed
> >> +                * in order to break the loop when both .init
> >> +                * are the same bpf prog.
> >> +                *
> >> +                * This applies even the second bpf_setsockopt(tcp_cc)
> >> +                * does not cause a loop.  This limits only the first
> >> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> >> +                * pick a fallback cc (eg. peer does not support ECN)
> >> +                * and the second '.init' cannot fallback to
> >> +                * another cc.
> >> +                */
> >> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
> >> +                       return -EBUSY;
> >> +       }
> >> +
> >> +       run_ctx->bpf_cookie = (uintptr_t)sk;
> >> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
> >> +       run_ctx->bpf_cookie = 0;
> >
> > Instead of adding 4 bytes for enum in patch 3
> > (which will be 8 bytes due to alignment)
> > and abusing bpf_cookie here
> > (which struct_ops bpf prog might eventually read and be surprised
> > to find sk pointer in there)
> > how about adding 'struct task_struct *saved_current' as another arg
> > to bpf_tramp_run_ctx ?
> > Always store the current task in there in prog_entry_struct_ops
> > and then compare it here in this specialized bpf_init_ops_setsockopt?
> >
> > Or maybe always check in enter_prog_struct_ops:
> > if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> > run_ctx)->saved_current == current) // goto out since recursion?
> > it will prevent issues in case we don't know about and will
> > address the good recursion case as explained in patch 1?
> > I'm assuming 2nd ssthresh runs in a different task..
> > Or is it actually the same task?
>
> The 2nd ssthresh() should run in the same task but different sk.  The
> first ssthresh(sk[1]) was run in_task() context and then got
> interrupted.  The softirq then handles the rcv path which just happens
> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>
> The tcp-cc ops can recur but cannot recur on the same sk because it
> requires to hold the sk lock, so the patch remembers what was the
> previous sk to ensure it does not recur on the same sk.  Then it needs
> to peek into the previous run ctx which may not always be
> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
> the previous bpf_run_ctx.

got it.

>
> Since struct_ops is the only one that needs to peek into the previous
> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
> this if it reuses the bpf_cookie (probably missed some int/ptr type
> casting):
>
> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>
> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
>                                      struct bpf_tramp_run_ctx *run_ctx)
>          __acquires(RCU)
> {
>         rcu_read_lock();
>         migrate_disable();
>
>         run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
>                                         BPF_RUN_CTX_STRUCT_OPS_BIT);
>
>          return bpf_prog_start_time();
> }
>
> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>             int, optname, char *, optval, int, optlen)
> {
>         /* ... */
>         if (unlikely((run_ctx->saved_run_ctx &
>                         BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
>                 /* ... */
>                 if (bpf_cookie == (uintptr_t)sk)
>                         return -EBUSY;
>         }
>
> }

that should work, but don't you need to loop through all previous
run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
Since run_ctx is saved in the task and we have preemptible
rpgos there could be tracing prog in the chain:
struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
where 1st and last have the same 'sk'.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23 15:26       ` Alexei Starovoitov
@ 2022-09-23 17:46         ` Martin KaFai Lau
  2022-09-23 18:27           ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 17:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Network Development

On 9/23/22 8:26 AM, Alexei Starovoitov wrote:
> On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
>>> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>>
>>>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>>>
>>>> When a bad bpf prog '.init' calls
>>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>>>
>>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>>>> ... => .init => bpf_setsockopt(tcp_cc).
>>>>
>>>> It was prevented by the prog->active counter before but the prog->active
>>>> detection cannot be used in struct_ops as explained in the earlier
>>>> patch of the set.
>>>>
>>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>>>> in order to break the loop.  This is done by checking the
>>>> previous bpf_run_ctx has saved the same sk pointer in the
>>>> bpf_cookie.
>>>>
>>>> Note that this essentially limits only the first '.init' can
>>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>>>> does not support ECN) and the second '.init' cannot fallback to
>>>> another cc.  This applies even the second
>>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>>>
>>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>>>> ---
>>>>    include/linux/filter.h |  3 +++
>>>>    net/core/filter.c      |  4 ++--
>>>>    net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
>>>>    3 files changed, 59 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>> index 98e28126c24b..9942ecc68a45 100644
>>>> --- a/include/linux/filter.h
>>>> +++ b/include/linux/filter.h
>>>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
>>>>    bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
>>>>    void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
>>>>
>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> +                   char *optval, int optlen);
>>>> +
>>>>    u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>>>    #define __bpf_call_base_args \
>>>>           ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index f4cea3ff994a..e56a1ebcf1bc 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>>>>           return -EINVAL;
>>>>    }
>>>>
>>>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> -                          char *optval, int optlen)
>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>> +                   char *optval, int optlen)
>>>>    {
>>>>           if (sk_fullsock(sk))
>>>>                   sock_owned_by_me(sk);
>>>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
>>>> index 6da16ae6a962..a9f2cab5ffbc 100644
>>>> --- a/net/ipv4/bpf_tcp_ca.c
>>>> +++ b/net/ipv4/bpf_tcp_ca.c
>>>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
>>>>           .arg2_type      = ARG_ANYTHING,
>>>>    };
>>>>
>>>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>>>> +          int, optname, char *, optval, int, optlen)
>>>> +{
>>>> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
>>>> +       int ret;
>>>> +
>>>> +       if (optname != TCP_CONGESTION)
>>>> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>> +
>>>> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
>>>> +       if (unlikely(run_ctx->saved_run_ctx &&
>>>> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
>>>> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
>>>> +               /* It stops this looping
>>>> +                *
>>>> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
>>>> +                * bpf_setsockopt(tcp_cc)" => .init => ....
>>>> +                *
>>>> +                * The second bpf_setsockopt(tcp_cc) is not allowed
>>>> +                * in order to break the loop when both .init
>>>> +                * are the same bpf prog.
>>>> +                *
>>>> +                * This applies even the second bpf_setsockopt(tcp_cc)
>>>> +                * does not cause a loop.  This limits only the first
>>>> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>>>> +                * pick a fallback cc (eg. peer does not support ECN)
>>>> +                * and the second '.init' cannot fallback to
>>>> +                * another cc.
>>>> +                */
>>>> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
>>>> +                       return -EBUSY;
>>>> +       }
>>>> +
>>>> +       run_ctx->bpf_cookie = (uintptr_t)sk;
>>>> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>> +       run_ctx->bpf_cookie = 0;
>>>
>>> Instead of adding 4 bytes for enum in patch 3
>>> (which will be 8 bytes due to alignment)
>>> and abusing bpf_cookie here
>>> (which struct_ops bpf prog might eventually read and be surprised
>>> to find sk pointer in there)
>>> how about adding 'struct task_struct *saved_current' as another arg
>>> to bpf_tramp_run_ctx ?
>>> Always store the current task in there in prog_entry_struct_ops
>>> and then compare it here in this specialized bpf_init_ops_setsockopt?
>>>
>>> Or maybe always check in enter_prog_struct_ops:
>>> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
>>> run_ctx)->saved_current == current) // goto out since recursion?
>>> it will prevent issues in case we don't know about and will
>>> address the good recursion case as explained in patch 1?
>>> I'm assuming 2nd ssthresh runs in a different task..
>>> Or is it actually the same task?
>>
>> The 2nd ssthresh() should run in the same task but different sk.  The
>> first ssthresh(sk[1]) was run in_task() context and then got
>> interrupted.  The softirq then handles the rcv path which just happens
>> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
>> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>>
>> The tcp-cc ops can recur but cannot recur on the same sk because it
>> requires to hold the sk lock, so the patch remembers what was the
>> previous sk to ensure it does not recur on the same sk.  Then it needs
>> to peek into the previous run ctx which may not always be
>> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
>> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
>> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
>> the previous bpf_run_ctx.
> 
> got it.
> 
>>
>> Since struct_ops is the only one that needs to peek into the previous
>> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
>> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
>> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
>> this if it reuses the bpf_cookie (probably missed some int/ptr type
>> casting):
>>
>> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>>
>> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
>>                                       struct bpf_tramp_run_ctx *run_ctx)
>>           __acquires(RCU)
>> {
>>          rcu_read_lock();
>>          migrate_disable();
>>
>>          run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
>>                                          BPF_RUN_CTX_STRUCT_OPS_BIT);
>>
>>           return bpf_prog_start_time();
>> }
>>
>> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>>              int, optname, char *, optval, int, optlen)
>> {
>>          /* ... */
>>          if (unlikely((run_ctx->saved_run_ctx &
>>                          BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
>>                  /* ... */
>>                  if (bpf_cookie == (uintptr_t)sk)
>>                          return -EBUSY;
>>          }
>>
>> }
> 
> that should work, but don't you need to loop through all previous
> run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
> Since run_ctx is saved in the task and we have preemptible
> rpgos there could be tracing prog in the chain:
> struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
> where 1st and last have the same 'sk'.


This interleave of different run_ctx could happen.  My understanding is 
the 'struct_ops_run_ctx' can only be created when the tcp stack is 
calling the 'bpf_tcp_cc->init()' (or other cc ops).  In the above case, 
the first and second struct_ops_run_ctx are interleaved with a 
tracing_run_ctx.  Each of these two struct_ops_run_ctx was created from 
a different 'bpf_tcp_cc->init()' call by the kernel tcp stack.  They 
cannot be called with the same sk and changing that sk at the same time 
like this.  Otherwise, the kernel stack has a bug.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23 17:46         ` Martin KaFai Lau
@ 2022-09-23 18:27           ` Andrii Nakryiko
  2022-09-23 18:30             ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2022-09-23 18:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, Network Development

On Fri, Sep 23, 2022 at 10:46 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/23/22 8:26 AM, Alexei Starovoitov wrote:
> > On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
> >>> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >>>>
> >>>> From: Martin KaFai Lau <martin.lau@kernel.org>
> >>>>
> >>>> When a bad bpf prog '.init' calls
> >>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >>>>
> >>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> >>>> ... => .init => bpf_setsockopt(tcp_cc).
> >>>>
> >>>> It was prevented by the prog->active counter before but the prog->active
> >>>> detection cannot be used in struct_ops as explained in the earlier
> >>>> patch of the set.
> >>>>
> >>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> >>>> in order to break the loop.  This is done by checking the
> >>>> previous bpf_run_ctx has saved the same sk pointer in the
> >>>> bpf_cookie.
> >>>>
> >>>> Note that this essentially limits only the first '.init' can
> >>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> >>>> does not support ECN) and the second '.init' cannot fallback to
> >>>> another cc.  This applies even the second
> >>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >>>>
> >>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> >>>> ---
> >>>>    include/linux/filter.h |  3 +++
> >>>>    net/core/filter.c      |  4 ++--
> >>>>    net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
> >>>>    3 files changed, 59 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >>>> index 98e28126c24b..9942ecc68a45 100644
> >>>> --- a/include/linux/filter.h
> >>>> +++ b/include/linux/filter.h
> >>>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
> >>>>    bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
> >>>>    void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
> >>>>
> >>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>> +                   char *optval, int optlen);
> >>>> +
> >>>>    u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> >>>>    #define __bpf_call_base_args \
> >>>>           ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
> >>>> diff --git a/net/core/filter.c b/net/core/filter.c
> >>>> index f4cea3ff994a..e56a1ebcf1bc 100644
> >>>> --- a/net/core/filter.c
> >>>> +++ b/net/core/filter.c
> >>>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>>           return -EINVAL;
> >>>>    }
> >>>>
> >>>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>> -                          char *optval, int optlen)
> >>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>> +                   char *optval, int optlen)
> >>>>    {
> >>>>           if (sk_fullsock(sk))
> >>>>                   sock_owned_by_me(sk);
> >>>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> >>>> index 6da16ae6a962..a9f2cab5ffbc 100644
> >>>> --- a/net/ipv4/bpf_tcp_ca.c
> >>>> +++ b/net/ipv4/bpf_tcp_ca.c
> >>>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
> >>>>           .arg2_type      = ARG_ANYTHING,
> >>>>    };
> >>>>
> >>>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> >>>> +          int, optname, char *, optval, int, optlen)
> >>>> +{
> >>>> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
> >>>> +       int ret;
> >>>> +
> >>>> +       if (optname != TCP_CONGESTION)
> >>>> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
> >>>> +
> >>>> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
> >>>> +       if (unlikely(run_ctx->saved_run_ctx &&
> >>>> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> >>>> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
> >>>> +               /* It stops this looping
> >>>> +                *
> >>>> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
> >>>> +                * bpf_setsockopt(tcp_cc)" => .init => ....
> >>>> +                *
> >>>> +                * The second bpf_setsockopt(tcp_cc) is not allowed
> >>>> +                * in order to break the loop when both .init
> >>>> +                * are the same bpf prog.
> >>>> +                *
> >>>> +                * This applies even the second bpf_setsockopt(tcp_cc)
> >>>> +                * does not cause a loop.  This limits only the first
> >>>> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> >>>> +                * pick a fallback cc (eg. peer does not support ECN)
> >>>> +                * and the second '.init' cannot fallback to
> >>>> +                * another cc.
> >>>> +                */
> >>>> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
> >>>> +                       return -EBUSY;
> >>>> +       }
> >>>> +
> >>>> +       run_ctx->bpf_cookie = (uintptr_t)sk;
> >>>> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
> >>>> +       run_ctx->bpf_cookie = 0;
> >>>
> >>> Instead of adding 4 bytes for enum in patch 3
> >>> (which will be 8 bytes due to alignment)
> >>> and abusing bpf_cookie here
> >>> (which struct_ops bpf prog might eventually read and be surprised
> >>> to find sk pointer in there)
> >>> how about adding 'struct task_struct *saved_current' as another arg
> >>> to bpf_tramp_run_ctx ?
> >>> Always store the current task in there in prog_entry_struct_ops
> >>> and then compare it here in this specialized bpf_init_ops_setsockopt?
> >>>
> >>> Or maybe always check in enter_prog_struct_ops:
> >>> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> >>> run_ctx)->saved_current == current) // goto out since recursion?
> >>> it will prevent issues in case we don't know about and will
> >>> address the good recursion case as explained in patch 1?
> >>> I'm assuming 2nd ssthresh runs in a different task..
> >>> Or is it actually the same task?
> >>
> >> The 2nd ssthresh() should run in the same task but different sk.  The
> >> first ssthresh(sk[1]) was run in_task() context and then got
> >> interrupted.  The softirq then handles the rcv path which just happens
> >> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
> >> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
> >>
> >> The tcp-cc ops can recur but cannot recur on the same sk because it
> >> requires to hold the sk lock, so the patch remembers what was the
> >> previous sk to ensure it does not recur on the same sk.  Then it needs
> >> to peek into the previous run ctx which may not always be
> >> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
> >> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
> >> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
> >> the previous bpf_run_ctx.
> >
> > got it.
> >
> >>
> >> Since struct_ops is the only one that needs to peek into the previous
> >> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
> >> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
> >> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
> >> this if it reuses the bpf_cookie (probably missed some int/ptr type
> >> casting):
> >>
> >> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
> >>
> >> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
> >>                                       struct bpf_tramp_run_ctx *run_ctx)
> >>           __acquires(RCU)
> >> {
> >>          rcu_read_lock();
> >>          migrate_disable();
> >>
> >>          run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
> >>                                          BPF_RUN_CTX_STRUCT_OPS_BIT);
> >>
> >>           return bpf_prog_start_time();
> >> }
> >>
> >> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> >>              int, optname, char *, optval, int, optlen)
> >> {
> >>          /* ... */
> >>          if (unlikely((run_ctx->saved_run_ctx &
> >>                          BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
> >>                  /* ... */
> >>                  if (bpf_cookie == (uintptr_t)sk)
> >>                          return -EBUSY;
> >>          }
> >>
> >> }
> >
> > that should work, but don't you need to loop through all previous
> > run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
> > Since run_ctx is saved in the task and we have preemptible
> > rpgos there could be tracing prog in the chain:
> > struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
> > where 1st and last have the same 'sk'.
>
>
> This interleave of different run_ctx could happen.  My understanding is
> the 'struct_ops_run_ctx' can only be created when the tcp stack is
> calling the 'bpf_tcp_cc->init()' (or other cc ops).  In the above case,
> the first and second struct_ops_run_ctx are interleaved with a
> tracing_run_ctx.  Each of these two struct_ops_run_ctx was created from
> a different 'bpf_tcp_cc->init()' call by the kernel tcp stack.  They
> cannot be called with the same sk and changing that sk at the same time
> like this.  Otherwise, the kernel stack has a bug.

There could be also kprobe context in the chain, not necessarily
trampoline-based context. You want to look at previous struct_ops
run_ctx (if any), but it's not necessarily run_ctx->saved_run_ctx. It
could be one of the still earlier ones in the chain. And given kprobe
run_ctx doesn't have saved_run_ctx field and don't preserve the chain
of run_ctxs, there is no reliable way to check entire chain of
run_ctxs.

BPF_RUN_CTX_STRUCT_OPS_BIT is a bit dangerous if we ever do a similar
bit trick for some other type of run_ctx (which honestly we should
avoid). Enum would be safer, but still, you need to check the entire
chain of run_ctxs, which we do not preserve.

It seems to me that run_ctx is not the right mechanism to use here,
tbh. Are there any other alternatives?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23 18:27           ` Andrii Nakryiko
@ 2022-09-23 18:30             ` Andrii Nakryiko
  2022-09-23 18:48               ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2022-09-23 18:30 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, Network Development

On Fri, Sep 23, 2022 at 11:27 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Sep 23, 2022 at 10:46 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 9/23/22 8:26 AM, Alexei Starovoitov wrote:
> > > On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >>
> > >> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
> > >>> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >>>>
> > >>>> From: Martin KaFai Lau <martin.lau@kernel.org>
> > >>>>
> > >>>> When a bad bpf prog '.init' calls
> > >>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> > >>>>
> > >>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> > >>>> ... => .init => bpf_setsockopt(tcp_cc).
> > >>>>
> > >>>> It was prevented by the prog->active counter before but the prog->active
> > >>>> detection cannot be used in struct_ops as explained in the earlier
> > >>>> patch of the set.
> > >>>>
> > >>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> > >>>> in order to break the loop.  This is done by checking the
> > >>>> previous bpf_run_ctx has saved the same sk pointer in the
> > >>>> bpf_cookie.
> > >>>>
> > >>>> Note that this essentially limits only the first '.init' can
> > >>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> > >>>> does not support ECN) and the second '.init' cannot fallback to
> > >>>> another cc.  This applies even the second
> > >>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> > >>>>
> > >>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> > >>>> ---
> > >>>>    include/linux/filter.h |  3 +++
> > >>>>    net/core/filter.c      |  4 ++--
> > >>>>    net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
> > >>>>    3 files changed, 59 insertions(+), 2 deletions(-)
> > >>>>
> > >>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
> > >>>> index 98e28126c24b..9942ecc68a45 100644
> > >>>> --- a/include/linux/filter.h
> > >>>> +++ b/include/linux/filter.h
> > >>>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
> > >>>>    bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
> > >>>>    void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
> > >>>>
> > >>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> > >>>> +                   char *optval, int optlen);
> > >>>> +
> > >>>>    u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> > >>>>    #define __bpf_call_base_args \
> > >>>>           ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
> > >>>> diff --git a/net/core/filter.c b/net/core/filter.c
> > >>>> index f4cea3ff994a..e56a1ebcf1bc 100644
> > >>>> --- a/net/core/filter.c
> > >>>> +++ b/net/core/filter.c
> > >>>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> > >>>>           return -EINVAL;
> > >>>>    }
> > >>>>
> > >>>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> > >>>> -                          char *optval, int optlen)
> > >>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> > >>>> +                   char *optval, int optlen)
> > >>>>    {
> > >>>>           if (sk_fullsock(sk))
> > >>>>                   sock_owned_by_me(sk);
> > >>>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> > >>>> index 6da16ae6a962..a9f2cab5ffbc 100644
> > >>>> --- a/net/ipv4/bpf_tcp_ca.c
> > >>>> +++ b/net/ipv4/bpf_tcp_ca.c
> > >>>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
> > >>>>           .arg2_type      = ARG_ANYTHING,
> > >>>>    };
> > >>>>
> > >>>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> > >>>> +          int, optname, char *, optval, int, optlen)
> > >>>> +{
> > >>>> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
> > >>>> +       int ret;
> > >>>> +
> > >>>> +       if (optname != TCP_CONGESTION)
> > >>>> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
> > >>>> +
> > >>>> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
> > >>>> +       if (unlikely(run_ctx->saved_run_ctx &&
> > >>>> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> > >>>> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
> > >>>> +               /* It stops this looping
> > >>>> +                *
> > >>>> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
> > >>>> +                * bpf_setsockopt(tcp_cc)" => .init => ....
> > >>>> +                *
> > >>>> +                * The second bpf_setsockopt(tcp_cc) is not allowed
> > >>>> +                * in order to break the loop when both .init
> > >>>> +                * are the same bpf prog.
> > >>>> +                *
> > >>>> +                * This applies even the second bpf_setsockopt(tcp_cc)
> > >>>> +                * does not cause a loop.  This limits only the first
> > >>>> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> > >>>> +                * pick a fallback cc (eg. peer does not support ECN)
> > >>>> +                * and the second '.init' cannot fallback to
> > >>>> +                * another cc.
> > >>>> +                */
> > >>>> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
> > >>>> +                       return -EBUSY;
> > >>>> +       }
> > >>>> +
> > >>>> +       run_ctx->bpf_cookie = (uintptr_t)sk;
> > >>>> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
> > >>>> +       run_ctx->bpf_cookie = 0;
> > >>>
> > >>> Instead of adding 4 bytes for enum in patch 3
> > >>> (which will be 8 bytes due to alignment)
> > >>> and abusing bpf_cookie here
> > >>> (which struct_ops bpf prog might eventually read and be surprised
> > >>> to find sk pointer in there)
> > >>> how about adding 'struct task_struct *saved_current' as another arg
> > >>> to bpf_tramp_run_ctx ?
> > >>> Always store the current task in there in prog_entry_struct_ops
> > >>> and then compare it here in this specialized bpf_init_ops_setsockopt?
> > >>>
> > >>> Or maybe always check in enter_prog_struct_ops:
> > >>> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> > >>> run_ctx)->saved_current == current) // goto out since recursion?
> > >>> it will prevent issues in case we don't know about and will
> > >>> address the good recursion case as explained in patch 1?
> > >>> I'm assuming 2nd ssthresh runs in a different task..
> > >>> Or is it actually the same task?
> > >>
> > >> The 2nd ssthresh() should run in the same task but different sk.  The
> > >> first ssthresh(sk[1]) was run in_task() context and then got
> > >> interrupted.  The softirq then handles the rcv path which just happens
> > >> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
> > >> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
> > >>
> > >> The tcp-cc ops can recur but cannot recur on the same sk because it
> > >> requires to hold the sk lock, so the patch remembers what was the
> > >> previous sk to ensure it does not recur on the same sk.  Then it needs
> > >> to peek into the previous run ctx which may not always be
> > >> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
> > >> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
> > >> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
> > >> the previous bpf_run_ctx.
> > >
> > > got it.
> > >
> > >>
> > >> Since struct_ops is the only one that needs to peek into the previous
> > >> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
> > >> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
> > >> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
> > >> this if it reuses the bpf_cookie (probably missed some int/ptr type
> > >> casting):
> > >>
> > >> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
> > >>
> > >> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
> > >>                                       struct bpf_tramp_run_ctx *run_ctx)
> > >>           __acquires(RCU)
> > >> {
> > >>          rcu_read_lock();
> > >>          migrate_disable();
> > >>
> > >>          run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
> > >>                                          BPF_RUN_CTX_STRUCT_OPS_BIT);
> > >>
> > >>           return bpf_prog_start_time();
> > >> }
> > >>
> > >> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> > >>              int, optname, char *, optval, int, optlen)
> > >> {
> > >>          /* ... */
> > >>          if (unlikely((run_ctx->saved_run_ctx &
> > >>                          BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
> > >>                  /* ... */
> > >>                  if (bpf_cookie == (uintptr_t)sk)
> > >>                          return -EBUSY;
> > >>          }
> > >>
> > >> }
> > >
> > > that should work, but don't you need to loop through all previous
> > > run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
> > > Since run_ctx is saved in the task and we have preemptible
> > > rpgos there could be tracing prog in the chain:
> > > struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
> > > where 1st and last have the same 'sk'.
> >
> >
> > This interleave of different run_ctx could happen.  My understanding is
> > the 'struct_ops_run_ctx' can only be created when the tcp stack is
> > calling the 'bpf_tcp_cc->init()' (or other cc ops).  In the above case,
> > the first and second struct_ops_run_ctx are interleaved with a
> > tracing_run_ctx.  Each of these two struct_ops_run_ctx was created from
> > a different 'bpf_tcp_cc->init()' call by the kernel tcp stack.  They
> > cannot be called with the same sk and changing that sk at the same time
> > like this.  Otherwise, the kernel stack has a bug.
>
> There could be also kprobe context in the chain, not necessarily
> trampoline-based context. You want to look at previous struct_ops
> run_ctx (if any), but it's not necessarily run_ctx->saved_run_ctx. It
> could be one of the still earlier ones in the chain. And given kprobe
> run_ctx doesn't have saved_run_ctx field and don't preserve the chain
> of run_ctxs, there is no reliable way to check entire chain of
> run_ctxs.
>
> BPF_RUN_CTX_STRUCT_OPS_BIT is a bit dangerous if we ever do a similar
> bit trick for some other type of run_ctx (which honestly we should
> avoid). Enum would be safer, but still, you need to check the entire
> chain of run_ctxs, which we do not preserve.
>
> It seems to me that run_ctx is not the right mechanism to use here,
> tbh. Are there any other alternatives?

E.g., we can't have more than one struct_ops program attached to any
given socket, right? So can we just use one bit on struct sock to mark
"it is being processed by struct_ops.init program" and just check
that?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23 18:30             ` Andrii Nakryiko
@ 2022-09-23 18:48               ` Martin KaFai Lau
  0 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 18:48 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, Network Development

On 9/23/22 11:30 AM, Andrii Nakryiko wrote:
> On Fri, Sep 23, 2022 at 11:27 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>>
>> On Fri, Sep 23, 2022 at 10:46 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 9/23/22 8:26 AM, Alexei Starovoitov wrote:
>>>> On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>
>>>>> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
>>>>>> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>>>>>
>>>>>>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>>>>>>
>>>>>>> When a bad bpf prog '.init' calls
>>>>>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>>>>>>
>>>>>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>>>>>>> ... => .init => bpf_setsockopt(tcp_cc).
>>>>>>>
>>>>>>> It was prevented by the prog->active counter before but the prog->active
>>>>>>> detection cannot be used in struct_ops as explained in the earlier
>>>>>>> patch of the set.
>>>>>>>
>>>>>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>>>>>>> in order to break the loop.  This is done by checking the
>>>>>>> previous bpf_run_ctx has saved the same sk pointer in the
>>>>>>> bpf_cookie.
>>>>>>>
>>>>>>> Note that this essentially limits only the first '.init' can
>>>>>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>>>>>>> does not support ECN) and the second '.init' cannot fallback to
>>>>>>> another cc.  This applies even the second
>>>>>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>>>>>>
>>>>>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>>>>>>> ---
>>>>>>>     include/linux/filter.h |  3 +++
>>>>>>>     net/core/filter.c      |  4 ++--
>>>>>>>     net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>     3 files changed, 59 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>>>>> index 98e28126c24b..9942ecc68a45 100644
>>>>>>> --- a/include/linux/filter.h
>>>>>>> +++ b/include/linux/filter.h
>>>>>>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
>>>>>>>     bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
>>>>>>>     void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
>>>>>>>
>>>>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>>>>> +                   char *optval, int optlen);
>>>>>>> +
>>>>>>>     u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>>>>>>     #define __bpf_call_base_args \
>>>>>>>            ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
>>>>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>>>>> index f4cea3ff994a..e56a1ebcf1bc 100644
>>>>>>> --- a/net/core/filter.c
>>>>>>> +++ b/net/core/filter.c
>>>>>>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
>>>>>>>            return -EINVAL;
>>>>>>>     }
>>>>>>>
>>>>>>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>>>>> -                          char *optval, int optlen)
>>>>>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
>>>>>>> +                   char *optval, int optlen)
>>>>>>>     {
>>>>>>>            if (sk_fullsock(sk))
>>>>>>>                    sock_owned_by_me(sk);
>>>>>>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
>>>>>>> index 6da16ae6a962..a9f2cab5ffbc 100644
>>>>>>> --- a/net/ipv4/bpf_tcp_ca.c
>>>>>>> +++ b/net/ipv4/bpf_tcp_ca.c
>>>>>>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
>>>>>>>            .arg2_type      = ARG_ANYTHING,
>>>>>>>     };
>>>>>>>
>>>>>>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>>>>>>> +          int, optname, char *, optval, int, optlen)
>>>>>>> +{
>>>>>>> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
>>>>>>> +       int ret;
>>>>>>> +
>>>>>>> +       if (optname != TCP_CONGESTION)
>>>>>>> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>>>>> +
>>>>>>> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
>>>>>>> +       if (unlikely(run_ctx->saved_run_ctx &&
>>>>>>> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
>>>>>>> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
>>>>>>> +               /* It stops this looping
>>>>>>> +                *
>>>>>>> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
>>>>>>> +                * bpf_setsockopt(tcp_cc)" => .init => ....
>>>>>>> +                *
>>>>>>> +                * The second bpf_setsockopt(tcp_cc) is not allowed
>>>>>>> +                * in order to break the loop when both .init
>>>>>>> +                * are the same bpf prog.
>>>>>>> +                *
>>>>>>> +                * This applies even the second bpf_setsockopt(tcp_cc)
>>>>>>> +                * does not cause a loop.  This limits only the first
>>>>>>> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>>>>>>> +                * pick a fallback cc (eg. peer does not support ECN)
>>>>>>> +                * and the second '.init' cannot fallback to
>>>>>>> +                * another cc.
>>>>>>> +                */
>>>>>>> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
>>>>>>> +                       return -EBUSY;
>>>>>>> +       }
>>>>>>> +
>>>>>>> +       run_ctx->bpf_cookie = (uintptr_t)sk;
>>>>>>> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
>>>>>>> +       run_ctx->bpf_cookie = 0;
>>>>>>
>>>>>> Instead of adding 4 bytes for enum in patch 3
>>>>>> (which will be 8 bytes due to alignment)
>>>>>> and abusing bpf_cookie here
>>>>>> (which struct_ops bpf prog might eventually read and be surprised
>>>>>> to find sk pointer in there)
>>>>>> how about adding 'struct task_struct *saved_current' as another arg
>>>>>> to bpf_tramp_run_ctx ?
>>>>>> Always store the current task in there in prog_entry_struct_ops
>>>>>> and then compare it here in this specialized bpf_init_ops_setsockopt?
>>>>>>
>>>>>> Or maybe always check in enter_prog_struct_ops:
>>>>>> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
>>>>>> run_ctx)->saved_current == current) // goto out since recursion?
>>>>>> it will prevent issues in case we don't know about and will
>>>>>> address the good recursion case as explained in patch 1?
>>>>>> I'm assuming 2nd ssthresh runs in a different task..
>>>>>> Or is it actually the same task?
>>>>>
>>>>> The 2nd ssthresh() should run in the same task but different sk.  The
>>>>> first ssthresh(sk[1]) was run in_task() context and then got
>>>>> interrupted.  The softirq then handles the rcv path which just happens
>>>>> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
>>>>> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>>>>>
>>>>> The tcp-cc ops can recur but cannot recur on the same sk because it
>>>>> requires to hold the sk lock, so the patch remembers what was the
>>>>> previous sk to ensure it does not recur on the same sk.  Then it needs
>>>>> to peek into the previous run ctx which may not always be
>>>>> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
>>>>> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
>>>>> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
>>>>> the previous bpf_run_ctx.
>>>>
>>>> got it.
>>>>
>>>>>
>>>>> Since struct_ops is the only one that needs to peek into the previous
>>>>> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
>>>>> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
>>>>> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
>>>>> this if it reuses the bpf_cookie (probably missed some int/ptr type
>>>>> casting):
>>>>>
>>>>> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>>>>>
>>>>> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
>>>>>                                        struct bpf_tramp_run_ctx *run_ctx)
>>>>>            __acquires(RCU)
>>>>> {
>>>>>           rcu_read_lock();
>>>>>           migrate_disable();
>>>>>
>>>>>           run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
>>>>>                                           BPF_RUN_CTX_STRUCT_OPS_BIT);
>>>>>
>>>>>            return bpf_prog_start_time();
>>>>> }
>>>>>
>>>>> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>>>>>               int, optname, char *, optval, int, optlen)
>>>>> {
>>>>>           /* ... */
>>>>>           if (unlikely((run_ctx->saved_run_ctx &
>>>>>                           BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
>>>>>                   /* ... */
>>>>>                   if (bpf_cookie == (uintptr_t)sk)
>>>>>                           return -EBUSY;
>>>>>           }
>>>>>
>>>>> }
>>>>
>>>> that should work, but don't you need to loop through all previous
>>>> run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
>>>> Since run_ctx is saved in the task and we have preemptible
>>>> rpgos there could be tracing prog in the chain:
>>>> struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
>>>> where 1st and last have the same 'sk'.
>>>
>>>
>>> This interleave of different run_ctx could happen.  My understanding is
>>> the 'struct_ops_run_ctx' can only be created when the tcp stack is
>>> calling the 'bpf_tcp_cc->init()' (or other cc ops).  In the above case,
>>> the first and second struct_ops_run_ctx are interleaved with a
>>> tracing_run_ctx.  Each of these two struct_ops_run_ctx was created from
>>> a different 'bpf_tcp_cc->init()' call by the kernel tcp stack.  They
>>> cannot be called with the same sk and changing that sk at the same time
>>> like this.  Otherwise, the kernel stack has a bug.
>>
>> There could be also kprobe context in the chain, not necessarily
>> trampoline-based context. You want to look at previous struct_ops
>> run_ctx (if any), but it's not necessarily run_ctx->saved_run_ctx. It
>> could be one of the still earlier ones in the chain. And given kprobe
>> run_ctx doesn't have saved_run_ctx field and don't preserve the chain
>> of run_ctxs, there is no reliable way to check entire chain of
>> run_ctxs.
>>
>> BPF_RUN_CTX_STRUCT_OPS_BIT is a bit dangerous if we ever do a similar
>> bit trick for some other type of run_ctx (which honestly we should
>> avoid). Enum would be safer, but still, you need to check the entire
>> chain of run_ctxs, which we do not preserve.

There is no need to check the entire chain.  Only the immediate previous 
one is needed.  If the previous one is a kprobe, it is fine since there 
is no loop.  If there is an even earlier run ctx that is a 
bpf_struct_ops_run_ctx, then it must be on a different sk.

>>
>> It seems to me that run_ctx is not the right mechanism to use here,
>> tbh. Are there any other alternatives?
> 
> E.g., we can't have more than one struct_ops program attached to any
> given socket, right? So can we just use one bit on struct sock to mark
> "it is being processed by struct_ops.init program" and just check
> that?

It was one of my eariler thought.  I went with this route to use the 
existing run context instead of getting a hole or bit from tcp_sock to 
handle this corner case.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-09-23 18:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22 22:56 [PATCH bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 3/5] bpf: Add bpf_run_ctx_type Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
2022-09-23  0:12   ` Alexei Starovoitov
2022-09-23  1:11     ` Martin KaFai Lau
2022-09-23  1:59       ` Hao Luo
2022-09-23 15:26       ` Alexei Starovoitov
2022-09-23 17:46         ` Martin KaFai Lau
2022-09-23 18:27           ` Andrii Nakryiko
2022-09-23 18:30             ` Andrii Nakryiko
2022-09-23 18:48               ` Martin KaFai Lau
2022-09-22 22:56 ` [PATCH bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).