All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next 0/5] Sleepable BPF programs on cgroup {get,set}sockopt
@ 2023-07-22  5:22 kuifeng
  2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: kuifeng @ 2023-07-22  5:22 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: sinquersw, Kui-Feng Lee

From: Kui-Feng Lee <kuifeng@meta.com>

Make BPF programs attached on cgroup/{get,set}sockopt hooks sleepable
and able to call bpf_copy_from_user() and bpf_copy_to_user(), a new
kfunc.

The Issue with CGroup {get,set}sockopt Hooks
============================================

Calling {get,set}sockopt from user space, optval is a pointer to a
buffer. The format of the buffer depends on the level and optname, and
its size is specified by optlen. The buffer is used by user space
programs to pass values to setsockopt and retrieve values from
getsockopt.

The problem is that BPF programs protected by RCU read lock cannot
access the buffers located in user space. This is because these
programs are non-sleepable and using copy_from_user() or
copy_to_user() to access user space memory can result in paging.

The kernel makes a copy of the buffer specified by optval and optlen
in kernel space before passing it to the cgroup {get,set}sockopt
hooks. After the hooks are executed, the content of the buffer in
kernel space is copied to user space if necessary.

Programs may send a significant amount of data, stored in buffer
indicated by optval, to the kernel. One example is iptables, which can
send several megabytes to the kernel. However, BPF programs on the
hooks can only see up to the first PAGE_SIZE bytes of the buffer. The
optlen value that BPF programs observe may appear to be PAGE_SIZE, but
in reality, it is larger than that. On the other hand, the value of
optlen represents the amount of data retrieved by
getsockopt(). Additionally, both the buffer content and optlen can be
modified by BPF programs.

Kernel may wrongly modify the value of optlen returned to user space
to PAGE_SIZE. This can happen because the kernel cannot distinguish if
the value was set by BPF programs or by the kernel itself.

To fix it, we perform various hacks; for example, the commit d8fe449a9c51
("bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE")
and the commit 29ebbba7d461 ("bpf: Don't EFAULT for {g,s}setsockopt with
 wrong optlen").

Make CGroup {get,set}sockopt Hooks Sleepable
============================================

The long term solution is to make these hooks sleepable to enable BPF
programs call copy_from_user() and copy_to_user(),
a.k.a. bpf_copy_from_user() and bpf_copy_to_user(). It prevents
manipulation of optval and optlen values, and allows BPF programs to
access the complete contents of the buffer referenced by optval.

Kui-Feng Lee (5):
  bpf: enable sleepable BPF programs attached to
    cgroup/{get,set}sockopt.
  bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
  bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  bpf: Prevent BPF programs from access the buffer pointed by
    user_optval.
  bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT
    type.

 include/linux/bpf.h                           |   7 +-
 include/linux/filter.h                        |   3 +
 include/uapi/linux/bpf.h                      |  11 +
 kernel/bpf/btf.c                              |   3 +
 kernel/bpf/cgroup.c                           | 196 +++++++++---
 kernel/bpf/helpers.c                          | 104 ++++++
 kernel/bpf/verifier.c                         | 116 ++++---
 tools/include/uapi/linux/bpf.h                |  11 +
 tools/lib/bpf/libbpf.c                        |   2 +
 .../testing/selftests/bpf/bpf_experimental.h  |  27 ++
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  30 ++
 .../selftests/bpf/prog_tests/sockopt_sk.c     |  34 +-
 .../testing/selftests/bpf/progs/sockopt_sk.c  | 299 ++++++++++++++++++
 .../selftests/bpf/verifier/sleepable.c        |   2 +-
 14 files changed, 763 insertions(+), 82 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-22  5:22 [RFC bpf-next 0/5] Sleepable BPF programs on cgroup {get,set}sockopt kuifeng
@ 2023-07-22  5:22 ` kuifeng
  2023-07-22 16:09   ` kernel test robot
                     ` (3 more replies)
  2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 20+ messages in thread
From: kuifeng @ 2023-07-22  5:22 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: sinquersw, Kui-Feng Lee

From: Kui-Feng Lee <kuifeng@meta.com>

Enable sleepable cgroup/{get,set}sockopt hooks.

The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
received a pointer to the optval in user space instead of a kernel
copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
begin and end of the user space buffer if receiving a user space
buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
a kernel space buffer.

A program receives a user space buffer if ctx->flags &
BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
buffer.  The BPF programs should not read/write from/to a user space buffer
dirrectly.  It should access the buffer through bpf_copy_from_user() and
bpf_copy_to_user() provided in the following patches.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/filter.h         |   3 +
 include/uapi/linux/bpf.h       |   9 ++
 kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
 kernel/bpf/verifier.c          |   7 +-
 tools/include/uapi/linux/bpf.h |   9 ++
 tools/lib/bpf/libbpf.c         |   2 +
 6 files changed, 176 insertions(+), 43 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index f69114083ec7..301dd1ba0de1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
 	s32		level;
 	s32		optname;
 	s32		optlen;
+	u32		flags;
+	u8		*user_optval;
+	u8		*user_optval_end;
 	/* for retval in struct bpf_cg_run_ctx */
 	struct task_struct *current_task;
 	/* Temporary "register" for indirect stores to ppos. */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 739c15906a65..b2f81193f97b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7135,6 +7135,15 @@ struct bpf_sockopt {
 	__s32	optname;
 	__s32	optlen;
 	__s32	retval;
+
+	__bpf_md_ptr(void *, user_optval);
+	__bpf_md_ptr(void *, user_optval_end);
+	__u32	flags;
+};
+
+enum bpf_sockopt_flags {
+	/* optval is a pointer to user space memory */
+	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
 };
 
 struct bpf_pidns_info {
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 5b2741aa0d9b..b268bbfa6c53 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -38,15 +38,26 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 	const struct bpf_prog_array *array;
 	struct bpf_run_ctx *old_run_ctx;
 	struct bpf_cg_run_ctx run_ctx;
+	bool do_sleepable;
 	u32 func_ret;
 
+	do_sleepable =
+		atype == CGROUP_SETSOCKOPT || atype == CGROUP_GETSOCKOPT;
+
 	run_ctx.retval = retval;
 	migrate_disable();
-	rcu_read_lock();
+	if (do_sleepable) {
+		might_fault();
+		rcu_read_lock_trace();
+	} else
+		rcu_read_lock();
 	array = rcu_dereference(cgrp->effective[atype]);
 	item = &array->items[0];
 	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
 	while ((prog = READ_ONCE(item->prog))) {
+		if (do_sleepable && !prog->aux->sleepable)
+			rcu_read_lock();
+
 		run_ctx.prog_item = item;
 		func_ret = run_prog(prog, ctx);
 		if (ret_flags) {
@@ -56,13 +67,43 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 		if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
 			run_ctx.retval = -EPERM;
 		item++;
+
+		if (do_sleepable && !prog->aux->sleepable)
+			rcu_read_unlock();
 	}
 	bpf_reset_run_ctx(old_run_ctx);
-	rcu_read_unlock();
+	if (do_sleepable)
+		rcu_read_unlock_trace();
+	else
+		rcu_read_unlock();
 	migrate_enable();
 	return run_ctx.retval;
 }
 
+static __always_inline bool
+has_only_sleepable_prog_cg(const struct cgroup_bpf *cgrp,
+			 enum cgroup_bpf_attach_type atype)
+{
+	const struct bpf_prog_array_item *item;
+	const struct bpf_prog *prog;
+	int cnt = 0;
+	bool ret = true;
+
+	rcu_read_lock();
+	item = &rcu_dereference(cgrp->effective[atype])->items[0];
+	while (ret && (prog = READ_ONCE(item->prog))) {
+		if (!prog->aux->sleepable)
+			ret = false;
+		item++;
+		cnt++;
+	}
+	rcu_read_unlock();
+	if (cnt == 0)
+		ret = false;
+
+	return ret;
+}
+
 unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
 				       const struct bpf_insn *insn)
 {
@@ -1773,7 +1814,8 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
 static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
 			     struct bpf_sockopt_buf *buf)
 {
-	if (ctx->optval == buf->data)
+	if (ctx->optval == buf->data ||
+	    ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
 		return;
 	kfree(ctx->optval);
 }
@@ -1781,7 +1823,8 @@ static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
 static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx,
 				  struct bpf_sockopt_buf *buf)
 {
-	return ctx->optval != buf->data;
+	return ctx->optval != buf->data &&
+		!(ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER);
 }
 
 int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
@@ -1796,21 +1839,31 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 		.optname = *optname,
 	};
 	int ret, max_optlen;
+	bool alloc_mem;
+
+	alloc_mem = !has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_SETSOCKOPT);
+	if (!alloc_mem) {
+		max_optlen = *optlen;
+		ctx.optlen = *optlen;
+		ctx.user_optval = optval;
+		ctx.user_optval_end = optval + *optlen;
+		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+	} else {
+		/* Allocate a bit more than the initial user buffer for
+		 * BPF program. The canonical use case is overriding
+		 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
+		 */
+		max_optlen = max_t(int, 16, *optlen);
+		max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
+		if (max_optlen < 0)
+			return max_optlen;
 
-	/* Allocate a bit more than the initial user buffer for
-	 * BPF program. The canonical use case is overriding
-	 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
-	 */
-	max_optlen = max_t(int, 16, *optlen);
-	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
-	if (max_optlen < 0)
-		return max_optlen;
-
-	ctx.optlen = *optlen;
+		ctx.optlen = *optlen;
 
-	if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
-		ret = -EFAULT;
-		goto out;
+		if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
+			ret = -EFAULT;
+			goto out;
+		}
 	}
 
 	lock_sock(sk);
@@ -1824,7 +1877,8 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 	if (ctx.optlen == -1) {
 		/* optlen set to -1, bypass kernel */
 		ret = 1;
-	} else if (ctx.optlen > max_optlen || ctx.optlen < -1) {
+	} else if (alloc_mem &&
+		   (ctx.optlen > max_optlen || ctx.optlen < -1)) {
 		/* optlen is out of bounds */
 		if (*optlen > PAGE_SIZE && ctx.optlen >= 0) {
 			pr_info_once("bpf setsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n",
@@ -1846,6 +1900,8 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 		 */
 		if (ctx.optlen != 0) {
 			*optlen = ctx.optlen;
+			if (ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
+				return 0;
 			/* We've used bpf_sockopt_kern->buf as an intermediary
 			 * storage, but the BPF program indicates that we need
 			 * to pass this data to the kernel setsockopt handler.
@@ -1892,33 +1948,59 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 
 	orig_optlen = max_optlen;
 	ctx.optlen = max_optlen;
-	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
-	if (max_optlen < 0)
-		return max_optlen;
+	if (has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_GETSOCKOPT)) {
+		if (!retval) {
+			/* If kernel getsockopt finished successfully,
+			 * copy whatever was returned to the user back
+			 * into our temporary buffer. Set optlen to the
+			 * one that kernel returned as well to let
+			 * BPF programs inspect the value.
+			 */
 
-	if (!retval) {
-		/* If kernel getsockopt finished successfully,
-		 * copy whatever was returned to the user back
-		 * into our temporary buffer. Set optlen to the
-		 * one that kernel returned as well to let
-		 * BPF programs inspect the value.
-		 */
+			if (get_user(ctx.optlen, optlen)) {
+				ret = -EFAULT;
+				goto out;
+			}
 
-		if (get_user(ctx.optlen, optlen)) {
-			ret = -EFAULT;
-			goto out;
+			if (ctx.optlen < 0) {
+				ret = -EFAULT;
+				goto out;
+			}
+			orig_optlen = ctx.optlen;
 		}
 
-		if (ctx.optlen < 0) {
-			ret = -EFAULT;
-			goto out;
-		}
-		orig_optlen = ctx.optlen;
+		ctx.user_optval = optval;
+		ctx.user_optval_end = optval + *optlen;
+		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+	} else {
+		max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
+		if (max_optlen < 0)
+			return max_optlen;
+
+		if (!retval) {
+			/* If kernel getsockopt finished successfully,
+			 * copy whatever was returned to the user back
+			 * into our temporary buffer. Set optlen to the
+			 * one that kernel returned as well to let
+			 * BPF programs inspect the value.
+			 */
 
-		if (copy_from_user(ctx.optval, optval,
-				   min(ctx.optlen, max_optlen)) != 0) {
-			ret = -EFAULT;
-			goto out;
+			if (get_user(ctx.optlen, optlen)) {
+				ret = -EFAULT;
+				goto out;
+			}
+
+			if (ctx.optlen < 0) {
+				ret = -EFAULT;
+				goto out;
+			}
+			orig_optlen = ctx.optlen;
+
+			if (copy_from_user(ctx.optval, optval,
+					   min(ctx.optlen, max_optlen)) != 0) {
+				ret = -EFAULT;
+				goto out;
+			}
 		}
 	}
 
@@ -1942,7 +2024,9 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 	}
 
 	if (ctx.optlen != 0) {
-		if (optval && copy_to_user(optval, ctx.optval, ctx.optlen)) {
+		if (optval &&
+		    !(ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) &&
+		    copy_to_user(optval, ctx.optval, ctx.optlen)) {
 			ret = -EFAULT;
 			goto out;
 		}
@@ -2388,6 +2472,20 @@ static bool cg_sockopt_is_valid_access(int off, int size,
 		if (size != size_default)
 			return false;
 		return prog->expected_attach_type == BPF_CGROUP_GETSOCKOPT;
+	case offsetof(struct bpf_sockopt, user_optval):
+		if (size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct bpf_sockopt, user_optval_end):
+		if (size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	case offsetof(struct bpf_sockopt, flags):
+		if (size != sizeof(__u32))
+			return false;
+		break;
 	default:
 		if (size != size_default)
 			return false;
@@ -2481,6 +2579,15 @@ static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_sockopt, optval_end):
 		*insn++ = CG_SOCKOPT_READ_FIELD(optval_end);
 		break;
+	case offsetof(struct bpf_sockopt, user_optval):
+		*insn++ = CG_SOCKOPT_READ_FIELD(user_optval);
+		break;
+	case offsetof(struct bpf_sockopt, user_optval_end):
+		*insn++ = CG_SOCKOPT_READ_FIELD(user_optval_end);
+		break;
+	case offsetof(struct bpf_sockopt, flags):
+		*insn++ = CG_SOCKOPT_READ_FIELD(flags);
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 803b91135ca0..8cb25a775119 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19299,9 +19299,11 @@ static bool can_be_sleepable(struct bpf_prog *prog)
 			return false;
 		}
 	}
+
 	return prog->type == BPF_PROG_TYPE_LSM ||
 	       prog->type == BPF_PROG_TYPE_KPROBE /* only for uprobes */ ||
-	       prog->type == BPF_PROG_TYPE_STRUCT_OPS;
+	       prog->type == BPF_PROG_TYPE_STRUCT_OPS ||
+	       prog->type == BPF_PROG_TYPE_CGROUP_SOCKOPT;
 }
 
 static int check_attach_btf_id(struct bpf_verifier_env *env)
@@ -19323,7 +19325,8 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	}
 
 	if (prog->aux->sleepable && !can_be_sleepable(prog)) {
-		verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, and struct_ops programs can be sleepable\n");
+		verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe,"
+			"cgroup, and struct_ops programs can be sleepable\n");
 		return -EINVAL;
 	}
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 739c15906a65..b2f81193f97b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7135,6 +7135,15 @@ struct bpf_sockopt {
 	__s32	optname;
 	__s32	optlen;
 	__s32	retval;
+
+	__bpf_md_ptr(void *, user_optval);
+	__bpf_md_ptr(void *, user_optval_end);
+	__u32	flags;
+};
+
+enum bpf_sockopt_flags {
+	/* optval is a pointer to user space memory */
+	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
 };
 
 struct bpf_pidns_info {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 17883f5a44b9..3be9270bbc33 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8766,7 +8766,9 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup/getsockname6",	CGROUP_SOCK_ADDR, BPF_CGROUP_INET6_GETSOCKNAME, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/sysctl",	CGROUP_SYSCTL, BPF_CGROUP_SYSCTL, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/getsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT, SEC_ATTACHABLE),
+	SEC_DEF("cgroup/getsockopt.s",	CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT, SEC_ATTACHABLE | SEC_SLEEPABLE),
 	SEC_DEF("cgroup/setsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE),
+	SEC_DEF("cgroup/setsockopt.s",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE | SEC_SLEEPABLE),
 	SEC_DEF("cgroup/dev",		CGROUP_DEVICE, BPF_CGROUP_DEVICE, SEC_ATTACHABLE_OPT),
 	SEC_DEF("struct_ops+",		STRUCT_OPS, 0, SEC_NONE),
 	SEC_DEF("struct_ops.s+",	STRUCT_OPS, 0, SEC_SLEEPABLE),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
  2023-07-22  5:22 [RFC bpf-next 0/5] Sleepable BPF programs on cgroup {get,set}sockopt kuifeng
  2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
@ 2023-07-22  5:22 ` kuifeng
  2023-07-22  6:39   ` kernel test robot
                     ` (4 more replies)
  2023-07-22  5:22 ` [RFC bpf-next 3/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT kuifeng
                   ` (2 subsequent siblings)
  4 siblings, 5 replies; 20+ messages in thread
From: kuifeng @ 2023-07-22  5:22 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: sinquersw, Kui-Feng Lee

From: Kui-Feng Lee <kuifeng@meta.com>

Provide bpf_copy_from_user() and bpf_copy_to_user() to the BPF programs
attached to cgroup/{set,get}sockopt. bpf_copy_to_user() is a new kfunc to
copy data from an kernel space buffer to a user space buffer. They are only
available for sleepable BPF programs.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 kernel/bpf/cgroup.c  |  6 ++++++
 kernel/bpf/helpers.c | 31 +++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index b268bbfa6c53..8e3a615f3fc8 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -2413,6 +2413,12 @@ cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 	case BPF_FUNC_perf_event_output:
 		return &bpf_event_output_data_proto;
+
+	case BPF_FUNC_copy_from_user:
+		if (prog->aux->sleepable)
+			return &bpf_copy_from_user_proto;
+		return NULL;
+
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 56ce5008aedd..5b1a62c20bb8 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -669,6 +669,26 @@ const struct bpf_func_proto bpf_copy_from_user_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+/**
+ * long bpf_copy_to_user(void *dst, u32 size, const void *kern_ptr)
+ *     Description
+ *             Read *size* bytes from kernel space address *kern_ptr* and
+ *              store the data in user space address *dst*. This is a
+ *              wrapper of **copy_to_user**\ ().
+ *     Return
+ *             0 on success, or a negative error in case of failure.
+ */
+__bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
+				 const void *src__ign)
+{
+	int ret = copy_to_user(dst__uninit, src__ign, dst__sz);
+
+	if (unlikely(ret))
+		return -EFAULT;
+
+	return ret;
+}
+
 BPF_CALL_5(bpf_copy_from_user_task, void *, dst, u32, size,
 	   const void __user *, user_ptr, struct task_struct *, tsk, u64, flags)
 {
@@ -2456,6 +2476,7 @@ BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
 #endif
 BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_copy_to_user, KF_SLEEPABLE)
 BTF_SET8_END(generic_btf_ids)
 
 static const struct btf_kfunc_id_set generic_kfunc_set = {
@@ -2494,6 +2515,15 @@ static const struct btf_kfunc_id_set common_kfunc_set = {
 	.set   = &common_btf_ids,
 };
 
+BTF_SET8_START(cgroup_common_btf_ids)
+BTF_ID_FLAGS(func, bpf_copy_to_user, KF_SLEEPABLE)
+BTF_SET8_END(cgroup_common_btf_ids)
+
+static const struct btf_kfunc_id_set cgroup_kfunc_set = {
+	.owner	= THIS_MODULE,
+	.set	= &cgroup_common_btf_ids,
+};
+
 static int __init kfunc_init(void)
 {
 	int ret;
@@ -2513,6 +2543,7 @@ static int __init kfunc_init(void)
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCKOPT, &cgroup_kfunc_set);
 	ret = ret ?: register_btf_id_dtor_kfuncs(generic_dtors,
 						  ARRAY_SIZE(generic_dtors),
 						  THIS_MODULE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC bpf-next 3/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.
  2023-07-22  5:22 [RFC bpf-next 0/5] Sleepable BPF programs on cgroup {get,set}sockopt kuifeng
  2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
  2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
@ 2023-07-22  5:22 ` kuifeng
  2023-07-22  5:22 ` [RFC bpf-next 4/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval kuifeng
  2023-07-22  5:22 ` [RFC bpf-next 5/5] bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type kuifeng
  4 siblings, 0 replies; 20+ messages in thread
From: kuifeng @ 2023-07-22  5:22 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: sinquersw, Kui-Feng Lee

From: Kui-Feng Lee <kuifeng@meta.com>

The new dynptr type (BPF_DYNPTR_TYPE_CGROUP_SOCKOPT) will be used by BPF
programs to create a buffer that can be installed on ctx to replace
exisiting optval or user_optval.

This feature is only allowed if ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_ALLOC
is true. It is enabled only for sleepable programs on the cgroup/setsockopt
hook.  BPF programs can install a new buffer holding by a dynptr to
increase the size of optval passed to setsockopt().

It is not enabled for cgroup/getsockopt since you can not increased a
buffer created, by user program, to return data from getsockopt().

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf.h            |  7 +++-
 include/uapi/linux/bpf.h       |  2 +
 kernel/bpf/btf.c               |  3 ++
 kernel/bpf/cgroup.c            |  3 +-
 kernel/bpf/helpers.c           | 73 ++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          | 43 +++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |  2 +
 7 files changed, 129 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ceaa8c23287f..40eede696ac5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -663,12 +663,15 @@ enum bpf_type_flag {
 	/* DYNPTR points to xdp_buff */
 	DYNPTR_TYPE_XDP		= BIT(16 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to optval buffer of bpf_sockopt */
+	DYNPTR_TYPE_CGROUP_SOCKOPT = BIT(17 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
 #define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
-				 | DYNPTR_TYPE_XDP)
+				 | DYNPTR_TYPE_XDP | DYNPTR_TYPE_CGROUP_SOCKOPT)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -1206,6 +1209,8 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_SKB,
 	/* Underlying data is a xdp_buff */
 	BPF_DYNPTR_TYPE_XDP,
+	/* Underlying data is for the optval of a cgroup sock */
+	BPF_DYNPTR_TYPE_CGROUP_SOCKOPT,
 };
 
 int bpf_dynptr_check_size(u32 size);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b2f81193f97b..13bdf8e78d01 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7144,6 +7144,8 @@ struct bpf_sockopt {
 enum bpf_sockopt_flags {
 	/* optval is a pointer to user space memory */
 	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
+	/* able to allocate and install new optval */
+	BPF_SOCKOPT_FLAG_OPTVAL_ALLOC	= (1U << 1),
 };
 
 struct bpf_pidns_info {
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index ef9581a580e2..718494edbfc6 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -216,6 +216,7 @@ enum btf_kfunc_hook {
 	BTF_KFUNC_HOOK_SOCKET_FILTER,
 	BTF_KFUNC_HOOK_LWT,
 	BTF_KFUNC_HOOK_NETFILTER,
+	BTF_KFUNC_HOOK_CGROUP_SOCKOPT,
 	BTF_KFUNC_HOOK_MAX,
 };
 
@@ -7845,6 +7846,8 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 		return BTF_KFUNC_HOOK_LWT;
 	case BPF_PROG_TYPE_NETFILTER:
 		return BTF_KFUNC_HOOK_NETFILTER;
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+		return BTF_KFUNC_HOOK_CGROUP_SOCKOPT;
 	default:
 		return BTF_KFUNC_HOOK_MAX;
 	}
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 8e3a615f3fc8..f42e76501e1c 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1847,7 +1847,8 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 		ctx.optlen = *optlen;
 		ctx.user_optval = optval;
 		ctx.user_optval_end = optval + *optlen;
-		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER |
+			BPF_SOCKOPT_FLAG_OPTVAL_ALLOC;
 	} else {
 		/* Allocate a bit more than the initial user buffer for
 		 * BPF program. The canonical use case is overriding
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5b1a62c20bb8..92ec80e3a5d3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1557,6 +1557,7 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		/* Source and destination may possibly overlap, hence use memmove to
 		 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
 		 * pointing to overlapping PTR_TO_MAP_VALUE regions.
@@ -1602,6 +1603,7 @@ BPF_CALL_5(bpf_dynptr_write, const struct bpf_dynptr_kern *, dst, u32, offset, v
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		if (flags)
 			return -EINVAL;
 		/* Source and destination may possibly overlap, hence use memmove to
@@ -1654,6 +1656,7 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		return (unsigned long)(ptr->data + ptr->offset + offset);
 	case BPF_DYNPTR_TYPE_SKB:
 	case BPF_DYNPTR_TYPE_XDP:
@@ -2281,6 +2284,7 @@ __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		return ptr->data + ptr->offset + offset;
 	case BPF_DYNPTR_TYPE_SKB:
 		if (buffer__opt)
@@ -2449,6 +2453,72 @@ __bpf_kfunc void bpf_rcu_read_unlock(void)
 	rcu_read_unlock();
 }
 
+__bpf_kfunc int bpf_sockopt_alloc_optval(struct bpf_sockopt *sopt, int size,
+					 struct bpf_dynptr_kern *ptr__uninit)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+	void *optval;
+	int err;
+
+	if (!(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_ALLOC))
+		return -EINVAL;
+
+	err = bpf_dynptr_check_size(size);
+	if (err)
+		return err;
+
+	optval = kzalloc(size, GFP_KERNEL);
+	if (!optval)
+		return -ENOMEM;
+
+	bpf_dynptr_init(ptr__uninit, optval,
+			BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0, size);
+
+	return size;
+}
+
+__bpf_kfunc int bpf_sockopt_install_optval(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+
+	if (!(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_ALLOC) ||
+	    bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
+	    !ptr->data)
+		return -EINVAL;
+
+	if (sopt_kern->optval &&
+	    !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER))
+		kfree(sopt_kern->optval);
+
+	sopt_kern->optval = ptr->data;
+	sopt_kern->optval_end = ptr->data + __bpf_dynptr_size(ptr);
+	sopt_kern->user_optval = NULL;
+	sopt_kern->user_optval_end = NULL;
+	sopt_kern->optlen = __bpf_dynptr_size(ptr);
+	sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_USER;
+
+	bpf_dynptr_set_null(ptr);
+
+	return 0;
+}
+
+__bpf_kfunc int bpf_sockopt_release_optval(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+
+	if (!(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_ALLOC) ||
+	    bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
+	    !ptr->data)
+		return -EINVAL;
+
+	kfree(ptr->data);
+	bpf_dynptr_set_null(ptr);
+
+	return 0;
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -2517,6 +2587,9 @@ static const struct btf_kfunc_id_set common_kfunc_set = {
 
 BTF_SET8_START(cgroup_common_btf_ids)
 BTF_ID_FLAGS(func, bpf_copy_to_user, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_alloc_optval)
+BTF_ID_FLAGS(func, bpf_sockopt_install_optval)
+BTF_ID_FLAGS(func, bpf_sockopt_release_optval)
 BTF_SET8_END(cgroup_common_btf_ids)
 
 static const struct btf_kfunc_id_set cgroup_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8cb25a775119..53e133525dc1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -744,6 +744,8 @@ static const char *dynptr_type_str(enum bpf_dynptr_type type)
 		return "skb";
 	case BPF_DYNPTR_TYPE_XDP:
 		return "xdp";
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return "cgroup_sockopt";
 	case BPF_DYNPTR_TYPE_INVALID:
 		return "<invalid>";
 	default:
@@ -825,6 +827,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_SKB;
 	case DYNPTR_TYPE_XDP:
 		return BPF_DYNPTR_TYPE_XDP;
+	case DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return BPF_DYNPTR_TYPE_CGROUP_SOCKOPT;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -841,6 +845,8 @@ static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type)
 		return DYNPTR_TYPE_SKB;
 	case BPF_DYNPTR_TYPE_XDP:
 		return DYNPTR_TYPE_XDP;
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return DYNPTR_TYPE_CGROUP_SOCKOPT;
 	default:
 		return 0;
 	}
@@ -848,7 +854,8 @@ static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type)
 
 static bool dynptr_type_refcounted(enum bpf_dynptr_type type)
 {
-	return type == BPF_DYNPTR_TYPE_RINGBUF;
+	return type == BPF_DYNPTR_TYPE_RINGBUF ||
+		type == BPF_DYNPTR_TYPE_CGROUP_SOCKOPT;
 }
 
 static void __mark_dynptr_reg(struct bpf_reg_state *reg,
@@ -10101,6 +10108,9 @@ enum special_kfunc_type {
 	KF_bpf_dynptr_slice,
 	KF_bpf_dynptr_slice_rdwr,
 	KF_bpf_dynptr_clone,
+	KF_bpf_sockopt_alloc_optval,
+	KF_bpf_sockopt_install_optval,
+	KF_bpf_sockopt_release_optval,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -10121,6 +10131,9 @@ BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_sockopt_alloc_optval)
+BTF_ID(func, bpf_sockopt_install_optval)
+BTF_ID(func, bpf_sockopt_release_optval)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -10143,6 +10156,9 @@ BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_sockopt_alloc_optval)
+BTF_ID(func, bpf_sockopt_install_optval)
+BTF_ID(func, bpf_sockopt_release_optval)
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -10796,6 +10812,20 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			arg_type |= OBJ_RELEASE;
 			break;
 		case KF_ARG_PTR_TO_DYNPTR:
+			if (meta->func_id == special_kfunc_list[KF_bpf_sockopt_install_optval] ||
+			    meta->func_id == special_kfunc_list[KF_bpf_sockopt_release_optval]) {
+				int ref_obj_id = dynptr_ref_obj_id(env, reg);
+
+				if (ref_obj_id < 0) {
+					verbose(env, "R%d is not a valid dynptr\n", regno);
+					return -EINVAL;
+				}
+
+				/* Required by check_func_arg_reg_off() */
+				arg_type |= ARG_PTR_TO_DYNPTR | OBJ_RELEASE;
+				meta->release_regno = regno;
+			}
+			break;
 		case KF_ARG_PTR_TO_ITER:
 		case KF_ARG_PTR_TO_LIST_HEAD:
 		case KF_ARG_PTR_TO_LIST_NODE:
@@ -10883,6 +10913,9 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 					verbose(env, "verifier internal error: missing ref obj id for parent of clone\n");
 					return -EFAULT;
 				}
+			} else if (meta->func_id == special_kfunc_list[KF_bpf_sockopt_alloc_optval] &&
+				   (dynptr_arg_type & MEM_UNINIT)) {
+				dynptr_arg_type |= DYNPTR_TYPE_CGROUP_SOCKOPT;
 			}
 
 			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type, clone_ref_obj_id);
@@ -11191,7 +11224,13 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	 * PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now.
 	 */
 	if (meta.release_regno) {
-		err = release_reference(env, regs[meta.release_regno].ref_obj_id);
+		verbose(env, "release refcounted PTR_TO_BTF_ID %s\n",
+			meta.func_name);
+		if (meta.func_id == special_kfunc_list[KF_bpf_sockopt_install_optval] ||
+		    meta.func_id == special_kfunc_list[KF_bpf_sockopt_release_optval])
+			err = unmark_stack_slots_dynptr(env, &regs[meta.release_regno]);
+		else
+			err = release_reference(env, regs[meta.release_regno].ref_obj_id);
 		if (err) {
 			verbose(env, "kfunc %s#%d reference has not been acquired before\n",
 				func_name, meta.func_id);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b2f81193f97b..13bdf8e78d01 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7144,6 +7144,8 @@ struct bpf_sockopt {
 enum bpf_sockopt_flags {
 	/* optval is a pointer to user space memory */
 	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
+	/* able to allocate and install new optval */
+	BPF_SOCKOPT_FLAG_OPTVAL_ALLOC	= (1U << 1),
 };
 
 struct bpf_pidns_info {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC bpf-next 4/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval.
  2023-07-22  5:22 [RFC bpf-next 0/5] Sleepable BPF programs on cgroup {get,set}sockopt kuifeng
                   ` (2 preceding siblings ...)
  2023-07-22  5:22 ` [RFC bpf-next 3/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT kuifeng
@ 2023-07-22  5:22 ` kuifeng
  2023-07-22  5:22 ` [RFC bpf-next 5/5] bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type kuifeng
  4 siblings, 0 replies; 20+ messages in thread
From: kuifeng @ 2023-07-22  5:22 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: sinquersw, Kui-Feng Lee

From: Kui-Feng Lee <kuifeng@meta.com>

Since the buffer pointed by ctx->user_optval is in user space, BPF programs
in kernel space should not access it directly.  They should use
bpf_copy_from_user() and bpf_copy_to_user() to move data between user and
kernel space.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 kernel/bpf/cgroup.c   |  4 +--
 kernel/bpf/verifier.c | 66 +++++++++++++++++++++----------------------
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index f42e76501e1c..88f3a48ca8d2 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -2482,12 +2482,12 @@ static bool cg_sockopt_is_valid_access(int off, int size,
 	case offsetof(struct bpf_sockopt, user_optval):
 		if (size != sizeof(__u64))
 			return false;
-		info->reg_type = PTR_TO_PACKET;
+		info->reg_type = PTR_TO_PACKET | MEM_USER;
 		break;
 	case offsetof(struct bpf_sockopt, user_optval_end):
 		if (size != sizeof(__u64))
 			return false;
-		info->reg_type = PTR_TO_PACKET_END;
+		info->reg_type = PTR_TO_PACKET_END | MEM_USER;
 		break;
 	case offsetof(struct bpf_sockopt, flags):
 		if (size != sizeof(__u32))
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 53e133525dc1..93463731ccc3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13188,7 +13188,7 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate,
 	 * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16.
 	 */
 	bpf_for_each_reg_in_vstate(vstate, state, reg, ({
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (base_type(reg->type) == type && reg->id == dst_reg->id)
 			/* keep the maximum range already checked */
 			reg->range = max(reg->range, new_range);
 	}));
@@ -13741,84 +13741,84 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 
 	switch (BPF_OP(insn->code)) {
 	case BPF_JGT:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' > pkt_end, pkt_meta' > pkt_data */
 			find_good_pkt_pointers(this_branch, dst_reg,
-					       dst_reg->type, false);
+					       base_type(dst_reg->type), false);
 			mark_pkt_end(other_branch, insn->dst_reg, true);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end > pkt_data', pkt_data > pkt_meta' */
 			find_good_pkt_pointers(other_branch, src_reg,
-					       src_reg->type, true);
+					       base_type(src_reg->type), true);
 			mark_pkt_end(this_branch, insn->src_reg, false);
 		} else {
 			return false;
 		}
 		break;
 	case BPF_JLT:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' < pkt_end, pkt_meta' < pkt_data */
 			find_good_pkt_pointers(other_branch, dst_reg,
-					       dst_reg->type, true);
+					       base_type(dst_reg->type), true);
 			mark_pkt_end(this_branch, insn->dst_reg, false);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end < pkt_data', pkt_data > pkt_meta' */
 			find_good_pkt_pointers(this_branch, src_reg,
-					       src_reg->type, false);
+					       base_type(src_reg->type), false);
 			mark_pkt_end(other_branch, insn->src_reg, true);
 		} else {
 			return false;
 		}
 		break;
 	case BPF_JGE:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' >= pkt_end, pkt_meta' >= pkt_data */
 			find_good_pkt_pointers(this_branch, dst_reg,
-					       dst_reg->type, true);
+					       base_type(dst_reg->type), true);
 			mark_pkt_end(other_branch, insn->dst_reg, false);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end >= pkt_data', pkt_data >= pkt_meta' */
 			find_good_pkt_pointers(other_branch, src_reg,
-					       src_reg->type, false);
+					       base_type(src_reg->type), false);
 			mark_pkt_end(this_branch, insn->src_reg, true);
 		} else {
 			return false;
 		}
 		break;
 	case BPF_JLE:
-		if ((dst_reg->type == PTR_TO_PACKET &&
-		     src_reg->type == PTR_TO_PACKET_END) ||
-		    (dst_reg->type == PTR_TO_PACKET_META &&
+		if ((base_type(dst_reg->type) == PTR_TO_PACKET &&
+		     base_type(src_reg->type) == PTR_TO_PACKET_END) ||
+		    (base_type(dst_reg->type) == PTR_TO_PACKET_META &&
 		     reg_is_init_pkt_pointer(src_reg, PTR_TO_PACKET))) {
 			/* pkt_data' <= pkt_end, pkt_meta' <= pkt_data */
 			find_good_pkt_pointers(other_branch, dst_reg,
-					       dst_reg->type, false);
+					       base_type(dst_reg->type), false);
 			mark_pkt_end(this_branch, insn->dst_reg, true);
-		} else if ((dst_reg->type == PTR_TO_PACKET_END &&
-			    src_reg->type == PTR_TO_PACKET) ||
+		} else if ((base_type(dst_reg->type) == PTR_TO_PACKET_END &&
+			    base_type(src_reg->type) == PTR_TO_PACKET) ||
 			   (reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
-			    src_reg->type == PTR_TO_PACKET_META)) {
+			    base_type(src_reg->type) == PTR_TO_PACKET_META)) {
 			/* pkt_end <= pkt_data', pkt_data <= pkt_meta' */
 			find_good_pkt_pointers(this_branch, src_reg,
-					       src_reg->type, true);
+					       base_type(src_reg->type), true);
 			mark_pkt_end(other_branch, insn->src_reg, false);
 		} else {
 			return false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC bpf-next 5/5] bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type.
  2023-07-22  5:22 [RFC bpf-next 0/5] Sleepable BPF programs on cgroup {get,set}sockopt kuifeng
                   ` (3 preceding siblings ...)
  2023-07-22  5:22 ` [RFC bpf-next 4/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval kuifeng
@ 2023-07-22  5:22 ` kuifeng
  4 siblings, 0 replies; 20+ messages in thread
From: kuifeng @ 2023-07-22  5:22 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii; +Cc: sinquersw, Kui-Feng Lee

From: Kui-Feng Lee <kuifeng@meta.com>

Do the same test as non-sleepable ones.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 .../testing/selftests/bpf/bpf_experimental.h  |  27 ++
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  30 ++
 .../selftests/bpf/prog_tests/sockopt_sk.c     |  34 +-
 .../testing/selftests/bpf/progs/sockopt_sk.c  | 299 ++++++++++++++++++
 .../selftests/bpf/verifier/sleepable.c        |   2 +-
 5 files changed, 389 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 209811b1993a..80b82858b50c 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -131,4 +131,31 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
  */
 extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
 
+/* Description
+ *	Allocate a buffer of 'size' bytes for being installed as optval.
+ * Returns
+ *	> 0 on success, the size of the allocated buffer
+ *	-ENOMEM or -EINVAL on failure
+ */
+extern int bpf_sockopt_alloc_optval(struct bpf_sockopt *sopt, int size,
+				    struct bpf_dynptr *ptr__uninit) __ksym;
+
+/* Description
+ *	Install the buffer pointed to by 'ptr' as optval.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer is too small
+ */
+extern int bpf_sockopt_install_optval(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
+/* Description
+ *	Release the buffer allocated by bpf_sockopt_alloc_optval.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer was not allocated by bpf_sockopt_alloc_optval
+ */
+extern int bpf_sockopt_release_optval(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
 #endif
diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index 642dda0e758a..a77a1851865e 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -41,4 +41,34 @@ extern bool bpf_dynptr_is_rdonly(const struct bpf_dynptr *ptr) __ksym;
 extern __u32 bpf_dynptr_size(const struct bpf_dynptr *ptr) __ksym;
 extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clone__init) __ksym;
 
+extern int bpf_copy_to_user(void *dst__uninit, __u32 dst__sz,
+			    const void *src_ign) __ksym;
+
+/* Description
+ *	Allocate a buffer of 'size' bytes for being installed as optval.
+ * Returns
+ *	> 0 on success, the size of the allocated buffer
+ *	-ENOMEM or -EINVAL on failure
+ */
+extern int bpf_sockopt_alloc_optval(struct bpf_sockopt *sopt, int size,
+				    struct bpf_dynptr *ptr__uninit) __ksym;
+
+/* Description
+ *	Install the buffer pointed to by 'ptr' as optval.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer is too small
+ */
+extern int bpf_sockopt_install_optval(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
+/* Description
+ *	Release the buffer allocated by bpf_sockopt_alloc_optval.
+ * Returns
+ *	0 on success
+ *	-EINVAL if the buffer was not allocated by bpf_sockopt_alloc_optval
+ */
+extern int bpf_sockopt_release_optval(struct bpf_sockopt *sopt,
+				      struct bpf_dynptr *ptr) __ksym;
+
 #endif
diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
index 05d0e07da394..e18a40d89860 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
@@ -92,6 +92,7 @@ static int getsetsockopt(void)
 	}
 	if (buf.u8[0] != 0x01) {
 		log_err("Unexpected buf[0] 0x%02x != 0x01", buf.u8[0]);
+		log_err("optlen %d", optlen);
 		goto err;
 	}
 
@@ -220,7 +221,7 @@ static int getsetsockopt(void)
 	return -1;
 }
 
-static void run_test(int cgroup_fd)
+static void run_test_nonsleepable(int cgroup_fd)
 {
 	struct sockopt_sk *skel;
 
@@ -246,6 +247,32 @@ static void run_test(int cgroup_fd)
 	sockopt_sk__destroy(skel);
 }
 
+static void run_test_sleepable(int cgroup_fd)
+{
+	struct sockopt_sk *skel;
+
+	skel = sockopt_sk__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	skel->bss->page_size = getpagesize();
+
+	skel->links._setsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._setsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._setsockopt_s, "setsockopt_link"))
+		goto cleanup;
+
+	skel->links._getsockopt_s =
+		bpf_program__attach_cgroup(skel->progs._getsockopt_s, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links._getsockopt_s, "getsockopt_link"))
+		goto cleanup;
+
+	ASSERT_OK(getsetsockopt(), "getsetsockopt");
+
+cleanup:
+	sockopt_sk__destroy(skel);
+}
+
 void test_sockopt_sk(void)
 {
 	int cgroup_fd;
@@ -254,6 +281,9 @@ void test_sockopt_sk(void)
 	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /sockopt_sk"))
 		return;
 
-	run_test(cgroup_fd);
+	if (test__start_subtest("nonsleepable"))
+		run_test_nonsleepable(cgroup_fd);
+	if (test__start_subtest("sleepable"))
+		run_test_sleepable(cgroup_fd);
 	close(cgroup_fd);
 }
diff --git a/tools/testing/selftests/bpf/progs/sockopt_sk.c b/tools/testing/selftests/bpf/progs/sockopt_sk.c
index cb990a7d3d45..33ac89c562a2 100644
--- a/tools/testing/selftests/bpf/progs/sockopt_sk.c
+++ b/tools/testing/selftests/bpf/progs/sockopt_sk.c
@@ -5,6 +5,9 @@
 #include <netinet/in.h>
 #include <bpf/bpf_helpers.h>
 
+typedef int bool;
+#include "bpf_kfuncs.h"
+
 char _license[] SEC("license") = "GPL";
 
 int page_size = 0; /* userspace should set it */
@@ -26,6 +29,53 @@ struct {
 	__type(value, struct sockopt_sk);
 } socket_storage_map SEC(".maps");
 
+/* Copy optval data to destinate even if optval is in user space. */
+static inline int cp_from_optval(struct bpf_sockopt *ctx,
+			void *dst, int len)
+{
+	if (ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		if (len < 0 ||
+		    ctx->user_optval + len > ctx->user_optval_end)
+			return -1;
+		return bpf_copy_from_user(dst, len, ctx->user_optval);
+	}
+
+	if (len < 0 ||
+	    ctx->optval + len > ctx->optval_end)
+		return -1;
+	memcpy(dst, ctx->optval, len);
+
+	return 0;
+}
+
+/* Copy source data to optval even if optval is in user space. */
+static inline int cp_to_optval(struct bpf_sockopt *ctx,
+			       const void *src, int len)
+{
+	if (ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		if (len < 0 ||
+		    ctx->user_optval + len > ctx->user_optval_end)
+			return -1;
+		return bpf_copy_to_user(ctx->user_optval, len, src);
+	}
+
+	#if 0
+	/* Somehow, this doesn't work.
+	 *
+	 * clang version 17.0.0
+	 *
+	 * progs/sockopt_sk.c:65:2: error: A call to built-in function
+	 * 'memcpy' is not supported.
+	 */
+	if (len < 0 ||
+	    ctx->optval + len > ctx->optval_end)
+		return -1;
+	memcpy(ctx->optval, src, len);
+	#endif
+
+	return 0;
+}
+
 SEC("cgroup/getsockopt")
 int _getsockopt(struct bpf_sockopt *ctx)
 {
@@ -136,6 +186,133 @@ int _getsockopt(struct bpf_sockopt *ctx)
 	return 1;
 }
 
+SEC("cgroup/getsockopt.s")
+int _getsockopt_s(struct bpf_sockopt *ctx)
+{
+	__u8 *optval_end = ctx->optval_end;
+	__u8 *optval = ctx->optval;
+	struct sockopt_sk *storage;
+	struct bpf_sock *sk;
+	struct tcp_zerocopy_receive zcvr;
+	char buf[1];
+	int ret;
+
+	if (ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		optval_end = ctx->user_optval_end;
+		optval = ctx->user_optval;
+	}
+
+	/* Bypass AF_NETLINK. */
+	sk = ctx->sk;
+	if (sk && sk->family == AF_NETLINK)
+		goto out;
+
+	/* Make sure bpf_get_netns_cookie is callable.
+	 */
+	if (bpf_get_netns_cookie(NULL) == 0)
+		return 0;
+
+	if (bpf_get_netns_cookie(ctx) == 0)
+		return 0;
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
+		/* Not interested in SOL_IP:IP_TOS;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		goto out;
+	}
+
+	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
+		/* Not interested in SOL_SOCKET:SO_SNDBUF;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		goto out;
+	}
+
+	if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
+		/* Not interested in SOL_TCP:TCP_CONGESTION;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		goto out;
+	}
+
+	if (ctx->level == SOL_TCP && ctx->optname == TCP_ZEROCOPY_RECEIVE) {
+		/* Verify that TCP_ZEROCOPY_RECEIVE triggers.
+		 * It has a custom implementation for performance
+		 * reasons.
+		 */
+
+		/* Check that optval contains address (__u64) */
+		if (optval + sizeof(zcvr) > optval_end)
+			return 0; /* bounds check */
+
+		ret = cp_from_optval(ctx, &zcvr, sizeof(zcvr));
+		if (ret < 0)
+			return 0;
+		if (zcvr.address != 0)
+			return 0; /* unexpected data */
+
+		goto out;
+	}
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
+		if (optval + 1 > optval_end)
+			return 0; /* bounds check */
+
+		ctx->retval = 0; /* Reset system call return value to zero */
+
+		/* Always export 0x55 */
+		buf[0] = 0x55;
+		ret = cp_to_optval(ctx, buf, 1);
+		if (ret < 0)
+			return 0;
+		ctx->optlen = 1;
+
+		/* Userspace buffer is PAGE_SIZE * 2, but BPF
+		 * program can only see the first PAGE_SIZE
+		 * bytes of data.
+		 */
+		if (optval_end - optval != page_size && 0)
+			return 0; /* unexpected data size */
+
+		return 1;
+	}
+
+	if (ctx->level != SOL_CUSTOM)
+		return 0; /* deny everything except custom level */
+
+	if (optval + 1 > optval_end)
+		return 0; /* bounds check */
+
+	storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!storage)
+		return 0; /* couldn't get sk storage */
+
+	if (!ctx->retval)
+		return 0; /* kernel should not have handled
+			   * SOL_CUSTOM, something is wrong!
+			   */
+	ctx->retval = 0; /* Reset system call return value to zero */
+
+	buf[0] = storage->val;
+	ret = cp_to_optval(ctx, buf, 1);
+	if (ret < 0)
+		return 0;
+	ctx->optlen = 1;
+
+	return 1;
+
+out:
+	/* optval larger than PAGE_SIZE use kernel's buffer. */
+	if (ctx->optlen > page_size)
+		ctx->optlen = 0;
+	return 1;
+}
+
 SEC("cgroup/setsockopt")
 int _setsockopt(struct bpf_sockopt *ctx)
 {
@@ -236,3 +413,125 @@ int _setsockopt(struct bpf_sockopt *ctx)
 		ctx->optlen = 0;
 	return 1;
 }
+
+SEC("cgroup/setsockopt.s")
+int _setsockopt_s(struct bpf_sockopt *ctx)
+{
+	__u8 *optval_end = ctx->optval_end;
+	struct bpf_dynptr optval_buf;
+	__u8 *optval = ctx->optval;
+	struct sockopt_sk *storage;
+	struct bpf_sock *sk;
+	__u8 tmp_u8;
+	__u32 tmp;
+	int ret;
+
+	if (!(ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_ALLOC))
+		return 0;
+
+	if (ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		optval_end = ctx->user_optval_end;
+		optval = ctx->user_optval;
+	}
+
+	/* Bypass AF_NETLINK. */
+	sk = ctx->sk;
+	if (sk && sk->family == AF_NETLINK)
+		goto out;
+
+	/* Make sure bpf_get_netns_cookie is callable.
+	 */
+	if (bpf_get_netns_cookie(NULL) == 0)
+		return 0;
+
+	if (bpf_get_netns_cookie(ctx) == 0)
+		return 0;
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_TOS) {
+		/* Not interested in SOL_IP:IP_TOS;
+		 * let next BPF program in the cgroup chain or kernel
+		 * handle it.
+		 */
+		ctx->optlen = 0; /* bypass optval>PAGE_SIZE */
+		return 1;
+	}
+
+	if (ctx->level == SOL_SOCKET && ctx->optname == SO_SNDBUF) {
+		/* Overwrite SO_SNDBUF value */
+
+		if (optval + sizeof(__u32) > optval_end)
+			return 0; /* bounds check */
+
+		tmp = 0x55AA;
+		ret = cp_to_optval(ctx, &tmp, sizeof(tmp));
+		if (ret < 0)
+			return 0;
+		ctx->optlen = 4;
+
+		return 1;
+	}
+
+	if (ctx->level == SOL_TCP && ctx->optname == TCP_CONGESTION) {
+		/* Always use cubic */
+
+		ret = bpf_sockopt_alloc_optval(ctx, 5, &optval_buf);
+		if (ret < 0) {
+			bpf_sockopt_release_optval(ctx, &optval_buf);
+			return 0;
+		}
+		bpf_dynptr_write(&optval_buf, 0, "cubic", 5, 0);
+		ret = bpf_sockopt_install_optval(ctx, &optval_buf);
+		if (ret < 0)
+			return 0;
+		ctx->optlen = 5;
+
+		return 1;
+	}
+
+	if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
+		/* Original optlen is larger than PAGE_SIZE. */
+		if (ctx->optlen != page_size * 2)
+			return 0; /* unexpected data size */
+
+		ret = bpf_sockopt_alloc_optval(ctx, 1, &optval_buf);
+		if (ret < 0) {
+			bpf_sockopt_release_optval(ctx, &optval_buf);
+			return 0;
+		}
+		tmp_u8 = 0;
+		bpf_dynptr_write(&optval_buf, 0, &tmp_u8, 1, 0);
+		ret = bpf_sockopt_install_optval(ctx, &optval_buf);
+		if (ret < 0)
+			return 0;
+		ctx->optlen = 1;
+
+		return 1;
+	}
+
+	if (ctx->level != SOL_CUSTOM)
+		return 0; /* deny everything except custom level */
+
+	if (optval + 1 > optval_end)
+		return 0; /* bounds check */
+
+	storage = bpf_sk_storage_get(&socket_storage_map, ctx->sk, 0,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!storage)
+		return 0; /* couldn't get sk storage */
+
+	ret = cp_from_optval(ctx, &storage->val, 1);
+	if (ret < 0)
+		return 0;
+	ctx->optlen = -1; /* BPF has consumed this option, don't call kernel
+			   * setsockopt handler.
+			   */
+
+	return 1;
+
+out:
+	/* optval larger than PAGE_SIZE use kernel's buffer. */
+	if (ctx->optlen > page_size)
+		ctx->optlen = 0;
+	return 1;
+}
+
diff --git a/tools/testing/selftests/bpf/verifier/sleepable.c b/tools/testing/selftests/bpf/verifier/sleepable.c
index 1f0d2bdc673f..4b6c1117ec9f 100644
--- a/tools/testing/selftests/bpf/verifier/sleepable.c
+++ b/tools/testing/selftests/bpf/verifier/sleepable.c
@@ -85,7 +85,7 @@
 	.expected_attach_type = BPF_TRACE_RAW_TP,
 	.kfunc = "sched_switch",
 	.result = REJECT,
-	.errstr = "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, and struct_ops programs can be sleepable",
+	.errstr = "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, cgroup, and struct_ops programs can be sleepable",
 	.flags = BPF_F_SLEEPABLE,
 	.runs = -1,
 },
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
  2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
@ 2023-07-22  6:39   ` kernel test robot
  2023-07-22  6:49   ` kernel test robot
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-07-22  6:39 UTC (permalink / raw)
  To: kuifeng; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/kuifeng-meta-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230722-132551
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230722052248.1062582-3-kuifeng%40meta.com
patch subject: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
config: x86_64-buildonly-randconfig-r003-20230720 (https://download.01.org/0day-ci/archive/20230722/202307221430.DScut0GF-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230722/202307221430.DScut0GF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307221430.DScut0GF-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/bpf/helpers.c:681:17: warning: no previous prototype for 'bpf_copy_to_user' [-Wmissing-prototypes]
     681 | __bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
         |                 ^~~~~~~~~~~~~~~~
--
>> kernel/bpf/helpers.c:673: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * long bpf_copy_to_user(void *dst, u32 size, const void *kern_ptr)


vim +/bpf_copy_to_user +681 kernel/bpf/helpers.c

   671	
   672	/**
 > 673	 * long bpf_copy_to_user(void *dst, u32 size, const void *kern_ptr)
   674	 *     Description
   675	 *             Read *size* bytes from kernel space address *kern_ptr* and
   676	 *              store the data in user space address *dst*. This is a
   677	 *              wrapper of **copy_to_user**\ ().
   678	 *     Return
   679	 *             0 on success, or a negative error in case of failure.
   680	 */
 > 681	__bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
   682					 const void *src__ign)
   683	{
   684		int ret = copy_to_user(dst__uninit, src__ign, dst__sz);
   685	
   686		if (unlikely(ret))
   687			return -EFAULT;
   688	
   689		return ret;
   690	}
   691	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
  2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
  2023-07-22  6:39   ` kernel test robot
@ 2023-07-22  6:49   ` kernel test robot
  2023-07-22  7:41   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-07-22  6:49 UTC (permalink / raw)
  To: kuifeng; +Cc: llvm, oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/kuifeng-meta-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230722-132551
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230722052248.1062582-3-kuifeng%40meta.com
patch subject: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
config: hexagon-randconfig-r021-20230720 (https://download.01.org/0day-ci/archive/20230722/202307221437.QVQzz3cx-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230722/202307221437.QVQzz3cx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307221437.QVQzz3cx-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/bpf/helpers.c:4:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from kernel/bpf/helpers.c:4:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from kernel/bpf/helpers.c:4:
   In file included from include/linux/bpf.h:31:
   In file included from include/linux/memcontrol.h:13:
   In file included from include/linux/cgroup.h:26:
   In file included from include/linux/kernel_stat.h:9:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
>> kernel/bpf/helpers.c:681:17: warning: no previous prototype for function 'bpf_copy_to_user' [-Wmissing-prototypes]
     681 | __bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
         |                 ^
   kernel/bpf/helpers.c:681:13: note: declare 'static' if the function is not intended to be used outside of this translation unit
     681 | __bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
         |             ^
         |             static 
   7 warnings generated.


vim +/bpf_copy_to_user +681 kernel/bpf/helpers.c

   671	
   672	/**
   673	 * long bpf_copy_to_user(void *dst, u32 size, const void *kern_ptr)
   674	 *     Description
   675	 *             Read *size* bytes from kernel space address *kern_ptr* and
   676	 *              store the data in user space address *dst*. This is a
   677	 *              wrapper of **copy_to_user**\ ().
   678	 *     Return
   679	 *             0 on success, or a negative error in case of failure.
   680	 */
 > 681	__bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
   682					 const void *src__ign)
   683	{
   684		int ret = copy_to_user(dst__uninit, src__ign, dst__sz);
   685	
   686		if (unlikely(ret))
   687			return -EFAULT;
   688	
   689		return ret;
   690	}
   691	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
  2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
  2023-07-22  6:39   ` kernel test robot
  2023-07-22  6:49   ` kernel test robot
@ 2023-07-22  7:41   ` kernel test robot
  2023-07-22 20:26   ` kernel test robot
  2023-08-02 19:59   ` Martin KaFai Lau
  4 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-07-22  7:41 UTC (permalink / raw)
  To: kuifeng; +Cc: llvm, oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/kuifeng-meta-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230722-132551
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230722052248.1062582-3-kuifeng%40meta.com
patch subject: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
config: powerpc-randconfig-r015-20230720 (https://download.01.org/0day-ci/archive/20230722/202307221559.mSQLTotj-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230722/202307221559.mSQLTotj-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307221559.mSQLTotj-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/bpf/helpers.c:681:17: warning: no previous prototype for function 'bpf_copy_to_user' [-Wmissing-prototypes]
     681 | __bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
         |                 ^
   kernel/bpf/helpers.c:681:13: note: declare 'static' if the function is not intended to be used outside of this translation unit
     681 | __bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
         |             ^
         |             static 
   1 warning generated.
--
>> kernel/bpf/helpers.c:673: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * long bpf_copy_to_user(void *dst, u32 size, const void *kern_ptr)


vim +/bpf_copy_to_user +681 kernel/bpf/helpers.c

   671	
   672	/**
 > 673	 * long bpf_copy_to_user(void *dst, u32 size, const void *kern_ptr)
   674	 *     Description
   675	 *             Read *size* bytes from kernel space address *kern_ptr* and
   676	 *              store the data in user space address *dst*. This is a
   677	 *              wrapper of **copy_to_user**\ ().
   678	 *     Return
   679	 *             0 on success, or a negative error in case of failure.
   680	 */
 > 681	__bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
   682					 const void *src__ign)
   683	{
   684		int ret = copy_to_user(dst__uninit, src__ign, dst__sz);
   685	
   686		if (unlikely(ret))
   687			return -EFAULT;
   688	
   689		return ret;
   690	}
   691	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
@ 2023-07-22 16:09   ` kernel test robot
  2023-07-22 18:22   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-07-22 16:09 UTC (permalink / raw)
  To: kuifeng; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/kuifeng-meta-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230722-132551
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230722052248.1062582-2-kuifeng%40meta.com
patch subject: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
config: sparc64-randconfig-r081-20230720 (https://download.01.org/0day-ci/archive/20230723/202307230006.fFpl0zeg-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230723/202307230006.fFpl0zeg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307230006.fFpl0zeg-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/bpf/cgroup.c:1848:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval @@     got char [noderef] __user *optval @@
   kernel/bpf/cgroup.c:1848:33: sparse:     expected unsigned char [usertype] *[assigned] user_optval
   kernel/bpf/cgroup.c:1848:33: sparse:     got char [noderef] __user *optval
>> kernel/bpf/cgroup.c:1849:37: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval_end @@     got char [noderef] __user * @@
   kernel/bpf/cgroup.c:1849:37: sparse:     expected unsigned char [usertype] *[assigned] user_optval_end
   kernel/bpf/cgroup.c:1849:37: sparse:     got char [noderef] __user *
   kernel/bpf/cgroup.c:1972:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval @@     got char [noderef] __user *optval @@
   kernel/bpf/cgroup.c:1972:33: sparse:     expected unsigned char [usertype] *[assigned] user_optval
   kernel/bpf/cgroup.c:1972:33: sparse:     got char [noderef] __user *optval
   kernel/bpf/cgroup.c:1973:37: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval_end @@     got char [noderef] __user * @@
   kernel/bpf/cgroup.c:1973:37: sparse:     expected unsigned char [usertype] *[assigned] user_optval_end
   kernel/bpf/cgroup.c:1973:37: sparse:     got char [noderef] __user *
   kernel/bpf/cgroup.c: note: in included file (through include/linux/rculist.h, include/linux/pid.h, include/linux/sched.h, ...):
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_sock' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_socket' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_current' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_skb' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sk' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sock_addr' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sock_ops' - unexpected unlock
   kernel/bpf/cgroup.c:57:24: sparse: sparse: context imbalance in '__cgroup_bpf_check_dev_permission' - different lock contexts for basic block
   kernel/bpf/cgroup.c:57:24: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sysctl' - different lock contexts for basic block
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_setsockopt' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_getsockopt' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_getsockopt_kern' - unexpected unlock

vim +1848 kernel/bpf/cgroup.c

  1829	
  1830	int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
  1831					       int *optname, char __user *optval,
  1832					       int *optlen, char **kernel_optval)
  1833	{
  1834		struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
  1835		struct bpf_sockopt_buf buf = {};
  1836		struct bpf_sockopt_kern ctx = {
  1837			.sk = sk,
  1838			.level = *level,
  1839			.optname = *optname,
  1840		};
  1841		int ret, max_optlen;
  1842		bool alloc_mem;
  1843	
  1844		alloc_mem = !has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_SETSOCKOPT);
  1845		if (!alloc_mem) {
  1846			max_optlen = *optlen;
  1847			ctx.optlen = *optlen;
> 1848			ctx.user_optval = optval;
> 1849			ctx.user_optval_end = optval + *optlen;
  1850			ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
  1851		} else {
  1852			/* Allocate a bit more than the initial user buffer for
  1853			 * BPF program. The canonical use case is overriding
  1854			 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
  1855			 */
  1856			max_optlen = max_t(int, 16, *optlen);
  1857			max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
  1858			if (max_optlen < 0)
  1859				return max_optlen;
  1860	
  1861			ctx.optlen = *optlen;
  1862	
  1863			if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
  1864				ret = -EFAULT;
  1865				goto out;
  1866			}
  1867		}
  1868	
  1869		lock_sock(sk);
  1870		ret = bpf_prog_run_array_cg(&cgrp->bpf, CGROUP_SETSOCKOPT,
  1871					    &ctx, bpf_prog_run, 0, NULL);
  1872		release_sock(sk);
  1873	
  1874		if (ret)
  1875			goto out;
  1876	
  1877		if (ctx.optlen == -1) {
  1878			/* optlen set to -1, bypass kernel */
  1879			ret = 1;
  1880		} else if (alloc_mem &&
  1881			   (ctx.optlen > max_optlen || ctx.optlen < -1)) {
  1882			/* optlen is out of bounds */
  1883			if (*optlen > PAGE_SIZE && ctx.optlen >= 0) {
  1884				pr_info_once("bpf setsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n",
  1885					     ctx.optlen, max_optlen);
  1886				ret = 0;
  1887				goto out;
  1888			}
  1889			ret = -EFAULT;
  1890		} else {
  1891			/* optlen within bounds, run kernel handler */
  1892			ret = 0;
  1893	
  1894			/* export any potential modifications */
  1895			*level = ctx.level;
  1896			*optname = ctx.optname;
  1897	
  1898			/* optlen == 0 from BPF indicates that we should
  1899			 * use original userspace data.
  1900			 */
  1901			if (ctx.optlen != 0) {
  1902				*optlen = ctx.optlen;
  1903				if (ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
  1904					return 0;
  1905				/* We've used bpf_sockopt_kern->buf as an intermediary
  1906				 * storage, but the BPF program indicates that we need
  1907				 * to pass this data to the kernel setsockopt handler.
  1908				 * No way to export on-stack buf, have to allocate a
  1909				 * new buffer.
  1910				 */
  1911				if (!sockopt_buf_allocated(&ctx, &buf)) {
  1912					void *p = kmalloc(ctx.optlen, GFP_USER);
  1913	
  1914					if (!p) {
  1915						ret = -ENOMEM;
  1916						goto out;
  1917					}
  1918					memcpy(p, ctx.optval, ctx.optlen);
  1919					*kernel_optval = p;
  1920				} else {
  1921					*kernel_optval = ctx.optval;
  1922				}
  1923				/* export and don't free sockopt buf */
  1924				return 0;
  1925			}
  1926		}
  1927	
  1928	out:
  1929		sockopt_free_buf(&ctx, &buf);
  1930		return ret;
  1931	}
  1932	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
  2023-07-22 16:09   ` kernel test robot
@ 2023-07-22 18:22   ` kernel test robot
  2023-07-24 18:36   ` Stanislav Fomichev
  2023-08-02 19:25   ` Martin KaFai Lau
  3 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-07-22 18:22 UTC (permalink / raw)
  To: kuifeng; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/kuifeng-meta-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230722-132551
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230722052248.1062582-2-kuifeng%40meta.com
patch subject: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
config: sparc64-randconfig-r081-20230720 (https://download.01.org/0day-ci/archive/20230723/202307230213.tnEcymlR-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230723/202307230213.tnEcymlR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307230213.tnEcymlR-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/bpf/cgroup.c:1848:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval @@     got char [noderef] __user *optval @@
   kernel/bpf/cgroup.c:1848:33: sparse:     expected unsigned char [usertype] *[assigned] user_optval
   kernel/bpf/cgroup.c:1848:33: sparse:     got char [noderef] __user *optval
>> kernel/bpf/cgroup.c:1849:37: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval_end @@     got char [noderef] __user * @@
   kernel/bpf/cgroup.c:1849:37: sparse:     expected unsigned char [usertype] *[assigned] user_optval_end
   kernel/bpf/cgroup.c:1849:37: sparse:     got char [noderef] __user *
   kernel/bpf/cgroup.c:1972:33: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval @@     got char [noderef] __user *optval @@
   kernel/bpf/cgroup.c:1972:33: sparse:     expected unsigned char [usertype] *[assigned] user_optval
   kernel/bpf/cgroup.c:1972:33: sparse:     got char [noderef] __user *optval
   kernel/bpf/cgroup.c:1973:37: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected unsigned char [usertype] *[assigned] user_optval_end @@     got char [noderef] __user * @@
   kernel/bpf/cgroup.c:1973:37: sparse:     expected unsigned char [usertype] *[assigned] user_optval_end
   kernel/bpf/cgroup.c:1973:37: sparse:     got char [noderef] __user *
   kernel/bpf/cgroup.c: note: in included file (through include/linux/rculist.h, include/linux/pid.h, include/linux/sched.h, ...):
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_sock' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_socket' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_lsm_current' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_skb' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sk' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sock_addr' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sock_ops' - unexpected unlock
   kernel/bpf/cgroup.c:57:24: sparse: sparse: context imbalance in '__cgroup_bpf_check_dev_permission' - different lock contexts for basic block
   kernel/bpf/cgroup.c:57:24: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_sysctl' - different lock contexts for basic block
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_setsockopt' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_getsockopt' - unexpected unlock
   include/linux/rcupdate.h:780:9: sparse: sparse: context imbalance in '__cgroup_bpf_run_filter_getsockopt_kern' - unexpected unlock

vim +1848 kernel/bpf/cgroup.c

  1829	
  1830	int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
  1831					       int *optname, char __user *optval,
  1832					       int *optlen, char **kernel_optval)
  1833	{
  1834		struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
  1835		struct bpf_sockopt_buf buf = {};
  1836		struct bpf_sockopt_kern ctx = {
  1837			.sk = sk,
  1838			.level = *level,
  1839			.optname = *optname,
  1840		};
  1841		int ret, max_optlen;
  1842		bool alloc_mem;
  1843	
  1844		alloc_mem = !has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_SETSOCKOPT);
  1845		if (!alloc_mem) {
  1846			max_optlen = *optlen;
  1847			ctx.optlen = *optlen;
> 1848			ctx.user_optval = optval;
> 1849			ctx.user_optval_end = optval + *optlen;
  1850			ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
  1851		} else {
  1852			/* Allocate a bit more than the initial user buffer for
  1853			 * BPF program. The canonical use case is overriding
  1854			 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
  1855			 */
  1856			max_optlen = max_t(int, 16, *optlen);
  1857			max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
  1858			if (max_optlen < 0)
  1859				return max_optlen;
  1860	
  1861			ctx.optlen = *optlen;
  1862	
  1863			if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
  1864				ret = -EFAULT;
  1865				goto out;
  1866			}
  1867		}
  1868	
  1869		lock_sock(sk);
  1870		ret = bpf_prog_run_array_cg(&cgrp->bpf, CGROUP_SETSOCKOPT,
  1871					    &ctx, bpf_prog_run, 0, NULL);
  1872		release_sock(sk);
  1873	
  1874		if (ret)
  1875			goto out;
  1876	
  1877		if (ctx.optlen == -1) {
  1878			/* optlen set to -1, bypass kernel */
  1879			ret = 1;
  1880		} else if (alloc_mem &&
  1881			   (ctx.optlen > max_optlen || ctx.optlen < -1)) {
  1882			/* optlen is out of bounds */
  1883			if (*optlen > PAGE_SIZE && ctx.optlen >= 0) {
  1884				pr_info_once("bpf setsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n",
  1885					     ctx.optlen, max_optlen);
  1886				ret = 0;
  1887				goto out;
  1888			}
  1889			ret = -EFAULT;
  1890		} else {
  1891			/* optlen within bounds, run kernel handler */
  1892			ret = 0;
  1893	
  1894			/* export any potential modifications */
  1895			*level = ctx.level;
  1896			*optname = ctx.optname;
  1897	
  1898			/* optlen == 0 from BPF indicates that we should
  1899			 * use original userspace data.
  1900			 */
  1901			if (ctx.optlen != 0) {
  1902				*optlen = ctx.optlen;
  1903				if (ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
  1904					return 0;
  1905				/* We've used bpf_sockopt_kern->buf as an intermediary
  1906				 * storage, but the BPF program indicates that we need
  1907				 * to pass this data to the kernel setsockopt handler.
  1908				 * No way to export on-stack buf, have to allocate a
  1909				 * new buffer.
  1910				 */
  1911				if (!sockopt_buf_allocated(&ctx, &buf)) {
  1912					void *p = kmalloc(ctx.optlen, GFP_USER);
  1913	
  1914					if (!p) {
  1915						ret = -ENOMEM;
  1916						goto out;
  1917					}
  1918					memcpy(p, ctx.optval, ctx.optlen);
  1919					*kernel_optval = p;
  1920				} else {
  1921					*kernel_optval = ctx.optval;
  1922				}
  1923				/* export and don't free sockopt buf */
  1924				return 0;
  1925			}
  1926		}
  1927	
  1928	out:
  1929		sockopt_free_buf(&ctx, &buf);
  1930		return ret;
  1931	}
  1932	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
  2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
                     ` (2 preceding siblings ...)
  2023-07-22  7:41   ` kernel test robot
@ 2023-07-22 20:26   ` kernel test robot
  2023-08-02 19:59   ` Martin KaFai Lau
  4 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-07-22 20:26 UTC (permalink / raw)
  To: kuifeng; +Cc: oe-kbuild-all

Hi,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/kuifeng-meta-com/bpf-enable-sleepable-BPF-programs-attached-to-cgroup-get-set-sockopt/20230722-132551
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230722052248.1062582-3-kuifeng%40meta.com
patch subject: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
config: sparc64-randconfig-r081-20230720 (https://download.01.org/0day-ci/archive/20230723/202307230401.pDIhqDbi-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230723/202307230401.pDIhqDbi-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307230401.pDIhqDbi-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/bpf/helpers.c:684:32: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected void [noderef] __user *to @@     got void *dst__uninit @@
   kernel/bpf/helpers.c:684:32: sparse:     expected void [noderef] __user *to
   kernel/bpf/helpers.c:684:32: sparse:     got void *dst__uninit
   kernel/bpf/helpers.c: note: in included file (through arch/sparc/include/asm/cmpxchg.h, arch/sparc/include/asm/atomic_64.h, arch/sparc/include/asm/atomic.h, ...):
   arch/sparc/include/asm/cmpxchg_64.h:161:55: sparse: sparse: cast truncates bits from constant value (eb9f becomes 9f)
   arch/sparc/include/asm/cmpxchg_64.h:161:55: sparse: sparse: cast truncates bits from constant value (eb9f becomes 9f)
   kernel/bpf/helpers.c:2442:18: sparse: sparse: context imbalance in 'bpf_rcu_read_lock' - wrong count at exit
   kernel/bpf/helpers.c:2447:18: sparse: sparse: context imbalance in 'bpf_rcu_read_unlock' - unexpected unlock

vim +684 kernel/bpf/helpers.c

   671	
   672	/**
   673	 * long bpf_copy_to_user(void *dst, u32 size, const void *kern_ptr)
   674	 *     Description
   675	 *             Read *size* bytes from kernel space address *kern_ptr* and
   676	 *              store the data in user space address *dst*. This is a
   677	 *              wrapper of **copy_to_user**\ ().
   678	 *     Return
   679	 *             0 on success, or a negative error in case of failure.
   680	 */
   681	__bpf_kfunc int bpf_copy_to_user(void *dst__uninit, u32 dst__sz,
   682					 const void *src__ign)
   683	{
 > 684		int ret = copy_to_user(dst__uninit, src__ign, dst__sz);
   685	
   686		if (unlikely(ret))
   687			return -EFAULT;
   688	
   689		return ret;
   690	}
   691	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
  2023-07-22 16:09   ` kernel test robot
  2023-07-22 18:22   ` kernel test robot
@ 2023-07-24 18:36   ` Stanislav Fomichev
  2023-07-31 22:02     ` Kui-Feng Lee
  2023-08-02 19:25   ` Martin KaFai Lau
  3 siblings, 1 reply; 20+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 18:36 UTC (permalink / raw)
  To: kuifeng; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sinquersw

On 07/21, kuifeng@meta.com wrote:
> From: Kui-Feng Lee <kuifeng@meta.com>
> 
> Enable sleepable cgroup/{get,set}sockopt hooks.
> 
> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
> received a pointer to the optval in user space instead of a kernel
> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
> begin and end of the user space buffer if receiving a user space
> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
> a kernel space buffer.
> 
> A program receives a user space buffer if ctx->flags &
> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
> buffer.  The BPF programs should not read/write from/to a user space buffer
> dirrectly.  It should access the buffer through bpf_copy_from_user() and
> bpf_copy_to_user() provided in the following patches.
> 
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  include/linux/filter.h         |   3 +
>  include/uapi/linux/bpf.h       |   9 ++
>  kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>  kernel/bpf/verifier.c          |   7 +-
>  tools/include/uapi/linux/bpf.h |   9 ++
>  tools/lib/bpf/libbpf.c         |   2 +
>  6 files changed, 176 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index f69114083ec7..301dd1ba0de1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>  	s32		level;
>  	s32		optname;
>  	s32		optlen;
> +	u32		flags;
> +	u8		*user_optval;
> +	u8		*user_optval_end;
>  	/* for retval in struct bpf_cg_run_ctx */
>  	struct task_struct *current_task;
>  	/* Temporary "register" for indirect stores to ppos. */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 739c15906a65..b2f81193f97b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>  	__s32	optname;
>  	__s32	optlen;
>  	__s32	retval;
> +
> +	__bpf_md_ptr(void *, user_optval);
> +	__bpf_md_ptr(void *, user_optval_end);

Can we re-purpose existing optval/optval_end pointers
for the sleepable programs? IOW, when the prog is sleepable,
pass user pointers via optval/optval_end and require the programs
to do copy_to/from on this buffer (even if the backing pointer might be
in kernel memory - we can handle that in the kfuncs?).

The fact that the program now needs to look at the flag
(BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
use makes the handling even more complicated; and we already have a
bunch of hairy stuff in these hooks. (or I misreading the change?)

Also, regarding sleepable and non-sleepable co-existence: do we really need
that? Can we say that all the programs have to be sleepable
or non-sleepable? Mixing them complicates the sharing of that buffer.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-24 18:36   ` Stanislav Fomichev
@ 2023-07-31 22:02     ` Kui-Feng Lee
  2023-07-31 23:35       ` Stanislav Fomichev
  0 siblings, 1 reply; 20+ messages in thread
From: Kui-Feng Lee @ 2023-07-31 22:02 UTC (permalink / raw)
  To: Stanislav Fomichev, kuifeng
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii

Sorry for the late reply! I just backed from a vacation.


On 7/24/23 11:36, Stanislav Fomichev wrote:
> On 07/21, kuifeng@meta.com wrote:
>> From: Kui-Feng Lee <kuifeng@meta.com>
>>
>> Enable sleepable cgroup/{get,set}sockopt hooks.
>>
>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
>> received a pointer to the optval in user space instead of a kernel
>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
>> begin and end of the user space buffer if receiving a user space
>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
>> a kernel space buffer.
>>
>> A program receives a user space buffer if ctx->flags &
>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
>> buffer.  The BPF programs should not read/write from/to a user space buffer
>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
>> bpf_copy_to_user() provided in the following patches.
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/filter.h         |   3 +
>>   include/uapi/linux/bpf.h       |   9 ++
>>   kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>>   kernel/bpf/verifier.c          |   7 +-
>>   tools/include/uapi/linux/bpf.h |   9 ++
>>   tools/lib/bpf/libbpf.c         |   2 +
>>   6 files changed, 176 insertions(+), 43 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index f69114083ec7..301dd1ba0de1 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>>   	s32		level;
>>   	s32		optname;
>>   	s32		optlen;
>> +	u32		flags;
>> +	u8		*user_optval;
>> +	u8		*user_optval_end;
>>   	/* for retval in struct bpf_cg_run_ctx */
>>   	struct task_struct *current_task;
>>   	/* Temporary "register" for indirect stores to ppos. */
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 739c15906a65..b2f81193f97b 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>>   	__s32	optname;
>>   	__s32	optlen;
>>   	__s32	retval;
>> +
>> +	__bpf_md_ptr(void *, user_optval);
>> +	__bpf_md_ptr(void *, user_optval_end);
> 
> Can we re-purpose existing optval/optval_end pointers
> for the sleepable programs? IOW, when the prog is sleepable,
> pass user pointers via optval/optval_end and require the programs
> to do copy_to/from on this buffer (even if the backing pointer might be
> in kernel memory - we can handle that in the kfuncs?).
> 
> The fact that the program now needs to look at the flag
> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
> use makes the handling even more complicated; and we already have a
> bunch of hairy stuff in these hooks. (or I misreading the change?)
> 
> Also, regarding sleepable and non-sleepable co-existence: do we really need
> that? Can we say that all the programs have to be sleepable
> or non-sleepable? Mixing them complicates the sharing of that buffer.

I considered this approach as well. This is an open question for me.
If we go this way, it means we can not attach a BPF program of a type
if any program of the other type has been installed.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-31 22:02     ` Kui-Feng Lee
@ 2023-07-31 23:35       ` Stanislav Fomichev
  2023-08-01 17:31         ` Kui-Feng Lee
  0 siblings, 1 reply; 20+ messages in thread
From: Stanislav Fomichev @ 2023-07-31 23:35 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: kuifeng, bpf, ast, martin.lau, song, kernel-team, andrii

On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
> Sorry for the late reply! I just backed from a vacation.
>
>
> On 7/24/23 11:36, Stanislav Fomichev wrote:
> > On 07/21, kuifeng@meta.com wrote:
> >> From: Kui-Feng Lee <kuifeng@meta.com>
> >>
> >> Enable sleepable cgroup/{get,set}sockopt hooks.
> >>
> >> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
> >> received a pointer to the optval in user space instead of a kernel
> >> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
> >> begin and end of the user space buffer if receiving a user space
> >> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
> >> a kernel space buffer.
> >>
> >> A program receives a user space buffer if ctx->flags &
> >> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
> >> buffer.  The BPF programs should not read/write from/to a user space buffer
> >> dirrectly.  It should access the buffer through bpf_copy_from_user() and
> >> bpf_copy_to_user() provided in the following patches.
> >>
> >> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> >> ---
> >>   include/linux/filter.h         |   3 +
> >>   include/uapi/linux/bpf.h       |   9 ++
> >>   kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
> >>   kernel/bpf/verifier.c          |   7 +-
> >>   tools/include/uapi/linux/bpf.h |   9 ++
> >>   tools/lib/bpf/libbpf.c         |   2 +
> >>   6 files changed, 176 insertions(+), 43 deletions(-)
> >>
> >> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >> index f69114083ec7..301dd1ba0de1 100644
> >> --- a/include/linux/filter.h
> >> +++ b/include/linux/filter.h
> >> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
> >>      s32             level;
> >>      s32             optname;
> >>      s32             optlen;
> >> +    u32             flags;
> >> +    u8              *user_optval;
> >> +    u8              *user_optval_end;
> >>      /* for retval in struct bpf_cg_run_ctx */
> >>      struct task_struct *current_task;
> >>      /* Temporary "register" for indirect stores to ppos. */
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 739c15906a65..b2f81193f97b 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
> >>      __s32   optname;
> >>      __s32   optlen;
> >>      __s32   retval;
> >> +
> >> +    __bpf_md_ptr(void *, user_optval);
> >> +    __bpf_md_ptr(void *, user_optval_end);
> >
> > Can we re-purpose existing optval/optval_end pointers
> > for the sleepable programs? IOW, when the prog is sleepable,
> > pass user pointers via optval/optval_end and require the programs
> > to do copy_to/from on this buffer (even if the backing pointer might be
> > in kernel memory - we can handle that in the kfuncs?).
> >
> > The fact that the program now needs to look at the flag
> > (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
> > use makes the handling even more complicated; and we already have a
> > bunch of hairy stuff in these hooks. (or I misreading the change?)
> >
> > Also, regarding sleepable and non-sleepable co-existence: do we really need
> > that? Can we say that all the programs have to be sleepable
> > or non-sleepable? Mixing them complicates the sharing of that buffer.
>
> I considered this approach as well. This is an open question for me.
> If we go this way, it means we can not attach a BPF program of a type
> if any program of the other type has been installed.

If we pass two pointers (kernel copy buffer + real user mem) to the
sleepable program, we'll make it even more complicated by inheriting
all existing warts of the non-sleepable version :-(
IOW, feels like we should try to see if we can have some
copy_to/from_user kfuncs in the sleepable version that transparently
support either kernel or user memory (and prohibit direct access to
user_optval in the sleepable version).
And then, if we have one non-sleepable program in the chain, we can
fallback everything to the kernel buffer (maybe).
This way seems like we can support both versions in the same chain and
have a more sane api?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-31 23:35       ` Stanislav Fomichev
@ 2023-08-01 17:31         ` Kui-Feng Lee
  2023-08-01 18:08           ` Stanislav Fomichev
  0 siblings, 1 reply; 20+ messages in thread
From: Kui-Feng Lee @ 2023-08-01 17:31 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: kuifeng, bpf, ast, martin.lau, song, kernel-team, andrii



On 7/31/23 16:35, Stanislav Fomichev wrote:
> On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>> Sorry for the late reply! I just backed from a vacation.
>>
>>
>> On 7/24/23 11:36, Stanislav Fomichev wrote:
>>> On 07/21, kuifeng@meta.com wrote:
>>>> From: Kui-Feng Lee <kuifeng@meta.com>
>>>>
>>>> Enable sleepable cgroup/{get,set}sockopt hooks.
>>>>
>>>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
>>>> received a pointer to the optval in user space instead of a kernel
>>>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
>>>> begin and end of the user space buffer if receiving a user space
>>>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
>>>> a kernel space buffer.
>>>>
>>>> A program receives a user space buffer if ctx->flags &
>>>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
>>>> buffer.  The BPF programs should not read/write from/to a user space buffer
>>>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
>>>> bpf_copy_to_user() provided in the following patches.
>>>>
>>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>>> ---
>>>>    include/linux/filter.h         |   3 +
>>>>    include/uapi/linux/bpf.h       |   9 ++
>>>>    kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>>>>    kernel/bpf/verifier.c          |   7 +-
>>>>    tools/include/uapi/linux/bpf.h |   9 ++
>>>>    tools/lib/bpf/libbpf.c         |   2 +
>>>>    6 files changed, 176 insertions(+), 43 deletions(-)
>>>>
>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>> index f69114083ec7..301dd1ba0de1 100644
>>>> --- a/include/linux/filter.h
>>>> +++ b/include/linux/filter.h
>>>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>>>>       s32             level;
>>>>       s32             optname;
>>>>       s32             optlen;
>>>> +    u32             flags;
>>>> +    u8              *user_optval;
>>>> +    u8              *user_optval_end;
>>>>       /* for retval in struct bpf_cg_run_ctx */
>>>>       struct task_struct *current_task;
>>>>       /* Temporary "register" for indirect stores to ppos. */
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 739c15906a65..b2f81193f97b 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>>>>       __s32   optname;
>>>>       __s32   optlen;
>>>>       __s32   retval;
>>>> +
>>>> +    __bpf_md_ptr(void *, user_optval);
>>>> +    __bpf_md_ptr(void *, user_optval_end);
>>>
>>> Can we re-purpose existing optval/optval_end pointers
>>> for the sleepable programs? IOW, when the prog is sleepable,
>>> pass user pointers via optval/optval_end and require the programs
>>> to do copy_to/from on this buffer (even if the backing pointer might be
>>> in kernel memory - we can handle that in the kfuncs?).
>>>
>>> The fact that the program now needs to look at the flag
>>> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
>>> use makes the handling even more complicated; and we already have a
>>> bunch of hairy stuff in these hooks. (or I misreading the change?)
>>>
>>> Also, regarding sleepable and non-sleepable co-existence: do we really need
>>> that? Can we say that all the programs have to be sleepable
>>> or non-sleepable? Mixing them complicates the sharing of that buffer.
>>
>> I considered this approach as well. This is an open question for me.
>> If we go this way, it means we can not attach a BPF program of a type
>> if any program of the other type has been installed.
> 
> If we pass two pointers (kernel copy buffer + real user mem) to the
> sleepable program, we'll make it even more complicated by inheriting
> all existing warts of the non-sleepable version :-(
> IOW, feels like we should try to see if we can have some
> copy_to/from_user kfuncs in the sleepable version that transparently
> support either kernel or user memory (and prohibit direct access to
> user_optval in the sleepable version).
> And then, if we have one non-sleepable program in the chain, we can
> fallback everything to the kernel buffer (maybe).
> This way seems like we can support both versions in the same chain and
> have a more sane api?

Basically, you are saying to move cp_from_optval() and cp_to_optval() in
the testcase to kfuncs. This can cause unnecessary copy. We can add
an API to make a dynptr from the ctx to avoid unnecessary copies.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-08-01 17:31         ` Kui-Feng Lee
@ 2023-08-01 18:08           ` Stanislav Fomichev
  2023-08-02 22:28             ` Martin KaFai Lau
  0 siblings, 1 reply; 20+ messages in thread
From: Stanislav Fomichev @ 2023-08-01 18:08 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: kuifeng, bpf, ast, martin.lau, song, kernel-team, andrii

On Tue, Aug 1, 2023 at 10:31 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 7/31/23 16:35, Stanislav Fomichev wrote:
> > On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
> >>
> >> Sorry for the late reply! I just backed from a vacation.
> >>
> >>
> >> On 7/24/23 11:36, Stanislav Fomichev wrote:
> >>> On 07/21, kuifeng@meta.com wrote:
> >>>> From: Kui-Feng Lee <kuifeng@meta.com>
> >>>>
> >>>> Enable sleepable cgroup/{get,set}sockopt hooks.
> >>>>
> >>>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
> >>>> received a pointer to the optval in user space instead of a kernel
> >>>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
> >>>> begin and end of the user space buffer if receiving a user space
> >>>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
> >>>> a kernel space buffer.
> >>>>
> >>>> A program receives a user space buffer if ctx->flags &
> >>>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
> >>>> buffer.  The BPF programs should not read/write from/to a user space buffer
> >>>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
> >>>> bpf_copy_to_user() provided in the following patches.
> >>>>
> >>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> >>>> ---
> >>>>    include/linux/filter.h         |   3 +
> >>>>    include/uapi/linux/bpf.h       |   9 ++
> >>>>    kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
> >>>>    kernel/bpf/verifier.c          |   7 +-
> >>>>    tools/include/uapi/linux/bpf.h |   9 ++
> >>>>    tools/lib/bpf/libbpf.c         |   2 +
> >>>>    6 files changed, 176 insertions(+), 43 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >>>> index f69114083ec7..301dd1ba0de1 100644
> >>>> --- a/include/linux/filter.h
> >>>> +++ b/include/linux/filter.h
> >>>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
> >>>>       s32             level;
> >>>>       s32             optname;
> >>>>       s32             optlen;
> >>>> +    u32             flags;
> >>>> +    u8              *user_optval;
> >>>> +    u8              *user_optval_end;
> >>>>       /* for retval in struct bpf_cg_run_ctx */
> >>>>       struct task_struct *current_task;
> >>>>       /* Temporary "register" for indirect stores to ppos. */
> >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>>> index 739c15906a65..b2f81193f97b 100644
> >>>> --- a/include/uapi/linux/bpf.h
> >>>> +++ b/include/uapi/linux/bpf.h
> >>>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
> >>>>       __s32   optname;
> >>>>       __s32   optlen;
> >>>>       __s32   retval;
> >>>> +
> >>>> +    __bpf_md_ptr(void *, user_optval);
> >>>> +    __bpf_md_ptr(void *, user_optval_end);
> >>>
> >>> Can we re-purpose existing optval/optval_end pointers
> >>> for the sleepable programs? IOW, when the prog is sleepable,
> >>> pass user pointers via optval/optval_end and require the programs
> >>> to do copy_to/from on this buffer (even if the backing pointer might be
> >>> in kernel memory - we can handle that in the kfuncs?).
> >>>
> >>> The fact that the program now needs to look at the flag
> >>> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
> >>> use makes the handling even more complicated; and we already have a
> >>> bunch of hairy stuff in these hooks. (or I misreading the change?)
> >>>
> >>> Also, regarding sleepable and non-sleepable co-existence: do we really need
> >>> that? Can we say that all the programs have to be sleepable
> >>> or non-sleepable? Mixing them complicates the sharing of that buffer.
> >>
> >> I considered this approach as well. This is an open question for me.
> >> If we go this way, it means we can not attach a BPF program of a type
> >> if any program of the other type has been installed.
> >
> > If we pass two pointers (kernel copy buffer + real user mem) to the
> > sleepable program, we'll make it even more complicated by inheriting
> > all existing warts of the non-sleepable version :-(
> > IOW, feels like we should try to see if we can have some
> > copy_to/from_user kfuncs in the sleepable version that transparently
> > support either kernel or user memory (and prohibit direct access to
> > user_optval in the sleepable version).
> > And then, if we have one non-sleepable program in the chain, we can
> > fallback everything to the kernel buffer (maybe).
> > This way seems like we can support both versions in the same chain and
> > have a more sane api?
>
> Basically, you are saying to move cp_from_optval() and cp_to_optval() in
> the testcase to kfuncs. This can cause unnecessary copy. We can add
> an API to make a dynptr from the ctx to avoid unnecessary copies.

Yeah, handle this transparently in the kfunc or via dynptr, whatever works.
I'm not too worried about the extra copy tbh, this is a slow path; I'm
more concerned about improving the bpf program / user experience.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
                     ` (2 preceding siblings ...)
  2023-07-24 18:36   ` Stanislav Fomichev
@ 2023-08-02 19:25   ` Martin KaFai Lau
  3 siblings, 0 replies; 20+ messages in thread
From: Martin KaFai Lau @ 2023-08-02 19:25 UTC (permalink / raw)
  To: kuifeng; +Cc: sinquersw, bpf, ast, song, kernel-team, andrii

On 7/21/23 10:22 PM, kuifeng@meta.com wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 739c15906a65..b2f81193f97b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>   	__s32	optname;
>   	__s32	optlen;
>   	__s32	retval;
> +
> +	__bpf_md_ptr(void *, user_optval);
> +	__bpf_md_ptr(void *, user_optval_end);
> +	__u32	flags;

I will follow up on the UAPI discussion in the other existing thread.

> +};
> +
> +enum bpf_sockopt_flags {
> +	/* optval is a pointer to user space memory */
> +	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
>   };
>   
>   struct bpf_pidns_info {
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 5b2741aa0d9b..b268bbfa6c53 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -38,15 +38,26 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
>   	const struct bpf_prog_array *array;
>   	struct bpf_run_ctx *old_run_ctx;
>   	struct bpf_cg_run_ctx run_ctx;
> +	bool do_sleepable;
>   	u32 func_ret;
>   
> +	do_sleepable =
> +		atype == CGROUP_SETSOCKOPT || atype == CGROUP_GETSOCKOPT;
> +
>   	run_ctx.retval = retval;
>   	migrate_disable();
> -	rcu_read_lock();
> +	if (do_sleepable) {
> +		might_fault();
> +		rcu_read_lock_trace();
> +	} else
> +		rcu_read_lock();
>   	array = rcu_dereference(cgrp->effective[atype]);

array is now under rcu_read_lock_trace for the "do_sleepable" case.

The array free side should be changed also to wait for the tasks_trace gp. 
Please check if any bpf_prog_array_free() usages in cgroup.c should be replaced 
with bpf_prog_array_free_sleepable().

Another high level comment is, other cgroup hooks may get sleepable support in 
the future. In particular the SEC("lsm_cgroup") when considering other lsm hooks 
in SEC("lsm.s") have sleepable support already. just want to bring up here for 
awareness. This set can focus on get/setsockopt for now.

>   	item = &array->items[0];
>   	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
>   	while ((prog = READ_ONCE(item->prog))) {
> +		if (do_sleepable && !prog->aux->sleepable)
> +			rcu_read_lock();
> +
>   		run_ctx.prog_item = item;
>   		func_ret = run_prog(prog, ctx);
>   		if (ret_flags) {
> @@ -56,13 +67,43 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
>   		if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
>   			run_ctx.retval = -EPERM;
>   		item++;
> +
> +		if (do_sleepable && !prog->aux->sleepable)
> +			rcu_read_unlock();
>   	}
>   	bpf_reset_run_ctx(old_run_ctx);
> -	rcu_read_unlock();
> +	if (do_sleepable)
> +		rcu_read_unlock_trace();
> +	else
> +		rcu_read_unlock();
>   	migrate_enable();
>   	return run_ctx.retval;
>   }
>   
> +static __always_inline bool
> +has_only_sleepable_prog_cg(const struct cgroup_bpf *cgrp,
> +			 enum cgroup_bpf_attach_type atype)
> +{
> +	const struct bpf_prog_array_item *item;
> +	const struct bpf_prog *prog;
> +	int cnt = 0;
> +	bool ret = true;
> +
> +	rcu_read_lock();
> +	item = &rcu_dereference(cgrp->effective[atype])->items[0];
> +	while (ret && (prog = READ_ONCE(item->prog))) {
> +		if (!prog->aux->sleepable)
> +			ret = false;
> +		item++;
> +		cnt++;
> +	}
> +	rcu_read_unlock();
> +	if (cnt == 0)
> +		ret = false;
> +
> +	return ret;
> +}
> +
>   unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
>   				       const struct bpf_insn *insn)
>   {
> @@ -1773,7 +1814,8 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
>   static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
>   			     struct bpf_sockopt_buf *buf)
>   {
> -	if (ctx->optval == buf->data)
> +	if (ctx->optval == buf->data ||
> +	    ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
>   		return;
>   	kfree(ctx->optval);
>   }
> @@ -1781,7 +1823,8 @@ static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
>   static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx,
>   				  struct bpf_sockopt_buf *buf)
>   {
> -	return ctx->optval != buf->data;
> +	return ctx->optval != buf->data &&
> +		!(ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER);
>   }
>   
>   int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
> @@ -1796,21 +1839,31 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
>   		.optname = *optname,
>   	};
>   	int ret, max_optlen;
> +	bool alloc_mem;
> +
> +	alloc_mem = !has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_SETSOCKOPT);

hmm... I am not sure if this would work. The cgroup->effective[atype] could have 
been changed after this has_only_sleepable_prog_cg() check. For example, a 
non-sleepable prog is attached after this check. The latter 
bpf_prog_run_array_cg() may end up having an incorrect ctx.

A quick thought is, this alloc decision should be postponed to the 
bpf_prog_run_array_cg()? It may be better to have a different 
bpf_prog_run_array_cg for set/getsockopt here, not sure.

> +	if (!alloc_mem) {
> +		max_optlen = *optlen;
> +		ctx.optlen = *optlen;
> +		ctx.user_optval = optval;
> +		ctx.user_optval_end = optval + *optlen;
> +		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
> +	} else {
> +		/* Allocate a bit more than the initial user buffer for
> +		 * BPF program. The canonical use case is overriding
> +		 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
> +		 */
> +		max_optlen = max_t(int, 16, *optlen);
> +		max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
> +		if (max_optlen < 0)
> +			return max_optlen;
>   
> -	/* Allocate a bit more than the initial user buffer for
> -	 * BPF program. The canonical use case is overriding
> -	 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
> -	 */
> -	max_optlen = max_t(int, 16, *optlen);
> -	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
> -	if (max_optlen < 0)
> -		return max_optlen;
> -
> -	ctx.optlen = *optlen;
> +		ctx.optlen = *optlen;
>   
> -	if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
> -		ret = -EFAULT;
> -		goto out;
> +		if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
>   	}


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user().
  2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
                     ` (3 preceding siblings ...)
  2023-07-22 20:26   ` kernel test robot
@ 2023-08-02 19:59   ` Martin KaFai Lau
  4 siblings, 0 replies; 20+ messages in thread
From: Martin KaFai Lau @ 2023-08-02 19:59 UTC (permalink / raw)
  To: kuifeng; +Cc: sinquersw, bpf, ast, song, kernel-team, andrii

On 7/21/23 10:22 PM, kuifeng@meta.com wrote:
> From: Kui-Feng Lee <kuifeng@meta.com>
> 
> Provide bpf_copy_from_user() and bpf_copy_to_user() to the BPF programs
> attached to cgroup/{set,get}sockopt. bpf_copy_to_user() is a new kfunc to

Allowing bpf_copy_to_user() in setsockopt will then change the "__user *optval". 
I don't think the userspace is expecting any change in the optval passed to 
setsockopt.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.
  2023-08-01 18:08           ` Stanislav Fomichev
@ 2023-08-02 22:28             ` Martin KaFai Lau
  0 siblings, 0 replies; 20+ messages in thread
From: Martin KaFai Lau @ 2023-08-02 22:28 UTC (permalink / raw)
  To: kuifeng
  Cc: bpf, ast, song, kernel-team, andrii, Stanislav Fomichev, Kui-Feng Lee

On 8/1/23 11:08 AM, Stanislav Fomichev wrote:
> On Tue, Aug 1, 2023 at 10:31 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 7/31/23 16:35, Stanislav Fomichev wrote:
>>> On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>>>
>>>> Sorry for the late reply! I just backed from a vacation.
>>>>
>>>>
>>>> On 7/24/23 11:36, Stanislav Fomichev wrote:
>>>>> On 07/21, kuifeng@meta.com wrote:
>>>>>> From: Kui-Feng Lee <kuifeng@meta.com>
>>>>>>
>>>>>> Enable sleepable cgroup/{get,set}sockopt hooks.
>>>>>>
>>>>>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
>>>>>> received a pointer to the optval in user space instead of a kernel
>>>>>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
>>>>>> begin and end of the user space buffer if receiving a user space
>>>>>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
>>>>>> a kernel space buffer.
>>>>>>
>>>>>> A program receives a user space buffer if ctx->flags &
>>>>>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
>>>>>> buffer.  The BPF programs should not read/write from/to a user space buffer
>>>>>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
>>>>>> bpf_copy_to_user() provided in the following patches.
>>>>>>
>>>>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>>>>> ---
>>>>>>     include/linux/filter.h         |   3 +
>>>>>>     include/uapi/linux/bpf.h       |   9 ++
>>>>>>     kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>>>>>>     kernel/bpf/verifier.c          |   7 +-
>>>>>>     tools/include/uapi/linux/bpf.h |   9 ++
>>>>>>     tools/lib/bpf/libbpf.c         |   2 +
>>>>>>     6 files changed, 176 insertions(+), 43 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>>>> index f69114083ec7..301dd1ba0de1 100644
>>>>>> --- a/include/linux/filter.h
>>>>>> +++ b/include/linux/filter.h
>>>>>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>>>>>>        s32             level;
>>>>>>        s32             optname;
>>>>>>        s32             optlen;
>>>>>> +    u32             flags;
>>>>>> +    u8              *user_optval;
>>>>>> +    u8              *user_optval_end;
>>>>>>        /* for retval in struct bpf_cg_run_ctx */
>>>>>>        struct task_struct *current_task;
>>>>>>        /* Temporary "register" for indirect stores to ppos. */
>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>> index 739c15906a65..b2f81193f97b 100644
>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>>>>>>        __s32   optname;
>>>>>>        __s32   optlen;
>>>>>>        __s32   retval;
>>>>>> +
>>>>>> +    __bpf_md_ptr(void *, user_optval);
>>>>>> +    __bpf_md_ptr(void *, user_optval_end);
>>>>>
>>>>> Can we re-purpose existing optval/optval_end pointers
>>>>> for the sleepable programs? IOW, when the prog is sleepable,
>>>>> pass user pointers via optval/optval_end and require the programs
>>>>> to do copy_to/from on this buffer (even if the backing pointer might be
>>>>> in kernel memory - we can handle that in the kfuncs?).
>>>>>
>>>>> The fact that the program now needs to look at the flag
>>>>> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
>>>>> use makes the handling even more complicated; and we already have a
>>>>> bunch of hairy stuff in these hooks. (or I misreading the change?)
>>>>>
>>>>> Also, regarding sleepable and non-sleepable co-existence: do we really need
>>>>> that? Can we say that all the programs have to be sleepable
>>>>> or non-sleepable? Mixing them complicates the sharing of that buffer.
>>>>
>>>> I considered this approach as well. This is an open question for me.
>>>> If we go this way, it means we can not attach a BPF program of a type
>>>> if any program of the other type has been installed.
>>>
>>> If we pass two pointers (kernel copy buffer + real user mem) to the
>>> sleepable program, we'll make it even more complicated by inheriting
>>> all existing warts of the non-sleepable version :-( >>> IOW, feels like we should try to see if we can have some
>>> copy_to/from_user kfuncs in the sleepable version that transparently
>>> support either kernel or user memory (and prohibit direct access to
>>> user_optval in the sleepable version).

 From looking at patch 5 selftest, I also think exposing user_optval_* and flags 
is not ideal. For example, correct me if I am wrong, in patch 3, dynptr is only 
used for setsockopt to alloc. Intuitively, when developing a bpf prog, I would 
expect using bpf_dynptr_write() to write a new sockopt and then done. However, 
it still needs to "install" (by calling bpf_sockopt_install_optval). I think the 
"install" part is leaking too much internal details.

Beside, adding both new 'ctx->user_optval + len > ctx->user_optval_end' and 
dynptr usage pattern together is counter productive considering dynptr is to 
avoid the length comparison. Saving an unnecessary "copy_from_user(ctx.optval, 
optval,...)" is more important than being able to directly read from 
ctx->user_optval. The bpf prog is usually only interested in a few optnames and 
directly returns without even looking at the optval for the uninterested 
optnames. The current __cgroup_bpf_run_filter_{get,set}sockopt always does a 
"copy_from_user(ctx.optval, optval,...)".

>>> And then, if we have one non-sleepable program in the chain, we can
>>> fallback everything to the kernel buffer (maybe).
>>> This way seems like we can support both versions in the same chain and
>>> have a more sane api?
>>
>> Basically, you are saying to move cp_from_optval() and cp_to_optval() in
>> the testcase to kfuncs. This can cause unnecessary copy. We can add
>> an API to make a dynptr from the ctx to avoid unnecessary copies.
> 
> Yeah, handle this transparently in the kfunc or via dynptr, whatever works.
> I'm not too worried about the extra copy tbh, this is a slow path; I'm
> more concerned about improving the bpf program / user experience.

+1. It will be great if all can be done in two kfunc (/dynptr_{write,read}). I 
would disallow sleepable prog to use the optval if it can make things simpler. 
If it goes with dynptr, need to support bpf_dynptr_slice() as well which I think 
should be doable after a quick thought.

The test needs to include a cgrp->effective array that has interleaved sleepable 
and non-sleepable bpf progs.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-08-02 22:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-22  5:22 [RFC bpf-next 0/5] Sleepable BPF programs on cgroup {get,set}sockopt kuifeng
2023-07-22  5:22 ` [RFC bpf-next 1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt kuifeng
2023-07-22 16:09   ` kernel test robot
2023-07-22 18:22   ` kernel test robot
2023-07-24 18:36   ` Stanislav Fomichev
2023-07-31 22:02     ` Kui-Feng Lee
2023-07-31 23:35       ` Stanislav Fomichev
2023-08-01 17:31         ` Kui-Feng Lee
2023-08-01 18:08           ` Stanislav Fomichev
2023-08-02 22:28             ` Martin KaFai Lau
2023-08-02 19:25   ` Martin KaFai Lau
2023-07-22  5:22 ` [RFC bpf-next 2/5] bpf: Provide bpf_copy_from_user() and bpf_copy_to_user() kuifeng
2023-07-22  6:39   ` kernel test robot
2023-07-22  6:49   ` kernel test robot
2023-07-22  7:41   ` kernel test robot
2023-07-22 20:26   ` kernel test robot
2023-08-02 19:59   ` Martin KaFai Lau
2023-07-22  5:22 ` [RFC bpf-next 3/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT kuifeng
2023-07-22  5:22 ` [RFC bpf-next 4/5] bpf: Prevent BPF programs from access the buffer pointed by user_optval kuifeng
2023-07-22  5:22 ` [RFC bpf-next 5/5] bpf: Add test cases for sleepable BPF programs of the CGROUP_SOCKOPT type kuifeng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.