All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks
@ 2019-06-04 21:35 Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 1/7] bpf: implement " Stanislav Fomichev
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

This series implements two new per-cgroup hooks: getsockopt and
setsockopt along with a new sockopt program type. The idea is pretty
similar to recently introduced cgroup sysctl hooks, but
implementation is simpler (no need to convert to/from strings).

What this can be applied to:
* move business logic of what tos/priority/etc can be set by
  containers (either pass or reject)
* handle existing options (or introduce new ones) differently by
  propagating some information in cgroup/socket local storage

Compared to a simple syscall/{g,s}etsockopt tracepoint, those
hooks are context aware. Meaning, they can access underlying socket
and use cgroup and socket local storage.

Stanislav Fomichev (7):
  bpf: implement getsockopt and setsockopt hooks
  bpf: sync bpf.h to tools/
  libbpf: support sockopt hooks
  selftests/bpf: test sockopt section name
  selftests/bpf: add sockopt test
  bpf: add sockopt documentation
  bpftool: support cgroup sockopt

 Documentation/bpf/index.rst                   |   1 +
 Documentation/bpf/prog_cgroup_sockopt.rst     |  42 +
 include/linux/bpf-cgroup.h                    |  29 +
 include/linux/bpf.h                           |   2 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/filter.h                        |  19 +
 include/uapi/linux/bpf.h                      |  17 +-
 kernel/bpf/cgroup.c                           | 288 +++++++
 kernel/bpf/syscall.c                          |  19 +
 kernel/bpf/verifier.c                         |  12 +
 net/core/filter.c                             |   4 +-
 net/socket.c                                  |  18 +
 .../bpftool/Documentation/bpftool-cgroup.rst  |   7 +-
 .../bpftool/Documentation/bpftool-prog.rst    |   2 +-
 tools/bpf/bpftool/bash-completion/bpftool     |   8 +-
 tools/bpf/bpftool/cgroup.c                    |   5 +-
 tools/bpf/bpftool/main.h                      |   1 +
 tools/bpf/bpftool/prog.c                      |   3 +-
 tools/include/uapi/linux/bpf.h                |  17 +-
 tools/lib/bpf/libbpf.c                        |   5 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   3 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   2 +
 .../selftests/bpf/test_section_names.c        |  10 +
 tools/testing/selftests/bpf/test_sockopt.c    | 789 ++++++++++++++++++
 26 files changed, 1293 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/bpf/prog_cgroup_sockopt.rst
 create mode 100644 tools/testing/selftests/bpf/test_sockopt.c

-- 
2.22.0.rc1.311.g5d7573a151-goog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
@ 2019-06-04 21:35 ` Stanislav Fomichev
  2019-06-05 18:47   ` Martin Lau
  2019-06-05 19:32   ` Andrii Nakryiko
  2019-06-04 21:35 ` [PATCH bpf-next 2/7] bpf: sync bpf.h to tools/ Stanislav Fomichev
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.

BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.

The buffer memory is pre-allocated (because I don't think there is
a precedent for working with __user memory from bpf). This might be
slow to do for each {s,g}etsockopt call, that's why I've added
__cgroup_bpf_has_prog_array that exits early if there is nothing
attached to a cgroup. Note, however, that there is a race between
__cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
program layout might have changed; this should not be a problem
because in general there is a race between multiple calls to
{s,g}etsocktop and user adding/removing bpf progs from a cgroup.

By default, kernel code path is executed after the hook (to let
BPF handle only a subset of the options). There is new
bpf_sockopt_handled handler that returns control to the userspace
instead (bypassing the kernel handling).

The return code is either 1 (success) or 0 (EPERM).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf-cgroup.h |  29 ++++
 include/linux/bpf.h        |   2 +
 include/linux/bpf_types.h  |   1 +
 include/linux/filter.h     |  19 +++
 include/uapi/linux/bpf.h   |  17 ++-
 kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c       |  19 +++
 kernel/bpf/verifier.c      |  12 ++
 net/core/filter.c          |   4 +-
 net/socket.c               |  18 +++
 10 files changed, 406 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index b631ee75762d..406f1ba82531 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
 				   loff_t *ppos, void **new_buf,
 				   enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
+				       int optname, char __user *optval,
+				       unsigned int optlen);
+int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
+				       int optname, char __user *optval,
+				       int __user *optlen);
+
 static inline enum bpf_cgroup_storage_type cgroup_storage_type(
 	struct bpf_map *map)
 {
@@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled)						       \
+		__ret = __cgroup_bpf_run_filter_setsockopt(sock, level,	       \
+							   optname, optval,    \
+							   optlen);	       \
+	__ret;								       \
+})
+
+#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled)						       \
+		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
+							   optname, optval,    \
+							   optlen);	       \
+	__ret;								       \
+})
+
 int cgroup_bpf_prog_attach(const union bpf_attr *attr,
 			   enum bpf_prog_type ptype, struct bpf_prog *prog);
 int cgroup_bpf_prog_detach(const union bpf_attr *attr,
@@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
 
 #define for_each_cgroup_storage_type(stype) for (; false; )
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e5a309e6a400..fb4e6ef5a971 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
 extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
+extern const struct bpf_func_proto bpf_sk_fullsock_proto;
+extern const struct bpf_func_proto bpf_tcp_sock_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5a9975678d6f..eec5aeeeaf92 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
 #endif
 #ifdef CONFIG_BPF_LIRC_MODE2
 BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 43b45d6db36d..7a07fd2e14d3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
 	u64 tmp_reg;
 };
 
+struct bpf_sockopt_kern {
+	struct sock	*sk;
+	s32		level;
+	s32		optname;
+	u32		optlen;
+	u8		*optval;
+	u8		*optval_end;
+
+	/* If true, BPF program had consumed the sockopt request.
+	 * Control is returned to the userspace (i.e. kernel doesn't
+	 * handle this option).
+	 */
+	bool		handled;
+
+	/* Small on-stack optval buffer to avoid small allocations.
+	 */
+	u8 buf[64];
+};
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7c6aef253173..b6c3891241ef 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_FLOW_DISSECTOR,
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 };
 
 enum bpf_attach_type {
@@ -192,6 +193,8 @@ enum bpf_attach_type {
 	BPF_LIRC_MODE2,
 	BPF_FLOW_DISSECTOR,
 	BPF_CGROUP_SYSCTL,
+	BPF_CGROUP_GETSOCKOPT,
+	BPF_CGROUP_SETSOCKOPT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2815,7 +2818,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(sockopt_handled),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3533,4 +3537,15 @@ struct bpf_sysctl {
 				 */
 };
 
+struct bpf_sockopt {
+	__bpf_md_ptr(struct bpf_sock *, sk);
+
+	__s32	level;
+	__s32	optname;
+
+	__u32	optlen;
+	__u32	optval;
+	__u32	optval_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 1b65ab0df457..4ec99ea97023 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -18,6 +18,7 @@
 #include <linux/bpf.h>
 #include <linux/bpf-cgroup.h>
 #include <net/sock.h>
+#include <net/bpf_sk_storage.h>
 
 DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
 EXPORT_SYMBOL(cgroup_bpf_enabled_key);
@@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
 }
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
 
+static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
+					enum bpf_attach_type attach_type)
+{
+	struct bpf_prog_array *prog_array;
+	int nr;
+
+	rcu_read_lock();
+	prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
+	nr = bpf_prog_array_length(prog_array);
+	rcu_read_unlock();
+
+	return nr > 0;
+}
+
+static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
+{
+	if (unlikely(max_optlen > PAGE_SIZE))
+		return -EINVAL;
+
+	if (likely(max_optlen <= sizeof(ctx->buf))) {
+		ctx->optval = ctx->buf;
+	} else {
+		ctx->optval = kzalloc(max_optlen, GFP_USER);
+		if (!ctx->optval)
+			return -ENOMEM;
+	}
+
+	ctx->optval_end = ctx->optval + max_optlen;
+	ctx->optlen = max_optlen;
+
+	return 0;
+}
+
+static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
+{
+	if (unlikely(ctx->optval != ctx->buf))
+		kfree(ctx->optval);
+}
+
+int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
+				       int optname, char __user *optval,
+				       unsigned int optlen)
+{
+	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	struct bpf_sockopt_kern ctx = {
+		.sk = sk,
+		.level = level,
+		.optname = optname,
+	};
+	int ret;
+
+	/* Opportunistic check to see whether we have any BPF program
+	 * attached to the hook so we don't waste time allocating
+	 * memory and locking the socket.
+	 */
+	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
+		return 0;
+
+	ret = sockopt_alloc_buf(&ctx, optlen);
+	if (ret)
+		return ret;
+
+	if (copy_from_user(ctx.optval, optval, optlen) != 0) {
+		sockopt_free_buf(&ctx);
+		return -EFAULT;
+	}
+
+	lock_sock(sk);
+	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
+				 &ctx, BPF_PROG_RUN);
+	release_sock(sk);
+
+	sockopt_free_buf(&ctx);
+
+	if (!ret)
+		return -EPERM;
+
+	return ctx.handled ? 1 : 0;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
+
+int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
+				       int optname, char __user *optval,
+				       int __user *optlen)
+{
+	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	struct bpf_sockopt_kern ctx = {
+		.sk = sk,
+		.level = level,
+		.optname = optname,
+	};
+	int max_optlen;
+	char buf[64];
+	int ret;
+
+	/* Opportunistic check to see whether we have any BPF program
+	 * attached to the hook so we don't waste time allocating
+	 * memory and locking the socket.
+	 */
+	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
+		return 0;
+
+	if (get_user(max_optlen, optlen))
+		return -EFAULT;
+
+	ret = sockopt_alloc_buf(&ctx, max_optlen);
+	if (ret)
+		return ret;
+
+	lock_sock(sk);
+	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
+				 &ctx, BPF_PROG_RUN);
+	release_sock(sk);
+
+	if (ctx.optlen > max_optlen) {
+		sockopt_free_buf(&ctx);
+		return -EFAULT;
+	}
+
+	if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
+		sockopt_free_buf(&ctx);
+		return -EFAULT;
+	}
+
+	sockopt_free_buf(&ctx);
+
+	if (put_user(ctx.optlen, optlen))
+		return -EFAULT;
+
+	if (!ret)
+		return -EPERM;
+
+	return ctx.handled ? 1 : 0;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
+
 static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
 			      size_t *lenp)
 {
@@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
 
 const struct bpf_prog_ops cg_sysctl_prog_ops = {
 };
+
+BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
+{
+	ctx->handled = true;
+	return 1;
+}
+
+static const struct bpf_func_proto bpf_sockopt_handled_proto = {
+	.func		= bpf_sockopt_handled,
+	.gpl_only	= false,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.ret_type	= RET_INTEGER,
+};
+
+static const struct bpf_func_proto *
+cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_sockopt_handled:
+		return &bpf_sockopt_handled_proto;
+	case BPF_FUNC_sk_fullsock:
+		return &bpf_sk_fullsock_proto;
+	case BPF_FUNC_sk_storage_get:
+		return &bpf_sk_storage_get_proto;
+	case BPF_FUNC_sk_storage_delete:
+		return &bpf_sk_storage_delete_proto;
+#ifdef CONFIG_INET
+	case BPF_FUNC_tcp_sock:
+		return &bpf_tcp_sock_proto;
+#endif
+	default:
+		return cgroup_base_func_proto(func_id, prog);
+	}
+}
+
+static bool cg_sockopt_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       const struct bpf_prog *prog,
+				       struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+
+	if (off < 0 || off >= sizeof(struct bpf_sockopt))
+		return false;
+
+	if (off % size != 0)
+		return false;
+
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case offsetof(struct bpf_sockopt, optlen):
+			if (size != size_default)
+				return false;
+			return prog->expected_attach_type ==
+				BPF_CGROUP_GETSOCKOPT;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case offsetof(struct bpf_sockopt, sk):
+		if (size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
+		break;
+	case bpf_ctx_range(struct bpf_sockopt, optval):
+		if (size != size_default)
+			return false;
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct bpf_sockopt, optval_end):
+		if (size != size_default)
+			return false;
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	default:
+		if (size != size_default)
+			return false;
+		break;
+	}
+	return true;
+}
+
+static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
+					 const struct bpf_insn *si,
+					 struct bpf_insn *insn_buf,
+					 struct bpf_prog *prog,
+					 u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sockopt, sk):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sockopt_kern, sk));
+		break;
+	case offsetof(struct bpf_sockopt, level):
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_sockopt_kern,
+						     level, 4, target_size));
+		break;
+	case offsetof(struct bpf_sockopt, optname):
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_sockopt_kern,
+						     optname, 4, target_size));
+		break;
+	case offsetof(struct bpf_sockopt, optlen):
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
+					      bpf_target_off(struct bpf_sockopt_kern,
+							     optlen, 4, target_size));
+		else
+			*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+					      bpf_target_off(struct bpf_sockopt_kern,
+							     optlen, 4, target_size));
+		break;
+	case offsetof(struct bpf_sockopt, optval):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sockopt_kern, optval));
+		break;
+	case offsetof(struct bpf_sockopt, optval_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sockopt_kern, optval_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
+				   bool direct_write,
+				   const struct bpf_prog *prog)
+{
+	/* Nothing to do for sockopt argument. The data is kzalloc'ated.
+	 */
+	return 0;
+}
+
+const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
+	.get_func_proto		= cg_sockopt_func_proto,
+	.is_valid_access	= cg_sockopt_is_valid_access,
+	.convert_ctx_access	= cg_sockopt_convert_ctx_access,
+	.gen_prologue		= cg_sockopt_get_prologue,
+};
+
+const struct bpf_prog_ops cg_sockopt_prog_ops = {
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4c53cbd3329d..4ad2b5f1905f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+		switch (expected_attach_type) {
+		case BPF_CGROUP_SETSOCKOPT:
+		case BPF_CGROUP_GETSOCKOPT:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	default:
 		return 0;
 	}
@@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	switch (prog->type) {
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
 		return prog->enforce_expected_attach_type &&
@@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_SYSCTL:
 		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
 		break;
+	case BPF_CGROUP_GETSOCKOPT:
+	case BPF_CGROUP_SETSOCKOPT:
+		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_SYSCTL:
 		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
 		break;
+	case BPF_CGROUP_GETSOCKOPT:
+	case BPF_CGROUP_SETSOCKOPT:
+		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_CGROUP_SOCK_OPS:
 	case BPF_CGROUP_DEVICE:
 	case BPF_CGROUP_SYSCTL:
+	case BPF_CGROUP_GETSOCKOPT:
+	case BPF_CGROUP_SETSOCKOPT:
 		break;
 	case BPF_LIRC_MODE2:
 		return lirc_prog_query(attr, uattr);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5c2cb5bd84ce..b91fde10e721 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 
 		env->seen_direct_write = true;
 		return true;
+
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+		if (t == BPF_WRITE) {
+			if (env->prog->expected_attach_type ==
+			    BPF_CGROUP_GETSOCKOPT) {
+				env->seen_direct_write = true;
+				return true;
+			}
+			return false;
+		}
+		return true;
+
 	default:
 		return false;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 55bfc941d17a..4652c0a005ca 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
 	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
 }
 
-static const struct bpf_func_proto bpf_sk_fullsock_proto = {
+const struct bpf_func_proto bpf_sk_fullsock_proto = {
 	.func		= bpf_sk_fullsock,
 	.gpl_only	= false,
 	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
@@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
 	return (unsigned long)NULL;
 }
 
-static const struct bpf_func_proto bpf_tcp_sock_proto = {
+const struct bpf_func_proto bpf_tcp_sock_proto = {
 	.func		= bpf_tcp_sock,
 	.gpl_only	= false,
 	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
diff --git a/net/socket.c b/net/socket.c
index 72372dc5dd70..e8654f1f70e6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
 		if (err)
 			goto out_put;
 
+		err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
+						     optval, optlen);
+		if (err < 0) {
+			goto out_put;
+		} else if (err > 0) {
+			err = 0;
+			goto out_put;
+		}
+
 		if (level == SOL_SOCKET)
 			err =
 			    sock_setsockopt(sock, level, optname, optval,
@@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
 		if (err)
 			goto out_put;
 
+		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
+						     optval, optlen);
+		if (err < 0) {
+			goto out_put;
+		} else if (err > 0) {
+			err = 0;
+			goto out_put;
+		}
+
 		if (level == SOL_SOCKET)
 			err =
 			    sock_getsockopt(sock, level, optname, optval,
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 2/7] bpf: sync bpf.h to tools/
  2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 1/7] bpf: implement " Stanislav Fomichev
@ 2019-06-04 21:35 ` Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 3/7] libbpf: support sockopt hooks Stanislav Fomichev
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

Export new prog type and hook points to the libbpf.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/include/uapi/linux/bpf.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7c6aef253173..b6c3891241ef 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_FLOW_DISSECTOR,
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 };
 
 enum bpf_attach_type {
@@ -192,6 +193,8 @@ enum bpf_attach_type {
 	BPF_LIRC_MODE2,
 	BPF_FLOW_DISSECTOR,
 	BPF_CGROUP_SYSCTL,
+	BPF_CGROUP_GETSOCKOPT,
+	BPF_CGROUP_SETSOCKOPT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2815,7 +2818,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(sockopt_handled),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3533,4 +3537,15 @@ struct bpf_sysctl {
 				 */
 };
 
+struct bpf_sockopt {
+	__bpf_md_ptr(struct bpf_sock *, sk);
+
+	__s32	level;
+	__s32	optname;
+
+	__u32	optlen;
+	__u32	optval;
+	__u32	optval_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 3/7] libbpf: support sockopt hooks
  2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 1/7] bpf: implement " Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 2/7] bpf: sync bpf.h to tools/ Stanislav Fomichev
@ 2019-06-04 21:35 ` Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 4/7] selftests/bpf: test sockopt section name Stanislav Fomichev
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

Make libbpf aware of new sockopt hooks so it can derive prog type
and hook point from the section names.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/lib/bpf/libbpf.c        | 5 +++++
 tools/lib/bpf/libbpf_probes.c | 1 +
 2 files changed, 6 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ba89d9727137..cd3c692a8b5d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2243,6 +2243,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE:
 	case BPF_PROG_TYPE_PERF_EVENT:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 		return false;
 	case BPF_PROG_TYPE_KPROBE:
 	default:
@@ -3196,6 +3197,10 @@ static const struct {
 						BPF_CGROUP_UDP6_SENDMSG),
 	BPF_EAPROG_SEC("cgroup/sysctl",		BPF_PROG_TYPE_CGROUP_SYSCTL,
 						BPF_CGROUP_SYSCTL),
+	BPF_EAPROG_SEC("cgroup/getsockopt",	BPF_PROG_TYPE_CGROUP_SOCKOPT,
+						BPF_CGROUP_GETSOCKOPT),
+	BPF_EAPROG_SEC("cgroup/setsockopt",	BPF_PROG_TYPE_CGROUP_SOCKOPT,
+						BPF_CGROUP_SETSOCKOPT),
 };
 
 #undef BPF_PROG_SEC_IMPL
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 5e2aa83f637a..7e21db11dde8 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -101,6 +101,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_SK_REUSEPORT:
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 	default:
 		break;
 	}
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 4/7] selftests/bpf: test sockopt section name
  2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2019-06-04 21:35 ` [PATCH bpf-next 3/7] libbpf: support sockopt hooks Stanislav Fomichev
@ 2019-06-04 21:35 ` Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 5/7] selftests/bpf: add sockopt test Stanislav Fomichev
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

Add tests that make sure libbpf section detection works.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/test_section_names.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_section_names.c b/tools/testing/selftests/bpf/test_section_names.c
index bebd4fbca1f4..5f84b3b8c90b 100644
--- a/tools/testing/selftests/bpf/test_section_names.c
+++ b/tools/testing/selftests/bpf/test_section_names.c
@@ -124,6 +124,16 @@ static struct sec_name_test tests[] = {
 		{0, BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_CGROUP_SYSCTL},
 		{0, BPF_CGROUP_SYSCTL},
 	},
+	{
+		"cgroup/getsockopt",
+		{0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT},
+		{0, BPF_CGROUP_GETSOCKOPT},
+	},
+	{
+		"cgroup/setsockopt",
+		{0, BPF_PROG_TYPE_CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT},
+		{0, BPF_CGROUP_SETSOCKOPT},
+	},
 };
 
 static int test_prog_type_by_name(const struct sec_name_test *test)
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 5/7] selftests/bpf: add sockopt test
  2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2019-06-04 21:35 ` [PATCH bpf-next 4/7] selftests/bpf: test sockopt section name Stanislav Fomichev
@ 2019-06-04 21:35 ` Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 6/7] bpf: add sockopt documentation Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 7/7] bpftool: support cgroup sockopt Stanislav Fomichev
  6 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

Add sockopt selftests:
* require proper expected_attach_type
* enforce context field read/write access
* test bpf_sockopt_handled handler
* test EPERM
* test limiting optlen from getsockopt
* test out-of-bounds access

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   3 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/test_sockopt.c | 789 +++++++++++++++++++++
 4 files changed, 794 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockopt.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 7470327edcfe..3fe92601223d 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -39,3 +39,4 @@ libbpf.so.*
 test_hashmap
 test_btf_dump
 xdping
+test_sockopt
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 2b426ae1cdc9..b982393b9181 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -26,7 +26,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
 	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
-	test_btf_dump test_cgroup_attach xdping
+	test_btf_dump test_cgroup_attach xdping test_sockopt
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
@@ -101,6 +101,7 @@ $(OUTPUT)/test_netcnt: cgroup_helpers.c
 $(OUTPUT)/test_sock_fields: cgroup_helpers.c
 $(OUTPUT)/test_sysctl: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_attach: cgroup_helpers.c
+$(OUTPUT)/test_sockopt: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index e6d243b7cd74..87efde68a7a7 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -225,6 +225,8 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
 static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
 static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
+static int (*bpf_sockopt_handled)(void *ctx) =
+	(void *) BPF_FUNC_sockopt_handled;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_sockopt.c b/tools/testing/selftests/bpf/test_sockopt.c
new file mode 100644
index 000000000000..4b5897e6a9ec
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sockopt.c
@@ -0,0 +1,789 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+
+#include <linux/filter.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "bpf_rlimit.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#define CG_PATH				"/sockopt"
+
+static char bpf_log_buf[4096];
+static bool verbose;
+
+enum sockopt_test_error {
+	OK = 0,
+	DENY_LOAD,
+	DENY_ATTACH,
+	EPERM_GETSOCKOPT,
+	EFAULT_GETSOCKOPT,
+	EPERM_SETSOCKOPT,
+};
+
+static struct sockopt_test {
+	const char			*descr;
+	const struct bpf_insn		insns[64];
+	enum bpf_attach_type		attach_type;
+	enum bpf_attach_type		expected_attach_type;
+
+	int				level;
+	int				optname;
+
+	const char			set_optval[64];
+	socklen_t			set_optlen;
+
+	const char			get_optval[64];
+	socklen_t			get_optlen;
+	socklen_t			get_optlen_ret;
+
+	enum sockopt_test_error		error;
+} tests[] = {
+
+	/* ==================== getsockopt ====================  */
+
+	{
+		.descr = "getsockopt: no expected_attach_type",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = 0,
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "getsockopt: wrong expected_attach_type",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+		.error = DENY_ATTACH,
+	},
+	{
+		.descr = "getsockopt: bypass bpf hook",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.level = SOL_IP,
+		.optname = IP_TOS,
+
+		.set_optval = { 1 << 3 },
+		.set_optlen = 1,
+
+		.get_optval = { 1 << 3 },
+		.get_optlen = 1,
+	},
+	{
+		.descr = "getsockopt: return EPERM from bpf hook",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.level = SOL_IP,
+		.optname = IP_TOS,
+
+		.get_optlen = 1,
+		.error = EPERM_GETSOCKOPT,
+	},
+	{
+		.descr = "getsockopt: no optval bounds check, deny loading",
+		.insns = {
+			/* r6 = ctx->optval */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval)),
+
+			/* ctx->optval[0] = 0x80 */
+			BPF_MOV64_IMM(BPF_REG_0, 0x80),
+			BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_0, 0),
+
+			/* return 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "getsockopt: read ctx->level",
+		.insns = {
+			/* r6 = ctx->level */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, level)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+
+			/* if (ctx->level == 123) { */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.level = 123,
+
+		.get_optlen = 1,
+	},
+	{
+		.descr = "getsockopt: deny writing to ctx->level",
+		.insns = {
+			/* ctx->level = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, level)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "getsockopt: read ctx->optname",
+		.insns = {
+			/* r6 = ctx->optname */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optname)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+
+			/* if (ctx->optname == 123) { */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.optname = 123,
+
+		.get_optlen = 1,
+	},
+	{
+		.descr = "getsockopt: deny writing to ctx->optname",
+		.insns = {
+			/* ctx->optname = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optname)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "getsockopt: read ctx->optlen",
+		.insns = {
+			/* r6 = ctx->optlen */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optlen)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+
+			/* if (ctx->optlen == 64) { */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 64, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.get_optlen = 64,
+	},
+	{
+		.descr = "getsockopt: deny bigger ctx->optlen",
+		.insns = {
+			/* ctx->optlen = 65 */
+			BPF_MOV64_IMM(BPF_REG_0, 65),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optlen)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.get_optlen = 64,
+
+		.error = EFAULT_GETSOCKOPT,
+	},
+	{
+		.descr = "getsockopt: support smaller ctx->optlen",
+		.insns = {
+			/* ctx->optlen = 32 */
+			BPF_MOV64_IMM(BPF_REG_0, 32),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optlen)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.get_optlen = 64,
+		.get_optlen_ret = 32,
+	},
+	{
+		.descr = "getsockopt: deny writing to ctx->optval",
+		.insns = {
+			/* ctx->optval = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optval)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "getsockopt: deny writing to ctx->optval_end",
+		.insns = {
+			/* ctx->optval_end = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optval_end)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+
+	{
+		.descr = "getsockopt: rewrite value",
+		.insns = {
+			/* r6 = ctx->optval */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval)),
+			/* r2 = ctx->optval */
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_6),
+			/* r6 = ctx->optval + 1 */
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 1),
+
+			/* r7 = ctx->optval_end */
+			BPF_LDX_MEM(BPF_W, BPF_REG_7, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval_end)),
+
+			/* if (ctx->optval + 1 <= ctx->optval_end) { */
+			BPF_JMP_REG(BPF_JGT, BPF_REG_6, BPF_REG_7, 1),
+			/* ctx->optval[0] = 0xF0 */
+			BPF_ST_MEM(BPF_B, BPF_REG_2, 0, 0xF0),
+			/* } */
+
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_GETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+
+		.level = SOL_IP,
+		.optname = IP_TOS,
+
+		.get_optval = { 0xF0 },
+		.get_optlen = 1,
+	},
+
+	/* ==================== setsockopt ====================  */
+
+	{
+		.descr = "setsockopt: no expected_attach_type",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = 0,
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "setsockopt: wrong expected_attach_type",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_GETSOCKOPT,
+		.error = DENY_ATTACH,
+	},
+	{
+		.descr = "setsockopt: bypass bpf hook",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.level = SOL_IP,
+		.optname = IP_TOS,
+
+		.set_optval = { 1 << 3 },
+		.set_optlen = 1,
+
+		.get_optval = { 1 << 3 },
+		.get_optlen = 1,
+	},
+	{
+		.descr = "setsockopt: return EPERM from bpf hook",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.level = SOL_IP,
+		.optname = IP_TOS,
+
+		.set_optlen = 1,
+		.error = EPERM_SETSOCKOPT,
+	},
+	{
+		.descr = "setsockopt: no optval bounds check, deny loading",
+		.insns = {
+			/* r6 = ctx->optval */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval)),
+
+			/* r0 = ctx->optval[0] */
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+
+			/* return 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "setsockopt: read ctx->level",
+		.insns = {
+			/* r6 = ctx->level */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, level)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+
+			/* if (ctx->level == 123) { */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.level = 123,
+
+		.set_optlen = 1,
+	},
+	{
+		.descr = "setsockopt: deny writing to ctx->level",
+		.insns = {
+			/* ctx->level = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, level)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "setsockopt: read ctx->optname",
+		.insns = {
+			/* r6 = ctx->optname */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optname)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+
+			/* if (ctx->optname == 123) { */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 123, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.optname = 123,
+
+		.set_optlen = 1,
+	},
+	{
+		.descr = "setsockopt: deny writing to ctx->optname",
+		.insns = {
+			/* ctx->optname = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optname)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "setsockopt: read ctx->optlen",
+		.insns = {
+			/* r6 = ctx->optlen */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optlen)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+
+			/* if (ctx->optlen == 64) { */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_6, 64, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.set_optlen = 64,
+	},
+	{
+		.descr = "setsockopt: deny writing to ctx->optlen",
+		.insns = {
+			/* ctx->optlen = 32 */
+			BPF_MOV64_IMM(BPF_REG_0, 32),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optlen)),
+			/* Don't let kernel handle this option. */
+			BPF_EMIT_CALL(BPF_FUNC_sockopt_handled),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.set_optlen = 64,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "setsockopt: deny writing to ctx->optval",
+		.insns = {
+			/* ctx->optval = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optval)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "setsockopt: deny writing to ctx->optval_end",
+		.insns = {
+			/* ctx->optval_end = 1 */
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+				    offsetof(struct bpf_sockopt, optval_end)),
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.error = DENY_LOAD,
+	},
+	{
+		.descr = "setsockopt: allow IP_TOS <= 128",
+		.insns = {
+			/* r6 = ctx->optval */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval)),
+			/* r7 = ctx->optval + 1 */
+			BPF_MOV64_REG(BPF_REG_7, BPF_REG_6),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1),
+
+			/* r8 = ctx->optval_end */
+			BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval_end)),
+
+			/* if (ctx->optval + 1 <= ctx->optval_end) { */
+			BPF_JMP_REG(BPF_JGT, BPF_REG_7, BPF_REG_8, 4),
+
+			/* r9 = ctx->optval[0] */
+			BPF_LDX_MEM(BPF_B, BPF_REG_9, BPF_REG_6, 0),
+
+			/* if (ctx->optval[0] < 128) */
+			BPF_JMP_IMM(BPF_JGT, BPF_REG_9, 128, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } */
+
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.level = SOL_IP,
+		.optname = IP_TOS,
+
+		.set_optval = { 0x80 },
+		.set_optlen = 1,
+		.get_optval = { 0x80 },
+		.get_optlen = 1,
+	},
+	{
+		.descr = "setsockopt: deny IP_TOS > 128",
+		.insns = {
+			/* r6 = ctx->optval */
+			BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval)),
+			/* r7 = ctx->optval + 1 */
+			BPF_MOV64_REG(BPF_REG_7, BPF_REG_6),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1),
+
+			/* r8 = ctx->optval_end */
+			BPF_LDX_MEM(BPF_W, BPF_REG_8, BPF_REG_1,
+				    offsetof(struct bpf_sockopt, optval_end)),
+
+			/* if (ctx->optval + 1 <= ctx->optval_end) { */
+			BPF_JMP_REG(BPF_JGT, BPF_REG_7, BPF_REG_8, 4),
+
+			/* r9 = ctx->optval[0] */
+			BPF_LDX_MEM(BPF_B, BPF_REG_9, BPF_REG_6, 0),
+
+			/* if (ctx->optval[0] < 128) */
+			BPF_JMP_IMM(BPF_JGT, BPF_REG_9, 128, 2),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_JMP_A(1),
+			/* } */
+
+			/* } else { */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			/* } */
+
+			BPF_EXIT_INSN(),
+		},
+		.attach_type = BPF_CGROUP_SETSOCKOPT,
+		.expected_attach_type = BPF_CGROUP_SETSOCKOPT,
+
+		.level = SOL_IP,
+		.optname = IP_TOS,
+
+		.set_optval = { 0x81 },
+		.set_optlen = 1,
+		.get_optval = { 0x00 },
+		.get_optlen = 1,
+
+		.error = EPERM_SETSOCKOPT,
+	},
+};
+
+static int load_prog(const struct bpf_insn *insns,
+		     enum bpf_attach_type expected_attach_type)
+{
+	struct bpf_load_program_attr attr = {
+		.prog_type = BPF_PROG_TYPE_CGROUP_SOCKOPT,
+		.expected_attach_type = expected_attach_type,
+		.insns = insns,
+		.license = "GPL",
+		.log_level = 2,
+	};
+	int fd;
+
+	for (;
+	     insns[attr.insns_cnt].code != (BPF_JMP | BPF_EXIT);
+	     attr.insns_cnt++) {
+	}
+	attr.insns_cnt++;
+
+	fd = bpf_load_program_xattr(&attr, bpf_log_buf, sizeof(bpf_log_buf));
+	if (verbose && fd < 0)
+		fprintf(stderr, "%s\n", bpf_log_buf);
+
+	return fd;
+}
+
+static int run_test(int cgroup_fd, struct sockopt_test *test)
+{
+	int sock_fd, err, prog_fd;
+	void *optval = NULL;
+	int ret = 0;
+
+	prog_fd = load_prog(test->insns, test->expected_attach_type);
+	if (prog_fd < 0) {
+		if (test->error == DENY_LOAD)
+			return 0;
+
+		perror("bpf_program__load");
+		return -1;
+	}
+
+	err = bpf_prog_attach(prog_fd, cgroup_fd, test->attach_type, 0);
+	if (err < 0) {
+		if (test->error == DENY_ATTACH)
+			goto close_prog_fd;
+
+		perror("bpf_prog_attach");
+		ret = -1;
+		goto close_prog_fd;
+	}
+
+	sock_fd = socket(AF_INET, SOCK_STREAM, 0);
+	if (sock_fd < 0) {
+		perror("socket");
+		ret = -1;
+		goto detach_prog;
+	}
+
+	if (test->set_optlen) {
+		err = setsockopt(sock_fd, test->level, test->optname,
+				 test->set_optval, test->set_optlen);
+		if (err) {
+			if (errno == EPERM && test->error == EPERM_SETSOCKOPT)
+				goto close_sock_fd;
+
+			perror("setsockopt");
+			ret = -1;
+			goto close_sock_fd;
+		}
+	}
+
+	if (test->get_optlen) {
+		optval = malloc(test->get_optlen);
+		socklen_t optlen = test->get_optlen;
+		socklen_t expected_get_optlen = test->get_optlen_ret ?:
+			test->get_optlen;
+
+		err = getsockopt(sock_fd, test->level, test->optname,
+				 optval, &optlen);
+		if (err) {
+			if (errno == EPERM && test->error == EPERM_GETSOCKOPT)
+				goto free_optval;
+			if (errno == EFAULT && test->error == EFAULT_GETSOCKOPT)
+				goto free_optval;
+
+			perror("getsockopt");
+			ret = -1;
+			goto free_optval;
+		}
+
+		if (optlen != expected_get_optlen) {
+			perror("getsockopt optlen");
+			ret = -1;
+			goto free_optval;
+		}
+
+		if (memcmp(optval, test->get_optval, optlen) != 0) {
+			perror("getsockopt optval");
+			ret = -1;
+			goto free_optval;
+		}
+	}
+
+	ret = test->error != OK;
+
+free_optval:
+	free(optval);
+close_sock_fd:
+	close(sock_fd);
+detach_prog:
+	bpf_prog_detach2(prog_fd, cgroup_fd, test->attach_type);
+close_prog_fd:
+	close(prog_fd);
+	return ret;
+}
+
+int main(int args, char **argv)
+{
+	int err = EXIT_FAILURE, error_cnt = 0;
+	int cgroup_fd, i;
+
+	if (setup_cgroup_environment())
+		goto cleanup_obj;
+
+	cgroup_fd = create_and_get_cgroup(CG_PATH);
+	if (cgroup_fd < 0)
+		goto cleanup_cgroup_env;
+
+	if (join_cgroup(CG_PATH))
+		goto cleanup_cgroup;
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		int err = run_test(cgroup_fd, &tests[i]);
+
+		if (err)
+			error_cnt++;
+
+		printf("#%d %s: %s\n", i, err ? "FAIL" : "PASS",
+		       tests[i].descr);
+	}
+
+	printf("Summary: %ld PASSED, %d FAILED\n",
+	       ARRAY_SIZE(tests) - error_cnt, error_cnt);
+	err = error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
+
+cleanup_cgroup:
+	close(cgroup_fd);
+cleanup_cgroup_env:
+	cleanup_cgroup_environment();
+cleanup_obj:
+	return err;
+}
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 6/7] bpf: add sockopt documentation
  2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2019-06-04 21:35 ` [PATCH bpf-next 5/7] selftests/bpf: add sockopt test Stanislav Fomichev
@ 2019-06-04 21:35 ` Stanislav Fomichev
  2019-06-04 21:35 ` [PATCH bpf-next 7/7] bpftool: support cgroup sockopt Stanislav Fomichev
  6 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

Provide user documentation about sockopt prog type and cgroup hooks.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/bpf/index.rst               |  1 +
 Documentation/bpf/prog_cgroup_sockopt.rst | 42 +++++++++++++++++++++++
 2 files changed, 43 insertions(+)
 create mode 100644 Documentation/bpf/prog_cgroup_sockopt.rst

diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst
index d3fe4cac0c90..801a6ed3f2e5 100644
--- a/Documentation/bpf/index.rst
+++ b/Documentation/bpf/index.rst
@@ -42,6 +42,7 @@ Program types
 .. toctree::
    :maxdepth: 1
 
+   prog_cgroup_sockopt
    prog_cgroup_sysctl
    prog_flow_dissector
 
diff --git a/Documentation/bpf/prog_cgroup_sockopt.rst b/Documentation/bpf/prog_cgroup_sockopt.rst
new file mode 100644
index 000000000000..7e34ed0b59c0
--- /dev/null
+++ b/Documentation/bpf/prog_cgroup_sockopt.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+BPF_PROG_TYPE_SOCKOPT
+=====================
+
+``BPF_PROG_TYPE_SOCKOPT`` program type can be attached to two cgroup hooks:
+
+* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt``
+  system call.
+* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt``
+  system call.
+
+The context (``struct bpf_sockopt``) has associated socket (``sk``) and
+all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``.
+
+By default, when the hook returns, kernel code that handles ``getsockopt``
+or ``setsockopt`` is executed as well. That way BPF code can handle a
+subset of options and let kernel handle the rest. To prevent kernel
+handlers to be executed, there is a new helper called
+``bpf_sockopt_handled()``. It tells kernel that BPF program has handled
+the socket option and control should be returned to userspace.
+
+BPF_CGROUP_SETSOCKOPT
+=====================
+
+``BPF_CGROUP_SETSOCKOPT`` has a read-only context and this hook has
+access to cgroup and socket local storage.
+
+BPF_CGROUP_GETSOCKOPT
+=====================
+
+``BPF_CGROUP_GETSOCKOPT`` has to fill in ``optval`` and adjust
+``optlen`` accordingly. Input ``optlen`` contains the maximum length
+of data that can be returned to the userspace. In other words, BPF
+program can't increase ``optlen``, it can only decrease it.
+
+Return Type
+===========
+
+* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
+* ``1`` - success.
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 7/7] bpftool: support cgroup sockopt
  2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2019-06-04 21:35 ` [PATCH bpf-next 6/7] bpf: add sockopt documentation Stanislav Fomichev
@ 2019-06-04 21:35 ` Stanislav Fomichev
  2019-06-04 21:55   ` Jakub Kicinski
  6 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-04 21:35 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev

Support sockopt prog type and cgroup hooks in the bpftool.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst | 7 +++++--
 tools/bpf/bpftool/Documentation/bpftool-prog.rst   | 2 +-
 tools/bpf/bpftool/bash-completion/bpftool          | 8 +++++---
 tools/bpf/bpftool/cgroup.c                         | 5 ++++-
 tools/bpf/bpftool/main.h                           | 1 +
 tools/bpf/bpftool/prog.c                           | 3 ++-
 6 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
index 36807735e2a5..cac088a320a6 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
@@ -29,7 +29,8 @@ CGROUP COMMANDS
 |	*PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
 |	*ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | **sock_ops** | **device** |
 |		**bind4** | **bind6** | **post_bind4** | **post_bind6** | **connect4** | **connect6** |
-|		**sendmsg4** | **sendmsg6** | **sysctl** }
+|		**sendmsg4** | **sendmsg6** | **sysctl** | **getsockopt** |
+|		**setsockopt** }
 |	*ATTACH_FLAGS* := { **multi** | **override** }
 
 DESCRIPTION
@@ -86,7 +87,9 @@ DESCRIPTION
 		  unconnected udp4 socket (since 4.18);
 		  **sendmsg6** call to sendto(2), sendmsg(2), sendmmsg(2) for an
 		  unconnected udp6 socket (since 4.18);
-		  **sysctl** sysctl access (since 5.2).
+		  **sysctl** sysctl access (since 5.2);
+		  **getsockopt** call to getsockopt (since 5.3);
+		  **setsockopt** call to setsockopt (since 5.3).
 
 	**bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
 		  Detach *PROG* from the cgroup *CGROUP* and attach type
diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 228a5c863cc7..c6bade35032c 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -40,7 +40,7 @@ PROG COMMANDS
 |		**lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | **lirc_mode2** |
 |		**cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** |
 |		**cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** |
-|		**cgroup/sysctl**
+|		**cgroup/sysctl** | **cgroup/getsockopt** | **cgroup/setsockopt**
 |	}
 |       *ATTACH_TYPE* := {
 |		**msg_verdict** | **stream_verdict** | **stream_parser** | **flow_dissector**
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 2725e27dfa42..7afb8b6fbaaa 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -378,7 +378,8 @@ _bpftool()
                                 cgroup/connect4 cgroup/connect6 \
                                 cgroup/sendmsg4 cgroup/sendmsg6 \
                                 cgroup/post_bind4 cgroup/post_bind6 \
-                                cgroup/sysctl" -- \
+                                cgroup/sysctl cgroup/getsockopt \
+                                cgroup/setsockopt" -- \
                                                    "$cur" ) )
                             return 0
                             ;;
@@ -688,7 +689,8 @@ _bpftool()
                 attach|detach)
                     local ATTACH_TYPES='ingress egress sock_create sock_ops \
                         device bind4 bind6 post_bind4 post_bind6 connect4 \
-                        connect6 sendmsg4 sendmsg6 sysctl'
+                        connect6 sendmsg4 sendmsg6 sysctl getsockopt \
+                        setsockopt'
                     local ATTACH_FLAGS='multi override'
                     local PROG_TYPE='id pinned tag'
                     case $prev in
@@ -698,7 +700,7 @@ _bpftool()
                             ;;
                         ingress|egress|sock_create|sock_ops|device|bind4|bind6|\
                         post_bind4|post_bind6|connect4|connect6|sendmsg4|\
-                        sendmsg6|sysctl)
+                        sendmsg6|sysctl|getsockopt|setsockopt)
                             COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \
                                 "$cur" ) )
                             return 0
diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
index 7e22f115c8c1..3083f2e4886e 100644
--- a/tools/bpf/bpftool/cgroup.c
+++ b/tools/bpf/bpftool/cgroup.c
@@ -25,7 +25,8 @@
 	"       ATTACH_TYPE := { ingress | egress | sock_create |\n"	       \
 	"                        sock_ops | device | bind4 | bind6 |\n"	       \
 	"                        post_bind4 | post_bind6 | connect4 |\n"       \
-	"                        connect6 | sendmsg4 | sendmsg6 | sysctl }"
+	"                        connect6 | sendmsg4 | sendmsg6 | sysctl |\n"  \
+	"                        getsockopt | setsockopt }"
 
 static const char * const attach_type_strings[] = {
 	[BPF_CGROUP_INET_INGRESS] = "ingress",
@@ -42,6 +43,8 @@ static const char * const attach_type_strings[] = {
 	[BPF_CGROUP_UDP4_SENDMSG] = "sendmsg4",
 	[BPF_CGROUP_UDP6_SENDMSG] = "sendmsg6",
 	[BPF_CGROUP_SYSCTL] = "sysctl",
+	[BPF_CGROUP_GETSOCKOPT] = "getsockopt",
+	[BPF_CGROUP_SETSOCKOPT] = "setsockopt",
 	[__MAX_BPF_ATTACH_TYPE] = NULL,
 };
 
diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index 28a2a5857e14..9c5d9c80f71e 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
 	[BPF_PROG_TYPE_SK_REUSEPORT]		= "sk_reuseport",
 	[BPF_PROG_TYPE_FLOW_DISSECTOR]		= "flow_dissector",
 	[BPF_PROG_TYPE_CGROUP_SYSCTL]		= "cgroup_sysctl",
+	[BPF_PROG_TYPE_CGROUP_SOCKOPT]		= "cgroup_sockopt",
 };
 
 extern const char * const map_type_name[];
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 1f209c80d906..a201e1c83346 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -1070,7 +1070,8 @@ static int do_help(int argc, char **argv)
 		"                 sk_reuseport | flow_dissector | cgroup/sysctl |\n"
 		"                 cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n"
 		"                 cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n"
-		"                 cgroup/sendmsg4 | cgroup/sendmsg6 }\n"
+		"                 cgroup/sendmsg4 | cgroup/sendmsg6 | cgroup/getsockopt |\n"
+		"                 cgroup/setsockopt }\n"
 		"       ATTACH_TYPE := { msg_verdict | stream_verdict | stream_parser |\n"
 		"                        flow_dissector }\n"
 		"       " HELP_SPEC_OPTIONS "\n"
-- 
2.22.0.rc1.311.g5d7573a151-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 7/7] bpftool: support cgroup sockopt
  2019-06-04 21:35 ` [PATCH bpf-next 7/7] bpftool: support cgroup sockopt Stanislav Fomichev
@ 2019-06-04 21:55   ` Jakub Kicinski
  0 siblings, 0 replies; 18+ messages in thread
From: Jakub Kicinski @ 2019-06-04 21:55 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: netdev, bpf, davem, ast, daniel

On Tue,  4 Jun 2019 14:35:24 -0700, Stanislav Fomichev wrote:
> Support sockopt prog type and cgroup hooks in the bpftool.
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-04 21:35 ` [PATCH bpf-next 1/7] bpf: implement " Stanislav Fomichev
@ 2019-06-05 18:47   ` Martin Lau
  2019-06-05 19:17     ` Stanislav Fomichev
  2019-06-05 19:32   ` Andrii Nakryiko
  1 sibling, 1 reply; 18+ messages in thread
From: Martin Lau @ 2019-06-05 18:47 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: netdev, bpf, davem, ast, daniel

On Tue, Jun 04, 2019 at 02:35:18PM -0700, Stanislav Fomichev wrote:
> Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> 
> BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> 
> The buffer memory is pre-allocated (because I don't think there is
> a precedent for working with __user memory from bpf). This might be
> slow to do for each {s,g}etsockopt call, that's why I've added
> __cgroup_bpf_has_prog_array that exits early if there is nothing
> attached to a cgroup. Note, however, that there is a race between
> __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> program layout might have changed; this should not be a problem
> because in general there is a race between multiple calls to
> {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> 
> By default, kernel code path is executed after the hook (to let
> BPF handle only a subset of the options). There is new
> bpf_sockopt_handled handler that returns control to the userspace
> instead (bypassing the kernel handling).
> 
> The return code is either 1 (success) or 0 (EPERM).
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/bpf-cgroup.h |  29 ++++
>  include/linux/bpf.h        |   2 +
>  include/linux/bpf_types.h  |   1 +
>  include/linux/filter.h     |  19 +++
>  include/uapi/linux/bpf.h   |  17 ++-
>  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c       |  19 +++
>  kernel/bpf/verifier.c      |  12 ++
>  net/core/filter.c          |   4 +-
>  net/socket.c               |  18 +++
>  10 files changed, 406 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index b631ee75762d..406f1ba82531 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
>  				   loff_t *ppos, void **new_buf,
>  				   enum bpf_attach_type type);
>  
> +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> +				       int optname, char __user *optval,
> +				       unsigned int optlen);
> +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> +				       int optname, char __user *optval,
> +				       int __user *optlen);
> +
>  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
>  	struct bpf_map *map)
>  {
> @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
>  	__ret;								       \
>  })
>  
> +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> +({									       \
> +	int __ret = 0;							       \
> +	if (cgroup_bpf_enabled)						       \
> +		__ret = __cgroup_bpf_run_filter_setsockopt(sock, level,	       \
> +							   optname, optval,    \
> +							   optlen);	       \
> +	__ret;								       \
> +})
> +
> +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> +({									       \
> +	int __ret = 0;							       \
> +	if (cgroup_bpf_enabled)						       \
> +		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
> +							   optname, optval,    \
> +							   optlen);	       \
> +	__ret;								       \
> +})
> +
>  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
>  			   enum bpf_prog_type ptype, struct bpf_prog *prog);
>  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
>  
>  #define for_each_cgroup_storage_type(stype) for (; false; )
>  
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e5a309e6a400..fb4e6ef5a971 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
>  extern const struct bpf_func_proto bpf_get_local_storage_proto;
>  extern const struct bpf_func_proto bpf_strtol_proto;
>  extern const struct bpf_func_proto bpf_strtoul_proto;
> +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> +extern const struct bpf_func_proto bpf_tcp_sock_proto;
>  
>  /* Shared helpers among cBPF and eBPF. */
>  void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 5a9975678d6f..eec5aeeeaf92 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
>  #ifdef CONFIG_CGROUP_BPF
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
>  #endif
>  #ifdef CONFIG_BPF_LIRC_MODE2
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 43b45d6db36d..7a07fd2e14d3 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
>  	u64 tmp_reg;
>  };
>  
> +struct bpf_sockopt_kern {
> +	struct sock	*sk;
> +	s32		level;
> +	s32		optname;
> +	u32		optlen;
It seems there is hole.

> +	u8		*optval;
> +	u8		*optval_end;
> +
> +	/* If true, BPF program had consumed the sockopt request.
> +	 * Control is returned to the userspace (i.e. kernel doesn't
> +	 * handle this option).
> +	 */
> +	bool		handled;
> +
> +	/* Small on-stack optval buffer to avoid small allocations.
> +	 */
> +	u8 buf[64];
Is it better to align to 8 bytes?

> +};
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7c6aef253173..b6c3891241ef 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -170,6 +170,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_FLOW_DISSECTOR,
>  	BPF_PROG_TYPE_CGROUP_SYSCTL,
>  	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> +	BPF_PROG_TYPE_CGROUP_SOCKOPT,
>  };
>  
>  enum bpf_attach_type {
> @@ -192,6 +193,8 @@ enum bpf_attach_type {
>  	BPF_LIRC_MODE2,
>  	BPF_FLOW_DISSECTOR,
>  	BPF_CGROUP_SYSCTL,
> +	BPF_CGROUP_GETSOCKOPT,
> +	BPF_CGROUP_SETSOCKOPT,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -2815,7 +2818,8 @@ union bpf_attr {
>  	FN(strtoul),			\
>  	FN(sk_storage_get),		\
>  	FN(sk_storage_delete),		\
> -	FN(send_signal),
> +	FN(send_signal),		\
> +	FN(sockopt_handled),
Document.

>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
>  				 */
>  };
>  
> +struct bpf_sockopt {
> +	__bpf_md_ptr(struct bpf_sock *, sk);
> +
> +	__s32	level;
> +	__s32	optname;
> +
> +	__u32	optlen;
> +	__u32	optval;
> +	__u32	optval_end;
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 1b65ab0df457..4ec99ea97023 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -18,6 +18,7 @@
>  #include <linux/bpf.h>
>  #include <linux/bpf-cgroup.h>
>  #include <net/sock.h>
> +#include <net/bpf_sk_storage.h>
>  
>  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
>  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
>  }
>  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
>  
> +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> +					enum bpf_attach_type attach_type)
> +{
> +	struct bpf_prog_array *prog_array;
> +	int nr;
> +
> +	rcu_read_lock();
> +	prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> +	nr = bpf_prog_array_length(prog_array);
Nit. It seems unnecessary to loop through the whole
array if the only signal needed is non-zero.

> +	rcu_read_unlock();
> +
> +	return nr > 0;
> +}
> +
> +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> +{
> +	if (unlikely(max_optlen > PAGE_SIZE))
> +		return -EINVAL;
> +
> +	if (likely(max_optlen <= sizeof(ctx->buf))) {
> +		ctx->optval = ctx->buf;
> +	} else {
> +		ctx->optval = kzalloc(max_optlen, GFP_USER);
> +		if (!ctx->optval)
> +			return -ENOMEM;
> +	}
> +
> +	ctx->optval_end = ctx->optval + max_optlen;
> +	ctx->optlen = max_optlen;
> +
> +	return 0;
> +}
> +
> +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> +{
> +	if (unlikely(ctx->optval != ctx->buf))
> +		kfree(ctx->optval);
> +}
> +
> +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> +				       int optname, char __user *optval,
> +				       unsigned int optlen)
> +{
> +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	struct bpf_sockopt_kern ctx = {
> +		.sk = sk,
> +		.level = level,
> +		.optname = optname,
> +	};
> +	int ret;
> +
> +	/* Opportunistic check to see whether we have any BPF program
> +	 * attached to the hook so we don't waste time allocating
> +	 * memory and locking the socket.
> +	 */
> +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> +		return 0;
> +
> +	ret = sockopt_alloc_buf(&ctx, optlen);
> +	if (ret)
> +		return ret;
> +
> +	if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> +		sockopt_free_buf(&ctx);
> +		return -EFAULT;
> +	}
> +
> +	lock_sock(sk);
> +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> +				 &ctx, BPF_PROG_RUN);
I think the check_return_code() in verifier.c has to be
adjusted also.

> +	release_sock(sk);
> +
> +	sockopt_free_buf(&ctx);
> +
> +	if (!ret)
> +		return -EPERM;
> +
> +	return ctx.handled ? 1 : 0;
> +}
> +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> +
> +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> +				       int optname, char __user *optval,
> +				       int __user *optlen)
> +{
> +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	struct bpf_sockopt_kern ctx = {
> +		.sk = sk,
> +		.level = level,
> +		.optname = optname,
> +	};
> +	int max_optlen;
> +	char buf[64];
hmm... where is it used?

> +	int ret;
> +
> +	/* Opportunistic check to see whether we have any BPF program
> +	 * attached to the hook so we don't waste time allocating
> +	 * memory and locking the socket.
> +	 */
> +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> +		return 0;
> +
> +	if (get_user(max_optlen, optlen))
> +		return -EFAULT;
> +
> +	ret = sockopt_alloc_buf(&ctx, max_optlen);
> +	if (ret)
> +		return ret;
> +
> +	lock_sock(sk);
> +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> +				 &ctx, BPF_PROG_RUN);
> +	release_sock(sk);
> +
> +	if (ctx.optlen > max_optlen) {
> +		sockopt_free_buf(&ctx);
> +		return -EFAULT;
> +	}
> +
> +	if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> +		sockopt_free_buf(&ctx);
> +		return -EFAULT;
> +	}
> +
> +	sockopt_free_buf(&ctx);
> +
> +	if (put_user(ctx.optlen, optlen))
> +		return -EFAULT;
> +
> +	if (!ret)
> +		return -EPERM;
> +
> +	return ctx.handled ? 1 : 0;
> +}
> +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> +
>  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
>  			      size_t *lenp)
>  {
> @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
>  
>  const struct bpf_prog_ops cg_sysctl_prog_ops = {
>  };
> +
> +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> +{
> +	ctx->handled = true;
> +	return 1;
RET_VOID?

> +}
> +
> +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> +	.func		= bpf_sockopt_handled,
> +	.gpl_only	= false,
> +	.arg1_type      = ARG_PTR_TO_CTX,
> +	.ret_type	= RET_INTEGER,
> +};
> +
> +static const struct bpf_func_proto *
> +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_sockopt_handled:
> +		return &bpf_sockopt_handled_proto;
> +	case BPF_FUNC_sk_fullsock:
> +		return &bpf_sk_fullsock_proto;
> +	case BPF_FUNC_sk_storage_get:
> +		return &bpf_sk_storage_get_proto;
> +	case BPF_FUNC_sk_storage_delete:
> +		return &bpf_sk_storage_delete_proto;
> +#ifdef CONFIG_INET
> +	case BPF_FUNC_tcp_sock:
> +		return &bpf_tcp_sock_proto;
> +#endif
> +	default:
> +		return cgroup_base_func_proto(func_id, prog);
> +	}
> +}
> +
> +static bool cg_sockopt_is_valid_access(int off, int size,
> +				       enum bpf_access_type type,
> +				       const struct bpf_prog *prog,
> +				       struct bpf_insn_access_aux *info)
> +{
> +	const int size_default = sizeof(__u32);
> +
> +	if (off < 0 || off >= sizeof(struct bpf_sockopt))
> +		return false;
> +
> +	if (off % size != 0)
> +		return false;
> +
> +	if (type == BPF_WRITE) {
> +		switch (off) {
> +		case offsetof(struct bpf_sockopt, optlen):
> +			if (size != size_default)
> +				return false;
> +			return prog->expected_attach_type ==
> +				BPF_CGROUP_GETSOCKOPT;
> +		default:
> +			return false;
> +		}
> +	}
> +
> +	switch (off) {
> +	case offsetof(struct bpf_sockopt, sk):
> +		if (size != sizeof(__u64))
> +			return false;
> +		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
sk cannot be NULL, so the OR_NULL part is not needed.

I think it should also be PTR_TO_SOCKET instead.

> +		break;
> +	case bpf_ctx_range(struct bpf_sockopt, optval):
> +		if (size != size_default)
> +			return false;
> +		info->reg_type = PTR_TO_PACKET;
> +		break;
> +	case bpf_ctx_range(struct bpf_sockopt, optval_end):
> +		if (size != size_default)
> +			return false;
> +		info->reg_type = PTR_TO_PACKET_END;
> +		break;
> +	default:
> +		if (size != size_default)
> +			return false;
> +		break;
> +	}
> +	return true;
> +}
> +
> +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> +					 const struct bpf_insn *si,
> +					 struct bpf_insn *insn_buf,
> +					 struct bpf_prog *prog,
> +					 u32 *target_size)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +
> +	switch (si->off) {
> +	case offsetof(struct bpf_sockopt, sk):
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sockopt_kern, sk));
> +		break;
> +	case offsetof(struct bpf_sockopt, level):
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      bpf_target_off(struct bpf_sockopt_kern,
> +						     level, 4, target_size));
bpf_target_off() is not needed since there is no narrow load.

> +		break;
> +	case offsetof(struct bpf_sockopt, optname):
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      bpf_target_off(struct bpf_sockopt_kern,
> +						     optname, 4, target_size));
> +		break;
> +	case offsetof(struct bpf_sockopt, optlen):
> +		if (type == BPF_WRITE)
> +			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +					      bpf_target_off(struct bpf_sockopt_kern,
> +							     optlen, 4, target_size));
> +		else
> +			*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +					      bpf_target_off(struct bpf_sockopt_kern,
> +							     optlen, 4, target_size));
> +		break;
> +	case offsetof(struct bpf_sockopt, optval):
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sockopt_kern, optval));
> +		break;
> +	case offsetof(struct bpf_sockopt, optval_end):
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sockopt_kern, optval_end));
> +		break;
> +	}
> +
> +	return insn - insn_buf;
> +}
> +
> +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> +				   bool direct_write,
> +				   const struct bpf_prog *prog)
> +{
> +	/* Nothing to do for sockopt argument. The data is kzalloc'ated.
> +	 */
> +	return 0;
> +}
> +
> +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> +	.get_func_proto		= cg_sockopt_func_proto,
> +	.is_valid_access	= cg_sockopt_is_valid_access,
> +	.convert_ctx_access	= cg_sockopt_convert_ctx_access,
> +	.gen_prologue		= cg_sockopt_get_prologue,
> +};
> +
> +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 4c53cbd3329d..4ad2b5f1905f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
>  		default:
>  			return -EINVAL;
>  		}
> +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> +		switch (expected_attach_type) {
> +		case BPF_CGROUP_SETSOCKOPT:
> +		case BPF_CGROUP_GETSOCKOPT:
> +			return 0;
> +		default:
> +			return -EINVAL;
> +		}
>  	default:
>  		return 0;
>  	}
> @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
>  	switch (prog->type) {
>  	case BPF_PROG_TYPE_CGROUP_SOCK:
>  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
>  		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
>  	case BPF_PROG_TYPE_CGROUP_SKB:
>  		return prog->enforce_expected_attach_type &&
> @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	case BPF_CGROUP_SYSCTL:
>  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
>  		break;
> +	case BPF_CGROUP_GETSOCKOPT:
> +	case BPF_CGROUP_SETSOCKOPT:
> +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> +		break;
>  	default:
>  		return -EINVAL;
>  	}
> @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>  	case BPF_CGROUP_SYSCTL:
>  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
>  		break;
> +	case BPF_CGROUP_GETSOCKOPT:
> +	case BPF_CGROUP_SETSOCKOPT:
> +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> +		break;
>  	default:
>  		return -EINVAL;
>  	}
> @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
>  	case BPF_CGROUP_SOCK_OPS:
>  	case BPF_CGROUP_DEVICE:
>  	case BPF_CGROUP_SYSCTL:
> +	case BPF_CGROUP_GETSOCKOPT:
> +	case BPF_CGROUP_SETSOCKOPT:
>  		break;
>  	case BPF_LIRC_MODE2:
>  		return lirc_prog_query(attr, uattr);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5c2cb5bd84ce..b91fde10e721 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>  
>  		env->seen_direct_write = true;
>  		return true;
> +
> +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> +		if (t == BPF_WRITE) {
> +			if (env->prog->expected_attach_type ==
> +			    BPF_CGROUP_GETSOCKOPT) {
> +				env->seen_direct_write = true;
> +				return true;
> +			}
> +			return false;
> +		}
> +		return true;
> +
>  	default:
>  		return false;
>  	}
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 55bfc941d17a..4652c0a005ca 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
>  	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
>  }
>  
> -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> +const struct bpf_func_proto bpf_sk_fullsock_proto = {
>  	.func		= bpf_sk_fullsock,
>  	.gpl_only	= false,
>  	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
> @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
>  	return (unsigned long)NULL;
>  }
>  
> -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> +const struct bpf_func_proto bpf_tcp_sock_proto = {
>  	.func		= bpf_tcp_sock,
>  	.gpl_only	= false,
>  	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
> diff --git a/net/socket.c b/net/socket.c
> index 72372dc5dd70..e8654f1f70e6 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
>  		if (err)
>  			goto out_put;
>  
> +		err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> +						     optval, optlen);
> +		if (err < 0) {
> +			goto out_put;
> +		} else if (err > 0) {
> +			err = 0;
> +			goto out_put;
> +		}
> +
>  		if (level == SOL_SOCKET)
>  			err =
>  			    sock_setsockopt(sock, level, optname, optval,
> @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
>  		if (err)
>  			goto out_put;
>  
> +		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> +						     optval, optlen);
> +		if (err < 0) {
> +			goto out_put;
> +		} else if (err > 0) {
> +			err = 0;
> +			goto out_put;
> +		}
> +
>  		if (level == SOL_SOCKET)
>  			err =
>  			    sock_getsockopt(sock, level, optname, optval,
> -- 
> 2.22.0.rc1.311.g5d7573a151-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-05 18:47   ` Martin Lau
@ 2019-06-05 19:17     ` Stanislav Fomichev
  2019-06-05 20:50       ` Martin Lau
  0 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-05 19:17 UTC (permalink / raw)
  To: Martin Lau; +Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel

On 06/05, Martin Lau wrote:
> On Tue, Jun 04, 2019 at 02:35:18PM -0700, Stanislav Fomichev wrote:
> > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > 
> > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > 
> > The buffer memory is pre-allocated (because I don't think there is
> > a precedent for working with __user memory from bpf). This might be
> > slow to do for each {s,g}etsockopt call, that's why I've added
> > __cgroup_bpf_has_prog_array that exits early if there is nothing
> > attached to a cgroup. Note, however, that there is a race between
> > __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> > program layout might have changed; this should not be a problem
> > because in general there is a race between multiple calls to
> > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > 
> > By default, kernel code path is executed after the hook (to let
> > BPF handle only a subset of the options). There is new
> > bpf_sockopt_handled handler that returns control to the userspace
> > instead (bypassing the kernel handling).
> > 
> > The return code is either 1 (success) or 0 (EPERM).
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/bpf-cgroup.h |  29 ++++
> >  include/linux/bpf.h        |   2 +
> >  include/linux/bpf_types.h  |   1 +
> >  include/linux/filter.h     |  19 +++
> >  include/uapi/linux/bpf.h   |  17 ++-
> >  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
> >  kernel/bpf/syscall.c       |  19 +++
> >  kernel/bpf/verifier.c      |  12 ++
> >  net/core/filter.c          |   4 +-
> >  net/socket.c               |  18 +++
> >  10 files changed, 406 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > index b631ee75762d..406f1ba82531 100644
> > --- a/include/linux/bpf-cgroup.h
> > +++ b/include/linux/bpf-cgroup.h
> > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> >  				   loff_t *ppos, void **new_buf,
> >  				   enum bpf_attach_type type);
> >  
> > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > +				       int optname, char __user *optval,
> > +				       unsigned int optlen);
> > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > +				       int optname, char __user *optval,
> > +				       int __user *optlen);
> > +
> >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> >  	struct bpf_map *map)
> >  {
> > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> >  	__ret;								       \
> >  })
> >  
> > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > +({									       \
> > +	int __ret = 0;							       \
> > +	if (cgroup_bpf_enabled)						       \
> > +		__ret = __cgroup_bpf_run_filter_setsockopt(sock, level,	       \
> > +							   optname, optval,    \
> > +							   optlen);	       \
> > +	__ret;								       \
> > +})
> > +
> > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > +({									       \
> > +	int __ret = 0;							       \
> > +	if (cgroup_bpf_enabled)						       \
> > +		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
> > +							   optname, optval,    \
> > +							   optlen);	       \
> > +	__ret;								       \
> > +})
> > +
> >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> >  			   enum bpf_prog_type ptype, struct bpf_prog *prog);
> >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> >  
> >  #define for_each_cgroup_storage_type(stype) for (; false; )
> >  
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index e5a309e6a400..fb4e6ef5a971 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> >  extern const struct bpf_func_proto bpf_strtol_proto;
> >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> >  
> >  /* Shared helpers among cBPF and eBPF. */
> >  void bpf_user_rnd_init_once(void);
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 5a9975678d6f..eec5aeeeaf92 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> >  #ifdef CONFIG_CGROUP_BPF
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> >  #endif
> >  #ifdef CONFIG_BPF_LIRC_MODE2
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 43b45d6db36d..7a07fd2e14d3 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
> >  	u64 tmp_reg;
> >  };
> >  
> > +struct bpf_sockopt_kern {
> > +	struct sock	*sk;
> > +	s32		level;
> > +	s32		optname;
> > +	u32		optlen;
> It seems there is hole.
Ack, will move the pointers up.

> > +	u8		*optval;
> > +	u8		*optval_end;
> > +
> > +	/* If true, BPF program had consumed the sockopt request.
> > +	 * Control is returned to the userspace (i.e. kernel doesn't
> > +	 * handle this option).
> > +	 */
> > +	bool		handled;
> > +
> > +	/* Small on-stack optval buffer to avoid small allocations.
> > +	 */
> > +	u8 buf[64];
> Is it better to align to 8 bytes?
Do you mean manually set size to be 64 + x where x is a remainder
to align to 8 bytes? Is there some macro to help with that maybe?

> > +};
> > +
> >  #endif /* __LINUX_FILTER_H__ */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 7c6aef253173..b6c3891241ef 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> >  	BPF_PROG_TYPE_FLOW_DISSECTOR,
> >  	BPF_PROG_TYPE_CGROUP_SYSCTL,
> >  	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > +	BPF_PROG_TYPE_CGROUP_SOCKOPT,
> >  };
> >  
> >  enum bpf_attach_type {
> > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> >  	BPF_LIRC_MODE2,
> >  	BPF_FLOW_DISSECTOR,
> >  	BPF_CGROUP_SYSCTL,
> > +	BPF_CGROUP_GETSOCKOPT,
> > +	BPF_CGROUP_SETSOCKOPT,
> >  	__MAX_BPF_ATTACH_TYPE
> >  };
> >  
> > @@ -2815,7 +2818,8 @@ union bpf_attr {
> >  	FN(strtoul),			\
> >  	FN(sk_storage_get),		\
> >  	FN(sk_storage_delete),		\
> > -	FN(send_signal),
> > +	FN(send_signal),		\
> > +	FN(sockopt_handled),
> Document.
Ah, totally forgot about that, sure, will do!

> >  
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
> >  				 */
> >  };
> >  
> > +struct bpf_sockopt {
> > +	__bpf_md_ptr(struct bpf_sock *, sk);
> > +
> > +	__s32	level;
> > +	__s32	optname;
> > +
> > +	__u32	optlen;
> > +	__u32	optval;
> > +	__u32	optval_end;
> > +};
> > +
> >  #endif /* _UAPI__LINUX_BPF_H__ */
> > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > index 1b65ab0df457..4ec99ea97023 100644
> > --- a/kernel/bpf/cgroup.c
> > +++ b/kernel/bpf/cgroup.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/bpf.h>
> >  #include <linux/bpf-cgroup.h>
> >  #include <net/sock.h>
> > +#include <net/bpf_sk_storage.h>
> >  
> >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> >  }
> >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> >  
> > +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> > +					enum bpf_attach_type attach_type)
> > +{
> > +	struct bpf_prog_array *prog_array;
> > +	int nr;
> > +
> > +	rcu_read_lock();
> > +	prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > +	nr = bpf_prog_array_length(prog_array);
> Nit. It seems unnecessary to loop through the whole
> array if the only signal needed is non-zero.
Oh, good point. I guess I'd have to add another helper like
bpf_prog_array_is_empty() and return early. Any other suggestions?

> > +	rcu_read_unlock();
> > +
> > +	return nr > 0;
> > +}
> > +
> > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > +{
> > +	if (unlikely(max_optlen > PAGE_SIZE))
> > +		return -EINVAL;
> > +
> > +	if (likely(max_optlen <= sizeof(ctx->buf))) {
> > +		ctx->optval = ctx->buf;
> > +	} else {
> > +		ctx->optval = kzalloc(max_optlen, GFP_USER);
> > +		if (!ctx->optval)
> > +			return -ENOMEM;
> > +	}
> > +
> > +	ctx->optval_end = ctx->optval + max_optlen;
> > +	ctx->optlen = max_optlen;
> > +
> > +	return 0;
> > +}
> > +
> > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> > +{
> > +	if (unlikely(ctx->optval != ctx->buf))
> > +		kfree(ctx->optval);
> > +}
> > +
> > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> > +				       int optname, char __user *optval,
> > +				       unsigned int optlen)
> > +{
> > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +	struct bpf_sockopt_kern ctx = {
> > +		.sk = sk,
> > +		.level = level,
> > +		.optname = optname,
> > +	};
> > +	int ret;
> > +
> > +	/* Opportunistic check to see whether we have any BPF program
> > +	 * attached to the hook so we don't waste time allocating
> > +	 * memory and locking the socket.
> > +	 */
> > +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> > +		return 0;
> > +
> > +	ret = sockopt_alloc_buf(&ctx, optlen);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> > +		sockopt_free_buf(&ctx);
> > +		return -EFAULT;
> > +	}
> > +
> > +	lock_sock(sk);
> > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> > +				 &ctx, BPF_PROG_RUN);
> I think the check_return_code() in verifier.c has to be
> adjusted also.
Good catch! I though that it does the [0,1] check by default.

> > +	release_sock(sk);
> > +
> > +	sockopt_free_buf(&ctx);
> > +
> > +	if (!ret)
> > +		return -EPERM;
> > +
> > +	return ctx.handled ? 1 : 0;
> > +}
> > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> > +
> > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> > +				       int optname, char __user *optval,
> > +				       int __user *optlen)
> > +{
> > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +	struct bpf_sockopt_kern ctx = {
> > +		.sk = sk,
> > +		.level = level,
> > +		.optname = optname,
> > +	};
> > +	int max_optlen;
> > +	char buf[64];
> hmm... where is it used?
It's a leftover from my initial attempt to have a small buffer on the stack.
I've since moved it into struct bpf_sockopt_kern. Will remove. Gcc even
complains about unused var, not sure how I missed that...

> > +	int ret;
> > +
> > +	/* Opportunistic check to see whether we have any BPF program
> > +	 * attached to the hook so we don't waste time allocating
> > +	 * memory and locking the socket.
> > +	 */
> > +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> > +		return 0;
> > +
> > +	if (get_user(max_optlen, optlen))
> > +		return -EFAULT;
> > +
> > +	ret = sockopt_alloc_buf(&ctx, max_optlen);
> > +	if (ret)
> > +		return ret;
> > +
> > +	lock_sock(sk);
> > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > +				 &ctx, BPF_PROG_RUN);
> > +	release_sock(sk);
> > +
> > +	if (ctx.optlen > max_optlen) {
> > +		sockopt_free_buf(&ctx);
> > +		return -EFAULT;
> > +	}
> > +
> > +	if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> > +		sockopt_free_buf(&ctx);
> > +		return -EFAULT;
> > +	}
> > +
> > +	sockopt_free_buf(&ctx);
> > +
> > +	if (put_user(ctx.optlen, optlen))
> > +		return -EFAULT;
> > +
> > +	if (!ret)
> > +		return -EPERM;
> > +
> > +	return ctx.handled ? 1 : 0;
> > +}
> > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> > +
> >  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
> >  			      size_t *lenp)
> >  {
> > @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
> >  
> >  const struct bpf_prog_ops cg_sysctl_prog_ops = {
> >  };
> > +
> > +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> > +{
> > +	ctx->handled = true;
> > +	return 1;
> RET_VOID?
I was thinking that in the C code the pattern can be:
{
	...
	return bpf_sockopt_handled();
}

That's why I'm retuning 1 from the helper. But I can change it to VOID
so that users have to return 1 manually. That's probably cleaner, will
change.

> > +}
> > +
> > +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> > +	.func		= bpf_sockopt_handled,
> > +	.gpl_only	= false,
> > +	.arg1_type      = ARG_PTR_TO_CTX,
> > +	.ret_type	= RET_INTEGER,
> > +};
> > +
> > +static const struct bpf_func_proto *
> > +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +	switch (func_id) {
> > +	case BPF_FUNC_sockopt_handled:
> > +		return &bpf_sockopt_handled_proto;
> > +	case BPF_FUNC_sk_fullsock:
> > +		return &bpf_sk_fullsock_proto;
> > +	case BPF_FUNC_sk_storage_get:
> > +		return &bpf_sk_storage_get_proto;
> > +	case BPF_FUNC_sk_storage_delete:
> > +		return &bpf_sk_storage_delete_proto;
> > +#ifdef CONFIG_INET
> > +	case BPF_FUNC_tcp_sock:
> > +		return &bpf_tcp_sock_proto;
> > +#endif
> > +	default:
> > +		return cgroup_base_func_proto(func_id, prog);
> > +	}
> > +}
> > +
> > +static bool cg_sockopt_is_valid_access(int off, int size,
> > +				       enum bpf_access_type type,
> > +				       const struct bpf_prog *prog,
> > +				       struct bpf_insn_access_aux *info)
> > +{
> > +	const int size_default = sizeof(__u32);
> > +
> > +	if (off < 0 || off >= sizeof(struct bpf_sockopt))
> > +		return false;
> > +
> > +	if (off % size != 0)
> > +		return false;
> > +
> > +	if (type == BPF_WRITE) {
> > +		switch (off) {
> > +		case offsetof(struct bpf_sockopt, optlen):
> > +			if (size != size_default)
> > +				return false;
> > +			return prog->expected_attach_type ==
> > +				BPF_CGROUP_GETSOCKOPT;
> > +		default:
> > +			return false;
> > +		}
> > +	}
> > +
> > +	switch (off) {
> > +	case offsetof(struct bpf_sockopt, sk):
> > +		if (size != sizeof(__u64))
> > +			return false;
> > +		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
> sk cannot be NULL, so the OR_NULL part is not needed.
> 
> I think it should also be PTR_TO_SOCKET instead.
I think you're correct. That reminds me of the fact that
I haven't properly tested it. Let me add a small C
selftest where I test this codepath.

> > +		break;
> > +	case bpf_ctx_range(struct bpf_sockopt, optval):
> > +		if (size != size_default)
> > +			return false;
> > +		info->reg_type = PTR_TO_PACKET;
> > +		break;
> > +	case bpf_ctx_range(struct bpf_sockopt, optval_end):
> > +		if (size != size_default)
> > +			return false;
> > +		info->reg_type = PTR_TO_PACKET_END;
> > +		break;
> > +	default:
> > +		if (size != size_default)
> > +			return false;
> > +		break;
> > +	}
> > +	return true;
> > +}
> > +
> > +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> > +					 const struct bpf_insn *si,
> > +					 struct bpf_insn *insn_buf,
> > +					 struct bpf_prog *prog,
> > +					 u32 *target_size)
> > +{
> > +	struct bpf_insn *insn = insn_buf;
> > +
> > +	switch (si->off) {
> > +	case offsetof(struct bpf_sockopt, sk):
> > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> > +				      si->dst_reg, si->src_reg,
> > +				      offsetof(struct bpf_sockopt_kern, sk));
> > +		break;
> > +	case offsetof(struct bpf_sockopt, level):
> > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +				      bpf_target_off(struct bpf_sockopt_kern,
> > +						     level, 4, target_size));
> bpf_target_off() is not needed since there is no narrow load.
Good point, will drop it.

Thank you for a review!

> > +		break;
> > +	case offsetof(struct bpf_sockopt, optname):
> > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +				      bpf_target_off(struct bpf_sockopt_kern,
> > +						     optname, 4, target_size));
> > +		break;
> > +	case offsetof(struct bpf_sockopt, optlen):
> > +		if (type == BPF_WRITE)
> > +			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +					      bpf_target_off(struct bpf_sockopt_kern,
> > +							     optlen, 4, target_size));
> > +		else
> > +			*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +					      bpf_target_off(struct bpf_sockopt_kern,
> > +							     optlen, 4, target_size));
> > +		break;
> > +	case offsetof(struct bpf_sockopt, optval):
> > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> > +				      si->dst_reg, si->src_reg,
> > +				      offsetof(struct bpf_sockopt_kern, optval));
> > +		break;
> > +	case offsetof(struct bpf_sockopt, optval_end):
> > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> > +				      si->dst_reg, si->src_reg,
> > +				      offsetof(struct bpf_sockopt_kern, optval_end));
> > +		break;
> > +	}
> > +
> > +	return insn - insn_buf;
> > +}
> > +
> > +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> > +				   bool direct_write,
> > +				   const struct bpf_prog *prog)
> > +{
> > +	/* Nothing to do for sockopt argument. The data is kzalloc'ated.
> > +	 */
> > +	return 0;
> > +}
> > +
> > +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> > +	.get_func_proto		= cg_sockopt_func_proto,
> > +	.is_valid_access	= cg_sockopt_is_valid_access,
> > +	.convert_ctx_access	= cg_sockopt_convert_ctx_access,
> > +	.gen_prologue		= cg_sockopt_get_prologue,
> > +};
> > +
> > +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> > +};
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 4c53cbd3329d..4ad2b5f1905f 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> >  		default:
> >  			return -EINVAL;
> >  		}
> > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > +		switch (expected_attach_type) {
> > +		case BPF_CGROUP_SETSOCKOPT:
> > +		case BPF_CGROUP_GETSOCKOPT:
> > +			return 0;
> > +		default:
> > +			return -EINVAL;
> > +		}
> >  	default:
> >  		return 0;
> >  	}
> > @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> >  	switch (prog->type) {
> >  	case BPF_PROG_TYPE_CGROUP_SOCK:
> >  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> >  		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> >  	case BPF_PROG_TYPE_CGROUP_SKB:
> >  		return prog->enforce_expected_attach_type &&
> > @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >  	case BPF_CGROUP_SYSCTL:
> >  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> >  		break;
> > +	case BPF_CGROUP_GETSOCKOPT:
> > +	case BPF_CGROUP_SETSOCKOPT:
> > +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > +		break;
> >  	default:
> >  		return -EINVAL;
> >  	}
> > @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> >  	case BPF_CGROUP_SYSCTL:
> >  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> >  		break;
> > +	case BPF_CGROUP_GETSOCKOPT:
> > +	case BPF_CGROUP_SETSOCKOPT:
> > +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > +		break;
> >  	default:
> >  		return -EINVAL;
> >  	}
> > @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
> >  	case BPF_CGROUP_SOCK_OPS:
> >  	case BPF_CGROUP_DEVICE:
> >  	case BPF_CGROUP_SYSCTL:
> > +	case BPF_CGROUP_GETSOCKOPT:
> > +	case BPF_CGROUP_SETSOCKOPT:
> >  		break;
> >  	case BPF_LIRC_MODE2:
> >  		return lirc_prog_query(attr, uattr);
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 5c2cb5bd84ce..b91fde10e721 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> >  
> >  		env->seen_direct_write = true;
> >  		return true;
> > +
> > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > +		if (t == BPF_WRITE) {
> > +			if (env->prog->expected_attach_type ==
> > +			    BPF_CGROUP_GETSOCKOPT) {
> > +				env->seen_direct_write = true;
> > +				return true;
> > +			}
> > +			return false;
> > +		}
> > +		return true;
> > +
> >  	default:
> >  		return false;
> >  	}
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 55bfc941d17a..4652c0a005ca 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> >  	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> >  }
> >  
> > -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > +const struct bpf_func_proto bpf_sk_fullsock_proto = {
> >  	.func		= bpf_sk_fullsock,
> >  	.gpl_only	= false,
> >  	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
> > @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
> >  	return (unsigned long)NULL;
> >  }
> >  
> > -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> > +const struct bpf_func_proto bpf_tcp_sock_proto = {
> >  	.func		= bpf_tcp_sock,
> >  	.gpl_only	= false,
> >  	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
> > diff --git a/net/socket.c b/net/socket.c
> > index 72372dc5dd70..e8654f1f70e6 100644
> > --- a/net/socket.c
> > +++ b/net/socket.c
> > @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
> >  		if (err)
> >  			goto out_put;
> >  
> > +		err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> > +						     optval, optlen);
> > +		if (err < 0) {
> > +			goto out_put;
> > +		} else if (err > 0) {
> > +			err = 0;
> > +			goto out_put;
> > +		}
> > +
> >  		if (level == SOL_SOCKET)
> >  			err =
> >  			    sock_setsockopt(sock, level, optname, optval,
> > @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
> >  		if (err)
> >  			goto out_put;
> >  
> > +		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> > +						     optval, optlen);
> > +		if (err < 0) {
> > +			goto out_put;
> > +		} else if (err > 0) {
> > +			err = 0;
> > +			goto out_put;
> > +		}
> > +
> >  		if (level == SOL_SOCKET)
> >  			err =
> >  			    sock_getsockopt(sock, level, optname, optval,
> > -- 
> > 2.22.0.rc1.311.g5d7573a151-goog
> > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-04 21:35 ` [PATCH bpf-next 1/7] bpf: implement " Stanislav Fomichev
  2019-06-05 18:47   ` Martin Lau
@ 2019-06-05 19:32   ` Andrii Nakryiko
  2019-06-05 20:54     ` Stanislav Fomichev
  1 sibling, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2019-06-05 19:32 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, bpf, davem, Alexei Starovoitov, Daniel Borkmann

On Tue, Jun 4, 2019 at 2:35 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
>
> BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
>
> The buffer memory is pre-allocated (because I don't think there is
> a precedent for working with __user memory from bpf). This might be

Is there any harm or technical complication in allowing BPF to read
user memory directly? Or is it just uncharted territory, so there is
no "guideline"? If it's the latter, it could be a good time to discuss
that :)

> slow to do for each {s,g}etsockopt call, that's why I've added
> __cgroup_bpf_has_prog_array that exits early if there is nothing
> attached to a cgroup. Note, however, that there is a race between
> __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> program layout might have changed; this should not be a problem
> because in general there is a race between multiple calls to
> {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
>
> By default, kernel code path is executed after the hook (to let
> BPF handle only a subset of the options). There is new
> bpf_sockopt_handled handler that returns control to the userspace
> instead (bypassing the kernel handling).
>
> The return code is either 1 (success) or 0 (EPERM).

Why not having 3 return values: success, disallow, consumed/bypass
kernel logic? Instead of having extra side-effecting helper?

>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/bpf-cgroup.h |  29 ++++
>  include/linux/bpf.h        |   2 +
>  include/linux/bpf_types.h  |   1 +
>  include/linux/filter.h     |  19 +++
>  include/uapi/linux/bpf.h   |  17 ++-
>  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c       |  19 +++
>  kernel/bpf/verifier.c      |  12 ++
>  net/core/filter.c          |   4 +-
>  net/socket.c               |  18 +++
>  10 files changed, 406 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index b631ee75762d..406f1ba82531 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
>                                    loff_t *ppos, void **new_buf,
>                                    enum bpf_attach_type type);
>
> +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> +                                      int optname, char __user *optval,
> +                                      unsigned int optlen);
> +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> +                                      int optname, char __user *optval,
> +                                      int __user *optlen);
> +
>  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
>         struct bpf_map *map)
>  {
> @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
>         __ret;                                                                 \
>  })
>
> +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> +({                                                                            \
> +       int __ret = 0;                                                         \
> +       if (cgroup_bpf_enabled)                                                \
> +               __ret = __cgroup_bpf_run_filter_setsockopt(sock, level,        \
> +                                                          optname, optval,    \
> +                                                          optlen);            \
> +       __ret;                                                                 \
> +})
> +
> +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> +({                                                                            \
> +       int __ret = 0;                                                         \
> +       if (cgroup_bpf_enabled)                                                \
> +               __ret = __cgroup_bpf_run_filter_getsockopt(sock, level,        \
> +                                                          optname, optval,    \
> +                                                          optlen);            \
> +       __ret;                                                                 \
> +})
> +
>  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
>                            enum bpf_prog_type ptype, struct bpf_prog *prog);
>  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
>
>  #define for_each_cgroup_storage_type(stype) for (; false; )
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e5a309e6a400..fb4e6ef5a971 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
>  extern const struct bpf_func_proto bpf_get_local_storage_proto;
>  extern const struct bpf_func_proto bpf_strtol_proto;
>  extern const struct bpf_func_proto bpf_strtoul_proto;
> +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> +extern const struct bpf_func_proto bpf_tcp_sock_proto;
>
>  /* Shared helpers among cBPF and eBPF. */
>  void bpf_user_rnd_init_once(void);
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 5a9975678d6f..eec5aeeeaf92 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
>  #ifdef CONFIG_CGROUP_BPF
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
>  #endif
>  #ifdef CONFIG_BPF_LIRC_MODE2
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 43b45d6db36d..7a07fd2e14d3 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
>         u64 tmp_reg;
>  };
>
> +struct bpf_sockopt_kern {
> +       struct sock     *sk;
> +       s32             level;
> +       s32             optname;
> +       u32             optlen;
> +       u8              *optval;
> +       u8              *optval_end;
> +
> +       /* If true, BPF program had consumed the sockopt request.
> +        * Control is returned to the userspace (i.e. kernel doesn't
> +        * handle this option).
> +        */
> +       bool            handled;
> +
> +       /* Small on-stack optval buffer to avoid small allocations.
> +        */
> +       u8 buf[64];
> +};
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7c6aef253173..b6c3891241ef 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -170,6 +170,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_FLOW_DISSECTOR,
>         BPF_PROG_TYPE_CGROUP_SYSCTL,
>         BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> +       BPF_PROG_TYPE_CGROUP_SOCKOPT,
>  };
>
>  enum bpf_attach_type {
> @@ -192,6 +193,8 @@ enum bpf_attach_type {
>         BPF_LIRC_MODE2,
>         BPF_FLOW_DISSECTOR,
>         BPF_CGROUP_SYSCTL,
> +       BPF_CGROUP_GETSOCKOPT,
> +       BPF_CGROUP_SETSOCKOPT,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -2815,7 +2818,8 @@ union bpf_attr {
>         FN(strtoul),                    \
>         FN(sk_storage_get),             \
>         FN(sk_storage_delete),          \
> -       FN(send_signal),
> +       FN(send_signal),                \
> +       FN(sockopt_handled),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
>                                  */
>  };
>
> +struct bpf_sockopt {
> +       __bpf_md_ptr(struct bpf_sock *, sk);
> +
> +       __s32   level;
> +       __s32   optname;
> +
> +       __u32   optlen;
> +       __u32   optval;
> +       __u32   optval_end;
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 1b65ab0df457..4ec99ea97023 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -18,6 +18,7 @@
>  #include <linux/bpf.h>
>  #include <linux/bpf-cgroup.h>
>  #include <net/sock.h>
> +#include <net/bpf_sk_storage.h>
>
>  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
>  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
>  }
>  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
>
> +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> +                                       enum bpf_attach_type attach_type)
> +{
> +       struct bpf_prog_array *prog_array;
> +       int nr;
> +
> +       rcu_read_lock();
> +       prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> +       nr = bpf_prog_array_length(prog_array);
> +       rcu_read_unlock();
> +
> +       return nr > 0;
> +}
> +
> +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> +{
> +       if (unlikely(max_optlen > PAGE_SIZE))
> +               return -EINVAL;
> +
> +       if (likely(max_optlen <= sizeof(ctx->buf))) {
> +               ctx->optval = ctx->buf;
> +       } else {
> +               ctx->optval = kzalloc(max_optlen, GFP_USER);
> +               if (!ctx->optval)
> +                       return -ENOMEM;
> +       }
> +
> +       ctx->optval_end = ctx->optval + max_optlen;
> +       ctx->optlen = max_optlen;
> +
> +       return 0;
> +}
> +
> +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> +{
> +       if (unlikely(ctx->optval != ctx->buf))
> +               kfree(ctx->optval);
> +}
> +
> +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> +                                      int optname, char __user *optval,
> +                                      unsigned int optlen)
> +{
> +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +       struct bpf_sockopt_kern ctx = {
> +               .sk = sk,
> +               .level = level,
> +               .optname = optname,
> +       };
> +       int ret;
> +
> +       /* Opportunistic check to see whether we have any BPF program
> +        * attached to the hook so we don't waste time allocating
> +        * memory and locking the socket.
> +        */
> +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> +               return 0;
> +
> +       ret = sockopt_alloc_buf(&ctx, optlen);
> +       if (ret)
> +               return ret;
> +
> +       if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> +               sockopt_free_buf(&ctx);
> +               return -EFAULT;
> +       }
> +
> +       lock_sock(sk);
> +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> +                                &ctx, BPF_PROG_RUN);
> +       release_sock(sk);
> +
> +       sockopt_free_buf(&ctx);
> +
> +       if (!ret)
> +               return -EPERM;
> +
> +       return ctx.handled ? 1 : 0;
> +}
> +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> +
> +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> +                                      int optname, char __user *optval,
> +                                      int __user *optlen)
> +{
> +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +       struct bpf_sockopt_kern ctx = {
> +               .sk = sk,
> +               .level = level,
> +               .optname = optname,
> +       };
> +       int max_optlen;
> +       char buf[64];
> +       int ret;
> +
> +       /* Opportunistic check to see whether we have any BPF program
> +        * attached to the hook so we don't waste time allocating
> +        * memory and locking the socket.
> +        */
> +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> +               return 0;
> +
> +       if (get_user(max_optlen, optlen))
> +               return -EFAULT;
> +
> +       ret = sockopt_alloc_buf(&ctx, max_optlen);
> +       if (ret)
> +               return ret;
> +
> +       lock_sock(sk);
> +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> +                                &ctx, BPF_PROG_RUN);
> +       release_sock(sk);
> +
> +       if (ctx.optlen > max_optlen) {
> +               sockopt_free_buf(&ctx);
> +               return -EFAULT;
> +       }
> +
> +       if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> +               sockopt_free_buf(&ctx);
> +               return -EFAULT;
> +       }
> +
> +       sockopt_free_buf(&ctx);
> +
> +       if (put_user(ctx.optlen, optlen))
> +               return -EFAULT;
> +
> +       if (!ret)
> +               return -EPERM;
> +
> +       return ctx.handled ? 1 : 0;
> +}
> +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> +
>  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
>                               size_t *lenp)
>  {
> @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
>
>  const struct bpf_prog_ops cg_sysctl_prog_ops = {
>  };
> +
> +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> +{
> +       ctx->handled = true;
> +       return 1;
> +}
> +
> +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> +       .func           = bpf_sockopt_handled,
> +       .gpl_only       = false,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .ret_type       = RET_INTEGER,
> +};
> +
> +static const struct bpf_func_proto *
> +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       switch (func_id) {
> +       case BPF_FUNC_sockopt_handled:
> +               return &bpf_sockopt_handled_proto;
> +       case BPF_FUNC_sk_fullsock:
> +               return &bpf_sk_fullsock_proto;
> +       case BPF_FUNC_sk_storage_get:
> +               return &bpf_sk_storage_get_proto;
> +       case BPF_FUNC_sk_storage_delete:
> +               return &bpf_sk_storage_delete_proto;
> +#ifdef CONFIG_INET
> +       case BPF_FUNC_tcp_sock:
> +               return &bpf_tcp_sock_proto;
> +#endif
> +       default:
> +               return cgroup_base_func_proto(func_id, prog);
> +       }
> +}
> +
> +static bool cg_sockopt_is_valid_access(int off, int size,
> +                                      enum bpf_access_type type,
> +                                      const struct bpf_prog *prog,
> +                                      struct bpf_insn_access_aux *info)
> +{
> +       const int size_default = sizeof(__u32);
> +
> +       if (off < 0 || off >= sizeof(struct bpf_sockopt))
> +               return false;
> +
> +       if (off % size != 0)
> +               return false;
> +
> +       if (type == BPF_WRITE) {
> +               switch (off) {
> +               case offsetof(struct bpf_sockopt, optlen):
> +                       if (size != size_default)
> +                               return false;
> +                       return prog->expected_attach_type ==
> +                               BPF_CGROUP_GETSOCKOPT;
> +               default:
> +                       return false;
> +               }
> +       }
> +
> +       switch (off) {
> +       case offsetof(struct bpf_sockopt, sk):
> +               if (size != sizeof(__u64))
> +                       return false;
> +               info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
> +               break;
> +       case bpf_ctx_range(struct bpf_sockopt, optval):
> +               if (size != size_default)
> +                       return false;
> +               info->reg_type = PTR_TO_PACKET;
> +               break;
> +       case bpf_ctx_range(struct bpf_sockopt, optval_end):
> +               if (size != size_default)
> +                       return false;
> +               info->reg_type = PTR_TO_PACKET_END;
> +               break;
> +       default:
> +               if (size != size_default)
> +                       return false;
> +               break;

nit, just:

return size == size_default

?

> +       }
> +       return true;
> +}
> +
> +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> +                                        const struct bpf_insn *si,
> +                                        struct bpf_insn *insn_buf,
> +                                        struct bpf_prog *prog,
> +                                        u32 *target_size)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (si->off) {
> +       case offsetof(struct bpf_sockopt, sk):
> +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> +                                     si->dst_reg, si->src_reg,
> +                                     offsetof(struct bpf_sockopt_kern, sk));
> +               break;
> +       case offsetof(struct bpf_sockopt, level):
> +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +                                     bpf_target_off(struct bpf_sockopt_kern,
> +                                                    level, 4, target_size));
> +               break;
> +       case offsetof(struct bpf_sockopt, optname):
> +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +                                     bpf_target_off(struct bpf_sockopt_kern,
> +                                                    optname, 4, target_size));
> +               break;
> +       case offsetof(struct bpf_sockopt, optlen):
> +               if (type == BPF_WRITE)
> +                       *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +                                             bpf_target_off(struct bpf_sockopt_kern,
> +                                                            optlen, 4, target_size));
> +               else
> +                       *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +                                             bpf_target_off(struct bpf_sockopt_kern,
> +                                                            optlen, 4, target_size));
> +               break;
> +       case offsetof(struct bpf_sockopt, optval):
> +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> +                                     si->dst_reg, si->src_reg,
> +                                     offsetof(struct bpf_sockopt_kern, optval));
> +               break;
> +       case offsetof(struct bpf_sockopt, optval_end):
> +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> +                                     si->dst_reg, si->src_reg,
> +                                     offsetof(struct bpf_sockopt_kern, optval_end));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
> +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> +                                  bool direct_write,
> +                                  const struct bpf_prog *prog)
> +{
> +       /* Nothing to do for sockopt argument. The data is kzalloc'ated.
> +        */
> +       return 0;
> +}
> +
> +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> +       .get_func_proto         = cg_sockopt_func_proto,
> +       .is_valid_access        = cg_sockopt_is_valid_access,
> +       .convert_ctx_access     = cg_sockopt_convert_ctx_access,
> +       .gen_prologue           = cg_sockopt_get_prologue,
> +};
> +
> +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 4c53cbd3329d..4ad2b5f1905f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
>                 default:
>                         return -EINVAL;
>                 }
> +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> +               switch (expected_attach_type) {
> +               case BPF_CGROUP_SETSOCKOPT:
> +               case BPF_CGROUP_GETSOCKOPT:
> +                       return 0;
> +               default:
> +                       return -EINVAL;
> +               }
>         default:
>                 return 0;
>         }
> @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
>         switch (prog->type) {
>         case BPF_PROG_TYPE_CGROUP_SOCK:
>         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
>                 return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
>         case BPF_PROG_TYPE_CGROUP_SKB:
>                 return prog->enforce_expected_attach_type &&
> @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>         case BPF_CGROUP_SYSCTL:
>                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
>                 break;
> +       case BPF_CGROUP_GETSOCKOPT:
> +       case BPF_CGROUP_SETSOCKOPT:
> +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> +               break;
>         default:
>                 return -EINVAL;
>         }
> @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>         case BPF_CGROUP_SYSCTL:
>                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
>                 break;
> +       case BPF_CGROUP_GETSOCKOPT:
> +       case BPF_CGROUP_SETSOCKOPT:
> +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> +               break;
>         default:
>                 return -EINVAL;
>         }
> @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
>         case BPF_CGROUP_SOCK_OPS:
>         case BPF_CGROUP_DEVICE:
>         case BPF_CGROUP_SYSCTL:
> +       case BPF_CGROUP_GETSOCKOPT:
> +       case BPF_CGROUP_SETSOCKOPT:
>                 break;
>         case BPF_LIRC_MODE2:
>                 return lirc_prog_query(attr, uattr);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5c2cb5bd84ce..b91fde10e721 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>
>                 env->seen_direct_write = true;
>                 return true;
> +
> +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> +               if (t == BPF_WRITE) {
> +                       if (env->prog->expected_attach_type ==
> +                           BPF_CGROUP_GETSOCKOPT) {
> +                               env->seen_direct_write = true;
> +                               return true;
> +                       }
> +                       return false;
> +               }
> +               return true;
> +
>         default:
>                 return false;
>         }
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 55bfc941d17a..4652c0a005ca 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
>         return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
>  }
>
> -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> +const struct bpf_func_proto bpf_sk_fullsock_proto = {
>         .func           = bpf_sk_fullsock,
>         .gpl_only       = false,
>         .ret_type       = RET_PTR_TO_SOCKET_OR_NULL,
> @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
>         return (unsigned long)NULL;
>  }
>
> -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> +const struct bpf_func_proto bpf_tcp_sock_proto = {
>         .func           = bpf_tcp_sock,
>         .gpl_only       = false,
>         .ret_type       = RET_PTR_TO_TCP_SOCK_OR_NULL,
> diff --git a/net/socket.c b/net/socket.c
> index 72372dc5dd70..e8654f1f70e6 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
>                 if (err)
>                         goto out_put;
>
> +               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> +                                                    optval, optlen);
> +               if (err < 0) {
> +                       goto out_put;
> +               } else if (err > 0) {
> +                       err = 0;
> +                       goto out_put;
> +               }
> +
>                 if (level == SOL_SOCKET)
>                         err =
>                             sock_setsockopt(sock, level, optname, optval,
> @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
>                 if (err)
>                         goto out_put;
>
> +               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> +                                                    optval, optlen);
> +               if (err < 0) {
> +                       goto out_put;
> +               } else if (err > 0) {
> +                       err = 0;
> +                       goto out_put;
> +               }
> +
>                 if (level == SOL_SOCKET)
>                         err =
>                             sock_getsockopt(sock, level, optname, optval,
> --
> 2.22.0.rc1.311.g5d7573a151-goog
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-05 19:17     ` Stanislav Fomichev
@ 2019-06-05 20:50       ` Martin Lau
  2019-06-05 21:16         ` Stanislav Fomichev
  0 siblings, 1 reply; 18+ messages in thread
From: Martin Lau @ 2019-06-05 20:50 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel

On Wed, Jun 05, 2019 at 12:17:24PM -0700, Stanislav Fomichev wrote:
> On 06/05, Martin Lau wrote:
> > On Tue, Jun 04, 2019 at 02:35:18PM -0700, Stanislav Fomichev wrote:
> > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > > 
> > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > > 
> > > The buffer memory is pre-allocated (because I don't think there is
> > > a precedent for working with __user memory from bpf). This might be
> > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > __cgroup_bpf_has_prog_array that exits early if there is nothing
> > > attached to a cgroup. Note, however, that there is a race between
> > > __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> > > program layout might have changed; this should not be a problem
> > > because in general there is a race between multiple calls to
> > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > > 
> > > By default, kernel code path is executed after the hook (to let
> > > BPF handle only a subset of the options). There is new
> > > bpf_sockopt_handled handler that returns control to the userspace
> > > instead (bypassing the kernel handling).
> > > 
> > > The return code is either 1 (success) or 0 (EPERM).
> > > 
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  include/linux/bpf-cgroup.h |  29 ++++
> > >  include/linux/bpf.h        |   2 +
> > >  include/linux/bpf_types.h  |   1 +
> > >  include/linux/filter.h     |  19 +++
> > >  include/uapi/linux/bpf.h   |  17 ++-
> > >  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
> > >  kernel/bpf/syscall.c       |  19 +++
> > >  kernel/bpf/verifier.c      |  12 ++
> > >  net/core/filter.c          |   4 +-
> > >  net/socket.c               |  18 +++
> > >  10 files changed, 406 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > > index b631ee75762d..406f1ba82531 100644
> > > --- a/include/linux/bpf-cgroup.h
> > > +++ b/include/linux/bpf-cgroup.h
> > > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > >  				   loff_t *ppos, void **new_buf,
> > >  				   enum bpf_attach_type type);
> > >  
> > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > > +				       int optname, char __user *optval,
> > > +				       unsigned int optlen);
> > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > > +				       int optname, char __user *optval,
> > > +				       int __user *optlen);
> > > +
> > >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> > >  	struct bpf_map *map)
> > >  {
> > > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> > >  	__ret;								       \
> > >  })
> > >  
> > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > +({									       \
> > > +	int __ret = 0;							       \
> > > +	if (cgroup_bpf_enabled)						       \
> > > +		__ret = __cgroup_bpf_run_filter_setsockopt(sock, level,	       \
> > > +							   optname, optval,    \
> > > +							   optlen);	       \
> > > +	__ret;								       \
> > > +})
> > > +
> > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > +({									       \
> > > +	int __ret = 0;							       \
> > > +	if (cgroup_bpf_enabled)						       \
> > > +		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
> > > +							   optname, optval,    \
> > > +							   optlen);	       \
> > > +	__ret;								       \
> > > +})
> > > +
> > >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> > >  			   enum bpf_prog_type ptype, struct bpf_prog *prog);
> > >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> > >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> > >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> > >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > >  
> > >  #define for_each_cgroup_storage_type(stype) for (; false; )
> > >  
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index e5a309e6a400..fb4e6ef5a971 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> > >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > >  extern const struct bpf_func_proto bpf_strtol_proto;
> > >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > > +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> > > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > >  
> > >  /* Shared helpers among cBPF and eBPF. */
> > >  void bpf_user_rnd_init_once(void);
> > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > index 5a9975678d6f..eec5aeeeaf92 100644
> > > --- a/include/linux/bpf_types.h
> > > +++ b/include/linux/bpf_types.h
> > > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> > >  #ifdef CONFIG_CGROUP_BPF
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> > >  #endif
> > >  #ifdef CONFIG_BPF_LIRC_MODE2
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 43b45d6db36d..7a07fd2e14d3 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
> > >  	u64 tmp_reg;
> > >  };
> > >  
> > > +struct bpf_sockopt_kern {
> > > +	struct sock	*sk;
> > > +	s32		level;
> > > +	s32		optname;
> > > +	u32		optlen;
> > It seems there is hole.
> Ack, will move the pointers up.
> 
> > > +	u8		*optval;
> > > +	u8		*optval_end;
> > > +
> > > +	/* If true, BPF program had consumed the sockopt request.
> > > +	 * Control is returned to the userspace (i.e. kernel doesn't
> > > +	 * handle this option).
> > > +	 */
> > > +	bool		handled;
> > > +
> > > +	/* Small on-stack optval buffer to avoid small allocations.
> > > +	 */
> > > +	u8 buf[64];
> > Is it better to align to 8 bytes?
> Do you mean manually set size to be 64 + x where x is a remainder
> to align to 8 bytes? Is there some macro to help with that maybe?
I think __attribute__((aligned(8))) should do.

> 
> > > +};
> > > +
> > >  #endif /* __LINUX_FILTER_H__ */
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 7c6aef253173..b6c3891241ef 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> > >  	BPF_PROG_TYPE_FLOW_DISSECTOR,
> > >  	BPF_PROG_TYPE_CGROUP_SYSCTL,
> > >  	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > > +	BPF_PROG_TYPE_CGROUP_SOCKOPT,
> > >  };
> > >  
> > >  enum bpf_attach_type {
> > > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> > >  	BPF_LIRC_MODE2,
> > >  	BPF_FLOW_DISSECTOR,
> > >  	BPF_CGROUP_SYSCTL,
> > > +	BPF_CGROUP_GETSOCKOPT,
> > > +	BPF_CGROUP_SETSOCKOPT,
> > >  	__MAX_BPF_ATTACH_TYPE
> > >  };
> > >  
> > > @@ -2815,7 +2818,8 @@ union bpf_attr {
> > >  	FN(strtoul),			\
> > >  	FN(sk_storage_get),		\
> > >  	FN(sk_storage_delete),		\
> > > -	FN(send_signal),
> > > +	FN(send_signal),		\
> > > +	FN(sockopt_handled),
> > Document.
> Ah, totally forgot about that, sure, will do!
> 
> > >  
> > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > >   * function eBPF program intends to call
> > > @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
> > >  				 */
> > >  };
> > >  
> > > +struct bpf_sockopt {
> > > +	__bpf_md_ptr(struct bpf_sock *, sk);
> > > +
> > > +	__s32	level;
> > > +	__s32	optname;
> > > +
> > > +	__u32	optlen;
> > > +	__u32	optval;
> > > +	__u32	optval_end;
> > > +};
> > > +
> > >  #endif /* _UAPI__LINUX_BPF_H__ */
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index 1b65ab0df457..4ec99ea97023 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/bpf.h>
> > >  #include <linux/bpf-cgroup.h>
> > >  #include <net/sock.h>
> > > +#include <net/bpf_sk_storage.h>
> > >  
> > >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> > >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > >  }
> > >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> > >  
> > > +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> > > +					enum bpf_attach_type attach_type)
> > > +{
> > > +	struct bpf_prog_array *prog_array;
> > > +	int nr;
> > > +
> > > +	rcu_read_lock();
> > > +	prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > > +	nr = bpf_prog_array_length(prog_array);
> > Nit. It seems unnecessary to loop through the whole
> > array if the only signal needed is non-zero.
> Oh, good point. I guess I'd have to add another helper like
> bpf_prog_array_is_empty() and return early. Any other suggestions?
I was thinking to check empty_prog_array on top but it is
too overkilled, so didn't mention it.  I think just return
early is good enough.

I think this non-zero check is good to have before doing lock_sock().

> 
> > > +	rcu_read_unlock();
> > > +
> > > +	return nr > 0;
> > > +}
> > > +
> > > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > > +{
> > > +	if (unlikely(max_optlen > PAGE_SIZE))
> > > +		return -EINVAL;
> > > +
> > > +	if (likely(max_optlen <= sizeof(ctx->buf))) {
> > > +		ctx->optval = ctx->buf;
> > > +	} else {
> > > +		ctx->optval = kzalloc(max_optlen, GFP_USER);
> > > +		if (!ctx->optval)
> > > +			return -ENOMEM;
> > > +	}
> > > +
> > > +	ctx->optval_end = ctx->optval + max_optlen;
> > > +	ctx->optlen = max_optlen;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> > > +{
> > > +	if (unlikely(ctx->optval != ctx->buf))
> > > +		kfree(ctx->optval);
> > > +}
> > > +
> > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> > > +				       int optname, char __user *optval,
> > > +				       unsigned int optlen)
> > > +{
> > > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > +	struct bpf_sockopt_kern ctx = {
> > > +		.sk = sk,
> > > +		.level = level,
> > > +		.optname = optname,
> > > +	};
> > > +	int ret;
> > > +
> > > +	/* Opportunistic check to see whether we have any BPF program
> > > +	 * attached to the hook so we don't waste time allocating
> > > +	 * memory and locking the socket.
> > > +	 */
> > > +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> > > +		return 0;
> > > +
> > > +	ret = sockopt_alloc_buf(&ctx, optlen);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> > > +		sockopt_free_buf(&ctx);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	lock_sock(sk);
> > > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> > > +				 &ctx, BPF_PROG_RUN);
> > I think the check_return_code() in verifier.c has to be
> > adjusted also.
> Good catch! I though that it does the [0,1] check by default.
btw, just came to my mind, do you have a chance to
look at how 'ret' is handled in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY()?
It can take return values other than 0 or 1.  I am thinking
ctx.handled could also be done in the 'ret' itself also
but out of my head I think your current way "bpf_sockopt_handled()"
may be cleaner.

> 
> > > +	release_sock(sk);
> > > +
> > > +	sockopt_free_buf(&ctx);
> > > +
> > > +	if (!ret)
> > > +		return -EPERM;
> > > +
> > > +	return ctx.handled ? 1 : 0;
> > > +}
> > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> > > +
> > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> > > +				       int optname, char __user *optval,
> > > +				       int __user *optlen)
> > > +{
> > > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > +	struct bpf_sockopt_kern ctx = {
> > > +		.sk = sk,
> > > +		.level = level,
> > > +		.optname = optname,
> > > +	};
> > > +	int max_optlen;
> > > +	char buf[64];
> > hmm... where is it used?
> It's a leftover from my initial attempt to have a small buffer on the stack.
> I've since moved it into struct bpf_sockopt_kern. Will remove. Gcc even
> complains about unused var, not sure how I missed that...
> 
> > > +	int ret;
> > > +
> > > +	/* Opportunistic check to see whether we have any BPF program
> > > +	 * attached to the hook so we don't waste time allocating
> > > +	 * memory and locking the socket.
> > > +	 */
> > > +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> > > +		return 0;
> > > +
> > > +	if (get_user(max_optlen, optlen))
> > > +		return -EFAULT;
> > > +
> > > +	ret = sockopt_alloc_buf(&ctx, max_optlen);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	lock_sock(sk);
> > > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > > +				 &ctx, BPF_PROG_RUN);
> > > +	release_sock(sk);
> > > +
> > > +	if (ctx.optlen > max_optlen) {
> > > +		sockopt_free_buf(&ctx);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> > > +		sockopt_free_buf(&ctx);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	sockopt_free_buf(&ctx);
> > > +
> > > +	if (put_user(ctx.optlen, optlen))
> > > +		return -EFAULT;
> > > +
> > > +	if (!ret)
> > > +		return -EPERM;
> > > +
> > > +	return ctx.handled ? 1 : 0;
> > > +}
> > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> > > +
> > >  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
> > >  			      size_t *lenp)
> > >  {
> > > @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
> > >  
> > >  const struct bpf_prog_ops cg_sysctl_prog_ops = {
> > >  };
> > > +
> > > +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> > > +{
> > > +	ctx->handled = true;
> > > +	return 1;
> > RET_VOID?
> I was thinking that in the C code the pattern can be:
> {
> 	...
> 	return bpf_sockopt_handled();
> }
> 
> That's why I'm retuning 1 from the helper. But I can change it to VOID
> so that users have to return 1 manually. That's probably cleaner, will
> change.
> 
> > > +}
> > > +
> > > +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> > > +	.func		= bpf_sockopt_handled,
> > > +	.gpl_only	= false,
> > > +	.arg1_type      = ARG_PTR_TO_CTX,
> > > +	.ret_type	= RET_INTEGER,
> > > +};
> > > +
> > > +static const struct bpf_func_proto *
> > > +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > +{
> > > +	switch (func_id) {
> > > +	case BPF_FUNC_sockopt_handled:
> > > +		return &bpf_sockopt_handled_proto;
> > > +	case BPF_FUNC_sk_fullsock:
> > > +		return &bpf_sk_fullsock_proto;
> > > +	case BPF_FUNC_sk_storage_get:
> > > +		return &bpf_sk_storage_get_proto;
> > > +	case BPF_FUNC_sk_storage_delete:
> > > +		return &bpf_sk_storage_delete_proto;
> > > +#ifdef CONFIG_INET
> > > +	case BPF_FUNC_tcp_sock:
> > > +		return &bpf_tcp_sock_proto;
> > > +#endif
> > > +	default:
> > > +		return cgroup_base_func_proto(func_id, prog);
> > > +	}
> > > +}
> > > +
> > > +static bool cg_sockopt_is_valid_access(int off, int size,
> > > +				       enum bpf_access_type type,
> > > +				       const struct bpf_prog *prog,
> > > +				       struct bpf_insn_access_aux *info)
> > > +{
> > > +	const int size_default = sizeof(__u32);
> > > +
> > > +	if (off < 0 || off >= sizeof(struct bpf_sockopt))
> > > +		return false;
> > > +
> > > +	if (off % size != 0)
> > > +		return false;
> > > +
> > > +	if (type == BPF_WRITE) {
> > > +		switch (off) {
> > > +		case offsetof(struct bpf_sockopt, optlen):
> > > +			if (size != size_default)
> > > +				return false;
> > > +			return prog->expected_attach_type ==
> > > +				BPF_CGROUP_GETSOCKOPT;
> > > +		default:
> > > +			return false;
> > > +		}
> > > +	}
> > > +
> > > +	switch (off) {
> > > +	case offsetof(struct bpf_sockopt, sk):
> > > +		if (size != sizeof(__u64))
> > > +			return false;
> > > +		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
> > sk cannot be NULL, so the OR_NULL part is not needed.
> > 
> > I think it should also be PTR_TO_SOCKET instead.
> I think you're correct. That reminds me of the fact that
> I haven't properly tested it. Let me add a small C
> selftest where I test this codepath.
> 
> > > +		break;
> > > +	case bpf_ctx_range(struct bpf_sockopt, optval):
> > > +		if (size != size_default)
> > > +			return false;
> > > +		info->reg_type = PTR_TO_PACKET;
> > > +		break;
> > > +	case bpf_ctx_range(struct bpf_sockopt, optval_end):
> > > +		if (size != size_default)
> > > +			return false;
> > > +		info->reg_type = PTR_TO_PACKET_END;
> > > +		break;
> > > +	default:
> > > +		if (size != size_default)
> > > +			return false;
> > > +		break;
> > > +	}
> > > +	return true;
> > > +}
> > > +
> > > +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> > > +					 const struct bpf_insn *si,
> > > +					 struct bpf_insn *insn_buf,
> > > +					 struct bpf_prog *prog,
> > > +					 u32 *target_size)
> > > +{
> > > +	struct bpf_insn *insn = insn_buf;
> > > +
> > > +	switch (si->off) {
> > > +	case offsetof(struct bpf_sockopt, sk):
> > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> > > +				      si->dst_reg, si->src_reg,
> > > +				      offsetof(struct bpf_sockopt_kern, sk));
> > > +		break;
> > > +	case offsetof(struct bpf_sockopt, level):
> > > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +				      bpf_target_off(struct bpf_sockopt_kern,
> > > +						     level, 4, target_size));
> > bpf_target_off() is not needed since there is no narrow load.
> Good point, will drop it.
> 
> Thank you for a review!
> 
> > > +		break;
> > > +	case offsetof(struct bpf_sockopt, optname):
> > > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +				      bpf_target_off(struct bpf_sockopt_kern,
> > > +						     optname, 4, target_size));
> > > +		break;
> > > +	case offsetof(struct bpf_sockopt, optlen):
> > > +		if (type == BPF_WRITE)
> > > +			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +					      bpf_target_off(struct bpf_sockopt_kern,
> > > +							     optlen, 4, target_size));
> > > +		else
> > > +			*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +					      bpf_target_off(struct bpf_sockopt_kern,
> > > +							     optlen, 4, target_size));
> > > +		break;
> > > +	case offsetof(struct bpf_sockopt, optval):
> > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> > > +				      si->dst_reg, si->src_reg,
> > > +				      offsetof(struct bpf_sockopt_kern, optval));
> > > +		break;
> > > +	case offsetof(struct bpf_sockopt, optval_end):
> > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> > > +				      si->dst_reg, si->src_reg,
> > > +				      offsetof(struct bpf_sockopt_kern, optval_end));
> > > +		break;
> > > +	}
> > > +
> > > +	return insn - insn_buf;
> > > +}
> > > +
> > > +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> > > +				   bool direct_write,
> > > +				   const struct bpf_prog *prog)
> > > +{
> > > +	/* Nothing to do for sockopt argument. The data is kzalloc'ated.
> > > +	 */
> > > +	return 0;
> > > +}
> > > +
> > > +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> > > +	.get_func_proto		= cg_sockopt_func_proto,
> > > +	.is_valid_access	= cg_sockopt_is_valid_access,
> > > +	.convert_ctx_access	= cg_sockopt_convert_ctx_access,
> > > +	.gen_prologue		= cg_sockopt_get_prologue,
> > > +};
> > > +
> > > +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> > > +};
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 4c53cbd3329d..4ad2b5f1905f 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> > >  		default:
> > >  			return -EINVAL;
> > >  		}
> > > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > +		switch (expected_attach_type) {
> > > +		case BPF_CGROUP_SETSOCKOPT:
> > > +		case BPF_CGROUP_GETSOCKOPT:
> > > +			return 0;
> > > +		default:
> > > +			return -EINVAL;
> > > +		}
> > >  	default:
> > >  		return 0;
> > >  	}
> > > @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> > >  	switch (prog->type) {
> > >  	case BPF_PROG_TYPE_CGROUP_SOCK:
> > >  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > >  		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> > >  	case BPF_PROG_TYPE_CGROUP_SKB:
> > >  		return prog->enforce_expected_attach_type &&
> > > @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> > >  	case BPF_CGROUP_SYSCTL:
> > >  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > >  		break;
> > > +	case BPF_CGROUP_GETSOCKOPT:
> > > +	case BPF_CGROUP_SETSOCKOPT:
> > > +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > +		break;
> > >  	default:
> > >  		return -EINVAL;
> > >  	}
> > > @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> > >  	case BPF_CGROUP_SYSCTL:
> > >  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > >  		break;
> > > +	case BPF_CGROUP_GETSOCKOPT:
> > > +	case BPF_CGROUP_SETSOCKOPT:
> > > +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > +		break;
> > >  	default:
> > >  		return -EINVAL;
> > >  	}
> > > @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
> > >  	case BPF_CGROUP_SOCK_OPS:
> > >  	case BPF_CGROUP_DEVICE:
> > >  	case BPF_CGROUP_SYSCTL:
> > > +	case BPF_CGROUP_GETSOCKOPT:
> > > +	case BPF_CGROUP_SETSOCKOPT:
> > >  		break;
> > >  	case BPF_LIRC_MODE2:
> > >  		return lirc_prog_query(attr, uattr);
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 5c2cb5bd84ce..b91fde10e721 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> > >  
> > >  		env->seen_direct_write = true;
> > >  		return true;
> > > +
> > > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > +		if (t == BPF_WRITE) {
> > > +			if (env->prog->expected_attach_type ==
> > > +			    BPF_CGROUP_GETSOCKOPT) {
> > > +				env->seen_direct_write = true;
> > > +				return true;
> > > +			}
> > > +			return false;
> > > +		}
> > > +		return true;
> > > +
> > >  	default:
> > >  		return false;
> > >  	}
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 55bfc941d17a..4652c0a005ca 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> > >  	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> > >  }
> > >  
> > > -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > > +const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > >  	.func		= bpf_sk_fullsock,
> > >  	.gpl_only	= false,
> > >  	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
> > > @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
> > >  	return (unsigned long)NULL;
> > >  }
> > >  
> > > -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> > > +const struct bpf_func_proto bpf_tcp_sock_proto = {
> > >  	.func		= bpf_tcp_sock,
> > >  	.gpl_only	= false,
> > >  	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
> > > diff --git a/net/socket.c b/net/socket.c
> > > index 72372dc5dd70..e8654f1f70e6 100644
> > > --- a/net/socket.c
> > > +++ b/net/socket.c
> > > @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
> > >  		if (err)
> > >  			goto out_put;
> > >  
> > > +		err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> > > +						     optval, optlen);
> > > +		if (err < 0) {
> > > +			goto out_put;
> > > +		} else if (err > 0) {
> > > +			err = 0;
> > > +			goto out_put;
> > > +		}
> > > +
> > >  		if (level == SOL_SOCKET)
> > >  			err =
> > >  			    sock_setsockopt(sock, level, optname, optval,
> > > @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
> > >  		if (err)
> > >  			goto out_put;
> > >  
> > > +		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> > > +						     optval, optlen);
> > > +		if (err < 0) {
> > > +			goto out_put;
> > > +		} else if (err > 0) {
> > > +			err = 0;
> > > +			goto out_put;
> > > +		}
> > > +
> > >  		if (level == SOL_SOCKET)
> > >  			err =
> > >  			    sock_getsockopt(sock, level, optname, optval,
> > > -- 
> > > 2.22.0.rc1.311.g5d7573a151-goog
> > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-05 19:32   ` Andrii Nakryiko
@ 2019-06-05 20:54     ` Stanislav Fomichev
  2019-06-05 21:12       ` Andrii Nakryiko
  0 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-05 20:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Stanislav Fomichev, Networking, bpf, davem, Alexei Starovoitov,
	Daniel Borkmann

On 06/05, Andrii Nakryiko wrote:
> On Tue, Jun 4, 2019 at 2:35 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> >
> > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> >
> > The buffer memory is pre-allocated (because I don't think there is
> > a precedent for working with __user memory from bpf). This might be
> 
> Is there any harm or technical complication in allowing BPF to read
> user memory directly? Or is it just uncharted territory, so there is
> no "guideline"? If it's the latter, it could be a good time to discuss
> that :)
The naive implementation would have two helpers: one to copy from user,
another to copy back to user; both of them would use something like
get_user/put_user which can fault. Since we are running bpf progs with
preempt disabled and in the rcu read section, we would need to do
something like we currently do in bpf_probe_read where we disable pagefaults.

To me it felt a bit excessive for socket options hook, simple data buffer is
easier to work with from BPF program and we have all the machinery
in place in the verifier. But I'm open to suggestions :-)

> > slow to do for each {s,g}etsockopt call, that's why I've added
> > __cgroup_bpf_has_prog_array that exits early if there is nothing
> > attached to a cgroup. Note, however, that there is a race between
> > __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> > program layout might have changed; this should not be a problem
> > because in general there is a race between multiple calls to
> > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> >
> > By default, kernel code path is executed after the hook (to let
> > BPF handle only a subset of the options). There is new
> > bpf_sockopt_handled handler that returns control to the userspace
> > instead (bypassing the kernel handling).
> >
> > The return code is either 1 (success) or 0 (EPERM).
> 
> Why not having 3 return values: success, disallow, consumed/bypass
> kernel logic? Instead of having extra side-effecting helper?
That is an option. I didn't go that route because I wanted to
reuse BPF_PROG_RUN_ARRAY which has the following inside:

	ret = 1;
	while (prog)
		ret &= bpf_prog..()

But given the fact that we now have BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
precedent, maybe that's worth it (essentially, have
BPF_PROG_CGROUP_SOCKOPS_RUN_ARRAY that handles 0, 1 and 2)?
I don't have a strong opinion here to be honest.

> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/bpf-cgroup.h |  29 ++++
> >  include/linux/bpf.h        |   2 +
> >  include/linux/bpf_types.h  |   1 +
> >  include/linux/filter.h     |  19 +++
> >  include/uapi/linux/bpf.h   |  17 ++-
> >  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
> >  kernel/bpf/syscall.c       |  19 +++
> >  kernel/bpf/verifier.c      |  12 ++
> >  net/core/filter.c          |   4 +-
> >  net/socket.c               |  18 +++
> >  10 files changed, 406 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > index b631ee75762d..406f1ba82531 100644
> > --- a/include/linux/bpf-cgroup.h
> > +++ b/include/linux/bpf-cgroup.h
> > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> >                                    loff_t *ppos, void **new_buf,
> >                                    enum bpf_attach_type type);
> >
> > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > +                                      int optname, char __user *optval,
> > +                                      unsigned int optlen);
> > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > +                                      int optname, char __user *optval,
> > +                                      int __user *optlen);
> > +
> >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> >         struct bpf_map *map)
> >  {
> > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> >         __ret;                                                                 \
> >  })
> >
> > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > +({                                                                            \
> > +       int __ret = 0;                                                         \
> > +       if (cgroup_bpf_enabled)                                                \
> > +               __ret = __cgroup_bpf_run_filter_setsockopt(sock, level,        \
> > +                                                          optname, optval,    \
> > +                                                          optlen);            \
> > +       __ret;                                                                 \
> > +})
> > +
> > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > +({                                                                            \
> > +       int __ret = 0;                                                         \
> > +       if (cgroup_bpf_enabled)                                                \
> > +               __ret = __cgroup_bpf_run_filter_getsockopt(sock, level,        \
> > +                                                          optname, optval,    \
> > +                                                          optlen);            \
> > +       __ret;                                                                 \
> > +})
> > +
> >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> >                            enum bpf_prog_type ptype, struct bpf_prog *prog);
> >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> >
> >  #define for_each_cgroup_storage_type(stype) for (; false; )
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index e5a309e6a400..fb4e6ef5a971 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> >  extern const struct bpf_func_proto bpf_strtol_proto;
> >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> >
> >  /* Shared helpers among cBPF and eBPF. */
> >  void bpf_user_rnd_init_once(void);
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 5a9975678d6f..eec5aeeeaf92 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> >  #ifdef CONFIG_CGROUP_BPF
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> >  #endif
> >  #ifdef CONFIG_BPF_LIRC_MODE2
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 43b45d6db36d..7a07fd2e14d3 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
> >         u64 tmp_reg;
> >  };
> >
> > +struct bpf_sockopt_kern {
> > +       struct sock     *sk;
> > +       s32             level;
> > +       s32             optname;
> > +       u32             optlen;
> > +       u8              *optval;
> > +       u8              *optval_end;
> > +
> > +       /* If true, BPF program had consumed the sockopt request.
> > +        * Control is returned to the userspace (i.e. kernel doesn't
> > +        * handle this option).
> > +        */
> > +       bool            handled;
> > +
> > +       /* Small on-stack optval buffer to avoid small allocations.
> > +        */
> > +       u8 buf[64];
> > +};
> > +
> >  #endif /* __LINUX_FILTER_H__ */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 7c6aef253173..b6c3891241ef 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_FLOW_DISSECTOR,
> >         BPF_PROG_TYPE_CGROUP_SYSCTL,
> >         BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > +       BPF_PROG_TYPE_CGROUP_SOCKOPT,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> >         BPF_LIRC_MODE2,
> >         BPF_FLOW_DISSECTOR,
> >         BPF_CGROUP_SYSCTL,
> > +       BPF_CGROUP_GETSOCKOPT,
> > +       BPF_CGROUP_SETSOCKOPT,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -2815,7 +2818,8 @@ union bpf_attr {
> >         FN(strtoul),                    \
> >         FN(sk_storage_get),             \
> >         FN(sk_storage_delete),          \
> > -       FN(send_signal),
> > +       FN(send_signal),                \
> > +       FN(sockopt_handled),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
> >                                  */
> >  };
> >
> > +struct bpf_sockopt {
> > +       __bpf_md_ptr(struct bpf_sock *, sk);
> > +
> > +       __s32   level;
> > +       __s32   optname;
> > +
> > +       __u32   optlen;
> > +       __u32   optval;
> > +       __u32   optval_end;
> > +};
> > +
> >  #endif /* _UAPI__LINUX_BPF_H__ */
> > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > index 1b65ab0df457..4ec99ea97023 100644
> > --- a/kernel/bpf/cgroup.c
> > +++ b/kernel/bpf/cgroup.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/bpf.h>
> >  #include <linux/bpf-cgroup.h>
> >  #include <net/sock.h>
> > +#include <net/bpf_sk_storage.h>
> >
> >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> >  }
> >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> >
> > +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> > +                                       enum bpf_attach_type attach_type)
> > +{
> > +       struct bpf_prog_array *prog_array;
> > +       int nr;
> > +
> > +       rcu_read_lock();
> > +       prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > +       nr = bpf_prog_array_length(prog_array);
> > +       rcu_read_unlock();
> > +
> > +       return nr > 0;
> > +}
> > +
> > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > +{
> > +       if (unlikely(max_optlen > PAGE_SIZE))
> > +               return -EINVAL;
> > +
> > +       if (likely(max_optlen <= sizeof(ctx->buf))) {
> > +               ctx->optval = ctx->buf;
> > +       } else {
> > +               ctx->optval = kzalloc(max_optlen, GFP_USER);
> > +               if (!ctx->optval)
> > +                       return -ENOMEM;
> > +       }
> > +
> > +       ctx->optval_end = ctx->optval + max_optlen;
> > +       ctx->optlen = max_optlen;
> > +
> > +       return 0;
> > +}
> > +
> > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> > +{
> > +       if (unlikely(ctx->optval != ctx->buf))
> > +               kfree(ctx->optval);
> > +}
> > +
> > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> > +                                      int optname, char __user *optval,
> > +                                      unsigned int optlen)
> > +{
> > +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +       struct bpf_sockopt_kern ctx = {
> > +               .sk = sk,
> > +               .level = level,
> > +               .optname = optname,
> > +       };
> > +       int ret;
> > +
> > +       /* Opportunistic check to see whether we have any BPF program
> > +        * attached to the hook so we don't waste time allocating
> > +        * memory and locking the socket.
> > +        */
> > +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> > +               return 0;
> > +
> > +       ret = sockopt_alloc_buf(&ctx, optlen);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> > +               sockopt_free_buf(&ctx);
> > +               return -EFAULT;
> > +       }
> > +
> > +       lock_sock(sk);
> > +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> > +                                &ctx, BPF_PROG_RUN);
> > +       release_sock(sk);
> > +
> > +       sockopt_free_buf(&ctx);
> > +
> > +       if (!ret)
> > +               return -EPERM;
> > +
> > +       return ctx.handled ? 1 : 0;
> > +}
> > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> > +
> > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> > +                                      int optname, char __user *optval,
> > +                                      int __user *optlen)
> > +{
> > +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +       struct bpf_sockopt_kern ctx = {
> > +               .sk = sk,
> > +               .level = level,
> > +               .optname = optname,
> > +       };
> > +       int max_optlen;
> > +       char buf[64];
> > +       int ret;
> > +
> > +       /* Opportunistic check to see whether we have any BPF program
> > +        * attached to the hook so we don't waste time allocating
> > +        * memory and locking the socket.
> > +        */
> > +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> > +               return 0;
> > +
> > +       if (get_user(max_optlen, optlen))
> > +               return -EFAULT;
> > +
> > +       ret = sockopt_alloc_buf(&ctx, max_optlen);
> > +       if (ret)
> > +               return ret;
> > +
> > +       lock_sock(sk);
> > +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > +                                &ctx, BPF_PROG_RUN);
> > +       release_sock(sk);
> > +
> > +       if (ctx.optlen > max_optlen) {
> > +               sockopt_free_buf(&ctx);
> > +               return -EFAULT;
> > +       }
> > +
> > +       if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> > +               sockopt_free_buf(&ctx);
> > +               return -EFAULT;
> > +       }
> > +
> > +       sockopt_free_buf(&ctx);
> > +
> > +       if (put_user(ctx.optlen, optlen))
> > +               return -EFAULT;
> > +
> > +       if (!ret)
> > +               return -EPERM;
> > +
> > +       return ctx.handled ? 1 : 0;
> > +}
> > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> > +
> >  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
> >                               size_t *lenp)
> >  {
> > @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
> >
> >  const struct bpf_prog_ops cg_sysctl_prog_ops = {
> >  };
> > +
> > +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> > +{
> > +       ctx->handled = true;
> > +       return 1;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> > +       .func           = bpf_sockopt_handled,
> > +       .gpl_only       = false,
> > +       .arg1_type      = ARG_PTR_TO_CTX,
> > +       .ret_type       = RET_INTEGER,
> > +};
> > +
> > +static const struct bpf_func_proto *
> > +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +       switch (func_id) {
> > +       case BPF_FUNC_sockopt_handled:
> > +               return &bpf_sockopt_handled_proto;
> > +       case BPF_FUNC_sk_fullsock:
> > +               return &bpf_sk_fullsock_proto;
> > +       case BPF_FUNC_sk_storage_get:
> > +               return &bpf_sk_storage_get_proto;
> > +       case BPF_FUNC_sk_storage_delete:
> > +               return &bpf_sk_storage_delete_proto;
> > +#ifdef CONFIG_INET
> > +       case BPF_FUNC_tcp_sock:
> > +               return &bpf_tcp_sock_proto;
> > +#endif
> > +       default:
> > +               return cgroup_base_func_proto(func_id, prog);
> > +       }
> > +}
> > +
> > +static bool cg_sockopt_is_valid_access(int off, int size,
> > +                                      enum bpf_access_type type,
> > +                                      const struct bpf_prog *prog,
> > +                                      struct bpf_insn_access_aux *info)
> > +{
> > +       const int size_default = sizeof(__u32);
> > +
> > +       if (off < 0 || off >= sizeof(struct bpf_sockopt))
> > +               return false;
> > +
> > +       if (off % size != 0)
> > +               return false;
> > +
> > +       if (type == BPF_WRITE) {
> > +               switch (off) {
> > +               case offsetof(struct bpf_sockopt, optlen):
> > +                       if (size != size_default)
> > +                               return false;
> > +                       return prog->expected_attach_type ==
> > +                               BPF_CGROUP_GETSOCKOPT;
> > +               default:
> > +                       return false;
> > +               }
> > +       }
> > +
> > +       switch (off) {
> > +       case offsetof(struct bpf_sockopt, sk):
> > +               if (size != sizeof(__u64))
> > +                       return false;
> > +               info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
> > +               break;
> > +       case bpf_ctx_range(struct bpf_sockopt, optval):
> > +               if (size != size_default)
> > +                       return false;
> > +               info->reg_type = PTR_TO_PACKET;
> > +               break;
> > +       case bpf_ctx_range(struct bpf_sockopt, optval_end):
> > +               if (size != size_default)
> > +                       return false;
> > +               info->reg_type = PTR_TO_PACKET_END;
> > +               break;
> > +       default:
> > +               if (size != size_default)
> > +                       return false;
> > +               break;
> 
> nit, just:
> 
> return size == size_default
> 
> ?
> 
> > +       }
> > +       return true;
> > +}
> > +
> > +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> > +                                        const struct bpf_insn *si,
> > +                                        struct bpf_insn *insn_buf,
> > +                                        struct bpf_prog *prog,
> > +                                        u32 *target_size)
> > +{
> > +       struct bpf_insn *insn = insn_buf;
> > +
> > +       switch (si->off) {
> > +       case offsetof(struct bpf_sockopt, sk):
> > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> > +                                     si->dst_reg, si->src_reg,
> > +                                     offsetof(struct bpf_sockopt_kern, sk));
> > +               break;
> > +       case offsetof(struct bpf_sockopt, level):
> > +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +                                     bpf_target_off(struct bpf_sockopt_kern,
> > +                                                    level, 4, target_size));
> > +               break;
> > +       case offsetof(struct bpf_sockopt, optname):
> > +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +                                     bpf_target_off(struct bpf_sockopt_kern,
> > +                                                    optname, 4, target_size));
> > +               break;
> > +       case offsetof(struct bpf_sockopt, optlen):
> > +               if (type == BPF_WRITE)
> > +                       *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +                                             bpf_target_off(struct bpf_sockopt_kern,
> > +                                                            optlen, 4, target_size));
> > +               else
> > +                       *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > +                                             bpf_target_off(struct bpf_sockopt_kern,
> > +                                                            optlen, 4, target_size));
> > +               break;
> > +       case offsetof(struct bpf_sockopt, optval):
> > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> > +                                     si->dst_reg, si->src_reg,
> > +                                     offsetof(struct bpf_sockopt_kern, optval));
> > +               break;
> > +       case offsetof(struct bpf_sockopt, optval_end):
> > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> > +                                     si->dst_reg, si->src_reg,
> > +                                     offsetof(struct bpf_sockopt_kern, optval_end));
> > +               break;
> > +       }
> > +
> > +       return insn - insn_buf;
> > +}
> > +
> > +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> > +                                  bool direct_write,
> > +                                  const struct bpf_prog *prog)
> > +{
> > +       /* Nothing to do for sockopt argument. The data is kzalloc'ated.
> > +        */
> > +       return 0;
> > +}
> > +
> > +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> > +       .get_func_proto         = cg_sockopt_func_proto,
> > +       .is_valid_access        = cg_sockopt_is_valid_access,
> > +       .convert_ctx_access     = cg_sockopt_convert_ctx_access,
> > +       .gen_prologue           = cg_sockopt_get_prologue,
> > +};
> > +
> > +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> > +};
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 4c53cbd3329d..4ad2b5f1905f 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> >                 default:
> >                         return -EINVAL;
> >                 }
> > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > +               switch (expected_attach_type) {
> > +               case BPF_CGROUP_SETSOCKOPT:
> > +               case BPF_CGROUP_GETSOCKOPT:
> > +                       return 0;
> > +               default:
> > +                       return -EINVAL;
> > +               }
> >         default:
> >                 return 0;
> >         }
> > @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> >         switch (prog->type) {
> >         case BPF_PROG_TYPE_CGROUP_SOCK:
> >         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> >                 return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> >         case BPF_PROG_TYPE_CGROUP_SKB:
> >                 return prog->enforce_expected_attach_type &&
> > @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >         case BPF_CGROUP_SYSCTL:
> >                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> >                 break;
> > +       case BPF_CGROUP_GETSOCKOPT:
> > +       case BPF_CGROUP_SETSOCKOPT:
> > +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > +               break;
> >         default:
> >                 return -EINVAL;
> >         }
> > @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> >         case BPF_CGROUP_SYSCTL:
> >                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> >                 break;
> > +       case BPF_CGROUP_GETSOCKOPT:
> > +       case BPF_CGROUP_SETSOCKOPT:
> > +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > +               break;
> >         default:
> >                 return -EINVAL;
> >         }
> > @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
> >         case BPF_CGROUP_SOCK_OPS:
> >         case BPF_CGROUP_DEVICE:
> >         case BPF_CGROUP_SYSCTL:
> > +       case BPF_CGROUP_GETSOCKOPT:
> > +       case BPF_CGROUP_SETSOCKOPT:
> >                 break;
> >         case BPF_LIRC_MODE2:
> >                 return lirc_prog_query(attr, uattr);
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 5c2cb5bd84ce..b91fde10e721 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> >
> >                 env->seen_direct_write = true;
> >                 return true;
> > +
> > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > +               if (t == BPF_WRITE) {
> > +                       if (env->prog->expected_attach_type ==
> > +                           BPF_CGROUP_GETSOCKOPT) {
> > +                               env->seen_direct_write = true;
> > +                               return true;
> > +                       }
> > +                       return false;
> > +               }
> > +               return true;
> > +
> >         default:
> >                 return false;
> >         }
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 55bfc941d17a..4652c0a005ca 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> >         return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> >  }
> >
> > -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > +const struct bpf_func_proto bpf_sk_fullsock_proto = {
> >         .func           = bpf_sk_fullsock,
> >         .gpl_only       = false,
> >         .ret_type       = RET_PTR_TO_SOCKET_OR_NULL,
> > @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
> >         return (unsigned long)NULL;
> >  }
> >
> > -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> > +const struct bpf_func_proto bpf_tcp_sock_proto = {
> >         .func           = bpf_tcp_sock,
> >         .gpl_only       = false,
> >         .ret_type       = RET_PTR_TO_TCP_SOCK_OR_NULL,
> > diff --git a/net/socket.c b/net/socket.c
> > index 72372dc5dd70..e8654f1f70e6 100644
> > --- a/net/socket.c
> > +++ b/net/socket.c
> > @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
> >                 if (err)
> >                         goto out_put;
> >
> > +               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> > +                                                    optval, optlen);
> > +               if (err < 0) {
> > +                       goto out_put;
> > +               } else if (err > 0) {
> > +                       err = 0;
> > +                       goto out_put;
> > +               }
> > +
> >                 if (level == SOL_SOCKET)
> >                         err =
> >                             sock_setsockopt(sock, level, optname, optval,
> > @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
> >                 if (err)
> >                         goto out_put;
> >
> > +               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> > +                                                    optval, optlen);
> > +               if (err < 0) {
> > +                       goto out_put;
> > +               } else if (err > 0) {
> > +                       err = 0;
> > +                       goto out_put;
> > +               }
> > +
> >                 if (level == SOL_SOCKET)
> >                         err =
> >                             sock_getsockopt(sock, level, optname, optval,
> > --
> > 2.22.0.rc1.311.g5d7573a151-goog
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-05 20:54     ` Stanislav Fomichev
@ 2019-06-05 21:12       ` Andrii Nakryiko
  2019-06-05 21:30         ` Stanislav Fomichev
  0 siblings, 1 reply; 18+ messages in thread
From: Andrii Nakryiko @ 2019-06-05 21:12 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, Networking, bpf, davem, Alexei Starovoitov,
	Daniel Borkmann

On Wed, Jun 5, 2019 at 1:54 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 06/05, Andrii Nakryiko wrote:
> > On Tue, Jun 4, 2019 at 2:35 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > >
> > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > >
> > > The buffer memory is pre-allocated (because I don't think there is
> > > a precedent for working with __user memory from bpf). This might be
> >
> > Is there any harm or technical complication in allowing BPF to read
> > user memory directly? Or is it just uncharted territory, so there is
> > no "guideline"? If it's the latter, it could be a good time to discuss
> > that :)
> The naive implementation would have two helpers: one to copy from user,
> another to copy back to user; both of them would use something like
> get_user/put_user which can fault. Since we are running bpf progs with
> preempt disabled and in the rcu read section, we would need to do
> something like we currently do in bpf_probe_read where we disable pagefaults.
>
> To me it felt a bit excessive for socket options hook, simple data buffer is
> easier to work with from BPF program and we have all the machinery
> in place in the verifier. But I'm open to suggestions :-)

It's more like I'm discovering what's the implication is :) I don't
have suggestions, was just curious. It seems like reading/writing to
user memory is a whole can of worms, so yeah, I'd stick to buffer.

>
> > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > __cgroup_bpf_has_prog_array that exits early if there is nothing
> > > attached to a cgroup. Note, however, that there is a race between
> > > __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> > > program layout might have changed; this should not be a problem
> > > because in general there is a race between multiple calls to
> > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > >
> > > By default, kernel code path is executed after the hook (to let
> > > BPF handle only a subset of the options). There is new
> > > bpf_sockopt_handled handler that returns control to the userspace
> > > instead (bypassing the kernel handling).
> > >
> > > The return code is either 1 (success) or 0 (EPERM).
> >
> > Why not having 3 return values: success, disallow, consumed/bypass
> > kernel logic? Instead of having extra side-effecting helper?
> That is an option. I didn't go that route because I wanted to
> reuse BPF_PROG_RUN_ARRAY which has the following inside:
>
>         ret = 1;
>         while (prog)
>                 ret &= bpf_prog..()
>
> But given the fact that we now have BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
> precedent, maybe that's worth it (essentially, have
> BPF_PROG_CGROUP_SOCKOPS_RUN_ARRAY that handles 0, 1 and 2)?
> I don't have a strong opinion here to be honest.

We are getting more types of BPF programs that are of "controlling
type", which communicate back some decision to kernel
(allow/deny/default handling, etc). In all of those cases
communicating this "decision" using return code feels much cleaner and
straight-forward, than through some custom one-off helpers. So yeah,
I'm voting for using return code for that. I'd say using helper for
those cases would make sense only if BPF program has to provide some
complex information back to kernel (e.g, default string or whatever).


>
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  include/linux/bpf-cgroup.h |  29 ++++
> > >  include/linux/bpf.h        |   2 +
> > >  include/linux/bpf_types.h  |   1 +
> > >  include/linux/filter.h     |  19 +++
> > >  include/uapi/linux/bpf.h   |  17 ++-
> > >  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
> > >  kernel/bpf/syscall.c       |  19 +++
> > >  kernel/bpf/verifier.c      |  12 ++
> > >  net/core/filter.c          |   4 +-
> > >  net/socket.c               |  18 +++
> > >  10 files changed, 406 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > > index b631ee75762d..406f1ba82531 100644
> > > --- a/include/linux/bpf-cgroup.h
> > > +++ b/include/linux/bpf-cgroup.h
> > > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > >                                    loff_t *ppos, void **new_buf,
> > >                                    enum bpf_attach_type type);
> > >
> > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > > +                                      int optname, char __user *optval,
> > > +                                      unsigned int optlen);
> > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > > +                                      int optname, char __user *optval,
> > > +                                      int __user *optlen);
> > > +
> > >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> > >         struct bpf_map *map)
> > >  {
> > > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> > >         __ret;                                                                 \
> > >  })
> > >
> > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > +({                                                                            \
> > > +       int __ret = 0;                                                         \
> > > +       if (cgroup_bpf_enabled)                                                \
> > > +               __ret = __cgroup_bpf_run_filter_setsockopt(sock, level,        \
> > > +                                                          optname, optval,    \
> > > +                                                          optlen);            \
> > > +       __ret;                                                                 \
> > > +})
> > > +
> > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > +({                                                                            \
> > > +       int __ret = 0;                                                         \
> > > +       if (cgroup_bpf_enabled)                                                \
> > > +               __ret = __cgroup_bpf_run_filter_getsockopt(sock, level,        \
> > > +                                                          optname, optval,    \
> > > +                                                          optlen);            \
> > > +       __ret;                                                                 \
> > > +})
> > > +
> > >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> > >                            enum bpf_prog_type ptype, struct bpf_prog *prog);
> > >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> > >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> > >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> > >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > >
> > >  #define for_each_cgroup_storage_type(stype) for (; false; )
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index e5a309e6a400..fb4e6ef5a971 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> > >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > >  extern const struct bpf_func_proto bpf_strtol_proto;
> > >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > > +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> > > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > >
> > >  /* Shared helpers among cBPF and eBPF. */
> > >  void bpf_user_rnd_init_once(void);
> > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > index 5a9975678d6f..eec5aeeeaf92 100644
> > > --- a/include/linux/bpf_types.h
> > > +++ b/include/linux/bpf_types.h
> > > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> > >  #ifdef CONFIG_CGROUP_BPF
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> > >  #endif
> > >  #ifdef CONFIG_BPF_LIRC_MODE2
> > >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 43b45d6db36d..7a07fd2e14d3 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
> > >         u64 tmp_reg;
> > >  };
> > >
> > > +struct bpf_sockopt_kern {
> > > +       struct sock     *sk;
> > > +       s32             level;
> > > +       s32             optname;
> > > +       u32             optlen;
> > > +       u8              *optval;
> > > +       u8              *optval_end;
> > > +
> > > +       /* If true, BPF program had consumed the sockopt request.
> > > +        * Control is returned to the userspace (i.e. kernel doesn't
> > > +        * handle this option).
> > > +        */
> > > +       bool            handled;
> > > +
> > > +       /* Small on-stack optval buffer to avoid small allocations.
> > > +        */
> > > +       u8 buf[64];
> > > +};
> > > +
> > >  #endif /* __LINUX_FILTER_H__ */
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 7c6aef253173..b6c3891241ef 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> > >         BPF_PROG_TYPE_FLOW_DISSECTOR,
> > >         BPF_PROG_TYPE_CGROUP_SYSCTL,
> > >         BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > > +       BPF_PROG_TYPE_CGROUP_SOCKOPT,
> > >  };
> > >
> > >  enum bpf_attach_type {
> > > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> > >         BPF_LIRC_MODE2,
> > >         BPF_FLOW_DISSECTOR,
> > >         BPF_CGROUP_SYSCTL,
> > > +       BPF_CGROUP_GETSOCKOPT,
> > > +       BPF_CGROUP_SETSOCKOPT,
> > >         __MAX_BPF_ATTACH_TYPE
> > >  };
> > >
> > > @@ -2815,7 +2818,8 @@ union bpf_attr {
> > >         FN(strtoul),                    \
> > >         FN(sk_storage_get),             \
> > >         FN(sk_storage_delete),          \
> > > -       FN(send_signal),
> > > +       FN(send_signal),                \
> > > +       FN(sockopt_handled),
> > >
> > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > >   * function eBPF program intends to call
> > > @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
> > >                                  */
> > >  };
> > >
> > > +struct bpf_sockopt {
> > > +       __bpf_md_ptr(struct bpf_sock *, sk);
> > > +
> > > +       __s32   level;
> > > +       __s32   optname;
> > > +
> > > +       __u32   optlen;
> > > +       __u32   optval;
> > > +       __u32   optval_end;
> > > +};
> > > +
> > >  #endif /* _UAPI__LINUX_BPF_H__ */
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index 1b65ab0df457..4ec99ea97023 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/bpf.h>
> > >  #include <linux/bpf-cgroup.h>
> > >  #include <net/sock.h>
> > > +#include <net/bpf_sk_storage.h>
> > >
> > >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> > >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > >  }
> > >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> > >
> > > +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> > > +                                       enum bpf_attach_type attach_type)
> > > +{
> > > +       struct bpf_prog_array *prog_array;
> > > +       int nr;
> > > +
> > > +       rcu_read_lock();
> > > +       prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > > +       nr = bpf_prog_array_length(prog_array);
> > > +       rcu_read_unlock();
> > > +
> > > +       return nr > 0;
> > > +}
> > > +
> > > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > > +{
> > > +       if (unlikely(max_optlen > PAGE_SIZE))
> > > +               return -EINVAL;
> > > +
> > > +       if (likely(max_optlen <= sizeof(ctx->buf))) {
> > > +               ctx->optval = ctx->buf;
> > > +       } else {
> > > +               ctx->optval = kzalloc(max_optlen, GFP_USER);
> > > +               if (!ctx->optval)
> > > +                       return -ENOMEM;
> > > +       }
> > > +
> > > +       ctx->optval_end = ctx->optval + max_optlen;
> > > +       ctx->optlen = max_optlen;
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> > > +{
> > > +       if (unlikely(ctx->optval != ctx->buf))
> > > +               kfree(ctx->optval);
> > > +}
> > > +
> > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> > > +                                      int optname, char __user *optval,
> > > +                                      unsigned int optlen)
> > > +{
> > > +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > +       struct bpf_sockopt_kern ctx = {
> > > +               .sk = sk,
> > > +               .level = level,
> > > +               .optname = optname,
> > > +       };
> > > +       int ret;
> > > +
> > > +       /* Opportunistic check to see whether we have any BPF program
> > > +        * attached to the hook so we don't waste time allocating
> > > +        * memory and locking the socket.
> > > +        */
> > > +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> > > +               return 0;
> > > +
> > > +       ret = sockopt_alloc_buf(&ctx, optlen);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> > > +               sockopt_free_buf(&ctx);
> > > +               return -EFAULT;
> > > +       }
> > > +
> > > +       lock_sock(sk);
> > > +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> > > +                                &ctx, BPF_PROG_RUN);
> > > +       release_sock(sk);
> > > +
> > > +       sockopt_free_buf(&ctx);
> > > +
> > > +       if (!ret)
> > > +               return -EPERM;
> > > +
> > > +       return ctx.handled ? 1 : 0;
> > > +}
> > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> > > +
> > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> > > +                                      int optname, char __user *optval,
> > > +                                      int __user *optlen)
> > > +{
> > > +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > +       struct bpf_sockopt_kern ctx = {
> > > +               .sk = sk,
> > > +               .level = level,
> > > +               .optname = optname,
> > > +       };
> > > +       int max_optlen;
> > > +       char buf[64];
> > > +       int ret;
> > > +
> > > +       /* Opportunistic check to see whether we have any BPF program
> > > +        * attached to the hook so we don't waste time allocating
> > > +        * memory and locking the socket.
> > > +        */
> > > +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> > > +               return 0;
> > > +
> > > +       if (get_user(max_optlen, optlen))
> > > +               return -EFAULT;
> > > +
> > > +       ret = sockopt_alloc_buf(&ctx, max_optlen);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       lock_sock(sk);
> > > +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > > +                                &ctx, BPF_PROG_RUN);
> > > +       release_sock(sk);
> > > +
> > > +       if (ctx.optlen > max_optlen) {
> > > +               sockopt_free_buf(&ctx);
> > > +               return -EFAULT;
> > > +       }
> > > +
> > > +       if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> > > +               sockopt_free_buf(&ctx);
> > > +               return -EFAULT;
> > > +       }
> > > +
> > > +       sockopt_free_buf(&ctx);
> > > +
> > > +       if (put_user(ctx.optlen, optlen))
> > > +               return -EFAULT;
> > > +
> > > +       if (!ret)
> > > +               return -EPERM;
> > > +
> > > +       return ctx.handled ? 1 : 0;
> > > +}
> > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> > > +
> > >  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
> > >                               size_t *lenp)
> > >  {
> > > @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
> > >
> > >  const struct bpf_prog_ops cg_sysctl_prog_ops = {
> > >  };
> > > +
> > > +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> > > +{
> > > +       ctx->handled = true;
> > > +       return 1;
> > > +}
> > > +
> > > +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> > > +       .func           = bpf_sockopt_handled,
> > > +       .gpl_only       = false,
> > > +       .arg1_type      = ARG_PTR_TO_CTX,
> > > +       .ret_type       = RET_INTEGER,
> > > +};
> > > +
> > > +static const struct bpf_func_proto *
> > > +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > +{
> > > +       switch (func_id) {
> > > +       case BPF_FUNC_sockopt_handled:
> > > +               return &bpf_sockopt_handled_proto;
> > > +       case BPF_FUNC_sk_fullsock:
> > > +               return &bpf_sk_fullsock_proto;
> > > +       case BPF_FUNC_sk_storage_get:
> > > +               return &bpf_sk_storage_get_proto;
> > > +       case BPF_FUNC_sk_storage_delete:
> > > +               return &bpf_sk_storage_delete_proto;
> > > +#ifdef CONFIG_INET
> > > +       case BPF_FUNC_tcp_sock:
> > > +               return &bpf_tcp_sock_proto;
> > > +#endif
> > > +       default:
> > > +               return cgroup_base_func_proto(func_id, prog);
> > > +       }
> > > +}
> > > +
> > > +static bool cg_sockopt_is_valid_access(int off, int size,
> > > +                                      enum bpf_access_type type,
> > > +                                      const struct bpf_prog *prog,
> > > +                                      struct bpf_insn_access_aux *info)
> > > +{
> > > +       const int size_default = sizeof(__u32);
> > > +
> > > +       if (off < 0 || off >= sizeof(struct bpf_sockopt))
> > > +               return false;
> > > +
> > > +       if (off % size != 0)
> > > +               return false;
> > > +
> > > +       if (type == BPF_WRITE) {
> > > +               switch (off) {
> > > +               case offsetof(struct bpf_sockopt, optlen):
> > > +                       if (size != size_default)
> > > +                               return false;
> > > +                       return prog->expected_attach_type ==
> > > +                               BPF_CGROUP_GETSOCKOPT;
> > > +               default:
> > > +                       return false;
> > > +               }
> > > +       }
> > > +
> > > +       switch (off) {
> > > +       case offsetof(struct bpf_sockopt, sk):
> > > +               if (size != sizeof(__u64))
> > > +                       return false;
> > > +               info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
> > > +               break;
> > > +       case bpf_ctx_range(struct bpf_sockopt, optval):
> > > +               if (size != size_default)
> > > +                       return false;
> > > +               info->reg_type = PTR_TO_PACKET;
> > > +               break;
> > > +       case bpf_ctx_range(struct bpf_sockopt, optval_end):
> > > +               if (size != size_default)
> > > +                       return false;
> > > +               info->reg_type = PTR_TO_PACKET_END;
> > > +               break;
> > > +       default:
> > > +               if (size != size_default)
> > > +                       return false;
> > > +               break;
> >
> > nit, just:
> >
> > return size == size_default
> >
> > ?
> >
> > > +       }
> > > +       return true;
> > > +}
> > > +
> > > +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> > > +                                        const struct bpf_insn *si,
> > > +                                        struct bpf_insn *insn_buf,
> > > +                                        struct bpf_prog *prog,
> > > +                                        u32 *target_size)
> > > +{
> > > +       struct bpf_insn *insn = insn_buf;
> > > +
> > > +       switch (si->off) {
> > > +       case offsetof(struct bpf_sockopt, sk):
> > > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> > > +                                     si->dst_reg, si->src_reg,
> > > +                                     offsetof(struct bpf_sockopt_kern, sk));
> > > +               break;
> > > +       case offsetof(struct bpf_sockopt, level):
> > > +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +                                     bpf_target_off(struct bpf_sockopt_kern,
> > > +                                                    level, 4, target_size));
> > > +               break;
> > > +       case offsetof(struct bpf_sockopt, optname):
> > > +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +                                     bpf_target_off(struct bpf_sockopt_kern,
> > > +                                                    optname, 4, target_size));
> > > +               break;
> > > +       case offsetof(struct bpf_sockopt, optlen):
> > > +               if (type == BPF_WRITE)
> > > +                       *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +                                             bpf_target_off(struct bpf_sockopt_kern,
> > > +                                                            optlen, 4, target_size));
> > > +               else
> > > +                       *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > +                                             bpf_target_off(struct bpf_sockopt_kern,
> > > +                                                            optlen, 4, target_size));
> > > +               break;
> > > +       case offsetof(struct bpf_sockopt, optval):
> > > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> > > +                                     si->dst_reg, si->src_reg,
> > > +                                     offsetof(struct bpf_sockopt_kern, optval));
> > > +               break;
> > > +       case offsetof(struct bpf_sockopt, optval_end):
> > > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> > > +                                     si->dst_reg, si->src_reg,
> > > +                                     offsetof(struct bpf_sockopt_kern, optval_end));
> > > +               break;
> > > +       }
> > > +
> > > +       return insn - insn_buf;
> > > +}
> > > +
> > > +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> > > +                                  bool direct_write,
> > > +                                  const struct bpf_prog *prog)
> > > +{
> > > +       /* Nothing to do for sockopt argument. The data is kzalloc'ated.
> > > +        */
> > > +       return 0;
> > > +}
> > > +
> > > +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> > > +       .get_func_proto         = cg_sockopt_func_proto,
> > > +       .is_valid_access        = cg_sockopt_is_valid_access,
> > > +       .convert_ctx_access     = cg_sockopt_convert_ctx_access,
> > > +       .gen_prologue           = cg_sockopt_get_prologue,
> > > +};
> > > +
> > > +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> > > +};
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 4c53cbd3329d..4ad2b5f1905f 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> > >                 default:
> > >                         return -EINVAL;
> > >                 }
> > > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > +               switch (expected_attach_type) {
> > > +               case BPF_CGROUP_SETSOCKOPT:
> > > +               case BPF_CGROUP_GETSOCKOPT:
> > > +                       return 0;
> > > +               default:
> > > +                       return -EINVAL;
> > > +               }
> > >         default:
> > >                 return 0;
> > >         }
> > > @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> > >         switch (prog->type) {
> > >         case BPF_PROG_TYPE_CGROUP_SOCK:
> > >         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > >                 return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> > >         case BPF_PROG_TYPE_CGROUP_SKB:
> > >                 return prog->enforce_expected_attach_type &&
> > > @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> > >         case BPF_CGROUP_SYSCTL:
> > >                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > >                 break;
> > > +       case BPF_CGROUP_GETSOCKOPT:
> > > +       case BPF_CGROUP_SETSOCKOPT:
> > > +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > +               break;
> > >         default:
> > >                 return -EINVAL;
> > >         }
> > > @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> > >         case BPF_CGROUP_SYSCTL:
> > >                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > >                 break;
> > > +       case BPF_CGROUP_GETSOCKOPT:
> > > +       case BPF_CGROUP_SETSOCKOPT:
> > > +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > +               break;
> > >         default:
> > >                 return -EINVAL;
> > >         }
> > > @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
> > >         case BPF_CGROUP_SOCK_OPS:
> > >         case BPF_CGROUP_DEVICE:
> > >         case BPF_CGROUP_SYSCTL:
> > > +       case BPF_CGROUP_GETSOCKOPT:
> > > +       case BPF_CGROUP_SETSOCKOPT:
> > >                 break;
> > >         case BPF_LIRC_MODE2:
> > >                 return lirc_prog_query(attr, uattr);
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 5c2cb5bd84ce..b91fde10e721 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> > >
> > >                 env->seen_direct_write = true;
> > >                 return true;
> > > +
> > > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > +               if (t == BPF_WRITE) {
> > > +                       if (env->prog->expected_attach_type ==
> > > +                           BPF_CGROUP_GETSOCKOPT) {
> > > +                               env->seen_direct_write = true;
> > > +                               return true;
> > > +                       }
> > > +                       return false;
> > > +               }
> > > +               return true;
> > > +
> > >         default:
> > >                 return false;
> > >         }
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 55bfc941d17a..4652c0a005ca 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> > >         return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> > >  }
> > >
> > > -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > > +const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > >         .func           = bpf_sk_fullsock,
> > >         .gpl_only       = false,
> > >         .ret_type       = RET_PTR_TO_SOCKET_OR_NULL,
> > > @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
> > >         return (unsigned long)NULL;
> > >  }
> > >
> > > -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> > > +const struct bpf_func_proto bpf_tcp_sock_proto = {
> > >         .func           = bpf_tcp_sock,
> > >         .gpl_only       = false,
> > >         .ret_type       = RET_PTR_TO_TCP_SOCK_OR_NULL,
> > > diff --git a/net/socket.c b/net/socket.c
> > > index 72372dc5dd70..e8654f1f70e6 100644
> > > --- a/net/socket.c
> > > +++ b/net/socket.c
> > > @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
> > >                 if (err)
> > >                         goto out_put;
> > >
> > > +               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> > > +                                                    optval, optlen);
> > > +               if (err < 0) {
> > > +                       goto out_put;
> > > +               } else if (err > 0) {
> > > +                       err = 0;
> > > +                       goto out_put;
> > > +               }
> > > +
> > >                 if (level == SOL_SOCKET)
> > >                         err =
> > >                             sock_setsockopt(sock, level, optname, optval,
> > > @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
> > >                 if (err)
> > >                         goto out_put;
> > >
> > > +               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> > > +                                                    optval, optlen);
> > > +               if (err < 0) {
> > > +                       goto out_put;
> > > +               } else if (err > 0) {
> > > +                       err = 0;
> > > +                       goto out_put;
> > > +               }
> > > +
> > >                 if (level == SOL_SOCKET)
> > >                         err =
> > >                             sock_getsockopt(sock, level, optname, optval,
> > > --
> > > 2.22.0.rc1.311.g5d7573a151-goog
> > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-05 20:50       ` Martin Lau
@ 2019-06-05 21:16         ` Stanislav Fomichev
  2019-06-05 21:41           ` Martin Lau
  0 siblings, 1 reply; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-05 21:16 UTC (permalink / raw)
  To: Martin Lau; +Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel

On 06/05, Martin Lau wrote:
> On Wed, Jun 05, 2019 at 12:17:24PM -0700, Stanislav Fomichev wrote:
> > On 06/05, Martin Lau wrote:
> > > On Tue, Jun 04, 2019 at 02:35:18PM -0700, Stanislav Fomichev wrote:
> > > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > > > 
> > > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > > > 
> > > > The buffer memory is pre-allocated (because I don't think there is
> > > > a precedent for working with __user memory from bpf). This might be
> > > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > > __cgroup_bpf_has_prog_array that exits early if there is nothing
> > > > attached to a cgroup. Note, however, that there is a race between
> > > > __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> > > > program layout might have changed; this should not be a problem
> > > > because in general there is a race between multiple calls to
> > > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > > > 
> > > > By default, kernel code path is executed after the hook (to let
> > > > BPF handle only a subset of the options). There is new
> > > > bpf_sockopt_handled handler that returns control to the userspace
> > > > instead (bypassing the kernel handling).
> > > > 
> > > > The return code is either 1 (success) or 0 (EPERM).
> > > > 
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > ---
> > > >  include/linux/bpf-cgroup.h |  29 ++++
> > > >  include/linux/bpf.h        |   2 +
> > > >  include/linux/bpf_types.h  |   1 +
> > > >  include/linux/filter.h     |  19 +++
> > > >  include/uapi/linux/bpf.h   |  17 ++-
> > > >  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
> > > >  kernel/bpf/syscall.c       |  19 +++
> > > >  kernel/bpf/verifier.c      |  12 ++
> > > >  net/core/filter.c          |   4 +-
> > > >  net/socket.c               |  18 +++
> > > >  10 files changed, 406 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > > > index b631ee75762d..406f1ba82531 100644
> > > > --- a/include/linux/bpf-cgroup.h
> > > > +++ b/include/linux/bpf-cgroup.h
> > > > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > > >  				   loff_t *ppos, void **new_buf,
> > > >  				   enum bpf_attach_type type);
> > > >  
> > > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > > > +				       int optname, char __user *optval,
> > > > +				       unsigned int optlen);
> > > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > > > +				       int optname, char __user *optval,
> > > > +				       int __user *optlen);
> > > > +
> > > >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> > > >  	struct bpf_map *map)
> > > >  {
> > > > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> > > >  	__ret;								       \
> > > >  })
> > > >  
> > > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > > +({									       \
> > > > +	int __ret = 0;							       \
> > > > +	if (cgroup_bpf_enabled)						       \
> > > > +		__ret = __cgroup_bpf_run_filter_setsockopt(sock, level,	       \
> > > > +							   optname, optval,    \
> > > > +							   optlen);	       \
> > > > +	__ret;								       \
> > > > +})
> > > > +
> > > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > > +({									       \
> > > > +	int __ret = 0;							       \
> > > > +	if (cgroup_bpf_enabled)						       \
> > > > +		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
> > > > +							   optname, optval,    \
> > > > +							   optlen);	       \
> > > > +	__ret;								       \
> > > > +})
> > > > +
> > > >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> > > >  			   enum bpf_prog_type ptype, struct bpf_prog *prog);
> > > >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > > > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> > > >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> > > >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> > > >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > >  
> > > >  #define for_each_cgroup_storage_type(stype) for (; false; )
> > > >  
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index e5a309e6a400..fb4e6ef5a971 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> > > >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > > >  extern const struct bpf_func_proto bpf_strtol_proto;
> > > >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > > > +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> > > > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > > >  
> > > >  /* Shared helpers among cBPF and eBPF. */
> > > >  void bpf_user_rnd_init_once(void);
> > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > index 5a9975678d6f..eec5aeeeaf92 100644
> > > > --- a/include/linux/bpf_types.h
> > > > +++ b/include/linux/bpf_types.h
> > > > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> > > >  #ifdef CONFIG_CGROUP_BPF
> > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > > > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> > > >  #endif
> > > >  #ifdef CONFIG_BPF_LIRC_MODE2
> > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > > index 43b45d6db36d..7a07fd2e14d3 100644
> > > > --- a/include/linux/filter.h
> > > > +++ b/include/linux/filter.h
> > > > @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
> > > >  	u64 tmp_reg;
> > > >  };
> > > >  
> > > > +struct bpf_sockopt_kern {
> > > > +	struct sock	*sk;
> > > > +	s32		level;
> > > > +	s32		optname;
> > > > +	u32		optlen;
> > > It seems there is hole.
> > Ack, will move the pointers up.
> > 
> > > > +	u8		*optval;
> > > > +	u8		*optval_end;
> > > > +
> > > > +	/* If true, BPF program had consumed the sockopt request.
> > > > +	 * Control is returned to the userspace (i.e. kernel doesn't
> > > > +	 * handle this option).
> > > > +	 */
> > > > +	bool		handled;
> > > > +
> > > > +	/* Small on-stack optval buffer to avoid small allocations.
> > > > +	 */
> > > > +	u8 buf[64];
> > > Is it better to align to 8 bytes?
> > Do you mean manually set size to be 64 + x where x is a remainder
> > to align to 8 bytes? Is there some macro to help with that maybe?
> I think __attribute__((aligned(8))) should do.
Ah, you meant to align the buffer itself to avoid unaligned
access from the bpf progs. Got it, will do.

> > 
> > > > +};
> > > > +
> > > >  #endif /* __LINUX_FILTER_H__ */
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 7c6aef253173..b6c3891241ef 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> > > >  	BPF_PROG_TYPE_FLOW_DISSECTOR,
> > > >  	BPF_PROG_TYPE_CGROUP_SYSCTL,
> > > >  	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > > > +	BPF_PROG_TYPE_CGROUP_SOCKOPT,
> > > >  };
> > > >  
> > > >  enum bpf_attach_type {
> > > > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> > > >  	BPF_LIRC_MODE2,
> > > >  	BPF_FLOW_DISSECTOR,
> > > >  	BPF_CGROUP_SYSCTL,
> > > > +	BPF_CGROUP_GETSOCKOPT,
> > > > +	BPF_CGROUP_SETSOCKOPT,
> > > >  	__MAX_BPF_ATTACH_TYPE
> > > >  };
> > > >  
> > > > @@ -2815,7 +2818,8 @@ union bpf_attr {
> > > >  	FN(strtoul),			\
> > > >  	FN(sk_storage_get),		\
> > > >  	FN(sk_storage_delete),		\
> > > > -	FN(send_signal),
> > > > +	FN(send_signal),		\
> > > > +	FN(sockopt_handled),
> > > Document.
> > Ah, totally forgot about that, sure, will do!
> > 
> > > >  
> > > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > >   * function eBPF program intends to call
> > > > @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
> > > >  				 */
> > > >  };
> > > >  
> > > > +struct bpf_sockopt {
> > > > +	__bpf_md_ptr(struct bpf_sock *, sk);
> > > > +
> > > > +	__s32	level;
> > > > +	__s32	optname;
> > > > +
> > > > +	__u32	optlen;
> > > > +	__u32	optval;
> > > > +	__u32	optval_end;
> > > > +};
> > > > +
> > > >  #endif /* _UAPI__LINUX_BPF_H__ */
> > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > > index 1b65ab0df457..4ec99ea97023 100644
> > > > --- a/kernel/bpf/cgroup.c
> > > > +++ b/kernel/bpf/cgroup.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <linux/bpf.h>
> > > >  #include <linux/bpf-cgroup.h>
> > > >  #include <net/sock.h>
> > > > +#include <net/bpf_sk_storage.h>
> > > >  
> > > >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> > > >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > > > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > > >  }
> > > >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> > > >  
> > > > +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> > > > +					enum bpf_attach_type attach_type)
> > > > +{
> > > > +	struct bpf_prog_array *prog_array;
> > > > +	int nr;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > > > +	nr = bpf_prog_array_length(prog_array);
> > > Nit. It seems unnecessary to loop through the whole
> > > array if the only signal needed is non-zero.
> > Oh, good point. I guess I'd have to add another helper like
> > bpf_prog_array_is_empty() and return early. Any other suggestions?
> I was thinking to check empty_prog_array on top but it is
> too overkilled, so didn't mention it.  I think just return
> early is good enough.
[..]
> I think this non-zero check is good to have before doing lock_sock().
And not before the allocation? I was trying to optimize for both kmalloc
and lock_sock (since, I guess, the majority of the cgroups would not
have any sockopt progs, so there is no point in paying the kmalloc
cost as well).

> 
> > 
> > > > +	rcu_read_unlock();
> > > > +
> > > > +	return nr > 0;
> > > > +}
> > > > +
> > > > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > > > +{
> > > > +	if (unlikely(max_optlen > PAGE_SIZE))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (likely(max_optlen <= sizeof(ctx->buf))) {
> > > > +		ctx->optval = ctx->buf;
> > > > +	} else {
> > > > +		ctx->optval = kzalloc(max_optlen, GFP_USER);
> > > > +		if (!ctx->optval)
> > > > +			return -ENOMEM;
> > > > +	}
> > > > +
> > > > +	ctx->optval_end = ctx->optval + max_optlen;
> > > > +	ctx->optlen = max_optlen;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> > > > +{
> > > > +	if (unlikely(ctx->optval != ctx->buf))
> > > > +		kfree(ctx->optval);
> > > > +}
> > > > +
> > > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> > > > +				       int optname, char __user *optval,
> > > > +				       unsigned int optlen)
> > > > +{
> > > > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > > +	struct bpf_sockopt_kern ctx = {
> > > > +		.sk = sk,
> > > > +		.level = level,
> > > > +		.optname = optname,
> > > > +	};
> > > > +	int ret;
> > > > +
> > > > +	/* Opportunistic check to see whether we have any BPF program
> > > > +	 * attached to the hook so we don't waste time allocating
> > > > +	 * memory and locking the socket.
> > > > +	 */
> > > > +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> > > > +		return 0;
> > > > +
> > > > +	ret = sockopt_alloc_buf(&ctx, optlen);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> > > > +		sockopt_free_buf(&ctx);
> > > > +		return -EFAULT;
> > > > +	}
> > > > +
> > > > +	lock_sock(sk);
> > > > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> > > > +				 &ctx, BPF_PROG_RUN);
> > > I think the check_return_code() in verifier.c has to be
> > > adjusted also.
> > Good catch! I though that it does the [0,1] check by default.
> btw, just came to my mind, do you have a chance to
> look at how 'ret' is handled in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY()?
> It can take return values other than 0 or 1.  I am thinking
> ctx.handled could also be done in the 'ret' itself also
> but out of my head I think your current way "bpf_sockopt_handled()"
> may be cleaner.
Andrii had the same suggestion. Let me spend some time to look into whether
it's easier to use return code.

> > > > +	release_sock(sk);
> > > > +
> > > > +	sockopt_free_buf(&ctx);
> > > > +
> > > > +	if (!ret)
> > > > +		return -EPERM;
> > > > +
> > > > +	return ctx.handled ? 1 : 0;
> > > > +}
> > > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> > > > +
> > > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> > > > +				       int optname, char __user *optval,
> > > > +				       int __user *optlen)
> > > > +{
> > > > +	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > > +	struct bpf_sockopt_kern ctx = {
> > > > +		.sk = sk,
> > > > +		.level = level,
> > > > +		.optname = optname,
> > > > +	};
> > > > +	int max_optlen;
> > > > +	char buf[64];
> > > hmm... where is it used?
> > It's a leftover from my initial attempt to have a small buffer on the stack.
> > I've since moved it into struct bpf_sockopt_kern. Will remove. Gcc even
> > complains about unused var, not sure how I missed that...
> > 
> > > > +	int ret;
> > > > +
> > > > +	/* Opportunistic check to see whether we have any BPF program
> > > > +	 * attached to the hook so we don't waste time allocating
> > > > +	 * memory and locking the socket.
> > > > +	 */
> > > > +	if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> > > > +		return 0;
> > > > +
> > > > +	if (get_user(max_optlen, optlen))
> > > > +		return -EFAULT;
> > > > +
> > > > +	ret = sockopt_alloc_buf(&ctx, max_optlen);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	lock_sock(sk);
> > > > +	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > > > +				 &ctx, BPF_PROG_RUN);
> > > > +	release_sock(sk);
> > > > +
> > > > +	if (ctx.optlen > max_optlen) {
> > > > +		sockopt_free_buf(&ctx);
> > > > +		return -EFAULT;
> > > > +	}
> > > > +
> > > > +	if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> > > > +		sockopt_free_buf(&ctx);
> > > > +		return -EFAULT;
> > > > +	}
> > > > +
> > > > +	sockopt_free_buf(&ctx);
> > > > +
> > > > +	if (put_user(ctx.optlen, optlen))
> > > > +		return -EFAULT;
> > > > +
> > > > +	if (!ret)
> > > > +		return -EPERM;
> > > > +
> > > > +	return ctx.handled ? 1 : 0;
> > > > +}
> > > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> > > > +
> > > >  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
> > > >  			      size_t *lenp)
> > > >  {
> > > > @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
> > > >  
> > > >  const struct bpf_prog_ops cg_sysctl_prog_ops = {
> > > >  };
> > > > +
> > > > +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> > > > +{
> > > > +	ctx->handled = true;
> > > > +	return 1;
> > > RET_VOID?
> > I was thinking that in the C code the pattern can be:
> > {
> > 	...
> > 	return bpf_sockopt_handled();
> > }
> > 
> > That's why I'm retuning 1 from the helper. But I can change it to VOID
> > so that users have to return 1 manually. That's probably cleaner, will
> > change.
> > 
> > > > +}
> > > > +
> > > > +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> > > > +	.func		= bpf_sockopt_handled,
> > > > +	.gpl_only	= false,
> > > > +	.arg1_type      = ARG_PTR_TO_CTX,
> > > > +	.ret_type	= RET_INTEGER,
> > > > +};
> > > > +
> > > > +static const struct bpf_func_proto *
> > > > +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > > +{
> > > > +	switch (func_id) {
> > > > +	case BPF_FUNC_sockopt_handled:
> > > > +		return &bpf_sockopt_handled_proto;
> > > > +	case BPF_FUNC_sk_fullsock:
> > > > +		return &bpf_sk_fullsock_proto;
> > > > +	case BPF_FUNC_sk_storage_get:
> > > > +		return &bpf_sk_storage_get_proto;
> > > > +	case BPF_FUNC_sk_storage_delete:
> > > > +		return &bpf_sk_storage_delete_proto;
> > > > +#ifdef CONFIG_INET
> > > > +	case BPF_FUNC_tcp_sock:
> > > > +		return &bpf_tcp_sock_proto;
> > > > +#endif
> > > > +	default:
> > > > +		return cgroup_base_func_proto(func_id, prog);
> > > > +	}
> > > > +}
> > > > +
> > > > +static bool cg_sockopt_is_valid_access(int off, int size,
> > > > +				       enum bpf_access_type type,
> > > > +				       const struct bpf_prog *prog,
> > > > +				       struct bpf_insn_access_aux *info)
> > > > +{
> > > > +	const int size_default = sizeof(__u32);
> > > > +
> > > > +	if (off < 0 || off >= sizeof(struct bpf_sockopt))
> > > > +		return false;
> > > > +
> > > > +	if (off % size != 0)
> > > > +		return false;
> > > > +
> > > > +	if (type == BPF_WRITE) {
> > > > +		switch (off) {
> > > > +		case offsetof(struct bpf_sockopt, optlen):
> > > > +			if (size != size_default)
> > > > +				return false;
> > > > +			return prog->expected_attach_type ==
> > > > +				BPF_CGROUP_GETSOCKOPT;
> > > > +		default:
> > > > +			return false;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	switch (off) {
> > > > +	case offsetof(struct bpf_sockopt, sk):
> > > > +		if (size != sizeof(__u64))
> > > > +			return false;
> > > > +		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
> > > sk cannot be NULL, so the OR_NULL part is not needed.
> > > 
> > > I think it should also be PTR_TO_SOCKET instead.
> > I think you're correct. That reminds me of the fact that
> > I haven't properly tested it. Let me add a small C
> > selftest where I test this codepath.
> > 
> > > > +		break;
> > > > +	case bpf_ctx_range(struct bpf_sockopt, optval):
> > > > +		if (size != size_default)
> > > > +			return false;
> > > > +		info->reg_type = PTR_TO_PACKET;
> > > > +		break;
> > > > +	case bpf_ctx_range(struct bpf_sockopt, optval_end):
> > > > +		if (size != size_default)
> > > > +			return false;
> > > > +		info->reg_type = PTR_TO_PACKET_END;
> > > > +		break;
> > > > +	default:
> > > > +		if (size != size_default)
> > > > +			return false;
> > > > +		break;
> > > > +	}
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> > > > +					 const struct bpf_insn *si,
> > > > +					 struct bpf_insn *insn_buf,
> > > > +					 struct bpf_prog *prog,
> > > > +					 u32 *target_size)
> > > > +{
> > > > +	struct bpf_insn *insn = insn_buf;
> > > > +
> > > > +	switch (si->off) {
> > > > +	case offsetof(struct bpf_sockopt, sk):
> > > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> > > > +				      si->dst_reg, si->src_reg,
> > > > +				      offsetof(struct bpf_sockopt_kern, sk));
> > > > +		break;
> > > > +	case offsetof(struct bpf_sockopt, level):
> > > > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +				      bpf_target_off(struct bpf_sockopt_kern,
> > > > +						     level, 4, target_size));
> > > bpf_target_off() is not needed since there is no narrow load.
> > Good point, will drop it.
> > 
> > Thank you for a review!
> > 
> > > > +		break;
> > > > +	case offsetof(struct bpf_sockopt, optname):
> > > > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +				      bpf_target_off(struct bpf_sockopt_kern,
> > > > +						     optname, 4, target_size));
> > > > +		break;
> > > > +	case offsetof(struct bpf_sockopt, optlen):
> > > > +		if (type == BPF_WRITE)
> > > > +			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +					      bpf_target_off(struct bpf_sockopt_kern,
> > > > +							     optlen, 4, target_size));
> > > > +		else
> > > > +			*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +					      bpf_target_off(struct bpf_sockopt_kern,
> > > > +							     optlen, 4, target_size));
> > > > +		break;
> > > > +	case offsetof(struct bpf_sockopt, optval):
> > > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> > > > +				      si->dst_reg, si->src_reg,
> > > > +				      offsetof(struct bpf_sockopt_kern, optval));
> > > > +		break;
> > > > +	case offsetof(struct bpf_sockopt, optval_end):
> > > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> > > > +				      si->dst_reg, si->src_reg,
> > > > +				      offsetof(struct bpf_sockopt_kern, optval_end));
> > > > +		break;
> > > > +	}
> > > > +
> > > > +	return insn - insn_buf;
> > > > +}
> > > > +
> > > > +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> > > > +				   bool direct_write,
> > > > +				   const struct bpf_prog *prog)
> > > > +{
> > > > +	/* Nothing to do for sockopt argument. The data is kzalloc'ated.
> > > > +	 */
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> > > > +	.get_func_proto		= cg_sockopt_func_proto,
> > > > +	.is_valid_access	= cg_sockopt_is_valid_access,
> > > > +	.convert_ctx_access	= cg_sockopt_convert_ctx_access,
> > > > +	.gen_prologue		= cg_sockopt_get_prologue,
> > > > +};
> > > > +
> > > > +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> > > > +};
> > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > index 4c53cbd3329d..4ad2b5f1905f 100644
> > > > --- a/kernel/bpf/syscall.c
> > > > +++ b/kernel/bpf/syscall.c
> > > > @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> > > >  		default:
> > > >  			return -EINVAL;
> > > >  		}
> > > > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > > +		switch (expected_attach_type) {
> > > > +		case BPF_CGROUP_SETSOCKOPT:
> > > > +		case BPF_CGROUP_GETSOCKOPT:
> > > > +			return 0;
> > > > +		default:
> > > > +			return -EINVAL;
> > > > +		}
> > > >  	default:
> > > >  		return 0;
> > > >  	}
> > > > @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> > > >  	switch (prog->type) {
> > > >  	case BPF_PROG_TYPE_CGROUP_SOCK:
> > > >  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > > > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > >  		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> > > >  	case BPF_PROG_TYPE_CGROUP_SKB:
> > > >  		return prog->enforce_expected_attach_type &&
> > > > @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> > > >  	case BPF_CGROUP_SYSCTL:
> > > >  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > > >  		break;
> > > > +	case BPF_CGROUP_GETSOCKOPT:
> > > > +	case BPF_CGROUP_SETSOCKOPT:
> > > > +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > > +		break;
> > > >  	default:
> > > >  		return -EINVAL;
> > > >  	}
> > > > @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> > > >  	case BPF_CGROUP_SYSCTL:
> > > >  		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > > >  		break;
> > > > +	case BPF_CGROUP_GETSOCKOPT:
> > > > +	case BPF_CGROUP_SETSOCKOPT:
> > > > +		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > > +		break;
> > > >  	default:
> > > >  		return -EINVAL;
> > > >  	}
> > > > @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
> > > >  	case BPF_CGROUP_SOCK_OPS:
> > > >  	case BPF_CGROUP_DEVICE:
> > > >  	case BPF_CGROUP_SYSCTL:
> > > > +	case BPF_CGROUP_GETSOCKOPT:
> > > > +	case BPF_CGROUP_SETSOCKOPT:
> > > >  		break;
> > > >  	case BPF_LIRC_MODE2:
> > > >  		return lirc_prog_query(attr, uattr);
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 5c2cb5bd84ce..b91fde10e721 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> > > >  
> > > >  		env->seen_direct_write = true;
> > > >  		return true;
> > > > +
> > > > +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > > +		if (t == BPF_WRITE) {
> > > > +			if (env->prog->expected_attach_type ==
> > > > +			    BPF_CGROUP_GETSOCKOPT) {
> > > > +				env->seen_direct_write = true;
> > > > +				return true;
> > > > +			}
> > > > +			return false;
> > > > +		}
> > > > +		return true;
> > > > +
> > > >  	default:
> > > >  		return false;
> > > >  	}
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index 55bfc941d17a..4652c0a005ca 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> > > >  	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> > > >  }
> > > >  
> > > > -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > > > +const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > > >  	.func		= bpf_sk_fullsock,
> > > >  	.gpl_only	= false,
> > > >  	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
> > > > @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
> > > >  	return (unsigned long)NULL;
> > > >  }
> > > >  
> > > > -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> > > > +const struct bpf_func_proto bpf_tcp_sock_proto = {
> > > >  	.func		= bpf_tcp_sock,
> > > >  	.gpl_only	= false,
> > > >  	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
> > > > diff --git a/net/socket.c b/net/socket.c
> > > > index 72372dc5dd70..e8654f1f70e6 100644
> > > > --- a/net/socket.c
> > > > +++ b/net/socket.c
> > > > @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
> > > >  		if (err)
> > > >  			goto out_put;
> > > >  
> > > > +		err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> > > > +						     optval, optlen);
> > > > +		if (err < 0) {
> > > > +			goto out_put;
> > > > +		} else if (err > 0) {
> > > > +			err = 0;
> > > > +			goto out_put;
> > > > +		}
> > > > +
> > > >  		if (level == SOL_SOCKET)
> > > >  			err =
> > > >  			    sock_setsockopt(sock, level, optname, optval,
> > > > @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
> > > >  		if (err)
> > > >  			goto out_put;
> > > >  
> > > > +		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> > > > +						     optval, optlen);
> > > > +		if (err < 0) {
> > > > +			goto out_put;
> > > > +		} else if (err > 0) {
> > > > +			err = 0;
> > > > +			goto out_put;
> > > > +		}
> > > > +
> > > >  		if (level == SOL_SOCKET)
> > > >  			err =
> > > >  			    sock_getsockopt(sock, level, optname, optval,
> > > > -- 
> > > > 2.22.0.rc1.311.g5d7573a151-goog
> > > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-05 21:12       ` Andrii Nakryiko
@ 2019-06-05 21:30         ` Stanislav Fomichev
  0 siblings, 0 replies; 18+ messages in thread
From: Stanislav Fomichev @ 2019-06-05 21:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Stanislav Fomichev, Networking, bpf, davem, Alexei Starovoitov,
	Daniel Borkmann

On 06/05, Andrii Nakryiko wrote:
> On Wed, Jun 5, 2019 at 1:54 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
> >
> > On 06/05, Andrii Nakryiko wrote:
> > > On Tue, Jun 4, 2019 at 2:35 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > > >
> > > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > > >
> > > > The buffer memory is pre-allocated (because I don't think there is
> > > > a precedent for working with __user memory from bpf). This might be
> > >
> > > Is there any harm or technical complication in allowing BPF to read
> > > user memory directly? Or is it just uncharted territory, so there is
> > > no "guideline"? If it's the latter, it could be a good time to discuss
> > > that :)
> > The naive implementation would have two helpers: one to copy from user,
> > another to copy back to user; both of them would use something like
> > get_user/put_user which can fault. Since we are running bpf progs with
> > preempt disabled and in the rcu read section, we would need to do
> > something like we currently do in bpf_probe_read where we disable pagefaults.
> >
> > To me it felt a bit excessive for socket options hook, simple data buffer is
> > easier to work with from BPF program and we have all the machinery
> > in place in the verifier. But I'm open to suggestions :-)
> 
> It's more like I'm discovering what's the implication is :) I don't
> have suggestions, was just curious. It seems like reading/writing to
> user memory is a whole can of worms, so yeah, I'd stick to buffer.
> 
> >
> > > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > > __cgroup_bpf_has_prog_array that exits early if there is nothing
> > > > attached to a cgroup. Note, however, that there is a race between
> > > > __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> > > > program layout might have changed; this should not be a problem
> > > > because in general there is a race between multiple calls to
> > > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > > >
> > > > By default, kernel code path is executed after the hook (to let
> > > > BPF handle only a subset of the options). There is new
> > > > bpf_sockopt_handled handler that returns control to the userspace
> > > > instead (bypassing the kernel handling).
> > > >
> > > > The return code is either 1 (success) or 0 (EPERM).
> > >
> > > Why not having 3 return values: success, disallow, consumed/bypass
> > > kernel logic? Instead of having extra side-effecting helper?
> > That is an option. I didn't go that route because I wanted to
> > reuse BPF_PROG_RUN_ARRAY which has the following inside:
> >
> >         ret = 1;
> >         while (prog)
> >                 ret &= bpf_prog..()
> >
> > But given the fact that we now have BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
> > precedent, maybe that's worth it (essentially, have
> > BPF_PROG_CGROUP_SOCKOPS_RUN_ARRAY that handles 0, 1 and 2)?
> > I don't have a strong opinion here to be honest.
> 
> We are getting more types of BPF programs that are of "controlling
> type", which communicate back some decision to kernel
> (allow/deny/default handling, etc). In all of those cases
> communicating this "decision" using return code feels much cleaner and
> straight-forward, than through some custom one-off helpers. So yeah,
> I'm voting for using return code for that. I'd say using helper for
> those cases would make sense only if BPF program has to provide some
> complex information back to kernel (e.g, default string or whatever).
Agreed, since both you and Martin prefer a return code, I'll try
to switch to that. (unless I'd hit some roadblock). Thanks!

> >
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > ---
> > > >  include/linux/bpf-cgroup.h |  29 ++++
> > > >  include/linux/bpf.h        |   2 +
> > > >  include/linux/bpf_types.h  |   1 +
> > > >  include/linux/filter.h     |  19 +++
> > > >  include/uapi/linux/bpf.h   |  17 ++-
> > > >  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
> > > >  kernel/bpf/syscall.c       |  19 +++
> > > >  kernel/bpf/verifier.c      |  12 ++
> > > >  net/core/filter.c          |   4 +-
> > > >  net/socket.c               |  18 +++
> > > >  10 files changed, 406 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > > > index b631ee75762d..406f1ba82531 100644
> > > > --- a/include/linux/bpf-cgroup.h
> > > > +++ b/include/linux/bpf-cgroup.h
> > > > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > > >                                    loff_t *ppos, void **new_buf,
> > > >                                    enum bpf_attach_type type);
> > > >
> > > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > > > +                                      int optname, char __user *optval,
> > > > +                                      unsigned int optlen);
> > > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > > > +                                      int optname, char __user *optval,
> > > > +                                      int __user *optlen);
> > > > +
> > > >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> > > >         struct bpf_map *map)
> > > >  {
> > > > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> > > >         __ret;                                                                 \
> > > >  })
> > > >
> > > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > > +({                                                                            \
> > > > +       int __ret = 0;                                                         \
> > > > +       if (cgroup_bpf_enabled)                                                \
> > > > +               __ret = __cgroup_bpf_run_filter_setsockopt(sock, level,        \
> > > > +                                                          optname, optval,    \
> > > > +                                                          optlen);            \
> > > > +       __ret;                                                                 \
> > > > +})
> > > > +
> > > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > > +({                                                                            \
> > > > +       int __ret = 0;                                                         \
> > > > +       if (cgroup_bpf_enabled)                                                \
> > > > +               __ret = __cgroup_bpf_run_filter_getsockopt(sock, level,        \
> > > > +                                                          optname, optval,    \
> > > > +                                                          optlen);            \
> > > > +       __ret;                                                                 \
> > > > +})
> > > > +
> > > >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> > > >                            enum bpf_prog_type ptype, struct bpf_prog *prog);
> > > >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > > > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> > > >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> > > >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> > > >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > >
> > > >  #define for_each_cgroup_storage_type(stype) for (; false; )
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index e5a309e6a400..fb4e6ef5a971 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> > > >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > > >  extern const struct bpf_func_proto bpf_strtol_proto;
> > > >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > > > +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> > > > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > > >
> > > >  /* Shared helpers among cBPF and eBPF. */
> > > >  void bpf_user_rnd_init_once(void);
> > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > index 5a9975678d6f..eec5aeeeaf92 100644
> > > > --- a/include/linux/bpf_types.h
> > > > +++ b/include/linux/bpf_types.h
> > > > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> > > >  #ifdef CONFIG_CGROUP_BPF
> > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > > > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> > > >  #endif
> > > >  #ifdef CONFIG_BPF_LIRC_MODE2
> > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > > index 43b45d6db36d..7a07fd2e14d3 100644
> > > > --- a/include/linux/filter.h
> > > > +++ b/include/linux/filter.h
> > > > @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
> > > >         u64 tmp_reg;
> > > >  };
> > > >
> > > > +struct bpf_sockopt_kern {
> > > > +       struct sock     *sk;
> > > > +       s32             level;
> > > > +       s32             optname;
> > > > +       u32             optlen;
> > > > +       u8              *optval;
> > > > +       u8              *optval_end;
> > > > +
> > > > +       /* If true, BPF program had consumed the sockopt request.
> > > > +        * Control is returned to the userspace (i.e. kernel doesn't
> > > > +        * handle this option).
> > > > +        */
> > > > +       bool            handled;
> > > > +
> > > > +       /* Small on-stack optval buffer to avoid small allocations.
> > > > +        */
> > > > +       u8 buf[64];
> > > > +};
> > > > +
> > > >  #endif /* __LINUX_FILTER_H__ */
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 7c6aef253173..b6c3891241ef 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> > > >         BPF_PROG_TYPE_FLOW_DISSECTOR,
> > > >         BPF_PROG_TYPE_CGROUP_SYSCTL,
> > > >         BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > > > +       BPF_PROG_TYPE_CGROUP_SOCKOPT,
> > > >  };
> > > >
> > > >  enum bpf_attach_type {
> > > > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> > > >         BPF_LIRC_MODE2,
> > > >         BPF_FLOW_DISSECTOR,
> > > >         BPF_CGROUP_SYSCTL,
> > > > +       BPF_CGROUP_GETSOCKOPT,
> > > > +       BPF_CGROUP_SETSOCKOPT,
> > > >         __MAX_BPF_ATTACH_TYPE
> > > >  };
> > > >
> > > > @@ -2815,7 +2818,8 @@ union bpf_attr {
> > > >         FN(strtoul),                    \
> > > >         FN(sk_storage_get),             \
> > > >         FN(sk_storage_delete),          \
> > > > -       FN(send_signal),
> > > > +       FN(send_signal),                \
> > > > +       FN(sockopt_handled),
> > > >
> > > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > >   * function eBPF program intends to call
> > > > @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
> > > >                                  */
> > > >  };
> > > >
> > > > +struct bpf_sockopt {
> > > > +       __bpf_md_ptr(struct bpf_sock *, sk);
> > > > +
> > > > +       __s32   level;
> > > > +       __s32   optname;
> > > > +
> > > > +       __u32   optlen;
> > > > +       __u32   optval;
> > > > +       __u32   optval_end;
> > > > +};
> > > > +
> > > >  #endif /* _UAPI__LINUX_BPF_H__ */
> > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > > index 1b65ab0df457..4ec99ea97023 100644
> > > > --- a/kernel/bpf/cgroup.c
> > > > +++ b/kernel/bpf/cgroup.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <linux/bpf.h>
> > > >  #include <linux/bpf-cgroup.h>
> > > >  #include <net/sock.h>
> > > > +#include <net/bpf_sk_storage.h>
> > > >
> > > >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> > > >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > > > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > > >  }
> > > >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> > > >
> > > > +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> > > > +                                       enum bpf_attach_type attach_type)
> > > > +{
> > > > +       struct bpf_prog_array *prog_array;
> > > > +       int nr;
> > > > +
> > > > +       rcu_read_lock();
> > > > +       prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > > > +       nr = bpf_prog_array_length(prog_array);
> > > > +       rcu_read_unlock();
> > > > +
> > > > +       return nr > 0;
> > > > +}
> > > > +
> > > > +static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen)
> > > > +{
> > > > +       if (unlikely(max_optlen > PAGE_SIZE))
> > > > +               return -EINVAL;
> > > > +
> > > > +       if (likely(max_optlen <= sizeof(ctx->buf))) {
> > > > +               ctx->optval = ctx->buf;
> > > > +       } else {
> > > > +               ctx->optval = kzalloc(max_optlen, GFP_USER);
> > > > +               if (!ctx->optval)
> > > > +                       return -ENOMEM;
> > > > +       }
> > > > +
> > > > +       ctx->optval_end = ctx->optval + max_optlen;
> > > > +       ctx->optlen = max_optlen;
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static void sockopt_free_buf(struct bpf_sockopt_kern *ctx)
> > > > +{
> > > > +       if (unlikely(ctx->optval != ctx->buf))
> > > > +               kfree(ctx->optval);
> > > > +}
> > > > +
> > > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int level,
> > > > +                                      int optname, char __user *optval,
> > > > +                                      unsigned int optlen)
> > > > +{
> > > > +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > > +       struct bpf_sockopt_kern ctx = {
> > > > +               .sk = sk,
> > > > +               .level = level,
> > > > +               .optname = optname,
> > > > +       };
> > > > +       int ret;
> > > > +
> > > > +       /* Opportunistic check to see whether we have any BPF program
> > > > +        * attached to the hook so we don't waste time allocating
> > > > +        * memory and locking the socket.
> > > > +        */
> > > > +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_SETSOCKOPT))
> > > > +               return 0;
> > > > +
> > > > +       ret = sockopt_alloc_buf(&ctx, optlen);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       if (copy_from_user(ctx.optval, optval, optlen) != 0) {
> > > > +               sockopt_free_buf(&ctx);
> > > > +               return -EFAULT;
> > > > +       }
> > > > +
> > > > +       lock_sock(sk);
> > > > +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT],
> > > > +                                &ctx, BPF_PROG_RUN);
> > > > +       release_sock(sk);
> > > > +
> > > > +       sockopt_free_buf(&ctx);
> > > > +
> > > > +       if (!ret)
> > > > +               return -EPERM;
> > > > +
> > > > +       return ctx.handled ? 1 : 0;
> > > > +}
> > > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt);
> > > > +
> > > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> > > > +                                      int optname, char __user *optval,
> > > > +                                      int __user *optlen)
> > > > +{
> > > > +       struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > > +       struct bpf_sockopt_kern ctx = {
> > > > +               .sk = sk,
> > > > +               .level = level,
> > > > +               .optname = optname,
> > > > +       };
> > > > +       int max_optlen;
> > > > +       char buf[64];
> > > > +       int ret;
> > > > +
> > > > +       /* Opportunistic check to see whether we have any BPF program
> > > > +        * attached to the hook so we don't waste time allocating
> > > > +        * memory and locking the socket.
> > > > +        */
> > > > +       if (!__cgroup_bpf_has_prog_array(cgrp, BPF_CGROUP_GETSOCKOPT))
> > > > +               return 0;
> > > > +
> > > > +       if (get_user(max_optlen, optlen))
> > > > +               return -EFAULT;
> > > > +
> > > > +       ret = sockopt_alloc_buf(&ctx, max_optlen);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       lock_sock(sk);
> > > > +       ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > > > +                                &ctx, BPF_PROG_RUN);
> > > > +       release_sock(sk);
> > > > +
> > > > +       if (ctx.optlen > max_optlen) {
> > > > +               sockopt_free_buf(&ctx);
> > > > +               return -EFAULT;
> > > > +       }
> > > > +
> > > > +       if (copy_to_user(optval, ctx.optval, ctx.optlen) != 0) {
> > > > +               sockopt_free_buf(&ctx);
> > > > +               return -EFAULT;
> > > > +       }
> > > > +
> > > > +       sockopt_free_buf(&ctx);
> > > > +
> > > > +       if (put_user(ctx.optlen, optlen))
> > > > +               return -EFAULT;
> > > > +
> > > > +       if (!ret)
> > > > +               return -EPERM;
> > > > +
> > > > +       return ctx.handled ? 1 : 0;
> > > > +}
> > > > +EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt);
> > > > +
> > > >  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
> > > >                               size_t *lenp)
> > > >  {
> > > > @@ -1184,3 +1321,154 @@ const struct bpf_verifier_ops cg_sysctl_verifier_ops = {
> > > >
> > > >  const struct bpf_prog_ops cg_sysctl_prog_ops = {
> > > >  };
> > > > +
> > > > +BPF_CALL_1(bpf_sockopt_handled, struct bpf_sockopt_kern *, ctx)
> > > > +{
> > > > +       ctx->handled = true;
> > > > +       return 1;
> > > > +}
> > > > +
> > > > +static const struct bpf_func_proto bpf_sockopt_handled_proto = {
> > > > +       .func           = bpf_sockopt_handled,
> > > > +       .gpl_only       = false,
> > > > +       .arg1_type      = ARG_PTR_TO_CTX,
> > > > +       .ret_type       = RET_INTEGER,
> > > > +};
> > > > +
> > > > +static const struct bpf_func_proto *
> > > > +cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > > +{
> > > > +       switch (func_id) {
> > > > +       case BPF_FUNC_sockopt_handled:
> > > > +               return &bpf_sockopt_handled_proto;
> > > > +       case BPF_FUNC_sk_fullsock:
> > > > +               return &bpf_sk_fullsock_proto;
> > > > +       case BPF_FUNC_sk_storage_get:
> > > > +               return &bpf_sk_storage_get_proto;
> > > > +       case BPF_FUNC_sk_storage_delete:
> > > > +               return &bpf_sk_storage_delete_proto;
> > > > +#ifdef CONFIG_INET
> > > > +       case BPF_FUNC_tcp_sock:
> > > > +               return &bpf_tcp_sock_proto;
> > > > +#endif
> > > > +       default:
> > > > +               return cgroup_base_func_proto(func_id, prog);
> > > > +       }
> > > > +}
> > > > +
> > > > +static bool cg_sockopt_is_valid_access(int off, int size,
> > > > +                                      enum bpf_access_type type,
> > > > +                                      const struct bpf_prog *prog,
> > > > +                                      struct bpf_insn_access_aux *info)
> > > > +{
> > > > +       const int size_default = sizeof(__u32);
> > > > +
> > > > +       if (off < 0 || off >= sizeof(struct bpf_sockopt))
> > > > +               return false;
> > > > +
> > > > +       if (off % size != 0)
> > > > +               return false;
> > > > +
> > > > +       if (type == BPF_WRITE) {
> > > > +               switch (off) {
> > > > +               case offsetof(struct bpf_sockopt, optlen):
> > > > +                       if (size != size_default)
> > > > +                               return false;
> > > > +                       return prog->expected_attach_type ==
> > > > +                               BPF_CGROUP_GETSOCKOPT;
> > > > +               default:
> > > > +                       return false;
> > > > +               }
> > > > +       }
> > > > +
> > > > +       switch (off) {
> > > > +       case offsetof(struct bpf_sockopt, sk):
> > > > +               if (size != sizeof(__u64))
> > > > +                       return false;
> > > > +               info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
> > > > +               break;
> > > > +       case bpf_ctx_range(struct bpf_sockopt, optval):
> > > > +               if (size != size_default)
> > > > +                       return false;
> > > > +               info->reg_type = PTR_TO_PACKET;
> > > > +               break;
> > > > +       case bpf_ctx_range(struct bpf_sockopt, optval_end):
> > > > +               if (size != size_default)
> > > > +                       return false;
> > > > +               info->reg_type = PTR_TO_PACKET_END;
> > > > +               break;
> > > > +       default:
> > > > +               if (size != size_default)
> > > > +                       return false;
> > > > +               break;
> > >
> > > nit, just:
> > >
> > > return size == size_default
> > >
> > > ?
> > >
> > > > +       }
> > > > +       return true;
> > > > +}
> > > > +
> > > > +static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
> > > > +                                        const struct bpf_insn *si,
> > > > +                                        struct bpf_insn *insn_buf,
> > > > +                                        struct bpf_prog *prog,
> > > > +                                        u32 *target_size)
> > > > +{
> > > > +       struct bpf_insn *insn = insn_buf;
> > > > +
> > > > +       switch (si->off) {
> > > > +       case offsetof(struct bpf_sockopt, sk):
> > > > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, sk),
> > > > +                                     si->dst_reg, si->src_reg,
> > > > +                                     offsetof(struct bpf_sockopt_kern, sk));
> > > > +               break;
> > > > +       case offsetof(struct bpf_sockopt, level):
> > > > +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +                                     bpf_target_off(struct bpf_sockopt_kern,
> > > > +                                                    level, 4, target_size));
> > > > +               break;
> > > > +       case offsetof(struct bpf_sockopt, optname):
> > > > +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +                                     bpf_target_off(struct bpf_sockopt_kern,
> > > > +                                                    optname, 4, target_size));
> > > > +               break;
> > > > +       case offsetof(struct bpf_sockopt, optlen):
> > > > +               if (type == BPF_WRITE)
> > > > +                       *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +                                             bpf_target_off(struct bpf_sockopt_kern,
> > > > +                                                            optlen, 4, target_size));
> > > > +               else
> > > > +                       *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> > > > +                                             bpf_target_off(struct bpf_sockopt_kern,
> > > > +                                                            optlen, 4, target_size));
> > > > +               break;
> > > > +       case offsetof(struct bpf_sockopt, optval):
> > > > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval),
> > > > +                                     si->dst_reg, si->src_reg,
> > > > +                                     offsetof(struct bpf_sockopt_kern, optval));
> > > > +               break;
> > > > +       case offsetof(struct bpf_sockopt, optval_end):
> > > > +               *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, optval_end),
> > > > +                                     si->dst_reg, si->src_reg,
> > > > +                                     offsetof(struct bpf_sockopt_kern, optval_end));
> > > > +               break;
> > > > +       }
> > > > +
> > > > +       return insn - insn_buf;
> > > > +}
> > > > +
> > > > +static int cg_sockopt_get_prologue(struct bpf_insn *insn_buf,
> > > > +                                  bool direct_write,
> > > > +                                  const struct bpf_prog *prog)
> > > > +{
> > > > +       /* Nothing to do for sockopt argument. The data is kzalloc'ated.
> > > > +        */
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +const struct bpf_verifier_ops cg_sockopt_verifier_ops = {
> > > > +       .get_func_proto         = cg_sockopt_func_proto,
> > > > +       .is_valid_access        = cg_sockopt_is_valid_access,
> > > > +       .convert_ctx_access     = cg_sockopt_convert_ctx_access,
> > > > +       .gen_prologue           = cg_sockopt_get_prologue,
> > > > +};
> > > > +
> > > > +const struct bpf_prog_ops cg_sockopt_prog_ops = {
> > > > +};
> > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > index 4c53cbd3329d..4ad2b5f1905f 100644
> > > > --- a/kernel/bpf/syscall.c
> > > > +++ b/kernel/bpf/syscall.c
> > > > @@ -1596,6 +1596,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> > > >                 default:
> > > >                         return -EINVAL;
> > > >                 }
> > > > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > > +               switch (expected_attach_type) {
> > > > +               case BPF_CGROUP_SETSOCKOPT:
> > > > +               case BPF_CGROUP_GETSOCKOPT:
> > > > +                       return 0;
> > > > +               default:
> > > > +                       return -EINVAL;
> > > > +               }
> > > >         default:
> > > >                 return 0;
> > > >         }
> > > > @@ -1846,6 +1854,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> > > >         switch (prog->type) {
> > > >         case BPF_PROG_TYPE_CGROUP_SOCK:
> > > >         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > > > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > >                 return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> > > >         case BPF_PROG_TYPE_CGROUP_SKB:
> > > >                 return prog->enforce_expected_attach_type &&
> > > > @@ -1916,6 +1925,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> > > >         case BPF_CGROUP_SYSCTL:
> > > >                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > > >                 break;
> > > > +       case BPF_CGROUP_GETSOCKOPT:
> > > > +       case BPF_CGROUP_SETSOCKOPT:
> > > > +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > > +               break;
> > > >         default:
> > > >                 return -EINVAL;
> > > >         }
> > > > @@ -1997,6 +2010,10 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> > > >         case BPF_CGROUP_SYSCTL:
> > > >                 ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
> > > >                 break;
> > > > +       case BPF_CGROUP_GETSOCKOPT:
> > > > +       case BPF_CGROUP_SETSOCKOPT:
> > > > +               ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
> > > > +               break;
> > > >         default:
> > > >                 return -EINVAL;
> > > >         }
> > > > @@ -2031,6 +2048,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
> > > >         case BPF_CGROUP_SOCK_OPS:
> > > >         case BPF_CGROUP_DEVICE:
> > > >         case BPF_CGROUP_SYSCTL:
> > > > +       case BPF_CGROUP_GETSOCKOPT:
> > > > +       case BPF_CGROUP_SETSOCKOPT:
> > > >                 break;
> > > >         case BPF_LIRC_MODE2:
> > > >                 return lirc_prog_query(attr, uattr);
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 5c2cb5bd84ce..b91fde10e721 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -1717,6 +1717,18 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> > > >
> > > >                 env->seen_direct_write = true;
> > > >                 return true;
> > > > +
> > > > +       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> > > > +               if (t == BPF_WRITE) {
> > > > +                       if (env->prog->expected_attach_type ==
> > > > +                           BPF_CGROUP_GETSOCKOPT) {
> > > > +                               env->seen_direct_write = true;
> > > > +                               return true;
> > > > +                       }
> > > > +                       return false;
> > > > +               }
> > > > +               return true;
> > > > +
> > > >         default:
> > > >                 return false;
> > > >         }
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index 55bfc941d17a..4652c0a005ca 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -1835,7 +1835,7 @@ BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> > > >         return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> > > >  }
> > > >
> > > > -static const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > > > +const struct bpf_func_proto bpf_sk_fullsock_proto = {
> > > >         .func           = bpf_sk_fullsock,
> > > >         .gpl_only       = false,
> > > >         .ret_type       = RET_PTR_TO_SOCKET_OR_NULL,
> > > > @@ -5636,7 +5636,7 @@ BPF_CALL_1(bpf_tcp_sock, struct sock *, sk)
> > > >         return (unsigned long)NULL;
> > > >  }
> > > >
> > > > -static const struct bpf_func_proto bpf_tcp_sock_proto = {
> > > > +const struct bpf_func_proto bpf_tcp_sock_proto = {
> > > >         .func           = bpf_tcp_sock,
> > > >         .gpl_only       = false,
> > > >         .ret_type       = RET_PTR_TO_TCP_SOCK_OR_NULL,
> > > > diff --git a/net/socket.c b/net/socket.c
> > > > index 72372dc5dd70..e8654f1f70e6 100644
> > > > --- a/net/socket.c
> > > > +++ b/net/socket.c
> > > > @@ -2069,6 +2069,15 @@ static int __sys_setsockopt(int fd, int level, int optname,
> > > >                 if (err)
> > > >                         goto out_put;
> > > >
> > > > +               err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, level, optname,
> > > > +                                                    optval, optlen);
> > > > +               if (err < 0) {
> > > > +                       goto out_put;
> > > > +               } else if (err > 0) {
> > > > +                       err = 0;
> > > > +                       goto out_put;
> > > > +               }
> > > > +
> > > >                 if (level == SOL_SOCKET)
> > > >                         err =
> > > >                             sock_setsockopt(sock, level, optname, optval,
> > > > @@ -2106,6 +2115,15 @@ static int __sys_getsockopt(int fd, int level, int optname,
> > > >                 if (err)
> > > >                         goto out_put;
> > > >
> > > > +               err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname,
> > > > +                                                    optval, optlen);
> > > > +               if (err < 0) {
> > > > +                       goto out_put;
> > > > +               } else if (err > 0) {
> > > > +                       err = 0;
> > > > +                       goto out_put;
> > > > +               }
> > > > +
> > > >                 if (level == SOL_SOCKET)
> > > >                         err =
> > > >                             sock_getsockopt(sock, level, optname, optval,
> > > > --
> > > > 2.22.0.rc1.311.g5d7573a151-goog
> > > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 1/7] bpf: implement getsockopt and setsockopt hooks
  2019-06-05 21:16         ` Stanislav Fomichev
@ 2019-06-05 21:41           ` Martin Lau
  0 siblings, 0 replies; 18+ messages in thread
From: Martin Lau @ 2019-06-05 21:41 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel

On Wed, Jun 05, 2019 at 02:16:30PM -0700, Stanislav Fomichev wrote:
> On 06/05, Martin Lau wrote:
> > On Wed, Jun 05, 2019 at 12:17:24PM -0700, Stanislav Fomichev wrote:
> > > On 06/05, Martin Lau wrote:
> > > > On Tue, Jun 04, 2019 at 02:35:18PM -0700, Stanislav Fomichev wrote:
> > > > > Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
> > > > > BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
> > > > > 
> > > > > BPF_CGROUP_SETSOCKOPT get a read-only view of the setsockopt arguments.
> > > > > BPF_CGROUP_GETSOCKOPT can modify the supplied buffer.
> > > > > Both of them reuse existing PTR_TO_PACKET{,_END} infrastructure.
> > > > > 
> > > > > The buffer memory is pre-allocated (because I don't think there is
> > > > > a precedent for working with __user memory from bpf). This might be
> > > > > slow to do for each {s,g}etsockopt call, that's why I've added
> > > > > __cgroup_bpf_has_prog_array that exits early if there is nothing
> > > > > attached to a cgroup. Note, however, that there is a race between
> > > > > __cgroup_bpf_has_prog_array and BPF_PROG_RUN_ARRAY where cgroup
> > > > > program layout might have changed; this should not be a problem
> > > > > because in general there is a race between multiple calls to
> > > > > {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
> > > > > 
> > > > > By default, kernel code path is executed after the hook (to let
> > > > > BPF handle only a subset of the options). There is new
> > > > > bpf_sockopt_handled handler that returns control to the userspace
> > > > > instead (bypassing the kernel handling).
> > > > > 
> > > > > The return code is either 1 (success) or 0 (EPERM).
> > > > > 
> > > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > > ---
> > > > >  include/linux/bpf-cgroup.h |  29 ++++
> > > > >  include/linux/bpf.h        |   2 +
> > > > >  include/linux/bpf_types.h  |   1 +
> > > > >  include/linux/filter.h     |  19 +++
> > > > >  include/uapi/linux/bpf.h   |  17 ++-
> > > > >  kernel/bpf/cgroup.c        | 288 +++++++++++++++++++++++++++++++++++++
> > > > >  kernel/bpf/syscall.c       |  19 +++
> > > > >  kernel/bpf/verifier.c      |  12 ++
> > > > >  net/core/filter.c          |   4 +-
> > > > >  net/socket.c               |  18 +++
> > > > >  10 files changed, 406 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > > > > index b631ee75762d..406f1ba82531 100644
> > > > > --- a/include/linux/bpf-cgroup.h
> > > > > +++ b/include/linux/bpf-cgroup.h
> > > > > @@ -124,6 +124,13 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > > > >  				   loff_t *ppos, void **new_buf,
> > > > >  				   enum bpf_attach_type type);
> > > > >  
> > > > > +int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int level,
> > > > > +				       int optname, char __user *optval,
> > > > > +				       unsigned int optlen);
> > > > > +int __cgroup_bpf_run_filter_getsockopt(struct sock *sock, int level,
> > > > > +				       int optname, char __user *optval,
> > > > > +				       int __user *optlen);
> > > > > +
> > > > >  static inline enum bpf_cgroup_storage_type cgroup_storage_type(
> > > > >  	struct bpf_map *map)
> > > > >  {
> > > > > @@ -280,6 +287,26 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
> > > > >  	__ret;								       \
> > > > >  })
> > > > >  
> > > > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > > > +({									       \
> > > > > +	int __ret = 0;							       \
> > > > > +	if (cgroup_bpf_enabled)						       \
> > > > > +		__ret = __cgroup_bpf_run_filter_setsockopt(sock, level,	       \
> > > > > +							   optname, optval,    \
> > > > > +							   optlen);	       \
> > > > > +	__ret;								       \
> > > > > +})
> > > > > +
> > > > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen)   \
> > > > > +({									       \
> > > > > +	int __ret = 0;							       \
> > > > > +	if (cgroup_bpf_enabled)						       \
> > > > > +		__ret = __cgroup_bpf_run_filter_getsockopt(sock, level,	       \
> > > > > +							   optname, optval,    \
> > > > > +							   optlen);	       \
> > > > > +	__ret;								       \
> > > > > +})
> > > > > +
> > > > >  int cgroup_bpf_prog_attach(const union bpf_attr *attr,
> > > > >  			   enum bpf_prog_type ptype, struct bpf_prog *prog);
> > > > >  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
> > > > > @@ -349,6 +376,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
> > > > >  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> > > > >  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
> > > > >  #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
> > > > > +#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > > > +#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen) ({ 0; })
> > > > >  
> > > > >  #define for_each_cgroup_storage_type(stype) for (; false; )
> > > > >  
> > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > index e5a309e6a400..fb4e6ef5a971 100644
> > > > > --- a/include/linux/bpf.h
> > > > > +++ b/include/linux/bpf.h
> > > > > @@ -1054,6 +1054,8 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
> > > > >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> > > > >  extern const struct bpf_func_proto bpf_strtol_proto;
> > > > >  extern const struct bpf_func_proto bpf_strtoul_proto;
> > > > > +extern const struct bpf_func_proto bpf_sk_fullsock_proto;
> > > > > +extern const struct bpf_func_proto bpf_tcp_sock_proto;
> > > > >  
> > > > >  /* Shared helpers among cBPF and eBPF. */
> > > > >  void bpf_user_rnd_init_once(void);
> > > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > > index 5a9975678d6f..eec5aeeeaf92 100644
> > > > > --- a/include/linux/bpf_types.h
> > > > > +++ b/include/linux/bpf_types.h
> > > > > @@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
> > > > >  #ifdef CONFIG_CGROUP_BPF
> > > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> > > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
> > > > > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
> > > > >  #endif
> > > > >  #ifdef CONFIG_BPF_LIRC_MODE2
> > > > >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> > > > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > > > index 43b45d6db36d..7a07fd2e14d3 100644
> > > > > --- a/include/linux/filter.h
> > > > > +++ b/include/linux/filter.h
> > > > > @@ -1199,4 +1199,23 @@ struct bpf_sysctl_kern {
> > > > >  	u64 tmp_reg;
> > > > >  };
> > > > >  
> > > > > +struct bpf_sockopt_kern {
> > > > > +	struct sock	*sk;
> > > > > +	s32		level;
> > > > > +	s32		optname;
> > > > > +	u32		optlen;
> > > > It seems there is hole.
> > > Ack, will move the pointers up.
> > > 
> > > > > +	u8		*optval;
> > > > > +	u8		*optval_end;
> > > > > +
> > > > > +	/* If true, BPF program had consumed the sockopt request.
> > > > > +	 * Control is returned to the userspace (i.e. kernel doesn't
> > > > > +	 * handle this option).
> > > > > +	 */
> > > > > +	bool		handled;
> > > > > +
> > > > > +	/* Small on-stack optval buffer to avoid small allocations.
> > > > > +	 */
> > > > > +	u8 buf[64];
> > > > Is it better to align to 8 bytes?
> > > Do you mean manually set size to be 64 + x where x is a remainder
> > > to align to 8 bytes? Is there some macro to help with that maybe?
> > I think __attribute__((aligned(8))) should do.
> Ah, you meant to align the buffer itself to avoid unaligned
> access from the bpf progs. Got it, will do.
> 
> > > 
> > > > > +};
> > > > > +
> > > > >  #endif /* __LINUX_FILTER_H__ */
> > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > index 7c6aef253173..b6c3891241ef 100644
> > > > > --- a/include/uapi/linux/bpf.h
> > > > > +++ b/include/uapi/linux/bpf.h
> > > > > @@ -170,6 +170,7 @@ enum bpf_prog_type {
> > > > >  	BPF_PROG_TYPE_FLOW_DISSECTOR,
> > > > >  	BPF_PROG_TYPE_CGROUP_SYSCTL,
> > > > >  	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
> > > > > +	BPF_PROG_TYPE_CGROUP_SOCKOPT,
> > > > >  };
> > > > >  
> > > > >  enum bpf_attach_type {
> > > > > @@ -192,6 +193,8 @@ enum bpf_attach_type {
> > > > >  	BPF_LIRC_MODE2,
> > > > >  	BPF_FLOW_DISSECTOR,
> > > > >  	BPF_CGROUP_SYSCTL,
> > > > > +	BPF_CGROUP_GETSOCKOPT,
> > > > > +	BPF_CGROUP_SETSOCKOPT,
> > > > >  	__MAX_BPF_ATTACH_TYPE
> > > > >  };
> > > > >  
> > > > > @@ -2815,7 +2818,8 @@ union bpf_attr {
> > > > >  	FN(strtoul),			\
> > > > >  	FN(sk_storage_get),		\
> > > > >  	FN(sk_storage_delete),		\
> > > > > -	FN(send_signal),
> > > > > +	FN(send_signal),		\
> > > > > +	FN(sockopt_handled),
> > > > Document.
> > > Ah, totally forgot about that, sure, will do!
> > > 
> > > > >  
> > > > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > > >   * function eBPF program intends to call
> > > > > @@ -3533,4 +3537,15 @@ struct bpf_sysctl {
> > > > >  				 */
> > > > >  };
> > > > >  
> > > > > +struct bpf_sockopt {
> > > > > +	__bpf_md_ptr(struct bpf_sock *, sk);
> > > > > +
> > > > > +	__s32	level;
> > > > > +	__s32	optname;
> > > > > +
> > > > > +	__u32	optlen;
> > > > > +	__u32	optval;
> > > > > +	__u32	optval_end;
> > > > > +};
> > > > > +
> > > > >  #endif /* _UAPI__LINUX_BPF_H__ */
> > > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > > > index 1b65ab0df457..4ec99ea97023 100644
> > > > > --- a/kernel/bpf/cgroup.c
> > > > > +++ b/kernel/bpf/cgroup.c
> > > > > @@ -18,6 +18,7 @@
> > > > >  #include <linux/bpf.h>
> > > > >  #include <linux/bpf-cgroup.h>
> > > > >  #include <net/sock.h>
> > > > > +#include <net/bpf_sk_storage.h>
> > > > >  
> > > > >  DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
> > > > >  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> > > > > @@ -924,6 +925,142 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
> > > > >  }
> > > > >  EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl);
> > > > >  
> > > > > +static bool __cgroup_bpf_has_prog_array(struct cgroup *cgrp,
> > > > > +					enum bpf_attach_type attach_type)
> > > > > +{
> > > > > +	struct bpf_prog_array *prog_array;
> > > > > +	int nr;
> > > > > +
> > > > > +	rcu_read_lock();
> > > > > +	prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]);
> > > > > +	nr = bpf_prog_array_length(prog_array);
> > > > Nit. It seems unnecessary to loop through the whole
> > > > array if the only signal needed is non-zero.
> > > Oh, good point. I guess I'd have to add another helper like
> > > bpf_prog_array_is_empty() and return early. Any other suggestions?
> > I was thinking to check empty_prog_array on top but it is
> > too overkilled, so didn't mention it.  I think just return
> > early is good enough.
> [..]
> > I think this non-zero check is good to have before doing lock_sock().
> And not before the allocation? I was trying to optimize for both kmalloc
> and lock_sock (since, I guess, the majority of the cgroups would not
> have any sockopt progs, so there is no point in paying the kmalloc
> cost as well).
+1

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-06-05 21:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-04 21:35 [PATCH bpf-next 0/7] bpf: getsockopt and setsockopt hooks Stanislav Fomichev
2019-06-04 21:35 ` [PATCH bpf-next 1/7] bpf: implement " Stanislav Fomichev
2019-06-05 18:47   ` Martin Lau
2019-06-05 19:17     ` Stanislav Fomichev
2019-06-05 20:50       ` Martin Lau
2019-06-05 21:16         ` Stanislav Fomichev
2019-06-05 21:41           ` Martin Lau
2019-06-05 19:32   ` Andrii Nakryiko
2019-06-05 20:54     ` Stanislav Fomichev
2019-06-05 21:12       ` Andrii Nakryiko
2019-06-05 21:30         ` Stanislav Fomichev
2019-06-04 21:35 ` [PATCH bpf-next 2/7] bpf: sync bpf.h to tools/ Stanislav Fomichev
2019-06-04 21:35 ` [PATCH bpf-next 3/7] libbpf: support sockopt hooks Stanislav Fomichev
2019-06-04 21:35 ` [PATCH bpf-next 4/7] selftests/bpf: test sockopt section name Stanislav Fomichev
2019-06-04 21:35 ` [PATCH bpf-next 5/7] selftests/bpf: add sockopt test Stanislav Fomichev
2019-06-04 21:35 ` [PATCH bpf-next 6/7] bpf: add sockopt documentation Stanislav Fomichev
2019-06-04 21:35 ` [PATCH bpf-next 7/7] bpftool: support cgroup sockopt Stanislav Fomichev
2019-06-04 21:55   ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.