bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS
@ 2019-12-21  6:25 Martin KaFai Lau
  2019-12-21  6:25 ` [PATCH bpf-next v2 01/11] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
                   ` (10 more replies)
  0 siblings, 11 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:25 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This series introduces BPF STRUCT_OPS.  It is an infra to allow
implementing some specific kernel's function pointers in BPF.
The first use case included in this series is to implement
TCP congestion control algorithm in BPF  (i.e. implement
struct tcp_congestion_ops in BPF).

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.

The idea is to allow implementing tcp_congestion_ops in bpf.
It allows a faster turnaround for testing algorithm in the
production while leveraging the existing (and continue growing) BPF
feature/framework instead of building one specifically for
userspace TCP CC.

Please see individual patch for details.

The bpftool support will be posted in follow-up patches.

v2:
- Dropped cubic for now.  They will be reposted
  once there are more clarity in "jiffies" on both
  bpf side (about the helper) and
  tcp_cubic side (some of jiffies usages are being replaced
  by tp->tcp_mstamp)
- Remove unnecssary check on bitfield support from btf_struct_access()
  (Yonghong)
- BTF_TYPE_EMIT macro (Yonghong, Andrii)
- value_name's length check to avoid an unlikely
  type match during truncation case (Yonghong)
- BUILD_BUG_ON to ensure no trampoline-image overrun
  in the future (Yonghong)
- Simplify get_next_key() (Yonghong)
- Added comment to explain how to check mandatory
  func ptr in net/ipv4/bpf_tcp_ca.c (Yonghong)
- Rename "__bpf_" to "bpf_struct_ops_" for value prefix (Andrii)
- Add comment to highlight the bpf_dctcp.c is not necessarily
  the same as tcp_dctcp.c. (Alexei, Eric)
- libbpf: Renmae "struct_ops" to ".struct_ops" for elf sec (Andrii)
- libbpf: Expose struct_ops as a bpf_map (Andrii)
- libbpf: Support multiple struct_ops in SEC(".struct_ops") (Andrii)
- libbpf: Add bpf_map__attach_struct_ops()  (Andrii)

Martin KaFai Lau (11):
  bpf: Save PTR_TO_BTF_ID register state when spilling to stack
  bpf: Avoid storing modifier to info->btf_id
  bpf: Add enum support to btf_ctx_access()
  bpf: Support bitfield read access in btf_struct_access
  bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  bpf: tcp: Support tcp_congestion_ops in bpf
  bpf: Add BPF_FUNC_tcp_send_ack helper
  bpf: Synch uapi bpf.h to tools/
  bpf: libbpf: Add STRUCT_OPS support
  bpf: Add bpf_dctcp example

 arch/x86/net/bpf_jit_comp.c                   |  11 +-
 include/linux/bpf.h                           |  79 ++-
 include/linux/bpf_types.h                     |   7 +
 include/linux/btf.h                           |  47 ++
 include/linux/filter.h                        |   2 +
 include/net/tcp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  19 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_struct_ops.c                   | 586 ++++++++++++++++
 kernel/bpf/bpf_struct_ops_types.h             |   9 +
 kernel/bpf/btf.c                              | 129 ++--
 kernel/bpf/map_in_map.c                       |   3 +-
 kernel/bpf/syscall.c                          |  66 +-
 kernel/bpf/trampoline.c                       |   5 +-
 kernel/bpf/verifier.c                         | 140 +++-
 net/core/filter.c                             |   2 +-
 net/ipv4/Makefile                             |   4 +
 net/ipv4/bpf_tcp_ca.c                         | 248 +++++++
 net/ipv4/tcp_cong.c                           |  14 +-
 net/ipv4/tcp_ipv4.c                           |   6 +-
 net/ipv4/tcp_minisocks.c                      |   4 +-
 net/ipv4/tcp_output.c                         |   4 +-
 tools/include/uapi/linux/bpf.h                |  19 +-
 tools/lib/bpf/bpf.c                           |  10 +-
 tools/lib/bpf/bpf.h                           |   5 +-
 tools/lib/bpf/libbpf.c                        | 639 +++++++++++++++++-
 tools/lib/bpf/libbpf.h                        |   1 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/lib/bpf/libbpf_probes.c                 |   2 +
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 +++++++
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 ++++++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++
 32 files changed, 2582 insertions(+), 139 deletions(-)
 create mode 100644 kernel/bpf/bpf_struct_ops.c
 create mode 100644 kernel/bpf/bpf_struct_ops_types.h
 create mode 100644 net/ipv4/bpf_tcp_ca.c
 create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 01/11] bpf: Save PTR_TO_BTF_ID register state when spilling to stack
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
@ 2019-12-21  6:25 ` Martin KaFai Lau
  2019-12-21  6:25 ` [PATCH bpf-next v2 02/11] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:25 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch makes the verifier save the PTR_TO_BTF_ID register state when
spilling to the stack.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 034ef81f935b..408264c1d55b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1915,6 +1915,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
+	case PTR_TO_BTF_ID:
 		return true;
 	default:
 		return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 02/11] bpf: Avoid storing modifier to info->btf_id
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
  2019-12-21  6:25 ` [PATCH bpf-next v2 01/11] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
@ 2019-12-21  6:25 ` Martin KaFai Lau
  2019-12-21  6:26 ` [PATCH bpf-next v2 03/11] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:25 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

info->btf_id expects the btf_id of a struct, so it should
store the final result after skipping modifiers (if any).

It also takes this chanace to add a missing newline in one of the
bpf_log() messages.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 7d40da240891..88359a4bccb0 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3696,7 +3696,6 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 
 	/* this is a pointer to another type */
 	info->reg_type = PTR_TO_BTF_ID;
-	info->btf_id = t->type;
 
 	if (tgt_prog) {
 		ret = btf_translate_to_vmlinux(log, btf, t, tgt_prog->type);
@@ -3707,10 +3706,14 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 			return false;
 		}
 	}
+
+	info->btf_id = t->type;
 	t = btf_type_by_id(btf, t->type);
 	/* skip modifiers */
-	while (btf_type_is_modifier(t))
+	while (btf_type_is_modifier(t)) {
+		info->btf_id = t->type;
 		t = btf_type_by_id(btf, t->type);
+	}
 	if (!btf_type_is_struct(t)) {
 		bpf_log(log,
 			"func '%s' arg%d type %s is not a struct\n",
@@ -3736,7 +3739,7 @@ int btf_struct_access(struct bpf_verifier_log *log,
 again:
 	tname = __btf_name_by_offset(btf_vmlinux, t->name_off);
 	if (!btf_type_is_struct(t)) {
-		bpf_log(log, "Type '%s' is not a struct", tname);
+		bpf_log(log, "Type '%s' is not a struct\n", tname);
 		return -EINVAL;
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 03/11] bpf: Add enum support to btf_ctx_access()
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
  2019-12-21  6:25 ` [PATCH bpf-next v2 01/11] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
  2019-12-21  6:25 ` [PATCH bpf-next v2 02/11] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-21  6:26 ` [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

It allows bpf prog (e.g. tracing) to attach
to a kernel function that takes enum argument.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 88359a4bccb0..6e652643849b 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3676,7 +3676,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 	/* skip modifiers */
 	while (btf_type_is_modifier(t))
 		t = btf_type_by_id(btf, t->type);
-	if (btf_type_is_int(t))
+	if (btf_type_is_int(t) || btf_type_is_enum(t))
 		/* accessing a scalar */
 		return true;
 	if (!btf_type_is_ptr(t)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 03/11] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-23  7:49   ` Yonghong Song
  2019-12-23 20:05   ` Andrii Nakryiko
  2019-12-21  6:26 ` [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch allows bitfield access as a scalar.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 6e652643849b..da73b63acfc5 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3744,10 +3744,6 @@ int btf_struct_access(struct bpf_verifier_log *log,
 	}
 
 	for_each_member(i, t, member) {
-		if (btf_member_bitfield_size(t, member))
-			/* bitfields are not supported yet */
-			continue;
-
 		/* offset of the field in bytes */
 		moff = btf_member_bit_offset(t, member) / 8;
 		if (off + size <= moff)
@@ -3757,6 +3753,12 @@ int btf_struct_access(struct bpf_verifier_log *log,
 		if (off < moff)
 			continue;
 
+		if (btf_member_bitfield_size(t, member)) {
+			if (off == moff && off + size <= t->size)
+				return SCALAR_VALUE;
+			continue;
+		}
+
 		/* type of the field */
 		mtype = btf_type_by_id(btf_vmlinux, member->type);
 		mname = __btf_name_by_offset(btf_vmlinux, member->name_off);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-23 19:33   ` Yonghong Song
                     ` (2 more replies)
  2019-12-21  6:26 ` [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch allows the kernel's struct ops (i.e. func ptr) to be
implemented in BPF.  The first use case in this series is the
"struct tcp_congestion_ops" which will be introduced in a
latter patch.

This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
of a kernel struct.  The attr->expected_attach_type is the member
"index" of that kernel struct.  The first member of a struct starts
with member index 0.  That will avoid ambiguity when a kernel struct
has multiple func ptrs with the same func signature.

For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
to implement the "init" func ptr of the "struct tcp_congestion_ops".
The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
of the _running_ kernel.  The attr->expected_attach_type is 3.

The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
by arch_prepare_bpf_trampoline that will be done in the next
patch when introducing BPF_MAP_TYPE_STRUCT_OPS.

"struct bpf_struct_ops" is introduced as a common interface for the kernel
struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
struct will need to implement an instance of the "struct bpf_struct_ops".

The supporting kernel struct also needs to implement a bpf_verifier_ops.
During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
bpf_verifier_ops by searching the attr->attach_btf_id.

A new "btf_struct_access" is also added to the bpf_verifier_ops such
that the supporting kernel struct can optionally provide its own specific
check on accessing the func arg (e.g. provide limited write access).

After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
to initialize some values (e.g. the btf id of the supporting kernel
struct) and it can only be done once the btf_vmlinux is available.

The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
if the return type of the prog->aux->attach_func_proto is "void".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h               |  30 +++++++
 include/linux/bpf_types.h         |   4 +
 include/linux/btf.h               |  34 ++++++++
 include/uapi/linux/bpf.h          |   1 +
 kernel/bpf/Makefile               |   2 +-
 kernel/bpf/bpf_struct_ops.c       | 122 +++++++++++++++++++++++++++
 kernel/bpf/bpf_struct_ops_types.h |   4 +
 kernel/bpf/btf.c                  |  88 ++++++++++++++------
 kernel/bpf/syscall.c              |  17 ++--
 kernel/bpf/verifier.c             | 134 +++++++++++++++++++++++-------
 10 files changed, 372 insertions(+), 64 deletions(-)
 create mode 100644 kernel/bpf/bpf_struct_ops.c
 create mode 100644 kernel/bpf/bpf_struct_ops_types.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8f3e00c84f39..b8f087eb4bdf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -349,6 +349,10 @@ struct bpf_verifier_ops {
 				  const struct bpf_insn *src,
 				  struct bpf_insn *dst,
 				  struct bpf_prog *prog, u32 *target_size);
+	int (*btf_struct_access)(struct bpf_verifier_log *log,
+				 const struct btf_type *t, int off, int size,
+				 enum bpf_access_type atype,
+				 u32 *next_btf_id);
 };
 
 struct bpf_prog_offload_ops {
@@ -667,6 +671,32 @@ struct bpf_array_aux {
 	struct work_struct work;
 };
 
+struct btf_type;
+struct btf_member;
+
+#define BPF_STRUCT_OPS_MAX_NR_MEMBERS 64
+struct bpf_struct_ops {
+	const struct bpf_verifier_ops *verifier_ops;
+	int (*init)(struct btf *_btf_vmlinux);
+	int (*check_member)(const struct btf_type *t,
+			    const struct btf_member *member);
+	const struct btf_type *type;
+	const char *name;
+	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
+	u32 type_id;
+};
+
+#if defined(CONFIG_BPF_JIT)
+const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id);
+void bpf_struct_ops_init(struct btf *_btf_vmlinux);
+#else
+static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
+{
+	return NULL;
+}
+static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { }
+#endif
+
 struct bpf_array {
 	struct bpf_map map;
 	u32 elem_size;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 93740b3614d7..fadd243ffa2d 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -65,6 +65,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
 	      struct sk_reuseport_md, struct sk_reuseport_kern)
 #endif
+#if defined(CONFIG_BPF_JIT)
+BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
+	      void *, void *)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 79d4abc2556a..f74a09a7120b 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -53,6 +53,18 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
+s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
+const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
+					       u32 id, u32 *res_id);
+const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
+					    u32 id, u32 *res_id);
+const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
+						 u32 id, u32 *res_id);
+
+#define for_each_member(i, struct_type, member)			\
+	for (i = 0, member = btf_type_member(struct_type);	\
+	     i < btf_type_vlen(struct_type);			\
+	     i++, member++)
 
 static inline bool btf_type_is_ptr(const struct btf_type *t)
 {
@@ -84,6 +96,28 @@ static inline bool btf_type_is_func_proto(const struct btf_type *t)
 	return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC_PROTO;
 }
 
+static inline u16 btf_type_vlen(const struct btf_type *t)
+{
+	return BTF_INFO_VLEN(t->info);
+}
+
+static inline bool btf_type_kflag(const struct btf_type *t)
+{
+	return BTF_INFO_KFLAG(t->info);
+}
+
+static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
+					   const struct btf_member *member)
+{
+	return btf_type_kflag(struct_type) ? BTF_MEMBER_BITFIELD_SIZE(member->offset)
+					   : 0;
+}
+
+static inline const struct btf_member *btf_type_member(const struct btf_type *t)
+{
+	return (const struct btf_member *)(t + 1);
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
 const char *btf_name_by_offset(const struct btf *btf, u32 offset);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7df436da542d..c1eeb3e0e116 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -174,6 +174,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 	BPF_PROG_TYPE_TRACING,
+	BPF_PROG_TYPE_STRUCT_OPS,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index d4f330351f87..0e636387db6f 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
-obj-$(CONFIG_BPF_JIT) += trampoline.o
+obj-$(CONFIG_BPF_JIT) += trampoline.o bpf_struct_ops.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
new file mode 100644
index 000000000000..c9f81bd1df83
--- /dev/null
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2019 Facebook */
+
+#include <linux/bpf.h>
+#include <linux/bpf_verifier.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <linux/slab.h>
+#include <linux/numa.h>
+#include <linux/seq_file.h>
+#include <linux/refcount.h>
+
+#define BPF_STRUCT_OPS_TYPE(_name)				\
+extern struct bpf_struct_ops bpf_##_name;
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+
+enum {
+#define BPF_STRUCT_OPS_TYPE(_name) BPF_STRUCT_OPS_TYPE_##_name,
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+	__NR_BPF_STRUCT_OPS_TYPE,
+};
+
+static struct bpf_struct_ops * const bpf_struct_ops[] = {
+#define BPF_STRUCT_OPS_TYPE(_name)				\
+	[BPF_STRUCT_OPS_TYPE_##_name] = &bpf_##_name,
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+};
+
+const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
+};
+
+const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
+};
+
+void bpf_struct_ops_init(struct btf *_btf_vmlinux)
+{
+	const struct btf_member *member;
+	struct bpf_struct_ops *st_ops;
+	struct bpf_verifier_log log = {};
+	const struct btf_type *t;
+	const char *mname;
+	s32 type_id;
+	u32 i, j;
+
+	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
+		st_ops = bpf_struct_ops[i];
+
+		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
+						BTF_KIND_STRUCT);
+		if (type_id < 0) {
+			pr_warn("Cannot find struct %s in btf_vmlinux\n",
+				st_ops->name);
+			continue;
+		}
+		t = btf_type_by_id(_btf_vmlinux, type_id);
+		if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) {
+			pr_warn("Cannot support #%u members in struct %s\n",
+				btf_type_vlen(t), st_ops->name);
+			continue;
+		}
+
+		for_each_member(j, t, member) {
+			const struct btf_type *func_proto;
+
+			mname = btf_name_by_offset(_btf_vmlinux,
+						   member->name_off);
+			if (!*mname) {
+				pr_warn("anon member in struct %s is not supported\n",
+					st_ops->name);
+				break;
+			}
+
+			if (btf_member_bitfield_size(t, member)) {
+				pr_warn("bit field member %s in struct %s is not supported\n",
+					mname, st_ops->name);
+				break;
+			}
+
+			func_proto = btf_type_resolve_func_ptr(_btf_vmlinux,
+							       member->type,
+							       NULL);
+			if (func_proto &&
+			    btf_distill_func_proto(&log, _btf_vmlinux,
+						   func_proto, mname,
+						   &st_ops->func_models[j])) {
+				pr_warn("Error in parsing func ptr %s in struct %s\n",
+					mname, st_ops->name);
+				break;
+			}
+		}
+
+		if (j == btf_type_vlen(t)) {
+			if (st_ops->init(_btf_vmlinux)) {
+				pr_warn("Error in init bpf_struct_ops %s\n",
+					st_ops->name);
+			} else {
+				st_ops->type_id = type_id;
+				st_ops->type = t;
+			}
+		}
+	}
+}
+
+extern struct btf *btf_vmlinux;
+
+const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
+{
+	unsigned int i;
+
+	if (!type_id || !btf_vmlinux)
+		return NULL;
+
+	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
+		if (bpf_struct_ops[i]->type_id == type_id)
+			return bpf_struct_ops[i];
+	}
+
+	return NULL;
+}
diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
new file mode 100644
index 000000000000..7bb13ff49ec2
--- /dev/null
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -0,0 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* internal file - do not include directly */
+
+/* To be filled in a later patch */
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index da73b63acfc5..0e879a512cf4 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -180,11 +180,6 @@
  */
 #define BTF_MAX_SIZE (16 * 1024 * 1024)
 
-#define for_each_member(i, struct_type, member)			\
-	for (i = 0, member = btf_type_member(struct_type);	\
-	     i < btf_type_vlen(struct_type);			\
-	     i++, member++)
-
 #define for_each_member_from(i, from, struct_type, member)		\
 	for (i = from, member = btf_type_member(struct_type) + from;	\
 	     i < btf_type_vlen(struct_type);				\
@@ -382,6 +377,65 @@ static bool btf_type_is_datasec(const struct btf_type *t)
 	return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
 }
 
+s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
+{
+	const struct btf_type *t;
+	const char *tname;
+	u32 i;
+
+	for (i = 1; i <= btf->nr_types; i++) {
+		t = btf->types[i];
+		if (BTF_INFO_KIND(t->info) != kind)
+			continue;
+
+		tname = btf_name_by_offset(btf, t->name_off);
+		if (!strcmp(tname, name))
+			return i;
+	}
+
+	return -ENOENT;
+}
+
+const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
+					       u32 id, u32 *res_id)
+{
+	const struct btf_type *t = btf_type_by_id(btf, id);
+
+	while (btf_type_is_modifier(t)) {
+		id = t->type;
+		t = btf_type_by_id(btf, t->type);
+	}
+
+	if (res_id)
+		*res_id = id;
+
+	return t;
+}
+
+const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
+					    u32 id, u32 *res_id)
+{
+	const struct btf_type *t;
+
+	t = btf_type_skip_modifiers(btf, id, NULL);
+	if (!btf_type_is_ptr(t))
+		return NULL;
+
+	return btf_type_skip_modifiers(btf, t->type, res_id);
+}
+
+const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
+						 u32 id, u32 *res_id)
+{
+	const struct btf_type *ptype;
+
+	ptype = btf_type_resolve_ptr(btf, id, res_id);
+	if (ptype && btf_type_is_func_proto(ptype))
+		return ptype;
+
+	return NULL;
+}
+
 /* Types that act only as a source, not sink or intermediate
  * type when resolving.
  */
@@ -446,16 +500,6 @@ static const char *btf_int_encoding_str(u8 encoding)
 		return "UNKN";
 }
 
-static u16 btf_type_vlen(const struct btf_type *t)
-{
-	return BTF_INFO_VLEN(t->info);
-}
-
-static bool btf_type_kflag(const struct btf_type *t)
-{
-	return BTF_INFO_KFLAG(t->info);
-}
-
 static u32 btf_member_bit_offset(const struct btf_type *struct_type,
 			     const struct btf_member *member)
 {
@@ -463,13 +507,6 @@ static u32 btf_member_bit_offset(const struct btf_type *struct_type,
 					   : member->offset;
 }
 
-static u32 btf_member_bitfield_size(const struct btf_type *struct_type,
-				    const struct btf_member *member)
-{
-	return btf_type_kflag(struct_type) ? BTF_MEMBER_BITFIELD_SIZE(member->offset)
-					   : 0;
-}
-
 static u32 btf_type_int(const struct btf_type *t)
 {
 	return *(u32 *)(t + 1);
@@ -480,11 +517,6 @@ static const struct btf_array *btf_type_array(const struct btf_type *t)
 	return (const struct btf_array *)(t + 1);
 }
 
-static const struct btf_member *btf_type_member(const struct btf_type *t)
-{
-	return (const struct btf_member *)(t + 1);
-}
-
 static const struct btf_enum *btf_type_enum(const struct btf_type *t)
 {
 	return (const struct btf_enum *)(t + 1);
@@ -3604,6 +3636,8 @@ struct btf *btf_parse_vmlinux(void)
 		goto errout;
 	}
 
+	bpf_struct_ops_init(btf);
+
 	btf_verifier_env_free(env);
 	refcount_set(&btf->refcnt, 1);
 	return btf;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 81ee8595dfee..03a02ef4c496 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1672,17 +1672,22 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 			   enum bpf_attach_type expected_attach_type,
 			   u32 btf_id, u32 prog_fd)
 {
-	switch (prog_type) {
-	case BPF_PROG_TYPE_TRACING:
+	if (btf_id) {
 		if (btf_id > BTF_MAX_TYPE)
 			return -EINVAL;
-		break;
-	default:
-		if (btf_id || prog_fd)
+
+		switch (prog_type) {
+		case BPF_PROG_TYPE_TRACING:
+		case BPF_PROG_TYPE_STRUCT_OPS:
+			break;
+		default:
 			return -EINVAL;
-		break;
+		}
 	}
 
+	if (prog_fd && prog_type != BPF_PROG_TYPE_TRACING)
+		return -EINVAL;
+
 	switch (prog_type) {
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 		switch (expected_attach_type) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 408264c1d55b..4c1eaa1a2965 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2858,11 +2858,6 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	u32 btf_id;
 	int ret;
 
-	if (atype != BPF_READ) {
-		verbose(env, "only read is supported\n");
-		return -EACCES;
-	}
-
 	if (off < 0) {
 		verbose(env,
 			"R%d is ptr_%s invalid negative access: off=%d\n",
@@ -2879,17 +2874,32 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 		return -EACCES;
 	}
 
-	ret = btf_struct_access(&env->log, t, off, size, atype, &btf_id);
+	if (env->ops->btf_struct_access) {
+		ret = env->ops->btf_struct_access(&env->log, t, off, size,
+						  atype, &btf_id);
+	} else {
+		if (atype != BPF_READ) {
+			verbose(env, "only read is supported\n");
+			return -EACCES;
+		}
+
+		ret = btf_struct_access(&env->log, t, off, size, atype,
+					&btf_id);
+	}
+
 	if (ret < 0)
 		return ret;
 
-	if (ret == SCALAR_VALUE) {
-		mark_reg_unknown(env, regs, value_regno);
-		return 0;
+	if (atype == BPF_READ) {
+		if (ret == SCALAR_VALUE) {
+			mark_reg_unknown(env, regs, value_regno);
+			return 0;
+		}
+		mark_reg_known_zero(env, regs, value_regno);
+		regs[value_regno].type = PTR_TO_BTF_ID;
+		regs[value_regno].btf_id = btf_id;
 	}
-	mark_reg_known_zero(env, regs, value_regno);
-	regs[value_regno].type = PTR_TO_BTF_ID;
-	regs[value_regno].btf_id = btf_id;
+
 	return 0;
 }
 
@@ -6343,8 +6353,30 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 static int check_return_code(struct bpf_verifier_env *env)
 {
 	struct tnum enforce_attach_type_range = tnum_unknown;
+	const struct bpf_prog *prog = env->prog;
 	struct bpf_reg_state *reg;
 	struct tnum range = tnum_range(0, 1);
+	int err;
+
+	/* The struct_ops func-ptr's return type could be "void" */
+	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
+	    !prog->aux->attach_func_proto->type)
+		return 0;
+
+	/* eBPF calling convetion is such that R0 is used
+	 * to return the value from eBPF program.
+	 * Make sure that it's readable at this time
+	 * of bpf_exit, which means that program wrote
+	 * something into it earlier
+	 */
+	err = check_reg_arg(env, BPF_REG_0, SRC_OP);
+	if (err)
+		return err;
+
+	if (is_pointer_value(env, BPF_REG_0)) {
+		verbose(env, "R0 leaks addr as return value\n");
+		return -EACCES;
+	}
 
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
@@ -8010,21 +8042,6 @@ static int do_check(struct bpf_verifier_env *env)
 				if (err)
 					return err;
 
-				/* eBPF calling convetion is such that R0 is used
-				 * to return the value from eBPF program.
-				 * Make sure that it's readable at this time
-				 * of bpf_exit, which means that program wrote
-				 * something into it earlier
-				 */
-				err = check_reg_arg(env, BPF_REG_0, SRC_OP);
-				if (err)
-					return err;
-
-				if (is_pointer_value(env, BPF_REG_0)) {
-					verbose(env, "R0 leaks addr as return value\n");
-					return -EACCES;
-				}
-
 				err = check_return_code(env);
 				if (err)
 					return err;
@@ -8833,12 +8850,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
 			break;
 		case PTR_TO_BTF_ID:
-			if (type == BPF_WRITE) {
+			if (type == BPF_READ) {
+				insn->code = BPF_LDX | BPF_PROBE_MEM |
+					BPF_SIZE((insn)->code);
+				env->prog->aux->num_exentries++;
+			} else if (env->prog->type != BPF_PROG_TYPE_STRUCT_OPS) {
 				verbose(env, "Writes through BTF pointers are not allowed\n");
 				return -EINVAL;
 			}
-			insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code);
-			env->prog->aux->num_exentries++;
 			continue;
 		default:
 			continue;
@@ -9505,6 +9524,58 @@ static void print_verification_stats(struct bpf_verifier_env *env)
 		env->peak_states, env->longest_mark_read_walk);
 }
 
+static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
+{
+	const struct btf_type *t, *func_proto;
+	const struct bpf_struct_ops *st_ops;
+	const struct btf_member *member;
+	struct bpf_prog *prog = env->prog;
+	u32 btf_id, member_idx;
+	const char *mname;
+
+	btf_id = prog->aux->attach_btf_id;
+	st_ops = bpf_struct_ops_find(btf_id);
+	if (!st_ops) {
+		verbose(env, "attach_btf_id %u is not a supported struct\n",
+			btf_id);
+		return -ENOTSUPP;
+	}
+
+	t = st_ops->type;
+	member_idx = prog->expected_attach_type;
+	if (member_idx >= btf_type_vlen(t)) {
+		verbose(env, "attach to invalid member idx %u of struct %s\n",
+			member_idx, st_ops->name);
+		return -EINVAL;
+	}
+
+	member = &btf_type_member(t)[member_idx];
+	mname = btf_name_by_offset(btf_vmlinux, member->name_off);
+	func_proto = btf_type_resolve_func_ptr(btf_vmlinux, member->type,
+					       NULL);
+	if (!func_proto) {
+		verbose(env, "attach to invalid member %s(@idx %u) of struct %s\n",
+			mname, member_idx, st_ops->name);
+		return -EINVAL;
+	}
+
+	if (st_ops->check_member) {
+		int err = st_ops->check_member(t, member);
+
+		if (err) {
+			verbose(env, "attach to unsupported member %s of struct %s\n",
+				mname, st_ops->name);
+			return err;
+		}
+	}
+
+	prog->aux->attach_func_proto = func_proto;
+	prog->aux->attach_func_name = mname;
+	env->ops = st_ops->verifier_ops;
+
+	return 0;
+}
+
 static int check_attach_btf_id(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog;
@@ -9520,6 +9591,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	long addr;
 	u64 key;
 
+	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
+		return check_struct_ops_btf_id(env);
+
 	if (prog->type != BPF_PROG_TYPE_TRACING)
 		return 0;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-23 19:57   ` Yonghong Song
                     ` (2 more replies)
  2019-12-21  6:26 ` [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
                   ` (4 subsequent siblings)
  10 siblings, 3 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code).  For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
	refcount_t refcnt;
	enum bpf_struct_ops_state state;
	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
is created automatically by a macro.  Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem).  The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
   set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
   running kernel.
   Instead of reusing the attr->btf_value_type_id,
   btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
   used as the "user" btf which could store other useful sysadmin/debug
   info that may be introduced in the furture,
   e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
   in the running kernel btf.  Populate the value of this object.
   The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
   the map value.  The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem.  The map will not allow further update
from this point.

Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".

Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
have the prog _id_ populated as the func ptr.

The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)

The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
for the purose of tracking the subsystem usage.  Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage.  However, that will also tie down the
future semantics of map->refcnt and map->usercnt.

The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt.  When the very last subsystem's refcnt
is gone, it will also release the map->refcnt.  All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).

Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops  name dctcp  flags 0x0
	key 4B  value 256B  max_entries 1  memlock 4096B
	btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
        "value": {
            "refcnt": {
                "refs": {
                    "counter": 1
                }
            },
            "state": 1,
            "data": {
                "list": {
                    "next": 0,
                    "prev": 0
                },
                "key": 0,
                "flags": 2,
                "init": 24,
                "release": 0,
                "ssthresh": 25,
                "cong_avoid": 30,
                "set_state": 27,
                "cwnd_event": 28,
                "in_ack_event": 26,
                "undo_cwnd": 29,
                "pkts_acked": 0,
                "min_tso_segs": 0,
                "sndbuf_expand": 0,
                "cong_control": 0,
                "get_info": 0,
                "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                ],
                "owner": 0
            }
        }
    }
]

Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
  It does an inplace update on "*value" instead returning a pointer
  to syscall.c.  Otherwise, it needs a separate copy of "zero" value
  for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

* The bpf_struct_ops_map_delete_elem() is also called without
  preempt_disable() from map_delete_elem().  It is because
  the "->unreg()" may requires sleepable context, e.g.
  the "tcp_unregister_congestion_control()".

* "const" is added to some of the existing "struct btf_func_model *"
  function arg to avoid a compiler warning caused by this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 arch/x86/net/bpf_jit_comp.c |  11 +-
 include/linux/bpf.h         |  49 +++-
 include/linux/bpf_types.h   |   3 +
 include/linux/btf.h         |  13 +
 include/uapi/linux/bpf.h    |   7 +-
 kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
 kernel/bpf/btf.c            |  20 +-
 kernel/bpf/map_in_map.c     |   3 +-
 kernel/bpf/syscall.c        |  49 ++--
 kernel/bpf/trampoline.c     |   5 +-
 kernel/bpf/verifier.c       |   5 +
 11 files changed, 593 insertions(+), 40 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4c8a2d1f8470..775347c78947 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1328,7 +1328,7 @@ xadd:			if (is_imm8(insn->off))
 	return proglen;
 }
 
-static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args,
+static void save_regs(const struct btf_func_model *m, u8 **prog, int nr_args,
 		      int stack_size)
 {
 	int i;
@@ -1344,7 +1344,7 @@ static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args,
 			 -(stack_size - i * 8));
 }
 
-static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args,
+static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args,
 			 int stack_size)
 {
 	int i;
@@ -1361,7 +1361,7 @@ static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args,
 			 -(stack_size - i * 8));
 }
 
-static int invoke_bpf(struct btf_func_model *m, u8 **pprog,
+static int invoke_bpf(const struct btf_func_model *m, u8 **pprog,
 		      struct bpf_prog **progs, int prog_cnt, int stack_size)
 {
 	u8 *prog = *pprog;
@@ -1456,7 +1456,8 @@ static int invoke_bpf(struct btf_func_model *m, u8 **pprog,
  * add rsp, 8                      // skip eth_type_trans's frame
  * ret                             // return to its caller
  */
-int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
+int arch_prepare_bpf_trampoline(void *image,
+				const struct btf_func_model *m, u32 flags,
 				struct bpf_prog **fentry_progs, int fentry_cnt,
 				struct bpf_prog **fexit_progs, int fexit_cnt,
 				void *orig_call)
@@ -1529,7 +1530,7 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
 	 */
 	if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY))
 		return -EFAULT;
-	return 0;
+	return (void *)prog - image;
 }
 
 static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b8f087eb4bdf..4e6bdf15d33f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -17,6 +17,7 @@
 #include <linux/u64_stats_sync.h>
 #include <linux/refcount.h>
 #include <linux/mutex.h>
+#include <linux/module.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -106,6 +107,7 @@ struct bpf_map {
 	struct btf *btf;
 	struct bpf_map_memory memory;
 	char name[BPF_OBJ_NAME_LEN];
+	u32 btf_vmlinux_value_type_id;
 	bool unpriv_array;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
 	/* 22 bytes hole */
@@ -183,7 +185,8 @@ static inline bool bpf_map_offload_neutral(const struct bpf_map *map)
 
 static inline bool bpf_map_support_seq_show(const struct bpf_map *map)
 {
-	return map->btf && map->ops->map_seq_show_elem;
+	return (map->btf_value_type_id || map->btf_vmlinux_value_type_id) &&
+		map->ops->map_seq_show_elem;
 }
 
 int map_check_no_btf(const struct bpf_map *map,
@@ -441,7 +444,8 @@ struct btf_func_model {
  *      fentry = a set of program to run before calling original function
  *      fexit = a set of program to run after original function
  */
-int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
+int arch_prepare_bpf_trampoline(void *image,
+				const struct btf_func_model *m, u32 flags,
 				struct bpf_prog **fentry_progs, int fentry_cnt,
 				struct bpf_prog **fexit_progs, int fexit_cnt,
 				void *orig_call);
@@ -671,6 +675,7 @@ struct bpf_array_aux {
 	struct work_struct work;
 };
 
+struct bpf_struct_ops_value;
 struct btf_type;
 struct btf_member;
 
@@ -680,21 +685,61 @@ struct bpf_struct_ops {
 	int (*init)(struct btf *_btf_vmlinux);
 	int (*check_member)(const struct btf_type *t,
 			    const struct btf_member *member);
+	int (*init_member)(const struct btf_type *t,
+			   const struct btf_member *member,
+			   void *kdata, const void *udata);
+	int (*reg)(void *kdata);
+	void (*unreg)(void *kdata);
 	const struct btf_type *type;
+	const struct btf_type *value_type;
 	const char *name;
 	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
 	u32 type_id;
+	u32 value_id;
 };
 
 #if defined(CONFIG_BPF_JIT)
+#define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
 const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id);
 void bpf_struct_ops_init(struct btf *_btf_vmlinux);
+bool bpf_struct_ops_get(const void *kdata);
+void bpf_struct_ops_put(const void *kdata);
+int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
+				       void *value);
+static inline bool bpf_try_module_get(const void *data, struct module *owner)
+{
+	if (owner == BPF_MODULE_OWNER)
+		return bpf_struct_ops_get(data);
+	else
+		return try_module_get(owner);
+}
+static inline void bpf_module_put(const void *data, struct module *owner)
+{
+	if (owner == BPF_MODULE_OWNER)
+		bpf_struct_ops_put(data);
+	else
+		module_put(owner);
+}
 #else
 static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 {
 	return NULL;
 }
 static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { }
+static inline bool bpf_try_module_get(const void *data, struct module *owner)
+{
+	return try_module_get(owner);
+}
+static inline void bpf_module_put(const void *data, struct module *owner)
+{
+	module_put(owner);
+}
+static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map,
+						     void *key,
+						     void *value)
+{
+	return -EINVAL;
+}
 #endif
 
 struct bpf_array {
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index fadd243ffa2d..9f326e6ef885 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -109,3 +109,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
+#if defined(CONFIG_BPF_JIT)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
+#endif
diff --git a/include/linux/btf.h b/include/linux/btf.h
index f74a09a7120b..881e9b76ef49 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -7,6 +7,8 @@
 #include <linux/types.h>
 #include <uapi/linux/btf.h>
 
+#define BTF_TYPE_EMIT(type) ((void)(type *)0)
+
 struct btf;
 struct btf_member;
 struct btf_type;
@@ -60,6 +62,10 @@ const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
 					    u32 id, u32 *res_id);
 const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
 						 u32 id, u32 *res_id);
+const struct btf_type *
+btf_resolve_size(const struct btf *btf, const struct btf_type *type,
+		 u32 *type_size, const struct btf_type **elem_type,
+		 u32 *total_nelems);
 
 #define for_each_member(i, struct_type, member)			\
 	for (i = 0, member = btf_type_member(struct_type);	\
@@ -106,6 +112,13 @@ static inline bool btf_type_kflag(const struct btf_type *t)
 	return BTF_INFO_KFLAG(t->info);
 }
 
+static inline u32 btf_member_bit_offset(const struct btf_type *struct_type,
+					const struct btf_member *member)
+{
+	return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset)
+					   : member->offset;
+}
+
 static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
 					   const struct btf_member *member)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c1eeb3e0e116..38059880963e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -136,6 +136,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_STACK,
 	BPF_MAP_TYPE_SK_STORAGE,
 	BPF_MAP_TYPE_DEVMAP_HASH,
+	BPF_MAP_TYPE_STRUCT_OPS,
 };
 
 /* Note that tracing related programs such as
@@ -398,6 +399,10 @@ union bpf_attr {
 		__u32	btf_fd;		/* fd pointing to a BTF type data */
 		__u32	btf_key_type_id;	/* BTF type_id of the key */
 		__u32	btf_value_type_id;	/* BTF type_id of the value */
+		__u32	btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
+						   * struct stored as the
+						   * map value
+						   */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -3350,7 +3355,7 @@ struct bpf_map_info {
 	__u32 map_flags;
 	char  name[BPF_OBJ_NAME_LEN];
 	__u32 ifindex;
-	__u32 :32;
+	__u32 btf_vmlinux_value_type_id;
 	__u64 netns_dev;
 	__u64 netns_ino;
 	__u32 btf_id;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index c9f81bd1df83..fb9a0b3e4580 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -10,8 +10,66 @@
 #include <linux/seq_file.h>
 #include <linux/refcount.h>
 
+enum bpf_struct_ops_state {
+	BPF_STRUCT_OPS_STATE_INIT,
+	BPF_STRUCT_OPS_STATE_INUSE,
+	BPF_STRUCT_OPS_STATE_TOBEFREE,
+};
+
+#define BPF_STRUCT_OPS_COMMON_VALUE			\
+	refcount_t refcnt;				\
+	enum bpf_struct_ops_state state
+
+struct bpf_struct_ops_value {
+	BPF_STRUCT_OPS_COMMON_VALUE;
+	char data[0] ____cacheline_aligned_in_smp;
+};
+
+struct bpf_struct_ops_map {
+	struct bpf_map map;
+	const struct bpf_struct_ops *st_ops;
+	/* protect map_update */
+	spinlock_t lock;
+	/* progs has all the bpf_prog that is populated
+	 * to the func ptr of the kernel's struct
+	 * (in kvalue.data).
+	 */
+	struct bpf_prog **progs;
+	/* image is a page that has all the trampolines
+	 * that stores the func args before calling the bpf_prog.
+	 * A PAGE_SIZE "image" is enough to store all trampoline for
+	 * "progs[]".
+	 */
+	void *image;
+	/* uvalue->data stores the kernel struct
+	 * (e.g. tcp_congestion_ops) that is more useful
+	 * to userspace than the kvalue.  For example,
+	 * the bpf_prog's id is stored instead of the kernel
+	 * address of a func ptr.
+	 */
+	struct bpf_struct_ops_value *uvalue;
+	/* kvalue.data stores the actual kernel's struct
+	 * (e.g. tcp_congestion_ops) that will be
+	 * registered to the kernel subsystem.
+	 */
+	struct bpf_struct_ops_value kvalue;
+};
+
+#define VALUE_PREFIX "bpf_struct_ops_"
+#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
+
+/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is
+ * the map's value exposed to the userspace and its btf-type-id is
+ * stored at the map->btf_vmlinux_value_type_id.
+ *
+ */
 #define BPF_STRUCT_OPS_TYPE(_name)				\
-extern struct bpf_struct_ops bpf_##_name;
+extern struct bpf_struct_ops bpf_##_name;			\
+								\
+struct bpf_struct_ops_##_name {						\
+	BPF_STRUCT_OPS_COMMON_VALUE;				\
+	struct _name data ____cacheline_aligned_in_smp;		\
+};
 #include "bpf_struct_ops_types.h"
 #undef BPF_STRUCT_OPS_TYPE
 
@@ -35,19 +93,51 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
 const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
 };
 
+static const struct btf_type *module_type;
+
 void bpf_struct_ops_init(struct btf *_btf_vmlinux)
 {
+	s32 type_id, value_id, module_id;
 	const struct btf_member *member;
 	struct bpf_struct_ops *st_ops;
 	struct bpf_verifier_log log = {};
 	const struct btf_type *t;
+	char value_name[128];
 	const char *mname;
-	s32 type_id;
 	u32 i, j;
 
+	/* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */
+#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name);
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+
+	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
+					  BTF_KIND_STRUCT);
+	if (module_id < 0) {
+		pr_warn("Cannot find struct module in btf_vmlinux\n");
+		return;
+	}
+	module_type = btf_type_by_id(_btf_vmlinux, module_id);
+
 	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
 		st_ops = bpf_struct_ops[i];
 
+		if (strlen(st_ops->name) + VALUE_PREFIX_LEN >=
+		    sizeof(value_name)) {
+			pr_warn("struct_ops name %s is too long\n",
+				st_ops->name);
+			continue;
+		}
+		sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name);
+
+		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
+						 BTF_KIND_STRUCT);
+		if (value_id < 0) {
+			pr_warn("Cannot find struct %s in btf_vmlinux\n",
+				value_name);
+			continue;
+		}
+
 		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
 						BTF_KIND_STRUCT);
 		if (type_id < 0) {
@@ -99,6 +189,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
 			} else {
 				st_ops->type_id = type_id;
 				st_ops->type = t;
+				st_ops->value_id = value_id;
+				st_ops->value_type =
+					btf_type_by_id(_btf_vmlinux, value_id);
 			}
 		}
 	}
@@ -106,6 +199,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
 
 extern struct btf *btf_vmlinux;
 
+static const struct bpf_struct_ops *
+bpf_struct_ops_find_value(u32 value_id)
+{
+	unsigned int i;
+
+	if (!value_id || !btf_vmlinux)
+		return NULL;
+
+	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
+		if (bpf_struct_ops[i]->value_id == value_id)
+			return bpf_struct_ops[i];
+	}
+
+	return NULL;
+}
+
 const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 {
 	unsigned int i;
@@ -120,3 +229,358 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 
 	return NULL;
 }
+
+static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key,
+					   void *next_key)
+{
+	if (key && *(u32 *)key == 0)
+		return -ENOENT;
+
+	*(u32 *)next_key = 0;
+	return 0;
+}
+
+int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
+				       void *value)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+	struct bpf_struct_ops_value *uvalue, *kvalue;
+	enum bpf_struct_ops_state state;
+
+	if (unlikely(*(u32 *)key != 0))
+		return -ENOENT;
+
+	kvalue = &st_map->kvalue;
+	/* Pair with smp_store_release() during map_update */
+	state = smp_load_acquire(&kvalue->state);
+	if (state == BPF_STRUCT_OPS_STATE_INIT) {
+		memset(value, 0, map->value_size);
+		return 0;
+	}
+
+	/* No lock is needed.  state and refcnt do not need
+	 * to be updated together under atomic context.
+	 */
+	uvalue = (struct bpf_struct_ops_value *)value;
+	memcpy(uvalue, st_map->uvalue, map->value_size);
+	uvalue->state = state;
+	refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt));
+
+	return 0;
+}
+
+static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map)
+{
+	const struct btf_type *t = st_map->st_ops->type;
+	u32 i;
+
+	for (i = 0; i < btf_type_vlen(t); i++) {
+		if (st_map->progs[i]) {
+			bpf_prog_put(st_map->progs[i]);
+			st_map->progs[i] = NULL;
+		}
+	}
+}
+
+static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
+					  void *value, u64 flags)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+	const struct bpf_struct_ops *st_ops = st_map->st_ops;
+	struct bpf_struct_ops_value *uvalue, *kvalue;
+	const struct btf_member *member;
+	const struct btf_type *t = st_ops->type;
+	void *udata, *kdata;
+	int prog_fd, err = 0;
+	void *image;
+	u32 i;
+
+	if (flags)
+		return -EINVAL;
+
+	if (*(u32 *)key != 0)
+		return -E2BIG;
+
+	uvalue = (struct bpf_struct_ops_value *)value;
+	if (uvalue->state || refcount_read(&uvalue->refcnt))
+		return -EINVAL;
+
+	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
+	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
+
+	spin_lock(&st_map->lock);
+
+	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	memcpy(uvalue, value, map->value_size);
+
+	udata = &uvalue->data;
+	kdata = &kvalue->data;
+	image = st_map->image;
+
+	for_each_member(i, t, member) {
+		const struct btf_type *mtype, *ptype;
+		struct bpf_prog *prog;
+		u32 moff;
+
+		moff = btf_member_bit_offset(t, member) / 8;
+		mtype = btf_type_by_id(btf_vmlinux, member->type);
+		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
+		if (ptype == module_type) {
+			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
+			continue;
+		}
+
+		err = st_ops->init_member(t, member, kdata, udata);
+		if (err < 0)
+			goto reset_unlock;
+
+		/* The ->init_member() has handled this member */
+		if (err > 0)
+			continue;
+
+		/* If st_ops->init_member does not handle it,
+		 * we will only handle func ptrs and zero-ed members
+		 * here.  Reject everything else.
+		 */
+
+		/* All non func ptr member must be 0 */
+		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
+					       NULL)) {
+			u32 msize;
+
+			mtype = btf_resolve_size(btf_vmlinux, mtype,
+						 &msize, NULL, NULL);
+			if (IS_ERR(mtype)) {
+				err = PTR_ERR(mtype);
+				goto reset_unlock;
+			}
+
+			if (memchr_inv(udata + moff, 0, msize)) {
+				err = -EINVAL;
+				goto reset_unlock;
+			}
+
+			continue;
+		}
+
+		prog_fd = (int)(*(unsigned long *)(udata + moff));
+		/* Similar check as the attr->attach_prog_fd */
+		if (!prog_fd)
+			continue;
+
+		prog = bpf_prog_get(prog_fd);
+		if (IS_ERR(prog)) {
+			err = PTR_ERR(prog);
+			goto reset_unlock;
+		}
+		st_map->progs[i] = prog;
+
+		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
+		    prog->aux->attach_btf_id != st_ops->type_id ||
+		    prog->expected_attach_type != i) {
+			err = -EINVAL;
+			goto reset_unlock;
+		}
+
+		err = arch_prepare_bpf_trampoline(image,
+						  &st_ops->func_models[i], 0,
+						  &prog, 1, NULL, 0, NULL);
+		if (err < 0)
+			goto reset_unlock;
+
+		*(void **)(kdata + moff) = image;
+		image += err;
+
+		/* put prog_id to udata */
+		*(unsigned long *)(udata + moff) = prog->aux->id;
+	}
+
+	refcount_set(&kvalue->refcnt, 1);
+	bpf_map_inc(map);
+
+	err = st_ops->reg(kdata);
+	if (!err) {
+		/* Pair with smp_load_acquire() during lookup */
+		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
+		goto unlock;
+	}
+
+	/* Error during st_ops->reg() */
+	bpf_map_put(map);
+
+reset_unlock:
+	bpf_struct_ops_map_put_progs(st_map);
+	memset(uvalue, 0, map->value_size);
+	memset(kvalue, 0, map->value_size);
+
+unlock:
+	spin_unlock(&st_map->lock);
+	return err;
+}
+
+static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
+{
+	enum bpf_struct_ops_state prev_state;
+	struct bpf_struct_ops_map *st_map;
+
+	st_map = (struct bpf_struct_ops_map *)map;
+	prev_state = cmpxchg(&st_map->kvalue.state,
+			     BPF_STRUCT_OPS_STATE_INUSE,
+			     BPF_STRUCT_OPS_STATE_TOBEFREE);
+	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
+		st_map->st_ops->unreg(&st_map->kvalue.data);
+		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
+			bpf_map_put(map);
+	}
+
+	return 0;
+}
+
+static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key,
+					     struct seq_file *m)
+{
+	void *value;
+
+	value = bpf_struct_ops_map_lookup_elem(map, key);
+	if (!value)
+		return;
+
+	btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id,
+			  value, m);
+	seq_puts(m, "\n");
+}
+
+static void bpf_struct_ops_map_free(struct bpf_map *map)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+
+	if (st_map->progs)
+		bpf_struct_ops_map_put_progs(st_map);
+	bpf_map_area_free(st_map->progs);
+	bpf_jit_free_exec(st_map->image);
+	bpf_map_area_free(st_map->uvalue);
+	bpf_map_area_free(st_map);
+}
+
+static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
+{
+	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
+	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
+		return -EINVAL;
+	return 0;
+}
+
+static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
+{
+	const struct bpf_struct_ops *st_ops;
+	size_t map_total_size, st_map_size;
+	struct bpf_struct_ops_map *st_map;
+	const struct btf_type *t, *vt;
+	struct bpf_map_memory mem;
+	struct bpf_map *map;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
+	if (!st_ops)
+		return ERR_PTR(-ENOTSUPP);
+
+	vt = st_ops->value_type;
+	if (attr->value_size != vt->size)
+		return ERR_PTR(-EINVAL);
+
+	t = st_ops->type;
+
+	st_map_size = sizeof(*st_map) +
+		/* kvalue stores the
+		 * struct bpf_struct_ops_tcp_congestions_ops
+		 */
+		(vt->size - sizeof(struct bpf_struct_ops_value));
+	map_total_size = st_map_size +
+		/* uvalue */
+		sizeof(vt->size) +
+		/* struct bpf_progs **progs */
+		 btf_type_vlen(t) * sizeof(struct bpf_prog *);
+	err = bpf_map_charge_init(&mem, map_total_size);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
+	if (!st_map) {
+		bpf_map_charge_finish(&mem);
+		return ERR_PTR(-ENOMEM);
+	}
+	st_map->st_ops = st_ops;
+	map = &st_map->map;
+
+	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
+	st_map->progs =
+		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
+				   NUMA_NO_NODE);
+	/* Each trampoline costs < 64 bytes.  Ensure one page
+	 * is enough for max number of func ptrs.
+	 */
+	BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS);
+	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
+	if (!st_map->uvalue || !st_map->progs || !st_map->image) {
+		bpf_struct_ops_map_free(map);
+		bpf_map_charge_finish(&mem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	spin_lock_init(&st_map->lock);
+	set_vm_flush_reset_perms(st_map->image);
+	set_memory_x((long)st_map->image, 1);
+	bpf_map_init_from_attr(map, attr);
+	bpf_map_charge_move(&map->memory, &mem);
+
+	return map;
+}
+
+const struct bpf_map_ops bpf_struct_ops_map_ops = {
+	.map_alloc_check = bpf_struct_ops_map_alloc_check,
+	.map_alloc = bpf_struct_ops_map_alloc,
+	.map_free = bpf_struct_ops_map_free,
+	.map_get_next_key = bpf_struct_ops_map_get_next_key,
+	.map_lookup_elem = bpf_struct_ops_map_lookup_elem,
+	.map_delete_elem = bpf_struct_ops_map_delete_elem,
+	.map_update_elem = bpf_struct_ops_map_update_elem,
+	.map_seq_show_elem = bpf_struct_ops_map_seq_show_elem,
+};
+
+/* "const void *" because some subsystem is
+ * passing a const (e.g. const struct tcp_congestion_ops *)
+ */
+bool bpf_struct_ops_get(const void *kdata)
+{
+	struct bpf_struct_ops_value *kvalue;
+
+	kvalue = container_of(kdata, struct bpf_struct_ops_value, data);
+
+	return refcount_inc_not_zero(&kvalue->refcnt);
+}
+
+void bpf_struct_ops_put(const void *kdata)
+{
+	struct bpf_struct_ops_value *kvalue;
+
+	kvalue = container_of(kdata, struct bpf_struct_ops_value, data);
+	if (refcount_dec_and_test(&kvalue->refcnt)) {
+		struct bpf_struct_ops_map *st_map;
+
+		st_map = container_of(kvalue, struct bpf_struct_ops_map,
+				      kvalue);
+		bpf_map_put(&st_map->map);
+	}
+}
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 0e879a512cf4..0ef265170e1e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -500,13 +500,6 @@ static const char *btf_int_encoding_str(u8 encoding)
 		return "UNKN";
 }
 
-static u32 btf_member_bit_offset(const struct btf_type *struct_type,
-			     const struct btf_member *member)
-{
-	return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset)
-					   : member->offset;
-}
-
 static u32 btf_type_int(const struct btf_type *t)
 {
 	return *(u32 *)(t + 1);
@@ -1089,7 +1082,7 @@ static const struct resolve_vertex *env_stack_peak(struct btf_verifier_env *env)
  * *elem_type: same as return type ("struct X")
  * *total_nelems: 1
  */
-static const struct btf_type *
+const struct btf_type *
 btf_resolve_size(const struct btf *btf, const struct btf_type *type,
 		 u32 *type_size, const struct btf_type **elem_type,
 		 u32 *total_nelems)
@@ -1143,8 +1136,10 @@ btf_resolve_size(const struct btf *btf, const struct btf_type *type,
 		return ERR_PTR(-EINVAL);
 
 	*type_size = nelems * size;
-	*total_nelems = nelems;
-	*elem_type = type;
+	if (total_nelems)
+		*total_nelems = nelems;
+	if (elem_type)
+		*elem_type = type;
 
 	return array_type ? : type;
 }
@@ -1858,7 +1853,10 @@ static void btf_modifier_seq_show(const struct btf *btf,
 				  u32 type_id, void *data,
 				  u8 bits_offset, struct seq_file *m)
 {
-	t = btf_type_id_resolve(btf, &type_id);
+	if (btf->resolved_ids)
+		t = btf_type_id_resolve(btf, &type_id);
+	else
+		t = btf_type_skip_modifiers(btf, type_id, NULL);
 
 	btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m);
 }
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 5e9366b33f0f..b3c48d1533cb 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -22,7 +22,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	 */
 	if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
 	    inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE ||
-	    inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+	    inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ||
+	    inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
 		fdput(f);
 		return ERR_PTR(-ENOTSUPP);
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 03a02ef4c496..a07800ec5023 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -628,7 +628,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	return ret;
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD btf_value_type_id
+#define BPF_MAP_CREATE_LAST_FIELD btf_vmlinux_value_type_id
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
@@ -642,6 +642,14 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		return -EINVAL;
 
+	if (attr->btf_vmlinux_value_type_id) {
+		if (attr->map_type != BPF_MAP_TYPE_STRUCT_OPS ||
+		    attr->btf_key_type_id || attr->btf_value_type_id)
+			return -EINVAL;
+	} else if (attr->btf_key_type_id && !attr->btf_value_type_id) {
+		return -EINVAL;
+	}
+
 	f_flags = bpf_get_file_flag(attr->map_flags);
 	if (f_flags < 0)
 		return f_flags;
@@ -664,32 +672,35 @@ static int map_create(union bpf_attr *attr)
 	atomic64_set(&map->usercnt, 1);
 	mutex_init(&map->freeze_mutex);
 
-	if (attr->btf_key_type_id || attr->btf_value_type_id) {
+	map->spin_lock_off = -EINVAL;
+	if (attr->btf_key_type_id || attr->btf_value_type_id ||
+	    /* Even the map's value is a kernel's struct,
+	     * the bpf_prog.o must have BTF to begin with
+	     * to figure out the corresponding kernel's
+	     * counter part.  Thus, attr->btf_fd has
+	     * to be valid also.
+	     */
+	    attr->btf_vmlinux_value_type_id) {
 		struct btf *btf;
 
-		if (!attr->btf_value_type_id) {
-			err = -EINVAL;
-			goto free_map;
-		}
-
 		btf = btf_get_by_fd(attr->btf_fd);
 		if (IS_ERR(btf)) {
 			err = PTR_ERR(btf);
 			goto free_map;
 		}
+		map->btf = btf;
 
-		err = map_check_btf(map, btf, attr->btf_key_type_id,
-				    attr->btf_value_type_id);
-		if (err) {
-			btf_put(btf);
-			goto free_map;
+		if (attr->btf_value_type_id) {
+			err = map_check_btf(map, btf, attr->btf_key_type_id,
+					    attr->btf_value_type_id);
+			if (err)
+				goto free_map;
 		}
 
-		map->btf = btf;
 		map->btf_key_type_id = attr->btf_key_type_id;
 		map->btf_value_type_id = attr->btf_value_type_id;
-	} else {
-		map->spin_lock_off = -EINVAL;
+		map->btf_vmlinux_value_type_id =
+			attr->btf_vmlinux_value_type_id;
 	}
 
 	err = security_bpf_map_alloc(map);
@@ -888,6 +899,9 @@ static int map_lookup_elem(union bpf_attr *attr)
 	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
 		   map->map_type == BPF_MAP_TYPE_STACK) {
 		err = map->ops->map_peek_elem(map, value);
+	} else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+		/* struct_ops map requires directly updating "value" */
+		err = bpf_struct_ops_map_sys_lookup_elem(map, key, value);
 	} else {
 		rcu_read_lock();
 		if (map->ops->map_lookup_elem_sys_only)
@@ -1092,7 +1106,9 @@ static int map_delete_elem(union bpf_attr *attr)
 	if (bpf_map_is_dev_bound(map)) {
 		err = bpf_map_offload_delete_elem(map, key);
 		goto out;
-	} else if (IS_FD_PROG_ARRAY(map)) {
+	} else if (IS_FD_PROG_ARRAY(map) ||
+		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+		/* These maps require sleepable context */
 		err = map->ops->map_delete_elem(map, key);
 		goto out;
 	}
@@ -2822,6 +2838,7 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
 		info.btf_key_type_id = map->btf_key_type_id;
 		info.btf_value_type_id = map->btf_value_type_id;
 	}
+	info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
 
 	if (bpf_map_is_dev_bound(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 5ee301ddbd00..610109cfc7a8 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -110,7 +110,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
 					  fentry, fentry_cnt,
 					  fexit, fexit_cnt,
 					  tr->func.addr);
-	if (err)
+	if (err < 0)
 		goto out;
 
 	if (tr->selector)
@@ -244,7 +244,8 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start)
 }
 
 int __weak
-arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
+arch_prepare_bpf_trampoline(void *image,
+			    const struct btf_func_model *m, u32 flags,
 			    struct bpf_prog **fentry_progs, int fentry_cnt,
 			    struct bpf_prog **fexit_progs, int fexit_cnt,
 			    void *orig_call)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4c1eaa1a2965..990f13165c52 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8149,6 +8149,11 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		return -EINVAL;
 	}
 
+	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+		verbose(env, "bpf_struct_ops map cannot be used in prog\n");
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-23 20:18   ` Yonghong Song
                     ` (3 more replies)
  2019-12-21  6:26 ` [PATCH bpf-next v2 08/11] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
                   ` (3 subsequent siblings)
  10 siblings, 4 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
in bpf.

The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt.  e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.  It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.

This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).

The optional "get_info" is unsupported now.  It can be added
later.  One possible way is to output the info with a btf-id
to describe the content.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h            |   2 +
 include/net/tcp.h                 |   1 +
 kernel/bpf/bpf_struct_ops_types.h |   7 +-
 net/core/filter.c                 |   2 +-
 net/ipv4/Makefile                 |   4 +
 net/ipv4/bpf_tcp_ca.c             | 226 ++++++++++++++++++++++++++++++
 net/ipv4/tcp_cong.c               |  14 +-
 net/ipv4/tcp_ipv4.c               |   6 +-
 net/ipv4/tcp_minisocks.c          |   4 +-
 net/ipv4/tcp_output.c             |   4 +-
 10 files changed, 255 insertions(+), 15 deletions(-)
 create mode 100644 net/ipv4/bpf_tcp_ca.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 69d6706fc889..6256511bbd6d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -843,6 +843,8 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog);
 int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
 			      bpf_aux_classic_check_t trans, bool save_orig);
 void bpf_prog_destroy(struct bpf_prog *fp);
+const struct bpf_func_proto *
+bpf_base_func_proto(enum bpf_func_id func_id);
 
 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
 int sk_attach_bpf(u32 ufd, struct sock *sk);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 86b9a8766648..fd87fa1df603 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1007,6 +1007,7 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
 #define TCP_CONG_NEEDS_ECN	0x2
+#define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
 
 union tcp_cc_info;
 
diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
index 7bb13ff49ec2..066d83ea1c99 100644
--- a/kernel/bpf/bpf_struct_ops_types.h
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -1,4 +1,9 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /* internal file - do not include directly */
 
-/* To be filled in a later patch */
+#ifdef CONFIG_BPF_JIT
+#ifdef CONFIG_INET
+#include <net/tcp.h>
+BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
+#endif
+#endif
diff --git a/net/core/filter.c b/net/core/filter.c
index 217af9974c86..63dcd0bdcb1f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5934,7 +5934,7 @@ bool bpf_helper_changes_pkt_data(void *func)
 	return false;
 }
 
-static const struct bpf_func_proto *
+const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index d57ecfaf89d4..7360d9b3eaad 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -65,3 +65,7 @@ obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
 		      xfrm4_output.o xfrm4_protocol.o
+
+ifeq ($(CONFIG_BPF_SYSCALL),y)
+obj-$(CONFIG_BPF_JIT) += bpf_tcp_ca.o
+endif
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
new file mode 100644
index 000000000000..1114339ee57d
--- /dev/null
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -0,0 +1,226 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook  */
+
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/tcp.h>
+
+static u32 optional_ops[] = {
+	offsetof(struct tcp_congestion_ops, init),
+	offsetof(struct tcp_congestion_ops, release),
+	offsetof(struct tcp_congestion_ops, set_state),
+	offsetof(struct tcp_congestion_ops, cwnd_event),
+	offsetof(struct tcp_congestion_ops, in_ack_event),
+	offsetof(struct tcp_congestion_ops, pkts_acked),
+	offsetof(struct tcp_congestion_ops, min_tso_segs),
+	offsetof(struct tcp_congestion_ops, sndbuf_expand),
+	offsetof(struct tcp_congestion_ops, cong_control),
+};
+
+static u32 unsupported_ops[] = {
+	offsetof(struct tcp_congestion_ops, get_info),
+};
+
+static const struct btf_type *tcp_sock_type;
+static u32 tcp_sock_id, sock_id;
+
+static int bpf_tcp_ca_init(struct btf *_btf_vmlinux)
+{
+	s32 type_id;
+
+	type_id = btf_find_by_name_kind(_btf_vmlinux, "sock", BTF_KIND_STRUCT);
+	if (type_id < 0)
+		return -EINVAL;
+	sock_id = type_id;
+
+	type_id = btf_find_by_name_kind(_btf_vmlinux, "tcp_sock",
+					BTF_KIND_STRUCT);
+	if (type_id < 0)
+		return -EINVAL;
+	tcp_sock_id = type_id;
+	tcp_sock_type = btf_type_by_id(_btf_vmlinux, tcp_sock_id);
+
+	return 0;
+}
+
+static bool check_optional(u32 member_offset)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(optional_ops); i++) {
+		if (member_offset == optional_ops[i])
+			return true;
+	}
+
+	return false;
+}
+
+static bool check_unsupported(u32 member_offset)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
+		if (member_offset == unsupported_ops[i])
+			return true;
+	}
+
+	return false;
+}
+
+extern struct btf *btf_vmlinux;
+
+static bool bpf_tcp_ca_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       const struct bpf_prog *prog,
+				       struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+		return false;
+	if (type != BPF_READ)
+		return false;
+	if (off % size != 0)
+		return false;
+
+	if (!btf_ctx_access(off, size, type, prog, info))
+		return false;
+
+	if (info->reg_type == PTR_TO_BTF_ID && info->btf_id == sock_id)
+		/* promote it to tcp_sock */
+		info->btf_id = tcp_sock_id;
+
+	return true;
+}
+
+static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
+					const struct btf_type *t, int off,
+					int size, enum bpf_access_type atype,
+					u32 *next_btf_id)
+{
+	size_t end;
+
+	if (atype == BPF_READ)
+		return btf_struct_access(log, t, off, size, atype, next_btf_id);
+
+	if (t != tcp_sock_type) {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct inet_connection_sock, icsk_ca_priv):
+		end = offsetofend(struct inet_connection_sock, icsk_ca_priv);
+		break;
+	case offsetof(struct inet_connection_sock, icsk_ack.pending):
+		end = offsetofend(struct inet_connection_sock,
+				  icsk_ack.pending);
+		break;
+	case offsetof(struct tcp_sock, snd_cwnd):
+		end = offsetofend(struct tcp_sock, snd_cwnd);
+		break;
+	case offsetof(struct tcp_sock, snd_cwnd_cnt):
+		end = offsetofend(struct tcp_sock, snd_cwnd_cnt);
+		break;
+	case offsetof(struct tcp_sock, snd_ssthresh):
+		end = offsetofend(struct tcp_sock, snd_ssthresh);
+		break;
+	case offsetof(struct tcp_sock, ecn_flags):
+		end = offsetofend(struct tcp_sock, ecn_flags);
+		break;
+	default:
+		bpf_log(log, "no write support to tcp_sock at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of tcp_sock ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return NOT_INIT;
+}
+
+static const struct bpf_func_proto *
+bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
+			  const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
+	.get_func_proto		= bpf_tcp_ca_get_func_proto,
+	.is_valid_access	= bpf_tcp_ca_is_valid_access,
+	.btf_struct_access	= bpf_tcp_ca_btf_struct_access,
+};
+
+static int bpf_tcp_ca_init_member(const struct btf_type *t,
+				  const struct btf_member *member,
+				  void *kdata, const void *udata)
+{
+	const struct tcp_congestion_ops *utcp_ca;
+	struct tcp_congestion_ops *tcp_ca;
+	size_t tcp_ca_name_len;
+	int prog_fd;
+	u32 moff;
+
+	utcp_ca = (const struct tcp_congestion_ops *)udata;
+	tcp_ca = (struct tcp_congestion_ops *)kdata;
+
+	moff = btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct tcp_congestion_ops, flags):
+		if (utcp_ca->flags & ~TCP_CONG_MASK)
+			return -EINVAL;
+		tcp_ca->flags = utcp_ca->flags;
+		return 1;
+	case offsetof(struct tcp_congestion_ops, name):
+		tcp_ca_name_len = strnlen(utcp_ca->name, sizeof(utcp_ca->name));
+		if (!tcp_ca_name_len ||
+		    tcp_ca_name_len == sizeof(utcp_ca->name))
+			return -EINVAL;
+		memcpy(tcp_ca->name, utcp_ca->name, sizeof(tcp_ca->name));
+		return 1;
+	}
+
+	if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, NULL))
+		return 0;
+
+	/* Ensure bpf_prog is provided for compulsory func ptr */
+	prog_fd = (int)(*(unsigned long *)(udata + moff));
+	if (!prog_fd && !check_optional(moff) && !check_unsupported(moff))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bpf_tcp_ca_check_member(const struct btf_type *t,
+				   const struct btf_member *member)
+{
+	if (check_unsupported(btf_member_bit_offset(t, member) / 8))
+		return -ENOTSUPP;
+	return 0;
+}
+
+static int bpf_tcp_ca_reg(void *kdata)
+{
+	return tcp_register_congestion_control(kdata);
+}
+
+static void bpf_tcp_ca_unreg(void *kdata)
+{
+	tcp_unregister_congestion_control(kdata);
+}
+
+struct bpf_struct_ops bpf_tcp_congestion_ops = {
+	.verifier_ops = &bpf_tcp_ca_verifier_ops,
+	.reg = bpf_tcp_ca_reg,
+	.unreg = bpf_tcp_ca_unreg,
+	.check_member = bpf_tcp_ca_check_member,
+	.init_member = bpf_tcp_ca_init_member,
+	.init = bpf_tcp_ca_init,
+	.name = "tcp_congestion_ops",
+};
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 3737ec096650..dc27f21bd815 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -162,7 +162,7 @@ void tcp_assign_congestion_control(struct sock *sk)
 
 	rcu_read_lock();
 	ca = rcu_dereference(net->ipv4.tcp_congestion_control);
-	if (unlikely(!try_module_get(ca->owner)))
+	if (unlikely(!bpf_try_module_get(ca, ca->owner)))
 		ca = &tcp_reno;
 	icsk->icsk_ca_ops = ca;
 	rcu_read_unlock();
@@ -208,7 +208,7 @@ void tcp_cleanup_congestion_control(struct sock *sk)
 
 	if (icsk->icsk_ca_ops->release)
 		icsk->icsk_ca_ops->release(sk);
-	module_put(icsk->icsk_ca_ops->owner);
+	bpf_module_put(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner);
 }
 
 /* Used by sysctl to change default congestion control */
@@ -222,12 +222,12 @@ int tcp_set_default_congestion_control(struct net *net, const char *name)
 	ca = tcp_ca_find_autoload(net, name);
 	if (!ca) {
 		ret = -ENOENT;
-	} else if (!try_module_get(ca->owner)) {
+	} else if (!bpf_try_module_get(ca, ca->owner)) {
 		ret = -EBUSY;
 	} else {
 		prev = xchg(&net->ipv4.tcp_congestion_control, ca);
 		if (prev)
-			module_put(prev->owner);
+			bpf_module_put(prev, prev->owner);
 
 		ca->flags |= TCP_CONG_NON_RESTRICTED;
 		ret = 0;
@@ -366,19 +366,19 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load,
 	} else if (!load) {
 		const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
 
-		if (try_module_get(ca->owner)) {
+		if (bpf_try_module_get(ca, ca->owner)) {
 			if (reinit) {
 				tcp_reinit_congestion_control(sk, ca);
 			} else {
 				icsk->icsk_ca_ops = ca;
-				module_put(old_ca->owner);
+				bpf_module_put(old_ca, old_ca->owner);
 			}
 		} else {
 			err = -EBUSY;
 		}
 	} else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) {
 		err = -EPERM;
-	} else if (!try_module_get(ca->owner)) {
+	} else if (!bpf_try_module_get(ca, ca->owner)) {
 		err = -EBUSY;
 	} else {
 		tcp_reinit_congestion_control(sk, ca);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 26637fce324d..45a88358168a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2619,7 +2619,8 @@ static void __net_exit tcp_sk_exit(struct net *net)
 	int cpu;
 
 	if (net->ipv4.tcp_congestion_control)
-		module_put(net->ipv4.tcp_congestion_control->owner);
+		bpf_module_put(net->ipv4.tcp_congestion_control,
+			       net->ipv4.tcp_congestion_control->owner);
 
 	for_each_possible_cpu(cpu)
 		inet_ctl_sock_destroy(*per_cpu_ptr(net->ipv4.tcp_sk, cpu));
@@ -2726,7 +2727,8 @@ static int __net_init tcp_sk_init(struct net *net)
 
 	/* Reno is always built in */
 	if (!net_eq(net, &init_net) &&
-	    try_module_get(init_net.ipv4.tcp_congestion_control->owner))
+	    bpf_try_module_get(init_net.ipv4.tcp_congestion_control,
+			       init_net.ipv4.tcp_congestion_control->owner))
 		net->ipv4.tcp_congestion_control = init_net.ipv4.tcp_congestion_control;
 	else
 		net->ipv4.tcp_congestion_control = &tcp_reno;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c802bc80c400..ad3b56d9fa71 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -414,7 +414,7 @@ void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
 
 		rcu_read_lock();
 		ca = tcp_ca_find_key(ca_key);
-		if (likely(ca && try_module_get(ca->owner))) {
+		if (likely(ca && bpf_try_module_get(ca, ca->owner))) {
 			icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
 			icsk->icsk_ca_ops = ca;
 			ca_got_dst = true;
@@ -425,7 +425,7 @@ void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
 	/* If no valid choice made yet, assign current system default ca. */
 	if (!ca_got_dst &&
 	    (!icsk->icsk_ca_setsockopt ||
-	     !try_module_get(icsk->icsk_ca_ops->owner)))
+	     !bpf_try_module_get(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner)))
 		tcp_assign_congestion_control(sk);
 
 	tcp_set_ca_state(sk, TCP_CA_Open);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b184f03d7437..8e7187732ac1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3356,8 +3356,8 @@ static void tcp_ca_dst_init(struct sock *sk, const struct dst_entry *dst)
 
 	rcu_read_lock();
 	ca = tcp_ca_find_key(ca_key);
-	if (likely(ca && try_module_get(ca->owner))) {
-		module_put(icsk->icsk_ca_ops->owner);
+	if (likely(ca && bpf_try_module_get(ca, ca->owner))) {
+		bpf_module_put(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner);
 		icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
 		icsk->icsk_ca_ops = ca;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 08/11] bpf: Add BPF_FUNC_tcp_send_ack helper
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-21  6:26 ` [PATCH bpf-next v2 09/11] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

Add a helper to send out a tcp-ack.  It will be used in the later
bpf_dctcp implementation that requires to send out an ack
when the CE state changed.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/uapi/linux/bpf.h | 11 ++++++++++-
 net/ipv4/bpf_tcp_ca.c    | 24 +++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 38059880963e..2d6a2e572f56 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2837,6 +2837,14 @@ union bpf_attr {
  * 	Return
  * 		On success, the strictly positive length of the string,	including
  * 		the trailing NUL character. On error, a negative value.
+ *
+ * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
+ *	Description
+ *		Send out a tcp-ack. *tp* is the in-kernel struct tcp_sock.
+ *		*rcv_nxt* is the ack_seq to be sent out.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2954,7 +2962,8 @@ union bpf_attr {
 	FN(probe_read_user),		\
 	FN(probe_read_kernel),		\
 	FN(probe_read_user_str),	\
-	FN(probe_read_kernel_str),
+	FN(probe_read_kernel_str),	\
+	FN(tcp_send_ack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 1114339ee57d..b90e2ec2ee2b 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -144,11 +144,33 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
 	return NOT_INIT;
 }
 
+BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt)
+{
+	/* bpf_tcp_ca prog cannot have NULL tp */
+	__tcp_send_ack((struct sock *)tp, rcv_nxt);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
+	.func		= bpf_tcp_send_ack,
+	.gpl_only	= false,
+	/* In case we want to report error later */
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_type	= ARG_ANYTHING,
+	.btf_id		= &tcp_sock_id,
+};
+
 static const struct bpf_func_proto *
 bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
 			  const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	switch (func_id) {
+	case BPF_FUNC_tcp_send_ack:
+		return &bpf_tcp_send_ack_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
 }
 
 static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 09/11] bpf: Synch uapi bpf.h to tools/
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 08/11] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-21  6:26 ` [PATCH bpf-next v2 10/11] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
  2019-12-21  6:26 ` [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example Martin KaFai Lau
  10 siblings, 0 replies; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch sync uapi bpf.h to tools/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/include/uapi/linux/bpf.h | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7df436da542d..2d6a2e572f56 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -136,6 +136,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_STACK,
 	BPF_MAP_TYPE_SK_STORAGE,
 	BPF_MAP_TYPE_DEVMAP_HASH,
+	BPF_MAP_TYPE_STRUCT_OPS,
 };
 
 /* Note that tracing related programs such as
@@ -174,6 +175,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 	BPF_PROG_TYPE_TRACING,
+	BPF_PROG_TYPE_STRUCT_OPS,
 };
 
 enum bpf_attach_type {
@@ -397,6 +399,10 @@ union bpf_attr {
 		__u32	btf_fd;		/* fd pointing to a BTF type data */
 		__u32	btf_key_type_id;	/* BTF type_id of the key */
 		__u32	btf_value_type_id;	/* BTF type_id of the value */
+		__u32	btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
+						   * struct stored as the
+						   * map value
+						   */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -2831,6 +2837,14 @@ union bpf_attr {
  * 	Return
  * 		On success, the strictly positive length of the string,	including
  * 		the trailing NUL character. On error, a negative value.
+ *
+ * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
+ *	Description
+ *		Send out a tcp-ack. *tp* is the in-kernel struct tcp_sock.
+ *		*rcv_nxt* is the ack_seq to be sent out.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2948,7 +2962,8 @@ union bpf_attr {
 	FN(probe_read_user),		\
 	FN(probe_read_kernel),		\
 	FN(probe_read_user_str),	\
-	FN(probe_read_kernel_str),
+	FN(probe_read_kernel_str),	\
+	FN(tcp_send_ack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3349,7 +3364,7 @@ struct bpf_map_info {
 	__u32 map_flags;
 	char  name[BPF_OBJ_NAME_LEN];
 	__u32 ifindex;
-	__u32 :32;
+	__u32 btf_vmlinux_value_type_id;
 	__u64 netns_dev;
 	__u64 netns_ino;
 	__u32 btf_id;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 10/11] bpf: libbpf: Add STRUCT_OPS support
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 09/11] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-23 19:54   ` Andrii Nakryiko
  2019-12-21  6:26 ` [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example Martin KaFai Lau
  10 siblings, 1 reply; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch adds BPF STRUCT_OPS support to libbpf.

The only sec_name convention is SEC(".struct_ops") to identify the
struct_ops implemented in BPF,
e.g. To implement a tcp_congestion_ops:

SEC(".struct_ops")
struct tcp_congestion_ops dctcp = {
	.init           = (void *)dctcp_init,  /* <-- a bpf_prog */
	/* ... some more func prts ... */
	.name           = "bpf_dctcp",
};

Each struct_ops is defined as a global variable under SEC(".struct_ops")
as above.  libbpf creates a map for each variable and the variable name
is the map's name.  Multiple struct_ops is supported under
SEC(".struct_ops").

In the bpf_object__open phase, libbpf will look for the SEC(".struct_ops")
section and find out what is the btf-type the struct_ops is
implementing.  Note that the btf-type here is referring to
a type in the bpf_prog.o's btf.  A "struct bpf_map" is added
by bpf_object__add_map() as other maps do.  It will then
collect (through SHT_REL) where are the bpf progs that the
func ptrs are referring to.  No btf_vmlinux is needed in
the open phase.

In the bpf_object__load phase, the map-fields, which depend
on the btf_vmlinux, are initialized (in bpf_map__init_kern_struct_ops()).
It will also set the prog->type, prog->attach_btf_id, and
prog->expected_attach_type.  Thus, the prog's properties do
not rely on its section name.
[ Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
  process is as simple as: member-name match + btf-kind match + size match.
  If these matching conditions fail, libbpf will reject.
  The current targeting support is "struct tcp_congestion_ops" which
  most of its members are function pointers.
  The member ordering of the bpf_prog's btf-type can be different from
  the btf_vmlinux's btf-type. ]

Then, all obj->maps are created as usual (in bpf_object__create_maps()).

Once the maps are created and prog's properties are all set,
the libbpf will proceed to load all the progs.

bpf_map__attach_struct_ops() is added to register a struct_ops
map to a kernel subsystem.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/lib/bpf/bpf.c           |  10 +-
 tools/lib/bpf/bpf.h           |   5 +-
 tools/lib/bpf/libbpf.c        | 639 +++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.h        |   1 +
 tools/lib/bpf/libbpf.map      |   1 +
 tools/lib/bpf/libbpf_probes.c |   2 +
 6 files changed, 646 insertions(+), 12 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index a787d53699c8..b0ecbe9ef2d4 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -95,7 +95,11 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
 	attr.btf_key_type_id = create_attr->btf_key_type_id;
 	attr.btf_value_type_id = create_attr->btf_value_type_id;
 	attr.map_ifindex = create_attr->map_ifindex;
-	attr.inner_map_fd = create_attr->inner_map_fd;
+	if (attr.map_type == BPF_MAP_TYPE_STRUCT_OPS)
+		attr.btf_vmlinux_value_type_id =
+			create_attr->btf_vmlinux_value_type_id;
+	else
+		attr.inner_map_fd = create_attr->inner_map_fd;
 
 	return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 }
@@ -228,7 +232,9 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 	memset(&attr, 0, sizeof(attr));
 	attr.prog_type = load_attr->prog_type;
 	attr.expected_attach_type = load_attr->expected_attach_type;
-	if (attr.prog_type == BPF_PROG_TYPE_TRACING) {
+	if (attr.prog_type == BPF_PROG_TYPE_STRUCT_OPS) {
+		attr.attach_btf_id = load_attr->attach_btf_id;
+	} else if (attr.prog_type == BPF_PROG_TYPE_TRACING) {
 		attr.attach_btf_id = load_attr->attach_btf_id;
 		attr.attach_prog_fd = load_attr->attach_prog_fd;
 	} else {
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index f0ab8519986e..56341d117e5b 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -46,7 +46,10 @@ struct bpf_create_map_attr {
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
 	__u32 map_ifindex;
-	__u32 inner_map_fd;
+	union {
+		__u32 inner_map_fd;
+		__u32 btf_vmlinux_value_type_id;
+	};
 };
 
 LIBBPF_API int
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 9576a90c5a1c..dbd3244018fd 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -69,6 +69,11 @@
 
 #define __printf(a, b)	__attribute__((format(printf, a, b)))
 
+static struct btf *bpf_find_kernel_btf(void);
+static struct bpf_map *bpf_object__add_map(struct bpf_object *obj);
+static struct bpf_program *bpf_object__find_prog_by_idx(struct bpf_object *obj,
+							int idx);
+
 static int __base_pr(enum libbpf_print_level level, const char *format,
 		     va_list args)
 {
@@ -228,10 +233,32 @@ struct bpf_program {
 	__u32 prog_flags;
 };
 
+struct bpf_struct_ops {
+	const char *tname;
+	const struct btf_type *type;
+	struct bpf_program **progs;
+	__u32 *kern_func_off;
+	/* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
+	void *data;
+	/* e.g. struct bpf_struct_ops_tcp_congestion_ops in
+	 *      btf_vmlinux's format.
+	 * struct bpf_struct_ops_tcp_congestion_ops {
+	 *	[... some other kernel fields ...]
+	 *	struct tcp_congestion_ops data;
+	 * }
+	 * kern_vdata-size == sizeof(struct bpf_struct_ops_tcp_congestion_ops)
+	 * bpf_map__init_kern_struct_ops() will populate the "kern_vdata"
+	 * from "data".
+	 */
+	void *kern_vdata;
+	__u32 type_id;
+};
+
 #define DATA_SEC ".data"
 #define BSS_SEC ".bss"
 #define RODATA_SEC ".rodata"
 #define KCONFIG_SEC ".kconfig"
+#define STRUCT_OPS_SEC ".struct_ops"
 
 enum libbpf_map_type {
 	LIBBPF_MAP_UNSPEC,
@@ -258,10 +285,12 @@ struct bpf_map {
 	struct bpf_map_def def;
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
+	__u32 btf_vmlinux_value_type_id;
 	void *priv;
 	bpf_map_clear_priv_t clear_priv;
 	enum libbpf_map_type libbpf_type;
 	void *mmaped;
+	struct bpf_struct_ops *st_ops;
 	char *pin_path;
 	bool pinned;
 	bool reused;
@@ -325,6 +354,7 @@ struct bpf_object {
 		Elf_Data *data;
 		Elf_Data *rodata;
 		Elf_Data *bss;
+		Elf_Data *st_ops_data;
 		size_t strtabidx;
 		struct {
 			GElf_Shdr shdr;
@@ -338,6 +368,7 @@ struct bpf_object {
 		int data_shndx;
 		int rodata_shndx;
 		int bss_shndx;
+		int st_ops_shndx;
 	} efile;
 	/*
 	 * All loaded bpf_object is linked in a list, which is
@@ -565,6 +596,480 @@ static __u32 get_kernel_version(void)
 	return KERNEL_VERSION(major, minor, patch);
 }
 
+static const struct btf_type *
+resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
+static const struct btf_type *
+resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
+
+static const struct btf_member *
+find_member_by_offset(const struct btf_type *t, __u32 bit_offset)
+{
+	struct btf_member *m;
+	int i;
+
+	for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
+		if (btf_member_bit_offset(t, i) == bit_offset)
+			return m;
+	}
+
+	return NULL;
+}
+
+static const struct btf_member *
+find_member_by_name(const struct btf *btf, const struct btf_type *t,
+		    const char *name)
+{
+	struct btf_member *m;
+	int i;
+
+	for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
+		if (!strcmp(btf__name_by_offset(btf, m->name_off), name))
+			return m;
+	}
+
+	return NULL;
+}
+
+#define STRUCT_OPS_VALUE_PREFIX "bpf_struct_ops_"
+#define STRUCT_OPS_VALUE_PREFIX_LEN (sizeof(STRUCT_OPS_VALUE_PREFIX) - 1)
+
+static int
+find_struct_ops_kern_types(const struct btf *btf, const char *tname,
+			   const struct btf_type **type, __u32 *type_id,
+			   const struct btf_type **vtype, __u32 *vtype_id,
+			   const struct btf_member **data_member)
+{
+	const struct btf_type *kern_type, *kern_vtype;
+	const struct btf_member *kern_data_member;
+	__s32 kern_vtype_id, kern_type_id;
+	char vtname[128] = STRUCT_OPS_VALUE_PREFIX;
+	__u32 i;
+
+	kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
+	if (kern_type_id < 0) {
+		pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
+			tname);
+		return kern_type_id;
+	}
+	kern_type = btf__type_by_id(btf, kern_type_id);
+
+	/* Find the corresponding "map_value" type that will be used
+	 * in map_update(BPF_MAP_TYPE_STRUCT_OPS).  For example,
+	 * find "struct bpf_struct_ops_tcp_congestion_ops" from the
+	 * btf_vmlinux.
+	 */
+	strncat(vtname + STRUCT_OPS_VALUE_PREFIX_LEN, tname,
+		sizeof(vtname) - STRUCT_OPS_VALUE_PREFIX_LEN - 1);
+	kern_vtype_id = btf__find_by_name_kind(btf, vtname,
+					       BTF_KIND_STRUCT);
+	if (kern_vtype_id < 0) {
+		pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n",
+			vtname);
+		return kern_vtype_id;
+	}
+	kern_vtype = btf__type_by_id(btf, kern_vtype_id);
+
+	/* Find "struct tcp_congestion_ops" from
+	 * struct bpf_struct_ops_tcp_congestion_ops {
+	 *	[ ... ]
+	 *	struct tcp_congestion_ops data;
+	 * }
+	 */
+	kern_data_member = btf_members(kern_vtype);
+	for (i = 0; i < btf_vlen(kern_vtype); i++, kern_data_member++) {
+		if (kern_data_member->type == kern_type_id)
+			break;
+	}
+	if (i == btf_vlen(kern_vtype)) {
+		pr_warn("struct_ops init_kern: struct %s data is not found in struct %s\n",
+			tname, vtname);
+		return -EINVAL;
+	}
+
+	*type = kern_type;
+	*type_id = kern_type_id;
+	*vtype = kern_vtype;
+	*vtype_id = kern_vtype_id;
+	*data_member = kern_data_member;
+
+	return 0;
+}
+
+static bool bpf_map__is_struct_ops(const struct bpf_map *map)
+{
+	return map->def.type == BPF_MAP_TYPE_STRUCT_OPS;
+}
+
+/* Init the map's fields that depend on kern_btf */
+static int bpf_map__init_kern_struct_ops(struct bpf_map *map,
+					 const struct btf *btf,
+					 const struct btf *kern_btf)
+{
+	const struct btf_member *member, *kern_member, *kern_data_member;
+	const struct btf_type *type, *kern_type, *kern_vtype;
+	__u32 i, kern_type_id, kern_vtype_id, kern_data_off;
+	struct bpf_struct_ops *st_ops;
+	void *data, *kern_data;
+	const char *tname;
+	int err;
+
+	st_ops = map->st_ops;
+	type = st_ops->type;
+	tname = st_ops->tname;
+	err = find_struct_ops_kern_types(kern_btf, tname,
+					 &kern_type, &kern_type_id,
+					 &kern_vtype, &kern_vtype_id,
+					 &kern_data_member);
+	if (err)
+		return err;
+
+	pr_debug("struct_ops map %s init_kern %s: type_id:%u kern_type_id:%u kern_vtype_id:%u\n",
+		 map->name, tname, st_ops->type_id, kern_type_id,
+		 kern_vtype_id);
+
+	map->def.value_size = kern_vtype->size;
+	map->btf_vmlinux_value_type_id = kern_vtype_id;
+
+	st_ops->kern_vdata = calloc(1, kern_vtype->size);
+	if (!st_ops->kern_vdata)
+		return -ENOMEM;
+
+	data = st_ops->data;
+	kern_data_off = kern_data_member->offset / 8;
+	kern_data = st_ops->kern_vdata + kern_data_off;
+
+	member = btf_members(type);
+	for (i = 0; i < btf_vlen(type); i++, member++) {
+		const struct btf_type *mtype, *kern_mtype;
+		__u32 mtype_id, kern_mtype_id;
+		void *mdata, *kern_mdata;
+		__s64 msize, kern_msize;
+		__u32 moff, kern_moff;
+		__u32 kern_member_idx;
+		const char *mname;
+
+		mname = btf__name_by_offset(btf, member->name_off);
+		kern_member = find_member_by_name(kern_btf, kern_type, mname);
+		if (!kern_member) {
+			pr_warn("struct_ops map %s init_kern %s: Cannot find member %s in kernel BTF\n",
+				map->name, tname, mname);
+			return -ENOTSUP;
+		}
+
+		kern_member_idx = kern_member - btf_members(kern_type);
+		if (btf_member_bitfield_size(type, i) ||
+		    btf_member_bitfield_size(kern_type, kern_member_idx)) {
+			pr_warn("struct_ops map %s init_kern %s: bitfield %s is not supported\n",
+				map->name, tname, mname);
+			return -ENOTSUP;
+		}
+
+		moff = member->offset / 8;
+		kern_moff = kern_member->offset / 8;
+
+		mdata = data + moff;
+		kern_mdata = kern_data + kern_moff;
+
+		mtype_id = member->type;
+		kern_mtype_id = kern_member->type;
+
+		mtype = resolve_ptr(btf, mtype_id, NULL);
+		kern_mtype = resolve_ptr(kern_btf, kern_mtype_id, NULL);
+		if (mtype && kern_mtype) {
+			struct bpf_program *prog;
+
+			if (!btf_is_func_proto(mtype) ||
+			    !btf_is_func_proto(kern_mtype)) {
+				pr_warn("struct_ops map %s init_kern %s: non func ptr %s is not supported\n",
+					map->name, tname, mname);
+				return -ENOTSUP;
+			}
+
+			prog = st_ops->progs[i];
+			if (!prog) {
+				pr_debug("struct_ops map %s init_kern %s: func ptr %s is not set\n",
+					 map->name, tname, mname);
+				continue;
+			}
+
+			if (prog->type != BPF_PROG_TYPE_UNSPEC &&
+			    (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
+			     prog->attach_btf_id != kern_type_id ||
+			     prog->expected_attach_type != kern_member_idx)) {
+				pr_warn("struct_ops map %s init_kern %s: Cannot use prog %s in type %u attach_btf_id %u expected_attach_type %u for func ptr %s\n",
+					map->name, tname, prog->name,
+					prog->type, prog->attach_btf_id,
+					prog->expected_attach_type, mname);
+				return -ENOTSUP;
+			}
+
+			prog->type = BPF_PROG_TYPE_STRUCT_OPS;
+			prog->attach_btf_id = kern_type_id;
+			prog->expected_attach_type = kern_member_idx;
+
+			st_ops->kern_func_off[i] = kern_data_off + kern_moff;
+
+			pr_debug("struct_ops map %s init_kern %s: func ptr %s is set to prog %s from data(+%u) to kern_data(+%u)\n",
+				 map->name, tname, mname, prog->name, moff,
+				 kern_moff);
+
+			continue;
+		}
+
+		mtype_id = btf__resolve_type(btf, mtype_id);
+		kern_mtype_id = btf__resolve_type(kern_btf, kern_mtype_id);
+		if (mtype_id < 0 || kern_mtype_id < 0) {
+			pr_warn("struct_ops map %s init_kern %s: Cannot resolve the type for %s\n",
+				map->name, tname, mname);
+			return -ENOTSUP;
+		}
+
+		mtype = btf__type_by_id(btf, mtype_id);
+		kern_mtype = btf__type_by_id(kern_btf, kern_mtype_id);
+		if (BTF_INFO_KIND(mtype->info) !=
+		    BTF_INFO_KIND(kern_mtype->info)) {
+			pr_warn("struct_ops map %s init_kern %s: Unmatched member type %s %u != %u(kernel)\n",
+				map->name, tname, mname,
+				BTF_INFO_KIND(mtype->info),
+				BTF_INFO_KIND(kern_mtype->info));
+			return -ENOTSUP;
+		}
+
+		msize = btf__resolve_size(btf, mtype_id);
+		kern_msize = btf__resolve_size(kern_btf, kern_mtype_id);
+		if (msize < 0 || kern_msize < 0 || msize != kern_msize) {
+			pr_warn("struct_ops map %s init_kern %s: Error in size of member %s: %zd != %zd(kernel)\n",
+				map->name, tname, mname,
+				(ssize_t)msize, (ssize_t)kern_msize);
+			return -ENOTSUP;
+		}
+
+		pr_debug("struct_ops map %s init_kern %s: copy %s %u bytes from data(+%u) to kern_data(+%u)\n",
+			 map->name, tname, mname, (unsigned int)msize,
+			 moff, kern_moff);
+		memcpy(kern_mdata, mdata, msize);
+	}
+
+	return 0;
+}
+
+static int bpf_object__init_kern_struct_ops_maps(struct bpf_object *obj)
+{
+	struct btf *kern_btf = NULL;
+	struct bpf_map *map;
+	size_t i;
+	int err;
+
+	for (i = 0; i < obj->nr_maps; i++) {
+		map = &obj->maps[i];
+
+		if (!bpf_map__is_struct_ops(map))
+			continue;
+
+		if (!kern_btf) {
+			kern_btf = bpf_find_kernel_btf();
+			if (IS_ERR(kern_btf))
+				return PTR_ERR(kern_btf);
+		}
+
+		err = bpf_map__init_kern_struct_ops(map, obj->btf, kern_btf);
+		if (err) {
+			btf__free(kern_btf);
+			return err;
+		}
+	}
+
+	btf__free(kern_btf);
+	return 0;
+}
+
+static struct bpf_map *find_struct_ops_map_by_offset(struct bpf_object *obj,
+						     size_t offset)
+{
+	struct bpf_map *map;
+	size_t i;
+
+	for (i = 0; i < obj->nr_maps; i++) {
+		map = &obj->maps[i];
+		if (!bpf_map__is_struct_ops(map))
+			continue;
+		if (map->sec_offset <= offset &&
+		    offset - map->sec_offset < map->def.value_size)
+			return map;
+	}
+
+	return NULL;
+}
+
+/* Collect the reloc from ELF and populate the st_ops->progs[] */
+static int bpf_object__collect_struct_ops_map_reloc(struct bpf_object *obj,
+						    GElf_Shdr *shdr,
+						    Elf_Data *data)
+{
+	const struct btf_member *member;
+	struct bpf_struct_ops *st_ops;
+	struct bpf_program *prog;
+	const char *name, *tname;
+	unsigned int shdr_idx;
+	const struct btf *btf;
+	struct bpf_map *map;
+	Elf_Data *symbols;
+	unsigned int moff;
+	GElf_Sym sym;
+	GElf_Rel rel;
+	int i, nrels;
+
+	symbols = obj->efile.symbols;
+	btf = obj->btf;
+	nrels = shdr->sh_size / shdr->sh_entsize;
+	for (i = 0; i < nrels; i++) {
+		if (!gelf_getrel(data, i, &rel)) {
+			pr_warn("struct_ops map reloc: failed to get %d reloc\n", i);
+			return -LIBBPF_ERRNO__FORMAT;
+		}
+
+		if (!gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym)) {
+			pr_warn("struct_ops map reloc: symbol %" PRIx64 " not found\n",
+				GELF_R_SYM(rel.r_info));
+			return -LIBBPF_ERRNO__FORMAT;
+		}
+
+		name = elf_strptr(obj->efile.elf, obj->efile.strtabidx,
+				  sym.st_name) ? : "<?>";
+		map = find_struct_ops_map_by_offset(obj, rel.r_offset);
+		if (!map) {
+			pr_warn("struct_ops map reloc: cannot find map at rel.r_offset %zu\n",
+				(size_t)rel.r_offset);
+			return -EINVAL;
+		}
+
+		moff = rel.r_offset -  map->sec_offset;
+		shdr_idx = sym.st_shndx;
+		st_ops = map->st_ops;
+		tname = st_ops->tname;
+		pr_debug("struct_ops map %s reloc %s: for %lld value %lld shdr_idx %u rel.r_offset %zu map->sec_offset %zu name %d (\'%s\')\n",
+			 map->name, tname,
+			 (long long)(rel.r_info >> 32),
+			 (long long)sym.st_value,
+			 shdr_idx, (size_t)rel.r_offset,
+			 map->sec_offset, sym.st_name, name);
+
+		if (shdr_idx >= SHN_LORESERVE) {
+			pr_warn("struct_ops map %s reloc %s: rel.r_offset %zu shdr_idx %u unsupported non-static function\n",
+				map->name, tname, (size_t)rel.r_offset,
+				shdr_idx);
+			return -LIBBPF_ERRNO__RELOC;
+		}
+
+		member = find_member_by_offset(st_ops->type, moff * 8);
+		if (!member) {
+			pr_warn("struct_ops map %s reloc %s: cannot find member at moff %u\n",
+				map->name, tname, moff);
+			return -EINVAL;
+		}
+		name = btf__name_by_offset(btf, member->name_off);
+
+		if (!resolve_func_ptr(btf, member->type, NULL)) {
+			pr_warn("struct_ops map %s reloc %s: cannot relocate non func ptr %s\n",
+				map->name, tname, name);
+			return -EINVAL;
+		}
+
+		prog = bpf_object__find_prog_by_idx(obj, shdr_idx);
+		if (!prog) {
+			pr_warn("struct_ops map %s reloc %s: cannot find prog at shdr_idx %u to relocate func ptr %s\n",
+				map->name, tname, shdr_idx, name);
+			return -EINVAL;
+		}
+		st_ops->progs[member - btf_members(st_ops->type)] = prog;
+	}
+
+	return 0;
+}
+
+static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
+{
+	const struct btf_type *type, *datasec;
+	const struct btf_var_secinfo *vsi;
+	struct bpf_struct_ops *st_ops;
+	const char *tname, *var_name;
+	__s32 type_id, datasec_id;
+	const struct btf *btf;
+	struct bpf_map *map;
+	__u32 i;
+
+	if (obj->efile.st_ops_shndx == -1)
+		return 0;
+
+	btf = obj->btf;
+	datasec_id = btf__find_by_name_kind(btf, STRUCT_OPS_SEC,
+					    BTF_KIND_DATASEC);
+	if (datasec_id < 0) {
+		pr_warn("struct_ops init: DATASEC %s not found\n",
+			STRUCT_OPS_SEC);
+		return -EINVAL;
+	}
+
+	datasec = btf__type_by_id(btf, datasec_id);
+	vsi = btf_var_secinfos(datasec);
+	for (i = 0; i < btf_vlen(datasec); i++, vsi++) {
+		type = btf__type_by_id(obj->btf, vsi->type);
+		var_name = btf__name_by_offset(obj->btf, type->name_off);
+
+		type_id = btf__resolve_type(obj->btf, vsi->type);
+		if (type_id < 0) {
+			pr_warn("struct_ops init: Cannot resolve var type_id %u in DATASEC %s\n",
+				vsi->type, STRUCT_OPS_SEC);
+			return -EINVAL;
+		}
+
+		type = btf__type_by_id(obj->btf, type_id);
+		tname = btf__name_by_offset(obj->btf, type->name_off);
+		if (!btf_is_struct(type)) {
+			pr_warn("struct_ops init: %s is not a struct\n", tname);
+			return -EINVAL;
+		}
+
+		map = bpf_object__add_map(obj);
+		if (IS_ERR(map))
+			return PTR_ERR(map);
+
+		map->sec_idx = obj->efile.st_ops_shndx;
+		map->sec_offset = vsi->offset;
+		map->name = strdup(var_name);
+		if (!map->name)
+			return -ENOMEM;
+
+		map->def.type = BPF_MAP_TYPE_STRUCT_OPS;
+		map->def.key_size = sizeof(int);
+		map->def.value_size = type->size;
+		map->def.max_entries = 1;
+
+		map->st_ops = calloc(1, sizeof(*map->st_ops));
+		if (!map->st_ops)
+			return -ENOMEM;
+		st_ops = map->st_ops;
+		st_ops->data = malloc(type->size);
+		st_ops->progs = calloc(btf_vlen(type), sizeof(*st_ops->progs));
+		st_ops->kern_func_off = malloc(btf_vlen(type) *
+					       sizeof(*st_ops->kern_func_off));
+		if (!st_ops->data || !st_ops->progs || !st_ops->kern_func_off)
+			return -ENOMEM;
+
+		memcpy(st_ops->data,
+		       obj->efile.st_ops_data->d_buf + vsi->offset,
+		       type->size);
+		st_ops->tname = tname;
+		st_ops->type = type;
+		st_ops->type_id = type_id;
+
+		pr_debug("struct_ops init: %s found. type_id:%u var_name:%s offset:%u\n",
+			 tname, type_id, var_name, vsi->offset);
+	}
+
+	return 0;
+}
+
 static struct bpf_object *bpf_object__new(const char *path,
 					  const void *obj_buf,
 					  size_t obj_buf_sz,
@@ -606,6 +1111,7 @@ static struct bpf_object *bpf_object__new(const char *path,
 	obj->efile.data_shndx = -1;
 	obj->efile.rodata_shndx = -1;
 	obj->efile.bss_shndx = -1;
+	obj->efile.st_ops_shndx = -1;
 	obj->kconfig_map_idx = -1;
 
 	obj->kern_version = get_kernel_version();
@@ -629,6 +1135,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj)
 	obj->efile.data = NULL;
 	obj->efile.rodata = NULL;
 	obj->efile.bss = NULL;
+	obj->efile.st_ops_data = NULL;
 
 	zfree(&obj->efile.reloc_sects);
 	obj->efile.nr_reloc_sects = 0;
@@ -814,6 +1321,9 @@ int bpf_object__section_size(const struct bpf_object *obj, const char *name,
 	} else if (!strcmp(name, RODATA_SEC)) {
 		if (obj->efile.rodata)
 			*size = obj->efile.rodata->d_size;
+	} else if (!strcmp(name, STRUCT_OPS_SEC)) {
+		if (obj->efile.st_ops_data)
+			*size = obj->efile.st_ops_data->d_size;
 	} else {
 		ret = bpf_object_search_section_size(obj, name, &d_size);
 		if (!ret)
@@ -1439,6 +1949,30 @@ skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
 	return t;
 }
 
+static const struct btf_type *
+resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id)
+{
+	const struct btf_type *t;
+
+	t = skip_mods_and_typedefs(btf, id, NULL);
+	if (!btf_is_ptr(t))
+		return NULL;
+
+	return skip_mods_and_typedefs(btf, t->type, res_id);
+}
+
+static const struct btf_type *
+resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id)
+{
+	const struct btf_type *t;
+
+	t = resolve_ptr(btf, id, res_id);
+	if (t && btf_is_func_proto(t))
+		return t;
+
+	return NULL;
+}
+
 /*
  * Fetch integer attribute of BTF map definition. Such attributes are
  * represented using a pointer to an array, in which dimensionality of array
@@ -1786,6 +2320,7 @@ static int bpf_object__init_maps(struct bpf_object *obj,
 	err = err ?: bpf_object__init_user_btf_maps(obj, strict, pin_root_path);
 	err = err ?: bpf_object__init_global_data_maps(obj);
 	err = err ?: bpf_object__init_kconfig_map(obj);
+	err = err ?: bpf_object__init_struct_ops_maps(obj);
 	if (err)
 		return err;
 
@@ -1888,7 +2423,8 @@ static void bpf_object__sanitize_btf_ext(struct bpf_object *obj)
 static bool bpf_object__is_btf_mandatory(const struct bpf_object *obj)
 {
 	return obj->efile.btf_maps_shndx >= 0 ||
-	       obj->nr_extern > 0;
+		obj->efile.st_ops_shndx >= 0 ||
+		obj->nr_extern > 0;
 }
 
 static int bpf_object__init_btf(struct bpf_object *obj,
@@ -2087,6 +2623,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
 			} else if (strcmp(name, RODATA_SEC) == 0) {
 				obj->efile.rodata = data;
 				obj->efile.rodata_shndx = idx;
+			} else if (strcmp(name, STRUCT_OPS_SEC) == 0) {
+				obj->efile.st_ops_data = data;
+				obj->efile.st_ops_shndx = idx;
 			} else {
 				pr_debug("skip section(%d) %s\n", idx, name);
 			}
@@ -2096,7 +2635,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
 			int sec = sh.sh_info; /* points to other section */
 
 			/* Only do relo for section with exec instructions */
-			if (!section_have_execinstr(obj, sec)) {
+			if (!section_have_execinstr(obj, sec) &&
+			    strcmp(name, ".rel" STRUCT_OPS_SEC)) {
 				pr_debug("skip relo %s(%d) for section(%d)\n",
 					 name, idx, sec);
 				continue;
@@ -2598,8 +3138,12 @@ static int bpf_map_find_btf_info(struct bpf_object *obj, struct bpf_map *map)
 	__u32 key_type_id = 0, value_type_id = 0;
 	int ret;
 
-	/* if it's BTF-defined map, we don't need to search for type IDs */
-	if (map->sec_idx == obj->efile.btf_maps_shndx)
+	/* if it's BTF-defined map, we don't need to search for type IDs.
+	 * For struct_ops map, it does not need btf_key_type_id and
+	 * btf_value_type_id.
+	 */
+	if (map->sec_idx == obj->efile.btf_maps_shndx ||
+	    bpf_map__is_struct_ops(map))
 		return 0;
 
 	if (!bpf_map__is_internal(map)) {
@@ -3024,6 +3568,9 @@ bpf_object__create_maps(struct bpf_object *obj)
 		if (bpf_map_type__is_map_in_map(def->type) &&
 		    map->inner_map_fd >= 0)
 			create_attr.inner_map_fd = map->inner_map_fd;
+		if (bpf_map__is_struct_ops(map))
+			create_attr.btf_vmlinux_value_type_id =
+				map->btf_vmlinux_value_type_id;
 
 		if (obj->btf && !bpf_map_find_btf_info(obj, map)) {
 			create_attr.btf_fd = btf__fd(obj->btf);
@@ -3874,7 +4421,7 @@ static struct btf *btf_load_raw(const char *path)
  * Probe few well-known locations for vmlinux kernel image and try to load BTF
  * data out of it to use for target BTF.
  */
-static struct btf *bpf_core_find_kernel_btf(void)
+static struct btf *bpf_find_kernel_btf(void)
 {
 	struct {
 		const char *path_fmt;
@@ -4155,7 +4702,7 @@ bpf_core_reloc_fields(struct bpf_object *obj, const char *targ_btf_path)
 	if (targ_btf_path)
 		targ_btf = btf__parse_elf(targ_btf_path, NULL);
 	else
-		targ_btf = bpf_core_find_kernel_btf();
+		targ_btf = bpf_find_kernel_btf();
 	if (IS_ERR(targ_btf)) {
 		pr_warn("failed to get target BTF: %ld\n", PTR_ERR(targ_btf));
 		return PTR_ERR(targ_btf);
@@ -4374,6 +4921,15 @@ static int bpf_object__collect_reloc(struct bpf_object *obj)
 			return -LIBBPF_ERRNO__INTERNAL;
 		}
 
+		if (idx == obj->efile.st_ops_shndx) {
+			err = bpf_object__collect_struct_ops_map_reloc(obj,
+								       shdr,
+								       data);
+			if (err)
+				return err;
+			continue;
+		}
+
 		prog = bpf_object__find_prog_by_idx(obj, idx);
 		if (!prog) {
 			pr_warn("relocation failed: no section(%d)\n", idx);
@@ -4408,7 +4964,9 @@ load_program(struct bpf_program *prog, struct bpf_insn *insns, int insns_cnt,
 	load_attr.insns = insns;
 	load_attr.insns_cnt = insns_cnt;
 	load_attr.license = license;
-	if (prog->type == BPF_PROG_TYPE_TRACING) {
+	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		load_attr.attach_btf_id = prog->attach_btf_id;
+	} else if (prog->type == BPF_PROG_TYPE_TRACING) {
 		load_attr.attach_prog_fd = prog->attach_prog_fd;
 		load_attr.attach_btf_id = prog->attach_btf_id;
 	} else {
@@ -4749,8 +5307,11 @@ int bpf_object__unload(struct bpf_object *obj)
 	if (!obj)
 		return -EINVAL;
 
-	for (i = 0; i < obj->nr_maps; i++)
+	for (i = 0; i < obj->nr_maps; i++) {
 		zclose(obj->maps[i].fd);
+		if (obj->maps[i].st_ops)
+			zfree(&obj->maps[i].st_ops->kern_vdata);
+	}
 
 	for (i = 0; i < obj->nr_programs; i++)
 		bpf_program__unload(&obj->programs[i]);
@@ -4866,6 +5427,7 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
 	err = err ? : bpf_object__resolve_externs(obj, obj->kconfig);
 	err = err ? : bpf_object__sanitize_and_load_btf(obj);
 	err = err ? : bpf_object__sanitize_maps(obj);
+	err = err ? : bpf_object__init_kern_struct_ops_maps(obj);
 	err = err ? : bpf_object__create_maps(obj);
 	err = err ? : bpf_object__relocate(obj, attr->target_btf_path);
 	err = err ? : bpf_object__load_progs(obj, attr->log_level);
@@ -5453,6 +6015,13 @@ void bpf_object__close(struct bpf_object *obj)
 			map->mmaped = NULL;
 		}
 
+		if (map->st_ops) {
+			zfree(&map->st_ops->data);
+			zfree(&map->st_ops->progs);
+			zfree(&map->st_ops->kern_func_off);
+			zfree(&map->st_ops);
+		}
+
 		zfree(&map->name);
 		zfree(&map->pin_path);
 	}
@@ -5954,7 +6523,7 @@ int libbpf_prog_type_by_name(const char *name, enum bpf_prog_type *prog_type,
 int libbpf_find_vmlinux_btf_id(const char *name,
 			       enum bpf_attach_type attach_type)
 {
-	struct btf *btf = bpf_core_find_kernel_btf();
+	struct btf *btf = bpf_find_kernel_btf();
 	char raw_tp_btf[128] = BTF_PREFIX;
 	char *dst = raw_tp_btf + sizeof(BTF_PREFIX) - 1;
 	const char *btf_name;
@@ -6780,6 +7349,58 @@ struct bpf_link *bpf_program__attach(struct bpf_program *prog)
 	return sec_def->attach_fn(sec_def, prog);
 }
 
+static int bpf_link__detach_struct_ops(struct bpf_link *link)
+{
+	struct bpf_link_fd *l = (void *)link;
+	__u32 zero = 0;
+
+	if (bpf_map_delete_elem(l->fd, &zero))
+		return -errno;
+
+	return 0;
+}
+
+struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map)
+{
+	struct bpf_struct_ops *st_ops;
+	struct bpf_link_fd *link;
+	__u32 i, zero = 0;
+	int err;
+
+	if (!bpf_map__is_struct_ops(map) || map->fd == -1)
+		return ERR_PTR(-EINVAL);
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return ERR_PTR(-EINVAL);
+
+	st_ops = map->st_ops;
+	for (i = 0; i < btf_vlen(st_ops->type); i++) {
+		struct bpf_program *prog = st_ops->progs[i];
+		void *kern_data;
+		int prog_fd;
+
+		if (!prog)
+			continue;
+
+		prog_fd = bpf_program__fd(prog);
+		kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
+		*(unsigned long *)kern_data = prog_fd;
+	}
+
+	err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
+	if (err) {
+		err = -errno;
+		free(link);
+		return ERR_PTR(err);
+	}
+
+	link->link.detach = bpf_link__detach_struct_ops;
+	link->fd = map->fd;
+
+	return (struct bpf_link *)link;
+}
+
 enum bpf_perf_event_ret
 bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t page_size,
 			   void **copy_mem, size_t *copy_size,
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index fe592ef48f1b..9c54f252f90f 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -355,6 +355,7 @@ struct bpf_map_def {
  * so no need to worry about a name clash.
  */
 struct bpf_map;
+LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map);
 LIBBPF_API struct bpf_map *
 bpf_object__find_map_by_name(const struct bpf_object *obj, const char *name);
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index e9713a574243..55912a165caf 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -213,6 +213,7 @@ LIBBPF_0.0.7 {
 	global:
 		btf_dump__emit_type_decl;
 		bpf_link__disconnect;
+		bpf_map__attach_struct_ops;
 		bpf_object__find_program_by_name;
 		bpf_object__attach_skeleton;
 		bpf_object__destroy_skeleton;
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index a9eb8b322671..7f06942e9574 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -103,6 +103,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 	case BPF_PROG_TYPE_TRACING:
+	case BPF_PROG_TYPE_STRUCT_OPS:
 	default:
 		break;
 	}
@@ -251,6 +252,7 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex)
 	case BPF_MAP_TYPE_XSKMAP:
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
+	case BPF_MAP_TYPE_STRUCT_OPS:
 	default:
 		break;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (9 preceding siblings ...)
  2019-12-21  6:26 ` [PATCH bpf-next v2 10/11] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
@ 2019-12-21  6:26 ` Martin KaFai Lau
  2019-12-23 23:26   ` Andrii Nakryiko
  10 siblings, 1 reply; 45+ messages in thread
From: Martin KaFai Lau @ 2019-12-21  6:26 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch adds a bpf_dctcp example.  It currently does not do
no-ECN fallback but the same could be done through the cgrp2-bpf.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
 3 files changed, 656 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c

diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
new file mode 100644
index 000000000000..7ba8c1b4157a
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -0,0 +1,228 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __BPF_TCP_HELPERS_H
+#define __BPF_TCP_HELPERS_H
+
+#include <stdbool.h>
+#include <linux/types.h>
+#include <bpf_helpers.h>
+#include <bpf_core_read.h>
+#include "bpf_trace_helpers.h"
+
+#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
+
+struct sock_common {
+	unsigned char	skc_state;
+} __attribute__((preserve_access_index));
+
+struct sock {
+	struct sock_common	__sk_common;
+} __attribute__((preserve_access_index));
+
+struct inet_sock {
+	struct sock		sk;
+} __attribute__((preserve_access_index));
+
+struct inet_connection_sock {
+	struct inet_sock	  icsk_inet;
+	__u8			  icsk_ca_state:6,
+				  icsk_ca_setsockopt:1,
+				  icsk_ca_dst_locked:1;
+	struct {
+		__u8		  pending;
+	} icsk_ack;
+	__u64			  icsk_ca_priv[104 / sizeof(__u64)];
+} __attribute__((preserve_access_index));
+
+struct tcp_sock {
+	struct inet_connection_sock	inet_conn;
+
+	__u32	rcv_nxt;
+	__u32	snd_nxt;
+	__u32	snd_una;
+	__u8	ecn_flags;
+	__u32	delivered;
+	__u32	delivered_ce;
+	__u32	snd_cwnd;
+	__u32	snd_cwnd_cnt;
+	__u32	snd_cwnd_clamp;
+	__u32	snd_ssthresh;
+	__u8	syn_data:1,	/* SYN includes data */
+		syn_fastopen:1,	/* SYN includes Fast Open option */
+		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
+		syn_fastopen_ch:1, /* Active TFO re-enabling probe */
+		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
+		save_syn:1,	/* Save headers of SYN packet */
+		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
+		syn_smc:1;	/* SYN includes SMC */
+	__u32	max_packets_out;
+	__u32	lsndtime;
+	__u32	prior_cwnd;
+} __attribute__((preserve_access_index));
+
+static __always_inline struct inet_connection_sock *inet_csk(const struct sock *sk)
+{
+	return (struct inet_connection_sock *)sk;
+}
+
+static __always_inline void *inet_csk_ca(const struct sock *sk)
+{
+	return (void *)inet_csk(sk)->icsk_ca_priv;
+}
+
+static __always_inline struct tcp_sock *tcp_sk(const struct sock *sk)
+{
+	return (struct tcp_sock *)sk;
+}
+
+static __always_inline bool before(__u32 seq1, __u32 seq2)
+{
+	return (__s32)(seq1-seq2) < 0;
+}
+#define after(seq2, seq1) 	before(seq1, seq2)
+
+#define	TCP_ECN_OK		1
+#define	TCP_ECN_QUEUE_CWR	2
+#define	TCP_ECN_DEMAND_CWR	4
+#define	TCP_ECN_SEEN		8
+
+enum inet_csk_ack_state_t {
+	ICSK_ACK_SCHED	= 1,
+	ICSK_ACK_TIMER  = 2,
+	ICSK_ACK_PUSHED = 4,
+	ICSK_ACK_PUSHED2 = 8,
+	ICSK_ACK_NOW = 16	/* Send the next ACK immediately (once) */
+};
+
+enum tcp_ca_event {
+	CA_EVENT_TX_START = 0,
+	CA_EVENT_CWND_RESTART = 1,
+	CA_EVENT_COMPLETE_CWR = 2,
+	CA_EVENT_LOSS = 3,
+	CA_EVENT_ECN_NO_CE = 4,
+	CA_EVENT_ECN_IS_CE = 5,
+};
+
+enum tcp_ca_state {
+	TCP_CA_Open = 0,
+	TCP_CA_Disorder = 1,
+	TCP_CA_CWR = 2,
+	TCP_CA_Recovery = 3,
+	TCP_CA_Loss = 4
+};
+
+struct ack_sample {
+	__u32 pkts_acked;
+	__s32 rtt_us;
+	__u32 in_flight;
+} __attribute__((preserve_access_index));
+
+struct rate_sample {
+	__u64  prior_mstamp; /* starting timestamp for interval */
+	__u32  prior_delivered;	/* tp->delivered at "prior_mstamp" */
+	__s32  delivered;		/* number of packets delivered over interval */
+	long interval_us;	/* time for tp->delivered to incr "delivered" */
+	__u32 snd_interval_us;	/* snd interval for delivered packets */
+	__u32 rcv_interval_us;	/* rcv interval for delivered packets */
+	long rtt_us;		/* RTT of last (S)ACKed packet (or -1) */
+	int  losses;		/* number of packets marked lost upon ACK */
+	__u32  acked_sacked;	/* number of packets newly (S)ACKed upon ACK */
+	__u32  prior_in_flight;	/* in flight before this ACK */
+	bool is_app_limited;	/* is sample from packet with bubble in pipe? */
+	bool is_retrans;	/* is sample from retransmission? */
+	bool is_ack_delayed;	/* is this (likely) a delayed ACK? */
+} __attribute__((preserve_access_index));
+
+#define TCP_CA_NAME_MAX		16
+#define TCP_CONG_NEEDS_ECN	0x2
+
+struct tcp_congestion_ops {
+	__u32 flags;
+
+	/* initialize private data (optional) */
+	void (*init)(struct sock *sk);
+	/* cleanup private data  (optional) */
+	void (*release)(struct sock *sk);
+
+	/* return slow start threshold (required) */
+	__u32 (*ssthresh)(struct sock *sk);
+	/* do new cwnd calculation (required) */
+	void (*cong_avoid)(struct sock *sk, __u32 ack, __u32 acked);
+	/* call before changing ca_state (optional) */
+	void (*set_state)(struct sock *sk, __u8 new_state);
+	/* call when cwnd event occurs (optional) */
+	void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
+	/* call when ack arrives (optional) */
+	void (*in_ack_event)(struct sock *sk, __u32 flags);
+	/* new value of cwnd after loss (required) */
+	__u32  (*undo_cwnd)(struct sock *sk);
+	/* hook for packet ack accounting (optional) */
+	void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
+	/* override sysctl_tcp_min_tso_segs */
+	__u32 (*min_tso_segs)(struct sock *sk);
+	/* returns the multiplier used in tcp_sndbuf_expand (optional) */
+	__u32 (*sndbuf_expand)(struct sock *sk);
+	/* call when packets are delivered to update cwnd and pacing rate,
+	 * after all the ca_state processing. (optional)
+	 */
+	void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
+
+	char 		name[TCP_CA_NAME_MAX];
+};
+
+#define min(a, b) ((a) < (b) ? (a) : (b))
+#define max(a, b) ((a) > (b) ? (a) : (b))
+#define min_not_zero(x, y) ({			\
+	typeof(x) __x = (x);			\
+	typeof(y) __y = (y);			\
+	__x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
+
+static __always_inline __u32 tcp_slow_start(struct tcp_sock *tp, __u32 acked)
+{
+	__u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh);
+
+	acked -= cwnd - tp->snd_cwnd;
+	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
+
+	return acked;
+}
+
+static __always_inline bool tcp_in_slow_start(const struct tcp_sock *tp)
+{
+	return tp->snd_cwnd < tp->snd_ssthresh;
+}
+
+static __always_inline bool tcp_is_cwnd_limited(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	/* If in slow start, ensure cwnd grows to twice what was ACKed. */
+	if (tcp_in_slow_start(tp))
+		return tp->snd_cwnd < 2 * tp->max_packets_out;
+
+	return !!BPF_CORE_READ_BITFIELD(tp, is_cwnd_limited);
+}
+
+static __always_inline void tcp_cong_avoid_ai(struct tcp_sock *tp, __u32 w, __u32 acked)
+{
+	/* If credits accumulated at a higher w, apply them gently now. */
+	if (tp->snd_cwnd_cnt >= w) {
+		tp->snd_cwnd_cnt = 0;
+		tp->snd_cwnd++;
+	}
+
+	tp->snd_cwnd_cnt += acked;
+	if (tp->snd_cwnd_cnt >= w) {
+		__u32 delta = tp->snd_cwnd_cnt / w;
+
+		tp->snd_cwnd_cnt -= delta * w;
+		tp->snd_cwnd += delta;
+	}
+	tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_cwnd_clamp);
+}
+
+#endif
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
new file mode 100644
index 000000000000..7fc05d990f4d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -0,0 +1,218 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <linux/err.h>
+#include <test_progs.h>
+
+#define min(a, b) ((a) < (b) ? (a) : (b))
+
+static const unsigned int total_bytes = 10 * 1024 * 1024;
+static const struct timeval timeo_sec = { .tv_sec = 10 };
+static const size_t timeo_optlen = sizeof(timeo_sec);
+static int stop, duration;
+
+static int settimeo(int fd)
+{
+	int err;
+
+	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo_sec,
+			 timeo_optlen);
+	if (CHECK(err == -1, "setsockopt(fd, SO_RCVTIMEO)", "errno:%d\n",
+		  errno))
+		return -1;
+
+	err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo_sec,
+			 timeo_optlen);
+	if (CHECK(err == -1, "setsockopt(fd, SO_SNDTIMEO)", "errno:%d\n",
+		  errno))
+		return -1;
+
+	return 0;
+}
+
+static int settcpca(int fd, const char *tcp_ca)
+{
+	int err;
+
+	err = setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, tcp_ca, strlen(tcp_ca));
+	if (CHECK(err == -1, "setsockopt(fd, TCP_CONGESTION)", "errno:%d\n",
+		  errno))
+		return -1;
+
+	return 0;
+}
+
+static void *server(void *arg)
+{
+	int lfd = (int)(long)arg, err = 0, fd;
+	ssize_t nr_sent = 0, bytes = 0;
+	char batch[1500];
+
+	fd = accept(lfd, NULL, NULL);
+	while (fd == -1) {
+		if (errno == EINTR)
+			continue;
+		err = -errno;
+		goto done;
+	}
+
+	if (settimeo(fd)) {
+		err = -errno;
+		goto done;
+	}
+
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_sent = send(fd, &batch,
+			       min(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_sent == -1 && errno == EINTR)
+			continue;
+		if (nr_sent == -1) {
+			err = -errno;
+			break;
+		}
+		bytes += nr_sent;
+	}
+
+	CHECK(bytes != total_bytes, "send", "%zd != %u nr_sent:%zd errno:%d\n",
+	      bytes, total_bytes, nr_sent, errno);
+
+done:
+	if (fd != -1)
+		close(fd);
+	if (err) {
+		WRITE_ONCE(stop, 1);
+		return ERR_PTR(err);
+	}
+	return NULL;
+}
+
+static void do_test(const char *tcp_ca)
+{
+	struct sockaddr_in6 sa6 = {};
+	ssize_t nr_recv = 0, bytes = 0;
+	int lfd = -1, fd = -1;
+	pthread_t srv_thread;
+	socklen_t addrlen = sizeof(sa6);
+	void *thread_ret;
+	char batch[1500];
+	int err;
+
+	WRITE_ONCE(stop, 0);
+
+	lfd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (CHECK(lfd == -1, "socket", "errno:%d\n", errno))
+		return;
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (CHECK(fd == -1, "socket", "errno:%d\n", errno)) {
+		close(lfd);
+		return;
+	}
+
+	if (settcpca(lfd, tcp_ca) || settcpca(fd, tcp_ca) ||
+	    settimeo(lfd) || settimeo(fd))
+		goto done;
+
+	/* bind, listen and start server thread to accept */
+	sa6.sin6_family = AF_INET6;
+	sa6.sin6_addr = in6addr_loopback;
+	err = bind(lfd, (struct sockaddr *)&sa6, addrlen);
+	if (CHECK(err == -1, "bind", "errno:%d\n", errno))
+		goto done;
+	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
+	if (CHECK(err == -1, "getsockname", "errno:%d\n", errno))
+		goto done;
+	err = listen(lfd, 1);
+	if (CHECK(err == -1, "listen", "errno:%d\n", errno))
+		goto done;
+	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
+	if (CHECK(err != 0, "pthread_create", "err:%d\n", err))
+		goto done;
+
+	/* connect to server */
+	err = connect(fd, (struct sockaddr *)&sa6, addrlen);
+	if (CHECK(err == -1, "connect", "errno:%d\n", errno))
+		goto wait_thread;
+
+	/* recv total_bytes */
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_recv = recv(fd, &batch,
+			       min(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_recv == -1 && errno == EINTR)
+			continue;
+		if (nr_recv == -1)
+			break;
+		bytes += nr_recv;
+	}
+
+	CHECK(bytes != total_bytes, "recv", "%zd != %u nr_recv:%zd errno:%d\n",
+	      bytes, total_bytes, nr_recv, errno);
+
+wait_thread:
+	WRITE_ONCE(stop, 1);
+	pthread_join(srv_thread, &thread_ret);
+	CHECK(IS_ERR(thread_ret), "pthread_join", "thread_ret:%ld",
+	      PTR_ERR(thread_ret));
+done:
+	close(lfd);
+	close(fd);
+}
+
+static struct bpf_object *load(const char *filename, const char *map_name,
+			       struct bpf_link **link)
+{
+	struct bpf_object *obj;
+	struct bpf_map *map;
+	struct bpf_link *l;
+	int err;
+
+	obj = bpf_object__open(filename);
+	if (CHECK(IS_ERR(obj), "bpf_obj__open_file", "obj:%ld\n",
+		  PTR_ERR(obj)))
+		return obj;
+
+	err = bpf_object__load(obj);
+	if (CHECK(err, "bpf_object__load", "err:%d\n", err)) {
+		bpf_object__close(obj);
+		return ERR_PTR(err);
+	}
+
+	map = bpf_object__find_map_by_name(obj, map_name);
+	if (CHECK(!map, "bpf_object__find_map_by_name", "%s not found\n",
+		    map_name)) {
+		bpf_object__close(obj);
+		return ERR_PTR(-ENOENT);
+	}
+
+	l = bpf_map__attach_struct_ops(map);
+	if (CHECK(IS_ERR(l), "bpf_struct_ops_map__attach", "err:%ld\n",
+		  PTR_ERR(l))) {
+		bpf_object__close(obj);
+		return (void *)l;
+	}
+
+	*link = l;
+
+	return obj;
+}
+
+static void test_dctcp(void)
+{
+	struct bpf_object *obj;
+	/* compiler warning... */
+	struct bpf_link *link = NULL;
+
+	obj = load("bpf_dctcp.o", "dctcp", &link);
+	if (IS_ERR(obj))
+		return;
+
+	do_test("bpf_dctcp");
+
+	bpf_link__destroy(link);
+	bpf_object__close(obj);
+}
+
+void test_bpf_tcp_ca(void)
+{
+	if (test__start_subtest("dctcp"))
+		test_dctcp();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
new file mode 100644
index 000000000000..5f9b613663e5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+/* WARNING: This implemenation is not necessarily the same
+ * as the tcp_dctcp.c.  The purpose is mainly for testing
+ * the kernel BPF logic.
+ */
+
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define DCTCP_MAX_ALPHA	1024U
+
+struct dctcp {
+	__u32 old_delivered;
+	__u32 old_delivered_ce;
+	__u32 prior_rcv_nxt;
+	__u32 dctcp_alpha;
+	__u32 next_seq;
+	__u32 ce_state;
+	__u32 loss_cwnd;
+};
+
+static unsigned int dctcp_shift_g = 4; /* g = 1/2^4 */
+static unsigned int dctcp_alpha_on_init = DCTCP_MAX_ALPHA;
+
+static __always_inline void dctcp_reset(const struct tcp_sock *tp,
+					struct dctcp *ca)
+{
+	ca->next_seq = tp->snd_nxt;
+
+	ca->old_delivered = tp->delivered;
+	ca->old_delivered_ce = tp->delivered_ce;
+}
+
+BPF_TCP_OPS_1(dctcp_init, void, struct sock *, sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct dctcp *ca = inet_csk_ca(sk);
+
+	ca->prior_rcv_nxt = tp->rcv_nxt;
+	ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
+	ca->loss_cwnd = 0;
+	ca->ce_state = 0;
+
+	dctcp_reset(tp, ca);
+}
+
+BPF_TCP_OPS_1(dctcp_ssthresh, __u32, struct sock *, sk)
+{
+	struct dctcp *ca = inet_csk_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	ca->loss_cwnd = tp->snd_cwnd;
+	return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 2U);
+}
+
+BPF_TCP_OPS_2(dctcp_update_alpha, void,
+	      struct sock *, sk, __u32, flags)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct dctcp *ca = inet_csk_ca(sk);
+
+	/* Expired RTT */
+	if (!before(tp->snd_una, ca->next_seq)) {
+		__u32 delivered_ce = tp->delivered_ce - ca->old_delivered_ce;
+		__u32 alpha = ca->dctcp_alpha;
+
+		/* alpha = (1 - g) * alpha + g * F */
+
+		alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g);
+		if (delivered_ce) {
+			__u32 delivered = tp->delivered - ca->old_delivered;
+
+			/* If dctcp_shift_g == 1, a 32bit value would overflow
+			 * after 8 M packets.
+			 */
+			delivered_ce <<= (10 - dctcp_shift_g);
+			delivered_ce /= max(1U, delivered);
+
+			alpha = min(alpha + delivered_ce, DCTCP_MAX_ALPHA);
+		}
+		ca->dctcp_alpha = alpha;
+		dctcp_reset(tp, ca);
+	}
+}
+
+static __always_inline void dctcp_react_to_loss(struct sock *sk)
+{
+	struct dctcp *ca = inet_csk_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	ca->loss_cwnd = tp->snd_cwnd;
+	tp->snd_ssthresh = max(tp->snd_cwnd >> 1U, 2U);
+}
+
+BPF_TCP_OPS_2(dctcp_state, void, struct sock *, sk, __u8, new_state)
+{
+	if (new_state == TCP_CA_Recovery &&
+	    new_state != BPF_CORE_READ_BITFIELD(inet_csk(sk), icsk_ca_state))
+		dctcp_react_to_loss(sk);
+	/* We handle RTO in dctcp_cwnd_event to ensure that we perform only
+	 * one loss-adjustment per RTT.
+	 */
+}
+
+static __always_inline void dctcp_ece_ack_cwr(struct sock *sk, __u32 ce_state)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (ce_state == 1)
+		tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+	else
+		tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+}
+
+/* Minimal DCTP CE state machine:
+ *
+ * S:	0 <- last pkt was non-CE
+ *	1 <- last pkt was CE
+ */
+static __always_inline
+void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
+			  __u32 *prior_rcv_nxt, __u32 *ce_state)
+{
+	__u32 new_ce_state = (evt == CA_EVENT_ECN_IS_CE) ? 1 : 0;
+
+	if (*ce_state != new_ce_state) {
+		/* CE state has changed, force an immediate ACK to
+		 * reflect the new CE state. If an ACK was delayed,
+		 * send that first to reflect the prior CE state.
+		 */
+		if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
+			dctcp_ece_ack_cwr(sk, *ce_state);
+			bpf_tcp_send_ack(sk, *prior_rcv_nxt);
+		}
+		inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+	}
+	*prior_rcv_nxt = tcp_sk(sk)->rcv_nxt;
+	*ce_state = new_ce_state;
+	dctcp_ece_ack_cwr(sk, new_ce_state);
+}
+
+BPF_TCP_OPS_2(dctcp_cwnd_event, void,
+	      struct sock *, sk, enum tcp_ca_event, ev)
+{
+	struct dctcp *ca = inet_csk_ca(sk);
+
+	switch (ev) {
+	case CA_EVENT_ECN_IS_CE:
+	case CA_EVENT_ECN_NO_CE:
+		dctcp_ece_ack_update(sk, ev, &ca->prior_rcv_nxt, &ca->ce_state);
+		break;
+	case CA_EVENT_LOSS:
+		dctcp_react_to_loss(sk);
+		break;
+	default:
+		/* Don't care for the rest. */
+		break;
+	}
+}
+
+BPF_TCP_OPS_1(dctcp_cwnd_undo, __u32, struct sock *, sk)
+{
+	const struct dctcp *ca = inet_csk_ca(sk);
+
+	return max(tcp_sk(sk)->snd_cwnd, ca->loss_cwnd);
+}
+
+BPF_TCP_OPS_3(tcp_reno_cong_avoid, void,
+	      struct sock *, sk, __u32, ack, __u32, acked)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!tcp_is_cwnd_limited(sk))
+		return;
+
+	/* In "safe" area, increase. */
+	if (tcp_in_slow_start(tp)) {
+		acked = tcp_slow_start(tp, acked);
+		if (!acked)
+			return;
+	}
+	/* In dangerous area, increase slowly. */
+	tcp_cong_avoid_ai(tp, tp->snd_cwnd, acked);
+}
+
+SEC(".struct_ops")
+struct tcp_congestion_ops dctcp_nouse = {
+	.init		= (void *)dctcp_init,
+	.set_state	= (void *)dctcp_state,
+	.flags		= TCP_CONG_NEEDS_ECN,
+	.name		= "bpf_dctcp_nouse",
+};
+
+SEC(".struct_ops")
+struct tcp_congestion_ops dctcp = {
+	.init		= (void *)dctcp_init,
+	.in_ack_event   = (void *)dctcp_update_alpha,
+	.cwnd_event	= (void *)dctcp_cwnd_event,
+	.ssthresh	= (void *)dctcp_ssthresh,
+	.cong_avoid	= (void *)tcp_reno_cong_avoid,
+	.undo_cwnd	= (void *)dctcp_cwnd_undo,
+	.set_state	= (void *)dctcp_state,
+	.flags		= TCP_CONG_NEEDS_ECN,
+	.name		= "bpf_dctcp",
+};
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access
  2019-12-21  6:26 ` [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
@ 2019-12-23  7:49   ` Yonghong Song
  2019-12-23 20:05   ` Andrii Nakryiko
  1 sibling, 0 replies; 45+ messages in thread
From: Yonghong Song @ 2019-12-23  7:49 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/20/19 10:26 PM, Martin KaFai Lau wrote:
> This patch allows bitfield access as a scalar.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-21  6:26 ` [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-23 19:33   ` Yonghong Song
  2019-12-23 20:29   ` Andrii Nakryiko
  2019-12-24 11:46   ` kbuild test robot
  2 siblings, 0 replies; 45+ messages in thread
From: Yonghong Song @ 2019-12-23 19:33 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/20/19 10:26 PM, Martin KaFai Lau wrote:
> This patch allows the kernel's struct ops (i.e. func ptr) to be
> implemented in BPF.  The first use case in this series is the
> "struct tcp_congestion_ops" which will be introduced in a
> latter patch.
> 
> This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
> The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
> func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
> of a kernel struct.  The attr->expected_attach_type is the member
> "index" of that kernel struct.  The first member of a struct starts
> with member index 0.  That will avoid ambiguity when a kernel struct
> has multiple func ptrs with the same func signature.
> 
> For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
> to implement the "init" func ptr of the "struct tcp_congestion_ops".
> The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
> of the _running_ kernel.  The attr->expected_attach_type is 3.
> 
> The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
> by arch_prepare_bpf_trampoline that will be done in the next
> patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
> 
> "struct bpf_struct_ops" is introduced as a common interface for the kernel
> struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
> struct will need to implement an instance of the "struct bpf_struct_ops".
> 
> The supporting kernel struct also needs to implement a bpf_verifier_ops.
> During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
> bpf_verifier_ops by searching the attr->attach_btf_id.
> 
> A new "btf_struct_access" is also added to the bpf_verifier_ops such
> that the supporting kernel struct can optionally provide its own specific
> check on accessing the func arg (e.g. provide limited write access).
> 
> After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
> to initialize some values (e.g. the btf id of the supporting kernel
> struct) and it can only be done once the btf_vmlinux is available.
> 
> The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
> if the return type of the prog->aux->attach_func_proto is "void".
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 10/11] bpf: libbpf: Add STRUCT_OPS support
  2019-12-21  6:26 ` [PATCH bpf-next v2 10/11] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
@ 2019-12-23 19:54   ` Andrii Nakryiko
  2019-12-26 22:47     ` Martin Lau
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 19:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch adds BPF STRUCT_OPS support to libbpf.
>
> The only sec_name convention is SEC(".struct_ops") to identify the
> struct_ops implemented in BPF,
> e.g. To implement a tcp_congestion_ops:
>
> SEC(".struct_ops")
> struct tcp_congestion_ops dctcp = {
>         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
>         /* ... some more func prts ... */
>         .name           = "bpf_dctcp",
> };
>
> Each struct_ops is defined as a global variable under SEC(".struct_ops")
> as above.  libbpf creates a map for each variable and the variable name
> is the map's name.  Multiple struct_ops is supported under
> SEC(".struct_ops").
>
> In the bpf_object__open phase, libbpf will look for the SEC(".struct_ops")
> section and find out what is the btf-type the struct_ops is
> implementing.  Note that the btf-type here is referring to
> a type in the bpf_prog.o's btf.  A "struct bpf_map" is added
> by bpf_object__add_map() as other maps do.  It will then
> collect (through SHT_REL) where are the bpf progs that the
> func ptrs are referring to.  No btf_vmlinux is needed in
> the open phase.
>
> In the bpf_object__load phase, the map-fields, which depend
> on the btf_vmlinux, are initialized (in bpf_map__init_kern_struct_ops()).
> It will also set the prog->type, prog->attach_btf_id, and
> prog->expected_attach_type.  Thus, the prog's properties do
> not rely on its section name.
> [ Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
>   process is as simple as: member-name match + btf-kind match + size match.
>   If these matching conditions fail, libbpf will reject.
>   The current targeting support is "struct tcp_congestion_ops" which
>   most of its members are function pointers.
>   The member ordering of the bpf_prog's btf-type can be different from
>   the btf_vmlinux's btf-type. ]
>
> Then, all obj->maps are created as usual (in bpf_object__create_maps()).
>
> Once the maps are created and prog's properties are all set,
> the libbpf will proceed to load all the progs.
>
> bpf_map__attach_struct_ops() is added to register a struct_ops
> map to a kernel subsystem.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

This looks great, Martin! I just have few nits/suggestions, but
overall I think this approach is much better.

After this lands, please follow up with adding support for struct_ops
maps into BPF skeleton, so that it's attached automatically on
skeleton load.

>  tools/lib/bpf/bpf.c           |  10 +-
>  tools/lib/bpf/bpf.h           |   5 +-
>  tools/lib/bpf/libbpf.c        | 639 +++++++++++++++++++++++++++++++++-
>  tools/lib/bpf/libbpf.h        |   1 +
>  tools/lib/bpf/libbpf.map      |   1 +
>  tools/lib/bpf/libbpf_probes.c |   2 +
>  6 files changed, 646 insertions(+), 12 deletions(-)
>

[...]

> +       member = btf_members(type);
> +       for (i = 0; i < btf_vlen(type); i++, member++) {
> +               const struct btf_type *mtype, *kern_mtype;
> +               __u32 mtype_id, kern_mtype_id;
> +               void *mdata, *kern_mdata;
> +               __s64 msize, kern_msize;
> +               __u32 moff, kern_moff;
> +               __u32 kern_member_idx;
> +               const char *mname;
> +
> +               mname = btf__name_by_offset(btf, member->name_off);
> +               kern_member = find_member_by_name(kern_btf, kern_type, mname);
> +               if (!kern_member) {
> +                       pr_warn("struct_ops map %s init_kern %s: Cannot find member %s in kernel BTF\n",
> +                               map->name, tname, mname);
> +                       return -ENOTSUP;
> +               }
> +
> +               kern_member_idx = kern_member - btf_members(kern_type);
> +               if (btf_member_bitfield_size(type, i) ||
> +                   btf_member_bitfield_size(kern_type, kern_member_idx)) {
> +                       pr_warn("struct_ops map %s init_kern %s: bitfield %s is not supported\n",
> +                               map->name, tname, mname);
> +                       return -ENOTSUP;
> +               }
> +
> +               moff = member->offset / 8;
> +               kern_moff = kern_member->offset / 8;
> +
> +               mdata = data + moff;
> +               kern_mdata = kern_data + kern_moff;
> +
> +               mtype_id = member->type;
> +               kern_mtype_id = kern_member->type;
> +
> +               mtype = resolve_ptr(btf, mtype_id, NULL);
> +               kern_mtype = resolve_ptr(kern_btf, kern_mtype_id, NULL);
> +               if (mtype && kern_mtype) {

This check seems more logical after you resolve mtype_id and
kern_mtype_id below and check that they have same BTF_INFO_KIND. After
that you can check for a case of pointers and handle it, if not
pointers - proceed to determining size.

That way you might also get rid of resolve_ptr and resolve_func_ptr
functions, because here you already resolved pointer, so just normal
btf__resolve_type will give you what it's pointing to. Then the only
use of resolve_func_ptr_resolve_ptr will be in
bpf_object__collect_struct_ops_map_reloc, which you can just inline at
that point and not really loose any readability.

> +                       struct bpf_program *prog;
> +
> +                       if (!btf_is_func_proto(mtype) ||
> +                           !btf_is_func_proto(kern_mtype)) {
> +                               pr_warn("struct_ops map %s init_kern %s: non func ptr %s is not supported\n",
> +                                       map->name, tname, mname);
> +                               return -ENOTSUP;
> +                       }
> +
> +                       prog = st_ops->progs[i];
> +                       if (!prog) {
> +                               pr_debug("struct_ops map %s init_kern %s: func ptr %s is not set\n",
> +                                        map->name, tname, mname);
> +                               continue;
> +                       }
> +
> +                       if (prog->type != BPF_PROG_TYPE_UNSPEC &&
> +                           (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> +                            prog->attach_btf_id != kern_type_id ||
> +                            prog->expected_attach_type != kern_member_idx)) {
> +                               pr_warn("struct_ops map %s init_kern %s: Cannot use prog %s in type %u attach_btf_id %u expected_attach_type %u for func ptr %s\n",
> +                                       map->name, tname, prog->name,
> +                                       prog->type, prog->attach_btf_id,
> +                                       prog->expected_attach_type, mname);
> +                               return -ENOTSUP;
> +                       }
> +
> +                       prog->type = BPF_PROG_TYPE_STRUCT_OPS;
> +                       prog->attach_btf_id = kern_type_id;
> +                       prog->expected_attach_type = kern_member_idx;
> +
> +                       st_ops->kern_func_off[i] = kern_data_off + kern_moff;
> +
> +                       pr_debug("struct_ops map %s init_kern %s: func ptr %s is set to prog %s from data(+%u) to kern_data(+%u)\n",
> +                                map->name, tname, mname, prog->name, moff,
> +                                kern_moff);
> +
> +                       continue;
> +               }
> +
> +               mtype_id = btf__resolve_type(btf, mtype_id);
> +               kern_mtype_id = btf__resolve_type(kern_btf, kern_mtype_id);
> +               if (mtype_id < 0 || kern_mtype_id < 0) {
> +                       pr_warn("struct_ops map %s init_kern %s: Cannot resolve the type for %s\n",
> +                               map->name, tname, mname);
> +                       return -ENOTSUP;
> +               }
> +
> +               mtype = btf__type_by_id(btf, mtype_id);
> +               kern_mtype = btf__type_by_id(kern_btf, kern_mtype_id);
> +               if (BTF_INFO_KIND(mtype->info) !=
> +                   BTF_INFO_KIND(kern_mtype->info)) {
> +                       pr_warn("struct_ops map %s init_kern %s: Unmatched member type %s %u != %u(kernel)\n",
> +                               map->name, tname, mname,
> +                               BTF_INFO_KIND(mtype->info),
> +                               BTF_INFO_KIND(kern_mtype->info));
> +                       return -ENOTSUP;
> +               }
> +
> +               msize = btf__resolve_size(btf, mtype_id);
> +               kern_msize = btf__resolve_size(kern_btf, kern_mtype_id);
> +               if (msize < 0 || kern_msize < 0 || msize != kern_msize) {
> +                       pr_warn("struct_ops map %s init_kern %s: Error in size of member %s: %zd != %zd(kernel)\n",
> +                               map->name, tname, mname,
> +                               (ssize_t)msize, (ssize_t)kern_msize);
> +                       return -ENOTSUP;
> +               }
> +
> +               pr_debug("struct_ops map %s init_kern %s: copy %s %u bytes from data(+%u) to kern_data(+%u)\n",
> +                        map->name, tname, mname, (unsigned int)msize,
> +                        moff, kern_moff);
> +               memcpy(kern_mdata, mdata, msize);
> +       }
> +
> +       return 0;
> +}
> +

[...]

> +       symbols = obj->efile.symbols;
> +       btf = obj->btf;
> +       nrels = shdr->sh_size / shdr->sh_entsize;
> +       for (i = 0; i < nrels; i++) {
> +               if (!gelf_getrel(data, i, &rel)) {
> +                       pr_warn("struct_ops map reloc: failed to get %d reloc\n", i);
> +                       return -LIBBPF_ERRNO__FORMAT;
> +               }
> +
> +               if (!gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym)) {
> +                       pr_warn("struct_ops map reloc: symbol %" PRIx64 " not found\n",

please use %zx and explicit cast to size_t instead of PRIx64

> +                               GELF_R_SYM(rel.r_info));
> +                       return -LIBBPF_ERRNO__FORMAT;
> +               }
> +
> +               name = elf_strptr(obj->efile.elf, obj->efile.strtabidx,
> +                                 sym.st_name) ? : "<?>";
> +               map = find_struct_ops_map_by_offset(obj, rel.r_offset);
> +               if (!map) {
> +                       pr_warn("struct_ops map reloc: cannot find map at rel.r_offset %zu\n",
> +                               (size_t)rel.r_offset);
> +                       return -EINVAL;
> +               }
> +
> +               moff = rel.r_offset -  map->sec_offset;

nit: double space

> +               shdr_idx = sym.st_shndx;
> +               st_ops = map->st_ops;
> +               tname = st_ops->tname;

[...]

> +       datasec = btf__type_by_id(btf, datasec_id);
> +       vsi = btf_var_secinfos(datasec);
> +       for (i = 0; i < btf_vlen(datasec); i++, vsi++) {
> +               type = btf__type_by_id(obj->btf, vsi->type);
> +               var_name = btf__name_by_offset(obj->btf, type->name_off);
> +
> +               type_id = btf__resolve_type(obj->btf, vsi->type);
> +               if (type_id < 0) {
> +                       pr_warn("struct_ops init: Cannot resolve var type_id %u in DATASEC %s\n",
> +                               vsi->type, STRUCT_OPS_SEC);
> +                       return -EINVAL;
> +               }
> +
> +               type = btf__type_by_id(obj->btf, type_id);
> +               tname = btf__name_by_offset(obj->btf, type->name_off);
> +               if (!btf_is_struct(type)) {

if tname is empty, it's also not of much use, so might be a good idea
to error out here with some context?

> +                       pr_warn("struct_ops init: %s is not a struct\n", tname);
> +                       return -EINVAL;
> +               }
> +
> +               map = bpf_object__add_map(obj);
> +               if (IS_ERR(map))
> +                       return PTR_ERR(map);
> +
> +               map->sec_idx = obj->efile.st_ops_shndx;
> +               map->sec_offset = vsi->offset;
> +               map->name = strdup(var_name);
> +               if (!map->name)
> +                       return -ENOMEM;
> +
> +               map->def.type = BPF_MAP_TYPE_STRUCT_OPS;
> +               map->def.key_size = sizeof(int);
> +               map->def.value_size = type->size;
> +               map->def.max_entries = 1;
> +
> +               map->st_ops = calloc(1, sizeof(*map->st_ops));
> +               if (!map->st_ops)
> +                       return -ENOMEM;
> +               st_ops = map->st_ops;
> +               st_ops->data = malloc(type->size);
> +               st_ops->progs = calloc(btf_vlen(type), sizeof(*st_ops->progs));
> +               st_ops->kern_func_off = malloc(btf_vlen(type) *
> +                                              sizeof(*st_ops->kern_func_off));
> +               if (!st_ops->data || !st_ops->progs || !st_ops->kern_func_off)
> +                       return -ENOMEM;
> +
> +               memcpy(st_ops->data,
> +                      obj->efile.st_ops_data->d_buf + vsi->offset,
> +                      type->size);

maybe also check that d_size is big enough to read data from?

> +               st_ops->tname = tname;
> +               st_ops->type = type;
> +               st_ops->type_id = type_id;
> +
> +               pr_debug("struct_ops init: %s found. type_id:%u var_name:%s offset:%u\n",
> +                        tname, type_id, var_name, vsi->offset);
> +       }
> +
> +       return 0;
> +}
> +

[...]

>
> -       for (i = 0; i < obj->nr_maps; i++)
> +       for (i = 0; i < obj->nr_maps; i++) {
>                 zclose(obj->maps[i].fd);
> +               if (obj->maps[i].st_ops)
> +                       zfree(&obj->maps[i].st_ops->kern_vdata);

any specific reason to deallocate only kern_vdata? maybe just
consolidate all the clean up in bpf_object__close instead?

> +       }
>
>         for (i = 0; i < obj->nr_programs; i++)
>                 bpf_program__unload(&obj->programs[i]);
> @@ -4866,6 +5427,7 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
>         err = err ? : bpf_object__resolve_externs(obj, obj->kconfig);
>         err = err ? : bpf_object__sanitize_and_load_btf(obj);
>         err = err ? : bpf_object__sanitize_maps(obj);
> +       err = err ? : bpf_object__init_kern_struct_ops_maps(obj);
>         err = err ? : bpf_object__create_maps(obj);
>         err = err ? : bpf_object__relocate(obj, attr->target_btf_path);
>         err = err ? : bpf_object__load_progs(obj, attr->log_level);
> @@ -5453,6 +6015,13 @@ void bpf_object__close(struct bpf_object *obj)
>                         map->mmaped = NULL;
>                 }
>
> +               if (map->st_ops) {
> +                       zfree(&map->st_ops->data);
> +                       zfree(&map->st_ops->progs);
> +                       zfree(&map->st_ops->kern_func_off);
> +                       zfree(&map->st_ops);
> +               }
> +
>                 zfree(&map->name);
>                 zfree(&map->pin_path);
>         }
> @@ -5954,7 +6523,7 @@ int libbpf_prog_type_by_name(const char *name, enum bpf_prog_type *prog_type,
>  int libbpf_find_vmlinux_btf_id(const char *name,
>                                enum bpf_attach_type attach_type)
>  {
> -       struct btf *btf = bpf_core_find_kernel_btf();
> +       struct btf *btf = bpf_find_kernel_btf();
>         char raw_tp_btf[128] = BTF_PREFIX;
>         char *dst = raw_tp_btf + sizeof(BTF_PREFIX) - 1;
>         const char *btf_name;
> @@ -6780,6 +7349,58 @@ struct bpf_link *bpf_program__attach(struct bpf_program *prog)
>         return sec_def->attach_fn(sec_def, prog);
>  }
>
> +static int bpf_link__detach_struct_ops(struct bpf_link *link)
> +{
> +       struct bpf_link_fd *l = (void *)link;
> +       __u32 zero = 0;
> +
> +       if (bpf_map_delete_elem(l->fd, &zero))
> +               return -errno;
> +
> +       return 0;
> +}
> +
> +struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map)
> +{

This looks great!

[...]

>  enum bpf_perf_event_ret
>  bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t page_size,
>                            void **copy_mem, size_t *copy_size,
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index fe592ef48f1b..9c54f252f90f 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -355,6 +355,7 @@ struct bpf_map_def {
>   * so no need to worry about a name clash.
>   */
>  struct bpf_map;
> +LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map);

nit: can you please move it into the section with all the bpf_link and
attach APIs?

>  LIBBPF_API struct bpf_map *
>  bpf_object__find_map_by_name(const struct bpf_object *obj, const char *name);
>

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-21  6:26 ` [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-23 19:57   ` Yonghong Song
  2019-12-23 21:44     ` Andrii Nakryiko
  2019-12-27  6:16     ` Martin Lau
  2019-12-23 23:05   ` Andrii Nakryiko
  2019-12-24 12:28   ` kbuild test robot
  2 siblings, 2 replies; 45+ messages in thread
From: Yonghong Song @ 2019-12-23 19:57 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/20/19 10:26 PM, Martin KaFai Lau wrote:
> The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> is a kernel struct with its func ptr implemented in bpf prog.
> This new map is the interface to register/unregister/introspect
> a bpf implemented kernel struct.
> 
> The kernel struct is actually embedded inside another new struct
> (or called the "value" struct in the code).  For example,
> "struct tcp_congestion_ops" is embbeded in:
> struct bpf_struct_ops_tcp_congestion_ops {
> 	refcount_t refcnt;
> 	enum bpf_struct_ops_state state;
> 	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> }
> The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> The "bpftool map dump" will then be able to show the
> state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> is created automatically by a macro.  Having a separate "value" struct
> will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> initialization works before registering the struct_ops to the kernel
> subsystem).  The libbpf will take care of finding and populating the
> "struct bpf_struct_ops_XYZ" from "struct XYZ".
> 
> Register a struct_ops to a kernel subsystem:
> 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
>     set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
>     running kernel.
>     Instead of reusing the attr->btf_value_type_id,
>     btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
>     used as the "user" btf which could store other useful sysadmin/debug
>     info that may be introduced in the furture,
>     e.g. creation-date/compiler-details/map-creator...etc.
> 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
>     in the running kernel btf.  Populate the value of this object.
>     The function ptr should be populated with the prog fds.
> 4. Call BPF_MAP_UPDATE with the object created in (3) as
>     the map value.  The key is always "0".
> 
> During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> the specific struct_ops to do some final checks in "st_ops->init_member()"
> (e.g. ensure all mandatory func ptrs are implemented).
> If everything looks good, it will register this kernel struct
> to the kernel subsystem.  The map will not allow further update
> from this point.
> 
> Unregister a struct_ops from the kernel subsystem:
> BPF_MAP_DELETE with key "0".
> 
> Introspect a struct_ops:
> BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> have the prog _id_ populated as the func ptr.
> 
> The map value state (enum bpf_struct_ops_state) will transit from:
> INIT (map created) =>
> INUSE (map updated, i.e. reg) =>
> TOBEFREE (map value deleted, i.e. unreg)
> 
> The kernel subsystem needs to call bpf_struct_ops_get() and
> bpf_struct_ops_put() to manage the "refcnt" in the
> "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
> for the purose of tracking the subsystem usage.  Another approach
> is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> the map-fd/pinned-map usage.  However, that will also tie down the
> future semantics of map->refcnt and map->usercnt.
> 
> The very first subsystem's refcnt (during reg()) holds one
> count to map->refcnt.  When the very last subsystem's refcnt
> is gone, it will also release the map->refcnt.  All bpf_prog will be
> freed when the map->refcnt reaches 0 (i.e. during map_free()).
> 
> Here is how the bpftool map command will look like:
> [root@arch-fb-vm1 bpf]# bpftool map show
> 6: struct_ops  name dctcp  flags 0x0
> 	key 4B  value 256B  max_entries 1  memlock 4096B
> 	btf_id 6
> [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> [{
>          "value": {
>              "refcnt": {
>                  "refs": {
>                      "counter": 1
>                  }
>              },
>              "state": 1,

The bpftool dump with "state" 1 is a little bit cryptic.
Since this is common for all struct_ops maps, can we
make it explicit, e.g., as enum values, like INIT/INUSE/TOBEFREE?

>              "data": {
>                  "list": {
>                      "next": 0,
>                      "prev": 0
>                  },
>                  "key": 0,
>                  "flags": 2,
>                  "init": 24,
>                  "release": 0,
>                  "ssthresh": 25,
>                  "cong_avoid": 30,
>                  "set_state": 27,
>                  "cwnd_event": 28,
>                  "in_ack_event": 26,
>                  "undo_cwnd": 29,
>                  "pkts_acked": 0,
>                  "min_tso_segs": 0,
>                  "sndbuf_expand": 0,
>                  "cong_control": 0,
>                  "get_info": 0,
>                  "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
>                  ],
>                  "owner": 0
>              }
>          }
>      }
> ]
> 
> Misc Notes:
> * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
>    It does an inplace update on "*value" instead returning a pointer
>    to syscall.c.  Otherwise, it needs a separate copy of "zero" value
>    for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
> 
> * The bpf_struct_ops_map_delete_elem() is also called without
>    preempt_disable() from map_delete_elem().  It is because
>    the "->unreg()" may requires sleepable context, e.g.
>    the "tcp_unregister_congestion_control()".
> 
> * "const" is added to some of the existing "struct btf_func_model *"
>    function arg to avoid a compiler warning caused by this patch.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   arch/x86/net/bpf_jit_comp.c |  11 +-
>   include/linux/bpf.h         |  49 +++-
>   include/linux/bpf_types.h   |   3 +
>   include/linux/btf.h         |  13 +
>   include/uapi/linux/bpf.h    |   7 +-
>   kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
>   kernel/bpf/btf.c            |  20 +-
>   kernel/bpf/map_in_map.c     |   3 +-
>   kernel/bpf/syscall.c        |  49 ++--
>   kernel/bpf/trampoline.c     |   5 +-
>   kernel/bpf/verifier.c       |   5 +
>   11 files changed, 593 insertions(+), 40 deletions(-)
> 
[...]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c1eeb3e0e116..38059880963e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -136,6 +136,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_STACK,
>   	BPF_MAP_TYPE_SK_STORAGE,
>   	BPF_MAP_TYPE_DEVMAP_HASH,
> +	BPF_MAP_TYPE_STRUCT_OPS,
>   };
>   
>   /* Note that tracing related programs such as
> @@ -398,6 +399,10 @@ union bpf_attr {
>   		__u32	btf_fd;		/* fd pointing to a BTF type data */
>   		__u32	btf_key_type_id;	/* BTF type_id of the key */
>   		__u32	btf_value_type_id;	/* BTF type_id of the value */
> +		__u32	btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
> +						   * struct stored as the
> +						   * map value
> +						   */
>   	};
>   
>   	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> @@ -3350,7 +3355,7 @@ struct bpf_map_info {
>   	__u32 map_flags;
>   	char  name[BPF_OBJ_NAME_LEN];
>   	__u32 ifindex;
> -	__u32 :32;
> +	__u32 btf_vmlinux_value_type_id;
>   	__u64 netns_dev;
>   	__u64 netns_ino;
>   	__u32 btf_id;
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index c9f81bd1df83..fb9a0b3e4580 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -10,8 +10,66 @@
>   #include <linux/seq_file.h>
>   #include <linux/refcount.h>
>   
> +enum bpf_struct_ops_state {
> +	BPF_STRUCT_OPS_STATE_INIT,
> +	BPF_STRUCT_OPS_STATE_INUSE,
> +	BPF_STRUCT_OPS_STATE_TOBEFREE,
> +};
> +
> +#define BPF_STRUCT_OPS_COMMON_VALUE			\
> +	refcount_t refcnt;				\
> +	enum bpf_struct_ops_state state
> +
> +struct bpf_struct_ops_value {
> +	BPF_STRUCT_OPS_COMMON_VALUE;
> +	char data[0] ____cacheline_aligned_in_smp;
> +};
> +
> +struct bpf_struct_ops_map {
> +	struct bpf_map map;
> +	const struct bpf_struct_ops *st_ops;
> +	/* protect map_update */
> +	spinlock_t lock;
> +	/* progs has all the bpf_prog that is populated
> +	 * to the func ptr of the kernel's struct
> +	 * (in kvalue.data).
> +	 */
> +	struct bpf_prog **progs;
> +	/* image is a page that has all the trampolines
> +	 * that stores the func args before calling the bpf_prog.
> +	 * A PAGE_SIZE "image" is enough to store all trampoline for
> +	 * "progs[]".
> +	 */
> +	void *image;
> +	/* uvalue->data stores the kernel struct
> +	 * (e.g. tcp_congestion_ops) that is more useful
> +	 * to userspace than the kvalue.  For example,
> +	 * the bpf_prog's id is stored instead of the kernel
> +	 * address of a func ptr.
> +	 */
> +	struct bpf_struct_ops_value *uvalue;
> +	/* kvalue.data stores the actual kernel's struct
> +	 * (e.g. tcp_congestion_ops) that will be
> +	 * registered to the kernel subsystem.
> +	 */
> +	struct bpf_struct_ops_value kvalue;
> +};
> +
> +#define VALUE_PREFIX "bpf_struct_ops_"
> +#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
> +
> +/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is
> + * the map's value exposed to the userspace and its btf-type-id is
> + * stored at the map->btf_vmlinux_value_type_id.
> + *
> + */
>   #define BPF_STRUCT_OPS_TYPE(_name)				\
> -extern struct bpf_struct_ops bpf_##_name;
> +extern struct bpf_struct_ops bpf_##_name;			\
> +								\
> +struct bpf_struct_ops_##_name {						\
> +	BPF_STRUCT_OPS_COMMON_VALUE;				\
> +	struct _name data ____cacheline_aligned_in_smp;		\
> +};
>   #include "bpf_struct_ops_types.h"
>   #undef BPF_STRUCT_OPS_TYPE
>   
> @@ -35,19 +93,51 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
>   const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
>   };
>   
> +static const struct btf_type *module_type;
> +
>   void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   {
> +	s32 type_id, value_id, module_id;
>   	const struct btf_member *member;
>   	struct bpf_struct_ops *st_ops;
>   	struct bpf_verifier_log log = {};
>   	const struct btf_type *t;
> +	char value_name[128];
>   	const char *mname;
> -	s32 type_id;
>   	u32 i, j;
>   
> +	/* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */
> +#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name);
> +#include "bpf_struct_ops_types.h"
> +#undef BPF_STRUCT_OPS_TYPE

This looks great!

> +
> +	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
> +					  BTF_KIND_STRUCT);
> +	if (module_id < 0) {
> +		pr_warn("Cannot find struct module in btf_vmlinux\n");
> +		return;
> +	}
> +	module_type = btf_type_by_id(_btf_vmlinux, module_id);
> +
>   	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
>   		st_ops = bpf_struct_ops[i];
>   
> +		if (strlen(st_ops->name) + VALUE_PREFIX_LEN >=
> +		    sizeof(value_name)) {
> +			pr_warn("struct_ops name %s is too long\n",
> +				st_ops->name);
> +			continue;
> +		}
> +		sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name);
> +
> +		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
> +						 BTF_KIND_STRUCT);
> +		if (value_id < 0) {
> +			pr_warn("Cannot find struct %s in btf_vmlinux\n",
> +				value_name);
> +			continue;
> +		}
> +
>   		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
>   						BTF_KIND_STRUCT);
>   		if (type_id < 0) {
> @@ -99,6 +189,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   			} else {
>   				st_ops->type_id = type_id;
>   				st_ops->type = t;
> +				st_ops->value_id = value_id;
> +				st_ops->value_type =
> +					btf_type_by_id(_btf_vmlinux, value_id);
>   			}
>   		}
>   	}
> @@ -106,6 +199,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   
>   extern struct btf *btf_vmlinux;
>   
[...]
> +
> +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> +					  void *value, u64 flags)
> +{
> +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> +	const struct bpf_struct_ops *st_ops = st_map->st_ops;
> +	struct bpf_struct_ops_value *uvalue, *kvalue;
> +	const struct btf_member *member;
> +	const struct btf_type *t = st_ops->type;
> +	void *udata, *kdata;
> +	int prog_fd, err = 0;
> +	void *image;
> +	u32 i;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (*(u32 *)key != 0)
> +		return -E2BIG;
> +
> +	uvalue = (struct bpf_struct_ops_value *)value;
> +	if (uvalue->state || refcount_read(&uvalue->refcnt))
> +		return -EINVAL;
> +
> +	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
> +	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> +
> +	spin_lock(&st_map->lock);
> +
> +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> +		err = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	memcpy(uvalue, value, map->value_size);
> +
> +	udata = &uvalue->data;
> +	kdata = &kvalue->data;
> +	image = st_map->image;
> +
> +	for_each_member(i, t, member) {
> +		const struct btf_type *mtype, *ptype;
> +		struct bpf_prog *prog;
> +		u32 moff;
> +
> +		moff = btf_member_bit_offset(t, member) / 8;
> +		mtype = btf_type_by_id(btf_vmlinux, member->type);
> +		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
> +		if (ptype == module_type) {
> +			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
> +			continue;
> +		}
> +
> +		err = st_ops->init_member(t, member, kdata, udata);
> +		if (err < 0)
> +			goto reset_unlock;
> +
> +		/* The ->init_member() has handled this member */
> +		if (err > 0)
> +			continue;
> +
> +		/* If st_ops->init_member does not handle it,
> +		 * we will only handle func ptrs and zero-ed members
> +		 * here.  Reject everything else.
> +		 */
> +
> +		/* All non func ptr member must be 0 */
> +		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> +					       NULL)) {
> +			u32 msize;
> +
> +			mtype = btf_resolve_size(btf_vmlinux, mtype,
> +						 &msize, NULL, NULL);
> +			if (IS_ERR(mtype)) {
> +				err = PTR_ERR(mtype);
> +				goto reset_unlock;
> +			}
> +
> +			if (memchr_inv(udata + moff, 0, msize)) {
> +				err = -EINVAL;
> +				goto reset_unlock;
> +			}
> +
> +			continue;
> +		}
> +
> +		prog_fd = (int)(*(unsigned long *)(udata + moff));
> +		/* Similar check as the attr->attach_prog_fd */
> +		if (!prog_fd)
> +			continue;
> +
> +		prog = bpf_prog_get(prog_fd);
> +		if (IS_ERR(prog)) {
> +			err = PTR_ERR(prog);
> +			goto reset_unlock;
> +		}
> +		st_map->progs[i] = prog;
> +
> +		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> +		    prog->aux->attach_btf_id != st_ops->type_id ||
> +		    prog->expected_attach_type != i) {
> +			err = -EINVAL;
> +			goto reset_unlock;
> +		}
> +
> +		err = arch_prepare_bpf_trampoline(image,
> +						  &st_ops->func_models[i], 0,
> +						  &prog, 1, NULL, 0, NULL);
> +		if (err < 0)
> +			goto reset_unlock;
> +
> +		*(void **)(kdata + moff) = image;
> +		image += err;
> +
> +		/* put prog_id to udata */
> +		*(unsigned long *)(udata + moff) = prog->aux->id;
> +	}

Should we check whether user indeed provided `module` member or
not before declaring success?

> +
> +	refcount_set(&kvalue->refcnt, 1);
> +	bpf_map_inc(map);
> +
> +	err = st_ops->reg(kdata);
> +	if (!err) {
> +		/* Pair with smp_load_acquire() during lookup */
> +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> +		goto unlock;
> +	}
> +
> +	/* Error during st_ops->reg() */
> +	bpf_map_put(map);
> +
> +reset_unlock:
> +	bpf_struct_ops_map_put_progs(st_map);
> +	memset(uvalue, 0, map->value_size);
> +	memset(kvalue, 0, map->value_size);
> +
> +unlock:
> +	spin_unlock(&st_map->lock);
> +	return err;
> +}
> +
[...]
> +
> +static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
> +{
> +	const struct bpf_struct_ops *st_ops;
> +	size_t map_total_size, st_map_size;
> +	struct bpf_struct_ops_map *st_map;
> +	const struct btf_type *t, *vt;
> +	struct bpf_map_memory mem;
> +	struct bpf_map *map;
> +	int err;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return ERR_PTR(-EPERM);
> +
> +	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
> +	if (!st_ops)
> +		return ERR_PTR(-ENOTSUPP);
> +
> +	vt = st_ops->value_type;
> +	if (attr->value_size != vt->size)
> +		return ERR_PTR(-EINVAL);
> +
> +	t = st_ops->type;
> +
> +	st_map_size = sizeof(*st_map) +
> +		/* kvalue stores the
> +		 * struct bpf_struct_ops_tcp_congestions_ops
> +		 */
> +		(vt->size - sizeof(struct bpf_struct_ops_value));
> +	map_total_size = st_map_size +
> +		/* uvalue */
> +		sizeof(vt->size) +
> +		/* struct bpf_progs **progs */
> +		 btf_type_vlen(t) * sizeof(struct bpf_prog *);
> +	err = bpf_map_charge_init(&mem, map_total_size);
> +	if (err < 0)
> +		return ERR_PTR(err);
> +
> +	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
> +	if (!st_map) {
> +		bpf_map_charge_finish(&mem);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +	st_map->st_ops = st_ops;
> +	map = &st_map->map;
> +
> +	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
> +	st_map->progs =
> +		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
> +				   NUMA_NO_NODE);
> +	/* Each trampoline costs < 64 bytes.  Ensure one page
> +	 * is enough for max number of func ptrs.
> +	 */
> +	BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS);

This maybe true for x86 now, but it may not hold up for future other
architectures. Not sure whether we should get the value for arch call 
backs, or we just bail out during map update if we ever grow exceeds
one page.

> +	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
> +	if (!st_map->uvalue || !st_map->progs || !st_map->image) {
> +		bpf_struct_ops_map_free(map);
> +		bpf_map_charge_finish(&mem);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	spin_lock_init(&st_map->lock);
> +	set_vm_flush_reset_perms(st_map->image);
> +	set_memory_x((long)st_map->image, 1);
> +	bpf_map_init_from_attr(map, attr);
> +	bpf_map_charge_move(&map->memory, &mem);
> +
> +	return map;
> +}
> +
> +const struct bpf_map_ops bpf_struct_ops_map_ops = {
> +	.map_alloc_check = bpf_struct_ops_map_alloc_check,
> +	.map_alloc = bpf_struct_ops_map_alloc,
> +	.map_free = bpf_struct_ops_map_free,
> +	.map_get_next_key = bpf_struct_ops_map_get_next_key,
> +	.map_lookup_elem = bpf_struct_ops_map_lookup_elem,
> +	.map_delete_elem = bpf_struct_ops_map_delete_elem,
> +	.map_update_elem = bpf_struct_ops_map_update_elem,
> +	.map_seq_show_elem = bpf_struct_ops_map_seq_show_elem,
> +};
[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access
  2019-12-21  6:26 ` [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
  2019-12-23  7:49   ` Yonghong Song
@ 2019-12-23 20:05   ` Andrii Nakryiko
  2019-12-23 21:21     ` Yonghong Song
  1 sibling, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 20:05 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch allows bitfield access as a scalar.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  kernel/bpf/btf.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 6e652643849b..da73b63acfc5 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3744,10 +3744,6 @@ int btf_struct_access(struct bpf_verifier_log *log,
>         }
>
>         for_each_member(i, t, member) {
> -               if (btf_member_bitfield_size(t, member))
> -                       /* bitfields are not supported yet */
> -                       continue;
> -
>                 /* offset of the field in bytes */
>                 moff = btf_member_bit_offset(t, member) / 8;
>                 if (off + size <= moff)
> @@ -3757,6 +3753,12 @@ int btf_struct_access(struct bpf_verifier_log *log,
>                 if (off < moff)
>                         continue;
>
> +               if (btf_member_bitfield_size(t, member)) {
> +                       if (off == moff && off + size <= t->size)
> +                               return SCALAR_VALUE;
> +                       continue;
> +               }

Shouldn't this check be done before (off < moff) above?

Imagine this situation:

struct {
  int :16;
  int x:8;
};

Compiler will generate 4-byte load with offset 0, and then bit shifts
to extract third byte. From kernel perspective, you'll see that off=0,
but moff=2, which will get skipped.

So there are two problems, I think:
1. if member is bitfield, special handle that before (off < moff) case.
2. off == moff is too precise, I think it should be `off <= moff`, but
also check that it covers entire bitfield, e.g.:

  (off + size) * 8 >= btf_member_bit_offset(t, member) +
btf_member_bitfield_size(t, member)

Make sense or am I missing anything?

> +
>                 /* type of the field */
>                 mtype = btf_type_by_id(btf_vmlinux, member->type);
>                 mname = __btf_name_by_offset(btf_vmlinux, member->name_off);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf
  2019-12-21  6:26 ` [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
@ 2019-12-23 20:18   ` Yonghong Song
  2019-12-23 23:20   ` Andrii Nakryiko
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Yonghong Song @ 2019-12-23 20:18 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/20/19 10:26 PM, Martin KaFai Lau wrote:
> This patch makes "struct tcp_congestion_ops" to be the first user
> of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
> in bpf.
> 
> The BPF implemented tcp_congestion_ops can be used like
> regular kernel tcp-cc through sysctl and setsockopt.  e.g.
> [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
> net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
> net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
> net.ipv4.tcp_congestion_control = bpf_cubic
> 
> There has been attempt to move the TCP CC to the user space
> (e.g. CCP in TCP).   The common arguments are faster turn around,
> get away from long-tail kernel versions in production...etc,
> which are legit points.
> 
> BPF has been the continuous effort to join both kernel and
> userspace upsides together (e.g. XDP to gain the performance
> advantage without bypassing the kernel).  The recent BPF
> advancements (in particular BTF-aware verifier, BPF trampoline,
> BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
> possible in BPF.  It allows a faster turnaround for testing algorithm
> in the production while leveraging the existing (and continue growing)
> BPF feature/framework instead of building one specifically for
> userspace TCP CC.
> 
> This patch allows write access to a few fields in tcp-sock
> (in bpf_tcp_ca_btf_struct_access()).
> 
> The optional "get_info" is unsupported now.  It can be added
> later.  One possible way is to output the info with a btf-id
> to describe the content.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Ack from bpf/btf perspective.

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-21  6:26 ` [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
  2019-12-23 19:33   ` Yonghong Song
@ 2019-12-23 20:29   ` Andrii Nakryiko
  2019-12-23 22:29     ` Martin Lau
  2019-12-24 11:46   ` kbuild test robot
  2 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 20:29 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch allows the kernel's struct ops (i.e. func ptr) to be
> implemented in BPF.  The first use case in this series is the
> "struct tcp_congestion_ops" which will be introduced in a
> latter patch.
>
> This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
> The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
> func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
> of a kernel struct.  The attr->expected_attach_type is the member
> "index" of that kernel struct.  The first member of a struct starts
> with member index 0.  That will avoid ambiguity when a kernel struct
> has multiple func ptrs with the same func signature.
>
> For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
> to implement the "init" func ptr of the "struct tcp_congestion_ops".
> The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
> of the _running_ kernel.  The attr->expected_attach_type is 3.
>
> The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
> by arch_prepare_bpf_trampoline that will be done in the next
> patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
>
> "struct bpf_struct_ops" is introduced as a common interface for the kernel
> struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
> struct will need to implement an instance of the "struct bpf_struct_ops".
>
> The supporting kernel struct also needs to implement a bpf_verifier_ops.
> During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
> bpf_verifier_ops by searching the attr->attach_btf_id.
>
> A new "btf_struct_access" is also added to the bpf_verifier_ops such
> that the supporting kernel struct can optionally provide its own specific
> check on accessing the func arg (e.g. provide limited write access).
>
> After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
> to initialize some values (e.g. the btf id of the supporting kernel
> struct) and it can only be done once the btf_vmlinux is available.
>
> The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
> if the return type of the prog->aux->attach_func_proto is "void".
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/bpf.h               |  30 +++++++
>  include/linux/bpf_types.h         |   4 +
>  include/linux/btf.h               |  34 ++++++++
>  include/uapi/linux/bpf.h          |   1 +
>  kernel/bpf/Makefile               |   2 +-
>  kernel/bpf/bpf_struct_ops.c       | 122 +++++++++++++++++++++++++++
>  kernel/bpf/bpf_struct_ops_types.h |   4 +
>  kernel/bpf/btf.c                  |  88 ++++++++++++++------
>  kernel/bpf/syscall.c              |  17 ++--
>  kernel/bpf/verifier.c             | 134 +++++++++++++++++++++++-------
>  10 files changed, 372 insertions(+), 64 deletions(-)
>  create mode 100644 kernel/bpf/bpf_struct_ops.c
>  create mode 100644 kernel/bpf/bpf_struct_ops_types.h
>

All looks good, apart from the concern with partially-initialized
bpf_struct_ops.

[...]

> +const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> +};
> +
> +void bpf_struct_ops_init(struct btf *_btf_vmlinux)

this is always get passed vmlinux's btf, so why not call it short and
sweet "btf"? _btf_vmlinux is kind of ugly and verbose.

> +{
> +       const struct btf_member *member;
> +       struct bpf_struct_ops *st_ops;
> +       struct bpf_verifier_log log = {};
> +       const struct btf_type *t;
> +       const char *mname;
> +       s32 type_id;
> +       u32 i, j;
> +

[...]

> +static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
> +{
> +       const struct btf_type *t, *func_proto;
> +       const struct bpf_struct_ops *st_ops;
> +       const struct btf_member *member;
> +       struct bpf_prog *prog = env->prog;
> +       u32 btf_id, member_idx;
> +       const char *mname;
> +
> +       btf_id = prog->aux->attach_btf_id;
> +       st_ops = bpf_struct_ops_find(btf_id);

if struct_ops initialization fails, type will be NULL and type_id will
be 0, which we rely on here to not get partially-initialized
bpf_struct_ops, right? Small comment mentioning this would be helpful.


> +       if (!st_ops) {
> +               verbose(env, "attach_btf_id %u is not a supported struct\n",
> +                       btf_id);
> +               return -ENOTSUPP;
> +       }
> +

[...]

>  static int check_attach_btf_id(struct bpf_verifier_env *env)
>  {
>         struct bpf_prog *prog = env->prog;
> @@ -9520,6 +9591,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>         long addr;
>         u64 key;
>
> +       if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
> +               return check_struct_ops_btf_id(env);
> +

There is a btf_id == 0 check below, you need to check that for
STRUCT_OPS as well, otherwise you can get partially-initialized
bpf_struct_ops struct in check_struct_ops_btf_id.

>         if (prog->type != BPF_PROG_TYPE_TRACING)
>                 return 0;
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access
  2019-12-23 20:05   ` Andrii Nakryiko
@ 2019-12-23 21:21     ` Yonghong Song
  0 siblings, 0 replies; 45+ messages in thread
From: Yonghong Song @ 2019-12-23 21:21 UTC (permalink / raw)
  To: Andrii Nakryiko, Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking



On 12/23/19 12:05 PM, Andrii Nakryiko wrote:
> On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> This patch allows bitfield access as a scalar.
>>
>> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> ---
>>   kernel/bpf/btf.c | 10 ++++++----
>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
>> index 6e652643849b..da73b63acfc5 100644
>> --- a/kernel/bpf/btf.c
>> +++ b/kernel/bpf/btf.c
>> @@ -3744,10 +3744,6 @@ int btf_struct_access(struct bpf_verifier_log *log,
>>          }
>>
>>          for_each_member(i, t, member) {
>> -               if (btf_member_bitfield_size(t, member))
>> -                       /* bitfields are not supported yet */
>> -                       continue;
>> -
>>                  /* offset of the field in bytes */
>>                  moff = btf_member_bit_offset(t, member) / 8;
>>                  if (off + size <= moff)
>> @@ -3757,6 +3753,12 @@ int btf_struct_access(struct bpf_verifier_log *log,
>>                  if (off < moff)
>>                          continue;
>>
>> +               if (btf_member_bitfield_size(t, member)) {
>> +                       if (off == moff && off + size <= t->size)
>> +                               return SCALAR_VALUE;
>> +                       continue;
>> +               }
> 
> Shouldn't this check be done before (off < moff) above?
> 
> Imagine this situation:
> 
> struct {
>    int :16;
>    int x:8;
> };

Oh, yes, forgot the case where the first bitfield member may have no
name, in which case, `off` will not match any `moff`.

btf_struct_access is used to check vmlinux btf types. I think in
vmlinux we may not have such scenarios. So the above code should
handle vmlinux use cases properly.

But I agree with Andrii that we probably want to handle
anonymous bitfield member (which is ignored in debuginfo and BTF) properly.

> 
> Compiler will generate 4-byte load with offset 0, and then bit shifts
> to extract third byte. From kernel perspective, you'll see that off=0,
> but moff=2, which will get skipped.
> 
> So there are two problems, I think:
> 1. if member is bitfield, special handle that before (off < moff) case.
> 2. off == moff is too precise, I think it should be `off <= moff`, but
> also check that it covers entire bitfield, e.g.:
> 
>    (off + size) * 8 >= btf_member_bit_offset(t, member) +
> btf_member_bitfield_size(t, member)
> 
> Make sense or am I missing anything?
> 
>> +
>>                  /* type of the field */
>>                  mtype = btf_type_by_id(btf_vmlinux, member->type);
>>                  mname = __btf_name_by_offset(btf_vmlinux, member->name_off);
>> --
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-23 19:57   ` Yonghong Song
@ 2019-12-23 21:44     ` Andrii Nakryiko
  2019-12-23 22:15       ` Martin Lau
  2019-12-27  6:16     ` Martin Lau
  1 sibling, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 21:44 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Martin Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Kernel Team, netdev

On Mon, Dec 23, 2019 at 11:58 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 12/20/19 10:26 PM, Martin KaFai Lau wrote:
> > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > is a kernel struct with its func ptr implemented in bpf prog.
> > This new map is the interface to register/unregister/introspect
> > a bpf implemented kernel struct.
> >
> > The kernel struct is actually embedded inside another new struct
> > (or called the "value" struct in the code).  For example,
> > "struct tcp_congestion_ops" is embbeded in:
> > struct bpf_struct_ops_tcp_congestion_ops {
> >       refcount_t refcnt;
> >       enum bpf_struct_ops_state state;
> >       struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > }
> > The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> > The "bpftool map dump" will then be able to show the
> > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> > number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> > is created automatically by a macro.  Having a separate "value" struct
> > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> > initialization works before registering the struct_ops to the kernel
> > subsystem).  The libbpf will take care of finding and populating the
> > "struct bpf_struct_ops_XYZ" from "struct XYZ".
> >
> > Register a struct_ops to a kernel subsystem:
> > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> >     set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> >     running kernel.
> >     Instead of reusing the attr->btf_value_type_id,
> >     btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> >     used as the "user" btf which could store other useful sysadmin/debug
> >     info that may be introduced in the furture,
> >     e.g. creation-date/compiler-details/map-creator...etc.
> > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> >     in the running kernel btf.  Populate the value of this object.
> >     The function ptr should be populated with the prog fds.
> > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> >     the map value.  The key is always "0".
> >
> > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > (e.g. ensure all mandatory func ptrs are implemented).
> > If everything looks good, it will register this kernel struct
> > to the kernel subsystem.  The map will not allow further update
> > from this point.
> >
> > Unregister a struct_ops from the kernel subsystem:
> > BPF_MAP_DELETE with key "0".
> >
> > Introspect a struct_ops:
> > BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> > have the prog _id_ populated as the func ptr.
> >
> > The map value state (enum bpf_struct_ops_state) will transit from:
> > INIT (map created) =>
> > INUSE (map updated, i.e. reg) =>
> > TOBEFREE (map value deleted, i.e. unreg)
> >
> > The kernel subsystem needs to call bpf_struct_ops_get() and
> > bpf_struct_ops_put() to manage the "refcnt" in the
> > "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
> > for the purose of tracking the subsystem usage.  Another approach
> > is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> > the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> > the map-fd/pinned-map usage.  However, that will also tie down the
> > future semantics of map->refcnt and map->usercnt.
> >
> > The very first subsystem's refcnt (during reg()) holds one
> > count to map->refcnt.  When the very last subsystem's refcnt
> > is gone, it will also release the map->refcnt.  All bpf_prog will be
> > freed when the map->refcnt reaches 0 (i.e. during map_free()).
> >
> > Here is how the bpftool map command will look like:
> > [root@arch-fb-vm1 bpf]# bpftool map show
> > 6: struct_ops  name dctcp  flags 0x0
> >       key 4B  value 256B  max_entries 1  memlock 4096B
> >       btf_id 6
> > [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> > [{
> >          "value": {
> >              "refcnt": {
> >                  "refs": {
> >                      "counter": 1
> >                  }
> >              },
> >              "state": 1,
>
> The bpftool dump with "state" 1 is a little bit cryptic.
> Since this is common for all struct_ops maps, can we
> make it explicit, e.g., as enum values, like INIT/INUSE/TOBEFREE?

This can (and probably should) be done generically in bpftool for any
field of enum type. Not blocking this patch set, though.

>
> >              "data": {
> >                  "list": {
> >                      "next": 0,
> >                      "prev": 0
> >                  },
> >                  "key": 0,
> >                  "flags": 2,
> >                  "init": 24,
> >                  "release": 0,
> >                  "ssthresh": 25,
> >                  "cong_avoid": 30,
> >                  "set_state": 27,
> >                  "cwnd_event": 28,
> >                  "in_ack_event": 26,
> >                  "undo_cwnd": 29,
> >                  "pkts_acked": 0,
> >                  "min_tso_segs": 0,
> >                  "sndbuf_expand": 0,
> >                  "cong_control": 0,
> >                  "get_info": 0,
> >                  "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
> >                  ],

Same here, bpftool should be smart enough to figure out that this is a
string, not just an array of bytes.

> >                  "owner": 0
> >              }
> >          }
> >      }
> > ]
> >

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-23 21:44     ` Andrii Nakryiko
@ 2019-12-23 22:15       ` Martin Lau
  0 siblings, 0 replies; 45+ messages in thread
From: Martin Lau @ 2019-12-23 22:15 UTC (permalink / raw)
  To: Andrii Nakryiko, Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On Mon, Dec 23, 2019 at 01:44:29PM -0800, Andrii Nakryiko wrote:
> On Mon, Dec 23, 2019 at 11:58 AM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 12/20/19 10:26 PM, Martin KaFai Lau wrote:
> > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > > is a kernel struct with its func ptr implemented in bpf prog.
> > > This new map is the interface to register/unregister/introspect
> > > a bpf implemented kernel struct.
> > >
> > > The kernel struct is actually embedded inside another new struct
> > > (or called the "value" struct in the code).  For example,
> > > "struct tcp_congestion_ops" is embbeded in:
> > > struct bpf_struct_ops_tcp_congestion_ops {
> > >       refcount_t refcnt;
> > >       enum bpf_struct_ops_state state;
> > >       struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > > }
> > > The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> > > The "bpftool map dump" will then be able to show the
> > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> > > number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> > > is created automatically by a macro.  Having a separate "value" struct
> > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> > > initialization works before registering the struct_ops to the kernel
> > > subsystem).  The libbpf will take care of finding and populating the
> > > "struct bpf_struct_ops_XYZ" from "struct XYZ".
> > >
> > > Register a struct_ops to a kernel subsystem:
> > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> > >     set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> > >     running kernel.
> > >     Instead of reusing the attr->btf_value_type_id,
> > >     btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> > >     used as the "user" btf which could store other useful sysadmin/debug
> > >     info that may be introduced in the furture,
> > >     e.g. creation-date/compiler-details/map-creator...etc.
> > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> > >     in the running kernel btf.  Populate the value of this object.
> > >     The function ptr should be populated with the prog fds.
> > > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> > >     the map value.  The key is always "0".
> > >
> > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > > (e.g. ensure all mandatory func ptrs are implemented).
> > > If everything looks good, it will register this kernel struct
> > > to the kernel subsystem.  The map will not allow further update
> > > from this point.
> > >
> > > Unregister a struct_ops from the kernel subsystem:
> > > BPF_MAP_DELETE with key "0".
> > >
> > > Introspect a struct_ops:
> > > BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> > > have the prog _id_ populated as the func ptr.
> > >
> > > The map value state (enum bpf_struct_ops_state) will transit from:
> > > INIT (map created) =>
> > > INUSE (map updated, i.e. reg) =>
> > > TOBEFREE (map value deleted, i.e. unreg)
> > >
> > > The kernel subsystem needs to call bpf_struct_ops_get() and
> > > bpf_struct_ops_put() to manage the "refcnt" in the
> > > "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
> > > for the purose of tracking the subsystem usage.  Another approach
> > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> > > the map-fd/pinned-map usage.  However, that will also tie down the
> > > future semantics of map->refcnt and map->usercnt.
> > >
> > > The very first subsystem's refcnt (during reg()) holds one
> > > count to map->refcnt.  When the very last subsystem's refcnt
> > > is gone, it will also release the map->refcnt.  All bpf_prog will be
> > > freed when the map->refcnt reaches 0 (i.e. during map_free()).
> > >
> > > Here is how the bpftool map command will look like:
> > > [root@arch-fb-vm1 bpf]# bpftool map show
> > > 6: struct_ops  name dctcp  flags 0x0
> > >       key 4B  value 256B  max_entries 1  memlock 4096B
> > >       btf_id 6
> > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> > > [{
> > >          "value": {
> > >              "refcnt": {
> > >                  "refs": {
> > >                      "counter": 1
> > >                  }
> > >              },
> > >              "state": 1,
> >
> > The bpftool dump with "state" 1 is a little bit cryptic.
> > Since this is common for all struct_ops maps, can we
> > make it explicit, e.g., as enum values, like INIT/INUSE/TOBEFREE?
> 
> This can (and probably should) be done generically in bpftool for any
> field of enum type. Not blocking this patch set, though.
> 
> >
> > >              "data": {
> > >                  "list": {
> > >                      "next": 0,
> > >                      "prev": 0
> > >                  },
> > >                  "key": 0,
> > >                  "flags": 2,
> > >                  "init": 24,
> > >                  "release": 0,
> > >                  "ssthresh": 25,
> > >                  "cong_avoid": 30,
> > >                  "set_state": 27,
> > >                  "cwnd_event": 28,
> > >                  "in_ack_event": 26,
> > >                  "undo_cwnd": 29,
> > >                  "pkts_acked": 0,
> > >                  "min_tso_segs": 0,
> > >                  "sndbuf_expand": 0,
> > >                  "cong_control": 0,
> > >                  "get_info": 0,
> > >                  "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
> > >                  ],
> 
> Same here, bpftool should be smart enough to figure out that this is a
> string, not just an array of bytes.

Agree on both above that bpftool can print better strings.
Those are generic improvements to bpftool and not specific
to a particular map type.

> 
> > >                  "owner": 0
> > >              }
> > >          }
> > >      }
> > > ]
> > >
> 
> [...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-23 20:29   ` Andrii Nakryiko
@ 2019-12-23 22:29     ` Martin Lau
  2019-12-23 22:55       ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Martin Lau @ 2019-12-23 22:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 12:29:37PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > This patch allows the kernel's struct ops (i.e. func ptr) to be
> > implemented in BPF.  The first use case in this series is the
> > "struct tcp_congestion_ops" which will be introduced in a
> > latter patch.
> >
> > This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
> > The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
> > func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
> > of a kernel struct.  The attr->expected_attach_type is the member
> > "index" of that kernel struct.  The first member of a struct starts
> > with member index 0.  That will avoid ambiguity when a kernel struct
> > has multiple func ptrs with the same func signature.
> >
> > For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
> > to implement the "init" func ptr of the "struct tcp_congestion_ops".
> > The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
> > of the _running_ kernel.  The attr->expected_attach_type is 3.
> >
> > The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
> > by arch_prepare_bpf_trampoline that will be done in the next
> > patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
> >
> > "struct bpf_struct_ops" is introduced as a common interface for the kernel
> > struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
> > struct will need to implement an instance of the "struct bpf_struct_ops".
> >
> > The supporting kernel struct also needs to implement a bpf_verifier_ops.
> > During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
> > bpf_verifier_ops by searching the attr->attach_btf_id.
> >
> > A new "btf_struct_access" is also added to the bpf_verifier_ops such
> > that the supporting kernel struct can optionally provide its own specific
> > check on accessing the func arg (e.g. provide limited write access).
> >
> > After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
> > to initialize some values (e.g. the btf id of the supporting kernel
> > struct) and it can only be done once the btf_vmlinux is available.
> >
> > The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
> > if the return type of the prog->aux->attach_func_proto is "void".
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/linux/bpf.h               |  30 +++++++
> >  include/linux/bpf_types.h         |   4 +
> >  include/linux/btf.h               |  34 ++++++++
> >  include/uapi/linux/bpf.h          |   1 +
> >  kernel/bpf/Makefile               |   2 +-
> >  kernel/bpf/bpf_struct_ops.c       | 122 +++++++++++++++++++++++++++
> >  kernel/bpf/bpf_struct_ops_types.h |   4 +
> >  kernel/bpf/btf.c                  |  88 ++++++++++++++------
> >  kernel/bpf/syscall.c              |  17 ++--
> >  kernel/bpf/verifier.c             | 134 +++++++++++++++++++++++-------
> >  10 files changed, 372 insertions(+), 64 deletions(-)
> >  create mode 100644 kernel/bpf/bpf_struct_ops.c
> >  create mode 100644 kernel/bpf/bpf_struct_ops_types.h
> >
> 
> All looks good, apart from the concern with partially-initialized
> bpf_struct_ops.
> 
> [...]
> 
> > +const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> > +};
> > +
> > +void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> 
> this is always get passed vmlinux's btf, so why not call it short and
> sweet "btf"? _btf_vmlinux is kind of ugly and verbose.
> 
> > +{
> > +       const struct btf_member *member;
> > +       struct bpf_struct_ops *st_ops;
> > +       struct bpf_verifier_log log = {};
> > +       const struct btf_type *t;
> > +       const char *mname;
> > +       s32 type_id;
> > +       u32 i, j;
> > +
> 
> [...]
> 
> > +static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
> > +{
> > +       const struct btf_type *t, *func_proto;
> > +       const struct bpf_struct_ops *st_ops;
> > +       const struct btf_member *member;
> > +       struct bpf_prog *prog = env->prog;
> > +       u32 btf_id, member_idx;
> > +       const char *mname;
> > +
> > +       btf_id = prog->aux->attach_btf_id;
> > +       st_ops = bpf_struct_ops_find(btf_id);
> 
> if struct_ops initialization fails, type will be NULL and type_id will
> be 0, which we rely on here to not get partially-initialized
> bpf_struct_ops, right? Small comment mentioning this would be helpful.
> 
> 
> > +       if (!st_ops) {
> > +               verbose(env, "attach_btf_id %u is not a supported struct\n",
> > +                       btf_id);
> > +               return -ENOTSUPP;
> > +       }
> > +
> 
> [...]
> 
> >  static int check_attach_btf_id(struct bpf_verifier_env *env)
> >  {
> >         struct bpf_prog *prog = env->prog;
> > @@ -9520,6 +9591,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
> >         long addr;
> >         u64 key;
> >
> > +       if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
> > +               return check_struct_ops_btf_id(env);
> > +
> 
> There is a btf_id == 0 check below, you need to check that for
> STRUCT_OPS as well, otherwise you can get partially-initialized
> bpf_struct_ops struct in check_struct_ops_btf_id.
This btf_id == 0 check is done at the beginning of bpf_struct_ops_find().
Hence, bpf_struct_ops_find() won't try to search if btf_id is 0.

st_ops fields is only set when everything passed, so individual st_ops
will not be partially initialized.


> 
> >         if (prog->type != BPF_PROG_TYPE_TRACING)
> >                 return 0;
> >
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-23 22:29     ` Martin Lau
@ 2019-12-23 22:55       ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 22:55 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 2:30 PM Martin Lau <kafai@fb.com> wrote:
>
> On Mon, Dec 23, 2019 at 12:29:37PM -0800, Andrii Nakryiko wrote:
> > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > This patch allows the kernel's struct ops (i.e. func ptr) to be
> > > implemented in BPF.  The first use case in this series is the
> > > "struct tcp_congestion_ops" which will be introduced in a
> > > latter patch.
> > >
> > > This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
> > > The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
> > > func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
> > > of a kernel struct.  The attr->expected_attach_type is the member
> > > "index" of that kernel struct.  The first member of a struct starts
> > > with member index 0.  That will avoid ambiguity when a kernel struct
> > > has multiple func ptrs with the same func signature.
> > >
> > > For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
> > > to implement the "init" func ptr of the "struct tcp_congestion_ops".
> > > The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
> > > of the _running_ kernel.  The attr->expected_attach_type is 3.
> > >
> > > The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
> > > by arch_prepare_bpf_trampoline that will be done in the next
> > > patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
> > >
> > > "struct bpf_struct_ops" is introduced as a common interface for the kernel
> > > struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
> > > struct will need to implement an instance of the "struct bpf_struct_ops".
> > >
> > > The supporting kernel struct also needs to implement a bpf_verifier_ops.
> > > During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
> > > bpf_verifier_ops by searching the attr->attach_btf_id.
> > >
> > > A new "btf_struct_access" is also added to the bpf_verifier_ops such
> > > that the supporting kernel struct can optionally provide its own specific
> > > check on accessing the func arg (e.g. provide limited write access).
> > >
> > > After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
> > > to initialize some values (e.g. the btf id of the supporting kernel
> > > struct) and it can only be done once the btf_vmlinux is available.
> > >
> > > The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
> > > if the return type of the prog->aux->attach_func_proto is "void".
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >  include/linux/bpf.h               |  30 +++++++
> > >  include/linux/bpf_types.h         |   4 +
> > >  include/linux/btf.h               |  34 ++++++++
> > >  include/uapi/linux/bpf.h          |   1 +
> > >  kernel/bpf/Makefile               |   2 +-
> > >  kernel/bpf/bpf_struct_ops.c       | 122 +++++++++++++++++++++++++++
> > >  kernel/bpf/bpf_struct_ops_types.h |   4 +
> > >  kernel/bpf/btf.c                  |  88 ++++++++++++++------
> > >  kernel/bpf/syscall.c              |  17 ++--
> > >  kernel/bpf/verifier.c             | 134 +++++++++++++++++++++++-------
> > >  10 files changed, 372 insertions(+), 64 deletions(-)
> > >  create mode 100644 kernel/bpf/bpf_struct_ops.c
> > >  create mode 100644 kernel/bpf/bpf_struct_ops_types.h
> > >
> >
> > All looks good, apart from the concern with partially-initialized
> > bpf_struct_ops.
> >
> > [...]
> >
> > > +const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> > > +};
> > > +
> > > +void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> >
> > this is always get passed vmlinux's btf, so why not call it short and
> > sweet "btf"? _btf_vmlinux is kind of ugly and verbose.
> >
> > > +{
> > > +       const struct btf_member *member;
> > > +       struct bpf_struct_ops *st_ops;
> > > +       struct bpf_verifier_log log = {};
> > > +       const struct btf_type *t;
> > > +       const char *mname;
> > > +       s32 type_id;
> > > +       u32 i, j;
> > > +
> >
> > [...]
> >
> > > +static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
> > > +{
> > > +       const struct btf_type *t, *func_proto;
> > > +       const struct bpf_struct_ops *st_ops;
> > > +       const struct btf_member *member;
> > > +       struct bpf_prog *prog = env->prog;
> > > +       u32 btf_id, member_idx;
> > > +       const char *mname;
> > > +
> > > +       btf_id = prog->aux->attach_btf_id;
> > > +       st_ops = bpf_struct_ops_find(btf_id);
> >
> > if struct_ops initialization fails, type will be NULL and type_id will
> > be 0, which we rely on here to not get partially-initialized
> > bpf_struct_ops, right? Small comment mentioning this would be helpful.
> >
> >
> > > +       if (!st_ops) {
> > > +               verbose(env, "attach_btf_id %u is not a supported struct\n",
> > > +                       btf_id);
> > > +               return -ENOTSUPP;
> > > +       }
> > > +
> >
> > [...]
> >
> > >  static int check_attach_btf_id(struct bpf_verifier_env *env)
> > >  {
> > >         struct bpf_prog *prog = env->prog;
> > > @@ -9520,6 +9591,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
> > >         long addr;
> > >         u64 key;
> > >
> > > +       if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
> > > +               return check_struct_ops_btf_id(env);
> > > +
> >
> > There is a btf_id == 0 check below, you need to check that for
> > STRUCT_OPS as well, otherwise you can get partially-initialized
> > bpf_struct_ops struct in check_struct_ops_btf_id.
> This btf_id == 0 check is done at the beginning of bpf_struct_ops_find().
> Hence, bpf_struct_ops_find() won't try to search if btf_id is 0.
>

Ah right, I missed that check. Then yeah, it's not a concern. I still
don't like _btf_vmlinux name, but that's just a nit.

Acked-by: Andrii Nakryiko <andriin@fb.com>

> st_ops fields is only set when everything passed, so individual st_ops
> will not be partially initialized.
>
>
> >
> > >         if (prog->type != BPF_PROG_TYPE_TRACING)
> > >                 return 0;
> > >
> > > --
> > > 2.17.1
> > >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-21  6:26 ` [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
  2019-12-23 19:57   ` Yonghong Song
@ 2019-12-23 23:05   ` Andrii Nakryiko
  2019-12-28  1:47     ` Martin Lau
  2019-12-24 12:28   ` kbuild test robot
  2 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 23:05 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> is a kernel struct with its func ptr implemented in bpf prog.
> This new map is the interface to register/unregister/introspect
> a bpf implemented kernel struct.
>
> The kernel struct is actually embedded inside another new struct
> (or called the "value" struct in the code).  For example,
> "struct tcp_congestion_ops" is embbeded in:
> struct bpf_struct_ops_tcp_congestion_ops {
>         refcount_t refcnt;
>         enum bpf_struct_ops_state state;
>         struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> }
> The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> The "bpftool map dump" will then be able to show the
> state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> is created automatically by a macro.  Having a separate "value" struct
> will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> initialization works before registering the struct_ops to the kernel
> subsystem).  The libbpf will take care of finding and populating the
> "struct bpf_struct_ops_XYZ" from "struct XYZ".
>
> Register a struct_ops to a kernel subsystem:
> 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
>    set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
>    running kernel.
>    Instead of reusing the attr->btf_value_type_id,
>    btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
>    used as the "user" btf which could store other useful sysadmin/debug
>    info that may be introduced in the furture,
>    e.g. creation-date/compiler-details/map-creator...etc.
> 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
>    in the running kernel btf.  Populate the value of this object.
>    The function ptr should be populated with the prog fds.
> 4. Call BPF_MAP_UPDATE with the object created in (3) as
>    the map value.  The key is always "0".
>
> During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> the specific struct_ops to do some final checks in "st_ops->init_member()"
> (e.g. ensure all mandatory func ptrs are implemented).
> If everything looks good, it will register this kernel struct
> to the kernel subsystem.  The map will not allow further update
> from this point.
>
> Unregister a struct_ops from the kernel subsystem:
> BPF_MAP_DELETE with key "0".
>
> Introspect a struct_ops:
> BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> have the prog _id_ populated as the func ptr.
>
> The map value state (enum bpf_struct_ops_state) will transit from:
> INIT (map created) =>
> INUSE (map updated, i.e. reg) =>
> TOBEFREE (map value deleted, i.e. unreg)
>
> The kernel subsystem needs to call bpf_struct_ops_get() and
> bpf_struct_ops_put() to manage the "refcnt" in the
> "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
> for the purose of tracking the subsystem usage.  Another approach
> is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> the map-fd/pinned-map usage.  However, that will also tie down the
> future semantics of map->refcnt and map->usercnt.
>
> The very first subsystem's refcnt (during reg()) holds one
> count to map->refcnt.  When the very last subsystem's refcnt
> is gone, it will also release the map->refcnt.  All bpf_prog will be
> freed when the map->refcnt reaches 0 (i.e. during map_free()).
>
> Here is how the bpftool map command will look like:
> [root@arch-fb-vm1 bpf]# bpftool map show
> 6: struct_ops  name dctcp  flags 0x0
>         key 4B  value 256B  max_entries 1  memlock 4096B
>         btf_id 6
> [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> [{
>         "value": {
>             "refcnt": {
>                 "refs": {
>                     "counter": 1
>                 }
>             },
>             "state": 1,
>             "data": {
>                 "list": {
>                     "next": 0,
>                     "prev": 0
>                 },
>                 "key": 0,
>                 "flags": 2,
>                 "init": 24,
>                 "release": 0,
>                 "ssthresh": 25,
>                 "cong_avoid": 30,
>                 "set_state": 27,
>                 "cwnd_event": 28,
>                 "in_ack_event": 26,
>                 "undo_cwnd": 29,
>                 "pkts_acked": 0,
>                 "min_tso_segs": 0,
>                 "sndbuf_expand": 0,
>                 "cong_control": 0,
>                 "get_info": 0,
>                 "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
>                 ],
>                 "owner": 0
>             }
>         }
>     }
> ]
>
> Misc Notes:
> * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
>   It does an inplace update on "*value" instead returning a pointer
>   to syscall.c.  Otherwise, it needs a separate copy of "zero" value
>   for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
>
> * The bpf_struct_ops_map_delete_elem() is also called without
>   preempt_disable() from map_delete_elem().  It is because
>   the "->unreg()" may requires sleepable context, e.g.
>   the "tcp_unregister_congestion_control()".
>
> * "const" is added to some of the existing "struct btf_func_model *"
>   function arg to avoid a compiler warning caused by this patch.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

LGTM! Few questions below to improve my understanding.

Acked-by: Andrii Nakryiko <andriin@fb.com>

>  arch/x86/net/bpf_jit_comp.c |  11 +-
>  include/linux/bpf.h         |  49 +++-
>  include/linux/bpf_types.h   |   3 +
>  include/linux/btf.h         |  13 +
>  include/uapi/linux/bpf.h    |   7 +-
>  kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
>  kernel/bpf/btf.c            |  20 +-
>  kernel/bpf/map_in_map.c     |   3 +-
>  kernel/bpf/syscall.c        |  49 ++--
>  kernel/bpf/trampoline.c     |   5 +-
>  kernel/bpf/verifier.c       |   5 +
>  11 files changed, 593 insertions(+), 40 deletions(-)
>

[...]

> +               /* All non func ptr member must be 0 */
> +               if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> +                                              NULL)) {
> +                       u32 msize;
> +
> +                       mtype = btf_resolve_size(btf_vmlinux, mtype,
> +                                                &msize, NULL, NULL);
> +                       if (IS_ERR(mtype)) {
> +                               err = PTR_ERR(mtype);
> +                               goto reset_unlock;
> +                       }
> +
> +                       if (memchr_inv(udata + moff, 0, msize)) {


just double-checking: we are ok with having non-zeroed padding in a
struct, is that right?

> +                               err = -EINVAL;
> +                               goto reset_unlock;
> +                       }
> +
> +                       continue;
> +               }
> +

[...]

> +
> +               err = arch_prepare_bpf_trampoline(image,
> +                                                 &st_ops->func_models[i], 0,
> +                                                 &prog, 1, NULL, 0, NULL);
> +               if (err < 0)
> +                       goto reset_unlock;
> +
> +               *(void **)(kdata + moff) = image;
> +               image += err;

are there any alignment requirements on image pointer for trampoline?

> +
> +               /* put prog_id to udata */
> +               *(unsigned long *)(udata + moff) = prog->aux->id;
> +       }
> +

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf
  2019-12-21  6:26 ` [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
  2019-12-23 20:18   ` Yonghong Song
@ 2019-12-23 23:20   ` Andrii Nakryiko
  2019-12-24  7:16   ` kbuild test robot
  2019-12-24 13:06   ` kbuild test robot
  3 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 23:20 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch makes "struct tcp_congestion_ops" to be the first user
> of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
> in bpf.
>
> The BPF implemented tcp_congestion_ops can be used like
> regular kernel tcp-cc through sysctl and setsockopt.  e.g.
> [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
> net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
> net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
> net.ipv4.tcp_congestion_control = bpf_cubic
>
> There has been attempt to move the TCP CC to the user space
> (e.g. CCP in TCP).   The common arguments are faster turn around,
> get away from long-tail kernel versions in production...etc,
> which are legit points.
>
> BPF has been the continuous effort to join both kernel and
> userspace upsides together (e.g. XDP to gain the performance
> advantage without bypassing the kernel).  The recent BPF
> advancements (in particular BTF-aware verifier, BPF trampoline,
> BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
> possible in BPF.  It allows a faster turnaround for testing algorithm
> in the production while leveraging the existing (and continue growing)
> BPF feature/framework instead of building one specifically for
> userspace TCP CC.
>
> This patch allows write access to a few fields in tcp-sock
> (in bpf_tcp_ca_btf_struct_access()).
>
> The optional "get_info" is unsupported now.  It can be added
> later.  One possible way is to output the info with a btf-id
> to describe the content.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/filter.h            |   2 +
>  include/net/tcp.h                 |   1 +
>  kernel/bpf/bpf_struct_ops_types.h |   7 +-
>  net/core/filter.c                 |   2 +-
>  net/ipv4/Makefile                 |   4 +
>  net/ipv4/bpf_tcp_ca.c             | 226 ++++++++++++++++++++++++++++++
>  net/ipv4/tcp_cong.c               |  14 +-
>  net/ipv4/tcp_ipv4.c               |   6 +-
>  net/ipv4/tcp_minisocks.c          |   4 +-
>  net/ipv4/tcp_output.c             |   4 +-
>  10 files changed, 255 insertions(+), 15 deletions(-)
>  create mode 100644 net/ipv4/bpf_tcp_ca.c
>

Naming nits below. Other than that:

Acked-by: Andrii Nakryiko <andriin@fb.com>

[...]

> +static const struct btf_type *tcp_sock_type;
> +static u32 tcp_sock_id, sock_id;
> +
> +static int bpf_tcp_ca_init(struct btf *_btf_vmlinux)
> +{

there is no reason to pass anything but vmlinux's BTF to this
function, so I think just having "btf" as a name is OK.

> +       s32 type_id;
> +
> +       type_id = btf_find_by_name_kind(_btf_vmlinux, "sock", BTF_KIND_STRUCT);
> +       if (type_id < 0)
> +               return -EINVAL;
> +       sock_id = type_id;
> +
> +       type_id = btf_find_by_name_kind(_btf_vmlinux, "tcp_sock",
> +                                       BTF_KIND_STRUCT);
> +       if (type_id < 0)
> +               return -EINVAL;
> +       tcp_sock_id = type_id;
> +       tcp_sock_type = btf_type_by_id(_btf_vmlinux, tcp_sock_id);
> +
> +       return 0;
> +}
> +
> +static bool check_optional(u32 member_offset)

check_xxx is quite ambiguous, in general: is it a check that it is
optional or that it's not optional? How about using
is_optional/is_unsupported to make this clear?


> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(optional_ops); i++) {
> +               if (member_offset == optional_ops[i])
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-21  6:26 ` [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example Martin KaFai Lau
@ 2019-12-23 23:26   ` Andrii Nakryiko
  2019-12-24  1:31     ` Martin Lau
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-23 23:26 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch adds a bpf_dctcp example.  It currently does not do
> no-ECN fallback but the same could be done through the cgrp2-bpf.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
>  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
>  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
>  3 files changed, 656 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
>
> diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> new file mode 100644
> index 000000000000..7ba8c1b4157a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> @@ -0,0 +1,228 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __BPF_TCP_HELPERS_H
> +#define __BPF_TCP_HELPERS_H
> +
> +#include <stdbool.h>
> +#include <linux/types.h>
> +#include <bpf_helpers.h>
> +#include <bpf_core_read.h>
> +#include "bpf_trace_helpers.h"
> +
> +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)

Should we try to put those BPF programs into some section that would
indicate they are used with struct opts? libbpf doesn't use or enforce
that (even though it could to derive and enforce that they are
STRUCT_OPS programs). So something like
SEC("struct_ops/<ideally-operation-name-here>"). I think having this
convention is very useful for consistency and to do a quick ELF dump
and see what is where. WDYT?

> +
> +struct sock_common {
> +       unsigned char   skc_state;
> +} __attribute__((preserve_access_index));
> +
> +struct sock {
> +       struct sock_common      __sk_common;
> +} __attribute__((preserve_access_index));
> +
> +struct inet_sock {
> +       struct sock             sk;
> +} __attribute__((preserve_access_index));
> +
> +struct inet_connection_sock {
> +       struct inet_sock          icsk_inet;
> +       __u8                      icsk_ca_state:6,
> +                                 icsk_ca_setsockopt:1,
> +                                 icsk_ca_dst_locked:1;
> +       struct {
> +               __u8              pending;
> +       } icsk_ack;
> +       __u64                     icsk_ca_priv[104 / sizeof(__u64)];
> +} __attribute__((preserve_access_index));
> +
> +struct tcp_sock {
> +       struct inet_connection_sock     inet_conn;
> +
> +       __u32   rcv_nxt;
> +       __u32   snd_nxt;
> +       __u32   snd_una;
> +       __u8    ecn_flags;
> +       __u32   delivered;
> +       __u32   delivered_ce;
> +       __u32   snd_cwnd;
> +       __u32   snd_cwnd_cnt;
> +       __u32   snd_cwnd_clamp;
> +       __u32   snd_ssthresh;
> +       __u8    syn_data:1,     /* SYN includes data */
> +               syn_fastopen:1, /* SYN includes Fast Open option */
> +               syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
> +               syn_fastopen_ch:1, /* Active TFO re-enabling probe */
> +               syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
> +               save_syn:1,     /* Save headers of SYN packet */
> +               is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
> +               syn_smc:1;      /* SYN includes SMC */
> +       __u32   max_packets_out;
> +       __u32   lsndtime;
> +       __u32   prior_cwnd;
> +} __attribute__((preserve_access_index));
> +
> +static __always_inline struct inet_connection_sock *inet_csk(const struct sock *sk)
> +{
> +       return (struct inet_connection_sock *)sk;
> +}
> +
> +static __always_inline void *inet_csk_ca(const struct sock *sk)
> +{
> +       return (void *)inet_csk(sk)->icsk_ca_priv;
> +}
> +
> +static __always_inline struct tcp_sock *tcp_sk(const struct sock *sk)
> +{
> +       return (struct tcp_sock *)sk;
> +}
> +
> +static __always_inline bool before(__u32 seq1, __u32 seq2)
> +{
> +       return (__s32)(seq1-seq2) < 0;
> +}
> +#define after(seq2, seq1)      before(seq1, seq2)
> +
> +#define        TCP_ECN_OK              1
> +#define        TCP_ECN_QUEUE_CWR       2
> +#define        TCP_ECN_DEMAND_CWR      4
> +#define        TCP_ECN_SEEN            8
> +
> +enum inet_csk_ack_state_t {
> +       ICSK_ACK_SCHED  = 1,
> +       ICSK_ACK_TIMER  = 2,
> +       ICSK_ACK_PUSHED = 4,
> +       ICSK_ACK_PUSHED2 = 8,
> +       ICSK_ACK_NOW = 16       /* Send the next ACK immediately (once) */
> +};
> +
> +enum tcp_ca_event {
> +       CA_EVENT_TX_START = 0,
> +       CA_EVENT_CWND_RESTART = 1,
> +       CA_EVENT_COMPLETE_CWR = 2,
> +       CA_EVENT_LOSS = 3,
> +       CA_EVENT_ECN_NO_CE = 4,
> +       CA_EVENT_ECN_IS_CE = 5,
> +};
> +
> +enum tcp_ca_state {
> +       TCP_CA_Open = 0,
> +       TCP_CA_Disorder = 1,
> +       TCP_CA_CWR = 2,
> +       TCP_CA_Recovery = 3,
> +       TCP_CA_Loss = 4
> +};
> +
> +struct ack_sample {
> +       __u32 pkts_acked;
> +       __s32 rtt_us;
> +       __u32 in_flight;
> +} __attribute__((preserve_access_index));
> +
> +struct rate_sample {
> +       __u64  prior_mstamp; /* starting timestamp for interval */
> +       __u32  prior_delivered; /* tp->delivered at "prior_mstamp" */
> +       __s32  delivered;               /* number of packets delivered over interval */
> +       long interval_us;       /* time for tp->delivered to incr "delivered" */
> +       __u32 snd_interval_us;  /* snd interval for delivered packets */
> +       __u32 rcv_interval_us;  /* rcv interval for delivered packets */
> +       long rtt_us;            /* RTT of last (S)ACKed packet (or -1) */
> +       int  losses;            /* number of packets marked lost upon ACK */
> +       __u32  acked_sacked;    /* number of packets newly (S)ACKed upon ACK */
> +       __u32  prior_in_flight; /* in flight before this ACK */
> +       bool is_app_limited;    /* is sample from packet with bubble in pipe? */
> +       bool is_retrans;        /* is sample from retransmission? */
> +       bool is_ack_delayed;    /* is this (likely) a delayed ACK? */
> +} __attribute__((preserve_access_index));
> +
> +#define TCP_CA_NAME_MAX                16
> +#define TCP_CONG_NEEDS_ECN     0x2
> +
> +struct tcp_congestion_ops {
> +       __u32 flags;
> +
> +       /* initialize private data (optional) */
> +       void (*init)(struct sock *sk);
> +       /* cleanup private data  (optional) */
> +       void (*release)(struct sock *sk);
> +
> +       /* return slow start threshold (required) */
> +       __u32 (*ssthresh)(struct sock *sk);
> +       /* do new cwnd calculation (required) */
> +       void (*cong_avoid)(struct sock *sk, __u32 ack, __u32 acked);
> +       /* call before changing ca_state (optional) */
> +       void (*set_state)(struct sock *sk, __u8 new_state);
> +       /* call when cwnd event occurs (optional) */
> +       void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
> +       /* call when ack arrives (optional) */
> +       void (*in_ack_event)(struct sock *sk, __u32 flags);
> +       /* new value of cwnd after loss (required) */
> +       __u32  (*undo_cwnd)(struct sock *sk);
> +       /* hook for packet ack accounting (optional) */
> +       void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
> +       /* override sysctl_tcp_min_tso_segs */
> +       __u32 (*min_tso_segs)(struct sock *sk);
> +       /* returns the multiplier used in tcp_sndbuf_expand (optional) */
> +       __u32 (*sndbuf_expand)(struct sock *sk);
> +       /* call when packets are delivered to update cwnd and pacing rate,
> +        * after all the ca_state processing. (optional)
> +        */
> +       void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
> +
> +       char            name[TCP_CA_NAME_MAX];
> +};

Can all of these types come from vmlinux.h instead of being duplicated here?

> +
> +#define min(a, b) ((a) < (b) ? (a) : (b))
> +#define max(a, b) ((a) > (b) ? (a) : (b))
> +#define min_not_zero(x, y) ({                  \
> +       typeof(x) __x = (x);                    \
> +       typeof(y) __y = (y);                    \
> +       __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
> +

[...]

> +static struct bpf_object *load(const char *filename, const char *map_name,
> +                              struct bpf_link **link)
> +{
> +       struct bpf_object *obj;
> +       struct bpf_map *map;
> +       struct bpf_link *l;
> +       int err;
> +
> +       obj = bpf_object__open(filename);
> +       if (CHECK(IS_ERR(obj), "bpf_obj__open_file", "obj:%ld\n",
> +                 PTR_ERR(obj)))
> +               return obj;
> +
> +       err = bpf_object__load(obj);
> +       if (CHECK(err, "bpf_object__load", "err:%d\n", err)) {
> +               bpf_object__close(obj);
> +               return ERR_PTR(err);
> +       }
> +
> +       map = bpf_object__find_map_by_name(obj, map_name);
> +       if (CHECK(!map, "bpf_object__find_map_by_name", "%s not found\n",
> +                   map_name)) {
> +               bpf_object__close(obj);
> +               return ERR_PTR(-ENOENT);
> +       }
> +

use skeleton instead?

> +       l = bpf_map__attach_struct_ops(map);
> +       if (CHECK(IS_ERR(l), "bpf_struct_ops_map__attach", "err:%ld\n",
> +                 PTR_ERR(l))) {
> +               bpf_object__close(obj);
> +               return (void *)l;
> +       }
> +
> +       *link = l;
> +
> +       return obj;
> +}
> +
> +static void test_dctcp(void)
> +{
> +       struct bpf_object *obj;
> +       /* compiler warning... */
> +       struct bpf_link *link = NULL;
> +
> +       obj = load("bpf_dctcp.o", "dctcp", &link);
> +       if (IS_ERR(obj))
> +               return;
> +
> +       do_test("bpf_dctcp");
> +
> +       bpf_link__destroy(link);
> +       bpf_object__close(obj);
> +}
> +
> +void test_bpf_tcp_ca(void)
> +{
> +       if (test__start_subtest("dctcp"))
> +               test_dctcp();
> +}

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-23 23:26   ` Andrii Nakryiko
@ 2019-12-24  1:31     ` Martin Lau
  2019-12-24  7:01       ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Martin Lau @ 2019-12-24  1:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > This patch adds a bpf_dctcp example.  It currently does not do
> > no-ECN fallback but the same could be done through the cgrp2-bpf.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> >  3 files changed, 656 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > new file mode 100644
> > index 000000000000..7ba8c1b4157a
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > @@ -0,0 +1,228 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef __BPF_TCP_HELPERS_H
> > +#define __BPF_TCP_HELPERS_H
> > +
> > +#include <stdbool.h>
> > +#include <linux/types.h>
> > +#include <bpf_helpers.h>
> > +#include <bpf_core_read.h>
> > +#include "bpf_trace_helpers.h"
> > +
> > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> 
> Should we try to put those BPF programs into some section that would
> indicate they are used with struct opts? libbpf doesn't use or enforce
> that (even though it could to derive and enforce that they are
> STRUCT_OPS programs). So something like
> SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> convention is very useful for consistency and to do a quick ELF dump
> and see what is where. WDYT?
I did not use it here because I don't want any misperception that it is
a required convention by libbpf.

Sure, I can prefix it here and comment that it is just a
convention but not a libbpf's requirement.

> 
> > +
> > +struct sock_common {
> > +       unsigned char   skc_state;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct sock {
> > +       struct sock_common      __sk_common;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct inet_sock {
> > +       struct sock             sk;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct inet_connection_sock {
> > +       struct inet_sock          icsk_inet;
> > +       __u8                      icsk_ca_state:6,
> > +                                 icsk_ca_setsockopt:1,
> > +                                 icsk_ca_dst_locked:1;
> > +       struct {
> > +               __u8              pending;
> > +       } icsk_ack;
> > +       __u64                     icsk_ca_priv[104 / sizeof(__u64)];
> > +} __attribute__((preserve_access_index));
> > +
> > +struct tcp_sock {
> > +       struct inet_connection_sock     inet_conn;
> > +
> > +       __u32   rcv_nxt;
> > +       __u32   snd_nxt;
> > +       __u32   snd_una;
> > +       __u8    ecn_flags;
> > +       __u32   delivered;
> > +       __u32   delivered_ce;
> > +       __u32   snd_cwnd;
> > +       __u32   snd_cwnd_cnt;
> > +       __u32   snd_cwnd_clamp;
> > +       __u32   snd_ssthresh;
> > +       __u8    syn_data:1,     /* SYN includes data */
> > +               syn_fastopen:1, /* SYN includes Fast Open option */
> > +               syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
> > +               syn_fastopen_ch:1, /* Active TFO re-enabling probe */
> > +               syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
> > +               save_syn:1,     /* Save headers of SYN packet */
> > +               is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
> > +               syn_smc:1;      /* SYN includes SMC */
> > +       __u32   max_packets_out;
> > +       __u32   lsndtime;
> > +       __u32   prior_cwnd;
> > +} __attribute__((preserve_access_index));
> > +
> > +static __always_inline struct inet_connection_sock *inet_csk(const struct sock *sk)
> > +{
> > +       return (struct inet_connection_sock *)sk;
> > +}
> > +
> > +static __always_inline void *inet_csk_ca(const struct sock *sk)
> > +{
> > +       return (void *)inet_csk(sk)->icsk_ca_priv;
> > +}
> > +
> > +static __always_inline struct tcp_sock *tcp_sk(const struct sock *sk)
> > +{
> > +       return (struct tcp_sock *)sk;
> > +}
> > +
> > +static __always_inline bool before(__u32 seq1, __u32 seq2)
> > +{
> > +       return (__s32)(seq1-seq2) < 0;
> > +}
> > +#define after(seq2, seq1)      before(seq1, seq2)
> > +
> > +#define        TCP_ECN_OK              1
> > +#define        TCP_ECN_QUEUE_CWR       2
> > +#define        TCP_ECN_DEMAND_CWR      4
> > +#define        TCP_ECN_SEEN            8
> > +
> > +enum inet_csk_ack_state_t {
> > +       ICSK_ACK_SCHED  = 1,
> > +       ICSK_ACK_TIMER  = 2,
> > +       ICSK_ACK_PUSHED = 4,
> > +       ICSK_ACK_PUSHED2 = 8,
> > +       ICSK_ACK_NOW = 16       /* Send the next ACK immediately (once) */
> > +};
> > +
> > +enum tcp_ca_event {
> > +       CA_EVENT_TX_START = 0,
> > +       CA_EVENT_CWND_RESTART = 1,
> > +       CA_EVENT_COMPLETE_CWR = 2,
> > +       CA_EVENT_LOSS = 3,
> > +       CA_EVENT_ECN_NO_CE = 4,
> > +       CA_EVENT_ECN_IS_CE = 5,
> > +};
> > +
> > +enum tcp_ca_state {
> > +       TCP_CA_Open = 0,
> > +       TCP_CA_Disorder = 1,
> > +       TCP_CA_CWR = 2,
> > +       TCP_CA_Recovery = 3,
> > +       TCP_CA_Loss = 4
> > +};
> > +
> > +struct ack_sample {
> > +       __u32 pkts_acked;
> > +       __s32 rtt_us;
> > +       __u32 in_flight;
> > +} __attribute__((preserve_access_index));
> > +
> > +struct rate_sample {
> > +       __u64  prior_mstamp; /* starting timestamp for interval */
> > +       __u32  prior_delivered; /* tp->delivered at "prior_mstamp" */
> > +       __s32  delivered;               /* number of packets delivered over interval */
> > +       long interval_us;       /* time for tp->delivered to incr "delivered" */
> > +       __u32 snd_interval_us;  /* snd interval for delivered packets */
> > +       __u32 rcv_interval_us;  /* rcv interval for delivered packets */
> > +       long rtt_us;            /* RTT of last (S)ACKed packet (or -1) */
> > +       int  losses;            /* number of packets marked lost upon ACK */
> > +       __u32  acked_sacked;    /* number of packets newly (S)ACKed upon ACK */
> > +       __u32  prior_in_flight; /* in flight before this ACK */
> > +       bool is_app_limited;    /* is sample from packet with bubble in pipe? */
> > +       bool is_retrans;        /* is sample from retransmission? */
> > +       bool is_ack_delayed;    /* is this (likely) a delayed ACK? */
> > +} __attribute__((preserve_access_index));
> > +
> > +#define TCP_CA_NAME_MAX                16
> > +#define TCP_CONG_NEEDS_ECN     0x2
> > +
> > +struct tcp_congestion_ops {
> > +       __u32 flags;
> > +
> > +       /* initialize private data (optional) */
> > +       void (*init)(struct sock *sk);
> > +       /* cleanup private data  (optional) */
> > +       void (*release)(struct sock *sk);
> > +
> > +       /* return slow start threshold (required) */
> > +       __u32 (*ssthresh)(struct sock *sk);
> > +       /* do new cwnd calculation (required) */
> > +       void (*cong_avoid)(struct sock *sk, __u32 ack, __u32 acked);
> > +       /* call before changing ca_state (optional) */
> > +       void (*set_state)(struct sock *sk, __u8 new_state);
> > +       /* call when cwnd event occurs (optional) */
> > +       void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
> > +       /* call when ack arrives (optional) */
> > +       void (*in_ack_event)(struct sock *sk, __u32 flags);
> > +       /* new value of cwnd after loss (required) */
> > +       __u32  (*undo_cwnd)(struct sock *sk);
> > +       /* hook for packet ack accounting (optional) */
> > +       void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
> > +       /* override sysctl_tcp_min_tso_segs */
> > +       __u32 (*min_tso_segs)(struct sock *sk);
> > +       /* returns the multiplier used in tcp_sndbuf_expand (optional) */
> > +       __u32 (*sndbuf_expand)(struct sock *sk);
> > +       /* call when packets are delivered to update cwnd and pacing rate,
> > +        * after all the ca_state processing. (optional)
> > +        */
> > +       void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
> > +
> > +       char            name[TCP_CA_NAME_MAX];
> > +};
> 
> Can all of these types come from vmlinux.h instead of being duplicated here?
It can but I prefer leaving it as is in bpf_tcp_helpers.h like another
existing test in kfree_skb.c.  Without directly using the same struct in
vmlinux.h,  I think it is a good test for libbpf.
That remind me to shuffle the member ordering a little in tcp_congestion_ops
here.

> 
> > +
> > +#define min(a, b) ((a) < (b) ? (a) : (b))
> > +#define max(a, b) ((a) > (b) ? (a) : (b))
> > +#define min_not_zero(x, y) ({                  \
> > +       typeof(x) __x = (x);                    \
> > +       typeof(y) __y = (y);                    \
> > +       __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
> > +
> 
> [...]
> 
> > +static struct bpf_object *load(const char *filename, const char *map_name,
> > +                              struct bpf_link **link)
> > +{
> > +       struct bpf_object *obj;
> > +       struct bpf_map *map;
> > +       struct bpf_link *l;
> > +       int err;
> > +
> > +       obj = bpf_object__open(filename);
> > +       if (CHECK(IS_ERR(obj), "bpf_obj__open_file", "obj:%ld\n",
> > +                 PTR_ERR(obj)))
> > +               return obj;
> > +
> > +       err = bpf_object__load(obj);
> > +       if (CHECK(err, "bpf_object__load", "err:%d\n", err)) {
> > +               bpf_object__close(obj);
> > +               return ERR_PTR(err);
> > +       }
> > +
> > +       map = bpf_object__find_map_by_name(obj, map_name);
> > +       if (CHECK(!map, "bpf_object__find_map_by_name", "%s not found\n",
> > +                   map_name)) {
> > +               bpf_object__close(obj);
> > +               return ERR_PTR(-ENOENT);
> > +       }
> > +
> 
> use skeleton instead?
Will give it a spin.

> 
> > +       l = bpf_map__attach_struct_ops(map);
> > +       if (CHECK(IS_ERR(l), "bpf_struct_ops_map__attach", "err:%ld\n",
> > +                 PTR_ERR(l))) {
> > +               bpf_object__close(obj);
> > +               return (void *)l;
> > +       }
> > +
> > +       *link = l;
> > +
> > +       return obj;
> > +}
> > +

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-24  1:31     ` Martin Lau
@ 2019-12-24  7:01       ` Andrii Nakryiko
  2019-12-24  7:32         ` Martin Lau
  2019-12-24 16:50         ` Martin Lau
  0 siblings, 2 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-24  7:01 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 5:31 PM Martin Lau <kafai@fb.com> wrote:
>
> On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > This patch adds a bpf_dctcp example.  It currently does not do
> > > no-ECN fallback but the same could be done through the cgrp2-bpf.
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> > >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> > >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> > >  3 files changed, 656 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > new file mode 100644
> > > index 000000000000..7ba8c1b4157a
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > @@ -0,0 +1,228 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef __BPF_TCP_HELPERS_H
> > > +#define __BPF_TCP_HELPERS_H
> > > +
> > > +#include <stdbool.h>
> > > +#include <linux/types.h>
> > > +#include <bpf_helpers.h>
> > > +#include <bpf_core_read.h>
> > > +#include "bpf_trace_helpers.h"
> > > +
> > > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> >
> > Should we try to put those BPF programs into some section that would
> > indicate they are used with struct opts? libbpf doesn't use or enforce
> > that (even though it could to derive and enforce that they are
> > STRUCT_OPS programs). So something like
> > SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> > convention is very useful for consistency and to do a quick ELF dump
> > and see what is where. WDYT?
> I did not use it here because I don't want any misperception that it is
> a required convention by libbpf.
>
> Sure, I can prefix it here and comment that it is just a
> convention but not a libbpf's requirement.

Well, we can actually make it a requirement of sorts. Currently your
code expects that BPF program's type is UNSPEC and then it sets it to
STRUCT_OPS. Alternatively we can say that any BPF program in
SEC("struct_ops/<whatever>") will be automatically assigned
STRUCT_OPTS BPF program type (which is done generically in
bpf_object__open()), and then as .struct_ops section is parsed, all
those programs will be "assembled" by the code you added into a
struct_ops map.

It's a requirement "of sorts", because even if user doesn't do that,
stuff will still work, if user manually will call
bpf_program__set_struct_ops(prog). Which actually reminds me that it
would be good to add bpf_program__set_struct_ops() and
bpf_program__is_struct_ops() APIs for completeness, similarly to how
KP's LSM patch set does.

BTW, libbpf will emit debug message for every single BPF program it
doesn't recognize section for, so it is still nice to have it be
something more or less standardized and recognizable by libbpf.

>
> >
> > > +

[...]

> >
> > Can all of these types come from vmlinux.h instead of being duplicated here?
> It can but I prefer leaving it as is in bpf_tcp_helpers.h like another
> existing test in kfree_skb.c.  Without directly using the same struct in
> vmlinux.h,  I think it is a good test for libbpf.
> That remind me to shuffle the member ordering a little in tcp_congestion_ops
> here.

Sure no problem. When I looked at this it was a bit discouraging on
how much types I'd need to duplicate, but surely we don't want to make
an impression that vmlinux.h is the only way to achieve this.

>
> >
> > > +
> > > +#define min(a, b) ((a) < (b) ? (a) : (b))
> > > +#define max(a, b) ((a) > (b) ? (a) : (b))
> > > +#define min_not_zero(x, y) ({                  \
> > > +       typeof(x) __x = (x);                    \
> > > +       typeof(y) __y = (y);                    \
> > > +       __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
> > > +
> >
> > [...]
> >
> > > +static struct bpf_object *load(const char *filename, const char *map_name,
> > > +                              struct bpf_link **link)
> > > +{
> > > +       struct bpf_object *obj;
> > > +       struct bpf_map *map;
> > > +       struct bpf_link *l;
> > > +       int err;
> > > +
> > > +       obj = bpf_object__open(filename);
> > > +       if (CHECK(IS_ERR(obj), "bpf_obj__open_file", "obj:%ld\n",
> > > +                 PTR_ERR(obj)))
> > > +               return obj;
> > > +
> > > +       err = bpf_object__load(obj);
> > > +       if (CHECK(err, "bpf_object__load", "err:%d\n", err)) {
> > > +               bpf_object__close(obj);
> > > +               return ERR_PTR(err);
> > > +       }
> > > +
> > > +       map = bpf_object__find_map_by_name(obj, map_name);
> > > +       if (CHECK(!map, "bpf_object__find_map_by_name", "%s not found\n",
> > > +                   map_name)) {
> > > +               bpf_object__close(obj);
> > > +               return ERR_PTR(-ENOENT);
> > > +       }
> > > +
> >
> > use skeleton instead?
> Will give it a spin.
>
> >
> > > +       l = bpf_map__attach_struct_ops(map);
> > > +       if (CHECK(IS_ERR(l), "bpf_struct_ops_map__attach", "err:%ld\n",
> > > +                 PTR_ERR(l))) {
> > > +               bpf_object__close(obj);
> > > +               return (void *)l;
> > > +       }
> > > +
> > > +       *link = l;
> > > +
> > > +       return obj;
> > > +}
> > > +

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf
  2019-12-21  6:26 ` [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
  2019-12-23 20:18   ` Yonghong Song
  2019-12-23 23:20   ` Andrii Nakryiko
@ 2019-12-24  7:16   ` kbuild test robot
  2019-12-24 13:06   ` kbuild test robot
  3 siblings, 0 replies; 45+ messages in thread
From: kbuild test robot @ 2019-12-24  7:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: kbuild-all, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, kernel-team, netdev

[-- Attachment #1: Type: text/plain, Size: 8202 bytes --]

Hi Martin,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]
[cannot apply to bpf/master net/master v5.5-rc3 next-20191220]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Martin-KaFai-Lau/Introduce-BPF-STRUCT_OPS/20191224-085617
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: s390-debug_defconfig (attached as .config)
compiler: s390-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=s390 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_init':
>> kernel/bpf/bpf_struct_ops.c:198:1: warning: the frame size of 1208 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^

vim +198 kernel/bpf/bpf_struct_ops.c

d69ac27055a81d Martin KaFai Lau 2019-12-20  113  
d69ac27055a81d Martin KaFai Lau 2019-12-20  114  	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
d69ac27055a81d Martin KaFai Lau 2019-12-20  115  					  BTF_KIND_STRUCT);
d69ac27055a81d Martin KaFai Lau 2019-12-20  116  	if (module_id < 0) {
d69ac27055a81d Martin KaFai Lau 2019-12-20  117  		pr_warn("Cannot find struct module in btf_vmlinux\n");
d69ac27055a81d Martin KaFai Lau 2019-12-20  118  		return;
d69ac27055a81d Martin KaFai Lau 2019-12-20  119  	}
d69ac27055a81d Martin KaFai Lau 2019-12-20  120  	module_type = btf_type_by_id(_btf_vmlinux, module_id);
d69ac27055a81d Martin KaFai Lau 2019-12-20  121  
b14e6918483a61 Martin KaFai Lau 2019-12-20  122  	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  123  		st_ops = bpf_struct_ops[i];
b14e6918483a61 Martin KaFai Lau 2019-12-20  124  
d69ac27055a81d Martin KaFai Lau 2019-12-20  125  		if (strlen(st_ops->name) + VALUE_PREFIX_LEN >=
d69ac27055a81d Martin KaFai Lau 2019-12-20  126  		    sizeof(value_name)) {
d69ac27055a81d Martin KaFai Lau 2019-12-20  127  			pr_warn("struct_ops name %s is too long\n",
d69ac27055a81d Martin KaFai Lau 2019-12-20  128  				st_ops->name);
d69ac27055a81d Martin KaFai Lau 2019-12-20  129  			continue;
d69ac27055a81d Martin KaFai Lau 2019-12-20  130  		}
d69ac27055a81d Martin KaFai Lau 2019-12-20  131  		sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name);
d69ac27055a81d Martin KaFai Lau 2019-12-20  132  
d69ac27055a81d Martin KaFai Lau 2019-12-20  133  		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
d69ac27055a81d Martin KaFai Lau 2019-12-20  134  						 BTF_KIND_STRUCT);
d69ac27055a81d Martin KaFai Lau 2019-12-20  135  		if (value_id < 0) {
d69ac27055a81d Martin KaFai Lau 2019-12-20  136  			pr_warn("Cannot find struct %s in btf_vmlinux\n",
d69ac27055a81d Martin KaFai Lau 2019-12-20  137  				value_name);
d69ac27055a81d Martin KaFai Lau 2019-12-20  138  			continue;
d69ac27055a81d Martin KaFai Lau 2019-12-20  139  		}
d69ac27055a81d Martin KaFai Lau 2019-12-20  140  
b14e6918483a61 Martin KaFai Lau 2019-12-20  141  		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
b14e6918483a61 Martin KaFai Lau 2019-12-20  142  						BTF_KIND_STRUCT);
b14e6918483a61 Martin KaFai Lau 2019-12-20  143  		if (type_id < 0) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  144  			pr_warn("Cannot find struct %s in btf_vmlinux\n",
b14e6918483a61 Martin KaFai Lau 2019-12-20  145  				st_ops->name);
b14e6918483a61 Martin KaFai Lau 2019-12-20  146  			continue;
b14e6918483a61 Martin KaFai Lau 2019-12-20  147  		}
b14e6918483a61 Martin KaFai Lau 2019-12-20  148  		t = btf_type_by_id(_btf_vmlinux, type_id);
b14e6918483a61 Martin KaFai Lau 2019-12-20  149  		if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  150  			pr_warn("Cannot support #%u members in struct %s\n",
b14e6918483a61 Martin KaFai Lau 2019-12-20  151  				btf_type_vlen(t), st_ops->name);
b14e6918483a61 Martin KaFai Lau 2019-12-20  152  			continue;
b14e6918483a61 Martin KaFai Lau 2019-12-20  153  		}
b14e6918483a61 Martin KaFai Lau 2019-12-20  154  
b14e6918483a61 Martin KaFai Lau 2019-12-20  155  		for_each_member(j, t, member) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  156  			const struct btf_type *func_proto;
b14e6918483a61 Martin KaFai Lau 2019-12-20  157  
b14e6918483a61 Martin KaFai Lau 2019-12-20  158  			mname = btf_name_by_offset(_btf_vmlinux,
b14e6918483a61 Martin KaFai Lau 2019-12-20  159  						   member->name_off);
b14e6918483a61 Martin KaFai Lau 2019-12-20  160  			if (!*mname) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  161  				pr_warn("anon member in struct %s is not supported\n",
b14e6918483a61 Martin KaFai Lau 2019-12-20  162  					st_ops->name);
b14e6918483a61 Martin KaFai Lau 2019-12-20  163  				break;
b14e6918483a61 Martin KaFai Lau 2019-12-20  164  			}
b14e6918483a61 Martin KaFai Lau 2019-12-20  165  
b14e6918483a61 Martin KaFai Lau 2019-12-20  166  			if (btf_member_bitfield_size(t, member)) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  167  				pr_warn("bit field member %s in struct %s is not supported\n",
b14e6918483a61 Martin KaFai Lau 2019-12-20  168  					mname, st_ops->name);
b14e6918483a61 Martin KaFai Lau 2019-12-20  169  				break;
b14e6918483a61 Martin KaFai Lau 2019-12-20  170  			}
b14e6918483a61 Martin KaFai Lau 2019-12-20  171  
b14e6918483a61 Martin KaFai Lau 2019-12-20  172  			func_proto = btf_type_resolve_func_ptr(_btf_vmlinux,
b14e6918483a61 Martin KaFai Lau 2019-12-20  173  							       member->type,
b14e6918483a61 Martin KaFai Lau 2019-12-20  174  							       NULL);
b14e6918483a61 Martin KaFai Lau 2019-12-20  175  			if (func_proto &&
b14e6918483a61 Martin KaFai Lau 2019-12-20  176  			    btf_distill_func_proto(&log, _btf_vmlinux,
b14e6918483a61 Martin KaFai Lau 2019-12-20  177  						   func_proto, mname,
b14e6918483a61 Martin KaFai Lau 2019-12-20  178  						   &st_ops->func_models[j])) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  179  				pr_warn("Error in parsing func ptr %s in struct %s\n",
b14e6918483a61 Martin KaFai Lau 2019-12-20  180  					mname, st_ops->name);
b14e6918483a61 Martin KaFai Lau 2019-12-20  181  				break;
b14e6918483a61 Martin KaFai Lau 2019-12-20  182  			}
b14e6918483a61 Martin KaFai Lau 2019-12-20  183  		}
b14e6918483a61 Martin KaFai Lau 2019-12-20  184  
b14e6918483a61 Martin KaFai Lau 2019-12-20  185  		if (j == btf_type_vlen(t)) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  186  			if (st_ops->init(_btf_vmlinux)) {
b14e6918483a61 Martin KaFai Lau 2019-12-20  187  				pr_warn("Error in init bpf_struct_ops %s\n",
b14e6918483a61 Martin KaFai Lau 2019-12-20  188  					st_ops->name);
b14e6918483a61 Martin KaFai Lau 2019-12-20  189  			} else {
b14e6918483a61 Martin KaFai Lau 2019-12-20  190  				st_ops->type_id = type_id;
b14e6918483a61 Martin KaFai Lau 2019-12-20  191  				st_ops->type = t;
d69ac27055a81d Martin KaFai Lau 2019-12-20  192  				st_ops->value_id = value_id;
d69ac27055a81d Martin KaFai Lau 2019-12-20  193  				st_ops->value_type =
d69ac27055a81d Martin KaFai Lau 2019-12-20  194  					btf_type_by_id(_btf_vmlinux, value_id);
b14e6918483a61 Martin KaFai Lau 2019-12-20  195  			}
b14e6918483a61 Martin KaFai Lau 2019-12-20  196  		}
b14e6918483a61 Martin KaFai Lau 2019-12-20  197  	}
b14e6918483a61 Martin KaFai Lau 2019-12-20 @198  }
b14e6918483a61 Martin KaFai Lau 2019-12-20  199  

:::::: The code at line 198 was first introduced by commit
:::::: b14e6918483a61bb02672580bde0aa60f4cce17d bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS

:::::: TO: Martin KaFai Lau <kafai@fb.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 19208 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-24  7:01       ` Andrii Nakryiko
@ 2019-12-24  7:32         ` Martin Lau
  2019-12-24 16:50         ` Martin Lau
  1 sibling, 0 replies; 45+ messages in thread
From: Martin Lau @ 2019-12-24  7:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 11:01:55PM -0800, Andrii Nakryiko wrote:
> > > Can all of these types come from vmlinux.h instead of being duplicated here?
> > It can but I prefer leaving it as is in bpf_tcp_helpers.h like another
> > existing test in kfree_skb.c.  Without directly using the same struct in
> > vmlinux.h,  I think it is a good test for libbpf.
> > That remind me to shuffle the member ordering a little in tcp_congestion_ops
> > here.
> 
> Sure no problem. When I looked at this it was a bit discouraging on
> how much types I'd need to duplicate, but surely we don't want to make
> an impression that vmlinux.h is the only way to achieve this.
IMO, it is a very compact set of fields that work for both dctcp and cubic.
Also, duplication is not a concern here.  The deviation between
the kernel and bpf_tcp_helpers.h is not a problem with CO-RE.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-21  6:26 ` [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
  2019-12-23 19:33   ` Yonghong Song
  2019-12-23 20:29   ` Andrii Nakryiko
@ 2019-12-24 11:46   ` kbuild test robot
  2 siblings, 0 replies; 45+ messages in thread
From: kbuild test robot @ 2019-12-24 11:46 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: kbuild-all, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, kernel-team, netdev

[-- Attachment #1: Type: text/plain, Size: 3877 bytes --]

Hi Martin,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[cannot apply to bpf/master net/master v5.5-rc3 next-20191219]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Martin-KaFai-Lau/Introduce-BPF-STRUCT_OPS/20191224-085617
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_init':
>> kernel/bpf/bpf_struct_ops.c:86:8: error: implicit declaration of function 'btf_distill_func_proto'; did you mean 'btf_type_is_func_proto'? [-Werror=implicit-function-declaration]
           btf_distill_func_proto(&log, _btf_vmlinux,
           ^~~~~~~~~~~~~~~~~~~~~~
           btf_type_is_func_proto
   cc1: some warnings being treated as errors

vim +86 kernel/bpf/bpf_struct_ops.c

    37	
    38	void bpf_struct_ops_init(struct btf *_btf_vmlinux)
    39	{
    40		const struct btf_member *member;
    41		struct bpf_struct_ops *st_ops;
    42		struct bpf_verifier_log log = {};
    43		const struct btf_type *t;
    44		const char *mname;
    45		s32 type_id;
    46		u32 i, j;
    47	
    48		for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
    49			st_ops = bpf_struct_ops[i];
    50	
    51			type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
    52							BTF_KIND_STRUCT);
    53			if (type_id < 0) {
    54				pr_warn("Cannot find struct %s in btf_vmlinux\n",
    55					st_ops->name);
    56				continue;
    57			}
    58			t = btf_type_by_id(_btf_vmlinux, type_id);
    59			if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) {
    60				pr_warn("Cannot support #%u members in struct %s\n",
    61					btf_type_vlen(t), st_ops->name);
    62				continue;
    63			}
    64	
    65			for_each_member(j, t, member) {
    66				const struct btf_type *func_proto;
    67	
    68				mname = btf_name_by_offset(_btf_vmlinux,
    69							   member->name_off);
    70				if (!*mname) {
    71					pr_warn("anon member in struct %s is not supported\n",
    72						st_ops->name);
    73					break;
    74				}
    75	
    76				if (btf_member_bitfield_size(t, member)) {
    77					pr_warn("bit field member %s in struct %s is not supported\n",
    78						mname, st_ops->name);
    79					break;
    80				}
    81	
    82				func_proto = btf_type_resolve_func_ptr(_btf_vmlinux,
    83								       member->type,
    84								       NULL);
    85				if (func_proto &&
  > 86				    btf_distill_func_proto(&log, _btf_vmlinux,
    87							   func_proto, mname,
    88							   &st_ops->func_models[j])) {
    89					pr_warn("Error in parsing func ptr %s in struct %s\n",
    90						mname, st_ops->name);
    91					break;
    92				}
    93			}
    94	
    95			if (j == btf_type_vlen(t)) {
    96				if (st_ops->init(_btf_vmlinux)) {
    97					pr_warn("Error in init bpf_struct_ops %s\n",
    98						st_ops->name);
    99				} else {
   100					st_ops->type_id = type_id;
   101					st_ops->type = t;
   102				}
   103			}
   104		}
   105	}
   106	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46481 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-21  6:26 ` [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
  2019-12-23 19:57   ` Yonghong Song
  2019-12-23 23:05   ` Andrii Nakryiko
@ 2019-12-24 12:28   ` kbuild test robot
  2 siblings, 0 replies; 45+ messages in thread
From: kbuild test robot @ 2019-12-24 12:28 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: kbuild-all, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, kernel-team, netdev

[-- Attachment #1: Type: text/plain, Size: 12699 bytes --]

Hi Martin,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[cannot apply to bpf/master net/master v5.5-rc3 next-20191219]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Martin-KaFai-Lau/Introduce-BPF-STRUCT_OPS/20191224-085617
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_init':
   kernel/bpf/bpf_struct_ops.c:176:8: error: implicit declaration of function 'btf_distill_func_proto'; did you mean 'btf_type_is_func_proto'? [-Werror=implicit-function-declaration]
           btf_distill_func_proto(&log, _btf_vmlinux,
           ^~~~~~~~~~~~~~~~~~~~~~
           btf_type_is_func_proto
   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_update_elem':
>> kernel/bpf/bpf_struct_ops.c:408:2: error: implicit declaration of function 'bpf_map_inc'; did you mean 'bpf_map_put'? [-Werror=implicit-function-declaration]
     bpf_map_inc(map);
     ^~~~~~~~~~~
     bpf_map_put
   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_free':
>> kernel/bpf/bpf_struct_ops.c:468:2: error: implicit declaration of function 'bpf_map_area_free'; did you mean 'bpf_prog_free'? [-Werror=implicit-function-declaration]
     bpf_map_area_free(st_map->progs);
     ^~~~~~~~~~~~~~~~~
     bpf_prog_free
   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_alloc':
   kernel/bpf/bpf_struct_ops.c:515:8: error: implicit declaration of function 'bpf_map_charge_init'; did you mean 'bpf_prog_change_xdp'? [-Werror=implicit-function-declaration]
     err = bpf_map_charge_init(&mem, map_total_size);
           ^~~~~~~~~~~~~~~~~~~
           bpf_prog_change_xdp
>> kernel/bpf/bpf_struct_ops.c:519:11: error: implicit declaration of function 'bpf_map_area_alloc'; did you mean 'bpf_prog_alloc'? [-Werror=implicit-function-declaration]
     st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
              ^~~~~~~~~~~~~~~~~~
              bpf_prog_alloc
>> kernel/bpf/bpf_struct_ops.c:519:9: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
     st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
            ^
>> kernel/bpf/bpf_struct_ops.c:521:3: error: implicit declaration of function 'bpf_map_charge_finish'; did you mean 'bpf_map_flags_to_cap'? [-Werror=implicit-function-declaration]
      bpf_map_charge_finish(&mem);
      ^~~~~~~~~~~~~~~~~~~~~
      bpf_map_flags_to_cap
   kernel/bpf/bpf_struct_ops.c:527:17: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
     st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
                    ^
   kernel/bpf/bpf_struct_ops.c:528:16: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
     st_map->progs =
                   ^
   kernel/bpf/bpf_struct_ops.c:545:2: error: implicit declaration of function 'bpf_map_init_from_attr'; did you mean 'bpf_jit_get_func_addr'? [-Werror=implicit-function-declaration]
     bpf_map_init_from_attr(map, attr);
     ^~~~~~~~~~~~~~~~~~~~~~
     bpf_jit_get_func_addr
   kernel/bpf/bpf_struct_ops.c:546:2: error: implicit declaration of function 'bpf_map_charge_move'; did you mean 'bpf_prog_change_xdp'? [-Werror=implicit-function-declaration]
     bpf_map_charge_move(&map->memory, &mem);
     ^~~~~~~~~~~~~~~~~~~
     bpf_prog_change_xdp
   cc1: some warnings being treated as errors

vim +408 kernel/bpf/bpf_struct_ops.c

   289	
   290	static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
   291						  void *value, u64 flags)
   292	{
   293		struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
   294		const struct bpf_struct_ops *st_ops = st_map->st_ops;
   295		struct bpf_struct_ops_value *uvalue, *kvalue;
   296		const struct btf_member *member;
   297		const struct btf_type *t = st_ops->type;
   298		void *udata, *kdata;
   299		int prog_fd, err = 0;
   300		void *image;
   301		u32 i;
   302	
   303		if (flags)
   304			return -EINVAL;
   305	
   306		if (*(u32 *)key != 0)
   307			return -E2BIG;
   308	
   309		uvalue = (struct bpf_struct_ops_value *)value;
   310		if (uvalue->state || refcount_read(&uvalue->refcnt))
   311			return -EINVAL;
   312	
   313		uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
   314		kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
   315	
   316		spin_lock(&st_map->lock);
   317	
   318		if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
   319			err = -EBUSY;
   320			goto unlock;
   321		}
   322	
   323		memcpy(uvalue, value, map->value_size);
   324	
   325		udata = &uvalue->data;
   326		kdata = &kvalue->data;
   327		image = st_map->image;
   328	
   329		for_each_member(i, t, member) {
   330			const struct btf_type *mtype, *ptype;
   331			struct bpf_prog *prog;
   332			u32 moff;
   333	
   334			moff = btf_member_bit_offset(t, member) / 8;
   335			mtype = btf_type_by_id(btf_vmlinux, member->type);
   336			ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
   337			if (ptype == module_type) {
   338				*(void **)(kdata + moff) = BPF_MODULE_OWNER;
   339				continue;
   340			}
   341	
   342			err = st_ops->init_member(t, member, kdata, udata);
   343			if (err < 0)
   344				goto reset_unlock;
   345	
   346			/* The ->init_member() has handled this member */
   347			if (err > 0)
   348				continue;
   349	
   350			/* If st_ops->init_member does not handle it,
   351			 * we will only handle func ptrs and zero-ed members
   352			 * here.  Reject everything else.
   353			 */
   354	
   355			/* All non func ptr member must be 0 */
   356			if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
   357						       NULL)) {
   358				u32 msize;
   359	
   360				mtype = btf_resolve_size(btf_vmlinux, mtype,
   361							 &msize, NULL, NULL);
   362				if (IS_ERR(mtype)) {
   363					err = PTR_ERR(mtype);
   364					goto reset_unlock;
   365				}
   366	
   367				if (memchr_inv(udata + moff, 0, msize)) {
   368					err = -EINVAL;
   369					goto reset_unlock;
   370				}
   371	
   372				continue;
   373			}
   374	
   375			prog_fd = (int)(*(unsigned long *)(udata + moff));
   376			/* Similar check as the attr->attach_prog_fd */
   377			if (!prog_fd)
   378				continue;
   379	
   380			prog = bpf_prog_get(prog_fd);
   381			if (IS_ERR(prog)) {
   382				err = PTR_ERR(prog);
   383				goto reset_unlock;
   384			}
   385			st_map->progs[i] = prog;
   386	
   387			if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
   388			    prog->aux->attach_btf_id != st_ops->type_id ||
   389			    prog->expected_attach_type != i) {
   390				err = -EINVAL;
   391				goto reset_unlock;
   392			}
   393	
   394			err = arch_prepare_bpf_trampoline(image,
   395							  &st_ops->func_models[i], 0,
   396							  &prog, 1, NULL, 0, NULL);
   397			if (err < 0)
   398				goto reset_unlock;
   399	
   400			*(void **)(kdata + moff) = image;
   401			image += err;
   402	
   403			/* put prog_id to udata */
   404			*(unsigned long *)(udata + moff) = prog->aux->id;
   405		}
   406	
   407		refcount_set(&kvalue->refcnt, 1);
 > 408		bpf_map_inc(map);
   409	
   410		err = st_ops->reg(kdata);
   411		if (!err) {
   412			/* Pair with smp_load_acquire() during lookup */
   413			smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
   414			goto unlock;
   415		}
   416	
   417		/* Error during st_ops->reg() */
   418		bpf_map_put(map);
   419	
   420	reset_unlock:
   421		bpf_struct_ops_map_put_progs(st_map);
   422		memset(uvalue, 0, map->value_size);
   423		memset(kvalue, 0, map->value_size);
   424	
   425	unlock:
   426		spin_unlock(&st_map->lock);
   427		return err;
   428	}
   429	
   430	static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
   431	{
   432		enum bpf_struct_ops_state prev_state;
   433		struct bpf_struct_ops_map *st_map;
   434	
   435		st_map = (struct bpf_struct_ops_map *)map;
   436		prev_state = cmpxchg(&st_map->kvalue.state,
   437				     BPF_STRUCT_OPS_STATE_INUSE,
   438				     BPF_STRUCT_OPS_STATE_TOBEFREE);
   439		if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
   440			st_map->st_ops->unreg(&st_map->kvalue.data);
   441			if (refcount_dec_and_test(&st_map->kvalue.refcnt))
   442				bpf_map_put(map);
   443		}
   444	
   445		return 0;
   446	}
   447	
   448	static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key,
   449						     struct seq_file *m)
   450	{
   451		void *value;
   452	
   453		value = bpf_struct_ops_map_lookup_elem(map, key);
   454		if (!value)
   455			return;
   456	
   457		btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id,
   458				  value, m);
   459		seq_puts(m, "\n");
   460	}
   461	
   462	static void bpf_struct_ops_map_free(struct bpf_map *map)
   463	{
   464		struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
   465	
   466		if (st_map->progs)
   467			bpf_struct_ops_map_put_progs(st_map);
 > 468		bpf_map_area_free(st_map->progs);
   469		bpf_jit_free_exec(st_map->image);
   470		bpf_map_area_free(st_map->uvalue);
   471		bpf_map_area_free(st_map);
   472	}
   473	
   474	static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
   475	{
   476		if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
   477		    attr->map_flags || !attr->btf_vmlinux_value_type_id)
   478			return -EINVAL;
   479		return 0;
   480	}
   481	
   482	static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
   483	{
   484		const struct bpf_struct_ops *st_ops;
   485		size_t map_total_size, st_map_size;
   486		struct bpf_struct_ops_map *st_map;
   487		const struct btf_type *t, *vt;
   488		struct bpf_map_memory mem;
   489		struct bpf_map *map;
   490		int err;
   491	
   492		if (!capable(CAP_SYS_ADMIN))
   493			return ERR_PTR(-EPERM);
   494	
   495		st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
   496		if (!st_ops)
   497			return ERR_PTR(-ENOTSUPP);
   498	
   499		vt = st_ops->value_type;
   500		if (attr->value_size != vt->size)
   501			return ERR_PTR(-EINVAL);
   502	
   503		t = st_ops->type;
   504	
   505		st_map_size = sizeof(*st_map) +
   506			/* kvalue stores the
   507			 * struct bpf_struct_ops_tcp_congestions_ops
   508			 */
   509			(vt->size - sizeof(struct bpf_struct_ops_value));
   510		map_total_size = st_map_size +
   511			/* uvalue */
   512			sizeof(vt->size) +
   513			/* struct bpf_progs **progs */
   514			 btf_type_vlen(t) * sizeof(struct bpf_prog *);
 > 515		err = bpf_map_charge_init(&mem, map_total_size);
   516		if (err < 0)
   517			return ERR_PTR(err);
   518	
 > 519		st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
   520		if (!st_map) {
 > 521			bpf_map_charge_finish(&mem);
   522			return ERR_PTR(-ENOMEM);
   523		}
   524		st_map->st_ops = st_ops;
   525		map = &st_map->map;
   526	
   527		st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
   528		st_map->progs =
   529			bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
   530					   NUMA_NO_NODE);
   531		/* Each trampoline costs < 64 bytes.  Ensure one page
   532		 * is enough for max number of func ptrs.
   533		 */
   534		BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS);
   535		st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
   536		if (!st_map->uvalue || !st_map->progs || !st_map->image) {
   537			bpf_struct_ops_map_free(map);
   538			bpf_map_charge_finish(&mem);
   539			return ERR_PTR(-ENOMEM);
   540		}
   541	
   542		spin_lock_init(&st_map->lock);
   543		set_vm_flush_reset_perms(st_map->image);
   544		set_memory_x((long)st_map->image, 1);
   545		bpf_map_init_from_attr(map, attr);
   546		bpf_map_charge_move(&map->memory, &mem);
   547	
   548		return map;
   549	}
   550	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46481 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf
  2019-12-21  6:26 ` [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
                     ` (2 preceding siblings ...)
  2019-12-24  7:16   ` kbuild test robot
@ 2019-12-24 13:06   ` kbuild test robot
  3 siblings, 0 replies; 45+ messages in thread
From: kbuild test robot @ 2019-12-24 13:06 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: kbuild-all, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, kernel-team, netdev

[-- Attachment #1: Type: text/plain, Size: 11414 bytes --]

Hi Martin,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[cannot apply to bpf/master net/master v5.5-rc3 next-20191219]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Martin-KaFai-Lau/Introduce-BPF-STRUCT_OPS/20191224-085617
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_init':
   kernel/bpf/bpf_struct_ops.c:176:8: error: implicit declaration of function 'btf_distill_func_proto'; did you mean 'btf_type_is_func_proto'? [-Werror=implicit-function-declaration]
           btf_distill_func_proto(&log, _btf_vmlinux,
           ^~~~~~~~~~~~~~~~~~~~~~
           btf_type_is_func_proto
   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_update_elem':
   kernel/bpf/bpf_struct_ops.c:408:2: error: implicit declaration of function 'bpf_map_inc'; did you mean 'bpf_map_put'? [-Werror=implicit-function-declaration]
     bpf_map_inc(map);
     ^~~~~~~~~~~
     bpf_map_put
   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_free':
   kernel/bpf/bpf_struct_ops.c:468:2: error: implicit declaration of function 'bpf_map_area_free'; did you mean 'bpf_prog_free'? [-Werror=implicit-function-declaration]
     bpf_map_area_free(st_map->progs);
     ^~~~~~~~~~~~~~~~~
     bpf_prog_free
   kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_alloc':
>> kernel/bpf/bpf_struct_ops.c:515:8: error: implicit declaration of function 'bpf_map_charge_init'; did you mean 'ip_misc_proc_init'? [-Werror=implicit-function-declaration]
     err = bpf_map_charge_init(&mem, map_total_size);
           ^~~~~~~~~~~~~~~~~~~
           ip_misc_proc_init
   kernel/bpf/bpf_struct_ops.c:519:11: error: implicit declaration of function 'bpf_map_area_alloc'; did you mean 'bpf_prog_alloc'? [-Werror=implicit-function-declaration]
     st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
              ^~~~~~~~~~~~~~~~~~
              bpf_prog_alloc
   kernel/bpf/bpf_struct_ops.c:519:9: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
     st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
            ^
   kernel/bpf/bpf_struct_ops.c:521:3: error: implicit declaration of function 'bpf_map_charge_finish'; did you mean 'bpf_map_flags_to_cap'? [-Werror=implicit-function-declaration]
      bpf_map_charge_finish(&mem);
      ^~~~~~~~~~~~~~~~~~~~~
      bpf_map_flags_to_cap
   kernel/bpf/bpf_struct_ops.c:527:17: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
     st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
                    ^
   kernel/bpf/bpf_struct_ops.c:528:16: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
     st_map->progs =
                   ^
>> kernel/bpf/bpf_struct_ops.c:545:2: error: implicit declaration of function 'bpf_map_init_from_attr'; did you mean 'bioset_init_from_src'? [-Werror=implicit-function-declaration]
     bpf_map_init_from_attr(map, attr);
     ^~~~~~~~~~~~~~~~~~~~~~
     bioset_init_from_src
>> kernel/bpf/bpf_struct_ops.c:546:2: error: implicit declaration of function 'bpf_map_charge_move'; did you mean 'bio_map_user_iov'? [-Werror=implicit-function-declaration]
     bpf_map_charge_move(&map->memory, &mem);
     ^~~~~~~~~~~~~~~~~~~
     bio_map_user_iov
   cc1: some warnings being treated as errors

vim +515 kernel/bpf/bpf_struct_ops.c

d69ac27055a81d Martin KaFai Lau 2019-12-20  461  
d69ac27055a81d Martin KaFai Lau 2019-12-20  462  static void bpf_struct_ops_map_free(struct bpf_map *map)
d69ac27055a81d Martin KaFai Lau 2019-12-20  463  {
d69ac27055a81d Martin KaFai Lau 2019-12-20  464  	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
d69ac27055a81d Martin KaFai Lau 2019-12-20  465  
d69ac27055a81d Martin KaFai Lau 2019-12-20  466  	if (st_map->progs)
d69ac27055a81d Martin KaFai Lau 2019-12-20  467  		bpf_struct_ops_map_put_progs(st_map);
d69ac27055a81d Martin KaFai Lau 2019-12-20 @468  	bpf_map_area_free(st_map->progs);
d69ac27055a81d Martin KaFai Lau 2019-12-20  469  	bpf_jit_free_exec(st_map->image);
d69ac27055a81d Martin KaFai Lau 2019-12-20  470  	bpf_map_area_free(st_map->uvalue);
d69ac27055a81d Martin KaFai Lau 2019-12-20  471  	bpf_map_area_free(st_map);
d69ac27055a81d Martin KaFai Lau 2019-12-20  472  }
d69ac27055a81d Martin KaFai Lau 2019-12-20  473  
d69ac27055a81d Martin KaFai Lau 2019-12-20  474  static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
d69ac27055a81d Martin KaFai Lau 2019-12-20  475  {
d69ac27055a81d Martin KaFai Lau 2019-12-20  476  	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
d69ac27055a81d Martin KaFai Lau 2019-12-20  477  	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
d69ac27055a81d Martin KaFai Lau 2019-12-20  478  		return -EINVAL;
d69ac27055a81d Martin KaFai Lau 2019-12-20  479  	return 0;
d69ac27055a81d Martin KaFai Lau 2019-12-20  480  }
d69ac27055a81d Martin KaFai Lau 2019-12-20  481  
d69ac27055a81d Martin KaFai Lau 2019-12-20  482  static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
d69ac27055a81d Martin KaFai Lau 2019-12-20  483  {
d69ac27055a81d Martin KaFai Lau 2019-12-20  484  	const struct bpf_struct_ops *st_ops;
d69ac27055a81d Martin KaFai Lau 2019-12-20  485  	size_t map_total_size, st_map_size;
d69ac27055a81d Martin KaFai Lau 2019-12-20  486  	struct bpf_struct_ops_map *st_map;
d69ac27055a81d Martin KaFai Lau 2019-12-20  487  	const struct btf_type *t, *vt;
d69ac27055a81d Martin KaFai Lau 2019-12-20  488  	struct bpf_map_memory mem;
d69ac27055a81d Martin KaFai Lau 2019-12-20  489  	struct bpf_map *map;
d69ac27055a81d Martin KaFai Lau 2019-12-20  490  	int err;
d69ac27055a81d Martin KaFai Lau 2019-12-20  491  
d69ac27055a81d Martin KaFai Lau 2019-12-20  492  	if (!capable(CAP_SYS_ADMIN))
d69ac27055a81d Martin KaFai Lau 2019-12-20  493  		return ERR_PTR(-EPERM);
d69ac27055a81d Martin KaFai Lau 2019-12-20  494  
d69ac27055a81d Martin KaFai Lau 2019-12-20  495  	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
d69ac27055a81d Martin KaFai Lau 2019-12-20  496  	if (!st_ops)
d69ac27055a81d Martin KaFai Lau 2019-12-20  497  		return ERR_PTR(-ENOTSUPP);
d69ac27055a81d Martin KaFai Lau 2019-12-20  498  
d69ac27055a81d Martin KaFai Lau 2019-12-20  499  	vt = st_ops->value_type;
d69ac27055a81d Martin KaFai Lau 2019-12-20  500  	if (attr->value_size != vt->size)
d69ac27055a81d Martin KaFai Lau 2019-12-20  501  		return ERR_PTR(-EINVAL);
d69ac27055a81d Martin KaFai Lau 2019-12-20  502  
d69ac27055a81d Martin KaFai Lau 2019-12-20  503  	t = st_ops->type;
d69ac27055a81d Martin KaFai Lau 2019-12-20  504  
d69ac27055a81d Martin KaFai Lau 2019-12-20  505  	st_map_size = sizeof(*st_map) +
d69ac27055a81d Martin KaFai Lau 2019-12-20  506  		/* kvalue stores the
d69ac27055a81d Martin KaFai Lau 2019-12-20  507  		 * struct bpf_struct_ops_tcp_congestions_ops
d69ac27055a81d Martin KaFai Lau 2019-12-20  508  		 */
d69ac27055a81d Martin KaFai Lau 2019-12-20  509  		(vt->size - sizeof(struct bpf_struct_ops_value));
d69ac27055a81d Martin KaFai Lau 2019-12-20  510  	map_total_size = st_map_size +
d69ac27055a81d Martin KaFai Lau 2019-12-20  511  		/* uvalue */
d69ac27055a81d Martin KaFai Lau 2019-12-20  512  		sizeof(vt->size) +
d69ac27055a81d Martin KaFai Lau 2019-12-20  513  		/* struct bpf_progs **progs */
d69ac27055a81d Martin KaFai Lau 2019-12-20  514  		 btf_type_vlen(t) * sizeof(struct bpf_prog *);
d69ac27055a81d Martin KaFai Lau 2019-12-20 @515  	err = bpf_map_charge_init(&mem, map_total_size);
d69ac27055a81d Martin KaFai Lau 2019-12-20  516  	if (err < 0)
d69ac27055a81d Martin KaFai Lau 2019-12-20  517  		return ERR_PTR(err);
d69ac27055a81d Martin KaFai Lau 2019-12-20  518  
d69ac27055a81d Martin KaFai Lau 2019-12-20  519  	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
d69ac27055a81d Martin KaFai Lau 2019-12-20  520  	if (!st_map) {
d69ac27055a81d Martin KaFai Lau 2019-12-20  521  		bpf_map_charge_finish(&mem);
d69ac27055a81d Martin KaFai Lau 2019-12-20  522  		return ERR_PTR(-ENOMEM);
d69ac27055a81d Martin KaFai Lau 2019-12-20  523  	}
d69ac27055a81d Martin KaFai Lau 2019-12-20  524  	st_map->st_ops = st_ops;
d69ac27055a81d Martin KaFai Lau 2019-12-20  525  	map = &st_map->map;
d69ac27055a81d Martin KaFai Lau 2019-12-20  526  
d69ac27055a81d Martin KaFai Lau 2019-12-20 @527  	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
d69ac27055a81d Martin KaFai Lau 2019-12-20 @528  	st_map->progs =
d69ac27055a81d Martin KaFai Lau 2019-12-20  529  		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
d69ac27055a81d Martin KaFai Lau 2019-12-20  530  				   NUMA_NO_NODE);
d69ac27055a81d Martin KaFai Lau 2019-12-20  531  	/* Each trampoline costs < 64 bytes.  Ensure one page
d69ac27055a81d Martin KaFai Lau 2019-12-20  532  	 * is enough for max number of func ptrs.
d69ac27055a81d Martin KaFai Lau 2019-12-20  533  	 */
d69ac27055a81d Martin KaFai Lau 2019-12-20  534  	BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS);
d69ac27055a81d Martin KaFai Lau 2019-12-20  535  	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
d69ac27055a81d Martin KaFai Lau 2019-12-20  536  	if (!st_map->uvalue || !st_map->progs || !st_map->image) {
d69ac27055a81d Martin KaFai Lau 2019-12-20  537  		bpf_struct_ops_map_free(map);
d69ac27055a81d Martin KaFai Lau 2019-12-20  538  		bpf_map_charge_finish(&mem);
d69ac27055a81d Martin KaFai Lau 2019-12-20  539  		return ERR_PTR(-ENOMEM);
d69ac27055a81d Martin KaFai Lau 2019-12-20  540  	}
d69ac27055a81d Martin KaFai Lau 2019-12-20  541  
d69ac27055a81d Martin KaFai Lau 2019-12-20  542  	spin_lock_init(&st_map->lock);
d69ac27055a81d Martin KaFai Lau 2019-12-20  543  	set_vm_flush_reset_perms(st_map->image);
d69ac27055a81d Martin KaFai Lau 2019-12-20  544  	set_memory_x((long)st_map->image, 1);
d69ac27055a81d Martin KaFai Lau 2019-12-20 @545  	bpf_map_init_from_attr(map, attr);
d69ac27055a81d Martin KaFai Lau 2019-12-20 @546  	bpf_map_charge_move(&map->memory, &mem);
d69ac27055a81d Martin KaFai Lau 2019-12-20  547  
d69ac27055a81d Martin KaFai Lau 2019-12-20  548  	return map;
d69ac27055a81d Martin KaFai Lau 2019-12-20  549  }
d69ac27055a81d Martin KaFai Lau 2019-12-20  550  

:::::: The code at line 515 was first introduced by commit
:::::: d69ac27055a81d26ee1bfe54b9655cf81ebd5ac9 bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

:::::: TO: Martin KaFai Lau <kafai@fb.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46481 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-24  7:01       ` Andrii Nakryiko
  2019-12-24  7:32         ` Martin Lau
@ 2019-12-24 16:50         ` Martin Lau
  2019-12-26 19:02           ` Andrii Nakryiko
  1 sibling, 1 reply; 45+ messages in thread
From: Martin Lau @ 2019-12-24 16:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 11:01:55PM -0800, Andrii Nakryiko wrote:
> On Mon, Dec 23, 2019 at 5:31 PM Martin Lau <kafai@fb.com> wrote:
> >
> > On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > This patch adds a bpf_dctcp example.  It currently does not do
> > > > no-ECN fallback but the same could be done through the cgrp2-bpf.
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> > > >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> > > >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> > > >  3 files changed, 656 insertions(+)
> > > >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> > > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > new file mode 100644
> > > > index 000000000000..7ba8c1b4157a
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > @@ -0,0 +1,228 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +#ifndef __BPF_TCP_HELPERS_H
> > > > +#define __BPF_TCP_HELPERS_H
> > > > +
> > > > +#include <stdbool.h>
> > > > +#include <linux/types.h>
> > > > +#include <bpf_helpers.h>
> > > > +#include <bpf_core_read.h>
> > > > +#include "bpf_trace_helpers.h"
> > > > +
> > > > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > >
> > > Should we try to put those BPF programs into some section that would
> > > indicate they are used with struct opts? libbpf doesn't use or enforce
> > > that (even though it could to derive and enforce that they are
> > > STRUCT_OPS programs). So something like
> > > SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> > > convention is very useful for consistency and to do a quick ELF dump
> > > and see what is where. WDYT?
> > I did not use it here because I don't want any misperception that it is
> > a required convention by libbpf.
> >
> > Sure, I can prefix it here and comment that it is just a
> > convention but not a libbpf's requirement.
> 
> Well, we can actually make it a requirement of sorts. Currently your
> code expects that BPF program's type is UNSPEC and then it sets it to
> STRUCT_OPS. Alternatively we can say that any BPF program in
> SEC("struct_ops/<whatever>") will be automatically assigned
> STRUCT_OPTS BPF program type (which is done generically in
> bpf_object__open()), and then as .struct_ops section is parsed, all
> those programs will be "assembled" by the code you added into a
> struct_ops map.
Setting BPF_PROG_TYPE_STRUCT_OPS can be done automatically at open
phase (during collect_reloc time).  I will make this change.

> 
> It's a requirement "of sorts", because even if user doesn't do that,
> stuff will still work, if user manually will call
> bpf_program__set_struct_ops(prog). Which actually reminds me that it
> would be good to add bpf_program__set_struct_ops() and
Although there is BPF_PROG_TYPE_FNS macro, 
I don't see moving bpf_prog__set_struct_ops(prog) to LIBBPF_API is useful
while actually may cause confusion and error.  How could __set_struct_ops()
a prog to struct_ops prog_type help a program, which is not used in
SEC(".struct_ops"), to be loaded successfully as a struct_ops prog?

Assigning a bpf_prog to a function ptr under the SEC(".struct_ops")
is the only way for a program to be successfully loaded as
struct_ops prog type.  Extra way to allow a prog to be changed to
struct_ops prog_type is either useless or redundant.

If it is really necessary to have __set_struct_ops() as a API
for completeness, it can be added...

> bpf_program__is_struct_ops() APIs for completeness, similarly to how
is_struct_ops() makes sense.

> BTW, libbpf will emit debug message for every single BPF program it
> doesn't recognize section for, so it is still nice to have it be
> something more or less standardized and recognizable by libbpf.
I can make this debug (not error) message go away too after
setting the BPF_PROG_TYPE_STRUCT_OPS automatically at open time.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-24 16:50         ` Martin Lau
@ 2019-12-26 19:02           ` Andrii Nakryiko
  2019-12-26 20:25             ` Martin Lau
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-26 19:02 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Tue, Dec 24, 2019 at 8:50 AM Martin Lau <kafai@fb.com> wrote:
>
> On Mon, Dec 23, 2019 at 11:01:55PM -0800, Andrii Nakryiko wrote:
> > On Mon, Dec 23, 2019 at 5:31 PM Martin Lau <kafai@fb.com> wrote:
> > >
> > > On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > This patch adds a bpf_dctcp example.  It currently does not do
> > > > > no-ECN fallback but the same could be done through the cgrp2-bpf.
> > > > >
> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > ---
> > > > >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> > > > >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> > > > >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> > > > >  3 files changed, 656 insertions(+)
> > > > >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> > > > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > > > >
> > > > > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > new file mode 100644
> > > > > index 000000000000..7ba8c1b4157a
> > > > > --- /dev/null
> > > > > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > @@ -0,0 +1,228 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > +#ifndef __BPF_TCP_HELPERS_H
> > > > > +#define __BPF_TCP_HELPERS_H
> > > > > +
> > > > > +#include <stdbool.h>
> > > > > +#include <linux/types.h>
> > > > > +#include <bpf_helpers.h>
> > > > > +#include <bpf_core_read.h>
> > > > > +#include "bpf_trace_helpers.h"
> > > > > +
> > > > > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > >
> > > > Should we try to put those BPF programs into some section that would
> > > > indicate they are used with struct opts? libbpf doesn't use or enforce
> > > > that (even though it could to derive and enforce that they are
> > > > STRUCT_OPS programs). So something like
> > > > SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> > > > convention is very useful for consistency and to do a quick ELF dump
> > > > and see what is where. WDYT?
> > > I did not use it here because I don't want any misperception that it is
> > > a required convention by libbpf.
> > >
> > > Sure, I can prefix it here and comment that it is just a
> > > convention but not a libbpf's requirement.
> >
> > Well, we can actually make it a requirement of sorts. Currently your
> > code expects that BPF program's type is UNSPEC and then it sets it to
> > STRUCT_OPS. Alternatively we can say that any BPF program in
> > SEC("struct_ops/<whatever>") will be automatically assigned
> > STRUCT_OPTS BPF program type (which is done generically in
> > bpf_object__open()), and then as .struct_ops section is parsed, all
> > those programs will be "assembled" by the code you added into a
> > struct_ops map.
> Setting BPF_PROG_TYPE_STRUCT_OPS can be done automatically at open
> phase (during collect_reloc time).  I will make this change.
>

Can you please extend exiting logic in __bpf_object__open() to do
this? See how libbpf_prog_type_by_name() is used for that.

> >
> > It's a requirement "of sorts", because even if user doesn't do that,
> > stuff will still work, if user manually will call
> > bpf_program__set_struct_ops(prog). Which actually reminds me that it
> > would be good to add bpf_program__set_struct_ops() and
> Although there is BPF_PROG_TYPE_FNS macro,
> I don't see moving bpf_prog__set_struct_ops(prog) to LIBBPF_API is useful
> while actually may cause confusion and error.  How could __set_struct_ops()
> a prog to struct_ops prog_type help a program, which is not used in
> SEC(".struct_ops"), to be loaded successfully as a struct_ops prog?
>
> Assigning a bpf_prog to a function ptr under the SEC(".struct_ops")
> is the only way for a program to be successfully loaded as
> struct_ops prog type.  Extra way to allow a prog to be changed to
> struct_ops prog_type is either useless or redundant.

Well, first of all, just for consistency with everything else. We have
such methods for all prog_types, so I'd like to avoid a special
snowflake one that doesn't.

Second, while high-level libbpf API provides all the magic to
construct STRUCT_OPS map based on .struct_ops section types,
technically, user might decide to do that using low-level map creation
API, right? So not making unnecessary assumptions and providing
complete APIs is a good thing, IMO. Especially if it costs basically
nothing in terms of code and maintenance.

>
> If it is really necessary to have __set_struct_ops() as a API
> for completeness, it can be added...
>
> > bpf_program__is_struct_ops() APIs for completeness, similarly to how
> is_struct_ops() makes sense.
>
> > BTW, libbpf will emit debug message for every single BPF program it
> > doesn't recognize section for, so it is still nice to have it be
> > something more or less standardized and recognizable by libbpf.
> I can make this debug (not error) message go away too after
> setting the BPF_PROG_TYPE_STRUCT_OPS automatically at open time.

Yep, that would be great.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-26 19:02           ` Andrii Nakryiko
@ 2019-12-26 20:25             ` Martin Lau
  2019-12-26 20:48               ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Martin Lau @ 2019-12-26 20:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Thu, Dec 26, 2019 at 11:02:26AM -0800, Andrii Nakryiko wrote:
> On Tue, Dec 24, 2019 at 8:50 AM Martin Lau <kafai@fb.com> wrote:
> >
> > On Mon, Dec 23, 2019 at 11:01:55PM -0800, Andrii Nakryiko wrote:
> > > On Mon, Dec 23, 2019 at 5:31 PM Martin Lau <kafai@fb.com> wrote:
> > > >
> > > > On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> > > > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > This patch adds a bpf_dctcp example.  It currently does not do
> > > > > > no-ECN fallback but the same could be done through the cgrp2-bpf.
> > > > > >
> > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > > ---
> > > > > >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> > > > > >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> > > > > >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> > > > > >  3 files changed, 656 insertions(+)
> > > > > >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > > > > >
> > > > > > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..7ba8c1b4157a
> > > > > > --- /dev/null
> > > > > > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > @@ -0,0 +1,228 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > +#ifndef __BPF_TCP_HELPERS_H
> > > > > > +#define __BPF_TCP_HELPERS_H
> > > > > > +
> > > > > > +#include <stdbool.h>
> > > > > > +#include <linux/types.h>
> > > > > > +#include <bpf_helpers.h>
> > > > > > +#include <bpf_core_read.h>
> > > > > > +#include "bpf_trace_helpers.h"
> > > > > > +
> > > > > > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > >
> > > > > Should we try to put those BPF programs into some section that would
> > > > > indicate they are used with struct opts? libbpf doesn't use or enforce
> > > > > that (even though it could to derive and enforce that they are
> > > > > STRUCT_OPS programs). So something like
> > > > > SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> > > > > convention is very useful for consistency and to do a quick ELF dump
> > > > > and see what is where. WDYT?
> > > > I did not use it here because I don't want any misperception that it is
> > > > a required convention by libbpf.
> > > >
> > > > Sure, I can prefix it here and comment that it is just a
> > > > convention but not a libbpf's requirement.
> > >
> > > Well, we can actually make it a requirement of sorts. Currently your
> > > code expects that BPF program's type is UNSPEC and then it sets it to
> > > STRUCT_OPS. Alternatively we can say that any BPF program in
> > > SEC("struct_ops/<whatever>") will be automatically assigned
> > > STRUCT_OPTS BPF program type (which is done generically in
> > > bpf_object__open()), and then as .struct_ops section is parsed, all
> > > those programs will be "assembled" by the code you added into a
> > > struct_ops map.
> > Setting BPF_PROG_TYPE_STRUCT_OPS can be done automatically at open
> > phase (during collect_reloc time).  I will make this change.
> >
> 
> Can you please extend exiting logic in __bpf_object__open() to do
> this? See how libbpf_prog_type_by_name() is used for that.
Does it have to call libbpf_prog_type_by_name() if everything
has already been decided by the earlier
bpf_object__collect_struct_ops_map_reloc()?

> 
> > >
> > > It's a requirement "of sorts", because even if user doesn't do that,
> > > stuff will still work, if user manually will call
> > > bpf_program__set_struct_ops(prog). Which actually reminds me that it
> > > would be good to add bpf_program__set_struct_ops() and
> > Although there is BPF_PROG_TYPE_FNS macro,
> > I don't see moving bpf_prog__set_struct_ops(prog) to LIBBPF_API is useful
> > while actually may cause confusion and error.  How could __set_struct_ops()
> > a prog to struct_ops prog_type help a program, which is not used in
> > SEC(".struct_ops"), to be loaded successfully as a struct_ops prog?
> >
> > Assigning a bpf_prog to a function ptr under the SEC(".struct_ops")
> > is the only way for a program to be successfully loaded as
> > struct_ops prog type.  Extra way to allow a prog to be changed to
> > struct_ops prog_type is either useless or redundant.
> 
> Well, first of all, just for consistency with everything else. We have
> such methods for all prog_types, so I'd like to avoid a special
> snowflake one that doesn't.
Yes, for consistency is fine as I mentioned in the earlier reply,
as long as it is understood the usefulness of it.

> Second, while high-level libbpf API provides all the magic to
> construct STRUCT_OPS map based on .struct_ops section types,
> technically, user might decide to do that using low-level map creation
> API, right?
How?

Correct that the map api is reused as is in SEC(".struct_ops").

For prog, AFAICT, it is not possible to create struct_ops
prog from raw and use it in struct_ops map unless more LIBBPF_API
is added.  Lets put aside the need to find the btf_vmlinux
and its btf-types...etc.  At least, there is no LIBBPF_API to
set prog->attach_btf_id.  Considering the amount of preparation
is needed to create a struct_ops map from raw,  I would like
to see a real use case first before even considering what else
is needed and add another LIBBPF_API that may not be used.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-26 20:25             ` Martin Lau
@ 2019-12-26 20:48               ` Andrii Nakryiko
  2019-12-26 22:20                 ` Martin Lau
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-26 20:48 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Thu, Dec 26, 2019 at 12:25 PM Martin Lau <kafai@fb.com> wrote:
>
> On Thu, Dec 26, 2019 at 11:02:26AM -0800, Andrii Nakryiko wrote:
> > On Tue, Dec 24, 2019 at 8:50 AM Martin Lau <kafai@fb.com> wrote:
> > >
> > > On Mon, Dec 23, 2019 at 11:01:55PM -0800, Andrii Nakryiko wrote:
> > > > On Mon, Dec 23, 2019 at 5:31 PM Martin Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> > > > > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > >
> > > > > > > This patch adds a bpf_dctcp example.  It currently does not do
> > > > > > > no-ECN fallback but the same could be done through the cgrp2-bpf.
> > > > > > >
> > > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > > > ---
> > > > > > >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> > > > > > >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> > > > > > >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> > > > > > >  3 files changed, 656 insertions(+)
> > > > > > >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > > > > > >
> > > > > > > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..7ba8c1b4157a
> > > > > > > --- /dev/null
> > > > > > > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > @@ -0,0 +1,228 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +#ifndef __BPF_TCP_HELPERS_H
> > > > > > > +#define __BPF_TCP_HELPERS_H
> > > > > > > +
> > > > > > > +#include <stdbool.h>
> > > > > > > +#include <linux/types.h>
> > > > > > > +#include <bpf_helpers.h>
> > > > > > > +#include <bpf_core_read.h>
> > > > > > > +#include "bpf_trace_helpers.h"
> > > > > > > +
> > > > > > > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > >
> > > > > > Should we try to put those BPF programs into some section that would
> > > > > > indicate they are used with struct opts? libbpf doesn't use or enforce
> > > > > > that (even though it could to derive and enforce that they are
> > > > > > STRUCT_OPS programs). So something like
> > > > > > SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> > > > > > convention is very useful for consistency and to do a quick ELF dump
> > > > > > and see what is where. WDYT?
> > > > > I did not use it here because I don't want any misperception that it is
> > > > > a required convention by libbpf.
> > > > >
> > > > > Sure, I can prefix it here and comment that it is just a
> > > > > convention but not a libbpf's requirement.
> > > >
> > > > Well, we can actually make it a requirement of sorts. Currently your
> > > > code expects that BPF program's type is UNSPEC and then it sets it to
> > > > STRUCT_OPS. Alternatively we can say that any BPF program in
> > > > SEC("struct_ops/<whatever>") will be automatically assigned
> > > > STRUCT_OPTS BPF program type (which is done generically in
> > > > bpf_object__open()), and then as .struct_ops section is parsed, all
> > > > those programs will be "assembled" by the code you added into a
> > > > struct_ops map.
> > > Setting BPF_PROG_TYPE_STRUCT_OPS can be done automatically at open
> > > phase (during collect_reloc time).  I will make this change.
> > >
> >
> > Can you please extend exiting logic in __bpf_object__open() to do
> > this? See how libbpf_prog_type_by_name() is used for that.
> Does it have to call libbpf_prog_type_by_name() if everything
> has already been decided by the earlier
> bpf_object__collect_struct_ops_map_reloc()?

We can certainly change the logic to omit guessing program type if
it's already set to something else than UNSPEC.

But all I'm asking is that instead of using #fname"_sec" section name,
is to use "struct_ops/"#fname, because it's consistent with all other
program types. If you do that, then you don't have to do anything
extra (well, add single entry to section_defs, of course), it will
just work as is.

>
> >
> > > >
> > > > It's a requirement "of sorts", because even if user doesn't do that,
> > > > stuff will still work, if user manually will call
> > > > bpf_program__set_struct_ops(prog). Which actually reminds me that it
> > > > would be good to add bpf_program__set_struct_ops() and
> > > Although there is BPF_PROG_TYPE_FNS macro,
> > > I don't see moving bpf_prog__set_struct_ops(prog) to LIBBPF_API is useful
> > > while actually may cause confusion and error.  How could __set_struct_ops()
> > > a prog to struct_ops prog_type help a program, which is not used in
> > > SEC(".struct_ops"), to be loaded successfully as a struct_ops prog?
> > >
> > > Assigning a bpf_prog to a function ptr under the SEC(".struct_ops")
> > > is the only way for a program to be successfully loaded as
> > > struct_ops prog type.  Extra way to allow a prog to be changed to
> > > struct_ops prog_type is either useless or redundant.
> >
> > Well, first of all, just for consistency with everything else. We have
> > such methods for all prog_types, so I'd like to avoid a special
> > snowflake one that doesn't.
> Yes, for consistency is fine as I mentioned in the earlier reply,
> as long as it is understood the usefulness of it.
>
> > Second, while high-level libbpf API provides all the magic to
> > construct STRUCT_OPS map based on .struct_ops section types,
> > technically, user might decide to do that using low-level map creation
> > API, right?
> How?
>
> Correct that the map api is reused as is in SEC(".struct_ops").
>
> For prog, AFAICT, it is not possible to create struct_ops
> prog from raw and use it in struct_ops map unless more LIBBPF_API
> is added.  Lets put aside the need to find the btf_vmlinux
> and its btf-types...etc.  At least, there is no LIBBPF_API to
> set prog->attach_btf_id.  Considering the amount of preparation
> is needed to create a struct_ops map from raw,  I would like
> to see a real use case first before even considering what else
> is needed and add another LIBBPF_API that may not be used.

To be clear, I don't think anyone in their right mind should do this
by hand. I'm just saying that in the end it's not magic, just calls to
low-level map APIs. See above, though, all I care is consistent
pattern for sections names: "program_type/<whatever-makes-sense>".

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-26 20:48               ` Andrii Nakryiko
@ 2019-12-26 22:20                 ` Martin Lau
  2019-12-26 22:25                   ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Martin Lau @ 2019-12-26 22:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Thu, Dec 26, 2019 at 12:48:09PM -0800, Andrii Nakryiko wrote:
> On Thu, Dec 26, 2019 at 12:25 PM Martin Lau <kafai@fb.com> wrote:
> >
> > On Thu, Dec 26, 2019 at 11:02:26AM -0800, Andrii Nakryiko wrote:
> > > On Tue, Dec 24, 2019 at 8:50 AM Martin Lau <kafai@fb.com> wrote:
> > > >
> > > > On Mon, Dec 23, 2019 at 11:01:55PM -0800, Andrii Nakryiko wrote:
> > > > > On Mon, Dec 23, 2019 at 5:31 PM Martin Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> > > > > > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > > >
> > > > > > > > This patch adds a bpf_dctcp example.  It currently does not do
> > > > > > > > no-ECN fallback but the same could be done through the cgrp2-bpf.
> > > > > > > >
> > > > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > > > > ---
> > > > > > > >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> > > > > > > >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> > > > > > > >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> > > > > > > >  3 files changed, 656 insertions(+)
> > > > > > > >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> > > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > > > > > > >
> > > > > > > > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..7ba8c1b4157a
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > > @@ -0,0 +1,228 @@
> > > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > > +#ifndef __BPF_TCP_HELPERS_H
> > > > > > > > +#define __BPF_TCP_HELPERS_H
> > > > > > > > +
> > > > > > > > +#include <stdbool.h>
> > > > > > > > +#include <linux/types.h>
> > > > > > > > +#include <bpf_helpers.h>
> > > > > > > > +#include <bpf_core_read.h>
> > > > > > > > +#include "bpf_trace_helpers.h"
> > > > > > > > +
> > > > > > > > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > >
> > > > > > > Should we try to put those BPF programs into some section that would
> > > > > > > indicate they are used with struct opts? libbpf doesn't use or enforce
> > > > > > > that (even though it could to derive and enforce that they are
> > > > > > > STRUCT_OPS programs). So something like
> > > > > > > SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> > > > > > > convention is very useful for consistency and to do a quick ELF dump
> > > > > > > and see what is where. WDYT?
> > > > > > I did not use it here because I don't want any misperception that it is
> > > > > > a required convention by libbpf.
> > > > > >
> > > > > > Sure, I can prefix it here and comment that it is just a
> > > > > > convention but not a libbpf's requirement.
> > > > >
> > > > > Well, we can actually make it a requirement of sorts. Currently your
> > > > > code expects that BPF program's type is UNSPEC and then it sets it to
> > > > > STRUCT_OPS. Alternatively we can say that any BPF program in
> > > > > SEC("struct_ops/<whatever>") will be automatically assigned
> > > > > STRUCT_OPTS BPF program type (which is done generically in
> > > > > bpf_object__open()), and then as .struct_ops section is parsed, all
> > > > > those programs will be "assembled" by the code you added into a
> > > > > struct_ops map.
> > > > Setting BPF_PROG_TYPE_STRUCT_OPS can be done automatically at open
> > > > phase (during collect_reloc time).  I will make this change.
> > > >
> > >
> > > Can you please extend exiting logic in __bpf_object__open() to do
> > > this? See how libbpf_prog_type_by_name() is used for that.
> > Does it have to call libbpf_prog_type_by_name() if everything
> > has already been decided by the earlier
> > bpf_object__collect_struct_ops_map_reloc()?
> 
> We can certainly change the logic to omit guessing program type if
> it's already set to something else than UNSPEC.
> 
> But all I'm asking is that instead of using #fname"_sec" section name,
> is to use "struct_ops/"#fname, because it's consistent with all other
> program types. If you do that, then you don't have to do anything
> extra (well, add single entry to section_defs, of course), it will
> just work as is.
Re: adding "struct_ops/" to section_defs,
Sure. as long as SEC(".struct_ops") can use prog that
libbpf_prog_type_by_name() concluded it is either -ESRCH or
STRUCT_OPS.

It is not the only change though.  Other changes are still
needed in collect_reloc (e.g. check prog type mismatch).
They won't be much though.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example
  2019-12-26 22:20                 ` Martin Lau
@ 2019-12-26 22:25                   ` Andrii Nakryiko
  0 siblings, 0 replies; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-26 22:25 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Thu, Dec 26, 2019 at 2:20 PM Martin Lau <kafai@fb.com> wrote:
>
> On Thu, Dec 26, 2019 at 12:48:09PM -0800, Andrii Nakryiko wrote:
> > On Thu, Dec 26, 2019 at 12:25 PM Martin Lau <kafai@fb.com> wrote:
> > >
> > > On Thu, Dec 26, 2019 at 11:02:26AM -0800, Andrii Nakryiko wrote:
> > > > On Tue, Dec 24, 2019 at 8:50 AM Martin Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Mon, Dec 23, 2019 at 11:01:55PM -0800, Andrii Nakryiko wrote:
> > > > > > On Mon, Dec 23, 2019 at 5:31 PM Martin Lau <kafai@fb.com> wrote:
> > > > > > >
> > > > > > > On Mon, Dec 23, 2019 at 03:26:50PM -0800, Andrii Nakryiko wrote:
> > > > > > > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > > > >
> > > > > > > > > This patch adds a bpf_dctcp example.  It currently does not do
> > > > > > > > > no-ECN fallback but the same could be done through the cgrp2-bpf.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > > > > > ---
> > > > > > > > >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
> > > > > > > > >  .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 218 +++++++++++++++++
> > > > > > > > >  tools/testing/selftests/bpf/progs/bpf_dctcp.c | 210 ++++++++++++++++
> > > > > > > > >  3 files changed, 656 insertions(+)
> > > > > > > > >  create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> > > > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c
> > > > > > > > >
> > > > > > > > > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > > > new file mode 100644
> > > > > > > > > index 000000000000..7ba8c1b4157a
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > > > > > > > > @@ -0,0 +1,228 @@
> > > > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > > > +#ifndef __BPF_TCP_HELPERS_H
> > > > > > > > > +#define __BPF_TCP_HELPERS_H
> > > > > > > > > +
> > > > > > > > > +#include <stdbool.h>
> > > > > > > > > +#include <linux/types.h>
> > > > > > > > > +#include <bpf_helpers.h>
> > > > > > > > > +#include <bpf_core_read.h>
> > > > > > > > > +#include "bpf_trace_helpers.h"
> > > > > > > > > +
> > > > > > > > > +#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > > +#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > > +#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > > +#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > > +#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > > > +#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
> > > > > > > >
> > > > > > > > Should we try to put those BPF programs into some section that would
> > > > > > > > indicate they are used with struct opts? libbpf doesn't use or enforce
> > > > > > > > that (even though it could to derive and enforce that they are
> > > > > > > > STRUCT_OPS programs). So something like
> > > > > > > > SEC("struct_ops/<ideally-operation-name-here>"). I think having this
> > > > > > > > convention is very useful for consistency and to do a quick ELF dump
> > > > > > > > and see what is where. WDYT?
> > > > > > > I did not use it here because I don't want any misperception that it is
> > > > > > > a required convention by libbpf.
> > > > > > >
> > > > > > > Sure, I can prefix it here and comment that it is just a
> > > > > > > convention but not a libbpf's requirement.
> > > > > >
> > > > > > Well, we can actually make it a requirement of sorts. Currently your
> > > > > > code expects that BPF program's type is UNSPEC and then it sets it to
> > > > > > STRUCT_OPS. Alternatively we can say that any BPF program in
> > > > > > SEC("struct_ops/<whatever>") will be automatically assigned
> > > > > > STRUCT_OPTS BPF program type (which is done generically in
> > > > > > bpf_object__open()), and then as .struct_ops section is parsed, all
> > > > > > those programs will be "assembled" by the code you added into a
> > > > > > struct_ops map.
> > > > > Setting BPF_PROG_TYPE_STRUCT_OPS can be done automatically at open
> > > > > phase (during collect_reloc time).  I will make this change.
> > > > >
> > > >
> > > > Can you please extend exiting logic in __bpf_object__open() to do
> > > > this? See how libbpf_prog_type_by_name() is used for that.
> > > Does it have to call libbpf_prog_type_by_name() if everything
> > > has already been decided by the earlier
> > > bpf_object__collect_struct_ops_map_reloc()?
> >
> > We can certainly change the logic to omit guessing program type if
> > it's already set to something else than UNSPEC.
> >
> > But all I'm asking is that instead of using #fname"_sec" section name,
> > is to use "struct_ops/"#fname, because it's consistent with all other
> > program types. If you do that, then you don't have to do anything
> > extra (well, add single entry to section_defs, of course), it will
> > just work as is.
> Re: adding "struct_ops/" to section_defs,
> Sure. as long as SEC(".struct_ops") can use prog that
> libbpf_prog_type_by_name() concluded it is either -ESRCH or
> STRUCT_OPS.
>
> It is not the only change though.  Other changes are still
> needed in collect_reloc (e.g. check prog type mismatch).
> They won't be much though.

sounds good, thanks!

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 10/11] bpf: libbpf: Add STRUCT_OPS support
  2019-12-23 19:54   ` Andrii Nakryiko
@ 2019-12-26 22:47     ` Martin Lau
  0 siblings, 0 replies; 45+ messages in thread
From: Martin Lau @ 2019-12-26 22:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 11:54:17AM -0800, Andrii Nakryiko wrote:
> > -       for (i = 0; i < obj->nr_maps; i++)
> > +       for (i = 0; i < obj->nr_maps; i++) {
> >                 zclose(obj->maps[i].fd);
> > +               if (obj->maps[i].st_ops)
> > +                       zfree(&obj->maps[i].st_ops->kern_vdata);
> 
> any specific reason to deallocate only kern_vdata? maybe just
> consolidate all the clean up in bpf_object__close instead?
I think it is the same as why the map fd is closed here.
kern_vdata is allocated at load time, so it seems more
logical to me to deallocate asap at unload time.
For example, if user calls bpf_object__load_xattr()
and one of the load steps fails, bpf_object__unload
(but not bpf_object__close) is called before returning.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-23 19:57   ` Yonghong Song
  2019-12-23 21:44     ` Andrii Nakryiko
@ 2019-12-27  6:16     ` Martin Lau
  1 sibling, 0 replies; 45+ messages in thread
From: Martin Lau @ 2019-12-27  6:16 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On Mon, Dec 23, 2019 at 11:57:48AM -0800, Yonghong Song wrote:
[ ... ]
> > +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> > +					  void *value, u64 flags)
> > +{
> > +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> > +	const struct bpf_struct_ops *st_ops = st_map->st_ops;
> > +	struct bpf_struct_ops_value *uvalue, *kvalue;
> > +	const struct btf_member *member;
> > +	const struct btf_type *t = st_ops->type;
> > +	void *udata, *kdata;
> > +	int prog_fd, err = 0;
> > +	void *image;
> > +	u32 i;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	if (*(u32 *)key != 0)
> > +		return -E2BIG;
> > +
> > +	uvalue = (struct bpf_struct_ops_value *)value;
> > +	if (uvalue->state || refcount_read(&uvalue->refcnt))
> > +		return -EINVAL;
> > +
> > +	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
> > +	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> > +
> > +	spin_lock(&st_map->lock);
> > +
> > +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> > +		err = -EBUSY;
> > +		goto unlock;
> > +	}
> > +
> > +	memcpy(uvalue, value, map->value_size);
> > +
> > +	udata = &uvalue->data;
> > +	kdata = &kvalue->data;
> > +	image = st_map->image;
> > +
> > +	for_each_member(i, t, member) {
> > +		const struct btf_type *mtype, *ptype;
> > +		struct bpf_prog *prog;
> > +		u32 moff;
> > +
> > +		moff = btf_member_bit_offset(t, member) / 8;
> > +		mtype = btf_type_by_id(btf_vmlinux, member->type);
> > +		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
> > +		if (ptype == module_type) {
> > +			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
> > +			continue;
> > +		}
> > +
> > +		err = st_ops->init_member(t, member, kdata, udata);
> > +		if (err < 0)
> > +			goto reset_unlock;
> > +
> > +		/* The ->init_member() has handled this member */
> > +		if (err > 0)
> > +			continue;
> > +
> > +		/* If st_ops->init_member does not handle it,
> > +		 * we will only handle func ptrs and zero-ed members
> > +		 * here.  Reject everything else.
> > +		 */
> > +
> > +		/* All non func ptr member must be 0 */
> > +		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> > +					       NULL)) {
> > +			u32 msize;
> > +
> > +			mtype = btf_resolve_size(btf_vmlinux, mtype,
> > +						 &msize, NULL, NULL);
> > +			if (IS_ERR(mtype)) {
> > +				err = PTR_ERR(mtype);
> > +				goto reset_unlock;
> > +			}
> > +
> > +			if (memchr_inv(udata + moff, 0, msize)) {
> > +				err = -EINVAL;
> > +				goto reset_unlock;
> > +			}
> > +
> > +			continue;
> > +		}
> > +
> > +		prog_fd = (int)(*(unsigned long *)(udata + moff));
> > +		/* Similar check as the attr->attach_prog_fd */
> > +		if (!prog_fd)
> > +			continue;
> > +
> > +		prog = bpf_prog_get(prog_fd);
> > +		if (IS_ERR(prog)) {
> > +			err = PTR_ERR(prog);
> > +			goto reset_unlock;
> > +		}
> > +		st_map->progs[i] = prog;
> > +
> > +		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> > +		    prog->aux->attach_btf_id != st_ops->type_id ||
> > +		    prog->expected_attach_type != i) {
> > +			err = -EINVAL;
> > +			goto reset_unlock;
> > +		}
> > +
> > +		err = arch_prepare_bpf_trampoline(image,
> > +						  &st_ops->func_models[i], 0,
> > +						  &prog, 1, NULL, 0, NULL);
> > +		if (err < 0)
> > +			goto reset_unlock;
> > +
> > +		*(void **)(kdata + moff) = image;
> > +		image += err;
> > +
> > +		/* put prog_id to udata */
> > +		*(unsigned long *)(udata + moff) = prog->aux->id;
> > +	}
> 
> Should we check whether user indeed provided `module` member or
> not before declaring success?
hmm.... By 'module', you meant the "if (ptype == module_type)" logic
at the beginning?  Yes, it should check user passed 0 also.
will fix.

> > +static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
> > +{
> > +	const struct bpf_struct_ops *st_ops;
> > +	size_t map_total_size, st_map_size;
> > +	struct bpf_struct_ops_map *st_map;
> > +	const struct btf_type *t, *vt;
> > +	struct bpf_map_memory mem;
> > +	struct bpf_map *map;
> > +	int err;
> > +
> > +	if (!capable(CAP_SYS_ADMIN))
> > +		return ERR_PTR(-EPERM);
> > +
> > +	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
> > +	if (!st_ops)
> > +		return ERR_PTR(-ENOTSUPP);
> > +
> > +	vt = st_ops->value_type;
> > +	if (attr->value_size != vt->size)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	t = st_ops->type;
> > +
> > +	st_map_size = sizeof(*st_map) +
> > +		/* kvalue stores the
> > +		 * struct bpf_struct_ops_tcp_congestions_ops
> > +		 */
> > +		(vt->size - sizeof(struct bpf_struct_ops_value));
> > +	map_total_size = st_map_size +
> > +		/* uvalue */
> > +		sizeof(vt->size) +
> > +		/* struct bpf_progs **progs */
> > +		 btf_type_vlen(t) * sizeof(struct bpf_prog *);
> > +	err = bpf_map_charge_init(&mem, map_total_size);
> > +	if (err < 0)
> > +		return ERR_PTR(err);
> > +
> > +	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
> > +	if (!st_map) {
> > +		bpf_map_charge_finish(&mem);
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +	st_map->st_ops = st_ops;
> > +	map = &st_map->map;
> > +
> > +	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
> > +	st_map->progs =
> > +		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
> > +				   NUMA_NO_NODE);
> > +	/* Each trampoline costs < 64 bytes.  Ensure one page
> > +	 * is enough for max number of func ptrs.
> > +	 */
> > +	BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS);
> 
> This maybe true for x86 now, but it may not hold up for future other
> architectures. Not sure whether we should get the value for arch call 
> backs, or we just bail out during map update if we ever grow exceeds
> one page.
I will reuse the existing WARN_ON_ONCE() check in arch_prepare_bpf_trampoline().
Need to parameterize the end-of-image (i.e. the current PAGE_SIZE / 2 assumption).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-23 23:05   ` Andrii Nakryiko
@ 2019-12-28  1:47     ` Martin Lau
  2019-12-28  2:24       ` Andrii Nakryiko
  0 siblings, 1 reply; 45+ messages in thread
From: Martin Lau @ 2019-12-28  1:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Mon, Dec 23, 2019 at 03:05:08PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > is a kernel struct with its func ptr implemented in bpf prog.
> > This new map is the interface to register/unregister/introspect
> > a bpf implemented kernel struct.
> >
> > The kernel struct is actually embedded inside another new struct
> > (or called the "value" struct in the code).  For example,
> > "struct tcp_congestion_ops" is embbeded in:
> > struct bpf_struct_ops_tcp_congestion_ops {
> >         refcount_t refcnt;
> >         enum bpf_struct_ops_state state;
> >         struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > }
> > The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> > The "bpftool map dump" will then be able to show the
> > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> > number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> > is created automatically by a macro.  Having a separate "value" struct
> > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> > initialization works before registering the struct_ops to the kernel
> > subsystem).  The libbpf will take care of finding and populating the
> > "struct bpf_struct_ops_XYZ" from "struct XYZ".
> >
> > Register a struct_ops to a kernel subsystem:
> > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> >    set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> >    running kernel.
> >    Instead of reusing the attr->btf_value_type_id,
> >    btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> >    used as the "user" btf which could store other useful sysadmin/debug
> >    info that may be introduced in the furture,
> >    e.g. creation-date/compiler-details/map-creator...etc.
> > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> >    in the running kernel btf.  Populate the value of this object.
> >    The function ptr should be populated with the prog fds.
> > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> >    the map value.  The key is always "0".
> >
> > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > (e.g. ensure all mandatory func ptrs are implemented).
> > If everything looks good, it will register this kernel struct
> > to the kernel subsystem.  The map will not allow further update
> > from this point.
> >
> > Unregister a struct_ops from the kernel subsystem:
> > BPF_MAP_DELETE with key "0".
> >
> > Introspect a struct_ops:
> > BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> > have the prog _id_ populated as the func ptr.
> >
> > The map value state (enum bpf_struct_ops_state) will transit from:
> > INIT (map created) =>
> > INUSE (map updated, i.e. reg) =>
> > TOBEFREE (map value deleted, i.e. unreg)
> >
> > The kernel subsystem needs to call bpf_struct_ops_get() and
> > bpf_struct_ops_put() to manage the "refcnt" in the
> > "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
> > for the purose of tracking the subsystem usage.  Another approach
> > is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> > the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> > the map-fd/pinned-map usage.  However, that will also tie down the
> > future semantics of map->refcnt and map->usercnt.
> >
> > The very first subsystem's refcnt (during reg()) holds one
> > count to map->refcnt.  When the very last subsystem's refcnt
> > is gone, it will also release the map->refcnt.  All bpf_prog will be
> > freed when the map->refcnt reaches 0 (i.e. during map_free()).
> >
> > Here is how the bpftool map command will look like:
> > [root@arch-fb-vm1 bpf]# bpftool map show
> > 6: struct_ops  name dctcp  flags 0x0
> >         key 4B  value 256B  max_entries 1  memlock 4096B
> >         btf_id 6
> > [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> > [{
> >         "value": {
> >             "refcnt": {
> >                 "refs": {
> >                     "counter": 1
> >                 }
> >             },
> >             "state": 1,
> >             "data": {
> >                 "list": {
> >                     "next": 0,
> >                     "prev": 0
> >                 },
> >                 "key": 0,
> >                 "flags": 2,
> >                 "init": 24,
> >                 "release": 0,
> >                 "ssthresh": 25,
> >                 "cong_avoid": 30,
> >                 "set_state": 27,
> >                 "cwnd_event": 28,
> >                 "in_ack_event": 26,
> >                 "undo_cwnd": 29,
> >                 "pkts_acked": 0,
> >                 "min_tso_segs": 0,
> >                 "sndbuf_expand": 0,
> >                 "cong_control": 0,
> >                 "get_info": 0,
> >                 "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
> >                 ],
> >                 "owner": 0
> >             }
> >         }
> >     }
> > ]
> >
> > Misc Notes:
> > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
> >   It does an inplace update on "*value" instead returning a pointer
> >   to syscall.c.  Otherwise, it needs a separate copy of "zero" value
> >   for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
> >
> > * The bpf_struct_ops_map_delete_elem() is also called without
> >   preempt_disable() from map_delete_elem().  It is because
> >   the "->unreg()" may requires sleepable context, e.g.
> >   the "tcp_unregister_congestion_control()".
> >
> > * "const" is added to some of the existing "struct btf_func_model *"
> >   function arg to avoid a compiler warning caused by this patch.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> 
> LGTM! Few questions below to improve my understanding.
> 
> Acked-by: Andrii Nakryiko <andriin@fb.com>
> 
> >  arch/x86/net/bpf_jit_comp.c |  11 +-
> >  include/linux/bpf.h         |  49 +++-
> >  include/linux/bpf_types.h   |   3 +
> >  include/linux/btf.h         |  13 +
> >  include/uapi/linux/bpf.h    |   7 +-
> >  kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
> >  kernel/bpf/btf.c            |  20 +-
> >  kernel/bpf/map_in_map.c     |   3 +-
> >  kernel/bpf/syscall.c        |  49 ++--
> >  kernel/bpf/trampoline.c     |   5 +-
> >  kernel/bpf/verifier.c       |   5 +
> >  11 files changed, 593 insertions(+), 40 deletions(-)
> >
> 
> [...]
> 
> > +               /* All non func ptr member must be 0 */
> > +               if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> > +                                              NULL)) {
> > +                       u32 msize;
> > +
> > +                       mtype = btf_resolve_size(btf_vmlinux, mtype,
> > +                                                &msize, NULL, NULL);
> > +                       if (IS_ERR(mtype)) {
> > +                               err = PTR_ERR(mtype);
> > +                               goto reset_unlock;
> > +                       }
> > +
> > +                       if (memchr_inv(udata + moff, 0, msize)) {
> 
> 
> just double-checking: we are ok with having non-zeroed padding in a
> struct, is that right?
Sorry for the delay.

You meant the end-padding of the kernel side struct (i.e. kdata (or kvalue))
could be non-zero?  The btf's struct size (i.e. vt->size) should include
the padding and the whole vt->size is init to 0.

or you meant the user passed in udata (or uvalue)?

> 
> > +                               err = -EINVAL;
> > +                               goto reset_unlock;
> > +                       }
> > +
> > +                       continue;
> > +               }
> > +
> 
> [...]
> 
> > +
> > +               err = arch_prepare_bpf_trampoline(image,
> > +                                                 &st_ops->func_models[i], 0,
> > +                                                 &prog, 1, NULL, 0, NULL);
> > +               if (err < 0)
> > +                       goto reset_unlock;
> > +
> > +               *(void **)(kdata + moff) = image;
> > +               image += err;
> 
> are there any alignment requirements on image pointer for trampoline?
Not that I know of from reading arch_prepare_bpf_trampoline()
which also can generate codes to call multiple bpf_prog.

> 
> > +
> > +               /* put prog_id to udata */
> > +               *(unsigned long *)(udata + moff) = prog->aux->id;
> > +       }
> > +
> 
> [...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-28  1:47     ` Martin Lau
@ 2019-12-28  2:24       ` Andrii Nakryiko
  2019-12-28  5:16         ` Martin Lau
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2019-12-28  2:24 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 27, 2019 at 5:47 PM Martin Lau <kafai@fb.com> wrote:
>
> On Mon, Dec 23, 2019 at 03:05:08PM -0800, Andrii Nakryiko wrote:
> > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > > is a kernel struct with its func ptr implemented in bpf prog.
> > > This new map is the interface to register/unregister/introspect
> > > a bpf implemented kernel struct.
> > >
> > > The kernel struct is actually embedded inside another new struct
> > > (or called the "value" struct in the code).  For example,
> > > "struct tcp_congestion_ops" is embbeded in:
> > > struct bpf_struct_ops_tcp_congestion_ops {
> > >         refcount_t refcnt;
> > >         enum bpf_struct_ops_state state;
> > >         struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > > }
> > > The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> > > The "bpftool map dump" will then be able to show the
> > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> > > number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> > > is created automatically by a macro.  Having a separate "value" struct
> > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> > > initialization works before registering the struct_ops to the kernel
> > > subsystem).  The libbpf will take care of finding and populating the
> > > "struct bpf_struct_ops_XYZ" from "struct XYZ".
> > >
> > > Register a struct_ops to a kernel subsystem:
> > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> > >    set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> > >    running kernel.
> > >    Instead of reusing the attr->btf_value_type_id,
> > >    btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> > >    used as the "user" btf which could store other useful sysadmin/debug
> > >    info that may be introduced in the furture,
> > >    e.g. creation-date/compiler-details/map-creator...etc.
> > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> > >    in the running kernel btf.  Populate the value of this object.
> > >    The function ptr should be populated with the prog fds.
> > > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> > >    the map value.  The key is always "0".
> > >
> > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > > (e.g. ensure all mandatory func ptrs are implemented).
> > > If everything looks good, it will register this kernel struct
> > > to the kernel subsystem.  The map will not allow further update
> > > from this point.
> > >
> > > Unregister a struct_ops from the kernel subsystem:
> > > BPF_MAP_DELETE with key "0".
> > >
> > > Introspect a struct_ops:
> > > BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> > > have the prog _id_ populated as the func ptr.
> > >
> > > The map value state (enum bpf_struct_ops_state) will transit from:
> > > INIT (map created) =>
> > > INUSE (map updated, i.e. reg) =>
> > > TOBEFREE (map value deleted, i.e. unreg)
> > >
> > > The kernel subsystem needs to call bpf_struct_ops_get() and
> > > bpf_struct_ops_put() to manage the "refcnt" in the
> > > "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
> > > for the purose of tracking the subsystem usage.  Another approach
> > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> > > the map-fd/pinned-map usage.  However, that will also tie down the
> > > future semantics of map->refcnt and map->usercnt.
> > >
> > > The very first subsystem's refcnt (during reg()) holds one
> > > count to map->refcnt.  When the very last subsystem's refcnt
> > > is gone, it will also release the map->refcnt.  All bpf_prog will be
> > > freed when the map->refcnt reaches 0 (i.e. during map_free()).
> > >
> > > Here is how the bpftool map command will look like:
> > > [root@arch-fb-vm1 bpf]# bpftool map show
> > > 6: struct_ops  name dctcp  flags 0x0
> > >         key 4B  value 256B  max_entries 1  memlock 4096B
> > >         btf_id 6
> > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> > > [{
> > >         "value": {
> > >             "refcnt": {
> > >                 "refs": {
> > >                     "counter": 1
> > >                 }
> > >             },
> > >             "state": 1,
> > >             "data": {
> > >                 "list": {
> > >                     "next": 0,
> > >                     "prev": 0
> > >                 },
> > >                 "key": 0,
> > >                 "flags": 2,
> > >                 "init": 24,
> > >                 "release": 0,
> > >                 "ssthresh": 25,
> > >                 "cong_avoid": 30,
> > >                 "set_state": 27,
> > >                 "cwnd_event": 28,
> > >                 "in_ack_event": 26,
> > >                 "undo_cwnd": 29,
> > >                 "pkts_acked": 0,
> > >                 "min_tso_segs": 0,
> > >                 "sndbuf_expand": 0,
> > >                 "cong_control": 0,
> > >                 "get_info": 0,
> > >                 "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
> > >                 ],
> > >                 "owner": 0
> > >             }
> > >         }
> > >     }
> > > ]
> > >
> > > Misc Notes:
> > > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
> > >   It does an inplace update on "*value" instead returning a pointer
> > >   to syscall.c.  Otherwise, it needs a separate copy of "zero" value
> > >   for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
> > >
> > > * The bpf_struct_ops_map_delete_elem() is also called without
> > >   preempt_disable() from map_delete_elem().  It is because
> > >   the "->unreg()" may requires sleepable context, e.g.
> > >   the "tcp_unregister_congestion_control()".
> > >
> > > * "const" is added to some of the existing "struct btf_func_model *"
> > >   function arg to avoid a compiler warning caused by this patch.
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> >
> > LGTM! Few questions below to improve my understanding.
> >
> > Acked-by: Andrii Nakryiko <andriin@fb.com>
> >
> > >  arch/x86/net/bpf_jit_comp.c |  11 +-
> > >  include/linux/bpf.h         |  49 +++-
> > >  include/linux/bpf_types.h   |   3 +
> > >  include/linux/btf.h         |  13 +
> > >  include/uapi/linux/bpf.h    |   7 +-
> > >  kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
> > >  kernel/bpf/btf.c            |  20 +-
> > >  kernel/bpf/map_in_map.c     |   3 +-
> > >  kernel/bpf/syscall.c        |  49 ++--
> > >  kernel/bpf/trampoline.c     |   5 +-
> > >  kernel/bpf/verifier.c       |   5 +
> > >  11 files changed, 593 insertions(+), 40 deletions(-)
> > >
> >
> > [...]
> >
> > > +               /* All non func ptr member must be 0 */
> > > +               if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> > > +                                              NULL)) {
> > > +                       u32 msize;
> > > +
> > > +                       mtype = btf_resolve_size(btf_vmlinux, mtype,
> > > +                                                &msize, NULL, NULL);
> > > +                       if (IS_ERR(mtype)) {
> > > +                               err = PTR_ERR(mtype);
> > > +                               goto reset_unlock;
> > > +                       }
> > > +
> > > +                       if (memchr_inv(udata + moff, 0, msize)) {
> >
> >
> > just double-checking: we are ok with having non-zeroed padding in a
> > struct, is that right?
> Sorry for the delay.
>
> You meant the end-padding of the kernel side struct (i.e. kdata (or kvalue))
> could be non-zero?  The btf's struct size (i.e. vt->size) should include
> the padding and the whole vt->size is init to 0.
>
> or you meant the user passed in udata (or uvalue)?

The latter, udata. You check member-by-member, but if there is padding
between fields or at the end of a struct, nothing is currently
checking it for zeroes. So probably safer to check those paddings
inbetween as well.

>
> >
> > > +                               err = -EINVAL;
> > > +                               goto reset_unlock;
> > > +                       }
> > > +
> > > +                       continue;
> > > +               }
> > > +
> >
> > [...]
> >
> > > +
> > > +               err = arch_prepare_bpf_trampoline(image,
> > > +                                                 &st_ops->func_models[i], 0,
> > > +                                                 &prog, 1, NULL, 0, NULL);
> > > +               if (err < 0)
> > > +                       goto reset_unlock;
> > > +
> > > +               *(void **)(kdata + moff) = image;
> > > +               image += err;
> >
> > are there any alignment requirements on image pointer for trampoline?
> Not that I know of from reading arch_prepare_bpf_trampoline()
> which also can generate codes to call multiple bpf_prog.

Yeah, I think you are right, at least for x86. Variable-sized x86
instructions pretty much rule out any alignment. It might not be the
case for fixed-sized instructions architectures, but we should cross
that bridge when we get to it.

>
> >
> > > +
> > > +               /* put prog_id to udata */
> > > +               *(unsigned long *)(udata + moff) = prog->aux->id;
> > > +       }
> > > +
> >
> > [...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-28  2:24       ` Andrii Nakryiko
@ 2019-12-28  5:16         ` Martin Lau
  0 siblings, 0 replies; 45+ messages in thread
From: Martin Lau @ 2019-12-28  5:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 27, 2019 at 06:24:41PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 27, 2019 at 5:47 PM Martin Lau <kafai@fb.com> wrote:
> >
> > On Mon, Dec 23, 2019 at 03:05:08PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > > > is a kernel struct with its func ptr implemented in bpf prog.
> > > > This new map is the interface to register/unregister/introspect
> > > > a bpf implemented kernel struct.
> > > >
> > > > The kernel struct is actually embedded inside another new struct
> > > > (or called the "value" struct in the code).  For example,
> > > > "struct tcp_congestion_ops" is embbeded in:
> > > > struct bpf_struct_ops_tcp_congestion_ops {
> > > >         refcount_t refcnt;
> > > >         enum bpf_struct_ops_state state;
> > > >         struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > > > }
> > > > The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> > > > The "bpftool map dump" will then be able to show the
> > > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> > > > number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> > > > is created automatically by a macro.  Having a separate "value" struct
> > > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> > > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> > > > initialization works before registering the struct_ops to the kernel
> > > > subsystem).  The libbpf will take care of finding and populating the
> > > > "struct bpf_struct_ops_XYZ" from "struct XYZ".
> > > >
> > > > Register a struct_ops to a kernel subsystem:
> > > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> > > >    set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> > > >    running kernel.
> > > >    Instead of reusing the attr->btf_value_type_id,
> > > >    btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> > > >    used as the "user" btf which could store other useful sysadmin/debug
> > > >    info that may be introduced in the furture,
> > > >    e.g. creation-date/compiler-details/map-creator...etc.
> > > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> > > >    in the running kernel btf.  Populate the value of this object.
> > > >    The function ptr should be populated with the prog fds.
> > > > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> > > >    the map value.  The key is always "0".
> > > >
> > > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > > > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > > > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > > > (e.g. ensure all mandatory func ptrs are implemented).
> > > > If everything looks good, it will register this kernel struct
> > > > to the kernel subsystem.  The map will not allow further update
> > > > from this point.
> > > >
> > > > Unregister a struct_ops from the kernel subsystem:
> > > > BPF_MAP_DELETE with key "0".
> > > >
> > > > Introspect a struct_ops:
> > > > BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> > > > have the prog _id_ populated as the func ptr.
> > > >
> > > > The map value state (enum bpf_struct_ops_state) will transit from:
> > > > INIT (map created) =>
> > > > INUSE (map updated, i.e. reg) =>
> > > > TOBEFREE (map value deleted, i.e. unreg)
> > > >
> > > > The kernel subsystem needs to call bpf_struct_ops_get() and
> > > > bpf_struct_ops_put() to manage the "refcnt" in the
> > > > "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
> > > > for the purose of tracking the subsystem usage.  Another approach
> > > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
> > > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out
> > > > the map-fd/pinned-map usage.  However, that will also tie down the
> > > > future semantics of map->refcnt and map->usercnt.
> > > >
> > > > The very first subsystem's refcnt (during reg()) holds one
> > > > count to map->refcnt.  When the very last subsystem's refcnt
> > > > is gone, it will also release the map->refcnt.  All bpf_prog will be
> > > > freed when the map->refcnt reaches 0 (i.e. during map_free()).
> > > >
> > > > Here is how the bpftool map command will look like:
> > > > [root@arch-fb-vm1 bpf]# bpftool map show
> > > > 6: struct_ops  name dctcp  flags 0x0
> > > >         key 4B  value 256B  max_entries 1  memlock 4096B
> > > >         btf_id 6
> > > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> > > > [{
> > > >         "value": {
> > > >             "refcnt": {
> > > >                 "refs": {
> > > >                     "counter": 1
> > > >                 }
> > > >             },
> > > >             "state": 1,
> > > >             "data": {
> > > >                 "list": {
> > > >                     "next": 0,
> > > >                     "prev": 0
> > > >                 },
> > > >                 "key": 0,
> > > >                 "flags": 2,
> > > >                 "init": 24,
> > > >                 "release": 0,
> > > >                 "ssthresh": 25,
> > > >                 "cong_avoid": 30,
> > > >                 "set_state": 27,
> > > >                 "cwnd_event": 28,
> > > >                 "in_ack_event": 26,
> > > >                 "undo_cwnd": 29,
> > > >                 "pkts_acked": 0,
> > > >                 "min_tso_segs": 0,
> > > >                 "sndbuf_expand": 0,
> > > >                 "cong_control": 0,
> > > >                 "get_info": 0,
> > > >                 "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
> > > >                 ],
> > > >                 "owner": 0
> > > >             }
> > > >         }
> > > >     }
> > > > ]
> > > >
> > > > Misc Notes:
> > > > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
> > > >   It does an inplace update on "*value" instead returning a pointer
> > > >   to syscall.c.  Otherwise, it needs a separate copy of "zero" value
> > > >   for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
> > > >
> > > > * The bpf_struct_ops_map_delete_elem() is also called without
> > > >   preempt_disable() from map_delete_elem().  It is because
> > > >   the "->unreg()" may requires sleepable context, e.g.
> > > >   the "tcp_unregister_congestion_control()".
> > > >
> > > > * "const" is added to some of the existing "struct btf_func_model *"
> > > >   function arg to avoid a compiler warning caused by this patch.
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > >
> > > LGTM! Few questions below to improve my understanding.
> > >
> > > Acked-by: Andrii Nakryiko <andriin@fb.com>
> > >
> > > >  arch/x86/net/bpf_jit_comp.c |  11 +-
> > > >  include/linux/bpf.h         |  49 +++-
> > > >  include/linux/bpf_types.h   |   3 +
> > > >  include/linux/btf.h         |  13 +
> > > >  include/uapi/linux/bpf.h    |   7 +-
> > > >  kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++-
> > > >  kernel/bpf/btf.c            |  20 +-
> > > >  kernel/bpf/map_in_map.c     |   3 +-
> > > >  kernel/bpf/syscall.c        |  49 ++--
> > > >  kernel/bpf/trampoline.c     |   5 +-
> > > >  kernel/bpf/verifier.c       |   5 +
> > > >  11 files changed, 593 insertions(+), 40 deletions(-)
> > > >
> > >
> > > [...]
> > >
> > > > +               /* All non func ptr member must be 0 */
> > > > +               if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> > > > +                                              NULL)) {
> > > > +                       u32 msize;
> > > > +
> > > > +                       mtype = btf_resolve_size(btf_vmlinux, mtype,
> > > > +                                                &msize, NULL, NULL);
> > > > +                       if (IS_ERR(mtype)) {
> > > > +                               err = PTR_ERR(mtype);
> > > > +                               goto reset_unlock;
> > > > +                       }
> > > > +
> > > > +                       if (memchr_inv(udata + moff, 0, msize)) {
> > >
> > >
> > > just double-checking: we are ok with having non-zeroed padding in a
> > > struct, is that right?
> > Sorry for the delay.
> >
> > You meant the end-padding of the kernel side struct (i.e. kdata (or kvalue))
> > could be non-zero?  The btf's struct size (i.e. vt->size) should include
> > the padding and the whole vt->size is init to 0.
> >
> > or you meant the user passed in udata (or uvalue)?
> 
> The latter, udata. You check member-by-member, but if there is padding
> between fields or at the end of a struct, nothing is currently
> checking it for zeroes. So probably safer to check those paddings
> inbetween as well.
Agree. Will do.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-12-28  5:17 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-21  6:25 [PATCH bpf-next v2 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
2019-12-21  6:25 ` [PATCH bpf-next v2 01/11] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
2019-12-21  6:25 ` [PATCH bpf-next v2 02/11] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
2019-12-21  6:26 ` [PATCH bpf-next v2 03/11] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
2019-12-21  6:26 ` [PATCH bpf-next v2 04/11] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
2019-12-23  7:49   ` Yonghong Song
2019-12-23 20:05   ` Andrii Nakryiko
2019-12-23 21:21     ` Yonghong Song
2019-12-21  6:26 ` [PATCH bpf-next v2 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
2019-12-23 19:33   ` Yonghong Song
2019-12-23 20:29   ` Andrii Nakryiko
2019-12-23 22:29     ` Martin Lau
2019-12-23 22:55       ` Andrii Nakryiko
2019-12-24 11:46   ` kbuild test robot
2019-12-21  6:26 ` [PATCH bpf-next v2 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
2019-12-23 19:57   ` Yonghong Song
2019-12-23 21:44     ` Andrii Nakryiko
2019-12-23 22:15       ` Martin Lau
2019-12-27  6:16     ` Martin Lau
2019-12-23 23:05   ` Andrii Nakryiko
2019-12-28  1:47     ` Martin Lau
2019-12-28  2:24       ` Andrii Nakryiko
2019-12-28  5:16         ` Martin Lau
2019-12-24 12:28   ` kbuild test robot
2019-12-21  6:26 ` [PATCH bpf-next v2 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
2019-12-23 20:18   ` Yonghong Song
2019-12-23 23:20   ` Andrii Nakryiko
2019-12-24  7:16   ` kbuild test robot
2019-12-24 13:06   ` kbuild test robot
2019-12-21  6:26 ` [PATCH bpf-next v2 08/11] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
2019-12-21  6:26 ` [PATCH bpf-next v2 09/11] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
2019-12-21  6:26 ` [PATCH bpf-next v2 10/11] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
2019-12-23 19:54   ` Andrii Nakryiko
2019-12-26 22:47     ` Martin Lau
2019-12-21  6:26 ` [PATCH bpf-next v2 11/11] bpf: Add bpf_dctcp example Martin KaFai Lau
2019-12-23 23:26   ` Andrii Nakryiko
2019-12-24  1:31     ` Martin Lau
2019-12-24  7:01       ` Andrii Nakryiko
2019-12-24  7:32         ` Martin Lau
2019-12-24 16:50         ` Martin Lau
2019-12-26 19:02           ` Andrii Nakryiko
2019-12-26 20:25             ` Martin Lau
2019-12-26 20:48               ` Andrii Nakryiko
2019-12-26 22:20                 ` Martin Lau
2019-12-26 22:25                   ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).