netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS
@ 2019-12-14  0:47 Martin KaFai Lau
  2019-12-14  0:47 ` [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
                   ` (13 more replies)
  0 siblings, 14 replies; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This series introduces BPF STRUCT_OPS.  It is an infra to allow
implementing some specific kernel's function pointers in BPF.
The first use case included in this series is to implement
TCP congestion control algorithm in BPF  (i.e. implement
struct tcp_congestion_ops in BPF).

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.

The idea is to allow implementing tcp_congestion_ops in bpf.
It allows a faster turnaround for testing algorithm in the
production while leveraging the existing (and continue growing) BPF
feature/framework instead of building one specifically for
userspace TCP CC.

Please see individual patch for details.

The bpftool support will be posted in follow-up patches.

Martin KaFai Lau (13):
  bpf: Save PTR_TO_BTF_ID register state when spilling to stack
  bpf: Avoid storing modifier to info->btf_id
  bpf: Add enum support to btf_ctx_access()
  bpf: Support bitfield read access in btf_struct_access
  bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  bpf: tcp: Support tcp_congestion_ops in bpf
  bpf: Add BPF_FUNC_tcp_send_ack helper
  bpf: Add BPF_FUNC_jiffies
  bpf: Synch uapi bpf.h to tools/
  bpf: libbpf: Add STRUCT_OPS support
  bpf: Add bpf_dctcp example
  bpf: Add bpf_cubic example

 arch/x86/net/bpf_jit_comp.c                   |  10 +-
 include/linux/bpf.h                           |  80 ++-
 include/linux/bpf_types.h                     |   7 +
 include/linux/btf.h                           |  45 ++
 include/linux/filter.h                        |   2 +
 include/net/tcp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  33 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_struct_ops.c                   | 585 +++++++++++++++++
 kernel/bpf/bpf_struct_ops_types.h             |   9 +
 kernel/bpf/btf.c                              | 132 ++--
 kernel/bpf/core.c                             |   1 +
 kernel/bpf/helpers.c                          |  25 +
 kernel/bpf/map_in_map.c                       |   3 +-
 kernel/bpf/syscall.c                          |  64 +-
 kernel/bpf/trampoline.c                       |   5 +-
 kernel/bpf/verifier.c                         | 140 +++-
 net/core/filter.c                             |   4 +-
 net/ipv4/Makefile                             |   4 +
 net/ipv4/bpf_tcp_ca.c                         | 247 ++++++++
 net/ipv4/tcp_cong.c                           |  14 +-
 net/ipv4/tcp_ipv4.c                           |   6 +-
 net/ipv4/tcp_minisocks.c                      |   4 +-
 net/ipv4/tcp_output.c                         |   4 +-
 tools/include/uapi/linux/bpf.h                |  33 +-
 tools/lib/bpf/bpf.c                           |  10 +-
 tools/lib/bpf/bpf.h                           |   5 +-
 tools/lib/bpf/libbpf.c                        | 599 +++++++++++++++++-
 tools/lib/bpf/libbpf.h                        |   3 +-
 tools/lib/bpf/libbpf_probes.c                 |   2 +
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 +++++++
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 220 +++++++
 tools/testing/selftests/bpf/progs/bpf_cubic.c | 502 +++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 194 ++++++
 34 files changed, 3089 insertions(+), 134 deletions(-)
 create mode 100644 kernel/bpf/bpf_struct_ops.c
 create mode 100644 kernel/bpf/bpf_struct_ops_types.h
 create mode 100644 net/ipv4/bpf_tcp_ca.c
 create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_cubic.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-16 19:48   ` Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch makes the verifier save the PTR_TO_BTF_ID register state when
spilling to the stack.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 034ef81f935b..408264c1d55b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1915,6 +1915,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
+	case PTR_TO_BTF_ID:
 		return true;
 	default:
 		return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
  2019-12-14  0:47 ` [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-16 21:34   ` Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

info->btf_id expects the btf_id of a struct, so it should
store the final result after skipping modifiers (if any).

It also takes this chanace to add a missing newline in one of the
bpf_log() messages.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 7d40da240891..88359a4bccb0 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3696,7 +3696,6 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 
 	/* this is a pointer to another type */
 	info->reg_type = PTR_TO_BTF_ID;
-	info->btf_id = t->type;
 
 	if (tgt_prog) {
 		ret = btf_translate_to_vmlinux(log, btf, t, tgt_prog->type);
@@ -3707,10 +3706,14 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 			return false;
 		}
 	}
+
+	info->btf_id = t->type;
 	t = btf_type_by_id(btf, t->type);
 	/* skip modifiers */
-	while (btf_type_is_modifier(t))
+	while (btf_type_is_modifier(t)) {
+		info->btf_id = t->type;
 		t = btf_type_by_id(btf, t->type);
+	}
 	if (!btf_type_is_struct(t)) {
 		bpf_log(log,
 			"func '%s' arg%d type %s is not a struct\n",
@@ -3736,7 +3739,7 @@ int btf_struct_access(struct bpf_verifier_log *log,
 again:
 	tname = __btf_name_by_offset(btf_vmlinux, t->name_off);
 	if (!btf_type_is_struct(t)) {
-		bpf_log(log, "Type '%s' is not a struct", tname);
+		bpf_log(log, "Type '%s' is not a struct\n", tname);
 		return -EINVAL;
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access()
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
  2019-12-14  0:47 ` [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
  2019-12-14  0:47 ` [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-16 21:36   ` Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

It allows bpf prog (e.g. tracing) to attach
to a kernel function that takes enum argument.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 88359a4bccb0..6e652643849b 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3676,7 +3676,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 	/* skip modifiers */
 	while (btf_type_is_modifier(t))
 		t = btf_type_by_id(btf, t->type);
-	if (btf_type_is_int(t))
+	if (btf_type_is_int(t) || btf_type_is_enum(t))
 		/* accessing a scalar */
 		return true;
 	if (!btf_type_is_ptr(t)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2019-12-14  0:47 ` [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-16 22:05   ` Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch allows bitfield access as a scalar.  It currently limits
the access to sizeof(u64) and upto the end of the struct.  It is needed
in a later bpf-tcp-cc example that reads bitfield from
inet_connection_sock and tcp_sock.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 6e652643849b..011194831499 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3744,10 +3744,6 @@ int btf_struct_access(struct bpf_verifier_log *log,
 	}
 
 	for_each_member(i, t, member) {
-		if (btf_member_bitfield_size(t, member))
-			/* bitfields are not supported yet */
-			continue;
-
 		/* offset of the field in bytes */
 		moff = btf_member_bit_offset(t, member) / 8;
 		if (off + size <= moff)
@@ -3757,6 +3753,15 @@ int btf_struct_access(struct bpf_verifier_log *log,
 		if (off < moff)
 			continue;
 
+		if (btf_member_bitfield_size(t, member)) {
+			if (off == moff &&
+			    !(btf_member_bit_offset(t, member) % 8) &&
+			    size <= sizeof(u64) &&
+			    off + size <= t->size)
+				return SCALAR_VALUE;
+			continue;
+		}
+
 		/* type of the field */
 		mtype = btf_type_by_id(btf_vmlinux, member->type);
 		mname = __btf_name_by_offset(btf_vmlinux, member->name_off);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2019-12-14  0:47 ` [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-17  6:14   ` Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch allows the kernel's struct ops (i.e. func ptr) to be
implemented in BPF.  The first use case in this series is the
"struct tcp_congestion_ops" which will be introduced in a
latter patch.

This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
of a kernel struct.  The attr->expected_attach_type is the member
"index" of that kernel struct.  The first member of a struct starts
with member index 0.  That will avoid ambiguity when a kernel struct
has multiple func ptrs with the same func signature.

For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
to implement the "init" func ptr of the "struct tcp_congestion_ops".
The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
of the _running_ kernel.  The attr->expected_attach_type is 3.

The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
by arch_prepare_bpf_trampoline that will be done in the next
patch when introducing BPF_MAP_TYPE_STRUCT_OPS.

"struct bpf_struct_ops" is introduced as a common interface for the kernel
struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
struct will need to implement an instance of the "struct bpf_struct_ops".

The supporting kernel struct also needs to implement a bpf_verifier_ops.
During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
bpf_verifier_ops by searching the attr->attach_btf_id.

A new "btf_struct_access" is also added to the bpf_verifier_ops such
that the supporting kernel struct can optionally provide its own specific
check on accessing the func arg (e.g. provide limited write access).

After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
to initialize some values (e.g. the btf id of the supporting kernel
struct) and it can only be done once the btf_vmlinux is available.

The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
if the return type of the prog->aux->attach_func_proto is "void".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h               |  30 +++++++
 include/linux/bpf_types.h         |   4 +
 include/linux/btf.h               |  34 ++++++++
 include/uapi/linux/bpf.h          |   1 +
 kernel/bpf/Makefile               |   2 +-
 kernel/bpf/bpf_struct_ops.c       | 124 +++++++++++++++++++++++++++
 kernel/bpf/bpf_struct_ops_types.h |   4 +
 kernel/bpf/btf.c                  |  88 ++++++++++++++------
 kernel/bpf/syscall.c              |  17 ++--
 kernel/bpf/verifier.c             | 134 +++++++++++++++++++++++-------
 10 files changed, 374 insertions(+), 64 deletions(-)
 create mode 100644 kernel/bpf/bpf_struct_ops.c
 create mode 100644 kernel/bpf/bpf_struct_ops_types.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d467983e61bb..1f0a5fc8c5ee 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -349,6 +349,10 @@ struct bpf_verifier_ops {
 				  const struct bpf_insn *src,
 				  struct bpf_insn *dst,
 				  struct bpf_prog *prog, u32 *target_size);
+	int (*btf_struct_access)(struct bpf_verifier_log *log,
+				 const struct btf_type *t, int off, int size,
+				 enum bpf_access_type atype,
+				 u32 *next_btf_id);
 };
 
 struct bpf_prog_offload_ops {
@@ -667,6 +671,32 @@ struct bpf_array_aux {
 	struct work_struct work;
 };
 
+struct btf_type;
+struct btf_member;
+
+#define BPF_STRUCT_OPS_MAX_NR_MEMBERS 64
+struct bpf_struct_ops {
+	const struct bpf_verifier_ops *verifier_ops;
+	int (*init)(struct btf *_btf_vmlinux);
+	int (*check_member)(const struct btf_type *t,
+			    const struct btf_member *member);
+	const struct btf_type *type;
+	const char *name;
+	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
+	u32 type_id;
+};
+
+#if defined(CONFIG_BPF_JIT)
+const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id);
+void bpf_struct_ops_init(struct btf *_btf_vmlinux);
+#else
+static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
+{
+	return NULL;
+}
+static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { }
+#endif
+
 struct bpf_array {
 	struct bpf_map map;
 	u32 elem_size;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 93740b3614d7..fadd243ffa2d 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -65,6 +65,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
 	      struct sk_reuseport_md, struct sk_reuseport_kern)
 #endif
+#if defined(CONFIG_BPF_JIT)
+BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
+	      void *, void *)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 79d4abc2556a..f74a09a7120b 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -53,6 +53,18 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
+s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
+const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
+					       u32 id, u32 *res_id);
+const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
+					    u32 id, u32 *res_id);
+const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
+						 u32 id, u32 *res_id);
+
+#define for_each_member(i, struct_type, member)			\
+	for (i = 0, member = btf_type_member(struct_type);	\
+	     i < btf_type_vlen(struct_type);			\
+	     i++, member++)
 
 static inline bool btf_type_is_ptr(const struct btf_type *t)
 {
@@ -84,6 +96,28 @@ static inline bool btf_type_is_func_proto(const struct btf_type *t)
 	return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC_PROTO;
 }
 
+static inline u16 btf_type_vlen(const struct btf_type *t)
+{
+	return BTF_INFO_VLEN(t->info);
+}
+
+static inline bool btf_type_kflag(const struct btf_type *t)
+{
+	return BTF_INFO_KFLAG(t->info);
+}
+
+static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
+					   const struct btf_member *member)
+{
+	return btf_type_kflag(struct_type) ? BTF_MEMBER_BITFIELD_SIZE(member->offset)
+					   : 0;
+}
+
+static inline const struct btf_member *btf_type_member(const struct btf_type *t)
+{
+	return (const struct btf_member *)(t + 1);
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
 const char *btf_name_by_offset(const struct btf *btf, u32 offset);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dbbcf0b02970..12900dfa1461 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -174,6 +174,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 	BPF_PROG_TYPE_TRACING,
+	BPF_PROG_TYPE_STRUCT_OPS,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index d4f330351f87..0e636387db6f 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
-obj-$(CONFIG_BPF_JIT) += trampoline.o
+obj-$(CONFIG_BPF_JIT) += trampoline.o bpf_struct_ops.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
new file mode 100644
index 000000000000..817d5aac42e5
--- /dev/null
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2019 Facebook
+ */
+
+#include <linux/bpf.h>
+#include <linux/bpf_verifier.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <linux/slab.h>
+#include <linux/numa.h>
+#include <linux/seq_file.h>
+#include <linux/refcount.h>
+
+#define BPF_STRUCT_OPS_TYPE(_name)				\
+extern struct bpf_struct_ops bpf_##_name;
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+
+enum {
+#define BPF_STRUCT_OPS_TYPE(_name) BPF_STRUCT_OPS_TYPE_##_name,
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+	__NR_BPF_STRUCT_OPS_TYPE,
+};
+
+static struct bpf_struct_ops * const bpf_struct_ops[] = {
+#define BPF_STRUCT_OPS_TYPE(_name)				\
+	[BPF_STRUCT_OPS_TYPE_##_name] = &bpf_##_name,
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+};
+
+const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
+};
+
+const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
+};
+
+void bpf_struct_ops_init(struct btf *_btf_vmlinux)
+{
+	const struct btf_member *member;
+	struct bpf_struct_ops *st_ops;
+	struct bpf_verifier_log log = {};
+	const struct btf_type *t;
+	const char *mname;
+	s32 type_id;
+	u32 i, j;
+
+	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
+		st_ops = bpf_struct_ops[i];
+
+		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
+						BTF_KIND_STRUCT);
+		if (type_id < 0) {
+			pr_warn("Cannot find struct %s in btf_vmlinux\n",
+				st_ops->name);
+			continue;
+		}
+		t = btf_type_by_id(_btf_vmlinux, type_id);
+		if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) {
+			pr_warn("Cannot support #%u members in struct %s\n",
+				btf_type_vlen(t), st_ops->name);
+			continue;
+		}
+
+		for_each_member(j, t, member) {
+			const struct btf_type *func_proto;
+
+			mname = btf_name_by_offset(_btf_vmlinux,
+						   member->name_off);
+			if (!*mname) {
+				pr_warn("anon member in struct %s is not supported\n",
+					st_ops->name);
+				break;
+			}
+
+			if (btf_member_bitfield_size(t, member)) {
+				pr_warn("bit field member %s in struct %s is not supported\n",
+					mname, st_ops->name);
+				break;
+			}
+
+			func_proto = btf_type_resolve_func_ptr(_btf_vmlinux,
+							       member->type,
+							       NULL);
+			if (func_proto &&
+			    btf_distill_func_proto(&log, _btf_vmlinux,
+						   func_proto, mname,
+						   &st_ops->func_models[j])) {
+				pr_warn("Error in parsing func ptr %s in struct %s\n",
+					mname, st_ops->name);
+				break;
+			}
+		}
+
+		if (j == btf_type_vlen(t)) {
+			if (st_ops->init(_btf_vmlinux)) {
+				pr_warn("Error in init bpf_struct_ops %s\n",
+					st_ops->name);
+			} else {
+				st_ops->type_id = type_id;
+				st_ops->type = t;
+			}
+		}
+	}
+}
+
+extern struct btf *btf_vmlinux;
+
+const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
+{
+	unsigned int i;
+
+	if (!type_id || !btf_vmlinux)
+		return NULL;
+
+	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
+		if (bpf_struct_ops[i]->type_id == type_id)
+			return bpf_struct_ops[i];
+	}
+
+	return NULL;
+}
diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
new file mode 100644
index 000000000000..7bb13ff49ec2
--- /dev/null
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -0,0 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* internal file - do not include directly */
+
+/* To be filled in a later patch */
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 011194831499..16924e5fa126 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -180,11 +180,6 @@
  */
 #define BTF_MAX_SIZE (16 * 1024 * 1024)
 
-#define for_each_member(i, struct_type, member)			\
-	for (i = 0, member = btf_type_member(struct_type);	\
-	     i < btf_type_vlen(struct_type);			\
-	     i++, member++)
-
 #define for_each_member_from(i, from, struct_type, member)		\
 	for (i = from, member = btf_type_member(struct_type) + from;	\
 	     i < btf_type_vlen(struct_type);				\
@@ -382,6 +377,65 @@ static bool btf_type_is_datasec(const struct btf_type *t)
 	return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
 }
 
+s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind)
+{
+	const struct btf_type *t;
+	const char *tname;
+	u32 i;
+
+	for (i = 1; i <= btf->nr_types; i++) {
+		t = btf->types[i];
+		if (BTF_INFO_KIND(t->info) != kind)
+			continue;
+
+		tname = btf_name_by_offset(btf, t->name_off);
+		if (!strcmp(tname, name))
+			return i;
+	}
+
+	return -ENOENT;
+}
+
+const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
+					       u32 id, u32 *res_id)
+{
+	const struct btf_type *t = btf_type_by_id(btf, id);
+
+	while (btf_type_is_modifier(t)) {
+		id = t->type;
+		t = btf_type_by_id(btf, t->type);
+	}
+
+	if (res_id)
+		*res_id = id;
+
+	return t;
+}
+
+const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
+					    u32 id, u32 *res_id)
+{
+	const struct btf_type *t;
+
+	t = btf_type_skip_modifiers(btf, id, NULL);
+	if (!btf_type_is_ptr(t))
+		return NULL;
+
+	return btf_type_skip_modifiers(btf, t->type, res_id);
+}
+
+const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
+						 u32 id, u32 *res_id)
+{
+	const struct btf_type *ptype;
+
+	ptype = btf_type_resolve_ptr(btf, id, res_id);
+	if (ptype && btf_type_is_func_proto(ptype))
+		return ptype;
+
+	return NULL;
+}
+
 /* Types that act only as a source, not sink or intermediate
  * type when resolving.
  */
@@ -446,16 +500,6 @@ static const char *btf_int_encoding_str(u8 encoding)
 		return "UNKN";
 }
 
-static u16 btf_type_vlen(const struct btf_type *t)
-{
-	return BTF_INFO_VLEN(t->info);
-}
-
-static bool btf_type_kflag(const struct btf_type *t)
-{
-	return BTF_INFO_KFLAG(t->info);
-}
-
 static u32 btf_member_bit_offset(const struct btf_type *struct_type,
 			     const struct btf_member *member)
 {
@@ -463,13 +507,6 @@ static u32 btf_member_bit_offset(const struct btf_type *struct_type,
 					   : member->offset;
 }
 
-static u32 btf_member_bitfield_size(const struct btf_type *struct_type,
-				    const struct btf_member *member)
-{
-	return btf_type_kflag(struct_type) ? BTF_MEMBER_BITFIELD_SIZE(member->offset)
-					   : 0;
-}
-
 static u32 btf_type_int(const struct btf_type *t)
 {
 	return *(u32 *)(t + 1);
@@ -480,11 +517,6 @@ static const struct btf_array *btf_type_array(const struct btf_type *t)
 	return (const struct btf_array *)(t + 1);
 }
 
-static const struct btf_member *btf_type_member(const struct btf_type *t)
-{
-	return (const struct btf_member *)(t + 1);
-}
-
 static const struct btf_enum *btf_type_enum(const struct btf_type *t)
 {
 	return (const struct btf_enum *)(t + 1);
@@ -3604,6 +3636,8 @@ struct btf *btf_parse_vmlinux(void)
 		goto errout;
 	}
 
+	bpf_struct_ops_init(btf);
+
 	btf_verifier_env_free(env);
 	refcount_set(&btf->refcnt, 1);
 	return btf;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b08c362f4e02..19b2d57f7c04 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1672,17 +1672,22 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 			   enum bpf_attach_type expected_attach_type,
 			   u32 btf_id, u32 prog_fd)
 {
-	switch (prog_type) {
-	case BPF_PROG_TYPE_TRACING:
+	if (btf_id) {
 		if (btf_id > BTF_MAX_TYPE)
 			return -EINVAL;
-		break;
-	default:
-		if (btf_id || prog_fd)
+
+		switch (prog_type) {
+		case BPF_PROG_TYPE_TRACING:
+		case BPF_PROG_TYPE_STRUCT_OPS:
+			break;
+		default:
 			return -EINVAL;
-		break;
+		}
 	}
 
+	if (prog_fd && prog_type != BPF_PROG_TYPE_TRACING)
+		return -EINVAL;
+
 	switch (prog_type) {
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 		switch (expected_attach_type) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 408264c1d55b..4c1eaa1a2965 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2858,11 +2858,6 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	u32 btf_id;
 	int ret;
 
-	if (atype != BPF_READ) {
-		verbose(env, "only read is supported\n");
-		return -EACCES;
-	}
-
 	if (off < 0) {
 		verbose(env,
 			"R%d is ptr_%s invalid negative access: off=%d\n",
@@ -2879,17 +2874,32 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 		return -EACCES;
 	}
 
-	ret = btf_struct_access(&env->log, t, off, size, atype, &btf_id);
+	if (env->ops->btf_struct_access) {
+		ret = env->ops->btf_struct_access(&env->log, t, off, size,
+						  atype, &btf_id);
+	} else {
+		if (atype != BPF_READ) {
+			verbose(env, "only read is supported\n");
+			return -EACCES;
+		}
+
+		ret = btf_struct_access(&env->log, t, off, size, atype,
+					&btf_id);
+	}
+
 	if (ret < 0)
 		return ret;
 
-	if (ret == SCALAR_VALUE) {
-		mark_reg_unknown(env, regs, value_regno);
-		return 0;
+	if (atype == BPF_READ) {
+		if (ret == SCALAR_VALUE) {
+			mark_reg_unknown(env, regs, value_regno);
+			return 0;
+		}
+		mark_reg_known_zero(env, regs, value_regno);
+		regs[value_regno].type = PTR_TO_BTF_ID;
+		regs[value_regno].btf_id = btf_id;
 	}
-	mark_reg_known_zero(env, regs, value_regno);
-	regs[value_regno].type = PTR_TO_BTF_ID;
-	regs[value_regno].btf_id = btf_id;
+
 	return 0;
 }
 
@@ -6343,8 +6353,30 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 static int check_return_code(struct bpf_verifier_env *env)
 {
 	struct tnum enforce_attach_type_range = tnum_unknown;
+	const struct bpf_prog *prog = env->prog;
 	struct bpf_reg_state *reg;
 	struct tnum range = tnum_range(0, 1);
+	int err;
+
+	/* The struct_ops func-ptr's return type could be "void" */
+	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
+	    !prog->aux->attach_func_proto->type)
+		return 0;
+
+	/* eBPF calling convetion is such that R0 is used
+	 * to return the value from eBPF program.
+	 * Make sure that it's readable at this time
+	 * of bpf_exit, which means that program wrote
+	 * something into it earlier
+	 */
+	err = check_reg_arg(env, BPF_REG_0, SRC_OP);
+	if (err)
+		return err;
+
+	if (is_pointer_value(env, BPF_REG_0)) {
+		verbose(env, "R0 leaks addr as return value\n");
+		return -EACCES;
+	}
 
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
@@ -8010,21 +8042,6 @@ static int do_check(struct bpf_verifier_env *env)
 				if (err)
 					return err;
 
-				/* eBPF calling convetion is such that R0 is used
-				 * to return the value from eBPF program.
-				 * Make sure that it's readable at this time
-				 * of bpf_exit, which means that program wrote
-				 * something into it earlier
-				 */
-				err = check_reg_arg(env, BPF_REG_0, SRC_OP);
-				if (err)
-					return err;
-
-				if (is_pointer_value(env, BPF_REG_0)) {
-					verbose(env, "R0 leaks addr as return value\n");
-					return -EACCES;
-				}
-
 				err = check_return_code(env);
 				if (err)
 					return err;
@@ -8833,12 +8850,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
 			break;
 		case PTR_TO_BTF_ID:
-			if (type == BPF_WRITE) {
+			if (type == BPF_READ) {
+				insn->code = BPF_LDX | BPF_PROBE_MEM |
+					BPF_SIZE((insn)->code);
+				env->prog->aux->num_exentries++;
+			} else if (env->prog->type != BPF_PROG_TYPE_STRUCT_OPS) {
 				verbose(env, "Writes through BTF pointers are not allowed\n");
 				return -EINVAL;
 			}
-			insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code);
-			env->prog->aux->num_exentries++;
 			continue;
 		default:
 			continue;
@@ -9505,6 +9524,58 @@ static void print_verification_stats(struct bpf_verifier_env *env)
 		env->peak_states, env->longest_mark_read_walk);
 }
 
+static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
+{
+	const struct btf_type *t, *func_proto;
+	const struct bpf_struct_ops *st_ops;
+	const struct btf_member *member;
+	struct bpf_prog *prog = env->prog;
+	u32 btf_id, member_idx;
+	const char *mname;
+
+	btf_id = prog->aux->attach_btf_id;
+	st_ops = bpf_struct_ops_find(btf_id);
+	if (!st_ops) {
+		verbose(env, "attach_btf_id %u is not a supported struct\n",
+			btf_id);
+		return -ENOTSUPP;
+	}
+
+	t = st_ops->type;
+	member_idx = prog->expected_attach_type;
+	if (member_idx >= btf_type_vlen(t)) {
+		verbose(env, "attach to invalid member idx %u of struct %s\n",
+			member_idx, st_ops->name);
+		return -EINVAL;
+	}
+
+	member = &btf_type_member(t)[member_idx];
+	mname = btf_name_by_offset(btf_vmlinux, member->name_off);
+	func_proto = btf_type_resolve_func_ptr(btf_vmlinux, member->type,
+					       NULL);
+	if (!func_proto) {
+		verbose(env, "attach to invalid member %s(@idx %u) of struct %s\n",
+			mname, member_idx, st_ops->name);
+		return -EINVAL;
+	}
+
+	if (st_ops->check_member) {
+		int err = st_ops->check_member(t, member);
+
+		if (err) {
+			verbose(env, "attach to unsupported member %s of struct %s\n",
+				mname, st_ops->name);
+			return err;
+		}
+	}
+
+	prog->aux->attach_func_proto = func_proto;
+	prog->aux->attach_func_name = mname;
+	env->ops = st_ops->verifier_ops;
+
+	return 0;
+}
+
 static int check_attach_btf_id(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog;
@@ -9520,6 +9591,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	long addr;
 	u64 key;
 
+	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
+		return check_struct_ops_btf_id(env);
+
 	if (prog->type != BPF_PROG_TYPE_TRACING)
 		return 0;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2019-12-14  0:47 ` [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-17  7:48   ` [Potential Spoof] " Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code).  For example,
"struct tcp_congestion_ops" is embbeded in:
struct __bpf_tcp_congestion_ops {
	refcount_t refcnt;
	enum bpf_struct_ops_state state;
	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
}
The map value is "struct __bpf_tcp_congestion_ops".  The "bpftool map dump"
will then be able to show the state ("inuse"/"tobefree") and the number of
subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case).
This "value" struct is created automatically by a macro.  Having a separate
"value" struct will also make extending "struct __bpf_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct __bpf_XYZ" to do some initialization
works before registering the struct_ops to the kernel subsystem).
The libbpf will take care of finding and populating the "struct __bpf_XYZ"
from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
   set to the btf id "struct __bpf_tcp_congestion_ops" of the running
   kernel.
   Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id
   is added such that attr->btf_fd can still be used as the "user" btf
   which could store other useful sysadmin/debug info that may be
   introduced in the furture,
   e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct __bpf_tcp_congestion_ops" object as described in
   the running kernel btf.  Populate the value of this object.
   The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
   the map value.  The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem.  The map will not allow further update
from this point.

Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".

Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
have the prog _id_ populated as the func ptr.

The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)

Note that the above state is not exposed to the uapi/bpf.h.
It will be obtained from the btf of the running kernel.

The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the "struct __bpf_XYZ".
This patch uses a separate refcnt for the purose of tracking the
subsystem usage.  Another approach is to reuse the map->refcnt
and then "show" (i.e. during map_lookup) the subsystem's usage
by doing map->refcnt - map->usercnt to filter out the
map-fd/pinned-map usage.  However, that will also tie down the
future semantics of map->refcnt and map->usercnt.

The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt.  When the very last subsystem's refcnt
is gone, it will also release the map->refcnt.  All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).

Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops  name dctcp  flags 0x0
	key 4B  value 256B  max_entries 1  memlock 4096B
	btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
        "value": {
            "refcnt": {
                "refs": {
                    "counter": 1
                }
            },
            "state": 1,
            "data": {
                "list": {
                    "next": 0,
                    "prev": 0
                },
                "key": 0,
                "flags": 2,
                "init": 24,
                "release": 0,
                "ssthresh": 25,
                "cong_avoid": 30,
                "set_state": 27,
                "cwnd_event": 28,
                "in_ack_event": 26,
                "undo_cwnd": 29,
                "pkts_acked": 0,
                "min_tso_segs": 0,
                "sndbuf_expand": 0,
                "cong_control": 0,
                "get_info": 0,
                "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                ],
                "owner": 0
            }
        }
    }
]

Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
  It does an inplace update on "*value" instead returning a pointer
  to syscall.c.  Otherwise, it needs a separate copy of "zero" value
  for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

* The bpf_struct_ops_map_delete_elem() is also called without
  preempt_disable() from map_delete_elem().  It is because
  the "->unreg()" may requires sleepable context, e.g.
  the "tcp_unregister_congestion_control()".

* "const" is added to some of the existing "struct btf_func_model *"
  function arg to avoid a compiler warning caused by this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 arch/x86/net/bpf_jit_comp.c |  10 +-
 include/linux/bpf.h         |  49 +++-
 include/linux/bpf_types.h   |   3 +
 include/linux/btf.h         |  11 +
 include/uapi/linux/bpf.h    |   7 +-
 kernel/bpf/bpf_struct_ops.c | 465 +++++++++++++++++++++++++++++++++++-
 kernel/bpf/btf.c            |  20 +-
 kernel/bpf/map_in_map.c     |   3 +-
 kernel/bpf/syscall.c        |  47 ++--
 kernel/bpf/trampoline.c     |   5 +-
 kernel/bpf/verifier.c       |   5 +
 11 files changed, 585 insertions(+), 40 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4c8a2d1f8470..0b9b486432bd 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1328,7 +1328,7 @@ xadd:			if (is_imm8(insn->off))
 	return proglen;
 }
 
-static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args,
+static void save_regs(const struct btf_func_model *m, u8 **prog, int nr_args,
 		      int stack_size)
 {
 	int i;
@@ -1344,7 +1344,7 @@ static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args,
 			 -(stack_size - i * 8));
 }
 
-static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args,
+static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args,
 			 int stack_size)
 {
 	int i;
@@ -1361,7 +1361,7 @@ static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args,
 			 -(stack_size - i * 8));
 }
 
-static int invoke_bpf(struct btf_func_model *m, u8 **pprog,
+static int invoke_bpf(const struct btf_func_model *m, u8 **pprog,
 		      struct bpf_prog **progs, int prog_cnt, int stack_size)
 {
 	u8 *prog = *pprog;
@@ -1456,7 +1456,7 @@ static int invoke_bpf(struct btf_func_model *m, u8 **pprog,
  * add rsp, 8                      // skip eth_type_trans's frame
  * ret                             // return to its caller
  */
-int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
+int arch_prepare_bpf_trampoline(void *image, const struct btf_func_model *m, u32 flags,
 				struct bpf_prog **fentry_progs, int fentry_cnt,
 				struct bpf_prog **fexit_progs, int fexit_cnt,
 				void *orig_call)
@@ -1529,7 +1529,7 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
 	 */
 	if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY))
 		return -EFAULT;
-	return 0;
+	return (void *)prog - image;
 }
 
 static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1f0a5fc8c5ee..349cedd7b97b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -17,6 +17,7 @@
 #include <linux/u64_stats_sync.h>
 #include <linux/refcount.h>
 #include <linux/mutex.h>
+#include <linux/module.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -106,6 +107,7 @@ struct bpf_map {
 	struct btf *btf;
 	struct bpf_map_memory memory;
 	char name[BPF_OBJ_NAME_LEN];
+	u32 btf_vmlinux_value_type_id;
 	bool unpriv_array;
 	bool frozen; /* write-once; write-protected by freeze_mutex */
 	/* 22 bytes hole */
@@ -183,7 +185,8 @@ static inline bool bpf_map_offload_neutral(const struct bpf_map *map)
 
 static inline bool bpf_map_support_seq_show(const struct bpf_map *map)
 {
-	return map->btf && map->ops->map_seq_show_elem;
+	return (map->btf_value_type_id || map->btf_vmlinux_value_type_id) &&
+		map->ops->map_seq_show_elem;
 }
 
 int map_check_no_btf(const struct bpf_map *map,
@@ -441,7 +444,8 @@ struct btf_func_model {
  *      fentry = a set of program to run before calling original function
  *      fexit = a set of program to run after original function
  */
-int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
+int arch_prepare_bpf_trampoline(void *image,
+				const struct btf_func_model *m, u32 flags,
 				struct bpf_prog **fentry_progs, int fentry_cnt,
 				struct bpf_prog **fexit_progs, int fexit_cnt,
 				void *orig_call);
@@ -671,6 +675,7 @@ struct bpf_array_aux {
 	struct work_struct work;
 };
 
+struct bpf_struct_ops_value;
 struct btf_type;
 struct btf_member;
 
@@ -680,21 +685,61 @@ struct bpf_struct_ops {
 	int (*init)(struct btf *_btf_vmlinux);
 	int (*check_member)(const struct btf_type *t,
 			    const struct btf_member *member);
+	int (*init_member)(const struct btf_type *t,
+			   const struct btf_member *member,
+			   void *kdata, const void *udata);
+	int (*reg)(void *kdata);
+	void (*unreg)(void *kdata);
 	const struct btf_type *type;
+	const struct btf_type *value_type;
 	const char *name;
 	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
 	u32 type_id;
+	u32 value_id;
 };
 
 #if defined(CONFIG_BPF_JIT)
+#define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
 const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id);
 void bpf_struct_ops_init(struct btf *_btf_vmlinux);
+bool bpf_struct_ops_get(const void *kdata);
+void bpf_struct_ops_put(const void *kdata);
+int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
+				       void *value);
+static inline bool bpf_try_module_get(const void *data, struct module *owner)
+{
+	if (owner == BPF_MODULE_OWNER)
+		return bpf_struct_ops_get(data);
+	else
+		return try_module_get(owner);
+}
+static inline void bpf_module_put(const void *data, struct module *owner)
+{
+	if (owner == BPF_MODULE_OWNER)
+		bpf_struct_ops_put(data);
+	else
+		module_put(owner);
+}
 #else
 static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 {
 	return NULL;
 }
 static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { }
+static inline bool bpf_try_module_get(const void *data, struct module *owner)
+{
+	return try_module_get(owner);
+}
+static inline void bpf_module_put(const void *data, struct module *owner)
+{
+	module_put(owner);
+}
+static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map,
+						     void *key,
+						     void *value)
+{
+	return -EINVAL;
+}
 #endif
 
 struct bpf_array {
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index fadd243ffa2d..9f326e6ef885 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -109,3 +109,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
+#if defined(CONFIG_BPF_JIT)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
+#endif
diff --git a/include/linux/btf.h b/include/linux/btf.h
index f74a09a7120b..49094564f1f1 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -60,6 +60,10 @@ const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
 					    u32 id, u32 *res_id);
 const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
 						 u32 id, u32 *res_id);
+const struct btf_type *
+btf_resolve_size(const struct btf *btf, const struct btf_type *type,
+		 u32 *type_size, const struct btf_type **elem_type,
+		 u32 *total_nelems);
 
 #define for_each_member(i, struct_type, member)			\
 	for (i = 0, member = btf_type_member(struct_type);	\
@@ -106,6 +110,13 @@ static inline bool btf_type_kflag(const struct btf_type *t)
 	return BTF_INFO_KFLAG(t->info);
 }
 
+static inline u32 btf_member_bit_offset(const struct btf_type *struct_type,
+					const struct btf_member *member)
+{
+	return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset)
+					   : member->offset;
+}
+
 static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
 					   const struct btf_member *member)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 12900dfa1461..8809212d9d6c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -136,6 +136,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_STACK,
 	BPF_MAP_TYPE_SK_STORAGE,
 	BPF_MAP_TYPE_DEVMAP_HASH,
+	BPF_MAP_TYPE_STRUCT_OPS,
 };
 
 /* Note that tracing related programs such as
@@ -392,6 +393,10 @@ union bpf_attr {
 		__u32	btf_fd;		/* fd pointing to a BTF type data */
 		__u32	btf_key_type_id;	/* BTF type_id of the key */
 		__u32	btf_value_type_id;	/* BTF type_id of the value */
+		__u32	btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
+						   * struct stored as the
+						   * map value
+						   */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -3340,7 +3345,7 @@ struct bpf_map_info {
 	__u32 map_flags;
 	char  name[BPF_OBJ_NAME_LEN];
 	__u32 ifindex;
-	__u32 :32;
+	__u32 btf_vmlinux_value_type_id;
 	__u64 netns_dev;
 	__u64 netns_ino;
 	__u32 btf_id;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 817d5aac42e5..00f49ac1342d 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -12,8 +12,68 @@
 #include <linux/seq_file.h>
 #include <linux/refcount.h>
 
+enum bpf_struct_ops_state {
+	BPF_STRUCT_OPS_STATE_INIT,
+	BPF_STRUCT_OPS_STATE_INUSE,
+	BPF_STRUCT_OPS_STATE_TOBEFREE,
+};
+
+#define BPF_STRUCT_OPS_COMMON_VALUE			\
+	refcount_t refcnt;				\
+	enum bpf_struct_ops_state state
+
+struct bpf_struct_ops_value {
+	BPF_STRUCT_OPS_COMMON_VALUE;
+	char data[0] ____cacheline_aligned_in_smp;
+};
+
+struct bpf_struct_ops_map {
+	struct bpf_map map;
+	const struct bpf_struct_ops *st_ops;
+	/* protect map_update */
+	spinlock_t lock;
+	/* progs has all the bpf_prog that is populated
+	 * to the func ptr of the kernel's struct
+	 * (in kvalue.data).
+	 */
+	struct bpf_prog **progs;
+	/* image is a page that has all the trampolines
+	 * that stores the func args before calling the bpf_prog.
+	 * A PAGE_SIZE "image" is enough to store all trampoline for
+	 * "progs[]".
+	 */
+	void *image;
+	/* uvalue->data stores the kernel struct
+	 * (e.g. tcp_congestion_ops) that is more useful
+	 * to userspace than the kvalue.  For example,
+	 * the bpf_prog's id is stored instead of the kernel
+	 * address of a func ptr.
+	 */
+	struct bpf_struct_ops_value *uvalue;
+	/* kvalue.data stores the actual kernel's struct
+	 * (e.g. tcp_congestion_ops) that will be
+	 * registered to the kernel subsystem.
+	 */
+	struct bpf_struct_ops_value kvalue;
+};
+
+#define VALUE_PREFIX "__bpf_"
+#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
+
+/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value
+ * exposed to the userspace and its btf-type-id is stored
+ * at the map->btf_vmlinux_value_type_id.
+ *
+ * The *_name##_dummy is to ensure the BTF type is emitted.
+ */
+
 #define BPF_STRUCT_OPS_TYPE(_name)				\
-extern struct bpf_struct_ops bpf_##_name;
+extern struct bpf_struct_ops bpf_##_name;			\
+								\
+static struct __bpf_##_name {					\
+	BPF_STRUCT_OPS_COMMON_VALUE;				\
+	struct _name data ____cacheline_aligned_in_smp;		\
+} *_name##_dummy;
 #include "bpf_struct_ops_types.h"
 #undef BPF_STRUCT_OPS_TYPE
 
@@ -37,19 +97,46 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
 const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
 };
 
+static const struct btf_type *module_type;
+
 void bpf_struct_ops_init(struct btf *_btf_vmlinux)
 {
+	char value_name[128] = VALUE_PREFIX;
+	s32 type_id, value_id, module_id;
 	const struct btf_member *member;
 	struct bpf_struct_ops *st_ops;
 	struct bpf_verifier_log log = {};
 	const struct btf_type *t;
 	const char *mname;
-	s32 type_id;
 	u32 i, j;
 
+	/* Avoid unused var compiler warning */
+#define BPF_STRUCT_OPS_TYPE(_name) (void)(_name##_dummy);
+#include "bpf_struct_ops_types.h"
+#undef BPF_STRUCT_OPS_TYPE
+
+	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
+					  BTF_KIND_STRUCT);
+	if (module_id < 0) {
+		pr_warn("Cannot find struct module in btf_vmlinux\n");
+		return;
+	}
+	module_type = btf_type_by_id(_btf_vmlinux, module_id);
+
 	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
 		st_ops = bpf_struct_ops[i];
 
+		value_name[VALUE_PREFIX_LEN] = '\0';
+		strncat(value_name + VALUE_PREFIX_LEN, st_ops->name,
+			sizeof(value_name) - VALUE_PREFIX_LEN - 1);
+		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
+						 BTF_KIND_STRUCT);
+		if (value_id < 0) {
+			pr_warn("Cannot find struct %s in btf_vmlinux\n",
+				value_name);
+			continue;
+		}
+
 		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
 						BTF_KIND_STRUCT);
 		if (type_id < 0) {
@@ -101,6 +188,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
 			} else {
 				st_ops->type_id = type_id;
 				st_ops->type = t;
+				st_ops->value_id = value_id;
+				st_ops->value_type =
+					btf_type_by_id(_btf_vmlinux, value_id);
 			}
 		}
 	}
@@ -108,6 +198,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
 
 extern struct btf *btf_vmlinux;
 
+static const struct bpf_struct_ops *
+bpf_struct_ops_find_value(u32 value_id)
+{
+	unsigned int i;
+
+	if (!value_id || !btf_vmlinux)
+		return NULL;
+
+	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
+		if (bpf_struct_ops[i]->value_id == value_id)
+			return bpf_struct_ops[i];
+	}
+
+	return NULL;
+}
+
 const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 {
 	unsigned int i;
@@ -122,3 +228,358 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
 
 	return NULL;
 }
+
+static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key,
+					   void *next_key)
+{
+	u32 index = key ? *(u32 *)key : U32_MAX;
+	u32 *next = (u32 *)next_key;
+
+	if (index >= map->max_entries) {
+		*next = 0;
+		return 0;
+	}
+
+	if (index == map->max_entries - 1)
+		return -ENOENT;
+
+	*next = index + 1;
+	return 0;
+}
+
+int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
+				       void *value)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+	struct bpf_struct_ops_value *uvalue, *kvalue;
+	enum bpf_struct_ops_state state;
+
+	if (unlikely(*(u32 *)key != 0))
+		return -ENOENT;
+
+	kvalue = &st_map->kvalue;
+	state = smp_load_acquire(&kvalue->state);
+	if (state == BPF_STRUCT_OPS_STATE_INIT) {
+		memset(value, 0, map->value_size);
+		return 0;
+	}
+
+	/* No lock is needed.  state and refcnt do not need
+	 * to be updated together under atomic context.
+	 */
+	uvalue = (struct bpf_struct_ops_value *)value;
+	memcpy(uvalue, st_map->uvalue, map->value_size);
+	uvalue->state = state;
+	refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt));
+
+	return 0;
+}
+
+static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map)
+{
+	const struct btf_type *t = st_map->st_ops->type;
+	u32 i;
+
+	for (i = 0; i < btf_type_vlen(t); i++) {
+		if (st_map->progs[i]) {
+			bpf_prog_put(st_map->progs[i]);
+			st_map->progs[i] = NULL;
+		}
+	}
+}
+
+static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
+					  void *value, u64 flags)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+	const struct bpf_struct_ops *st_ops = st_map->st_ops;
+	struct bpf_struct_ops_value *uvalue, *kvalue;
+	const struct btf_member *member;
+	const struct btf_type *t = st_ops->type;
+	void *udata, *kdata;
+	int prog_fd, err = 0;
+	void *image;
+	u32 i;
+
+	if (flags)
+		return -EINVAL;
+
+	if (*(u32 *)key != 0)
+		return -E2BIG;
+
+	uvalue = (struct bpf_struct_ops_value *)value;
+	if (uvalue->state || refcount_read(&uvalue->refcnt))
+		return -EINVAL;
+
+	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
+	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
+
+	spin_lock(&st_map->lock);
+
+	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
+		err = -EBUSY;
+		goto unlock;
+	}
+
+	memcpy(uvalue, value, map->value_size);
+
+	udata = &uvalue->data;
+	kdata = &kvalue->data;
+	image = st_map->image;
+
+	for_each_member(i, t, member) {
+		const struct btf_type *mtype, *ptype;
+		struct bpf_prog *prog;
+		u32 moff;
+
+		moff = btf_member_bit_offset(t, member) / 8;
+		mtype = btf_type_by_id(btf_vmlinux, member->type);
+		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
+		if (ptype == module_type) {
+			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
+			continue;
+		}
+
+		err = st_ops->init_member(t, member, kdata, udata);
+		if (err < 0)
+			goto reset_unlock;
+
+		/* The ->init_member() has handled this member */
+		if (err > 0)
+			continue;
+
+		/* If st_ops->init_member does not handle it,
+		 * we will only handle func ptrs and zero-ed members
+		 * here.  Reject everything else.
+		 */
+
+		/* All non func ptr member must be 0 */
+		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
+					       NULL)) {
+			u32 msize;
+
+			mtype = btf_resolve_size(btf_vmlinux, mtype,
+						 &msize, NULL, NULL);
+			if (IS_ERR(mtype)) {
+				err = PTR_ERR(mtype);
+				goto reset_unlock;
+			}
+
+			if (memchr_inv(udata + moff, 0, msize)) {
+				err = -EINVAL;
+				goto reset_unlock;
+			}
+
+			continue;
+		}
+
+		prog_fd = (int)(*(unsigned long *)(udata + moff));
+		/* Similar check as the attr->attach_prog_fd */
+		if (!prog_fd)
+			continue;
+
+		prog = bpf_prog_get(prog_fd);
+		if (IS_ERR(prog)) {
+			err = PTR_ERR(prog);
+			goto reset_unlock;
+		}
+		st_map->progs[i] = prog;
+
+		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
+		    prog->aux->attach_btf_id != st_ops->type_id ||
+		    prog->expected_attach_type != i) {
+			err = -EINVAL;
+			goto reset_unlock;
+		}
+
+		err = arch_prepare_bpf_trampoline(image,
+						  &st_ops->func_models[i], 0,
+						  &prog, 1, NULL, 0, NULL);
+		if (err < 0)
+			goto reset_unlock;
+
+		*(void **)(kdata + moff) = image;
+		image += err;
+
+		/* put prog_id to udata */
+		*(unsigned long *)(udata + moff) = prog->aux->id;
+	}
+
+	refcount_set(&kvalue->refcnt, 1);
+	bpf_map_inc(map);
+
+	err = st_ops->reg(kdata);
+	if (!err) {
+		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
+		goto unlock;
+	}
+
+	/* Error during st_ops->reg() */
+	bpf_map_put(map);
+
+reset_unlock:
+	bpf_struct_ops_map_put_progs(st_map);
+	memset(uvalue, 0, map->value_size);
+	memset(kvalue, 0, map->value_size);
+
+unlock:
+	spin_unlock(&st_map->lock);
+	return err;
+}
+
+static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
+{
+	enum bpf_struct_ops_state prev_state;
+	struct bpf_struct_ops_map *st_map;
+
+	st_map = (struct bpf_struct_ops_map *)map;
+	prev_state = cmpxchg(&st_map->kvalue.state,
+			     BPF_STRUCT_OPS_STATE_INUSE,
+			     BPF_STRUCT_OPS_STATE_TOBEFREE);
+	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
+		st_map->st_ops->unreg(&st_map->kvalue.data);
+		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
+			bpf_map_put(map);
+	}
+
+	return 0;
+}
+
+static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key,
+					     struct seq_file *m)
+{
+	void *value;
+
+	value = bpf_struct_ops_map_lookup_elem(map, key);
+	if (!value)
+		return;
+
+	btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id,
+			  value, m);
+	seq_puts(m, "\n");
+}
+
+static void bpf_struct_ops_map_free(struct bpf_map *map)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+
+	if (st_map->progs)
+		bpf_struct_ops_map_put_progs(st_map);
+	bpf_map_area_free(st_map->progs);
+	bpf_jit_free_exec(st_map->image);
+	bpf_map_area_free(st_map->uvalue);
+	bpf_map_area_free(st_map);
+}
+
+static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
+{
+	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
+	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
+		return -EINVAL;
+	return 0;
+}
+
+static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
+{
+	const struct bpf_struct_ops *st_ops;
+	size_t map_total_size, st_map_size;
+	struct bpf_struct_ops_map *st_map;
+	const struct btf_type *t, *vt;
+	struct bpf_map_memory mem;
+	struct bpf_map *map;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
+	if (!st_ops)
+		return ERR_PTR(-ENOTSUPP);
+
+	vt = st_ops->value_type;
+	if (attr->value_size != vt->size)
+		return ERR_PTR(-EINVAL);
+
+	t = st_ops->type;
+
+	st_map_size = sizeof(*st_map) +
+		/* kvalue stores the struct __bpf_tcp_congestions_ops */
+		(vt->size - sizeof(struct bpf_struct_ops_value));
+	map_total_size = st_map_size +
+		/* uvalue */
+		sizeof(vt->size) +
+		/* struct bpf_progs **progs */
+		 btf_type_vlen(t) * sizeof(struct bpf_prog *);
+	err = bpf_map_charge_init(&mem, map_total_size);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
+	if (!st_map) {
+		bpf_map_charge_finish(&mem);
+		return ERR_PTR(-ENOMEM);
+	}
+	st_map->st_ops = st_ops;
+	map = &st_map->map;
+
+	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
+	st_map->progs =
+		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *),
+				   NUMA_NO_NODE);
+	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
+	if (!st_map->uvalue || !st_map->progs || !st_map->image) {
+		bpf_struct_ops_map_free(map);
+		bpf_map_charge_finish(&mem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	spin_lock_init(&st_map->lock);
+	set_vm_flush_reset_perms(st_map->image);
+	set_memory_x((long)st_map->image, 1);
+	bpf_map_init_from_attr(map, attr);
+	bpf_map_charge_move(&map->memory, &mem);
+
+	return map;
+}
+
+const struct bpf_map_ops bpf_struct_ops_map_ops = {
+	.map_alloc_check = bpf_struct_ops_map_alloc_check,
+	.map_alloc = bpf_struct_ops_map_alloc,
+	.map_free = bpf_struct_ops_map_free,
+	.map_get_next_key = bpf_struct_ops_map_get_next_key,
+	.map_lookup_elem = bpf_struct_ops_map_lookup_elem,
+	.map_delete_elem = bpf_struct_ops_map_delete_elem,
+	.map_update_elem = bpf_struct_ops_map_update_elem,
+	.map_seq_show_elem = bpf_struct_ops_map_seq_show_elem,
+};
+
+/* "const void *" because some subsystem is
+ * passing a const (e.g. const struct tcp_congestion_ops *)
+ */
+bool bpf_struct_ops_get(const void *kdata)
+{
+	struct bpf_struct_ops_value *kvalue;
+
+	kvalue = container_of(kdata, struct bpf_struct_ops_value, data);
+
+	return refcount_inc_not_zero(&kvalue->refcnt);
+}
+
+void bpf_struct_ops_put(const void *kdata)
+{
+	struct bpf_struct_ops_value *kvalue;
+
+	kvalue = container_of(kdata, struct bpf_struct_ops_value, data);
+	if (refcount_dec_and_test(&kvalue->refcnt)) {
+		struct bpf_struct_ops_map *st_map;
+
+		st_map = container_of(kvalue, struct bpf_struct_ops_map,
+				      kvalue);
+		bpf_map_put(&st_map->map);
+	}
+}
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 16924e5fa126..90837aaa86d6 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -500,13 +500,6 @@ static const char *btf_int_encoding_str(u8 encoding)
 		return "UNKN";
 }
 
-static u32 btf_member_bit_offset(const struct btf_type *struct_type,
-			     const struct btf_member *member)
-{
-	return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset)
-					   : member->offset;
-}
-
 static u32 btf_type_int(const struct btf_type *t)
 {
 	return *(u32 *)(t + 1);
@@ -1089,7 +1082,7 @@ static const struct resolve_vertex *env_stack_peak(struct btf_verifier_env *env)
  * *elem_type: same as return type ("struct X")
  * *total_nelems: 1
  */
-static const struct btf_type *
+const struct btf_type *
 btf_resolve_size(const struct btf *btf, const struct btf_type *type,
 		 u32 *type_size, const struct btf_type **elem_type,
 		 u32 *total_nelems)
@@ -1143,8 +1136,10 @@ btf_resolve_size(const struct btf *btf, const struct btf_type *type,
 		return ERR_PTR(-EINVAL);
 
 	*type_size = nelems * size;
-	*total_nelems = nelems;
-	*elem_type = type;
+	if (total_nelems)
+		*total_nelems = nelems;
+	if (elem_type)
+		*elem_type = type;
 
 	return array_type ? : type;
 }
@@ -1858,7 +1853,10 @@ static void btf_modifier_seq_show(const struct btf *btf,
 				  u32 type_id, void *data,
 				  u8 bits_offset, struct seq_file *m)
 {
-	t = btf_type_id_resolve(btf, &type_id);
+	if (btf->resolved_ids)
+		t = btf_type_id_resolve(btf, &type_id);
+	else
+		t = btf_type_skip_modifiers(btf, type_id, NULL);
 
 	btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m);
 }
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 5e9366b33f0f..b3c48d1533cb 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -22,7 +22,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	 */
 	if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
 	    inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE ||
-	    inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+	    inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ||
+	    inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
 		fdput(f);
 		return ERR_PTR(-ENOTSUPP);
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b2d57f7c04..15f0505ce6d0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -628,7 +628,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	return ret;
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD btf_value_type_id
+#define BPF_MAP_CREATE_LAST_FIELD btf_vmlinux_value_type_id
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
@@ -642,6 +642,14 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		return -EINVAL;
 
+	if (attr->btf_vmlinux_value_type_id) {
+		if (attr->map_type != BPF_MAP_TYPE_STRUCT_OPS ||
+		    attr->btf_key_type_id || attr->btf_value_type_id)
+			return -EINVAL;
+	} else if (attr->btf_key_type_id && !attr->btf_value_type_id) {
+		return -EINVAL;
+	}
+
 	f_flags = bpf_get_file_flag(attr->map_flags);
 	if (f_flags < 0)
 		return f_flags;
@@ -664,32 +672,35 @@ static int map_create(union bpf_attr *attr)
 	atomic64_set(&map->usercnt, 1);
 	mutex_init(&map->freeze_mutex);
 
-	if (attr->btf_key_type_id || attr->btf_value_type_id) {
+	map->spin_lock_off = -EINVAL;
+	if (attr->btf_key_type_id || attr->btf_value_type_id ||
+	    /* Even the map's value is a kernel's struct,
+	     * the bpf_prog.o must have BTF to begin with
+	     * to figure out the corresponding kernel's
+	     * counter part.  Thus, attr->btf_fd has
+	     * to be valid also.
+	     */
+	    attr->btf_vmlinux_value_type_id) {
 		struct btf *btf;
 
-		if (!attr->btf_value_type_id) {
-			err = -EINVAL;
-			goto free_map;
-		}
-
 		btf = btf_get_by_fd(attr->btf_fd);
 		if (IS_ERR(btf)) {
 			err = PTR_ERR(btf);
 			goto free_map;
 		}
+		map->btf = btf;
 
-		err = map_check_btf(map, btf, attr->btf_key_type_id,
-				    attr->btf_value_type_id);
-		if (err) {
-			btf_put(btf);
-			goto free_map;
+		if (attr->btf_value_type_id) {
+			err = map_check_btf(map, btf, attr->btf_key_type_id,
+					    attr->btf_value_type_id);
+			if (err)
+				goto free_map;
 		}
 
-		map->btf = btf;
 		map->btf_key_type_id = attr->btf_key_type_id;
 		map->btf_value_type_id = attr->btf_value_type_id;
-	} else {
-		map->spin_lock_off = -EINVAL;
+		map->btf_vmlinux_value_type_id =
+			attr->btf_vmlinux_value_type_id;
 	}
 
 	err = security_bpf_map_alloc(map);
@@ -888,6 +899,8 @@ static int map_lookup_elem(union bpf_attr *attr)
 	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
 		   map->map_type == BPF_MAP_TYPE_STACK) {
 		err = map->ops->map_peek_elem(map, value);
+	} else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+		err = bpf_struct_ops_map_sys_lookup_elem(map, key, value);
 	} else {
 		rcu_read_lock();
 		if (map->ops->map_lookup_elem_sys_only)
@@ -1092,7 +1105,8 @@ static int map_delete_elem(union bpf_attr *attr)
 	if (bpf_map_is_dev_bound(map)) {
 		err = bpf_map_offload_delete_elem(map, key);
 		goto out;
-	} else if (IS_FD_PROG_ARRAY(map)) {
+	} else if (IS_FD_PROG_ARRAY(map) ||
+		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
 		err = map->ops->map_delete_elem(map, key);
 		goto out;
 	}
@@ -2822,6 +2836,7 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
 		info.btf_key_type_id = map->btf_key_type_id;
 		info.btf_value_type_id = map->btf_value_type_id;
 	}
+	info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
 
 	if (bpf_map_is_dev_bound(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 5ee301ddbd00..610109cfc7a8 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -110,7 +110,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
 					  fentry, fentry_cnt,
 					  fexit, fexit_cnt,
 					  tr->func.addr);
-	if (err)
+	if (err < 0)
 		goto out;
 
 	if (tr->selector)
@@ -244,7 +244,8 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start)
 }
 
 int __weak
-arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
+arch_prepare_bpf_trampoline(void *image,
+			    const struct btf_func_model *m, u32 flags,
 			    struct bpf_prog **fentry_progs, int fentry_cnt,
 			    struct bpf_prog **fexit_progs, int fexit_cnt,
 			    void *orig_call)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4c1eaa1a2965..990f13165c52 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8149,6 +8149,11 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		return -EINVAL;
 	}
 
+	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
+		verbose(env, "bpf_struct_ops map cannot be used in prog\n");
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2019-12-14  0:47 ` [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-17 17:36   ` Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
in bpf.

The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt.  e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.  It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.

This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).

The optional "get_info" is unsupported now.  It can be added
later.  One possible way is to output the info with a btf-id
to describe the content.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h            |   2 +
 include/net/tcp.h                 |   1 +
 kernel/bpf/bpf_struct_ops_types.h |   7 +-
 net/core/filter.c                 |   2 +-
 net/ipv4/Makefile                 |   4 +
 net/ipv4/bpf_tcp_ca.c             | 225 ++++++++++++++++++++++++++++++
 net/ipv4/tcp_cong.c               |  14 +-
 net/ipv4/tcp_ipv4.c               |   6 +-
 net/ipv4/tcp_minisocks.c          |   4 +-
 net/ipv4/tcp_output.c             |   4 +-
 10 files changed, 254 insertions(+), 15 deletions(-)
 create mode 100644 net/ipv4/bpf_tcp_ca.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 37ac7025031d..7c22c5e6528d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -844,6 +844,8 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog);
 int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
 			      bpf_aux_classic_check_t trans, bool save_orig);
 void bpf_prog_destroy(struct bpf_prog *fp);
+const struct bpf_func_proto *
+bpf_base_func_proto(enum bpf_func_id func_id);
 
 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
 int sk_attach_bpf(u32 ufd, struct sock *sk);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 86b9a8766648..fd87fa1df603 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1007,6 +1007,7 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
 #define TCP_CONG_NEEDS_ECN	0x2
+#define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
 
 union tcp_cc_info;
 
diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
index 7bb13ff49ec2..066d83ea1c99 100644
--- a/kernel/bpf/bpf_struct_ops_types.h
+++ b/kernel/bpf/bpf_struct_ops_types.h
@@ -1,4 +1,9 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /* internal file - do not include directly */
 
-/* To be filled in a later patch */
+#ifdef CONFIG_BPF_JIT
+#ifdef CONFIG_INET
+#include <net/tcp.h>
+BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
+#endif
+#endif
diff --git a/net/core/filter.c b/net/core/filter.c
index a411f7835dee..fbb3698026bd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5975,7 +5975,7 @@ bool bpf_helper_changes_pkt_data(void *func)
 	return false;
 }
 
-static const struct bpf_func_proto *
+const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index d57ecfaf89d4..7360d9b3eaad 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -65,3 +65,7 @@ obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
 		      xfrm4_output.o xfrm4_protocol.o
+
+ifeq ($(CONFIG_BPF_SYSCALL),y)
+obj-$(CONFIG_BPF_JIT) += bpf_tcp_ca.o
+endif
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
new file mode 100644
index 000000000000..967af987bc26
--- /dev/null
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -0,0 +1,225 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook  */
+
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/tcp.h>
+
+static u32 optional_ops[] = {
+	offsetof(struct tcp_congestion_ops, init),
+	offsetof(struct tcp_congestion_ops, release),
+	offsetof(struct tcp_congestion_ops, set_state),
+	offsetof(struct tcp_congestion_ops, cwnd_event),
+	offsetof(struct tcp_congestion_ops, in_ack_event),
+	offsetof(struct tcp_congestion_ops, pkts_acked),
+	offsetof(struct tcp_congestion_ops, min_tso_segs),
+	offsetof(struct tcp_congestion_ops, sndbuf_expand),
+	offsetof(struct tcp_congestion_ops, cong_control),
+};
+
+static u32 unsupported_ops[] = {
+	offsetof(struct tcp_congestion_ops, get_info),
+};
+
+static const struct btf_type *tcp_sock_type;
+static u32 tcp_sock_id, sock_id;
+
+static int bpf_tcp_ca_init(struct btf *_btf_vmlinux)
+{
+	s32 type_id;
+
+	type_id = btf_find_by_name_kind(_btf_vmlinux, "sock", BTF_KIND_STRUCT);
+	if (type_id < 0)
+		return -EINVAL;
+	sock_id = type_id;
+
+	type_id = btf_find_by_name_kind(_btf_vmlinux, "tcp_sock",
+					BTF_KIND_STRUCT);
+	if (type_id < 0)
+		return -EINVAL;
+	tcp_sock_id = type_id;
+	tcp_sock_type = btf_type_by_id(_btf_vmlinux, tcp_sock_id);
+
+	return 0;
+}
+
+static bool check_optional(u32 member_offset)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(optional_ops); i++) {
+		if (member_offset == optional_ops[i])
+			return true;
+	}
+
+	return false;
+}
+
+static bool check_unsupported(u32 member_offset)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
+		if (member_offset == unsupported_ops[i])
+			return true;
+	}
+
+	return false;
+}
+
+extern struct btf *btf_vmlinux;
+
+static bool bpf_tcp_ca_is_valid_access(int off, int size,
+				       enum bpf_access_type type,
+				       const struct bpf_prog *prog,
+				       struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
+		return false;
+	if (type != BPF_READ)
+		return false;
+	if (off % size != 0)
+		return false;
+
+	if (!btf_ctx_access(off, size, type, prog, info))
+		return false;
+
+	if (info->reg_type == PTR_TO_BTF_ID && info->btf_id == sock_id)
+		/* promote it to tcp_sock */
+		info->btf_id = tcp_sock_id;
+
+	return true;
+}
+
+static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
+					const struct btf_type *t, int off,
+					int size, enum bpf_access_type atype,
+					u32 *next_btf_id)
+{
+	size_t end;
+
+	if (atype == BPF_READ)
+		return btf_struct_access(log, t, off, size, atype, next_btf_id);
+
+	if (t != tcp_sock_type) {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct inet_connection_sock, icsk_ca_priv):
+		end = offsetofend(struct inet_connection_sock, icsk_ca_priv);
+		break;
+	case offsetof(struct inet_connection_sock, icsk_ack.pending):
+		end = offsetofend(struct inet_connection_sock,
+				  icsk_ack.pending);
+		break;
+	case offsetof(struct tcp_sock, snd_cwnd):
+		end = offsetofend(struct tcp_sock, snd_cwnd);
+		break;
+	case offsetof(struct tcp_sock, snd_cwnd_cnt):
+		end = offsetofend(struct tcp_sock, snd_cwnd_cnt);
+		break;
+	case offsetof(struct tcp_sock, snd_ssthresh):
+		end = offsetofend(struct tcp_sock, snd_ssthresh);
+		break;
+	case offsetof(struct tcp_sock, ecn_flags):
+		end = offsetofend(struct tcp_sock, ecn_flags);
+		break;
+	default:
+		bpf_log(log, "no write support to tcp_sock at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of tcp_sock ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return NOT_INIT;
+}
+
+static const struct bpf_func_proto *
+bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
+			  const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
+	.get_func_proto		= bpf_tcp_ca_get_func_proto,
+	.is_valid_access	= bpf_tcp_ca_is_valid_access,
+	.btf_struct_access	= bpf_tcp_ca_btf_struct_access,
+};
+
+static int bpf_tcp_ca_init_member(const struct btf_type *t,
+				  const struct btf_member *member,
+				  void *kdata, const void *udata)
+{
+	const struct tcp_congestion_ops *utcp_ca;
+	struct tcp_congestion_ops *tcp_ca;
+	size_t tcp_ca_name_len;
+	int prog_fd;
+	u32 moff;
+
+	utcp_ca = (const struct tcp_congestion_ops *)udata;
+	tcp_ca = (struct tcp_congestion_ops *)kdata;
+
+	moff = btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct tcp_congestion_ops, flags):
+		if (utcp_ca->flags & ~TCP_CONG_MASK)
+			return -EINVAL;
+		tcp_ca->flags = utcp_ca->flags;
+		return 1;
+	case offsetof(struct tcp_congestion_ops, name):
+		tcp_ca_name_len = strnlen(utcp_ca->name, sizeof(utcp_ca->name));
+		if (!tcp_ca_name_len ||
+		    tcp_ca_name_len == sizeof(utcp_ca->name))
+			return -EINVAL;
+		memcpy(tcp_ca->name, utcp_ca->name, sizeof(tcp_ca->name));
+		return 1;
+	}
+
+	if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, NULL))
+		return 0;
+
+	prog_fd = (int)(*(unsigned long *)(udata + moff));
+	if (!prog_fd && !check_optional(moff) && !check_unsupported(moff))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bpf_tcp_ca_check_member(const struct btf_type *t,
+				   const struct btf_member *member)
+{
+	if (check_unsupported(btf_member_bit_offset(t, member) / 8))
+		return -ENOTSUPP;
+	return 0;
+}
+
+static int bpf_tcp_ca_reg(void *kdata)
+{
+	return tcp_register_congestion_control(kdata);
+}
+
+static void bpf_tcp_ca_unreg(void *kdata)
+{
+	tcp_unregister_congestion_control(kdata);
+}
+
+struct bpf_struct_ops bpf_tcp_congestion_ops = {
+	.verifier_ops = &bpf_tcp_ca_verifier_ops,
+	.reg = bpf_tcp_ca_reg,
+	.unreg = bpf_tcp_ca_unreg,
+	.check_member = bpf_tcp_ca_check_member,
+	.init_member = bpf_tcp_ca_init_member,
+	.init = bpf_tcp_ca_init,
+	.name = "tcp_congestion_ops",
+};
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 3737ec096650..dc27f21bd815 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -162,7 +162,7 @@ void tcp_assign_congestion_control(struct sock *sk)
 
 	rcu_read_lock();
 	ca = rcu_dereference(net->ipv4.tcp_congestion_control);
-	if (unlikely(!try_module_get(ca->owner)))
+	if (unlikely(!bpf_try_module_get(ca, ca->owner)))
 		ca = &tcp_reno;
 	icsk->icsk_ca_ops = ca;
 	rcu_read_unlock();
@@ -208,7 +208,7 @@ void tcp_cleanup_congestion_control(struct sock *sk)
 
 	if (icsk->icsk_ca_ops->release)
 		icsk->icsk_ca_ops->release(sk);
-	module_put(icsk->icsk_ca_ops->owner);
+	bpf_module_put(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner);
 }
 
 /* Used by sysctl to change default congestion control */
@@ -222,12 +222,12 @@ int tcp_set_default_congestion_control(struct net *net, const char *name)
 	ca = tcp_ca_find_autoload(net, name);
 	if (!ca) {
 		ret = -ENOENT;
-	} else if (!try_module_get(ca->owner)) {
+	} else if (!bpf_try_module_get(ca, ca->owner)) {
 		ret = -EBUSY;
 	} else {
 		prev = xchg(&net->ipv4.tcp_congestion_control, ca);
 		if (prev)
-			module_put(prev->owner);
+			bpf_module_put(prev, prev->owner);
 
 		ca->flags |= TCP_CONG_NON_RESTRICTED;
 		ret = 0;
@@ -366,19 +366,19 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load,
 	} else if (!load) {
 		const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
 
-		if (try_module_get(ca->owner)) {
+		if (bpf_try_module_get(ca, ca->owner)) {
 			if (reinit) {
 				tcp_reinit_congestion_control(sk, ca);
 			} else {
 				icsk->icsk_ca_ops = ca;
-				module_put(old_ca->owner);
+				bpf_module_put(old_ca, old_ca->owner);
 			}
 		} else {
 			err = -EBUSY;
 		}
 	} else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) {
 		err = -EPERM;
-	} else if (!try_module_get(ca->owner)) {
+	} else if (!bpf_try_module_get(ca, ca->owner)) {
 		err = -EBUSY;
 	} else {
 		tcp_reinit_congestion_control(sk, ca);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 26637fce324d..45a88358168a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2619,7 +2619,8 @@ static void __net_exit tcp_sk_exit(struct net *net)
 	int cpu;
 
 	if (net->ipv4.tcp_congestion_control)
-		module_put(net->ipv4.tcp_congestion_control->owner);
+		bpf_module_put(net->ipv4.tcp_congestion_control,
+			       net->ipv4.tcp_congestion_control->owner);
 
 	for_each_possible_cpu(cpu)
 		inet_ctl_sock_destroy(*per_cpu_ptr(net->ipv4.tcp_sk, cpu));
@@ -2726,7 +2727,8 @@ static int __net_init tcp_sk_init(struct net *net)
 
 	/* Reno is always built in */
 	if (!net_eq(net, &init_net) &&
-	    try_module_get(init_net.ipv4.tcp_congestion_control->owner))
+	    bpf_try_module_get(init_net.ipv4.tcp_congestion_control,
+			       init_net.ipv4.tcp_congestion_control->owner))
 		net->ipv4.tcp_congestion_control = init_net.ipv4.tcp_congestion_control;
 	else
 		net->ipv4.tcp_congestion_control = &tcp_reno;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index c802bc80c400..ad3b56d9fa71 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -414,7 +414,7 @@ void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
 
 		rcu_read_lock();
 		ca = tcp_ca_find_key(ca_key);
-		if (likely(ca && try_module_get(ca->owner))) {
+		if (likely(ca && bpf_try_module_get(ca, ca->owner))) {
 			icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
 			icsk->icsk_ca_ops = ca;
 			ca_got_dst = true;
@@ -425,7 +425,7 @@ void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
 	/* If no valid choice made yet, assign current system default ca. */
 	if (!ca_got_dst &&
 	    (!icsk->icsk_ca_setsockopt ||
-	     !try_module_get(icsk->icsk_ca_ops->owner)))
+	     !bpf_try_module_get(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner)))
 		tcp_assign_congestion_control(sk);
 
 	tcp_set_ca_state(sk, TCP_CA_Open);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b184f03d7437..8e7187732ac1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3356,8 +3356,8 @@ static void tcp_ca_dst_init(struct sock *sk, const struct dst_entry *dst)
 
 	rcu_read_lock();
 	ca = tcp_ca_find_key(ca_key);
-	if (likely(ca && try_module_get(ca->owner))) {
-		module_put(icsk->icsk_ca_ops->owner);
+	if (likely(ca && bpf_try_module_get(ca, ca->owner))) {
+		bpf_module_put(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner);
 		icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
 		icsk->icsk_ca_ops = ca;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2019-12-14  0:47 ` [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-17 17:41   ` Yonghong Song
  2019-12-14  0:47 ` [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies Martin KaFai Lau
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

Add a helper to send out a tcp-ack.  It will be used in the later
bpf_dctcp implementation that requires to send out an ack
when the CE state changed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/uapi/linux/bpf.h | 11 ++++++++++-
 net/ipv4/bpf_tcp_ca.c    | 24 +++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8809212d9d6c..602449a56dde 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2827,6 +2827,14 @@ union bpf_attr {
  * 	Return
  * 		On success, the strictly positive length of the string,	including
  * 		the trailing NUL character. On error, a negative value.
+ *
+ * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
+ *	Description
+ *		Send out a tcp-ack. *tp* is the in-kernel struct tcp_sock.
+ *		*rcv_nxt* is the ack_seq to be sent out.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2944,7 +2952,8 @@ union bpf_attr {
 	FN(probe_read_user),		\
 	FN(probe_read_kernel),		\
 	FN(probe_read_user_str),	\
-	FN(probe_read_kernel_str),
+	FN(probe_read_kernel_str),	\
+	FN(tcp_send_ack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 967af987bc26..1fb86f7a93c1 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -144,11 +144,33 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
 	return NOT_INIT;
 }
 
+BPF_CALL_2(bpf_tcp_send_ack, struct tcp_sock *, tp, u32, rcv_nxt)
+{
+	/* bpf_tcp_ca prog cannot have NULL tp */
+	__tcp_send_ack((struct sock *)tp, rcv_nxt);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_tcp_send_ack_proto = {
+	.func		= bpf_tcp_send_ack,
+	.gpl_only	= false,
+	/* In case we want to report error later */
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_type	= ARG_ANYTHING,
+	.btf_id		= &tcp_sock_id,
+};
+
 static const struct bpf_func_proto *
 bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
 			  const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	switch (func_id) {
+	case BPF_FUNC_tcp_send_ack:
+		return &bpf_tcp_send_ack_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
 }
 
 static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2019-12-14  0:47 ` [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
@ 2019-12-14  0:47 ` Martin KaFai Lau
  2019-12-14  1:59   ` Eric Dumazet
  2019-12-14  0:48 ` [PATCH bpf-next 10/13] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:47 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch adds a helper to handle jiffies.  Some of the
tcp_sock's timing is stored in jiffies.  Although things
could be deduced by CONFIG_HZ, having an easy way to get
jiffies will make the later bpf-tcp-cc implementation easier.

While at it, instead of reading jiffies alone, it also takes a
"flags" argument to help converting between ns and jiffies.

This helper is available to CAP_SYS_ADMIN.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      |  1 +
 include/uapi/linux/bpf.h | 16 +++++++++++++++-
 kernel/bpf/core.c        |  1 +
 kernel/bpf/helpers.c     | 25 +++++++++++++++++++++++++
 net/core/filter.c        |  2 ++
 5 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 349cedd7b97b..00491961421e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1371,6 +1371,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto;
 extern const struct bpf_func_proto bpf_strtol_proto;
 extern const struct bpf_func_proto bpf_strtoul_proto;
 extern const struct bpf_func_proto bpf_tcp_sock_proto;
+extern const struct bpf_func_proto bpf_jiffies_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 602449a56dde..cf864a5f7d61 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2835,6 +2835,16 @@ union bpf_attr {
  *	Return
  *		0 on success, or a negative error in case of failure.
  *
+ * u64 bpf_jiffies(u64 in, u64 flags)
+ *	Description
+ *		jiffies helper.
+ *	Return
+ *		*flags*: 0, return the current jiffies.
+ *			 BPF_F_NS_TO_JIFFIES, convert *in* from ns to jiffies.
+ *			 BPF_F_JIFFIES_TO_NS, convert *in* from jiffies to
+ *			 ns.  If *in* is zero, it returns the current
+ *			 jiffies as ns.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2953,7 +2963,8 @@ union bpf_attr {
 	FN(probe_read_kernel),		\
 	FN(probe_read_user_str),	\
 	FN(probe_read_kernel_str),	\
-	FN(tcp_send_ack),
+	FN(tcp_send_ack),		\
+	FN(jiffies),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3032,6 +3043,9 @@ enum bpf_func_id {
 /* BPF_FUNC_sk_storage_get flags */
 #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
 
+#define BPF_F_NS_TO_JIFFIES		(1ULL << 0)
+#define BPF_F_JIFFIES_TO_NS		(1ULL << 1)
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
 	BPF_ADJ_ROOM_NET,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2ff01a716128..0ffbda9a13e9 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2134,6 +2134,7 @@ const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
 const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
 const struct bpf_func_proto bpf_spin_lock_proto __weak;
 const struct bpf_func_proto bpf_spin_unlock_proto __weak;
+const struct bpf_func_proto bpf_jiffies_proto __weak;
 
 const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
 const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index cada974c9f4e..e87c332d1b61 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -11,6 +11,7 @@
 #include <linux/uidgid.h>
 #include <linux/filter.h>
 #include <linux/ctype.h>
+#include <linux/jiffies.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -486,4 +487,28 @@ const struct bpf_func_proto bpf_strtoul_proto = {
 	.arg3_type	= ARG_ANYTHING,
 	.arg4_type	= ARG_PTR_TO_LONG,
 };
+
+BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
+{
+	if (!flags)
+		return get_jiffies_64();
+
+	if (flags & BPF_F_NS_TO_JIFFIES) {
+		return nsecs_to_jiffies(in);
+	} else if (flags & BPF_F_JIFFIES_TO_NS) {
+		if (!in)
+			in = get_jiffies_64();
+		return jiffies_to_nsecs(in);
+	}
+
+	return 0;
+}
+
+const struct bpf_func_proto bpf_jiffies_proto = {
+	.func		= bpf_jiffies,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+	.arg2_type	= ARG_ANYTHING,
+};
 #endif
diff --git a/net/core/filter.c b/net/core/filter.c
index fbb3698026bd..355746715901 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6015,6 +6015,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_spin_unlock_proto;
 	case BPF_FUNC_trace_printk:
 		return bpf_get_trace_printk_proto();
+	case BPF_FUNC_jiffies:
+		return &bpf_jiffies_proto;
 	default:
 		return NULL;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 10/13] bpf: Synch uapi bpf.h to tools/
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2019-12-14  0:47 ` [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies Martin KaFai Lau
@ 2019-12-14  0:48 ` Martin KaFai Lau
  2019-12-14  0:48 ` [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch sync uapi bpf.h to tools/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/include/uapi/linux/bpf.h | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index dbbcf0b02970..cf864a5f7d61 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -136,6 +136,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_STACK,
 	BPF_MAP_TYPE_SK_STORAGE,
 	BPF_MAP_TYPE_DEVMAP_HASH,
+	BPF_MAP_TYPE_STRUCT_OPS,
 };
 
 /* Note that tracing related programs such as
@@ -174,6 +175,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 	BPF_PROG_TYPE_TRACING,
+	BPF_PROG_TYPE_STRUCT_OPS,
 };
 
 enum bpf_attach_type {
@@ -391,6 +393,10 @@ union bpf_attr {
 		__u32	btf_fd;		/* fd pointing to a BTF type data */
 		__u32	btf_key_type_id;	/* BTF type_id of the key */
 		__u32	btf_value_type_id;	/* BTF type_id of the value */
+		__u32	btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
+						   * struct stored as the
+						   * map value
+						   */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -2821,6 +2827,24 @@ union bpf_attr {
  * 	Return
  * 		On success, the strictly positive length of the string,	including
  * 		the trailing NUL character. On error, a negative value.
+ *
+ * int bpf_tcp_send_ack(void *tp, u32 rcv_nxt)
+ *	Description
+ *		Send out a tcp-ack. *tp* is the in-kernel struct tcp_sock.
+ *		*rcv_nxt* is the ack_seq to be sent out.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * u64 bpf_jiffies(u64 in, u64 flags)
+ *	Description
+ *		jiffies helper.
+ *	Return
+ *		*flags*: 0, return the current jiffies.
+ *			 BPF_F_NS_TO_JIFFIES, convert *in* from ns to jiffies.
+ *			 BPF_F_JIFFIES_TO_NS, convert *in* from jiffies to
+ *			 ns.  If *in* is zero, it returns the current
+ *			 jiffies as ns.
+ *
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2938,7 +2962,9 @@ union bpf_attr {
 	FN(probe_read_user),		\
 	FN(probe_read_kernel),		\
 	FN(probe_read_user_str),	\
-	FN(probe_read_kernel_str),
+	FN(probe_read_kernel_str),	\
+	FN(tcp_send_ack),		\
+	FN(jiffies),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3017,6 +3043,9 @@ enum bpf_func_id {
 /* BPF_FUNC_sk_storage_get flags */
 #define BPF_SK_STORAGE_GET_F_CREATE	(1ULL << 0)
 
+#define BPF_F_NS_TO_JIFFIES		(1ULL << 0)
+#define BPF_F_JIFFIES_TO_NS		(1ULL << 1)
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
 	BPF_ADJ_ROOM_NET,
@@ -3339,7 +3368,7 @@ struct bpf_map_info {
 	__u32 map_flags;
 	char  name[BPF_OBJ_NAME_LEN];
 	__u32 ifindex;
-	__u32 :32;
+	__u32 btf_vmlinux_value_type_id;
 	__u64 netns_dev;
 	__u64 netns_ino;
 	__u32 btf_id;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (9 preceding siblings ...)
  2019-12-14  0:48 ` [PATCH bpf-next 10/13] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
@ 2019-12-14  0:48 ` Martin KaFai Lau
  2019-12-18  3:07   ` Andrii Nakryiko
  2019-12-14  0:48 ` [PATCH bpf-next 12/13] bpf: Add bpf_dctcp example Martin KaFai Lau
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch adds BPF STRUCT_OPS support to libbpf.

The only sec_name convention is SEC("struct_ops") to identify the
struct ops implemented in BPF, e.g.
SEC("struct_ops")
struct tcp_congestion_ops dctcp = {
	.init           = (void *)dctcp_init,  /* <-- a bpf_prog */
	/* ... some more func prts ... */
	.name           = "bpf_dctcp",
};

In the bpf_object__open phase, libbpf will look for the "struct_ops"
elf section and find out what is the btf-type the "struct_ops" is
implementing.  Note that the btf-type here is referring to
a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
where are the bpf progs that the func ptrs are referring to.

In the bpf_object__load phase, the prepare_struct_ops() will load
the btf_vmlinux and obtain the corresponding kernel's btf-type.
With the kernel's btf-type, it can then set the prog->type,
prog->attach_btf_id and the prog->expected_attach_type.  Thus,
the prog's properties do not rely on its section name.

Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
process is as simple as: member-name match + btf-kind match + size match.
If these matching conditions fail, libbpf will reject.
The current targeting support is "struct tcp_congestion_ops" which
most of its members are function pointers.
The member ordering of the bpf_prog's btf-type can be different from
the btf_vmlinux's btf-type.

Once the prog's properties are all set,
the libbpf will proceed to load all the progs.

After that, register_struct_ops() will create a map, finalize the
map-value by populating it with the prog-fd, and then register this
"struct_ops" to the kernel by updating the map-value to the map.

By default, libbpf does not unregister the struct_ops from the kernel
during bpf_object__close().  It can be changed by setting the new
"unreg_st_ops" in bpf_object_open_opts.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/lib/bpf/bpf.c           |  10 +-
 tools/lib/bpf/bpf.h           |   5 +-
 tools/lib/bpf/libbpf.c        | 599 +++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.h        |   3 +-
 tools/lib/bpf/libbpf_probes.c |   2 +
 5 files changed, 612 insertions(+), 7 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 98596e15390f..ebb9d7066173 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -95,7 +95,11 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
 	attr.btf_key_type_id = create_attr->btf_key_type_id;
 	attr.btf_value_type_id = create_attr->btf_value_type_id;
 	attr.map_ifindex = create_attr->map_ifindex;
-	attr.inner_map_fd = create_attr->inner_map_fd;
+	if (attr.map_type == BPF_MAP_TYPE_STRUCT_OPS)
+		attr.btf_vmlinux_value_type_id =
+			create_attr->btf_vmlinux_value_type_id;
+	else
+		attr.inner_map_fd = create_attr->inner_map_fd;
 
 	return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
 }
@@ -228,7 +232,9 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
 	memset(&attr, 0, sizeof(attr));
 	attr.prog_type = load_attr->prog_type;
 	attr.expected_attach_type = load_attr->expected_attach_type;
-	if (attr.prog_type == BPF_PROG_TYPE_TRACING) {
+	if (attr.prog_type == BPF_PROG_TYPE_STRUCT_OPS) {
+		attr.attach_btf_id = load_attr->attach_btf_id;
+	} else if (attr.prog_type == BPF_PROG_TYPE_TRACING) {
 		attr.attach_btf_id = load_attr->attach_btf_id;
 		attr.attach_prog_fd = load_attr->attach_prog_fd;
 	} else {
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 3c791fa8e68e..1ddbf7f33b83 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -48,7 +48,10 @@ struct bpf_create_map_attr {
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
 	__u32 map_ifindex;
-	__u32 inner_map_fd;
+	union {
+		__u32 inner_map_fd;
+		__u32 btf_vmlinux_value_type_id;
+	};
 };
 
 LIBBPF_API int
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 27d5f7ecba32..ffb5cdd7db5a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -67,6 +67,10 @@
 
 #define __printf(a, b)	__attribute__((format(printf, a, b)))
 
+static struct btf *bpf_core_find_kernel_btf(void);
+static struct bpf_program *bpf_object__find_prog_by_idx(struct bpf_object *obj,
+							int idx);
+
 static int __base_pr(enum libbpf_print_level level, const char *format,
 		     va_list args)
 {
@@ -128,6 +132,8 @@ void libbpf_print(enum libbpf_print_level level, const char *format, ...)
 # define LIBBPF_ELF_C_READ_MMAP ELF_C_READ
 #endif
 
+#define BPF_STRUCT_OPS_SEC_NAME "struct_ops"
+
 static inline __u64 ptr_to_u64(const void *ptr)
 {
 	return (__u64) (unsigned long) ptr;
@@ -233,6 +239,32 @@ struct bpf_map {
 	bool reused;
 };
 
+struct bpf_struct_ops {
+	const char *var_name;
+	const char *tname;
+	const struct btf_type *type;
+	struct bpf_program **progs;
+	__u32 *kern_func_off;
+	/* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
+	void *data;
+	/* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
+	 * format.
+	 * struct __bpf_tcp_congestion_ops {
+	 *	[... some other kernel fields ...]
+	 *	struct tcp_congestion_ops data;
+	 * }
+	 * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
+	 * prepare_struct_ops() will populate the "data" into
+	 * "kern_vdata".
+	 */
+	void *kern_vdata;
+	__u32 type_id;
+	__u32 kern_vtype_id;
+	__u32 kern_vtype_size;
+	int fd;
+	bool unreg;
+};
+
 struct bpf_secdata {
 	void *rodata;
 	void *data;
@@ -251,6 +283,7 @@ struct bpf_object {
 	size_t nr_maps;
 	size_t maps_cap;
 	struct bpf_secdata sections;
+	struct bpf_struct_ops st_ops;
 
 	bool loaded;
 	bool has_pseudo_calls;
@@ -270,6 +303,7 @@ struct bpf_object {
 		Elf_Data *data;
 		Elf_Data *rodata;
 		Elf_Data *bss;
+		Elf_Data *st_ops_data;
 		size_t strtabidx;
 		struct {
 			GElf_Shdr shdr;
@@ -282,6 +316,7 @@ struct bpf_object {
 		int data_shndx;
 		int rodata_shndx;
 		int bss_shndx;
+		int st_ops_shndx;
 	} efile;
 	/*
 	 * All loaded bpf_object is linked in a list, which is
@@ -509,6 +544,508 @@ static __u32 get_kernel_version(void)
 	return KERNEL_VERSION(major, minor, patch);
 }
 
+static int bpf_object__register_struct_ops(struct bpf_object *obj)
+{
+	struct bpf_create_map_attr map_attr = {};
+	struct bpf_struct_ops *st_ops;
+	const char *tname;
+	__u32 i, zero = 0;
+	int fd, err;
+
+	st_ops = &obj->st_ops;
+	if (!st_ops->kern_vdata)
+		return 0;
+
+	tname = st_ops->tname;
+	for (i = 0; i < btf_vlen(st_ops->type); i++) {
+		struct bpf_program *prog = st_ops->progs[i];
+		void *kern_data;
+		int prog_fd;
+
+		if (!prog)
+			continue;
+
+		prog_fd = bpf_program__nth_fd(prog, 0);
+		if (prog_fd < 0) {
+			pr_warn("struct_ops register %s: prog %s is not loaded\n",
+				tname, prog->name);
+			return -EINVAL;
+		}
+
+		kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
+		*(unsigned long *)kern_data = prog_fd;
+	}
+
+	map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS;
+	map_attr.key_size = sizeof(unsigned int);
+	map_attr.value_size = st_ops->kern_vtype_size;
+	map_attr.max_entries = 1;
+	map_attr.btf_fd = btf__fd(obj->btf);
+	map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id;
+	map_attr.name = st_ops->var_name;
+
+	fd = bpf_create_map_xattr(&map_attr);
+	if (fd < 0) {
+		err = -errno;
+		pr_warn("struct_ops register %s: Error in creating struct_ops map\n",
+			tname);
+		return err;
+	}
+
+	err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0);
+	if (err) {
+		err = -errno;
+		close(fd);
+		pr_warn("struct_ops register %s: Error in updating struct_ops map\n",
+			tname);
+		return err;
+	}
+
+	st_ops->fd = fd;
+
+	return 0;
+}
+
+static int bpf_struct_ops__unregister(struct bpf_struct_ops *st_ops)
+{
+	if (st_ops->fd != -1) {
+		__u32 zero = 0;
+		int err = 0;
+
+		if (bpf_map_delete_elem(st_ops->fd, &zero))
+			err = -errno;
+		zclose(st_ops->fd);
+
+		return err;
+	}
+
+	return 0;
+}
+
+static const struct btf_type *
+resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
+static const struct btf_type *
+resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
+
+static const struct btf_member *
+find_member_by_offset(const struct btf_type *t, __u32 offset)
+{
+	struct btf_member *m;
+	int i;
+
+	for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
+		if (btf_member_bit_offset(t, i) == offset)
+			return m;
+	}
+
+	return NULL;
+}
+
+static const struct btf_member *
+find_member_by_name(const struct btf *btf, const struct btf_type *t,
+		    const char *name)
+{
+	struct btf_member *m;
+	int i;
+
+	for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
+		if (!strcmp(btf__name_by_offset(btf, m->name_off), name))
+			return m;
+	}
+
+	return NULL;
+}
+
+#define STRUCT_OPS_VALUE_PREFIX "__bpf_"
+#define STRUCT_OPS_VALUE_PREFIX_LEN (sizeof(STRUCT_OPS_VALUE_PREFIX) - 1)
+
+static int
+bpf_struct_ops__get_kern_types(const struct btf *btf, const char *tname,
+			       const struct btf_type **type, __u32 *type_id,
+			       const struct btf_type **vtype, __u32 *vtype_id,
+			       const struct btf_member **data_member)
+{
+	const struct btf_type *kern_type, *kern_vtype;
+	const struct btf_member *kern_data_member;
+	__s32 kern_vtype_id, kern_type_id;
+	char vtname[128] = STRUCT_OPS_VALUE_PREFIX;
+	__u32 i;
+
+	kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
+	if (kern_type_id < 0) {
+		pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n",
+			tname);
+		return -ENOTSUP;
+	}
+	kern_type = btf__type_by_id(btf, kern_type_id);
+
+	/* Find the corresponding "map_value" type that will be used
+	 * in map_update(BPF_MAP_TYPE_STRUCT_OPS).  For example,
+	 * find "struct __bpf_tcp_congestion_ops" from the btf_vmlinux.
+	 */
+	strncat(vtname + STRUCT_OPS_VALUE_PREFIX_LEN, tname,
+		sizeof(vtname) - STRUCT_OPS_VALUE_PREFIX_LEN - 1);
+	kern_vtype_id = btf__find_by_name_kind(btf, vtname,
+					       BTF_KIND_STRUCT);
+	if (kern_vtype_id < 0) {
+		pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n",
+			vtname);
+		return -ENOTSUP;
+	}
+	kern_vtype = btf__type_by_id(btf, kern_vtype_id);
+
+	/* Find "struct tcp_congestion_ops" from
+	 * struct __bpf_tcp_congestion_ops {
+	 *	[ ... ]
+	 *	struct tcp_congestion_ops data;
+	 * }
+	 */
+	for (i = 0, kern_data_member = btf_members(kern_vtype);
+	     i < btf_vlen(kern_vtype);
+	     i++, kern_data_member++) {
+		if (kern_data_member->type == kern_type_id)
+			break;
+	}
+	if (i == btf_vlen(kern_vtype)) {
+		pr_warn("struct_ops prepare: struct %s data is not found in struct %s\n",
+			tname, vtname);
+		return -EINVAL;
+	}
+
+	*type = kern_type;
+	*type_id = kern_type_id;
+	*vtype = kern_vtype;
+	*vtype_id = kern_vtype_id;
+	*data_member = kern_data_member;
+
+	return 0;
+}
+
+static int bpf_object__prepare_struct_ops(struct bpf_object *obj)
+{
+	const struct btf_member *member, *kern_member, *kern_data_member;
+	const struct btf_type *type, *kern_type, *kern_vtype;
+	__u32 i, kern_type_id, kern_vtype_id, kern_data_off;
+	struct bpf_struct_ops *st_ops;
+	void *data, *kern_data;
+	const struct btf *btf;
+	struct btf *kern_btf;
+	const char *tname;
+	int err;
+
+	st_ops = &obj->st_ops;
+	if (!st_ops->data)
+		return 0;
+
+	btf = obj->btf;
+	type = st_ops->type;
+	tname = st_ops->tname;
+
+	kern_btf = bpf_core_find_kernel_btf();
+	if (IS_ERR(kern_btf))
+		return PTR_ERR(kern_btf);
+
+	err = bpf_struct_ops__get_kern_types(kern_btf, tname,
+					     &kern_type, &kern_type_id,
+					     &kern_vtype, &kern_vtype_id,
+					     &kern_data_member);
+	if (err)
+		goto done;
+
+	pr_debug("struct_ops prepare %s: type_id:%u kern_type_id:%u kern_vtype_id:%u\n",
+		 tname, st_ops->type_id, kern_type_id, kern_vtype_id);
+
+	kern_data_off = kern_data_member->offset / 8;
+	st_ops->kern_vtype_size = kern_vtype->size;
+	st_ops->kern_vtype_id = kern_vtype_id;
+
+	st_ops->kern_vdata = calloc(1, st_ops->kern_vtype_size);
+	if (!st_ops->kern_vdata) {
+		err = -ENOMEM;
+		goto done;
+	}
+
+	data = st_ops->data;
+	kern_data = st_ops->kern_vdata + kern_data_off;
+
+	err = -ENOTSUP;
+	for (i = 0, member = btf_members(type); i < btf_vlen(type);
+	     i++, member++) {
+		const struct btf_type *mtype, *kern_mtype;
+		__u32 mtype_id, kern_mtype_id;
+		void *mdata, *kern_mdata;
+		__s64 msize, kern_msize;
+		__u32 moff, kern_moff;
+		__u32 kern_member_idx;
+		const char *mname;
+
+		mname = btf__name_by_offset(btf, member->name_off);
+		kern_member = find_member_by_name(kern_btf, kern_type, mname);
+		if (!kern_member) {
+			pr_warn("struct_ops prepare %s: Cannot find member %s in kernel BTF\n",
+				tname, mname);
+			goto done;
+		}
+
+		kern_member_idx = kern_member - btf_members(kern_type);
+		if (btf_member_bitfield_size(type, i) ||
+		    btf_member_bitfield_size(kern_type, kern_member_idx)) {
+			pr_warn("struct_ops prepare %s: bitfield %s is not supported\n",
+				tname, mname);
+			goto done;
+		}
+
+		moff = member->offset / 8;
+		kern_moff = kern_member->offset / 8;
+
+		mdata = data + moff;
+		kern_mdata = kern_data + kern_moff;
+
+		mtype_id = member->type;
+		kern_mtype_id = kern_member->type;
+
+		mtype = resolve_ptr(btf, mtype_id, NULL);
+		kern_mtype = resolve_ptr(kern_btf, kern_mtype_id, NULL);
+		if (mtype && kern_mtype) {
+			struct bpf_program *prog;
+
+			if (!btf_is_func_proto(mtype) ||
+			    !btf_is_func_proto(kern_mtype)) {
+				pr_warn("struct_ops prepare %s: non func ptr %s is not supported\n",
+					tname, mname);
+				goto done;
+			}
+
+			prog = st_ops->progs[i];
+			if (!prog) {
+				pr_debug("struct_ops prepare %s: func ptr %s is not set\n",
+					 tname, mname);
+				continue;
+			}
+
+			if (prog->type != BPF_PROG_TYPE_UNSPEC ||
+			    prog->attach_btf_id) {
+				pr_warn("struct_ops prepare %s: cannot use prog %s with attach_btf_id %d for func ptr %s\n",
+					tname, prog->name, prog->attach_btf_id, mname);
+				err = -EINVAL;
+				goto done;
+			}
+
+			prog->type = BPF_PROG_TYPE_STRUCT_OPS;
+			prog->attach_btf_id = kern_type_id;
+			/* expected_attach_type is the member index */
+			prog->expected_attach_type = kern_member_idx;
+
+			st_ops->kern_func_off[i] = kern_data_off + kern_moff;
+
+			pr_debug("struct_ops prepare %s: func ptr %s is set to prog %s from data(+%u) to kern_data(+%u)\n",
+				 tname, mname, prog->name, moff, kern_moff);
+
+			continue;
+		}
+
+		mtype_id = btf__resolve_type(btf, mtype_id);
+		kern_mtype_id = btf__resolve_type(kern_btf, kern_mtype_id);
+		if (mtype_id < 0 || kern_mtype_id < 0) {
+			pr_warn("struct_ops prepare %s: Cannot resolve the type for %s\n",
+				tname, mname);
+			goto done;
+		}
+
+		mtype = btf__type_by_id(btf, mtype_id);
+		kern_mtype = btf__type_by_id(kern_btf, kern_mtype_id);
+		if (BTF_INFO_KIND(mtype->info) !=
+		    BTF_INFO_KIND(kern_mtype->info)) {
+			pr_warn("struct_ops prepare %s: Unmatched member type %s %u != %u(kernel)\n",
+				tname, mname,
+				BTF_INFO_KIND(mtype->info),
+				BTF_INFO_KIND(kern_mtype->info));
+			goto done;
+		}
+
+		msize = btf__resolve_size(btf, mtype_id);
+		kern_msize = btf__resolve_size(kern_btf, kern_mtype_id);
+		if (msize < 0 || kern_msize < 0 || msize != kern_msize) {
+			pr_warn("struct_ops prepare %s: Error in size of member %s: %zd != %zd(kernel)\n",
+				tname, mname,
+				(ssize_t)msize, (ssize_t)kern_msize);
+			goto done;
+		}
+
+		pr_debug("struct_ops prepare %s: copy %s %u bytes from data(+%u) to kern_data(+%u)\n",
+			 tname, mname, (unsigned int)msize,
+			 moff, kern_moff);
+		memcpy(kern_mdata, mdata, msize);
+	}
+
+	err = 0;
+
+done:
+	/* On error case, bpf_object__unload() will free the
+	 * st_ops->kern_vdata.
+	 */
+	btf__free(kern_btf);
+	return err;
+}
+
+static int bpf_object__collect_struct_ops_reloc(struct bpf_object *obj,
+						GElf_Shdr *shdr,
+						Elf_Data *data)
+{
+	const struct btf_member *member;
+	struct bpf_struct_ops *st_ops;
+	struct bpf_program *prog;
+	const char *name, *tname;
+	unsigned int shdr_idx;
+	const struct btf *btf;
+	Elf_Data *symbols;
+	unsigned int moff;
+	GElf_Sym sym;
+	GElf_Rel rel;
+	int i, nrels;
+
+	symbols = obj->efile.symbols;
+	btf = obj->btf;
+	st_ops = &obj->st_ops;
+	tname = st_ops->tname;
+
+	nrels = shdr->sh_size / shdr->sh_entsize;
+	for (i = 0; i < nrels; i++) {
+		if (!gelf_getrel(data, i, &rel)) {
+			pr_warn("struct_ops reloc %s: failed to get %d reloc\n",
+				tname, i);
+			return -LIBBPF_ERRNO__FORMAT;
+		}
+
+		if (!gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym)) {
+			pr_warn("struct_ops reloc %s: symbol %" PRIx64 " not found\n",
+				tname, GELF_R_SYM(rel.r_info));
+			return -LIBBPF_ERRNO__FORMAT;
+		}
+
+		name = elf_strptr(obj->efile.elf, obj->efile.strtabidx,
+				  sym.st_name) ? : "<?>";
+
+		pr_debug("%s relo for %lld value %lld name %d (\'%s\')\n",
+			 tname,
+			 (long long) (rel.r_info >> 32),
+			 (long long) sym.st_value, sym.st_name, name);
+
+		shdr_idx = sym.st_shndx;
+		moff = rel.r_offset;
+		pr_debug("struct_ops reloc %s: moff=%u, shdr_idx=%u\n",
+			 tname, moff, shdr_idx);
+
+		if (shdr_idx >= SHN_LORESERVE) {
+			pr_warn("struct_ops reloc %s: moff=%u shdr_idx=%u unsupported non-static function\n",
+				tname, moff, shdr_idx);
+			return -LIBBPF_ERRNO__RELOC;
+		}
+
+		member = find_member_by_offset(st_ops->type, moff * 8);
+		if (!member) {
+			pr_warn("struct_ops reloc %s: cannot find member at moff=%u\n",
+				tname, moff);
+			return -EINVAL;
+		}
+		name = btf__name_by_offset(btf, member->name_off);
+
+		if (!resolve_func_ptr(btf, member->type, NULL)) {
+			pr_warn("struct_ops reloc %s: cannot relocate non func ptr %s\n",
+				tname, name);
+			return -EINVAL;
+		}
+
+		prog = bpf_object__find_prog_by_idx(obj, shdr_idx);
+		if (!prog) {
+			pr_warn("struct_ops reloc %s: cannot find prog at shdr_idx %u to relocate func ptr %s\n",
+				tname, shdr_idx, name);
+			return -EINVAL;
+		}
+
+		st_ops->progs[member - btf_members(st_ops->type)] = prog;
+	}
+
+	return 0;
+}
+
+static int bpf_object__init_struct_ops(struct bpf_object *obj)
+{
+	const struct btf_type *type, *datasec;
+	const struct btf_var_secinfo *vsi;
+	struct bpf_struct_ops *st_ops;
+	const char *tname, *var_name;
+	__s32 type_id, datasec_id;
+	const struct btf *btf;
+
+	if (obj->efile.st_ops_shndx == -1)
+		return 0;
+
+	btf = obj->btf;
+	st_ops = &obj->st_ops;
+	datasec_id = btf__find_by_name_kind(btf, BPF_STRUCT_OPS_SEC_NAME,
+					    BTF_KIND_DATASEC);
+	if (datasec_id < 0) {
+		pr_warn("struct_ops init: DATASEC %s not found\n",
+			BPF_STRUCT_OPS_SEC_NAME);
+		return -EINVAL;
+	}
+
+	datasec = btf__type_by_id(btf, datasec_id);
+	if (btf_vlen(datasec) != 1) {
+		pr_warn("struct_ops init: multiple VAR in DATASEC %s\n",
+			BPF_STRUCT_OPS_SEC_NAME);
+		return -ENOTSUP;
+	}
+	vsi = btf_var_secinfos(datasec);
+
+	type = btf__type_by_id(obj->btf, vsi->type);
+	if (!btf_is_var(type)) {
+		pr_warn("struct_ops init: vsi->type %u is not a VAR\n",
+			vsi->type);
+		return -EINVAL;
+	}
+	var_name = btf__name_by_offset(obj->btf, type->name_off);
+
+	type_id = btf__resolve_type(obj->btf, vsi->type);
+	if (type_id < 0) {
+		pr_warn("struct_ops init: Cannot resolve var type_id %u in DATASEC %s\n",
+			vsi->type, BPF_STRUCT_OPS_SEC_NAME);
+		return -EINVAL;
+	}
+
+	type = btf__type_by_id(obj->btf, type_id);
+	tname = btf__name_by_offset(obj->btf, type->name_off);
+	if (!btf_is_struct(type)) {
+		pr_warn("struct_ops init: %s is not a struct\n", tname);
+		return -EINVAL;
+	}
+
+	if (type->size != obj->efile.st_ops_data->d_size) {
+		pr_warn("struct_ops init: %s unmatched size %u (BTF DATASEC) != %zu (ELF)\n",
+			tname, type->size, obj->efile.st_ops_data->d_size);
+		return -EINVAL;
+	}
+
+	st_ops->data = malloc(type->size);
+	st_ops->progs = calloc(btf_vlen(type), sizeof(*st_ops->progs));
+	st_ops->kern_func_off = malloc(btf_vlen(type) *
+				       sizeof(*st_ops->kern_func_off));
+	/* bpf_object__close() will take care of the free-ings */
+	if (!st_ops->data || !st_ops->progs || !st_ops->kern_func_off)
+		return -ENOMEM;
+	memcpy(st_ops->data, obj->efile.st_ops_data->d_buf, type->size);
+
+	st_ops->tname = tname;
+	st_ops->type = type;
+	st_ops->type_id = type_id;
+	st_ops->var_name = var_name;
+
+	pr_debug("struct_ops init: %s found. type_id:%u\n", tname, type_id);
+
+	return 0;
+}
+
 static struct bpf_object *bpf_object__new(const char *path,
 					  const void *obj_buf,
 					  size_t obj_buf_sz,
@@ -550,6 +1087,9 @@ static struct bpf_object *bpf_object__new(const char *path,
 	obj->efile.data_shndx = -1;
 	obj->efile.rodata_shndx = -1;
 	obj->efile.bss_shndx = -1;
+	obj->efile.st_ops_shndx = -1;
+
+	obj->st_ops.fd = -1;
 
 	obj->kern_version = get_kernel_version();
 	obj->loaded = false;
@@ -572,6 +1112,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj)
 	obj->efile.data = NULL;
 	obj->efile.rodata = NULL;
 	obj->efile.bss = NULL;
+	obj->efile.st_ops_data = NULL;
 
 	zfree(&obj->efile.reloc_sects);
 	obj->efile.nr_reloc_sects = 0;
@@ -757,6 +1298,9 @@ int bpf_object__section_size(const struct bpf_object *obj, const char *name,
 	} else if (!strcmp(name, ".rodata")) {
 		if (obj->efile.rodata)
 			*size = obj->efile.rodata->d_size;
+	} else if (!strcmp(name, BPF_STRUCT_OPS_SEC_NAME)) {
+		if (obj->efile.st_ops_data)
+			*size = obj->efile.st_ops_data->d_size;
 	} else {
 		ret = bpf_object_search_section_size(obj, name, &d_size);
 		if (!ret)
@@ -1060,6 +1604,30 @@ skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
 	return t;
 }
 
+static const struct btf_type *
+resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id)
+{
+	const struct btf_type *t;
+
+	t = skip_mods_and_typedefs(btf, id, NULL);
+	if (!btf_is_ptr(t))
+		return NULL;
+
+	return skip_mods_and_typedefs(btf, t->type, res_id);
+}
+
+static const struct btf_type *
+resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id)
+{
+	const struct btf_type *t;
+
+	t = resolve_ptr(btf, id, res_id);
+	if (t && btf_is_func_proto(t))
+		return t;
+
+	return NULL;
+}
+
 /*
  * Fetch integer attribute of BTF map definition. Such attributes are
  * represented using a pointer to an array, in which dimensionality of array
@@ -1509,7 +2077,7 @@ static void bpf_object__sanitize_btf_ext(struct bpf_object *obj)
 
 static bool bpf_object__is_btf_mandatory(const struct bpf_object *obj)
 {
-	return obj->efile.btf_maps_shndx >= 0;
+	return obj->efile.btf_maps_shndx >= 0 || obj->efile.st_ops_shndx >= 0;
 }
 
 static int bpf_object__init_btf(struct bpf_object *obj,
@@ -1689,6 +2257,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
 			} else if (strcmp(name, ".rodata") == 0) {
 				obj->efile.rodata = data;
 				obj->efile.rodata_shndx = idx;
+			} else if (strcmp(name, BPF_STRUCT_OPS_SEC_NAME) == 0) {
+				obj->efile.st_ops_data = data;
+				obj->efile.st_ops_shndx = idx;
 			} else {
 				pr_debug("skip section(%d) %s\n", idx, name);
 			}
@@ -1698,7 +2269,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
 			int sec = sh.sh_info; /* points to other section */
 
 			/* Only do relo for section with exec instructions */
-			if (!section_have_execinstr(obj, sec)) {
+			if (!section_have_execinstr(obj, sec) &&
+			    !strstr(name, BPF_STRUCT_OPS_SEC_NAME)) {
 				pr_debug("skip relo %s(%d) for section(%d)\n",
 					 name, idx, sec);
 				continue;
@@ -1735,6 +2307,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
 		err = bpf_object__sanitize_and_load_btf(obj);
 	if (!err)
 		err = bpf_object__init_prog_names(obj);
+	if (!err)
+		err = bpf_object__init_struct_ops(obj);
 	return err;
 }
 
@@ -3700,6 +4274,13 @@ static int bpf_object__collect_reloc(struct bpf_object *obj)
 			return -LIBBPF_ERRNO__INTERNAL;
 		}
 
+		if (idx == obj->efile.st_ops_shndx) {
+			err = bpf_object__collect_struct_ops_reloc(obj, shdr, data);
+			if (err)
+				return err;
+			continue;
+		}
+
 		prog = bpf_object__find_prog_by_idx(obj, idx);
 		if (!prog) {
 			pr_warn("relocation failed: no section(%d)\n", idx);
@@ -3734,7 +4315,9 @@ load_program(struct bpf_program *prog, struct bpf_insn *insns, int insns_cnt,
 	load_attr.insns = insns;
 	load_attr.insns_cnt = insns_cnt;
 	load_attr.license = license;
-	if (prog->type == BPF_PROG_TYPE_TRACING) {
+	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		load_attr.attach_btf_id = prog->attach_btf_id;
+	} else if (prog->type == BPF_PROG_TYPE_TRACING) {
 		load_attr.attach_prog_fd = prog->attach_prog_fd;
 		load_attr.attach_btf_id = prog->attach_btf_id;
 	} else {
@@ -3952,6 +4535,7 @@ __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz,
 	if (IS_ERR(obj))
 		return obj;
 
+	obj->st_ops.unreg = OPTS_GET(opts, unreg_st_ops, false);
 	obj->relaxed_core_relocs = OPTS_GET(opts, relaxed_core_relocs, false);
 	relaxed_maps = OPTS_GET(opts, relaxed_maps, false);
 	pin_root_path = OPTS_GET(opts, pin_root_path, NULL);
@@ -4077,6 +4661,10 @@ int bpf_object__unload(struct bpf_object *obj)
 	for (i = 0; i < obj->nr_programs; i++)
 		bpf_program__unload(&obj->programs[i]);
 
+	if (obj->st_ops.unreg)
+		bpf_struct_ops__unregister(&obj->st_ops);
+	zfree(&obj->st_ops.kern_vdata);
+
 	return 0;
 }
 
@@ -4100,7 +4688,9 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
 
 	CHECK_ERR(bpf_object__create_maps(obj), err, out);
 	CHECK_ERR(bpf_object__relocate(obj, attr->target_btf_path), err, out);
+	CHECK_ERR(bpf_object__prepare_struct_ops(obj), err, out);
 	CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out);
+	CHECK_ERR(bpf_object__register_struct_ops(obj), err, out);
 
 	return 0;
 out:
@@ -4690,6 +5280,9 @@ void bpf_object__close(struct bpf_object *obj)
 			bpf_program__exit(&obj->programs[i]);
 	}
 	zfree(&obj->programs);
+	zfree(&obj->st_ops.data);
+	zfree(&obj->st_ops.progs);
+	zfree(&obj->st_ops.kern_func_off);
 
 	list_del(&obj->list);
 	free(obj);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 0dbf4bfba0c4..db255fce4948 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -109,8 +109,9 @@ struct bpf_object_open_opts {
 	 */
 	const char *pin_root_path;
 	__u32 attach_prog_fd;
+	bool unreg_st_ops;
 };
-#define bpf_object_open_opts__last_field attach_prog_fd
+#define bpf_object_open_opts__last_field unreg_st_ops
 
 LIBBPF_API struct bpf_object *bpf_object__open(const char *path);
 LIBBPF_API struct bpf_object *
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index a9eb8b322671..7f06942e9574 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -103,6 +103,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 	case BPF_PROG_TYPE_TRACING:
+	case BPF_PROG_TYPE_STRUCT_OPS:
 	default:
 		break;
 	}
@@ -251,6 +252,7 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex)
 	case BPF_MAP_TYPE_XSKMAP:
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
+	case BPF_MAP_TYPE_STRUCT_OPS:
 	default:
 		break;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 12/13] bpf: Add bpf_dctcp example
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (10 preceding siblings ...)
  2019-12-14  0:48 ` [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
@ 2019-12-14  0:48 ` Martin KaFai Lau
  2019-12-14  0:48 ` [PATCH bpf-next 13/13] bpf: Add bpf_cubic example Martin KaFai Lau
  2019-12-14  2:26 ` [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Eric Dumazet
  13 siblings, 0 replies; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch adds a bpf_dctcp example.  It currently does not do
no-ECN fallback but the same could be done through the cgrp2-bpf.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 228 ++++++++++++++++++
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 198 +++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 194 +++++++++++++++
 3 files changed, 620 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bpf_tcp_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_dctcp.c

diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
new file mode 100644
index 000000000000..7ba8c1b4157a
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -0,0 +1,228 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __BPF_TCP_HELPERS_H
+#define __BPF_TCP_HELPERS_H
+
+#include <stdbool.h>
+#include <linux/types.h>
+#include <bpf_helpers.h>
+#include <bpf_core_read.h>
+#include "bpf_trace_helpers.h"
+
+#define BPF_TCP_OPS_0(fname, ret_type, ...) BPF_TRACE_x(0, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_1(fname, ret_type, ...) BPF_TRACE_x(1, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_2(fname, ret_type, ...) BPF_TRACE_x(2, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_3(fname, ret_type, ...) BPF_TRACE_x(3, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_4(fname, ret_type, ...) BPF_TRACE_x(4, #fname"_sec", fname, ret_type, __VA_ARGS__)
+#define BPF_TCP_OPS_5(fname, ret_type, ...) BPF_TRACE_x(5, #fname"_sec", fname, ret_type, __VA_ARGS__)
+
+struct sock_common {
+	unsigned char	skc_state;
+} __attribute__((preserve_access_index));
+
+struct sock {
+	struct sock_common	__sk_common;
+} __attribute__((preserve_access_index));
+
+struct inet_sock {
+	struct sock		sk;
+} __attribute__((preserve_access_index));
+
+struct inet_connection_sock {
+	struct inet_sock	  icsk_inet;
+	__u8			  icsk_ca_state:6,
+				  icsk_ca_setsockopt:1,
+				  icsk_ca_dst_locked:1;
+	struct {
+		__u8		  pending;
+	} icsk_ack;
+	__u64			  icsk_ca_priv[104 / sizeof(__u64)];
+} __attribute__((preserve_access_index));
+
+struct tcp_sock {
+	struct inet_connection_sock	inet_conn;
+
+	__u32	rcv_nxt;
+	__u32	snd_nxt;
+	__u32	snd_una;
+	__u8	ecn_flags;
+	__u32	delivered;
+	__u32	delivered_ce;
+	__u32	snd_cwnd;
+	__u32	snd_cwnd_cnt;
+	__u32	snd_cwnd_clamp;
+	__u32	snd_ssthresh;
+	__u8	syn_data:1,	/* SYN includes data */
+		syn_fastopen:1,	/* SYN includes Fast Open option */
+		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
+		syn_fastopen_ch:1, /* Active TFO re-enabling probe */
+		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
+		save_syn:1,	/* Save headers of SYN packet */
+		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
+		syn_smc:1;	/* SYN includes SMC */
+	__u32	max_packets_out;
+	__u32	lsndtime;
+	__u32	prior_cwnd;
+} __attribute__((preserve_access_index));
+
+static __always_inline struct inet_connection_sock *inet_csk(const struct sock *sk)
+{
+	return (struct inet_connection_sock *)sk;
+}
+
+static __always_inline void *inet_csk_ca(const struct sock *sk)
+{
+	return (void *)inet_csk(sk)->icsk_ca_priv;
+}
+
+static __always_inline struct tcp_sock *tcp_sk(const struct sock *sk)
+{
+	return (struct tcp_sock *)sk;
+}
+
+static __always_inline bool before(__u32 seq1, __u32 seq2)
+{
+	return (__s32)(seq1-seq2) < 0;
+}
+#define after(seq2, seq1) 	before(seq1, seq2)
+
+#define	TCP_ECN_OK		1
+#define	TCP_ECN_QUEUE_CWR	2
+#define	TCP_ECN_DEMAND_CWR	4
+#define	TCP_ECN_SEEN		8
+
+enum inet_csk_ack_state_t {
+	ICSK_ACK_SCHED	= 1,
+	ICSK_ACK_TIMER  = 2,
+	ICSK_ACK_PUSHED = 4,
+	ICSK_ACK_PUSHED2 = 8,
+	ICSK_ACK_NOW = 16	/* Send the next ACK immediately (once) */
+};
+
+enum tcp_ca_event {
+	CA_EVENT_TX_START = 0,
+	CA_EVENT_CWND_RESTART = 1,
+	CA_EVENT_COMPLETE_CWR = 2,
+	CA_EVENT_LOSS = 3,
+	CA_EVENT_ECN_NO_CE = 4,
+	CA_EVENT_ECN_IS_CE = 5,
+};
+
+enum tcp_ca_state {
+	TCP_CA_Open = 0,
+	TCP_CA_Disorder = 1,
+	TCP_CA_CWR = 2,
+	TCP_CA_Recovery = 3,
+	TCP_CA_Loss = 4
+};
+
+struct ack_sample {
+	__u32 pkts_acked;
+	__s32 rtt_us;
+	__u32 in_flight;
+} __attribute__((preserve_access_index));
+
+struct rate_sample {
+	__u64  prior_mstamp; /* starting timestamp for interval */
+	__u32  prior_delivered;	/* tp->delivered at "prior_mstamp" */
+	__s32  delivered;		/* number of packets delivered over interval */
+	long interval_us;	/* time for tp->delivered to incr "delivered" */
+	__u32 snd_interval_us;	/* snd interval for delivered packets */
+	__u32 rcv_interval_us;	/* rcv interval for delivered packets */
+	long rtt_us;		/* RTT of last (S)ACKed packet (or -1) */
+	int  losses;		/* number of packets marked lost upon ACK */
+	__u32  acked_sacked;	/* number of packets newly (S)ACKed upon ACK */
+	__u32  prior_in_flight;	/* in flight before this ACK */
+	bool is_app_limited;	/* is sample from packet with bubble in pipe? */
+	bool is_retrans;	/* is sample from retransmission? */
+	bool is_ack_delayed;	/* is this (likely) a delayed ACK? */
+} __attribute__((preserve_access_index));
+
+#define TCP_CA_NAME_MAX		16
+#define TCP_CONG_NEEDS_ECN	0x2
+
+struct tcp_congestion_ops {
+	__u32 flags;
+
+	/* initialize private data (optional) */
+	void (*init)(struct sock *sk);
+	/* cleanup private data  (optional) */
+	void (*release)(struct sock *sk);
+
+	/* return slow start threshold (required) */
+	__u32 (*ssthresh)(struct sock *sk);
+	/* do new cwnd calculation (required) */
+	void (*cong_avoid)(struct sock *sk, __u32 ack, __u32 acked);
+	/* call before changing ca_state (optional) */
+	void (*set_state)(struct sock *sk, __u8 new_state);
+	/* call when cwnd event occurs (optional) */
+	void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
+	/* call when ack arrives (optional) */
+	void (*in_ack_event)(struct sock *sk, __u32 flags);
+	/* new value of cwnd after loss (required) */
+	__u32  (*undo_cwnd)(struct sock *sk);
+	/* hook for packet ack accounting (optional) */
+	void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
+	/* override sysctl_tcp_min_tso_segs */
+	__u32 (*min_tso_segs)(struct sock *sk);
+	/* returns the multiplier used in tcp_sndbuf_expand (optional) */
+	__u32 (*sndbuf_expand)(struct sock *sk);
+	/* call when packets are delivered to update cwnd and pacing rate,
+	 * after all the ca_state processing. (optional)
+	 */
+	void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
+
+	char 		name[TCP_CA_NAME_MAX];
+};
+
+#define min(a, b) ((a) < (b) ? (a) : (b))
+#define max(a, b) ((a) > (b) ? (a) : (b))
+#define min_not_zero(x, y) ({			\
+	typeof(x) __x = (x);			\
+	typeof(y) __y = (y);			\
+	__x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
+
+static __always_inline __u32 tcp_slow_start(struct tcp_sock *tp, __u32 acked)
+{
+	__u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh);
+
+	acked -= cwnd - tp->snd_cwnd;
+	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
+
+	return acked;
+}
+
+static __always_inline bool tcp_in_slow_start(const struct tcp_sock *tp)
+{
+	return tp->snd_cwnd < tp->snd_ssthresh;
+}
+
+static __always_inline bool tcp_is_cwnd_limited(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	/* If in slow start, ensure cwnd grows to twice what was ACKed. */
+	if (tcp_in_slow_start(tp))
+		return tp->snd_cwnd < 2 * tp->max_packets_out;
+
+	return !!BPF_CORE_READ_BITFIELD(tp, is_cwnd_limited);
+}
+
+static __always_inline void tcp_cong_avoid_ai(struct tcp_sock *tp, __u32 w, __u32 acked)
+{
+	/* If credits accumulated at a higher w, apply them gently now. */
+	if (tp->snd_cwnd_cnt >= w) {
+		tp->snd_cwnd_cnt = 0;
+		tp->snd_cwnd++;
+	}
+
+	tp->snd_cwnd_cnt += acked;
+	if (tp->snd_cwnd_cnt >= w) {
+		__u32 delta = tp->snd_cwnd_cnt / w;
+
+		tp->snd_cwnd_cnt -= delta * w;
+		tp->snd_cwnd += delta;
+	}
+	tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_cwnd_clamp);
+}
+
+#endif
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
new file mode 100644
index 000000000000..035de76bf8ed
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+#include <linux/err.h>
+#include <test_progs.h>
+
+#define min(a, b) ((a) < (b) ? (a) : (b))
+
+static const unsigned int total_bytes = 10 * 1024 * 1024;
+static const struct timeval timeo_sec = { .tv_sec = 10 };
+static const size_t timeo_optlen = sizeof(timeo_sec);
+static int stop, duration;
+
+static int settimeo(int fd)
+{
+	int err;
+
+	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo_sec,
+			 timeo_optlen);
+	if (CHECK(err == -1, "setsockopt(fd, SO_RCVTIMEO)", "errno:%d\n",
+		  errno))
+		return -1;
+
+	err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo_sec,
+			 timeo_optlen);
+	if (CHECK(err == -1, "setsockopt(fd, SO_SNDTIMEO)", "errno:%d\n",
+		  errno))
+		return -1;
+
+	return 0;
+}
+
+static int settcpca(int fd, const char *tcp_ca)
+{
+	int err;
+
+	err = setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, tcp_ca, strlen(tcp_ca));
+	if (CHECK(err == -1, "setsockopt(fd, TCP_CONGESTION)", "errno:%d\n",
+		  errno))
+		return -1;
+
+	return 0;
+}
+
+static void *server(void *arg)
+{
+	int lfd = (int)(long)arg, err = 0, fd;
+	ssize_t nr_sent = 0, bytes = 0;
+	char batch[1500];
+
+	fd = accept(lfd, NULL, NULL);
+	while (fd == -1) {
+		if (errno == EINTR)
+			continue;
+		err = -errno;
+		goto done;
+	}
+
+	if (settimeo(fd)) {
+		err = -errno;
+		goto done;
+	}
+
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_sent = send(fd, &batch,
+			       min(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_sent == -1 && errno == EINTR)
+			continue;
+		if (nr_sent == -1) {
+			err = -errno;
+			break;
+		}
+		bytes += nr_sent;
+	}
+
+	CHECK(bytes != total_bytes, "send", "%zd != %u nr_sent:%zd errno:%d\n",
+	      bytes, total_bytes, nr_sent, errno);
+
+done:
+	if (fd != -1)
+		close(fd);
+	if (err) {
+		WRITE_ONCE(stop, 1);
+		return ERR_PTR(err);
+	}
+	return NULL;
+}
+
+static void do_test(const char *tcp_ca)
+{
+	struct sockaddr_in6 sa6 = {};
+	ssize_t nr_recv = 0, bytes = 0;
+	int lfd = -1, fd = -1;
+	pthread_t srv_thread;
+	socklen_t addrlen = sizeof(sa6);
+	void *thread_ret;
+	char batch[1500];
+	int err;
+
+	WRITE_ONCE(stop, 0);
+
+	lfd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (CHECK(lfd == -1, "socket", "errno:%d\n", errno))
+		return;
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (CHECK(fd == -1, "socket", "errno:%d\n", errno)) {
+		close(lfd);
+		return;
+	}
+
+	if (settcpca(lfd, tcp_ca) || settcpca(fd, tcp_ca) ||
+	    settimeo(lfd) || settimeo(fd))
+		goto done;
+
+	/* bind, listen and start server thread to accept */
+	sa6.sin6_family = AF_INET6;
+	sa6.sin6_addr = in6addr_loopback;
+	err = bind(lfd, (struct sockaddr *)&sa6, addrlen);
+	if (CHECK(err == -1, "bind", "errno:%d\n", errno))
+		goto done;
+	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
+	if (CHECK(err == -1, "getsockname", "errno:%d\n", errno))
+		goto done;
+	err = listen(lfd, 1);
+	if (CHECK(err == -1, "listen", "errno:%d\n", errno))
+		goto done;
+	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
+	if (CHECK(err != 0, "pthread_create", "err:%d\n", err))
+		goto done;
+
+	/* connect to server */
+	err = connect(fd, (struct sockaddr *)&sa6, addrlen);
+	if (CHECK(err == -1, "connect", "errno:%d\n", errno))
+		goto wait_thread;
+
+	/* recv total_bytes */
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_recv = recv(fd, &batch,
+			       min(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_recv == -1 && errno == EINTR)
+			continue;
+		if (nr_recv == -1)
+			break;
+		bytes += nr_recv;
+	}
+
+	CHECK(bytes != total_bytes, "recv", "%zd != %u nr_recv:%zd errno:%d\n",
+	      bytes, total_bytes, nr_recv, errno);
+
+wait_thread:
+	WRITE_ONCE(stop, 1);
+	pthread_join(srv_thread, &thread_ret);
+	CHECK(IS_ERR(thread_ret), "pthread_join", "thread_ret:%ld",
+	      PTR_ERR(thread_ret));
+done:
+	close(lfd);
+	close(fd);
+}
+
+static struct bpf_object *load(const char *filename)
+{
+	DECLARE_LIBBPF_OPTS(bpf_object_open_opts, open_opts,
+		.unreg_st_ops = true,
+	);
+	struct bpf_object *obj;
+	int err;
+
+	obj = bpf_object__open_file(filename, &open_opts);
+	if (CHECK(IS_ERR(obj), "bpf_obj__open_file", "obj:%ld\n",
+		  PTR_ERR(obj)))
+		return obj;
+
+	err = bpf_object__load(obj);
+	if (CHECK(err, "bpf_object__load", "err:%d\n", err)) {
+		bpf_object__close(obj);
+		return ERR_PTR(err);
+	}
+
+	return obj;
+}
+
+static void test_dctcp(void)
+{
+	struct bpf_object *obj;
+
+	obj = load("bpf_dctcp.o");
+	if (IS_ERR(obj))
+		return;
+
+	do_test("bpf_dctcp");
+
+	bpf_object__close(obj);
+}
+
+void test_bpf_tcp_ca(void)
+{
+	if (test__start_subtest("dctcp"))
+		test_dctcp();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
new file mode 100644
index 000000000000..794954832adc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
@@ -0,0 +1,194 @@
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define DCTCP_MAX_ALPHA	1024U
+
+struct dctcp {
+	__u32 old_delivered;
+	__u32 old_delivered_ce;
+	__u32 prior_rcv_nxt;
+	__u32 dctcp_alpha;
+	__u32 next_seq;
+	__u32 ce_state;
+	__u32 loss_cwnd;
+};
+
+static unsigned int dctcp_shift_g = 4; /* g = 1/2^4 */
+static unsigned int dctcp_alpha_on_init = DCTCP_MAX_ALPHA;
+
+static __always_inline void dctcp_reset(const struct tcp_sock *tp,
+					struct dctcp *ca)
+{
+	ca->next_seq = tp->snd_nxt;
+
+	ca->old_delivered = tp->delivered;
+	ca->old_delivered_ce = tp->delivered_ce;
+}
+
+BPF_TCP_OPS_1(dctcp_init, void, struct sock *, sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct dctcp *ca = inet_csk_ca(sk);
+
+	ca->prior_rcv_nxt = tp->rcv_nxt;
+	ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
+	ca->loss_cwnd = 0;
+	ca->ce_state = 0;
+
+	dctcp_reset(tp, ca);
+}
+
+BPF_TCP_OPS_1(dctcp_ssthresh, __u32, struct sock *, sk)
+{
+	struct dctcp *ca = inet_csk_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	ca->loss_cwnd = tp->snd_cwnd;
+	return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 2U);
+}
+
+BPF_TCP_OPS_2(dctcp_update_alpha, void,
+	      struct sock *, sk, __u32, flags)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct dctcp *ca = inet_csk_ca(sk);
+
+	/* Expired RTT */
+	if (!before(tp->snd_una, ca->next_seq)) {
+		__u32 delivered_ce = tp->delivered_ce - ca->old_delivered_ce;
+		__u32 alpha = ca->dctcp_alpha;
+
+		/* alpha = (1 - g) * alpha + g * F */
+
+		alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g);
+		if (delivered_ce) {
+			__u32 delivered = tp->delivered - ca->old_delivered;
+
+			/* If dctcp_shift_g == 1, a 32bit value would overflow
+			 * after 8 M packets.
+			 */
+			delivered_ce <<= (10 - dctcp_shift_g);
+			delivered_ce /= max(1U, delivered);
+
+			alpha = min(alpha + delivered_ce, DCTCP_MAX_ALPHA);
+		}
+		ca->dctcp_alpha = alpha;
+		dctcp_reset(tp, ca);
+	}
+}
+
+static __always_inline void dctcp_react_to_loss(struct sock *sk)
+{
+	struct dctcp *ca = inet_csk_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	ca->loss_cwnd = tp->snd_cwnd;
+	tp->snd_ssthresh = max(tp->snd_cwnd >> 1U, 2U);
+}
+
+BPF_TCP_OPS_2(dctcp_state, void, struct sock *, sk, __u8, new_state)
+{
+	if (new_state == TCP_CA_Recovery &&
+	    new_state != BPF_CORE_READ_BITFIELD(inet_csk(sk), icsk_ca_state))
+		dctcp_react_to_loss(sk);
+	/* We handle RTO in dctcp_cwnd_event to ensure that we perform only
+	 * one loss-adjustment per RTT.
+	 */
+}
+
+static __always_inline void dctcp_ece_ack_cwr(struct sock *sk, __u32 ce_state)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (ce_state == 1)
+		tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+	else
+		tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+}
+
+/* Minimal DCTP CE state machine:
+ *
+ * S:	0 <- last pkt was non-CE
+ *	1 <- last pkt was CE
+ */
+static __always_inline
+void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
+			  __u32 *prior_rcv_nxt, __u32 *ce_state)
+{
+	__u32 new_ce_state = (evt == CA_EVENT_ECN_IS_CE) ? 1 : 0;
+
+	if (*ce_state != new_ce_state) {
+		/* CE state has changed, force an immediate ACK to
+		 * reflect the new CE state. If an ACK was delayed,
+		 * send that first to reflect the prior CE state.
+		 */
+		if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
+			dctcp_ece_ack_cwr(sk, *ce_state);
+			bpf_tcp_send_ack(sk, *prior_rcv_nxt);
+		}
+		inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+	}
+	*prior_rcv_nxt = tcp_sk(sk)->rcv_nxt;
+	*ce_state = new_ce_state;
+	dctcp_ece_ack_cwr(sk, new_ce_state);
+}
+
+BPF_TCP_OPS_2(dctcp_cwnd_event, void,
+	      struct sock *, sk, enum tcp_ca_event, ev)
+{
+	struct dctcp *ca = inet_csk_ca(sk);
+
+	switch (ev) {
+	case CA_EVENT_ECN_IS_CE:
+	case CA_EVENT_ECN_NO_CE:
+		dctcp_ece_ack_update(sk, ev, &ca->prior_rcv_nxt, &ca->ce_state);
+		break;
+	case CA_EVENT_LOSS:
+		dctcp_react_to_loss(sk);
+		break;
+	default:
+		/* Don't care for the rest. */
+		break;
+	}
+}
+
+BPF_TCP_OPS_1(dctcp_cwnd_undo, __u32, struct sock *, sk)
+{
+	const struct dctcp *ca = inet_csk_ca(sk);
+
+	return max(tcp_sk(sk)->snd_cwnd, ca->loss_cwnd);
+}
+
+BPF_TCP_OPS_3(tcp_reno_cong_avoid, void,
+	      struct sock *, sk, __u32, ack, __u32, acked)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!tcp_is_cwnd_limited(sk))
+		return;
+
+	/* In "safe" area, increase. */
+	if (tcp_in_slow_start(tp)) {
+		acked = tcp_slow_start(tp, acked);
+		if (!acked)
+			return;
+	}
+	/* In dangerous area, increase slowly. */
+	tcp_cong_avoid_ai(tp, tp->snd_cwnd, acked);
+}
+
+SEC("struct_ops")
+struct tcp_congestion_ops dctcp = {
+	.init		= (void *)dctcp_init,
+	.in_ack_event   = (void *)dctcp_update_alpha,
+	.cwnd_event	= (void *)dctcp_cwnd_event,
+	.ssthresh	= (void *)dctcp_ssthresh,
+	.cong_avoid	= (void *)tcp_reno_cong_avoid,
+	.undo_cwnd	= (void *)dctcp_cwnd_undo,
+	.set_state	= (void *)dctcp_state,
+	.flags		= TCP_CONG_NEEDS_ECN,
+	.name		= "bpf_dctcp",
+};
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next 13/13] bpf: Add bpf_cubic example
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (11 preceding siblings ...)
  2019-12-14  0:48 ` [PATCH bpf-next 12/13] bpf: Add bpf_dctcp example Martin KaFai Lau
@ 2019-12-14  0:48 ` Martin KaFai Lau
  2019-12-14  2:26 ` [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Eric Dumazet
  13 siblings, 0 replies; 51+ messages in thread
From: Martin KaFai Lau @ 2019-12-14  0:48 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev

This patch adds the bpf_cubic tcp sample.
The CONFIG_HZ=1000 requirement will go away when
the libbpf extern-var support is ready.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |  22 +
 tools/testing/selftests/bpf/progs/bpf_cubic.c | 502 ++++++++++++++++++
 2 files changed, 524 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_cubic.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index 035de76bf8ed..3d787ecafa4a 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -178,6 +178,26 @@ static struct bpf_object *load(const char *filename)
 	return obj;
 }
 
+static void test_cubic(void)
+{
+	struct bpf_object *obj;
+	int err;
+
+	err = system("zgrep 'CONFIG_HZ=1000$' /proc/config.gz >& /dev/null");
+	if (err) {
+		test__skip();
+		return;
+	}
+
+	obj = load("bpf_cubic.o");
+	if (IS_ERR(obj))
+		return;
+
+	do_test("bpf_cubic");
+
+	bpf_object__close(obj);
+}
+
 static void test_dctcp(void)
 {
 	struct bpf_object *obj;
@@ -195,4 +215,6 @@ void test_bpf_tcp_ca(void)
 {
 	if (test__start_subtest("dctcp"))
 		test_dctcp();
+	if (test__start_subtest("cubic"))
+		test_cubic();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_cubic.c b/tools/testing/selftests/bpf/progs/bpf_cubic.c
new file mode 100644
index 000000000000..ca77d6a34406
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_cubic.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/bpf.h>
+#include "bpf_tcp_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)
+
+#define BICTCP_BETA_SCALE    1024	/* Scale factor beta calculation
+					 * max_cwnd = snd_cwnd * beta
+					 */
+#define	BICTCP_HZ		10	/* BIC HZ 2^10 = 1024 */
+
+/* Two methods of hybrid slow start */
+#define HYSTART_ACK_TRAIN	0x1
+#define HYSTART_DELAY		0x2
+
+/* Number of delay samples for detecting the increase of delay */
+#define HYSTART_MIN_SAMPLES	8
+#define HYSTART_DELAY_MIN	(4U<<3)
+#define HYSTART_DELAY_MAX	(16U<<3)
+#define HYSTART_DELAY_THRESH(x)	clamp(x, HYSTART_DELAY_MIN, HYSTART_DELAY_MAX)
+
+static int fast_convergence = 1;
+static const int beta = 717;	/* = 717/1024 (BICTCP_BETA_SCALE) */
+static int initial_ssthresh;
+static const int bic_scale = 41;
+static int tcp_friendliness = 1;
+
+static int hystart = 1;
+static int hystart_detect = HYSTART_ACK_TRAIN | HYSTART_DELAY;
+static int hystart_low_window = 16;
+static int hystart_ack_delta = 2;
+
+static __u32 cube_rtt_scale = (bic_scale * 10);	/* 1024*c/rtt */
+static __u32 beta_scale = 8*(BICTCP_BETA_SCALE+beta) / 3
+		/ (BICTCP_BETA_SCALE - beta);
+/* calculate the "K" for (wmax-cwnd) = c/rtt * K^3
+ *  so K = cubic_root( (wmax-cwnd)*rtt/c )
+ * the unit of K is bictcp_HZ=2^10, not HZ
+ *
+ *  c = bic_scale >> 10
+ *  rtt = 100ms
+ *
+ * the following code has been designed and tested for
+ * cwnd < 1 million packets
+ * RTT < 100 seconds
+ * HZ < 1,000,00  (corresponding to 10 nano-second)
+ */
+
+/* 1/c * 2^2*bictcp_HZ * srtt, 2^40 */
+static __u64 cube_factor = (__u64)(1ull << (10+3*BICTCP_HZ)) / (bic_scale * 10);
+
+/* BIC TCP Parameters */
+struct bictcp {
+	__u32	cnt;		/* increase cwnd by 1 after ACKs */
+	__u32	last_max_cwnd;	/* last maximum snd_cwnd */
+	__u32	last_cwnd;	/* the last snd_cwnd */
+	__u32	last_time;	/* time when updated last_cwnd */
+	__u32	bic_origin_point;/* origin point of bic function */
+	__u32	bic_K;		/* time to origin point
+				   from the beginning of the current epoch */
+	__u32	delay_min;	/* min delay (msec << 3) */
+	__u32	epoch_start;	/* beginning of an epoch */
+	__u32	ack_cnt;	/* number of acks */
+	__u32	tcp_cwnd;	/* estimated tcp cwnd */
+	__u16	unused;
+	__u8	sample_cnt;	/* number of samples to decide curr_rtt */
+	__u8	found;		/* the exit point is found? */
+	__u32	round_start;	/* beginning of each round */
+	__u32	end_seq;	/* end_seq of the round */
+	__u32	last_ack;	/* last time when the ACK spacing is close */
+	__u32	curr_rtt;	/* the minimum rtt of current round */
+};
+
+static __always_inline void bictcp_reset(struct bictcp *ca)
+{
+	ca->cnt = 0;
+	ca->last_max_cwnd = 0;
+	ca->last_cwnd = 0;
+	ca->last_time = 0;
+	ca->bic_origin_point = 0;
+	ca->bic_K = 0;
+	ca->delay_min = 0;
+	ca->epoch_start = 0;
+	ca->ack_cnt = 0;
+	ca->tcp_cwnd = 0;
+	ca->found = 0;
+}
+
+#define HZ 1000UL
+#define NSEC_PER_MSEC	1000000UL
+#define USEC_PER_MSEC	1000UL
+
+static __always_inline __u64 msecs_to_jiffies(__u32 m)
+{
+	return bpf_jiffies((__u64)m * NSEC_PER_MSEC, BPF_F_NS_TO_JIFFIES);
+}
+
+static __always_inline __u32 bictcp_clock(void)
+{
+	return bpf_jiffies(0, BPF_F_JIFFIES_TO_NS) / NSEC_PER_MSEC;
+}
+
+#define tcp_jiffies32 ((__u32)bpf_jiffies(0, 0))
+
+static __always_inline void bictcp_hystart_reset(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bictcp *ca = inet_csk_ca(sk);
+
+	ca->round_start = ca->last_ack = bictcp_clock();
+	ca->end_seq = tp->snd_nxt;
+	ca->curr_rtt = 0;
+	ca->sample_cnt = 0;
+}
+
+BPF_TCP_OPS_1(bictcp_init, void, struct sock *, sk)
+{
+	struct bictcp *ca = inet_csk_ca(sk);
+
+	bictcp_reset(ca);
+
+	if (hystart)
+		bictcp_hystart_reset(sk);
+
+	if (!hystart && initial_ssthresh)
+		tcp_sk(sk)->snd_ssthresh = initial_ssthresh;
+}
+
+BPF_TCP_OPS_2(bictcp_cwnd_event, void, struct sock *, sk,
+	      enum tcp_ca_event, event)
+{
+	if (event == CA_EVENT_TX_START) {
+		struct bictcp *ca = inet_csk_ca(sk);
+		__u32 now = tcp_jiffies32;
+		__s32 delta;
+
+		delta = now - tcp_sk(sk)->lsndtime;
+
+		/* We were application limited (idle) for a while.
+		 * Shift epoch_start to keep cwnd growth to cubic curve.
+		 */
+		if (ca->epoch_start && delta > 0) {
+			ca->epoch_start += delta;
+			if (after(ca->epoch_start, now))
+				ca->epoch_start = now;
+		}
+		return;
+	}
+}
+
+#define BITS_PER_LONG (sizeof(long) * 8)
+static __always_inline unsigned long __fls(unsigned long word)
+{
+	int num = BITS_PER_LONG - 1;
+
+	if (!(word & (~0ul << 32))) {
+		num -= 32;
+		word <<= 32;
+	}
+
+	if (!(word & (~0ul << (BITS_PER_LONG-16)))) {
+		num -= 16;
+		word <<= 16;
+	}
+	if (!(word & (~0ul << (BITS_PER_LONG-8)))) {
+		num -= 8;
+		word <<= 8;
+	}
+	if (!(word & (~0ul << (BITS_PER_LONG-4)))) {
+		num -= 4;
+		word <<= 4;
+	}
+	if (!(word & (~0ul << (BITS_PER_LONG-2)))) {
+		num -= 2;
+		word <<= 2;
+	}
+	if (!(word & (~0ul << (BITS_PER_LONG-1))))
+		num -= 1;
+	return num;
+}
+
+static __always_inline int fls64(__u64 x)
+{
+	if (x == 0)
+		return 0;
+	return __fls(x) + 1;
+}
+
+static __always_inline __u64 div64_u64(__u64 dividend, __u64 divisor)
+{
+	return dividend / divisor;
+}
+
+/*
+ * cbrt(x) MSB values for x MSB values in [0..63].
+ * Precomputed then refined by hand - Willy Tarreau
+ *
+ * For x in [0..63],
+ *   v = cbrt(x << 18) - 1
+ *   cbrt(x) = (v[x] + 10) >> 6
+ */
+static const __u8 v[] = {
+	/* 0x00 */    0,   54,   54,   54,  118,  118,  118,  118,
+	/* 0x08 */  123,  129,  134,  138,  143,  147,  151,  156,
+	/* 0x10 */  157,  161,  164,  168,  170,  173,  176,  179,
+	/* 0x18 */  181,  185,  187,  190,  192,  194,  197,  199,
+	/* 0x20 */  200,  202,  204,  206,  209,  211,  213,  215,
+	/* 0x28 */  217,  219,  221,  222,  224,  225,  227,  229,
+	/* 0x30 */  231,  232,  234,  236,  237,  239,  240,  242,
+	/* 0x38 */  244,  245,  246,  248,  250,  251,  252,  254,
+};
+
+/* calculate the cubic root of x using a table lookup followed by one
+ * Newton-Raphson iteration.
+ * Avg err ~= 0.195%
+ */
+static __always_inline __u32 cubic_root(__u64 a)
+{
+	__u32 x, b, shift;
+	b = fls64(a);
+	if (a < 64) {
+		/* a in [0..63] */
+		return ((__u32)v[(__u32)a] + 35) >> 6;
+	}
+
+	/* b >= 7 */
+
+	b = ((b * 84) >> 8) - 1;
+	shift = (a >> (b * 3));
+
+	/* it is needed for verifier's bound check on v */
+	if (shift >= 64)
+		return 0;
+
+	x = ((__u32)(((__u32)v[shift] + 10) << b)) >> 6;
+
+	/*
+	 * Newton-Raphson iteration
+	 *                         2
+	 * x    = ( 2 * x  +  a / x  ) / 3
+	 *  k+1          k         k
+	 */
+	x = (2 * x + (__u32)div64_u64(a, (__u64)x * (__u64)(x - 1)));
+	x = ((x * 341) >> 10);
+	return x;
+}
+
+/*
+ * Compute congestion window to use.
+ */
+static __always_inline void bictcp_update(struct bictcp *ca, __u32 cwnd,
+					  __u32 acked)
+{
+	__u32 delta, bic_target, max_cnt;
+	__u64 offs, t;
+
+	ca->ack_cnt += acked;	/* count the number of ACKed packets */
+
+	if (ca->last_cwnd == cwnd &&
+	    (__s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
+		return;
+
+	/* The CUBIC function can update ca->cnt at most once per jiffy.
+	 * On all cwnd reduction events, ca->epoch_start is set to 0,
+	 * which will force a recalculation of ca->cnt.
+	 */
+	if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
+		goto tcp_friendliness;
+
+	ca->last_cwnd = cwnd;
+	ca->last_time = tcp_jiffies32;
+
+	if (ca->epoch_start == 0) {
+		ca->epoch_start = tcp_jiffies32;	/* record beginning */
+		ca->ack_cnt = acked;			/* start counting */
+		ca->tcp_cwnd = cwnd;			/* syn with cubic */
+
+		if (ca->last_max_cwnd <= cwnd) {
+			ca->bic_K = 0;
+			ca->bic_origin_point = cwnd;
+		} else {
+			/* Compute new K based on
+			 * (wmax-cwnd) * (srtt>>3 / HZ) / c * 2^(3*bictcp_HZ)
+			 */
+			ca->bic_K = cubic_root(cube_factor
+					       * (ca->last_max_cwnd - cwnd));
+			ca->bic_origin_point = ca->last_max_cwnd;
+		}
+	}
+
+	/* cubic function - calc*/
+	/* calculate c * time^3 / rtt,
+	 *  while considering overflow in calculation of time^3
+	 * (so time^3 is done by using 64 bit)
+	 * and without the support of division of 64bit numbers
+	 * (so all divisions are done by using 32 bit)
+	 *  also NOTE the unit of those veriables
+	 *	  time  = (t - K) / 2^bictcp_HZ
+	 *	  c = bic_scale >> 10
+	 * rtt  = (srtt >> 3) / HZ
+	 * !!! The following code does not have overflow problems,
+	 * if the cwnd < 1 million packets !!!
+	 */
+
+	t = (__s32)(tcp_jiffies32 - ca->epoch_start);
+	t += msecs_to_jiffies(ca->delay_min >> 3);
+	/* change the unit from HZ to bictcp_HZ */
+	t <<= BICTCP_HZ;
+	t /= HZ;
+
+	if (t < ca->bic_K)		/* t - K */
+		offs = ca->bic_K - t;
+	else
+		offs = t - ca->bic_K;
+
+	/* c/rtt * (t-K)^3 */
+	delta = (cube_rtt_scale * offs * offs * offs) >> (10+3*BICTCP_HZ);
+	if (t < ca->bic_K)                            /* below origin*/
+		bic_target = ca->bic_origin_point - delta;
+	else                                          /* above origin*/
+		bic_target = ca->bic_origin_point + delta;
+
+	/* cubic function - calc bictcp_cnt*/
+	if (bic_target > cwnd) {
+		ca->cnt = cwnd / (bic_target - cwnd);
+	} else {
+		ca->cnt = 100 * cwnd;              /* very small increment*/
+	}
+
+	/*
+	 * The initial growth of cubic function may be too conservative
+	 * when the available bandwidth is still unknown.
+	 */
+	if (ca->last_max_cwnd == 0 && ca->cnt > 20)
+		ca->cnt = 20;	/* increase cwnd 5% per RTT */
+
+tcp_friendliness:
+	/* TCP Friendly */
+	if (tcp_friendliness) {
+		__u32 scale = beta_scale;
+		__u32 n;
+
+		/* update tcp cwnd */
+		delta = (cwnd * scale) >> 3;
+		if (delta) {
+			n = ca->ack_cnt / delta;
+			ca->ack_cnt -= n * delta;
+			ca->tcp_cwnd += n;
+		}
+
+		if (ca->tcp_cwnd > cwnd) {	/* if bic is slower than tcp */
+			delta = ca->tcp_cwnd - cwnd;
+			max_cnt = cwnd / delta;
+			if (ca->cnt > max_cnt)
+				ca->cnt = max_cnt;
+		}
+	}
+
+	/* The maximum rate of cwnd increase CUBIC allows is 1 packet per
+	 * 2 packets ACKed, meaning cwnd grows at 1.5x per RTT.
+	 */
+	ca->cnt = max(ca->cnt, 2U);
+}
+
+BPF_TCP_OPS_3(bictcp_cong_avoid, void, struct sock *, sk,
+	      __u32, ack, __u32, acked)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bictcp *ca = inet_csk_ca(sk);
+
+	if (!tcp_is_cwnd_limited(sk))
+		return;
+
+	if (tcp_in_slow_start(tp)) {
+		if (hystart && after(ack, ca->end_seq))
+			bictcp_hystart_reset(sk);
+		acked = tcp_slow_start(tp, acked);
+		if (!acked)
+			return;
+	}
+	bictcp_update(ca, tp->snd_cwnd, acked);
+	tcp_cong_avoid_ai(tp, ca->cnt, acked);
+}
+
+BPF_TCP_OPS_1(bictcp_recalc_ssthresh, __u32,
+	      struct sock *, sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct bictcp *ca = inet_csk_ca(sk);
+
+	ca->epoch_start = 0;	/* end of epoch */
+
+	/* Wmax and fast convergence */
+	if (tp->snd_cwnd < ca->last_max_cwnd && fast_convergence)
+		ca->last_max_cwnd = (tp->snd_cwnd * (BICTCP_BETA_SCALE + beta))
+			/ (2 * BICTCP_BETA_SCALE);
+	else
+		ca->last_max_cwnd = tp->snd_cwnd;
+
+	return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U);
+}
+
+BPF_TCP_OPS_2(bictcp_state, void, struct sock *, sk,
+	      __u8, new_state)
+{
+	if (new_state == TCP_CA_Loss) {
+		bictcp_reset(inet_csk_ca(sk));
+		bictcp_hystart_reset(sk);
+	}
+}
+
+static __always_inline void hystart_update(struct sock *sk, __u32 delay)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bictcp *ca = inet_csk_ca(sk);
+
+	if (ca->found & hystart_detect)
+		return;
+
+	if (hystart_detect & HYSTART_ACK_TRAIN) {
+		__u32 now = bictcp_clock();
+
+		/* first detection parameter - ack-train detection */
+		if ((__s32)(now - ca->last_ack) <= hystart_ack_delta) {
+			ca->last_ack = now;
+			if ((__s32)(now - ca->round_start) > ca->delay_min >> 4) {
+				ca->found |= HYSTART_ACK_TRAIN;
+				tp->snd_ssthresh = tp->snd_cwnd;
+			}
+		}
+	}
+
+	if (hystart_detect & HYSTART_DELAY) {
+		/* obtain the minimum delay of more than sampling packets */
+		if (ca->sample_cnt < HYSTART_MIN_SAMPLES) {
+			if (ca->curr_rtt == 0 || ca->curr_rtt > delay)
+				ca->curr_rtt = delay;
+
+			ca->sample_cnt++;
+		} else {
+			if (ca->curr_rtt > ca->delay_min +
+			    HYSTART_DELAY_THRESH(ca->delay_min >> 3)) {
+				ca->found |= HYSTART_DELAY;
+				tp->snd_ssthresh = tp->snd_cwnd;
+			}
+		}
+	}
+}
+
+/* Track delayed acknowledgment ratio using sliding window
+ * ratio = (15*ratio + sample) / 16
+ */
+BPF_TCP_OPS_2(bictcp_acked, void, struct sock *, sk,
+	      const struct ack_sample *, sample)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	struct bictcp *ca = inet_csk_ca(sk);
+	__u32 delay;
+
+	/* Some calls are for duplicates without timetamps */
+	if (sample->rtt_us < 0)
+		return;
+
+	/* Discard delay samples right after fast recovery */
+	if (ca->epoch_start && (__s32)(tcp_jiffies32 - ca->epoch_start) < HZ)
+		return;
+
+	delay = (sample->rtt_us << 3) / USEC_PER_MSEC;
+	if (delay == 0)
+		delay = 1;
+
+	/* first time call or link delay decreases */
+	if (ca->delay_min == 0 || ca->delay_min > delay)
+		ca->delay_min = delay;
+
+	/* hystart triggers when cwnd is larger than some threshold */
+	if (hystart && tcp_in_slow_start(tp) &&
+	    tp->snd_cwnd >= hystart_low_window)
+		hystart_update(sk, delay);
+}
+
+BPF_TCP_OPS_1(tcp_reno_undo_cwnd, __u32, struct sock *, sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+
+	return max(tp->snd_cwnd, tp->prior_cwnd);
+}
+
+SEC("struct_ops")
+struct tcp_congestion_ops cubictcp = {
+	.init		= (void *)bictcp_init,
+	.ssthresh	= (void *)bictcp_recalc_ssthresh,
+	.cong_avoid	= (void *)bictcp_cong_avoid,
+	.set_state	= (void *)bictcp_state,
+	.undo_cwnd	= (void *)tcp_reno_undo_cwnd,
+	.cwnd_event	= (void *)bictcp_cwnd_event,
+	.pkts_acked     = (void *)bictcp_acked,
+	.name		= "bpf_cubic",
+};
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-14  0:47 ` [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies Martin KaFai Lau
@ 2019-12-14  1:59   ` Eric Dumazet
  2019-12-14 19:25     ` Neal Cardwell
  2019-12-16 19:14     ` Martin Lau
  0 siblings, 2 replies; 51+ messages in thread
From: Eric Dumazet @ 2019-12-14  1:59 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> This patch adds a helper to handle jiffies.  Some of the
> tcp_sock's timing is stored in jiffies.  Although things
> could be deduced by CONFIG_HZ, having an easy way to get
> jiffies will make the later bpf-tcp-cc implementation easier.
> 

...

> +
> +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
> +{
> +	if (!flags)
> +		return get_jiffies_64();
> +
> +	if (flags & BPF_F_NS_TO_JIFFIES) {
> +		return nsecs_to_jiffies(in);
> +	} else if (flags & BPF_F_JIFFIES_TO_NS) {
> +		if (!in)
> +			in = get_jiffies_64();
> +		return jiffies_to_nsecs(in);
> +	}
> +
> +	return 0;
> +}

This looks a bit convoluted :)

Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.

We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.


Have you thought of finding a way to not duplicate the code for cubic and dctcp, maybe
by including a template ?

Maintaining two copies means that future changes need more maintenance work.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS
  2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
                   ` (12 preceding siblings ...)
  2019-12-14  0:48 ` [PATCH bpf-next 13/13] bpf: Add bpf_cubic example Martin KaFai Lau
@ 2019-12-14  2:26 ` Eric Dumazet
  13 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2019-12-14  2:26 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, kernel-team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> This series introduces BPF STRUCT_OPS.  It is an infra to allow
> implementing some specific kernel's function pointers in BPF.
> The first use case included in this series is to implement
> TCP congestion control algorithm in BPF  (i.e. implement
> struct tcp_congestion_ops in BPF).
> 
> There has been attempt to move the TCP CC to the user space
> (e.g. CCP in TCP).   The common arguments are faster turn around,
> get away from long-tail kernel versions in production...etc,
> which are legit points.
> 
> BPF has been the continuous effort to join both kernel and
> userspace upsides together (e.g. XDP to gain the performance
> advantage without bypassing the kernel).  The recent BPF
> advancements (in particular BTF-aware verifier, BPF trampoline,
> BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
> possible in BPF.
> 
> The idea is to allow implementing tcp_congestion_ops in bpf.
> It allows a faster turnaround for testing algorithm in the
> production while leveraging the existing (and continue growing) BPF
> feature/framework instead of building one specifically for
> userspace TCP CC.
>

This is awesome work Martin !


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-14  1:59   ` Eric Dumazet
@ 2019-12-14 19:25     ` Neal Cardwell
  2019-12-16 19:30       ` Martin Lau
  2019-12-17  8:26       ` Jakub Sitnicki
  2019-12-16 19:14     ` Martin Lau
  1 sibling, 2 replies; 51+ messages in thread
From: Neal Cardwell @ 2019-12-14 19:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Martin KaFai Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Kernel Team, Netdev

On Fri, Dec 13, 2019 at 9:00 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> > This patch adds a helper to handle jiffies.  Some of the
> > tcp_sock's timing is stored in jiffies.  Although things
> > could be deduced by CONFIG_HZ, having an easy way to get
> > jiffies will make the later bpf-tcp-cc implementation easier.
> >
>
> ...
>
> > +
> > +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
> > +{
> > +     if (!flags)
> > +             return get_jiffies_64();
> > +
> > +     if (flags & BPF_F_NS_TO_JIFFIES) {
> > +             return nsecs_to_jiffies(in);
> > +     } else if (flags & BPF_F_JIFFIES_TO_NS) {
> > +             if (!in)
> > +                     in = get_jiffies_64();
> > +             return jiffies_to_nsecs(in);
> > +     }
> > +
> > +     return 0;
> > +}
>
> This looks a bit convoluted :)
>
> Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
>
> We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.

If the jiffies functionality stays, how about 3 simple functions that
correspond to the underlying C functions, perhaps something like:

  bpf_nsecs_to_jiffies(nsecs)
  bpf_jiffies_to_nsecs(jiffies)
  bpf_get_jiffies_64()

Separate functions might be easier to read/maintain (and may even be
faster, given the corresponding reduction in branches).

neal

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-14  1:59   ` Eric Dumazet
  2019-12-14 19:25     ` Neal Cardwell
@ 2019-12-16 19:14     ` Martin Lau
  2019-12-16 19:33       ` Eric Dumazet
  2019-12-16 23:08       ` Alexei Starovoitov
  1 sibling, 2 replies; 51+ messages in thread
From: Martin Lau @ 2019-12-16 19:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On Fri, Dec 13, 2019 at 05:59:54PM -0800, Eric Dumazet wrote:
> 
> 
> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> > This patch adds a helper to handle jiffies.  Some of the
> > tcp_sock's timing is stored in jiffies.  Although things
> > could be deduced by CONFIG_HZ, having an easy way to get
> > jiffies will make the later bpf-tcp-cc implementation easier.
> > 
> 
> ...
> 
> > +
> > +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
> > +{
> > +	if (!flags)
> > +		return get_jiffies_64();
> > +
> > +	if (flags & BPF_F_NS_TO_JIFFIES) {
> > +		return nsecs_to_jiffies(in);
> > +	} else if (flags & BPF_F_JIFFIES_TO_NS) {
> > +		if (!in)
> > +			in = get_jiffies_64();
> > +		return jiffies_to_nsecs(in);
> > +	}
> > +
> > +	return 0;
> > +}
> 
> This looks a bit convoluted :)
> 
> Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
> 
> We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.
Thanks for the feedbacks!

I have a few questions that need some helps.

Does it mean tp->tcp_mstamp can be used as the "now" in cubic?
or tcp_clock_ns() should still be called in cubic, e.g. to replace
bictcp_clock() and tcp_jiffies32?

BPF currently has a helper calling ktime_get_mono_fast_ns() which looks
different from tcp_clock_ns().

The lsndtime is in jiffies.  I think it can probably be converted to ms before
using it in cubic.  There are some BICTCP_HZ logic in bictcp_update() that
is not obvious to me how to convet them to ms base also.

> 
> 
> Have you thought of finding a way to not duplicate the code for cubic and dctcp, maybe
> by including a template ?
> 
> Maintaining two copies means that future changes need more maintenance work.
At least for bpf_dctcp.c, I did not expect it could be that close to tcp_dctcp.c
when I just started converted it.  tcp_cubic/bpf_cubic still has some TBD
on jiffies/msec.

Agree that it is beneficial to have one copy.   It is likely
I need to make some changes on the tcp_*.c side also.  Hence, I prefer
to give it a try in a separate series, e.g. revert the kernel side
changes will be easier.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-14 19:25     ` Neal Cardwell
@ 2019-12-16 19:30       ` Martin Lau
  2019-12-17  8:26       ` Jakub Sitnicki
  1 sibling, 0 replies; 51+ messages in thread
From: Martin Lau @ 2019-12-16 19:30 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Kernel Team, Netdev

On Sat, Dec 14, 2019 at 02:25:14PM -0500, Neal Cardwell wrote:
> On Fri, Dec 13, 2019 at 9:00 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >
> >
> > On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> > > This patch adds a helper to handle jiffies.  Some of the
> > > tcp_sock's timing is stored in jiffies.  Although things
> > > could be deduced by CONFIG_HZ, having an easy way to get
> > > jiffies will make the later bpf-tcp-cc implementation easier.
> > >
> >
> > ...
> >
> > > +
> > > +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
> > > +{
> > > +     if (!flags)
> > > +             return get_jiffies_64();
> > > +
> > > +     if (flags & BPF_F_NS_TO_JIFFIES) {
> > > +             return nsecs_to_jiffies(in);
> > > +     } else if (flags & BPF_F_JIFFIES_TO_NS) {
> > > +             if (!in)
> > > +                     in = get_jiffies_64();
> > > +             return jiffies_to_nsecs(in);
> > > +     }
> > > +
> > > +     return 0;
> > > +}
> >
> > This looks a bit convoluted :)
> >
> > Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
> >
> > We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.
> 
> If the jiffies functionality stays, how about 3 simple functions that
> correspond to the underlying C functions, perhaps something like:
> 
>   bpf_nsecs_to_jiffies(nsecs)
>   bpf_jiffies_to_nsecs(jiffies)
>   bpf_get_jiffies_64()
> 
> Separate functions might be easier to read/maintain (and may even be
> faster, given the corresponding reduction in branches).
Yes.  It could be different bpf helpers.

I will take another look on these.
I may not need the nsecs <=> jiffies with CONFIG_HZ and
Andrii's recent extern var support.  The first attempt I tried
end-up a lot of codes on the bpf_prog side.  I may not have done
it right.  I will give it another try on this side.

Thanks for the feedbacks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-16 19:14     ` Martin Lau
@ 2019-12-16 19:33       ` Eric Dumazet
  2019-12-16 21:17         ` Martin Lau
  2019-12-16 23:08       ` Alexei Starovoitov
  1 sibling, 1 reply; 51+ messages in thread
From: Eric Dumazet @ 2019-12-16 19:33 UTC (permalink / raw)
  To: Martin Lau, Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev



On 12/16/19 11:14 AM, Martin Lau wrote:
> On Fri, Dec 13, 2019 at 05:59:54PM -0800, Eric Dumazet wrote:
>>
>>
>> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
>>> This patch adds a helper to handle jiffies.  Some of the
>>> tcp_sock's timing is stored in jiffies.  Although things
>>> could be deduced by CONFIG_HZ, having an easy way to get
>>> jiffies will make the later bpf-tcp-cc implementation easier.
>>>
>>
>> ...
>>
>>> +
>>> +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
>>> +{
>>> +	if (!flags)
>>> +		return get_jiffies_64();
>>> +
>>> +	if (flags & BPF_F_NS_TO_JIFFIES) {
>>> +		return nsecs_to_jiffies(in);
>>> +	} else if (flags & BPF_F_JIFFIES_TO_NS) {
>>> +		if (!in)
>>> +			in = get_jiffies_64();
>>> +		return jiffies_to_nsecs(in);
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>
>> This looks a bit convoluted :)
>>
>> Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
>>
>> We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.
> Thanks for the feedbacks!
> 
> I have a few questions that need some helps.
> 
> Does it mean tp->tcp_mstamp can be used as the "now" in cubic?

TCP makes sure to update tp->tcp_mstamp at least once when handling
a particular packet. We did that to avoid calling possibly expensive
kernel time service (Some platforms do not have fast TSC) 

> or tcp_clock_ns() should still be called in cubic, e.g. to replace
> bictcp_clock() and tcp_jiffies32?

Yeah, there is this lsndtime and tcp_jiffies32 thing, but maybe
we can find a way to fetch jiffies32 without having to call a bpf helper ?

> 
> BPF currently has a helper calling ktime_get_mono_fast_ns() which looks
> different from tcp_clock_ns().

That's maybe because of NMI requirements. 

TCP in the other hand is in process or BH context.

But it should not matter, cubic could should not have to call them.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack
  2019-12-14  0:47 ` [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
@ 2019-12-16 19:48   ` Yonghong Song
  0 siblings, 0 replies; 51+ messages in thread
From: Yonghong Song @ 2019-12-16 19:48 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> This patch makes the verifier save the PTR_TO_BTF_ID register state when
> spilling to the stack.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-16 19:33       ` Eric Dumazet
@ 2019-12-16 21:17         ` Martin Lau
  0 siblings, 0 replies; 51+ messages in thread
From: Martin Lau @ 2019-12-16 21:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On Mon, Dec 16, 2019 at 11:33:01AM -0800, Eric Dumazet wrote:
> 
> 
> On 12/16/19 11:14 AM, Martin Lau wrote:
> > On Fri, Dec 13, 2019 at 05:59:54PM -0800, Eric Dumazet wrote:
> >>
> >>
> >> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> >>> This patch adds a helper to handle jiffies.  Some of the
> >>> tcp_sock's timing is stored in jiffies.  Although things
> >>> could be deduced by CONFIG_HZ, having an easy way to get
> >>> jiffies will make the later bpf-tcp-cc implementation easier.
> >>>
> >>
> >> ...
> >>
> >>> +
> >>> +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
> >>> +{
> >>> +	if (!flags)
> >>> +		return get_jiffies_64();
> >>> +
> >>> +	if (flags & BPF_F_NS_TO_JIFFIES) {
> >>> +		return nsecs_to_jiffies(in);
> >>> +	} else if (flags & BPF_F_JIFFIES_TO_NS) {
> >>> +		if (!in)
> >>> +			in = get_jiffies_64();
> >>> +		return jiffies_to_nsecs(in);
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>
> >> This looks a bit convoluted :)
> >>
> >> Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
> >>
> >> We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.
> > Thanks for the feedbacks!
> > 
> > I have a few questions that need some helps.
> > 
> > Does it mean tp->tcp_mstamp can be used as the "now" in cubic?
> 
> TCP makes sure to update tp->tcp_mstamp at least once when handling
> a particular packet. We did that to avoid calling possibly expensive
> kernel time service (Some platforms do not have fast TSC) 
> 
> > or tcp_clock_ns() should still be called in cubic, e.g. to replace
> > bictcp_clock() and tcp_jiffies32?
> 
> Yeah, there is this lsndtime and tcp_jiffies32 thing, but maybe
> we can find a way to fetch jiffies32 without having to call a bpf helper ?
Loading a kernel global variable is not yet supported.
Thus, helper is needed but it could be inlined like array_map_gen_lookup().

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id
  2019-12-14  0:47 ` [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
@ 2019-12-16 21:34   ` Yonghong Song
  0 siblings, 0 replies; 51+ messages in thread
From: Yonghong Song @ 2019-12-16 21:34 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> info->btf_id expects the btf_id of a struct, so it should
> store the final result after skipping modifiers (if any).
> 
> It also takes this chanace to add a missing newline in one of the
> bpf_log() messages.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access()
  2019-12-14  0:47 ` [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
@ 2019-12-16 21:36   ` Yonghong Song
  0 siblings, 0 replies; 51+ messages in thread
From: Yonghong Song @ 2019-12-16 21:36 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> It allows bpf prog (e.g. tracing) to attach
> to a kernel function that takes enum argument.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access
  2019-12-14  0:47 ` [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
@ 2019-12-16 22:05   ` Yonghong Song
  0 siblings, 0 replies; 51+ messages in thread
From: Yonghong Song @ 2019-12-16 22:05 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> This patch allows bitfield access as a scalar.  It currently limits
> the access to sizeof(u64) and upto the end of the struct.  It is needed
> in a later bpf-tcp-cc example that reads bitfield from
> inet_connection_sock and tcp_sock.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   kernel/bpf/btf.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 6e652643849b..011194831499 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3744,10 +3744,6 @@ int btf_struct_access(struct bpf_verifier_log *log,
>   	}
>   
>   	for_each_member(i, t, member) {
> -		if (btf_member_bitfield_size(t, member))
> -			/* bitfields are not supported yet */
> -			continue;
> -
>   		/* offset of the field in bytes */
>   		moff = btf_member_bit_offset(t, member) / 8;
>   		if (off + size <= moff)
> @@ -3757,6 +3753,15 @@ int btf_struct_access(struct bpf_verifier_log *log,
>   		if (off < moff)
>   			continue;
>   
> +		if (btf_member_bitfield_size(t, member)) {
> +			if (off == moff &&
> +			    !(btf_member_bit_offset(t, member) % 8) &&

This check '!(btf_member_bit_offset(t, member) % 8)' is not needed.

> +			    size <= sizeof(u64) &&

This one is not needed since verifier gets 'size' from load/store 
instructions which is guaranteed to be <= sizeof(u64).

> +			    off + size <= t->size)
> +				return SCALAR_VALUE;
> +			continue;
> +		}
> +
>   		/* type of the field */
>   		mtype = btf_type_by_id(btf_vmlinux, member->type);
>   		mname = __btf_name_by_offset(btf_vmlinux, member->name_off);
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-16 19:14     ` Martin Lau
  2019-12-16 19:33       ` Eric Dumazet
@ 2019-12-16 23:08       ` Alexei Starovoitov
  2019-12-17  0:34         ` Eric Dumazet
  1 sibling, 1 reply; 51+ messages in thread
From: Alexei Starovoitov @ 2019-12-16 23:08 UTC (permalink / raw)
  To: Martin Lau, Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On 12/16/19 11:14 AM, Martin Lau wrote:
> At least for bpf_dctcp.c, I did not expect it could be that close to tcp_dctcp.c
> when I just started converted it.  tcp_cubic/bpf_cubic still has some TBD
> on jiffies/msec.
> 
> Agree that it is beneficial to have one copy.   It is likely
> I need to make some changes on the tcp_*.c side also.  Hence, I prefer
> to give it a try in a separate series, e.g. revert the kernel side
> changes will be easier.

I've looked at bpf_cubic.c and bpf_dctcp.c as examples of what this
patch set can do. They're selftests of the feature.
What's the value of keeping them in sync with real kernel cc-s?
I think it's fine if they quickly diverge.
The value of them as selftests is important though. Quite a bit of BTF
and verifier logic is being tested.
May be add a comment saying that bpf_cubic.c is like cubic, but doesn't
have to be exactly cubic ?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-16 23:08       ` Alexei Starovoitov
@ 2019-12-17  0:34         ` Eric Dumazet
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2019-12-17  0:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin Lau, Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev



On 12/16/19 3:08 PM, Alexei Starovoitov wrote:
> On 12/16/19 11:14 AM, Martin Lau wrote:
>> At least for bpf_dctcp.c, I did not expect it could be that close to tcp_dctcp.c
>> when I just started converted it.  tcp_cubic/bpf_cubic still has some TBD
>> on jiffies/msec.
>>
>> Agree that it is beneficial to have one copy.   It is likely
>> I need to make some changes on the tcp_*.c side also.  Hence, I prefer
>> to give it a try in a separate series, e.g. revert the kernel side
>> changes will be easier.
> 
> I've looked at bpf_cubic.c and bpf_dctcp.c as examples of what this
> patch set can do. They're selftests of the feature.
> What's the value of keeping them in sync with real kernel cc-s?
> I think it's fine if they quickly diverge.
> The value of them as selftests is important though. Quite a bit of BTF
> and verifier logic is being tested.
> May be add a comment saying that bpf_cubic.c is like cubic, but doesn't
> have to be exactly cubic ?
> 

The reason I mentioned this is that I am currently working on a fix of Hystart
logic, which is quite broken at the moment.

(hystart_train detection triggers in cases it should not)

But yes, if we add a comment warning potential users, this should be fine.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-14  0:47 ` [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-17  6:14   ` Yonghong Song
  2019-12-18 16:41     ` Martin Lau
  0 siblings, 1 reply; 51+ messages in thread
From: Yonghong Song @ 2019-12-17  6:14 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> This patch allows the kernel's struct ops (i.e. func ptr) to be
> implemented in BPF.  The first use case in this series is the
> "struct tcp_congestion_ops" which will be introduced in a
> latter patch.
> 
> This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
> The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
> func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
> of a kernel struct.  The attr->expected_attach_type is the member
> "index" of that kernel struct.  The first member of a struct starts
> with member index 0.  That will avoid ambiguity when a kernel struct
> has multiple func ptrs with the same func signature.
> 
> For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
> to implement the "init" func ptr of the "struct tcp_congestion_ops".
> The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
> of the _running_ kernel.  The attr->expected_attach_type is 3.
> 
> The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
> by arch_prepare_bpf_trampoline that will be done in the next
> patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
> 
> "struct bpf_struct_ops" is introduced as a common interface for the kernel
> struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
> struct will need to implement an instance of the "struct bpf_struct_ops".
> 
> The supporting kernel struct also needs to implement a bpf_verifier_ops.
> During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
> bpf_verifier_ops by searching the attr->attach_btf_id.
> 
> A new "btf_struct_access" is also added to the bpf_verifier_ops such
> that the supporting kernel struct can optionally provide its own specific
> check on accessing the func arg (e.g. provide limited write access).
> 
> After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
> to initialize some values (e.g. the btf id of the supporting kernel
> struct) and it can only be done once the btf_vmlinux is available.
> 
> The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
> if the return type of the prog->aux->attach_func_proto is "void".
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   include/linux/bpf.h               |  30 +++++++
>   include/linux/bpf_types.h         |   4 +
>   include/linux/btf.h               |  34 ++++++++
>   include/uapi/linux/bpf.h          |   1 +
>   kernel/bpf/Makefile               |   2 +-
>   kernel/bpf/bpf_struct_ops.c       | 124 +++++++++++++++++++++++++++
>   kernel/bpf/bpf_struct_ops_types.h |   4 +
>   kernel/bpf/btf.c                  |  88 ++++++++++++++------
>   kernel/bpf/syscall.c              |  17 ++--
>   kernel/bpf/verifier.c             | 134 +++++++++++++++++++++++-------
>   10 files changed, 374 insertions(+), 64 deletions(-)
>   create mode 100644 kernel/bpf/bpf_struct_ops.c
>   create mode 100644 kernel/bpf/bpf_struct_ops_types.h
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index d467983e61bb..1f0a5fc8c5ee 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -349,6 +349,10 @@ struct bpf_verifier_ops {
>   				  const struct bpf_insn *src,
>   				  struct bpf_insn *dst,
>   				  struct bpf_prog *prog, u32 *target_size);
> +	int (*btf_struct_access)(struct bpf_verifier_log *log,
> +				 const struct btf_type *t, int off, int size,
> +				 enum bpf_access_type atype,
> +				 u32 *next_btf_id);
>   };
>   
>   struct bpf_prog_offload_ops {
> @@ -667,6 +671,32 @@ struct bpf_array_aux {
>   	struct work_struct work;
>   };
>   
> +struct btf_type;
> +struct btf_member;
> +
> +#define BPF_STRUCT_OPS_MAX_NR_MEMBERS 64
> +struct bpf_struct_ops {
> +	const struct bpf_verifier_ops *verifier_ops;
> +	int (*init)(struct btf *_btf_vmlinux);
> +	int (*check_member)(const struct btf_type *t,
> +			    const struct btf_member *member);
> +	const struct btf_type *type;
> +	const char *name;
> +	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
> +	u32 type_id;
> +};
> +
> +#if defined(CONFIG_BPF_JIT)
> +const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id);
> +void bpf_struct_ops_init(struct btf *_btf_vmlinux);
> +#else
> +static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
> +{
> +	return NULL;
> +}
> +static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { }
> +#endif
> +
>   struct bpf_array {
>   	struct bpf_map map;
>   	u32 elem_size;
[...]
> +const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
> +};
> +
> +const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> +};
> +
> +void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> +{
> +	const struct btf_member *member;
> +	struct bpf_struct_ops *st_ops;
> +	struct bpf_verifier_log log = {};
> +	const struct btf_type *t;
> +	const char *mname;
> +	s32 type_id;
> +	u32 i, j;
> +
> +	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> +		st_ops = bpf_struct_ops[i];
> +
> +		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
> +						BTF_KIND_STRUCT);
> +		if (type_id < 0) {
> +			pr_warn("Cannot find struct %s in btf_vmlinux\n",
> +				st_ops->name);
> +			continue;
> +		}
> +		t = btf_type_by_id(_btf_vmlinux, type_id);
> +		if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) {
> +			pr_warn("Cannot support #%u members in struct %s\n",
> +				btf_type_vlen(t), st_ops->name);
> +			continue;
> +		}
> +
> +		for_each_member(j, t, member) {
> +			const struct btf_type *func_proto;
> +
> +			mname = btf_name_by_offset(_btf_vmlinux,
> +						   member->name_off);
> +			if (!*mname) {
> +				pr_warn("anon member in struct %s is not supported\n",
> +					st_ops->name);
> +				break;
> +			}
> +
> +			if (btf_member_bitfield_size(t, member)) {
> +				pr_warn("bit field member %s in struct %s is not supported\n",
> +					mname, st_ops->name);
> +				break;
> +			}
> +
> +			func_proto = btf_type_resolve_func_ptr(_btf_vmlinux,
> +							       member->type,
> +							       NULL);
> +			if (func_proto &&
> +			    btf_distill_func_proto(&log, _btf_vmlinux,
> +						   func_proto, mname,
> +						   &st_ops->func_models[j])) {
> +				pr_warn("Error in parsing func ptr %s in struct %s\n",
> +					mname, st_ops->name);
> +				break;
> +			}
> +		}
> +
> +		if (j == btf_type_vlen(t)) {
> +			if (st_ops->init(_btf_vmlinux)) {

is it possible that st_ops->init might be a NULL pointer?

> +				pr_warn("Error in init bpf_struct_ops %s\n",
> +					st_ops->name);
> +			} else {
> +				st_ops->type_id = type_id;
> +				st_ops->type = t;
> +			}
> +		}
> +	}
> +}
> +
> +extern struct btf *btf_vmlinux;
> +
[...]
> index 408264c1d55b..4c1eaa1a2965 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2858,11 +2858,6 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
>   	u32 btf_id;
>   	int ret;
>   
> -	if (atype != BPF_READ) {
> -		verbose(env, "only read is supported\n");
> -		return -EACCES;
> -	}
> -
>   	if (off < 0) {
>   		verbose(env,
>   			"R%d is ptr_%s invalid negative access: off=%d\n",
> @@ -2879,17 +2874,32 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
>   		return -EACCES;
>   	}
>   
> -	ret = btf_struct_access(&env->log, t, off, size, atype, &btf_id);
> +	if (env->ops->btf_struct_access) {
> +		ret = env->ops->btf_struct_access(&env->log, t, off, size,
> +						  atype, &btf_id);
> +	} else {
> +		if (atype != BPF_READ) {
> +			verbose(env, "only read is supported\n");
> +			return -EACCES;
> +		}
> +
> +		ret = btf_struct_access(&env->log, t, off, size, atype,
> +					&btf_id);
> +	}
> +
>   	if (ret < 0)
>   		return ret;
>   
> -	if (ret == SCALAR_VALUE) {
> -		mark_reg_unknown(env, regs, value_regno);
> -		return 0;
> +	if (atype == BPF_READ) {
> +		if (ret == SCALAR_VALUE) {
> +			mark_reg_unknown(env, regs, value_regno);
> +			return 0;
> +		}
> +		mark_reg_known_zero(env, regs, value_regno);
> +		regs[value_regno].type = PTR_TO_BTF_ID;
> +		regs[value_regno].btf_id = btf_id;
>   	}
> -	mark_reg_known_zero(env, regs, value_regno);
> -	regs[value_regno].type = PTR_TO_BTF_ID;
> -	regs[value_regno].btf_id = btf_id;
> +
>   	return 0;
>   }
>   
> @@ -6343,8 +6353,30 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
>   static int check_return_code(struct bpf_verifier_env *env)
>   {
>   	struct tnum enforce_attach_type_range = tnum_unknown;
> +	const struct bpf_prog *prog = env->prog;
>   	struct bpf_reg_state *reg;
>   	struct tnum range = tnum_range(0, 1);
> +	int err;
> +
> +	/* The struct_ops func-ptr's return type could be "void" */
> +	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
> +	    !prog->aux->attach_func_proto->type)
> +		return 0;
> +
> +	/* eBPF calling convetion is such that R0 is used
> +	 * to return the value from eBPF program.
> +	 * Make sure that it's readable at this time
> +	 * of bpf_exit, which means that program wrote
> +	 * something into it earlier
> +	 */
> +	err = check_reg_arg(env, BPF_REG_0, SRC_OP);
> +	if (err)
> +		return err;
> +
> +	if (is_pointer_value(env, BPF_REG_0)) {
> +		verbose(env, "R0 leaks addr as return value\n");
> +		return -EACCES;
> +	}
>   
>   	switch (env->prog->type) {
>   	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> @@ -8010,21 +8042,6 @@ static int do_check(struct bpf_verifier_env *env)
>   				if (err)
>   					return err;
>   
> -				/* eBPF calling convetion is such that R0 is used
> -				 * to return the value from eBPF program.
> -				 * Make sure that it's readable at this time
> -				 * of bpf_exit, which means that program wrote
> -				 * something into it earlier
> -				 */
> -				err = check_reg_arg(env, BPF_REG_0, SRC_OP);
> -				if (err)
> -					return err;
> -
> -				if (is_pointer_value(env, BPF_REG_0)) {
> -					verbose(env, "R0 leaks addr as return value\n");
> -					return -EACCES;
> -				}
> -
>   				err = check_return_code(env);
>   				if (err)
>   					return err;
> @@ -8833,12 +8850,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>   			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
>   			break;
>   		case PTR_TO_BTF_ID:
> -			if (type == BPF_WRITE) {
> +			if (type == BPF_READ) {
> +				insn->code = BPF_LDX | BPF_PROBE_MEM |
> +					BPF_SIZE((insn)->code);
> +				env->prog->aux->num_exentries++;
> +			} else if (env->prog->type != BPF_PROG_TYPE_STRUCT_OPS) {
>   				verbose(env, "Writes through BTF pointers are not allowed\n");
>   				return -EINVAL;
>   			}
> -			insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code);
> -			env->prog->aux->num_exentries++;

Do we need to increase num_exentries for BPF_WRITE as well?

>   			continue;
>   		default:
>   			continue;
> @@ -9505,6 +9524,58 @@ static void print_verification_stats(struct bpf_verifier_env *env)
>   		env->peak_states, env->longest_mark_read_walk);
>   }
>   
[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Potential Spoof] [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-14  0:47 ` [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
@ 2019-12-17  7:48   ` Yonghong Song
  2019-12-20  7:22     ` Martin Lau
  0 siblings, 1 reply; 51+ messages in thread
From: Yonghong Song @ 2019-12-17  7:48 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> is a kernel struct with its func ptr implemented in bpf prog.
> This new map is the interface to register/unregister/introspect
> a bpf implemented kernel struct.
> 
> The kernel struct is actually embedded inside another new struct
> (or called the "value" struct in the code).  For example,
> "struct tcp_congestion_ops" is embbeded in:
> struct __bpf_tcp_congestion_ops {
> 	refcount_t refcnt;
> 	enum bpf_struct_ops_state state;
> 	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> }
> The map value is "struct __bpf_tcp_congestion_ops".  The "bpftool map dump"
> will then be able to show the state ("inuse"/"tobefree") and the number of
> subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case).
> This "value" struct is created automatically by a macro.  Having a separate
> "value" struct will also make extending "struct __bpf_XYZ" easier (e.g. adding
> "void (*init)(void)" to "struct __bpf_XYZ" to do some initialization
> works before registering the struct_ops to the kernel subsystem).
> The libbpf will take care of finding and populating the "struct __bpf_XYZ"
> from "struct XYZ".
> 
> Register a struct_ops to a kernel subsystem:
> 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
>     set to the btf id "struct __bpf_tcp_congestion_ops" of the running
>     kernel.
>     Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id
>     is added such that attr->btf_fd can still be used as the "user" btf
>     which could store other useful sysadmin/debug info that may be
>     introduced in the furture,
>     e.g. creation-date/compiler-details/map-creator...etc.
> 3. Create a "struct __bpf_tcp_congestion_ops" object as described in
>     the running kernel btf.  Populate the value of this object.
>     The function ptr should be populated with the prog fds.
> 4. Call BPF_MAP_UPDATE with the object created in (3) as
>     the map value.  The key is always "0".

This is really a special one element map. The key "0" should work.
Not sure whether we should generalize this and maps for global variables
to a kind of key-less map. Just some thought.

> 
> During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> the specific struct_ops to do some final checks in "st_ops->init_member()"
> (e.g. ensure all mandatory func ptrs are implemented).
> If everything looks good, it will register this kernel struct
> to the kernel subsystem.  The map will not allow further update
> from this point.
> 
> Unregister a struct_ops from the kernel subsystem:
> BPF_MAP_DELETE with key "0".
> 
> Introspect a struct_ops:
> BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> have the prog _id_ populated as the func ptr.
> 
> The map value state (enum bpf_struct_ops_state) will transit from:
> INIT (map created) =>
> INUSE (map updated, i.e. reg) =>
> TOBEFREE (map value deleted, i.e. unreg)
> 
> Note that the above state is not exposed to the uapi/bpf.h.
> It will be obtained from the btf of the running kernel.

It is not really from btf, right? It is from kernel internal
data structure which will be copied to user space.

Since such information is available to bpftool dump and is common
to all st_ops maps. I am wondering whether we should expose
this through uapi.

> 
> The kernel subsystem needs to call bpf_struct_ops_get() and
> bpf_struct_ops_put() to manage the "refcnt" in the "struct __bpf_XYZ".
> This patch uses a separate refcnt for the purose of tracking the
> subsystem usage.  Another approach is to reuse the map->refcnt
> and then "show" (i.e. during map_lookup) the subsystem's usage
> by doing map->refcnt - map->usercnt to filter out the
> map-fd/pinned-map usage.  However, that will also tie down the
> future semantics of map->refcnt and map->usercnt.
> 
> The very first subsystem's refcnt (during reg()) holds one
> count to map->refcnt.  When the very last subsystem's refcnt
> is gone, it will also release the map->refcnt.  All bpf_prog will be
> freed when the map->refcnt reaches 0 (i.e. during map_free()).
> 
> Here is how the bpftool map command will look like:
> [root@arch-fb-vm1 bpf]# bpftool map show
> 6: struct_ops  name dctcp  flags 0x0
> 	key 4B  value 256B  max_entries 1  memlock 4096B
> 	btf_id 6
> [root@arch-fb-vm1 bpf]# bpftool map dump id 6
> [{
>          "value": {
>              "refcnt": {
>                  "refs": {
>                      "counter": 1
>                  }
>              },
>              "state": 1,
>              "data": {
>                  "list": {
>                      "next": 0,
>                      "prev": 0
>                  },
>                  "key": 0,
>                  "flags": 2,
>                  "init": 24,
>                  "release": 0,
>                  "ssthresh": 25,
>                  "cong_avoid": 30,
>                  "set_state": 27,
>                  "cwnd_event": 28,
>                  "in_ack_event": 26,
>                  "undo_cwnd": 29,
>                  "pkts_acked": 0,
>                  "min_tso_segs": 0,
>                  "sndbuf_expand": 0,
>                  "cong_control": 0,
>                  "get_info": 0,
>                  "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0

Not related to this patch. It will be good if we can
make "name" printing better.

>                  ],
>                  "owner": 0
>              }
>          }
>      }
> ]
> 
> Misc Notes:
> * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
>    It does an inplace update on "*value" instead returning a pointer
>    to syscall.c.  Otherwise, it needs a separate copy of "zero" value
>    for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
> 
> * The bpf_struct_ops_map_delete_elem() is also called without
>    preempt_disable() from map_delete_elem().  It is because
>    the "->unreg()" may requires sleepable context, e.g.
>    the "tcp_unregister_congestion_control()".

This probably fine, we do not per cpu data structure here and lookup
will fail after init stage. Some comments in the code will be good.

> 
> * "const" is added to some of the existing "struct btf_func_model *"
>    function arg to avoid a compiler warning caused by this patch.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   arch/x86/net/bpf_jit_comp.c |  10 +-
>   include/linux/bpf.h         |  49 +++-
>   include/linux/bpf_types.h   |   3 +
>   include/linux/btf.h         |  11 +
>   include/uapi/linux/bpf.h    |   7 +-
>   kernel/bpf/bpf_struct_ops.c | 465 +++++++++++++++++++++++++++++++++++-
>   kernel/bpf/btf.c            |  20 +-
>   kernel/bpf/map_in_map.c     |   3 +-
>   kernel/bpf/syscall.c        |  47 ++--
>   kernel/bpf/trampoline.c     |   5 +-
>   kernel/bpf/verifier.c       |   5 +
>   11 files changed, 585 insertions(+), 40 deletions(-)
> 
[...]
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index fadd243ffa2d..9f326e6ef885 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -109,3 +109,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>   #endif
>   BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
>   BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> +#if defined(CONFIG_BPF_JIT)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
> +#endif
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index f74a09a7120b..49094564f1f1 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -60,6 +60,10 @@ const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
>   					    u32 id, u32 *res_id);
>   const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
>   						 u32 id, u32 *res_id);
> +const struct btf_type *
> +btf_resolve_size(const struct btf *btf, const struct btf_type *type,
> +		 u32 *type_size, const struct btf_type **elem_type,
> +		 u32 *total_nelems);
>   
>   #define for_each_member(i, struct_type, member)			\
>   	for (i = 0, member = btf_type_member(struct_type);	\
> @@ -106,6 +110,13 @@ static inline bool btf_type_kflag(const struct btf_type *t)
>   	return BTF_INFO_KFLAG(t->info);
>   }
>   
> +static inline u32 btf_member_bit_offset(const struct btf_type *struct_type,
> +					const struct btf_member *member)
> +{
> +	return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset)
> +					   : member->offset;
> +}
> +
>   static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
>   					   const struct btf_member *member)
>   {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 12900dfa1461..8809212d9d6c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -136,6 +136,7 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_STACK,
>   	BPF_MAP_TYPE_SK_STORAGE,
>   	BPF_MAP_TYPE_DEVMAP_HASH,
> +	BPF_MAP_TYPE_STRUCT_OPS,
>   };
>   
>   /* Note that tracing related programs such as
> @@ -392,6 +393,10 @@ union bpf_attr {
>   		__u32	btf_fd;		/* fd pointing to a BTF type data */
>   		__u32	btf_key_type_id;	/* BTF type_id of the key */
>   		__u32	btf_value_type_id;	/* BTF type_id of the value */
> +		__u32	btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
> +						   * struct stored as the
> +						   * map value
> +						   */
>   	};
>   
>   	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> @@ -3340,7 +3345,7 @@ struct bpf_map_info {
>   	__u32 map_flags;
>   	char  name[BPF_OBJ_NAME_LEN];
>   	__u32 ifindex;
> -	__u32 :32;
> +	__u32 btf_vmlinux_value_type_id;
>   	__u64 netns_dev;
>   	__u64 netns_ino;
>   	__u32 btf_id;
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 817d5aac42e5..00f49ac1342d 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -12,8 +12,68 @@
>   #include <linux/seq_file.h>
>   #include <linux/refcount.h>
>   
> +enum bpf_struct_ops_state {
> +	BPF_STRUCT_OPS_STATE_INIT,
> +	BPF_STRUCT_OPS_STATE_INUSE,
> +	BPF_STRUCT_OPS_STATE_TOBEFREE,
> +};
> +
> +#define BPF_STRUCT_OPS_COMMON_VALUE			\
> +	refcount_t refcnt;				\
> +	enum bpf_struct_ops_state state
> +
> +struct bpf_struct_ops_value {
> +	BPF_STRUCT_OPS_COMMON_VALUE;
> +	char data[0] ____cacheline_aligned_in_smp;
> +};
> +
> +struct bpf_struct_ops_map {
> +	struct bpf_map map;
> +	const struct bpf_struct_ops *st_ops;
> +	/* protect map_update */
> +	spinlock_t lock;
> +	/* progs has all the bpf_prog that is populated
> +	 * to the func ptr of the kernel's struct
> +	 * (in kvalue.data).
> +	 */
> +	struct bpf_prog **progs;
> +	/* image is a page that has all the trampolines
> +	 * that stores the func args before calling the bpf_prog.
> +	 * A PAGE_SIZE "image" is enough to store all trampoline for
> +	 * "progs[]".
> +	 */
> +	void *image;
> +	/* uvalue->data stores the kernel struct
> +	 * (e.g. tcp_congestion_ops) that is more useful
> +	 * to userspace than the kvalue.  For example,
> +	 * the bpf_prog's id is stored instead of the kernel
> +	 * address of a func ptr.
> +	 */
> +	struct bpf_struct_ops_value *uvalue;
> +	/* kvalue.data stores the actual kernel's struct
> +	 * (e.g. tcp_congestion_ops) that will be
> +	 * registered to the kernel subsystem.
> +	 */
> +	struct bpf_struct_ops_value kvalue;
> +};
> +
> +#define VALUE_PREFIX "__bpf_"
> +#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
> +
> +/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value
> + * exposed to the userspace and its btf-type-id is stored
> + * at the map->btf_vmlinux_value_type_id.
> + *
> + * The *_name##_dummy is to ensure the BTF type is emitted.
> + */
> +
>   #define BPF_STRUCT_OPS_TYPE(_name)				\
> -extern struct bpf_struct_ops bpf_##_name;
> +extern struct bpf_struct_ops bpf_##_name;			\
> +								\
> +static struct __bpf_##_name {					\
> +	BPF_STRUCT_OPS_COMMON_VALUE;				\
> +	struct _name data ____cacheline_aligned_in_smp;		\
> +} *_name##_dummy;

There are other ways to retain types in debug info without
creating new variables. For example, you can use it in a cast
like
     (void *)(struct __bpf_##_name *)v
Not sure whether we could easily find a place for such casting or not.

>   #include "bpf_struct_ops_types.h"
>   #undef BPF_STRUCT_OPS_TYPE
>   
> @@ -37,19 +97,46 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
>   const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
>   };
>   
> +static const struct btf_type *module_type;
> +
>   void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   {
> +	char value_name[128] = VALUE_PREFIX;
> +	s32 type_id, value_id, module_id;
>   	const struct btf_member *member;
>   	struct bpf_struct_ops *st_ops;
>   	struct bpf_verifier_log log = {};
>   	const struct btf_type *t;
>   	const char *mname;
> -	s32 type_id;
>   	u32 i, j;
>   
> +	/* Avoid unused var compiler warning */
> +#define BPF_STRUCT_OPS_TYPE(_name) (void)(_name##_dummy);
> +#include "bpf_struct_ops_types.h"
> +#undef BPF_STRUCT_OPS_TYPE
> +
> +	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
> +					  BTF_KIND_STRUCT);
> +	if (module_id < 0) {
> +		pr_warn("Cannot find struct module in btf_vmlinux\n");
> +		return;
> +	}
> +	module_type = btf_type_by_id(_btf_vmlinux, module_id);
> +
>   	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
>   		st_ops = bpf_struct_ops[i];
>   
> +		value_name[VALUE_PREFIX_LEN] = '\0';
> +		strncat(value_name + VALUE_PREFIX_LEN, st_ops->name,
> +			sizeof(value_name) - VALUE_PREFIX_LEN - 1);

Do we have restrictions on the length of st_ops->name?
We probably do not want truncation, right?

> +		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
> +						 BTF_KIND_STRUCT);
> +		if (value_id < 0) {
> +			pr_warn("Cannot find struct %s in btf_vmlinux\n",
> +				value_name);
> +			continue;
> +		}
> +
>   		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
>   						BTF_KIND_STRUCT);
>   		if (type_id < 0) {
> @@ -101,6 +188,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   			} else {
>   				st_ops->type_id = type_id;
>   				st_ops->type = t;
> +				st_ops->value_id = value_id;
> +				st_ops->value_type =
> +					btf_type_by_id(_btf_vmlinux, value_id);
>   			}
>   		}
>   	}
> @@ -108,6 +198,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux)
>   
>   extern struct btf *btf_vmlinux;
>   
> +static const struct bpf_struct_ops *
> +bpf_struct_ops_find_value(u32 value_id)
> +{
> +	unsigned int i;
> +
> +	if (!value_id || !btf_vmlinux)
> +		return NULL;
> +
> +	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> +		if (bpf_struct_ops[i]->value_id == value_id)
> +			return bpf_struct_ops[i];
> +	}
> +
> +	return NULL;
> +}
> +
>   const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
>   {
>   	unsigned int i;
> @@ -122,3 +228,358 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
>   
>   	return NULL;
>   }
> +
> +static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key,
> +					   void *next_key)
> +{
> +	u32 index = key ? *(u32 *)key : U32_MAX;
> +	u32 *next = (u32 *)next_key;
> +
> +	if (index >= map->max_entries) {
> +		*next = 0;
> +		return 0;
> +	}

We know the the max_entries must be 1. Maybe we can simplify the code
accordingly.

> +
> +	if (index == map->max_entries - 1)
> +		return -ENOENT;
> +
> +	*next = index + 1;
> +	return 0;
> +}
> +
> +int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
> +				       void *value)
> +{
> +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> +	struct bpf_struct_ops_value *uvalue, *kvalue;
> +	enum bpf_struct_ops_state state;
> +
> +	if (unlikely(*(u32 *)key != 0))
> +		return -ENOENT;
> +
> +	kvalue = &st_map->kvalue;
> +	state = smp_load_acquire(&kvalue->state);

Some one-line comment here? Also for below smp_store_release()?
A simple comment for important synchronization/barrier point will help
code review a lot.

> +	if (state == BPF_STRUCT_OPS_STATE_INIT) {
> +		memset(value, 0, map->value_size);
> +		return 0;
> +	}
> +
> +	/* No lock is needed.  state and refcnt do not need
> +	 * to be updated together under atomic context.
> +	 */
> +	uvalue = (struct bpf_struct_ops_value *)value;
> +	memcpy(uvalue, st_map->uvalue, map->value_size);
> +	uvalue->state = state;
> +	refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt));
> +
> +	return 0;
> +}
> +
> +static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map)
> +{
> +	const struct btf_type *t = st_map->st_ops->type;
> +	u32 i;
> +
> +	for (i = 0; i < btf_type_vlen(t); i++) {
> +		if (st_map->progs[i]) {
> +			bpf_prog_put(st_map->progs[i]);
> +			st_map->progs[i] = NULL;
> +		}
> +	}
> +}
> +
> +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> +					  void *value, u64 flags)
> +{
> +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> +	const struct bpf_struct_ops *st_ops = st_map->st_ops;
> +	struct bpf_struct_ops_value *uvalue, *kvalue;
> +	const struct btf_member *member;
> +	const struct btf_type *t = st_ops->type;
> +	void *udata, *kdata;
> +	int prog_fd, err = 0;
> +	void *image;
> +	u32 i;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (*(u32 *)key != 0)
> +		return -E2BIG;
> +
> +	uvalue = (struct bpf_struct_ops_value *)value;
> +	if (uvalue->state || refcount_read(&uvalue->refcnt))
> +		return -EINVAL;
> +
> +	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
> +	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> +
> +	spin_lock(&st_map->lock);
> +
> +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> +		err = -EBUSY;
> +		goto unlock;
> +	}
> +
> +	memcpy(uvalue, value, map->value_size);
> +
> +	udata = &uvalue->data;
> +	kdata = &kvalue->data;
> +	image = st_map->image;
> +
> +	for_each_member(i, t, member) {
> +		const struct btf_type *mtype, *ptype;
> +		struct bpf_prog *prog;
> +		u32 moff;
> +
> +		moff = btf_member_bit_offset(t, member) / 8;
> +		mtype = btf_type_by_id(btf_vmlinux, member->type);
> +		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
> +		if (ptype == module_type) {
> +			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
> +			continue;
> +		}
> +
> +		err = st_ops->init_member(t, member, kdata, udata);
> +		if (err < 0)
> +			goto reset_unlock;
> +
> +		/* The ->init_member() has handled this member */
> +		if (err > 0)
> +			continue;
> +
> +		/* If st_ops->init_member does not handle it,
> +		 * we will only handle func ptrs and zero-ed members
> +		 * here.  Reject everything else.
> +		 */
> +
> +		/* All non func ptr member must be 0 */
> +		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> +					       NULL)) {
> +			u32 msize;
> +
> +			mtype = btf_resolve_size(btf_vmlinux, mtype,
> +						 &msize, NULL, NULL);
> +			if (IS_ERR(mtype)) {
> +				err = PTR_ERR(mtype);
> +				goto reset_unlock;
> +			}
> +
> +			if (memchr_inv(udata + moff, 0, msize)) {
> +				err = -EINVAL;
> +				goto reset_unlock;
> +			}
> +
> +			continue;
> +		}
> +
> +		prog_fd = (int)(*(unsigned long *)(udata + moff));
> +		/* Similar check as the attr->attach_prog_fd */
> +		if (!prog_fd)
> +			continue;
> +
> +		prog = bpf_prog_get(prog_fd);
> +		if (IS_ERR(prog)) {
> +			err = PTR_ERR(prog);
> +			goto reset_unlock;
> +		}
> +		st_map->progs[i] = prog;
> +
> +		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> +		    prog->aux->attach_btf_id != st_ops->type_id ||
> +		    prog->expected_attach_type != i) {
> +			err = -EINVAL;
> +			goto reset_unlock;
> +		}
> +
> +		err = arch_prepare_bpf_trampoline(image,
> +						  &st_ops->func_models[i], 0,
> +						  &prog, 1, NULL, 0, NULL);
> +		if (err < 0)
> +			goto reset_unlock;
> +
> +		*(void **)(kdata + moff) = image;
> +		image += err;

Do we still want to check whether image out of page boundary or not?

> +
> +		/* put prog_id to udata */
> +		*(unsigned long *)(udata + moff) = prog->aux->id;
> +	}
> +
> +	refcount_set(&kvalue->refcnt, 1);
> +	bpf_map_inc(map);
> +
> +	err = st_ops->reg(kdata);
> +	if (!err) {
> +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> +		goto unlock;
> +	}
> +
> +	/* Error during st_ops->reg() */
> +	bpf_map_put(map);
> +
> +reset_unlock:
> +	bpf_struct_ops_map_put_progs(st_map);
> +	memset(uvalue, 0, map->value_size);
> +	memset(kvalue, 0, map->value_size);
> +
> +unlock:
> +	spin_unlock(&st_map->lock);
> +	return err;
> +}
> +
> +static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +	enum bpf_struct_ops_state prev_state;
> +	struct bpf_struct_ops_map *st_map;
> +
> +	st_map = (struct bpf_struct_ops_map *)map;
> +	prev_state = cmpxchg(&st_map->kvalue.state,
> +			     BPF_STRUCT_OPS_STATE_INUSE,
> +			     BPF_STRUCT_OPS_STATE_TOBEFREE);
> +	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
> +		st_map->st_ops->unreg(&st_map->kvalue.data);
> +		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
> +			bpf_map_put(map);
> +	}
> +
> +	return 0;
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-14 19:25     ` Neal Cardwell
  2019-12-16 19:30       ` Martin Lau
@ 2019-12-17  8:26       ` Jakub Sitnicki
  2019-12-17 18:22         ` Martin Lau
  1 sibling, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2019-12-17  8:26 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Eric Dumazet, Neal Cardwell, bpf, Alexei Starovoitov,
	Daniel Borkmann, David Miller, Kernel Team, Netdev

On Sat, Dec 14, 2019 at 08:25 PM CET, Neal Cardwell wrote:
> On Fri, Dec 13, 2019 at 9:00 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
>> > This patch adds a helper to handle jiffies.  Some of the
>> > tcp_sock's timing is stored in jiffies.  Although things
>> > could be deduced by CONFIG_HZ, having an easy way to get
>> > jiffies will make the later bpf-tcp-cc implementation easier.
>> >
>>
>> ...
>>
>> > +
>> > +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
>> > +{
>> > +     if (!flags)
>> > +             return get_jiffies_64();
>> > +
>> > +     if (flags & BPF_F_NS_TO_JIFFIES) {
>> > +             return nsecs_to_jiffies(in);
>> > +     } else if (flags & BPF_F_JIFFIES_TO_NS) {
>> > +             if (!in)
>> > +                     in = get_jiffies_64();
>> > +             return jiffies_to_nsecs(in);
>> > +     }
>> > +
>> > +     return 0;
>> > +}
>>
>> This looks a bit convoluted :)
>>
>> Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
>>
>> We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.
>
> If the jiffies functionality stays, how about 3 simple functions that
> correspond to the underlying C functions, perhaps something like:
>
>   bpf_nsecs_to_jiffies(nsecs)
>   bpf_jiffies_to_nsecs(jiffies)
>   bpf_get_jiffies_64()
>
> Separate functions might be easier to read/maintain (and may even be
> faster, given the corresponding reduction in branches).

Having bpf_nsecs_to_jiffies() would be also handy for BPF sockops progs
that configure SYN-RTO timeout (BPF_SOCK_OPS_TIMEOUT_INIT).

Right now user-space needs to go look for CONFIG_HZ in /proc/config.gz
or /boot/config-`uname -r`, or derive it from clock resolution [0]

        clock_getres(CLOCK_REALTIME_COARSE, &res);
        jiffy = res.tv_nsec / 1000000;

to pass timeout in jiffies to the BPF prog.

-jkbs

[0] https://www.mail-archive.com/kernelnewbies@nl.linux.org/msg08850.html


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf
  2019-12-14  0:47 ` [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
@ 2019-12-17 17:36   ` Yonghong Song
  0 siblings, 0 replies; 51+ messages in thread
From: Yonghong Song @ 2019-12-17 17:36 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> This patch makes "struct tcp_congestion_ops" to be the first user
> of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
> in bpf.
> 
> The BPF implemented tcp_congestion_ops can be used like
> regular kernel tcp-cc through sysctl and setsockopt.  e.g.
> [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
> net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
> net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
> net.ipv4.tcp_congestion_control = bpf_cubic
> 
> There has been attempt to move the TCP CC to the user space
> (e.g. CCP in TCP).   The common arguments are faster turn around,
> get away from long-tail kernel versions in production...etc,
> which are legit points.
> 
> BPF has been the continuous effort to join both kernel and
> userspace upsides together (e.g. XDP to gain the performance
> advantage without bypassing the kernel).  The recent BPF
> advancements (in particular BTF-aware verifier, BPF trampoline,
> BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
> possible in BPF.  It allows a faster turnaround for testing algorithm
> in the production while leveraging the existing (and continue growing)
> BPF feature/framework instead of building one specifically for
> userspace TCP CC.
> 
> This patch allows write access to a few fields in tcp-sock
> (in bpf_tcp_ca_btf_struct_access()).
> 
> The optional "get_info" is unsupported now.  It can be added
> later.  One possible way is to output the info with a btf-id
> to describe the content.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   include/linux/filter.h            |   2 +
>   include/net/tcp.h                 |   1 +
>   kernel/bpf/bpf_struct_ops_types.h |   7 +-
>   net/core/filter.c                 |   2 +-
>   net/ipv4/Makefile                 |   4 +
>   net/ipv4/bpf_tcp_ca.c             | 225 ++++++++++++++++++++++++++++++
>   net/ipv4/tcp_cong.c               |  14 +-
>   net/ipv4/tcp_ipv4.c               |   6 +-
>   net/ipv4/tcp_minisocks.c          |   4 +-
>   net/ipv4/tcp_output.c             |   4 +-
>   10 files changed, 254 insertions(+), 15 deletions(-)
>   create mode 100644 net/ipv4/bpf_tcp_ca.c
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 37ac7025031d..7c22c5e6528d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -844,6 +844,8 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog);
>   int bpf_prog_create_from_user(struct bpf_prog **pfp, struct sock_fprog *fprog,
>   			      bpf_aux_classic_check_t trans, bool save_orig);
>   void bpf_prog_destroy(struct bpf_prog *fp);
> +const struct bpf_func_proto *
> +bpf_base_func_proto(enum bpf_func_id func_id);
>   
>   int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
>   int sk_attach_bpf(u32 ufd, struct sock *sk);
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 86b9a8766648..fd87fa1df603 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1007,6 +1007,7 @@ enum tcp_ca_ack_event_flags {
>   #define TCP_CONG_NON_RESTRICTED 0x1
>   /* Requires ECN/ECT set on all packets */
>   #define TCP_CONG_NEEDS_ECN	0x2
> +#define TCP_CONG_MASK	(TCP_CONG_NON_RESTRICTED | TCP_CONG_NEEDS_ECN)
>   
>   union tcp_cc_info;
>   
> diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
> index 7bb13ff49ec2..066d83ea1c99 100644
> --- a/kernel/bpf/bpf_struct_ops_types.h
> +++ b/kernel/bpf/bpf_struct_ops_types.h
> @@ -1,4 +1,9 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
>   /* internal file - do not include directly */
>   
> -/* To be filled in a later patch */
> +#ifdef CONFIG_BPF_JIT
> +#ifdef CONFIG_INET
> +#include <net/tcp.h>
> +BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
> +#endif
> +#endif
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a411f7835dee..fbb3698026bd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5975,7 +5975,7 @@ bool bpf_helper_changes_pkt_data(void *func)
>   	return false;
>   }
>   
> -static const struct bpf_func_proto *
> +const struct bpf_func_proto *
>   bpf_base_func_proto(enum bpf_func_id func_id)
>   {
>   	switch (func_id) {
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index d57ecfaf89d4..7360d9b3eaad 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -65,3 +65,7 @@ obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
>   
>   obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
>   		      xfrm4_output.o xfrm4_protocol.o
> +
> +ifeq ($(CONFIG_BPF_SYSCALL),y)
> +obj-$(CONFIG_BPF_JIT) += bpf_tcp_ca.o
> +endif
> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> new file mode 100644
> index 000000000000..967af987bc26
> --- /dev/null
> +++ b/net/ipv4/bpf_tcp_ca.c
> @@ -0,0 +1,225 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019 Facebook  */
> +
> +#include <linux/types.h>
> +#include <linux/bpf_verifier.h>
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/filter.h>
> +#include <net/tcp.h>
> +
> +static u32 optional_ops[] = {
> +	offsetof(struct tcp_congestion_ops, init),
> +	offsetof(struct tcp_congestion_ops, release),
> +	offsetof(struct tcp_congestion_ops, set_state),
> +	offsetof(struct tcp_congestion_ops, cwnd_event),
> +	offsetof(struct tcp_congestion_ops, in_ack_event),
> +	offsetof(struct tcp_congestion_ops, pkts_acked),
> +	offsetof(struct tcp_congestion_ops, min_tso_segs),
> +	offsetof(struct tcp_congestion_ops, sndbuf_expand),
> +	offsetof(struct tcp_congestion_ops, cong_control),
> +};
> +
> +static u32 unsupported_ops[] = {
> +	offsetof(struct tcp_congestion_ops, get_info),
> +};

Could you elaborate by adding some comments at least how
required fields are handled?

> +
> +static const struct btf_type *tcp_sock_type;
> +static u32 tcp_sock_id, sock_id;
> +
> +static int bpf_tcp_ca_init(struct btf *_btf_vmlinux)
> +{
> +	s32 type_id;
> +
> +	type_id = btf_find_by_name_kind(_btf_vmlinux, "sock", BTF_KIND_STRUCT);
> +	if (type_id < 0)
> +		return -EINVAL;
> +	sock_id = type_id;
> +
> +	type_id = btf_find_by_name_kind(_btf_vmlinux, "tcp_sock",
> +					BTF_KIND_STRUCT);
> +	if (type_id < 0)
> +		return -EINVAL;
> +	tcp_sock_id = type_id;
> +	tcp_sock_type = btf_type_by_id(_btf_vmlinux, tcp_sock_id);
> +
> +	return 0;
> +}
> +
> +static bool check_optional(u32 member_offset)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(optional_ops); i++) {
> +		if (member_offset == optional_ops[i])
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static bool check_unsupported(u32 member_offset)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
> +		if (member_offset == unsupported_ops[i])
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
[...]
> +
> +static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
> +	.get_func_proto		= bpf_tcp_ca_get_func_proto,
> +	.is_valid_access	= bpf_tcp_ca_is_valid_access,
> +	.btf_struct_access	= bpf_tcp_ca_btf_struct_access,
> +};
> +
> +static int bpf_tcp_ca_init_member(const struct btf_type *t,
> +				  const struct btf_member *member,
> +				  void *kdata, const void *udata)
> +{
> +	const struct tcp_congestion_ops *utcp_ca;
> +	struct tcp_congestion_ops *tcp_ca;
> +	size_t tcp_ca_name_len;
> +	int prog_fd;
> +	u32 moff;
> +
> +	utcp_ca = (const struct tcp_congestion_ops *)udata;
> +	tcp_ca = (struct tcp_congestion_ops *)kdata;
> +
> +	moff = btf_member_bit_offset(t, member) / 8;
> +	switch (moff) {
> +	case offsetof(struct tcp_congestion_ops, flags):
> +		if (utcp_ca->flags & ~TCP_CONG_MASK)
> +			return -EINVAL;
> +		tcp_ca->flags = utcp_ca->flags;
> +		return 1;
> +	case offsetof(struct tcp_congestion_ops, name):
> +		tcp_ca_name_len = strnlen(utcp_ca->name, sizeof(utcp_ca->name));
> +		if (!tcp_ca_name_len ||
> +		    tcp_ca_name_len == sizeof(utcp_ca->name))
> +			return -EINVAL;
> +		memcpy(tcp_ca->name, utcp_ca->name, sizeof(tcp_ca->name));
> +		return 1;
> +	}
> +
> +	if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, NULL))
> +		return 0;
> +
> +	prog_fd = (int)(*(unsigned long *)(udata + moff));
> +	if (!prog_fd && !check_optional(moff) && !check_unsupported(moff))
> +		return -EINVAL;

So if a member is option or unsupported, we will return -EINVAL?
I probably miss something here.

> +
> +	return 0;
> +}
> +
> +static int bpf_tcp_ca_check_member(const struct btf_type *t,
> +				   const struct btf_member *member)
> +{
> +	if (check_unsupported(btf_member_bit_offset(t, member) / 8))
> +		return -ENOTSUPP;
> +	return 0;
> +}
> +
> +static int bpf_tcp_ca_reg(void *kdata)
> +{
> +	return tcp_register_congestion_control(kdata);
> +}
> +
> +static void bpf_tcp_ca_unreg(void *kdata)
> +{
> +	tcp_unregister_congestion_control(kdata);
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper
  2019-12-14  0:47 ` [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
@ 2019-12-17 17:41   ` Yonghong Song
  0 siblings, 0 replies; 51+ messages in thread
From: Yonghong Song @ 2019-12-17 17:41 UTC (permalink / raw)
  To: Martin Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, David Miller, Kernel Team, netdev



On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> Add a helper to send out a tcp-ack.  It will be used in the later
> bpf_dctcp implementation that requires to send out an ack
> when the CE state changed.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-17  8:26       ` Jakub Sitnicki
@ 2019-12-17 18:22         ` Martin Lau
  2019-12-17 21:04           ` Eric Dumazet
  2019-12-18  9:03           ` Jakub Sitnicki
  0 siblings, 2 replies; 51+ messages in thread
From: Martin Lau @ 2019-12-17 18:22 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Eric Dumazet, Neal Cardwell, bpf, Alexei Starovoitov,
	Daniel Borkmann, David Miller, Kernel Team, Netdev

On Tue, Dec 17, 2019 at 09:26:31AM +0100, Jakub Sitnicki wrote:
> On Sat, Dec 14, 2019 at 08:25 PM CET, Neal Cardwell wrote:
> > On Fri, Dec 13, 2019 at 9:00 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>
> >>
> >>
> >> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> >> > This patch adds a helper to handle jiffies.  Some of the
> >> > tcp_sock's timing is stored in jiffies.  Although things
> >> > could be deduced by CONFIG_HZ, having an easy way to get
> >> > jiffies will make the later bpf-tcp-cc implementation easier.
> >> >
> >>
> >> ...
> >>
> >> > +
> >> > +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
> >> > +{
> >> > +     if (!flags)
> >> > +             return get_jiffies_64();
> >> > +
> >> > +     if (flags & BPF_F_NS_TO_JIFFIES) {
> >> > +             return nsecs_to_jiffies(in);
> >> > +     } else if (flags & BPF_F_JIFFIES_TO_NS) {
> >> > +             if (!in)
> >> > +                     in = get_jiffies_64();
> >> > +             return jiffies_to_nsecs(in);
> >> > +     }
> >> > +
> >> > +     return 0;
> >> > +}
> >>
> >> This looks a bit convoluted :)
> >>
> >> Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
> >>
> >> We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.
> >
> > If the jiffies functionality stays, how about 3 simple functions that
> > correspond to the underlying C functions, perhaps something like:
> >
> >   bpf_nsecs_to_jiffies(nsecs)
> >   bpf_jiffies_to_nsecs(jiffies)
> >   bpf_get_jiffies_64()
> >
> > Separate functions might be easier to read/maintain (and may even be
> > faster, given the corresponding reduction in branches).
> 
> Having bpf_nsecs_to_jiffies() would be also handy for BPF sockops progs
> that configure SYN-RTO timeout (BPF_SOCK_OPS_TIMEOUT_INIT).
> 
> Right now user-space needs to go look for CONFIG_HZ in /proc/config.gz
Andrii's extern variable work (already landed) allows a bpf_prog
to read CONFIG_HZ as a global variable.  It is the path that I am
pursuing now for jiffies/nsecs conversion without relying on
a helper.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-17 18:22         ` Martin Lau
@ 2019-12-17 21:04           ` Eric Dumazet
  2019-12-18  9:03           ` Jakub Sitnicki
  1 sibling, 0 replies; 51+ messages in thread
From: Eric Dumazet @ 2019-12-17 21:04 UTC (permalink / raw)
  To: Martin Lau, Jakub Sitnicki
  Cc: Eric Dumazet, Neal Cardwell, bpf, Alexei Starovoitov,
	Daniel Borkmann, David Miller, Kernel Team, Netdev


> Andrii's extern variable work (already landed) allows a bpf_prog
> to read CONFIG_HZ as a global variable.  It is the path that I am
> pursuing now for jiffies/nsecs conversion without relying on
> a helper.

I am traveling today, but plan sending a patch series for cubic,
switching to usec resolution to solve its inability to properly
detect ack trains in the datacenter.

But still it will use jiffies32 in some spots,
as you mentioned already because of tp->lsndtime.

This means bpf could also stick to tp->tcp_mstamp 

extract :

-static inline u32 bictcp_clock(void)
+static inline u32 bictcp_clock_us(const struct sock *sk)
 {
-#if HZ < 1000
-       return ktime_to_ms(ktime_get_real());
-#else
-       return jiffies_to_msecs(jiffies);
-#endif
+       return tcp_sk(sk)->tcp_mstamp;
 }

 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-14  0:48 ` [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
@ 2019-12-18  3:07   ` Andrii Nakryiko
  2019-12-18  7:03     ` Martin Lau
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2019-12-18  3:07 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch adds BPF STRUCT_OPS support to libbpf.
>
> The only sec_name convention is SEC("struct_ops") to identify the
> struct ops implemented in BPF, e.g.
> SEC("struct_ops")
> struct tcp_congestion_ops dctcp = {
>         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
>         /* ... some more func prts ... */
>         .name           = "bpf_dctcp",
> };
>
> In the bpf_object__open phase, libbpf will look for the "struct_ops"
> elf section and find out what is the btf-type the "struct_ops" is
> implementing.  Note that the btf-type here is referring to
> a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> where are the bpf progs that the func ptrs are referring to.
>
> In the bpf_object__load phase, the prepare_struct_ops() will load
> the btf_vmlinux and obtain the corresponding kernel's btf-type.
> With the kernel's btf-type, it can then set the prog->type,
> prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> the prog's properties do not rely on its section name.
>
> Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> process is as simple as: member-name match + btf-kind match + size match.
> If these matching conditions fail, libbpf will reject.
> The current targeting support is "struct tcp_congestion_ops" which
> most of its members are function pointers.
> The member ordering of the bpf_prog's btf-type can be different from
> the btf_vmlinux's btf-type.
>
> Once the prog's properties are all set,
> the libbpf will proceed to load all the progs.
>
> After that, register_struct_ops() will create a map, finalize the
> map-value by populating it with the prog-fd, and then register this
> "struct_ops" to the kernel by updating the map-value to the map.
>
> By default, libbpf does not unregister the struct_ops from the kernel
> during bpf_object__close().  It can be changed by setting the new
> "unreg_st_ops" in bpf_object_open_opts.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

This looks pretty good to me. The big two things is exposing structops
as real struct bpf_map, so that users can interact with it using
libbpf APIs, as well as splitting struct_ops map creation and
registration. bpf_object__load() should only make sure all maps are
created, progs are loaded/verified, but none of BPF program can yet be
called. Then attach is the phase where registration happens.


>  tools/lib/bpf/bpf.c           |  10 +-
>  tools/lib/bpf/bpf.h           |   5 +-
>  tools/lib/bpf/libbpf.c        | 599 +++++++++++++++++++++++++++++++++-
>  tools/lib/bpf/libbpf.h        |   3 +-
>  tools/lib/bpf/libbpf_probes.c |   2 +
>  5 files changed, 612 insertions(+), 7 deletions(-)
>

[...]

>  LIBBPF_API int
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 27d5f7ecba32..ffb5cdd7db5a 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -67,6 +67,10 @@
>
>  #define __printf(a, b) __attribute__((format(printf, a, b)))
>
> +static struct btf *bpf_core_find_kernel_btf(void);

this is not CO-RE specific anymore, we should probably just rename it
to bpf_find_kernel_btf

> +static struct bpf_program *bpf_object__find_prog_by_idx(struct bpf_object *obj,
> +                                                       int idx);
> +
>  static int __base_pr(enum libbpf_print_level level, const char *format,
>                      va_list args)
>  {
> @@ -128,6 +132,8 @@ void libbpf_print(enum libbpf_print_level level, const char *format, ...)
>  # define LIBBPF_ELF_C_READ_MMAP ELF_C_READ
>  #endif
>
> +#define BPF_STRUCT_OPS_SEC_NAME "struct_ops"

This is a special ELF section recognized by libbpf, so similarly to
".maps" (and ".kconfig", which I'm renaming from ".extern"), I think
this should be ".struct_ops" (or I'd even drop underscore and go with
".structops", but not insisting).

> +
>  static inline __u64 ptr_to_u64(const void *ptr)
>  {
>         return (__u64) (unsigned long) ptr;
> @@ -233,6 +239,32 @@ struct bpf_map {
>         bool reused;
>  };
>
> +struct bpf_struct_ops {
> +       const char *var_name;
> +       const char *tname;
> +       const struct btf_type *type;
> +       struct bpf_program **progs;
> +       __u32 *kern_func_off;
> +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> +       void *data;
> +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf

Using __bpf_ prefix for this struct_ops-specific types is a bit too
generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
it btf_ops_ or btf_structops_?


> +        * format.
> +        * struct __bpf_tcp_congestion_ops {
> +        *      [... some other kernel fields ...]
> +        *      struct tcp_congestion_ops data;
> +        * }
> +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).

Comment isn't very clear.. do you mean that data pointed to by
kern_vdata is of sizeof(...) bytes?

> +        * prepare_struct_ops() will populate the "data" into
> +        * "kern_vdata".
> +        */
> +       void *kern_vdata;
> +       __u32 type_id;
> +       __u32 kern_vtype_id;
> +       __u32 kern_vtype_size;
> +       int fd;
> +       bool unreg;

This unreg flag (and default behavior to not unregister) is bothering
me a bit.. Shouldn't this be controlled by map's lifetime, at least.
E.g., if no one pins that map - then struct_ops should be unregistered
on map destruction. If application wants to keep BPF programs
attached, it should make sure to pin map, before userspace part exits?
Is this problematic in any way?

> +};
> +
>  struct bpf_secdata {
>         void *rodata;
>         void *data;
> @@ -251,6 +283,7 @@ struct bpf_object {
>         size_t nr_maps;
>         size_t maps_cap;
>         struct bpf_secdata sections;
> +       struct bpf_struct_ops st_ops;

These bpf_struct_ops are strictly belonging to that special struct_ops
map, right? So I'd say we should change struct bpf_map to contain
per-map extra piece of info. We can combine that with current mmaped
pointer for internal maps;


struct bpf_map {
    ...
    union {
        void *mmaped;
        struct bpf_struct_ops *st_ops;
    };
};

That way those special maps can have extra piece of information
specific to that special map's type.


>
>         bool loaded;
>         bool has_pseudo_calls;
> @@ -270,6 +303,7 @@ struct bpf_object {
>                 Elf_Data *data;
>                 Elf_Data *rodata;
>                 Elf_Data *bss;
> +               Elf_Data *st_ops_data;
>                 size_t strtabidx;
>                 struct {
>                         GElf_Shdr shdr;
> @@ -282,6 +316,7 @@ struct bpf_object {
>                 int data_shndx;
>                 int rodata_shndx;
>                 int bss_shndx;
> +               int st_ops_shndx;
>         } efile;
>         /*
>          * All loaded bpf_object is linked in a list, which is
> @@ -509,6 +544,508 @@ static __u32 get_kernel_version(void)
>         return KERNEL_VERSION(major, minor, patch);
>  }
>
> +static int bpf_object__register_struct_ops(struct bpf_object *obj)
> +{
> +       struct bpf_create_map_attr map_attr = {};
> +       struct bpf_struct_ops *st_ops;
> +       const char *tname;
> +       __u32 i, zero = 0;
> +       int fd, err;
> +
> +       st_ops = &obj->st_ops;
> +       if (!st_ops->kern_vdata)
> +               return 0;

this shouldn't happen, right? I'd drop the check or return error at least.

> +
> +       tname = st_ops->tname;
> +       for (i = 0; i < btf_vlen(st_ops->type); i++) {
> +               struct bpf_program *prog = st_ops->progs[i];
> +               void *kern_data;
> +               int prog_fd;
> +
> +               if (!prog)
> +                       continue;
> +
> +               prog_fd = bpf_program__nth_fd(prog, 0);

nit: just bpf_program__fd(prog)

> +               if (prog_fd < 0) {
> +                       pr_warn("struct_ops register %s: prog %s is not loaded\n",
> +                               tname, prog->name);
> +                       return -EINVAL;
> +               }

This is redundant check, register_struct_ops will not be called if any
program loading fails.

> +
> +               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
> +               *(unsigned long *)kern_data = prog_fd;
> +       }
> +
> +       map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS;
> +       map_attr.key_size = sizeof(unsigned int);
> +       map_attr.value_size = st_ops->kern_vtype_size;
> +       map_attr.max_entries = 1;
> +       map_attr.btf_fd = btf__fd(obj->btf);
> +       map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id;
> +       map_attr.name = st_ops->var_name;
> +
> +       fd = bpf_create_map_xattr(&map_attr);

we should try to reuse bpf_object__init_internal_map(). This will add
struct bpf_map which users can iterate over and look up by name, etc.
We had similar discussion when Daniel was adding  global data maps,
and we conclusively decided that these special maps have to be
represented in libbpf as struct bpf_map as well.

> +       if (fd < 0) {
> +               err = -errno;
> +               pr_warn("struct_ops register %s: Error in creating struct_ops map\n",
> +                       tname);
> +               return err;
> +       }
> +
> +       err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0);

This is what "activates" struct_ops, so this has to happen outside of
load, load shouldn't trigger execution of BPF programs. So something
like bpf_map__attach_struct_ops() or we if introduce new concept for
struct_ops: bpf_struct_ops__attach(), which can be called explicitly
by user of automatically from skeletons <skeleton>__attach().


> +       if (err) {
> +               err = -errno;
> +               close(fd);
> +               pr_warn("struct_ops register %s: Error in updating struct_ops map\n",
> +                       tname);
> +               return err;
> +       }
> +
> +       st_ops->fd = fd;
> +
> +       return 0;
> +}
> +
> +static int bpf_struct_ops__unregister(struct bpf_struct_ops *st_ops)
> +{
> +       if (st_ops->fd != -1) {
> +               __u32 zero = 0;
> +               int err = 0;
> +
> +               if (bpf_map_delete_elem(st_ops->fd, &zero))
> +                       err = -errno;
> +               zclose(st_ops->fd);
> +
> +               return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static const struct btf_type *
> +resolve_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
> +static const struct btf_type *
> +resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id);
> +
> +static const struct btf_member *
> +find_member_by_offset(const struct btf_type *t, __u32 offset)

nit: find_member_by_bit_offset (offset -> bit_offset)?

> +{
> +       struct btf_member *m;
> +       int i;
> +
> +       for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
> +               if (btf_member_bit_offset(t, i) == offset)
> +                       return m;
> +       }
> +
> +       return NULL;
> +}
> +
> +static const struct btf_member *
> +find_member_by_name(const struct btf *btf, const struct btf_type *t,
> +                   const char *name)
> +{
> +       struct btf_member *m;
> +       int i;
> +
> +       for (i = 0, m = btf_members(t); i < btf_vlen(t); i++, m++) {
> +               if (!strcmp(btf__name_by_offset(btf, m->name_off), name))
> +                       return m;
> +       }
> +
> +       return NULL;
> +}
> +
> +#define STRUCT_OPS_VALUE_PREFIX "__bpf_"
> +#define STRUCT_OPS_VALUE_PREFIX_LEN (sizeof(STRUCT_OPS_VALUE_PREFIX) - 1)
> +
> +static int
> +bpf_struct_ops__get_kern_types(const struct btf *btf, const char *tname,

nit: there is no "bpf_struct_ops" object in libbpf and this is not its
method, so it's a violation of libbpf's naming convention, please
consider renaming to something like "find_struct_ops_kern_types"

> +                              const struct btf_type **type, __u32 *type_id,
> +                              const struct btf_type **vtype, __u32 *vtype_id,
> +                              const struct btf_member **data_member)
> +{
> +       const struct btf_type *kern_type, *kern_vtype;
> +       const struct btf_member *kern_data_member;
> +       __s32 kern_vtype_id, kern_type_id;
> +       char vtname[128] = STRUCT_OPS_VALUE_PREFIX;
> +       __u32 i;
> +
> +       kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT);
> +       if (kern_type_id < 0) {
> +               pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n",
> +                       tname);
> +               return -ENOTSUP;

just return kern_type_id (pass through btf__find_by_name_kind's
result). Same below.

> +       }
> +       kern_type = btf__type_by_id(btf, kern_type_id);
> +
> +       /* Find the corresponding "map_value" type that will be used
> +        * in map_update(BPF_MAP_TYPE_STRUCT_OPS).  For example,
> +        * find "struct __bpf_tcp_congestion_ops" from the btf_vmlinux.
> +        */
> +       strncat(vtname + STRUCT_OPS_VALUE_PREFIX_LEN, tname,
> +               sizeof(vtname) - STRUCT_OPS_VALUE_PREFIX_LEN - 1);
> +       kern_vtype_id = btf__find_by_name_kind(btf, vtname,
> +                                              BTF_KIND_STRUCT);
> +       if (kern_vtype_id < 0) {
> +               pr_warn("struct_ops prepare: struct %s is not found in kernel BTF\n",
> +                       vtname);
> +               return -ENOTSUP;
> +       }
> +       kern_vtype = btf__type_by_id(btf, kern_vtype_id);
> +
> +       /* Find "struct tcp_congestion_ops" from
> +        * struct __bpf_tcp_congestion_ops {
> +        *      [ ... ]
> +        *      struct tcp_congestion_ops data;
> +        * }
> +        */
> +       for (i = 0, kern_data_member = btf_members(kern_vtype);
> +            i < btf_vlen(kern_vtype);
> +            i++, kern_data_member++) {

nit: multi-line for is kind of ugly, maybe move kern_data_member
assignment out of for?

> +               if (kern_data_member->type == kern_type_id)
> +                       break;
> +       }
> +       if (i == btf_vlen(kern_vtype)) {
> +               pr_warn("struct_ops prepare: struct %s data is not found in struct %s\n",
> +                       tname, vtname);
> +               return -EINVAL;
> +       }
> +

[...]

>  static int bpf_object__init_btf(struct bpf_object *obj,
> @@ -1689,6 +2257,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
>                         } else if (strcmp(name, ".rodata") == 0) {
>                                 obj->efile.rodata = data;
>                                 obj->efile.rodata_shndx = idx;
> +                       } else if (strcmp(name, BPF_STRUCT_OPS_SEC_NAME) == 0) {
> +                               obj->efile.st_ops_data = data;
> +                               obj->efile.st_ops_shndx = idx;
>                         } else {
>                                 pr_debug("skip section(%d) %s\n", idx, name);
>                         }
> @@ -1698,7 +2269,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj, bool relaxed_maps,
>                         int sec = sh.sh_info; /* points to other section */
>
>                         /* Only do relo for section with exec instructions */
> -                       if (!section_have_execinstr(obj, sec)) {
> +                       if (!section_have_execinstr(obj, sec) &&
> +                           !strstr(name, BPF_STRUCT_OPS_SEC_NAME)) {

why substring match?

>                                 pr_debug("skip relo %s(%d) for section(%d)\n",
>                                          name, idx, sec);
>                                 continue;

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18  3:07   ` Andrii Nakryiko
@ 2019-12-18  7:03     ` Martin Lau
  2019-12-18  7:20       ` Martin Lau
  2019-12-18 16:34       ` Andrii Nakryiko
  0 siblings, 2 replies; 51+ messages in thread
From: Martin Lau @ 2019-12-18  7:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > This patch adds BPF STRUCT_OPS support to libbpf.
> >
> > The only sec_name convention is SEC("struct_ops") to identify the
> > struct ops implemented in BPF, e.g.
> > SEC("struct_ops")
> > struct tcp_congestion_ops dctcp = {
> >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> >         /* ... some more func prts ... */
> >         .name           = "bpf_dctcp",
> > };
> >
> > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > elf section and find out what is the btf-type the "struct_ops" is
> > implementing.  Note that the btf-type here is referring to
> > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > where are the bpf progs that the func ptrs are referring to.
> >
> > In the bpf_object__load phase, the prepare_struct_ops() will load
> > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > With the kernel's btf-type, it can then set the prog->type,
> > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > the prog's properties do not rely on its section name.
> >
> > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > process is as simple as: member-name match + btf-kind match + size match.
> > If these matching conditions fail, libbpf will reject.
> > The current targeting support is "struct tcp_congestion_ops" which
> > most of its members are function pointers.
> > The member ordering of the bpf_prog's btf-type can be different from
> > the btf_vmlinux's btf-type.
> >
> > Once the prog's properties are all set,
> > the libbpf will proceed to load all the progs.
> >
> > After that, register_struct_ops() will create a map, finalize the
> > map-value by populating it with the prog-fd, and then register this
> > "struct_ops" to the kernel by updating the map-value to the map.
> >
> > By default, libbpf does not unregister the struct_ops from the kernel
> > during bpf_object__close().  It can be changed by setting the new
> > "unreg_st_ops" in bpf_object_open_opts.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> 
> This looks pretty good to me. The big two things is exposing structops
> as real struct bpf_map, so that users can interact with it using
> libbpf APIs, as well as splitting struct_ops map creation and
> registration. bpf_object__load() should only make sure all maps are
> created, progs are loaded/verified, but none of BPF program can yet be
> called. Then attach is the phase where registration happens.
Thanks for the review.

[ ... ]

> >  static inline __u64 ptr_to_u64(const void *ptr)
> >  {
> >         return (__u64) (unsigned long) ptr;
> > @@ -233,6 +239,32 @@ struct bpf_map {
> >         bool reused;
> >  };
> >
> > +struct bpf_struct_ops {
> > +       const char *var_name;
> > +       const char *tname;
> > +       const struct btf_type *type;
> > +       struct bpf_program **progs;
> > +       __u32 *kern_func_off;
> > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > +       void *data;
> > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> 
> Using __bpf_ prefix for this struct_ops-specific types is a bit too
> generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> it btf_ops_ or btf_structops_?
Is it a concern on name collision?

The prefix pick is to use a more representative name.
struct_ops use many bpf pieces and btf is one of them.
Very soon, all new codes will depend on BTF and btf_ prefix
could become generic also.

Unlike tracepoint, there is no non-btf version of struct_ops.  

> 
> 
> > +        * format.
> > +        * struct __bpf_tcp_congestion_ops {
> > +        *      [... some other kernel fields ...]
> > +        *      struct tcp_congestion_ops data;
> > +        * }
> > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> 
> Comment isn't very clear.. do you mean that data pointed to by
> kern_vdata is of sizeof(...) bytes?
> 
> > +        * prepare_struct_ops() will populate the "data" into
> > +        * "kern_vdata".
> > +        */
> > +       void *kern_vdata;
> > +       __u32 type_id;
> > +       __u32 kern_vtype_id;
> > +       __u32 kern_vtype_size;
> > +       int fd;
> > +       bool unreg;
> 
> This unreg flag (and default behavior to not unregister) is bothering
> me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> E.g., if no one pins that map - then struct_ops should be unregistered
> on map destruction. If application wants to keep BPF programs
> attached, it should make sure to pin map, before userspace part exits?
> Is this problematic in any way?
I don't think it should in the struct_ops case.  I think of the
struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
in this case) and this map-progs stay (or keep attaching) until it is
detached.  Like other attached bpf_prog keeps running without
caring if the bpf_prog is pinned or not.

About the "bool unreg;", the default can be changed to true if
it makes more sense.

[ ... ]

> 
> > +
> > +               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
> > +               *(unsigned long *)kern_data = prog_fd;
> > +       }
> > +
> > +       map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS;
> > +       map_attr.key_size = sizeof(unsigned int);
> > +       map_attr.value_size = st_ops->kern_vtype_size;
> > +       map_attr.max_entries = 1;
> > +       map_attr.btf_fd = btf__fd(obj->btf);
> > +       map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id;
> > +       map_attr.name = st_ops->var_name;
> > +
> > +       fd = bpf_create_map_xattr(&map_attr);
> 
> we should try to reuse bpf_object__init_internal_map(). This will add
> struct bpf_map which users can iterate over and look up by name, etc.
> We had similar discussion when Daniel was adding  global data maps,
> and we conclusively decided that these special maps have to be
> represented in libbpf as struct bpf_map as well.
I will take a look.

> 
> > +       if (fd < 0) {
> > +               err = -errno;
> > +               pr_warn("struct_ops register %s: Error in creating struct_ops map\n",
> > +                       tname);
> > +               return err;
> > +       }
> > +
> > +       err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18  7:03     ` Martin Lau
@ 2019-12-18  7:20       ` Martin Lau
  2019-12-18 16:36         ` Andrii Nakryiko
  2019-12-18 16:34       ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Martin Lau @ 2019-12-18  7:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Tue, Dec 17, 2019 at 11:03:45PM -0800, Martin Lau wrote:
> On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > This patch adds BPF STRUCT_OPS support to libbpf.
> > >
> > > The only sec_name convention is SEC("struct_ops") to identify the
> > > struct ops implemented in BPF, e.g.
> > > SEC("struct_ops")
> > > struct tcp_congestion_ops dctcp = {
> > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > >         /* ... some more func prts ... */
> > >         .name           = "bpf_dctcp",
> > > };
> > >
> > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > elf section and find out what is the btf-type the "struct_ops" is
> > > implementing.  Note that the btf-type here is referring to
> > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > where are the bpf progs that the func ptrs are referring to.
> > >
> > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > With the kernel's btf-type, it can then set the prog->type,
> > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > the prog's properties do not rely on its section name.
> > >
> > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > process is as simple as: member-name match + btf-kind match + size match.
> > > If these matching conditions fail, libbpf will reject.
> > > The current targeting support is "struct tcp_congestion_ops" which
> > > most of its members are function pointers.
> > > The member ordering of the bpf_prog's btf-type can be different from
> > > the btf_vmlinux's btf-type.
> > >
> > > Once the prog's properties are all set,
> > > the libbpf will proceed to load all the progs.
> > >
> > > After that, register_struct_ops() will create a map, finalize the
> > > map-value by populating it with the prog-fd, and then register this
> > > "struct_ops" to the kernel by updating the map-value to the map.
> > >
> > > By default, libbpf does not unregister the struct_ops from the kernel
> > > during bpf_object__close().  It can be changed by setting the new
> > > "unreg_st_ops" in bpf_object_open_opts.
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > 
> > This looks pretty good to me. The big two things is exposing structops
> > as real struct bpf_map, so that users can interact with it using
> > libbpf APIs, as well as splitting struct_ops map creation and
> > registration. bpf_object__load() should only make sure all maps are
> > created, progs are loaded/verified, but none of BPF program can yet be
> > called. Then attach is the phase where registration happens.
> Thanks for the review.
> 
> [ ... ]
> 
> > >  static inline __u64 ptr_to_u64(const void *ptr)
> > >  {
> > >         return (__u64) (unsigned long) ptr;
> > > @@ -233,6 +239,32 @@ struct bpf_map {
> > >         bool reused;
> > >  };
> > >
> > > +struct bpf_struct_ops {
> > > +       const char *var_name;
> > > +       const char *tname;
> > > +       const struct btf_type *type;
> > > +       struct bpf_program **progs;
> > > +       __u32 *kern_func_off;
> > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > +       void *data;
> > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > 
> > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > it btf_ops_ or btf_structops_?
> Is it a concern on name collision?
> 
> The prefix pick is to use a more representative name.
> struct_ops use many bpf pieces and btf is one of them.
> Very soon, all new codes will depend on BTF and btf_ prefix
> could become generic also.
> 
> Unlike tracepoint, there is no non-btf version of struct_ops.
May be bpf_struct_ops_?

It was my early pick but it read quite weird,
bpf_[struct]_<ops>_[tcp_congestion]_<ops>.

Hence, I go with __bpf_<actual-name-of-the-kernel-struct> in this series.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies
  2019-12-17 18:22         ` Martin Lau
  2019-12-17 21:04           ` Eric Dumazet
@ 2019-12-18  9:03           ` Jakub Sitnicki
  1 sibling, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2019-12-18  9:03 UTC (permalink / raw)
  To: Martin Lau
  Cc: Eric Dumazet, Neal Cardwell, bpf, Alexei Starovoitov,
	Daniel Borkmann, David Miller, Kernel Team, Netdev

On Tue, Dec 17, 2019 at 07:22 PM CET, Martin Lau wrote:
> On Tue, Dec 17, 2019 at 09:26:31AM +0100, Jakub Sitnicki wrote:
>> On Sat, Dec 14, 2019 at 08:25 PM CET, Neal Cardwell wrote:
>> > On Fri, Dec 13, 2019 at 9:00 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
>> >> > This patch adds a helper to handle jiffies.  Some of the
>> >> > tcp_sock's timing is stored in jiffies.  Although things
>> >> > could be deduced by CONFIG_HZ, having an easy way to get
>> >> > jiffies will make the later bpf-tcp-cc implementation easier.
>> >> >
>> >>
>> >> ...
>> >>
>> >> > +
>> >> > +BPF_CALL_2(bpf_jiffies, u64, in, u64, flags)
>> >> > +{
>> >> > +     if (!flags)
>> >> > +             return get_jiffies_64();
>> >> > +
>> >> > +     if (flags & BPF_F_NS_TO_JIFFIES) {
>> >> > +             return nsecs_to_jiffies(in);
>> >> > +     } else if (flags & BPF_F_JIFFIES_TO_NS) {
>> >> > +             if (!in)
>> >> > +                     in = get_jiffies_64();
>> >> > +             return jiffies_to_nsecs(in);
>> >> > +     }
>> >> > +
>> >> > +     return 0;
>> >> > +}
>> >>
>> >> This looks a bit convoluted :)
>> >>
>> >> Note that we could possibly change net/ipv4/tcp_cubic.c to no longer use jiffies at all.
>> >>
>> >> We have in tp->tcp_mstamp an accurate timestamp (in usec) that can be converted to ms.
>> >
>> > If the jiffies functionality stays, how about 3 simple functions that
>> > correspond to the underlying C functions, perhaps something like:
>> >
>> >   bpf_nsecs_to_jiffies(nsecs)
>> >   bpf_jiffies_to_nsecs(jiffies)
>> >   bpf_get_jiffies_64()
>> >
>> > Separate functions might be easier to read/maintain (and may even be
>> > faster, given the corresponding reduction in branches).
>>
>> Having bpf_nsecs_to_jiffies() would be also handy for BPF sockops progs
>> that configure SYN-RTO timeout (BPF_SOCK_OPS_TIMEOUT_INIT).
>>
>> Right now user-space needs to go look for CONFIG_HZ in /proc/config.gz
> Andrii's extern variable work (already landed) allows a bpf_prog
> to read CONFIG_HZ as a global variable.  It is the path that I am
> pursuing now for jiffies/nsecs conversion without relying on
> a helper.

Thank yor for the pointer, and Andrii for implementing it.
Selftest [0] from extern-var-support series demonstrates it nicely.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=330a73a7b6ca93a415de1b7da68d7a0698fe4937

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18  7:03     ` Martin Lau
  2019-12-18  7:20       ` Martin Lau
@ 2019-12-18 16:34       ` Andrii Nakryiko
  2019-12-18 17:33         ` Martin Lau
  1 sibling, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2019-12-18 16:34 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
>
> On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > This patch adds BPF STRUCT_OPS support to libbpf.
> > >
> > > The only sec_name convention is SEC("struct_ops") to identify the
> > > struct ops implemented in BPF, e.g.
> > > SEC("struct_ops")
> > > struct tcp_congestion_ops dctcp = {
> > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > >         /* ... some more func prts ... */
> > >         .name           = "bpf_dctcp",
> > > };
> > >
> > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > elf section and find out what is the btf-type the "struct_ops" is
> > > implementing.  Note that the btf-type here is referring to
> > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > where are the bpf progs that the func ptrs are referring to.
> > >
> > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > With the kernel's btf-type, it can then set the prog->type,
> > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > the prog's properties do not rely on its section name.
> > >
> > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > process is as simple as: member-name match + btf-kind match + size match.
> > > If these matching conditions fail, libbpf will reject.
> > > The current targeting support is "struct tcp_congestion_ops" which
> > > most of its members are function pointers.
> > > The member ordering of the bpf_prog's btf-type can be different from
> > > the btf_vmlinux's btf-type.
> > >
> > > Once the prog's properties are all set,
> > > the libbpf will proceed to load all the progs.
> > >
> > > After that, register_struct_ops() will create a map, finalize the
> > > map-value by populating it with the prog-fd, and then register this
> > > "struct_ops" to the kernel by updating the map-value to the map.
> > >
> > > By default, libbpf does not unregister the struct_ops from the kernel
> > > during bpf_object__close().  It can be changed by setting the new
> > > "unreg_st_ops" in bpf_object_open_opts.
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> >
> > This looks pretty good to me. The big two things is exposing structops
> > as real struct bpf_map, so that users can interact with it using
> > libbpf APIs, as well as splitting struct_ops map creation and
> > registration. bpf_object__load() should only make sure all maps are
> > created, progs are loaded/verified, but none of BPF program can yet be
> > called. Then attach is the phase where registration happens.
> Thanks for the review.
>
> [ ... ]
>
> > >  static inline __u64 ptr_to_u64(const void *ptr)
> > >  {
> > >         return (__u64) (unsigned long) ptr;
> > > @@ -233,6 +239,32 @@ struct bpf_map {
> > >         bool reused;
> > >  };
> > >
> > > +struct bpf_struct_ops {
> > > +       const char *var_name;
> > > +       const char *tname;
> > > +       const struct btf_type *type;
> > > +       struct bpf_program **progs;
> > > +       __u32 *kern_func_off;
> > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > +       void *data;
> > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> >
> > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > it btf_ops_ or btf_structops_?
> Is it a concern on name collision?
>
> The prefix pick is to use a more representative name.
> struct_ops use many bpf pieces and btf is one of them.
> Very soon, all new codes will depend on BTF and btf_ prefix
> could become generic also.
>
> Unlike tracepoint, there is no non-btf version of struct_ops.

Not so much name collision, as being able to immediately recognize
that it's used to provide type information for struct_ops. Think about
some automated tooling parsing vmlinux BTF and trying to create some
derivative types for those btf_trace_xxx and __bpf_xxx types. Having
unique prefix that identifies what kind of type-providing struct it is
is very useful to do generic tool like that. While __bpf_ isn't
specifying in any ways that it's for struct_ops.

>
> >
> >
> > > +        * format.
> > > +        * struct __bpf_tcp_congestion_ops {
> > > +        *      [... some other kernel fields ...]
> > > +        *      struct tcp_congestion_ops data;
> > > +        * }
> > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> >
> > Comment isn't very clear.. do you mean that data pointed to by
> > kern_vdata is of sizeof(...) bytes?
> >
> > > +        * prepare_struct_ops() will populate the "data" into
> > > +        * "kern_vdata".
> > > +        */
> > > +       void *kern_vdata;
> > > +       __u32 type_id;
> > > +       __u32 kern_vtype_id;
> > > +       __u32 kern_vtype_size;
> > > +       int fd;
> > > +       bool unreg;
> >
> > This unreg flag (and default behavior to not unregister) is bothering
> > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> > E.g., if no one pins that map - then struct_ops should be unregistered
> > on map destruction. If application wants to keep BPF programs
> > attached, it should make sure to pin map, before userspace part exits?
> > Is this problematic in any way?
> I don't think it should in the struct_ops case.  I think of the
> struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> in this case) and this map-progs stay (or keep attaching) until it is
> detached.  Like other attached bpf_prog keeps running without
> caring if the bpf_prog is pinned or not.

I'll let someone else comment on how this behaves for cgroup, xdp,
etc, but for tracing, for example, we have FD-based BPF links, which
will detach program automatically when FD is closed. I think the idea
is to extend this to other types of BPF programs as well, so there is
no risk of leaving some stray BPF program running after unintended
crash of userspace program. When application explicitly needs BPF
program to outlive its userspace control app, then this can be
achieved by pinning map/program in BPFFS.

>
> About the "bool unreg;", the default can be changed to true if
> it makes more sense.
>
> [ ... ]
>
> >
> > > +
> > > +               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
> > > +               *(unsigned long *)kern_data = prog_fd;
> > > +       }
> > > +
> > > +       map_attr.map_type = BPF_MAP_TYPE_STRUCT_OPS;
> > > +       map_attr.key_size = sizeof(unsigned int);
> > > +       map_attr.value_size = st_ops->kern_vtype_size;
> > > +       map_attr.max_entries = 1;
> > > +       map_attr.btf_fd = btf__fd(obj->btf);
> > > +       map_attr.btf_vmlinux_value_type_id = st_ops->kern_vtype_id;
> > > +       map_attr.name = st_ops->var_name;
> > > +
> > > +       fd = bpf_create_map_xattr(&map_attr);
> >
> > we should try to reuse bpf_object__init_internal_map(). This will add
> > struct bpf_map which users can iterate over and look up by name, etc.
> > We had similar discussion when Daniel was adding  global data maps,
> > and we conclusively decided that these special maps have to be
> > represented in libbpf as struct bpf_map as well.
> I will take a look.
>
> >
> > > +       if (fd < 0) {
> > > +               err = -errno;
> > > +               pr_warn("struct_ops register %s: Error in creating struct_ops map\n",
> > > +                       tname);
> > > +               return err;
> > > +       }
> > > +
> > > +       err = bpf_map_update_elem(fd, &zero, st_ops->kern_vdata, 0);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18  7:20       ` Martin Lau
@ 2019-12-18 16:36         ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2019-12-18 16:36 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Tue, Dec 17, 2019 at 11:20 PM Martin Lau <kafai@fb.com> wrote:
>
> On Tue, Dec 17, 2019 at 11:03:45PM -0800, Martin Lau wrote:
> > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > This patch adds BPF STRUCT_OPS support to libbpf.
> > > >
> > > > The only sec_name convention is SEC("struct_ops") to identify the
> > > > struct ops implemented in BPF, e.g.
> > > > SEC("struct_ops")
> > > > struct tcp_congestion_ops dctcp = {
> > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > > >         /* ... some more func prts ... */
> > > >         .name           = "bpf_dctcp",
> > > > };
> > > >
> > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > > elf section and find out what is the btf-type the "struct_ops" is
> > > > implementing.  Note that the btf-type here is referring to
> > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > > where are the bpf progs that the func ptrs are referring to.
> > > >
> > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > > With the kernel's btf-type, it can then set the prog->type,
> > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > > the prog's properties do not rely on its section name.
> > > >
> > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > > process is as simple as: member-name match + btf-kind match + size match.
> > > > If these matching conditions fail, libbpf will reject.
> > > > The current targeting support is "struct tcp_congestion_ops" which
> > > > most of its members are function pointers.
> > > > The member ordering of the bpf_prog's btf-type can be different from
> > > > the btf_vmlinux's btf-type.
> > > >
> > > > Once the prog's properties are all set,
> > > > the libbpf will proceed to load all the progs.
> > > >
> > > > After that, register_struct_ops() will create a map, finalize the
> > > > map-value by populating it with the prog-fd, and then register this
> > > > "struct_ops" to the kernel by updating the map-value to the map.
> > > >
> > > > By default, libbpf does not unregister the struct_ops from the kernel
> > > > during bpf_object__close().  It can be changed by setting the new
> > > > "unreg_st_ops" in bpf_object_open_opts.
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > >
> > > This looks pretty good to me. The big two things is exposing structops
> > > as real struct bpf_map, so that users can interact with it using
> > > libbpf APIs, as well as splitting struct_ops map creation and
> > > registration. bpf_object__load() should only make sure all maps are
> > > created, progs are loaded/verified, but none of BPF program can yet be
> > > called. Then attach is the phase where registration happens.
> > Thanks for the review.
> >
> > [ ... ]
> >
> > > >  static inline __u64 ptr_to_u64(const void *ptr)
> > > >  {
> > > >         return (__u64) (unsigned long) ptr;
> > > > @@ -233,6 +239,32 @@ struct bpf_map {
> > > >         bool reused;
> > > >  };
> > > >
> > > > +struct bpf_struct_ops {
> > > > +       const char *var_name;
> > > > +       const char *tname;
> > > > +       const struct btf_type *type;
> > > > +       struct bpf_program **progs;
> > > > +       __u32 *kern_func_off;
> > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > > +       void *data;
> > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > >
> > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > > it btf_ops_ or btf_structops_?
> > Is it a concern on name collision?
> >
> > The prefix pick is to use a more representative name.
> > struct_ops use many bpf pieces and btf is one of them.
> > Very soon, all new codes will depend on BTF and btf_ prefix
> > could become generic also.
> >
> > Unlike tracepoint, there is no non-btf version of struct_ops.
> May be bpf_struct_ops_?
>
> It was my early pick but it read quite weird,
> bpf_[struct]_<ops>_[tcp_congestion]_<ops>.
>
> Hence, I go with __bpf_<actual-name-of-the-kernel-struct> in this series.

bpf_struct_ops_ is much better, IMO, but given this struct serves only
the purpose of providing type information to kernel, I think
btf_struct_ops_ is more justified.
And this <ops>_xxx_<ops> duplication doesn't bother me at all, again,
because it's not directly used in C code. But believe me, having
unique prefix is so good, even in the simples case of grepping through
vmlinux.h.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
  2019-12-17  6:14   ` Yonghong Song
@ 2019-12-18 16:41     ` Martin Lau
  0 siblings, 0 replies; 51+ messages in thread
From: Martin Lau @ 2019-12-18 16:41 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On Mon, Dec 16, 2019 at 10:14:09PM -0800, Yonghong Song wrote:

[ ... ]

> > +void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> > +{
> > +	const struct btf_member *member;
> > +	struct bpf_struct_ops *st_ops;
> > +	struct bpf_verifier_log log = {};
> > +	const struct btf_type *t;
> > +	const char *mname;
> > +	s32 type_id;
> > +	u32 i, j;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> > +		st_ops = bpf_struct_ops[i];
> > +
> > +		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
> > +						BTF_KIND_STRUCT);
> > +		if (type_id < 0) {
> > +			pr_warn("Cannot find struct %s in btf_vmlinux\n",
> > +				st_ops->name);
> > +			continue;
> > +		}
> > +		t = btf_type_by_id(_btf_vmlinux, type_id);
> > +		if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) {
> > +			pr_warn("Cannot support #%u members in struct %s\n",
> > +				btf_type_vlen(t), st_ops->name);
> > +			continue;
> > +		}
> > +
> > +		for_each_member(j, t, member) {
> > +			const struct btf_type *func_proto;
> > +
> > +			mname = btf_name_by_offset(_btf_vmlinux,
> > +						   member->name_off);
> > +			if (!*mname) {
> > +				pr_warn("anon member in struct %s is not supported\n",
> > +					st_ops->name);
> > +				break;
> > +			}
> > +
> > +			if (btf_member_bitfield_size(t, member)) {
> > +				pr_warn("bit field member %s in struct %s is not supported\n",
> > +					mname, st_ops->name);
> > +				break;
> > +			}
> > +
> > +			func_proto = btf_type_resolve_func_ptr(_btf_vmlinux,
> > +							       member->type,
> > +							       NULL);
> > +			if (func_proto &&
> > +			    btf_distill_func_proto(&log, _btf_vmlinux,
> > +						   func_proto, mname,
> > +						   &st_ops->func_models[j])) {
> > +				pr_warn("Error in parsing func ptr %s in struct %s\n",
> > +					mname, st_ops->name);
> > +				break;
> > +			}
> > +		}
> > +
> > +		if (j == btf_type_vlen(t)) {
> > +			if (st_ops->init(_btf_vmlinux)) {
> 
> is it possible that st_ops->init might be a NULL pointer?
Not now.  The check could be added if there would be
struct_ops that does not need init.

> 
> > +				pr_warn("Error in init bpf_struct_ops %s\n",
> > +					st_ops->name);
> > +			} else {
> > +				st_ops->type_id = type_id;
> > +				st_ops->type = t;
> > +			}
> > +		}
> > +	}
> > +}

[ ... ]

> > @@ -6343,8 +6353,30 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
> >   static int check_return_code(struct bpf_verifier_env *env)
> >   {
> >   	struct tnum enforce_attach_type_range = tnum_unknown;
> > +	const struct bpf_prog *prog = env->prog;
> >   	struct bpf_reg_state *reg;
> >   	struct tnum range = tnum_range(0, 1);
> > +	int err;
> > +
> > +	/* The struct_ops func-ptr's return type could be "void" */
> > +	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
> > +	    !prog->aux->attach_func_proto->type)
> > +		return 0;
> > +
> > +	/* eBPF calling convetion is such that R0 is used
> > +	 * to return the value from eBPF program.
> > +	 * Make sure that it's readable at this time
> > +	 * of bpf_exit, which means that program wrote
> > +	 * something into it earlier
> > +	 */
> > +	err = check_reg_arg(env, BPF_REG_0, SRC_OP);
> > +	if (err)
> > +		return err;
> > +
> > +	if (is_pointer_value(env, BPF_REG_0)) {
> > +		verbose(env, "R0 leaks addr as return value\n");
> > +		return -EACCES;
> > +	}
> >   
> >   	switch (env->prog->type) {
> >   	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > @@ -8010,21 +8042,6 @@ static int do_check(struct bpf_verifier_env *env)
> >   				if (err)
> >   					return err;
> >   
> > -				/* eBPF calling convetion is such that R0 is used
> > -				 * to return the value from eBPF program.
> > -				 * Make sure that it's readable at this time
> > -				 * of bpf_exit, which means that program wrote
> > -				 * something into it earlier
> > -				 */
> > -				err = check_reg_arg(env, BPF_REG_0, SRC_OP);
> > -				if (err)
> > -					return err;
> > -
> > -				if (is_pointer_value(env, BPF_REG_0)) {
> > -					verbose(env, "R0 leaks addr as return value\n");
> > -					return -EACCES;
> > -				}
> > -
> >   				err = check_return_code(env);
> >   				if (err)
> >   					return err;
> > @@ -8833,12 +8850,14 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
> >   			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
> >   			break;
> >   		case PTR_TO_BTF_ID:
> > -			if (type == BPF_WRITE) {
> > +			if (type == BPF_READ) {
> > +				insn->code = BPF_LDX | BPF_PROBE_MEM |
> > +					BPF_SIZE((insn)->code);
> > +				env->prog->aux->num_exentries++;
> > +			} else if (env->prog->type != BPF_PROG_TYPE_STRUCT_OPS) {
> >   				verbose(env, "Writes through BTF pointers are not allowed\n");
> >   				return -EINVAL;
> >   			}
> > -			insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code);
> > -			env->prog->aux->num_exentries++;
> 
> Do we need to increase num_exentries for BPF_WRITE as well?
Not needed since it does not need to gen exentry
for this write access for BPF_PROG_TYPE_STRUCT_OPS.

The individual struct_ops (e.g. the bpf_tcp_ca_btf_struct_access()
in patch 7) ensures the write is fine, which is like the
current convert_ctx_access() in filter.c but with the BTF help.

I will add some comments on this.

Thanks for the review!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18 16:34       ` Andrii Nakryiko
@ 2019-12-18 17:33         ` Martin Lau
  2019-12-18 18:14           ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Martin Lau @ 2019-12-18 17:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> >
> > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > This patch adds BPF STRUCT_OPS support to libbpf.
> > > >
> > > > The only sec_name convention is SEC("struct_ops") to identify the
> > > > struct ops implemented in BPF, e.g.
> > > > SEC("struct_ops")
> > > > struct tcp_congestion_ops dctcp = {
> > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > > >         /* ... some more func prts ... */
> > > >         .name           = "bpf_dctcp",
> > > > };
> > > >
> > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > > elf section and find out what is the btf-type the "struct_ops" is
> > > > implementing.  Note that the btf-type here is referring to
> > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > > where are the bpf progs that the func ptrs are referring to.
> > > >
> > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > > With the kernel's btf-type, it can then set the prog->type,
> > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > > the prog's properties do not rely on its section name.
> > > >
> > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > > process is as simple as: member-name match + btf-kind match + size match.
> > > > If these matching conditions fail, libbpf will reject.
> > > > The current targeting support is "struct tcp_congestion_ops" which
> > > > most of its members are function pointers.
> > > > The member ordering of the bpf_prog's btf-type can be different from
> > > > the btf_vmlinux's btf-type.
> > > >
> > > > Once the prog's properties are all set,
> > > > the libbpf will proceed to load all the progs.
> > > >
> > > > After that, register_struct_ops() will create a map, finalize the
> > > > map-value by populating it with the prog-fd, and then register this
> > > > "struct_ops" to the kernel by updating the map-value to the map.
> > > >
> > > > By default, libbpf does not unregister the struct_ops from the kernel
> > > > during bpf_object__close().  It can be changed by setting the new
> > > > "unreg_st_ops" in bpf_object_open_opts.
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > >
> > > This looks pretty good to me. The big two things is exposing structops
> > > as real struct bpf_map, so that users can interact with it using
> > > libbpf APIs, as well as splitting struct_ops map creation and
> > > registration. bpf_object__load() should only make sure all maps are
> > > created, progs are loaded/verified, but none of BPF program can yet be
> > > called. Then attach is the phase where registration happens.
> > Thanks for the review.
> >
> > [ ... ]
> >
> > > >  static inline __u64 ptr_to_u64(const void *ptr)
> > > >  {
> > > >         return (__u64) (unsigned long) ptr;
> > > > @@ -233,6 +239,32 @@ struct bpf_map {
> > > >         bool reused;
> > > >  };
> > > >
> > > > +struct bpf_struct_ops {
> > > > +       const char *var_name;
> > > > +       const char *tname;
> > > > +       const struct btf_type *type;
> > > > +       struct bpf_program **progs;
> > > > +       __u32 *kern_func_off;
> > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > > +       void *data;
> > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > >
> > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > > it btf_ops_ or btf_structops_?
> > Is it a concern on name collision?
> >
> > The prefix pick is to use a more representative name.
> > struct_ops use many bpf pieces and btf is one of them.
> > Very soon, all new codes will depend on BTF and btf_ prefix
> > could become generic also.
> >
> > Unlike tracepoint, there is no non-btf version of struct_ops.
> 
> Not so much name collision, as being able to immediately recognize
> that it's used to provide type information for struct_ops. Think about
> some automated tooling parsing vmlinux BTF and trying to create some
> derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> unique prefix that identifies what kind of type-providing struct it is
> is very useful to do generic tool like that. While __bpf_ isn't
> specifying in any ways that it's for struct_ops.
> 
> >
> > >
> > >
> > > > +        * format.
> > > > +        * struct __bpf_tcp_congestion_ops {
> > > > +        *      [... some other kernel fields ...]
> > > > +        *      struct tcp_congestion_ops data;
> > > > +        * }
> > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> > >
> > > Comment isn't very clear.. do you mean that data pointed to by
> > > kern_vdata is of sizeof(...) bytes?
> > >
> > > > +        * prepare_struct_ops() will populate the "data" into
> > > > +        * "kern_vdata".
> > > > +        */
> > > > +       void *kern_vdata;
> > > > +       __u32 type_id;
> > > > +       __u32 kern_vtype_id;
> > > > +       __u32 kern_vtype_size;
> > > > +       int fd;
> > > > +       bool unreg;
> > >
> > > This unreg flag (and default behavior to not unregister) is bothering
> > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> > > E.g., if no one pins that map - then struct_ops should be unregistered
> > > on map destruction. If application wants to keep BPF programs
> > > attached, it should make sure to pin map, before userspace part exits?
> > > Is this problematic in any way?
> > I don't think it should in the struct_ops case.  I think of the
> > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> > in this case) and this map-progs stay (or keep attaching) until it is
> > detached.  Like other attached bpf_prog keeps running without
> > caring if the bpf_prog is pinned or not.
> 
> I'll let someone else comment on how this behaves for cgroup, xdp,
> etc,
> but for tracing, for example, we have FD-based BPF links, which
> will detach program automatically when FD is closed. I think the idea
> is to extend this to other types of BPF programs as well, so there is
> no risk of leaving some stray BPF program running after unintended
Like xdp_prog, struct_ops does not have another fd-based-link.
This link can be created for struct_ops, xdp_prog and others later.
I don't see a conflict here.

> crash of userspace program. When application explicitly needs BPF
> program to outlive its userspace control app, then this can be
> achieved by pinning map/program in BPFFS.
If the concern is about not leaving struct_ops behind,
lets assume there is no "detach" and only depends on the very
last userspace's handles (FD/pinned) of a map goes away,
what may be an easy way to remove bpf_cubic from the system:

[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
    net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
    net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
    net.ipv4.tcp_congestion_control = bpf_cubic

> 
> >
> > About the "bool unreg;", the default can be changed to true if
> > it makes more sense.
> >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18 17:33         ` Martin Lau
@ 2019-12-18 18:14           ` Andrii Nakryiko
  2019-12-18 20:19             ` Martin Lau
  2019-12-19  8:53             ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2019-12-18 18:14 UTC (permalink / raw)
  To: Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
>
> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> > >
> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> > > > >
> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> > > > > struct ops implemented in BPF, e.g.
> > > > > SEC("struct_ops")
> > > > > struct tcp_congestion_ops dctcp = {
> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > > > >         /* ... some more func prts ... */
> > > > >         .name           = "bpf_dctcp",
> > > > > };
> > > > >
> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > > > elf section and find out what is the btf-type the "struct_ops" is
> > > > > implementing.  Note that the btf-type here is referring to
> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > > > where are the bpf progs that the func ptrs are referring to.
> > > > >
> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > > > With the kernel's btf-type, it can then set the prog->type,
> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > > > the prog's properties do not rely on its section name.
> > > > >
> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > > > process is as simple as: member-name match + btf-kind match + size match.
> > > > > If these matching conditions fail, libbpf will reject.
> > > > > The current targeting support is "struct tcp_congestion_ops" which
> > > > > most of its members are function pointers.
> > > > > The member ordering of the bpf_prog's btf-type can be different from
> > > > > the btf_vmlinux's btf-type.
> > > > >
> > > > > Once the prog's properties are all set,
> > > > > the libbpf will proceed to load all the progs.
> > > > >
> > > > > After that, register_struct_ops() will create a map, finalize the
> > > > > map-value by populating it with the prog-fd, and then register this
> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> > > > >
> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> > > > > during bpf_object__close().  It can be changed by setting the new
> > > > > "unreg_st_ops" in bpf_object_open_opts.
> > > > >
> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > ---
> > > >
> > > > This looks pretty good to me. The big two things is exposing structops
> > > > as real struct bpf_map, so that users can interact with it using
> > > > libbpf APIs, as well as splitting struct_ops map creation and
> > > > registration. bpf_object__load() should only make sure all maps are
> > > > created, progs are loaded/verified, but none of BPF program can yet be
> > > > called. Then attach is the phase where registration happens.
> > > Thanks for the review.
> > >
> > > [ ... ]
> > >
> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> > > > >  {
> > > > >         return (__u64) (unsigned long) ptr;
> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> > > > >         bool reused;
> > > > >  };
> > > > >
> > > > > +struct bpf_struct_ops {
> > > > > +       const char *var_name;
> > > > > +       const char *tname;
> > > > > +       const struct btf_type *type;
> > > > > +       struct bpf_program **progs;
> > > > > +       __u32 *kern_func_off;
> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > > > +       void *data;
> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > > >
> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > > > it btf_ops_ or btf_structops_?
> > > Is it a concern on name collision?
> > >
> > > The prefix pick is to use a more representative name.
> > > struct_ops use many bpf pieces and btf is one of them.
> > > Very soon, all new codes will depend on BTF and btf_ prefix
> > > could become generic also.
> > >
> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >
> > Not so much name collision, as being able to immediately recognize
> > that it's used to provide type information for struct_ops. Think about
> > some automated tooling parsing vmlinux BTF and trying to create some
> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> > unique prefix that identifies what kind of type-providing struct it is
> > is very useful to do generic tool like that. While __bpf_ isn't
> > specifying in any ways that it's for struct_ops.
> >
> > >
> > > >
> > > >
> > > > > +        * format.
> > > > > +        * struct __bpf_tcp_congestion_ops {
> > > > > +        *      [... some other kernel fields ...]
> > > > > +        *      struct tcp_congestion_ops data;
> > > > > +        * }
> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> > > >
> > > > Comment isn't very clear.. do you mean that data pointed to by
> > > > kern_vdata is of sizeof(...) bytes?
> > > >
> > > > > +        * prepare_struct_ops() will populate the "data" into
> > > > > +        * "kern_vdata".
> > > > > +        */
> > > > > +       void *kern_vdata;
> > > > > +       __u32 type_id;
> > > > > +       __u32 kern_vtype_id;
> > > > > +       __u32 kern_vtype_size;
> > > > > +       int fd;
> > > > > +       bool unreg;
> > > >
> > > > This unreg flag (and default behavior to not unregister) is bothering
> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> > > > on map destruction. If application wants to keep BPF programs
> > > > attached, it should make sure to pin map, before userspace part exits?
> > > > Is this problematic in any way?
> > > I don't think it should in the struct_ops case.  I think of the
> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> > > in this case) and this map-progs stay (or keep attaching) until it is
> > > detached.  Like other attached bpf_prog keeps running without
> > > caring if the bpf_prog is pinned or not.
> >
> > I'll let someone else comment on how this behaves for cgroup, xdp,
> > etc,
> > but for tracing, for example, we have FD-based BPF links, which
> > will detach program automatically when FD is closed. I think the idea
> > is to extend this to other types of BPF programs as well, so there is
> > no risk of leaving some stray BPF program running after unintended
> Like xdp_prog, struct_ops does not have another fd-based-link.
> This link can be created for struct_ops, xdp_prog and others later.
> I don't see a conflict here.

My point was that default behavior should be conservative: free up
resources automatically on process exit, unless specifically pinned by
user.
But this discussion made me realize that we miss one thing from
general bpf_link framework. See below.

>
> > crash of userspace program. When application explicitly needs BPF
> > program to outlive its userspace control app, then this can be
> > achieved by pinning map/program in BPFFS.
> If the concern is about not leaving struct_ops behind,
> lets assume there is no "detach" and only depends on the very
> last userspace's handles (FD/pinned) of a map goes away,
> what may be an easy way to remove bpf_cubic from the system:

Yeah, I think this "last map FD close frees up resources/detaches" is
a good behavior.

Where we do have problem is with bpf_link__destroy() unconditionally
also detaching whatever was attached (tracepoint, kprobe, or whatever
was done to create bpf_link in the first place). Now,
bpf_link__destroy() has to be called by user (or skeleton) to at least
free up malloc()'ed structs. But it appears that it's not always
desirable that upon bpf_link destruction underlying BPF program gets
detached. I think this will be the case for xdp and others as well.

I think the good and generic way to go about this is to have this as a
general concept of destroying the link without detaching BPF programs.
E.g., what if we have new API call `void bpf_link__unlink()`, which
will mark that link as not requiring to detach underlying BPF program.
When bpf_link__destroy() is called later, it will just free resources
allocated to maintain bpf_link itself, but won't detach any BPF
programs/resources.

With this, user will have to explicitly specify that he doesn't want
to detach even when skeleton/link is destroyed. If we get consensus on
this, I can add support for this to all the existing bpf_links and you
can build on that?

>
> [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
>     net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
>     net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
>     net.ipv4.tcp_congestion_control = bpf_cubic
>
> >
> > >
> > > About the "bool unreg;", the default can be changed to true if
> > > it makes more sense.
> > >

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18 18:14           ` Andrii Nakryiko
@ 2019-12-18 20:19             ` Martin Lau
  2019-12-19  8:53             ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 51+ messages in thread
From: Martin Lau @ 2019-12-18 20:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

On Wed, Dec 18, 2019 at 10:14:04AM -0800, Andrii Nakryiko wrote:
[ ... ]
> 
> Where we do have problem is with bpf_link__destroy() unconditionally
> also detaching whatever was attached (tracepoint, kprobe, or whatever
> was done to create bpf_link in the first place). Now,
> bpf_link__destroy() has to be called by user (or skeleton) to at least
> free up malloc()'ed structs. But it appears that it's not always
> desirable that upon bpf_link destruction underlying BPF program gets
> detached. I think this will be the case for xdp and others as well.
> 
> I think the good and generic way to go about this is to have this as a
> general concept of destroying the link without detaching BPF programs.
> E.g., what if we have new API call `void bpf_link__unlink()`, which
> will mark that link as not requiring to detach underlying BPF program.
> When bpf_link__destroy() is called later, it will just free resources
> allocated to maintain bpf_link itself, but won't detach any BPF
> programs/resources.
> 
> With this, user will have to explicitly specify that he doesn't want
> to detach even when skeleton/link is destroyed. If we get consensus on
> this, I can add support for this to all the existing bpf_links and you
> can build on that?
Keeping the current struct_ops unreg mechanism (i.e.
bpf_struct_ops__unregister(), to be renamed) and
having a way to opt-out sounds good to me.  Thanks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-18 18:14           ` Andrii Nakryiko
  2019-12-18 20:19             ` Martin Lau
@ 2019-12-19  8:53             ` Toke Høiland-Jørgensen
  2019-12-19 20:49               ` Andrii Nakryiko
  1 sibling, 1 reply; 51+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-12-19  8:53 UTC (permalink / raw)
  To: Andrii Nakryiko, Martin Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, Networking

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
>>
>> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
>> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
>> > >
>> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
>> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> > > > >
>> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
>> > > > >
>> > > > > The only sec_name convention is SEC("struct_ops") to identify the
>> > > > > struct ops implemented in BPF, e.g.
>> > > > > SEC("struct_ops")
>> > > > > struct tcp_congestion_ops dctcp = {
>> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
>> > > > >         /* ... some more func prts ... */
>> > > > >         .name           = "bpf_dctcp",
>> > > > > };
>> > > > >
>> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
>> > > > > elf section and find out what is the btf-type the "struct_ops" is
>> > > > > implementing.  Note that the btf-type here is referring to
>> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
>> > > > > where are the bpf progs that the func ptrs are referring to.
>> > > > >
>> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
>> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
>> > > > > With the kernel's btf-type, it can then set the prog->type,
>> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
>> > > > > the prog's properties do not rely on its section name.
>> > > > >
>> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
>> > > > > process is as simple as: member-name match + btf-kind match + size match.
>> > > > > If these matching conditions fail, libbpf will reject.
>> > > > > The current targeting support is "struct tcp_congestion_ops" which
>> > > > > most of its members are function pointers.
>> > > > > The member ordering of the bpf_prog's btf-type can be different from
>> > > > > the btf_vmlinux's btf-type.
>> > > > >
>> > > > > Once the prog's properties are all set,
>> > > > > the libbpf will proceed to load all the progs.
>> > > > >
>> > > > > After that, register_struct_ops() will create a map, finalize the
>> > > > > map-value by populating it with the prog-fd, and then register this
>> > > > > "struct_ops" to the kernel by updating the map-value to the map.
>> > > > >
>> > > > > By default, libbpf does not unregister the struct_ops from the kernel
>> > > > > during bpf_object__close().  It can be changed by setting the new
>> > > > > "unreg_st_ops" in bpf_object_open_opts.
>> > > > >
>> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> > > > > ---
>> > > >
>> > > > This looks pretty good to me. The big two things is exposing structops
>> > > > as real struct bpf_map, so that users can interact with it using
>> > > > libbpf APIs, as well as splitting struct_ops map creation and
>> > > > registration. bpf_object__load() should only make sure all maps are
>> > > > created, progs are loaded/verified, but none of BPF program can yet be
>> > > > called. Then attach is the phase where registration happens.
>> > > Thanks for the review.
>> > >
>> > > [ ... ]
>> > >
>> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
>> > > > >  {
>> > > > >         return (__u64) (unsigned long) ptr;
>> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
>> > > > >         bool reused;
>> > > > >  };
>> > > > >
>> > > > > +struct bpf_struct_ops {
>> > > > > +       const char *var_name;
>> > > > > +       const char *tname;
>> > > > > +       const struct btf_type *type;
>> > > > > +       struct bpf_program **progs;
>> > > > > +       __u32 *kern_func_off;
>> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
>> > > > > +       void *data;
>> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
>> > > >
>> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
>> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
>> > > > it btf_ops_ or btf_structops_?
>> > > Is it a concern on name collision?
>> > >
>> > > The prefix pick is to use a more representative name.
>> > > struct_ops use many bpf pieces and btf is one of them.
>> > > Very soon, all new codes will depend on BTF and btf_ prefix
>> > > could become generic also.
>> > >
>> > > Unlike tracepoint, there is no non-btf version of struct_ops.
>> >
>> > Not so much name collision, as being able to immediately recognize
>> > that it's used to provide type information for struct_ops. Think about
>> > some automated tooling parsing vmlinux BTF and trying to create some
>> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
>> > unique prefix that identifies what kind of type-providing struct it is
>> > is very useful to do generic tool like that. While __bpf_ isn't
>> > specifying in any ways that it's for struct_ops.
>> >
>> > >
>> > > >
>> > > >
>> > > > > +        * format.
>> > > > > +        * struct __bpf_tcp_congestion_ops {
>> > > > > +        *      [... some other kernel fields ...]
>> > > > > +        *      struct tcp_congestion_ops data;
>> > > > > +        * }
>> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
>> > > >
>> > > > Comment isn't very clear.. do you mean that data pointed to by
>> > > > kern_vdata is of sizeof(...) bytes?
>> > > >
>> > > > > +        * prepare_struct_ops() will populate the "data" into
>> > > > > +        * "kern_vdata".
>> > > > > +        */
>> > > > > +       void *kern_vdata;
>> > > > > +       __u32 type_id;
>> > > > > +       __u32 kern_vtype_id;
>> > > > > +       __u32 kern_vtype_size;
>> > > > > +       int fd;
>> > > > > +       bool unreg;
>> > > >
>> > > > This unreg flag (and default behavior to not unregister) is bothering
>> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
>> > > > E.g., if no one pins that map - then struct_ops should be unregistered
>> > > > on map destruction. If application wants to keep BPF programs
>> > > > attached, it should make sure to pin map, before userspace part exits?
>> > > > Is this problematic in any way?
>> > > I don't think it should in the struct_ops case.  I think of the
>> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
>> > > in this case) and this map-progs stay (or keep attaching) until it is
>> > > detached.  Like other attached bpf_prog keeps running without
>> > > caring if the bpf_prog is pinned or not.
>> >
>> > I'll let someone else comment on how this behaves for cgroup, xdp,
>> > etc,
>> > but for tracing, for example, we have FD-based BPF links, which
>> > will detach program automatically when FD is closed. I think the idea
>> > is to extend this to other types of BPF programs as well, so there is
>> > no risk of leaving some stray BPF program running after unintended
>> Like xdp_prog, struct_ops does not have another fd-based-link.
>> This link can be created for struct_ops, xdp_prog and others later.
>> I don't see a conflict here.
>
> My point was that default behavior should be conservative: free up
> resources automatically on process exit, unless specifically pinned by
> user.
> But this discussion made me realize that we miss one thing from
> general bpf_link framework. See below.
>
>>
>> > crash of userspace program. When application explicitly needs BPF
>> > program to outlive its userspace control app, then this can be
>> > achieved by pinning map/program in BPFFS.
>> If the concern is about not leaving struct_ops behind,
>> lets assume there is no "detach" and only depends on the very
>> last userspace's handles (FD/pinned) of a map goes away,
>> what may be an easy way to remove bpf_cubic from the system:
>
> Yeah, I think this "last map FD close frees up resources/detaches" is
> a good behavior.
>
> Where we do have problem is with bpf_link__destroy() unconditionally
> also detaching whatever was attached (tracepoint, kprobe, or whatever
> was done to create bpf_link in the first place). Now,
> bpf_link__destroy() has to be called by user (or skeleton) to at least
> free up malloc()'ed structs. But it appears that it's not always
> desirable that upon bpf_link destruction underlying BPF program gets
> detached. I think this will be the case for xdp and others as well.

For XDP the model has thus far been "once attached, the program stays
until explicitly detached". Changing that would certainly be surprising,
so I agree that splitting the API is best (not that I'm sure how many
XDP programs will end up using that API, but that's a different
concern)...

-Toke


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-19  8:53             ` Toke Høiland-Jørgensen
@ 2019-12-19 20:49               ` Andrii Nakryiko
  2019-12-20 10:16                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2019-12-19 20:49 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Kernel Team, Networking

On Thu, Dec 19, 2019 at 12:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
> >>
> >> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> >> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> >> > >
> >> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> >> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> > > > >
> >> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> >> > > > >
> >> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> >> > > > > struct ops implemented in BPF, e.g.
> >> > > > > SEC("struct_ops")
> >> > > > > struct tcp_congestion_ops dctcp = {
> >> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> >> > > > >         /* ... some more func prts ... */
> >> > > > >         .name           = "bpf_dctcp",
> >> > > > > };
> >> > > > >
> >> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> >> > > > > elf section and find out what is the btf-type the "struct_ops" is
> >> > > > > implementing.  Note that the btf-type here is referring to
> >> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> >> > > > > where are the bpf progs that the func ptrs are referring to.
> >> > > > >
> >> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> >> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> >> > > > > With the kernel's btf-type, it can then set the prog->type,
> >> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> >> > > > > the prog's properties do not rely on its section name.
> >> > > > >
> >> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> >> > > > > process is as simple as: member-name match + btf-kind match + size match.
> >> > > > > If these matching conditions fail, libbpf will reject.
> >> > > > > The current targeting support is "struct tcp_congestion_ops" which
> >> > > > > most of its members are function pointers.
> >> > > > > The member ordering of the bpf_prog's btf-type can be different from
> >> > > > > the btf_vmlinux's btf-type.
> >> > > > >
> >> > > > > Once the prog's properties are all set,
> >> > > > > the libbpf will proceed to load all the progs.
> >> > > > >
> >> > > > > After that, register_struct_ops() will create a map, finalize the
> >> > > > > map-value by populating it with the prog-fd, and then register this
> >> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> >> > > > >
> >> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> >> > > > > during bpf_object__close().  It can be changed by setting the new
> >> > > > > "unreg_st_ops" in bpf_object_open_opts.
> >> > > > >
> >> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> >> > > > > ---
> >> > > >
> >> > > > This looks pretty good to me. The big two things is exposing structops
> >> > > > as real struct bpf_map, so that users can interact with it using
> >> > > > libbpf APIs, as well as splitting struct_ops map creation and
> >> > > > registration. bpf_object__load() should only make sure all maps are
> >> > > > created, progs are loaded/verified, but none of BPF program can yet be
> >> > > > called. Then attach is the phase where registration happens.
> >> > > Thanks for the review.
> >> > >
> >> > > [ ... ]
> >> > >
> >> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> >> > > > >  {
> >> > > > >         return (__u64) (unsigned long) ptr;
> >> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> >> > > > >         bool reused;
> >> > > > >  };
> >> > > > >
> >> > > > > +struct bpf_struct_ops {
> >> > > > > +       const char *var_name;
> >> > > > > +       const char *tname;
> >> > > > > +       const struct btf_type *type;
> >> > > > > +       struct bpf_program **progs;
> >> > > > > +       __u32 *kern_func_off;
> >> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> >> > > > > +       void *data;
> >> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> >> > > >
> >> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> >> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> >> > > > it btf_ops_ or btf_structops_?
> >> > > Is it a concern on name collision?
> >> > >
> >> > > The prefix pick is to use a more representative name.
> >> > > struct_ops use many bpf pieces and btf is one of them.
> >> > > Very soon, all new codes will depend on BTF and btf_ prefix
> >> > > could become generic also.
> >> > >
> >> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >> >
> >> > Not so much name collision, as being able to immediately recognize
> >> > that it's used to provide type information for struct_ops. Think about
> >> > some automated tooling parsing vmlinux BTF and trying to create some
> >> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> >> > unique prefix that identifies what kind of type-providing struct it is
> >> > is very useful to do generic tool like that. While __bpf_ isn't
> >> > specifying in any ways that it's for struct_ops.
> >> >
> >> > >
> >> > > >
> >> > > >
> >> > > > > +        * format.
> >> > > > > +        * struct __bpf_tcp_congestion_ops {
> >> > > > > +        *      [... some other kernel fields ...]
> >> > > > > +        *      struct tcp_congestion_ops data;
> >> > > > > +        * }
> >> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> >> > > >
> >> > > > Comment isn't very clear.. do you mean that data pointed to by
> >> > > > kern_vdata is of sizeof(...) bytes?
> >> > > >
> >> > > > > +        * prepare_struct_ops() will populate the "data" into
> >> > > > > +        * "kern_vdata".
> >> > > > > +        */
> >> > > > > +       void *kern_vdata;
> >> > > > > +       __u32 type_id;
> >> > > > > +       __u32 kern_vtype_id;
> >> > > > > +       __u32 kern_vtype_size;
> >> > > > > +       int fd;
> >> > > > > +       bool unreg;
> >> > > >
> >> > > > This unreg flag (and default behavior to not unregister) is bothering
> >> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> >> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> >> > > > on map destruction. If application wants to keep BPF programs
> >> > > > attached, it should make sure to pin map, before userspace part exits?
> >> > > > Is this problematic in any way?
> >> > > I don't think it should in the struct_ops case.  I think of the
> >> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> >> > > in this case) and this map-progs stay (or keep attaching) until it is
> >> > > detached.  Like other attached bpf_prog keeps running without
> >> > > caring if the bpf_prog is pinned or not.
> >> >
> >> > I'll let someone else comment on how this behaves for cgroup, xdp,
> >> > etc,
> >> > but for tracing, for example, we have FD-based BPF links, which
> >> > will detach program automatically when FD is closed. I think the idea
> >> > is to extend this to other types of BPF programs as well, so there is
> >> > no risk of leaving some stray BPF program running after unintended
> >> Like xdp_prog, struct_ops does not have another fd-based-link.
> >> This link can be created for struct_ops, xdp_prog and others later.
> >> I don't see a conflict here.
> >
> > My point was that default behavior should be conservative: free up
> > resources automatically on process exit, unless specifically pinned by
> > user.
> > But this discussion made me realize that we miss one thing from
> > general bpf_link framework. See below.
> >
> >>
> >> > crash of userspace program. When application explicitly needs BPF
> >> > program to outlive its userspace control app, then this can be
> >> > achieved by pinning map/program in BPFFS.
> >> If the concern is about not leaving struct_ops behind,
> >> lets assume there is no "detach" and only depends on the very
> >> last userspace's handles (FD/pinned) of a map goes away,
> >> what may be an easy way to remove bpf_cubic from the system:
> >
> > Yeah, I think this "last map FD close frees up resources/detaches" is
> > a good behavior.
> >
> > Where we do have problem is with bpf_link__destroy() unconditionally
> > also detaching whatever was attached (tracepoint, kprobe, or whatever
> > was done to create bpf_link in the first place). Now,
> > bpf_link__destroy() has to be called by user (or skeleton) to at least
> > free up malloc()'ed structs. But it appears that it's not always
> > desirable that upon bpf_link destruction underlying BPF program gets
> > detached. I think this will be the case for xdp and others as well.
>
> For XDP the model has thus far been "once attached, the program stays
> until explicitly detached". Changing that would certainly be surprising,
> so I agree that splitting the API is best (not that I'm sure how many
> XDP programs will end up using that API, but that's a different
> concern)...

This would be a new FD-based API for XDP, I don't think we can change
existing API. But I think default behavior should still be to
auto-detach, unless explicitly "pinned" in whatever way. That would
prevent surprising "leakage" of BPF programs for unsuspecting users.

>
> -Toke
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 06/13] bpf: Introduce  BPF_MAP_TYPE_STRUCT_OPS
  2019-12-17  7:48   ` [Potential Spoof] " Yonghong Song
@ 2019-12-20  7:22     ` Martin Lau
  2019-12-20 16:52       ` Martin Lau
  0 siblings, 1 reply; 51+ messages in thread
From: Martin Lau @ 2019-12-20  7:22 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On Mon, Dec 16, 2019 at 11:48:18PM -0800, Yonghong Song wrote:
> 
> 
> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > is a kernel struct with its func ptr implemented in bpf prog.
> > This new map is the interface to register/unregister/introspect
> > a bpf implemented kernel struct.
> > 
> > The kernel struct is actually embedded inside another new struct
> > (or called the "value" struct in the code).  For example,
> > "struct tcp_congestion_ops" is embbeded in:
> > struct __bpf_tcp_congestion_ops {
> > 	refcount_t refcnt;
> > 	enum bpf_struct_ops_state state;
> > 	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > }
> > The map value is "struct __bpf_tcp_congestion_ops".  The "bpftool map dump"
> > will then be able to show the state ("inuse"/"tobefree") and the number of
> > subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case).
> > This "value" struct is created automatically by a macro.  Having a separate
> > "value" struct will also make extending "struct __bpf_XYZ" easier (e.g. adding
> > "void (*init)(void)" to "struct __bpf_XYZ" to do some initialization
> > works before registering the struct_ops to the kernel subsystem).
> > The libbpf will take care of finding and populating the "struct __bpf_XYZ"
> > from "struct XYZ".
> > 
> > Register a struct_ops to a kernel subsystem:
> > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> >     set to the btf id "struct __bpf_tcp_congestion_ops" of the running
> >     kernel.
> >     Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id
> >     is added such that attr->btf_fd can still be used as the "user" btf
> >     which could store other useful sysadmin/debug info that may be
> >     introduced in the furture,
> >     e.g. creation-date/compiler-details/map-creator...etc.
> > 3. Create a "struct __bpf_tcp_congestion_ops" object as described in
> >     the running kernel btf.  Populate the value of this object.
> >     The function ptr should be populated with the prog fds.
> > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> >     the map value.  The key is always "0".
> 
> This is really a special one element map. The key "0" should work.
> Not sure whether we should generalize this and maps for global variables
> to a kind of key-less map. Just some thought.
key-less.  I think it mostly means, no key is passed or pass NULL
as a key.  Not sure if it worths the uapi and userspace disruption,
e.g. think about "bpftool map dump".
I did try to add new bpf cmd to do register/unregister
which do not need the key.  I stopped in the middle because
it does not worth it when considering the lookup side also.

Also, like the global value map, the attr->btf_key_type_id is 0
which is a "void" btf-type and I think it is an as good way as
saying it is keyless.  The bpftool is already ready for this keyless
signal.  The difference between passing 0 or NULL to represent
"void" key is also arguably less.  
In struct_ops case, only btf_vmlinux_value_type_id is added but
not for the key.  

> 
> > 
> > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > (e.g. ensure all mandatory func ptrs are implemented).
> > If everything looks good, it will register this kernel struct
> > to the kernel subsystem.  The map will not allow further update
> > from this point.
> > 
> > Unregister a struct_ops from the kernel subsystem:
> > BPF_MAP_DELETE with key "0".
> > 
> > Introspect a struct_ops:
> > BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> > have the prog _id_ populated as the func ptr.
> > 
> > The map value state (enum bpf_struct_ops_state) will transit from:
> > INIT (map created) =>
> > INUSE (map updated, i.e. reg) =>
> > TOBEFREE (map value deleted, i.e. unreg)
> > 
> > Note that the above state is not exposed to the uapi/bpf.h.
> > It will be obtained from the btf of the running kernel.
> 
> It is not really from btf, right? It is from kernel internal
> data structure which will be copied to user space.
> 
> Since such information is available to bpftool dump and is common
> to all st_ops maps. I am wondering whether we should expose
> this through uapi.
The data is from the kernel-map's value.

The enum's type and its values' "string", meaning "INIT", "INUSE",
and "TOBEFREE" are from the kernel BTF.  These do not need to be
exposed in uapi.  kernel BTF is the way to get them.

[ ... ]

> > +/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value
> > + * exposed to the userspace and its btf-type-id is stored
> > + * at the map->btf_vmlinux_value_type_id.
> > + *
> > + * The *_name##_dummy is to ensure the BTF type is emitted.
> > + */
> > +
> >   #define BPF_STRUCT_OPS_TYPE(_name)				\
> > -extern struct bpf_struct_ops bpf_##_name;
> > +extern struct bpf_struct_ops bpf_##_name;			\
> > +								\
> > +static struct __bpf_##_name {					\
> > +	BPF_STRUCT_OPS_COMMON_VALUE;				\
> > +	struct _name data ____cacheline_aligned_in_smp;		\
> > +} *_name##_dummy;
> 
> There are other ways to retain types in debug info without
> creating new variables. For example, you can use it in a cast
> like
>      (void *)(struct __bpf_##_name *)v
hmm... What is v?

> Not sure whether we could easily find a place for such casting or not.
> 
> >   #include "bpf_struct_ops_types.h"
> >   #undef BPF_STRUCT_OPS_TYPE
> >   
> > @@ -37,19 +97,46 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
> >   const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> >   };
> >   
> > +static const struct btf_type *module_type;
> > +
> >   void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> >   {
> > +	char value_name[128] = VALUE_PREFIX;
> > +	s32 type_id, value_id, module_id;
> >   	const struct btf_member *member;
> >   	struct bpf_struct_ops *st_ops;
> >   	struct bpf_verifier_log log = {};
> >   	const struct btf_type *t;
> >   	const char *mname;
> > -	s32 type_id;
> >   	u32 i, j;
> >   
> > +	/* Avoid unused var compiler warning */
> > +#define BPF_STRUCT_OPS_TYPE(_name) (void)(_name##_dummy);
> > +#include "bpf_struct_ops_types.h"
> > +#undef BPF_STRUCT_OPS_TYPE
> > +
> > +	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
> > +					  BTF_KIND_STRUCT);
> > +	if (module_id < 0) {
> > +		pr_warn("Cannot find struct module in btf_vmlinux\n");
> > +		return;
> > +	}
> > +	module_type = btf_type_by_id(_btf_vmlinux, module_id);
> > +
> >   	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> >   		st_ops = bpf_struct_ops[i];
> >   
> > +		value_name[VALUE_PREFIX_LEN] = '\0';
> > +		strncat(value_name + VALUE_PREFIX_LEN, st_ops->name,
> > +			sizeof(value_name) - VALUE_PREFIX_LEN - 1);
> 
> Do we have restrictions on the length of st_ops->name?
> We probably do not want truncation, right?
It is unlikely the following btf_find_by_name_kind() would succeed.

I will add a check here to ensure no truncation.

> 
> > +		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
> > +						 BTF_KIND_STRUCT);
> > +		if (value_id < 0) {
> > +			pr_warn("Cannot find struct %s in btf_vmlinux\n",
> > +				value_name);
> > +			continue;
> > +		}
> > +
> >   		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
> >   						BTF_KIND_STRUCT);
> >   		if (type_id < 0) {

[ ... ]

> > +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> > +					  void *value, u64 flags)
> > +{
> > +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> > +	const struct bpf_struct_ops *st_ops = st_map->st_ops;
> > +	struct bpf_struct_ops_value *uvalue, *kvalue;
> > +	const struct btf_member *member;
> > +	const struct btf_type *t = st_ops->type;
> > +	void *udata, *kdata;
> > +	int prog_fd, err = 0;
> > +	void *image;
> > +	u32 i;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	if (*(u32 *)key != 0)
> > +		return -E2BIG;
> > +
> > +	uvalue = (struct bpf_struct_ops_value *)value;
> > +	if (uvalue->state || refcount_read(&uvalue->refcnt))
> > +		return -EINVAL;
> > +
> > +	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
> > +	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> > +
> > +	spin_lock(&st_map->lock);
> > +
> > +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> > +		err = -EBUSY;
> > +		goto unlock;
> > +	}
> > +
> > +	memcpy(uvalue, value, map->value_size);
> > +
> > +	udata = &uvalue->data;
> > +	kdata = &kvalue->data;
> > +	image = st_map->image;
> > +
> > +	for_each_member(i, t, member) {
> > +		const struct btf_type *mtype, *ptype;
> > +		struct bpf_prog *prog;
> > +		u32 moff;
> > +
> > +		moff = btf_member_bit_offset(t, member) / 8;
> > +		mtype = btf_type_by_id(btf_vmlinux, member->type);
> > +		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
> > +		if (ptype == module_type) {
> > +			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
> > +			continue;
> > +		}
> > +
> > +		err = st_ops->init_member(t, member, kdata, udata);
> > +		if (err < 0)
> > +			goto reset_unlock;
> > +
> > +		/* The ->init_member() has handled this member */
> > +		if (err > 0)
> > +			continue;
> > +
> > +		/* If st_ops->init_member does not handle it,
> > +		 * we will only handle func ptrs and zero-ed members
> > +		 * here.  Reject everything else.
> > +		 */
> > +
> > +		/* All non func ptr member must be 0 */
> > +		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> > +					       NULL)) {
> > +			u32 msize;
> > +
> > +			mtype = btf_resolve_size(btf_vmlinux, mtype,
> > +						 &msize, NULL, NULL);
> > +			if (IS_ERR(mtype)) {
> > +				err = PTR_ERR(mtype);
> > +				goto reset_unlock;
> > +			}
> > +
> > +			if (memchr_inv(udata + moff, 0, msize)) {
> > +				err = -EINVAL;
> > +				goto reset_unlock;
> > +			}
> > +
> > +			continue;
> > +		}
> > +
> > +		prog_fd = (int)(*(unsigned long *)(udata + moff));
> > +		/* Similar check as the attr->attach_prog_fd */
> > +		if (!prog_fd)
> > +			continue;
> > +
> > +		prog = bpf_prog_get(prog_fd);
> > +		if (IS_ERR(prog)) {
> > +			err = PTR_ERR(prog);
> > +			goto reset_unlock;
> > +		}
> > +		st_map->progs[i] = prog;
> > +
> > +		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> > +		    prog->aux->attach_btf_id != st_ops->type_id ||
> > +		    prog->expected_attach_type != i) {
> > +			err = -EINVAL;
> > +			goto reset_unlock;
> > +		}
> > +
> > +		err = arch_prepare_bpf_trampoline(image,
> > +						  &st_ops->func_models[i], 0,
> > +						  &prog, 1, NULL, 0, NULL);
> > +		if (err < 0)
> > +			goto reset_unlock;
> > +
> > +		*(void **)(kdata + moff) = image;
> > +		image += err;
> 
> Do we still want to check whether image out of page boundary or not?
It should never happen.  It would be too late to check here also.

The BPF_STRUCT_OPS_MAX_NR_MEMBERS (which is 64) is picked
based on each trampoline will take less than 64 bytes.
Thus, PAGE_SIZE / 64(bytes) => 64 members

I can add a BUILD_BUG_ON() to ensure the future BPF_STRUCT_OPS_MAX_NR_MEMBERS
change won't violate this.

> 
> > +
> > +		/* put prog_id to udata */
> > +		*(unsigned long *)(udata + moff) = prog->aux->id;
> > +	}
> > +
> > +	refcount_set(&kvalue->refcnt, 1);
> > +	bpf_map_inc(map);
> > +
> > +	err = st_ops->reg(kdata);
> > +	if (!err) {
> > +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> > +		goto unlock;
> > +	}
> > +
> > +	/* Error during st_ops->reg() */
> > +	bpf_map_put(map);
> > +
> > +reset_unlock:
> > +	bpf_struct_ops_map_put_progs(st_map);
> > +	memset(uvalue, 0, map->value_size);
> > +	memset(kvalue, 0, map->value_size);
> > +
> > +unlock:
> > +	spin_unlock(&st_map->lock);
> > +	return err;
> > +}
> > +
> > +static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
> > +{
> > +	enum bpf_struct_ops_state prev_state;
> > +	struct bpf_struct_ops_map *st_map;
> > +
> > +	st_map = (struct bpf_struct_ops_map *)map;
> > +	prev_state = cmpxchg(&st_map->kvalue.state,
> > +			     BPF_STRUCT_OPS_STATE_INUSE,
> > +			     BPF_STRUCT_OPS_STATE_TOBEFREE);
> > +	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
> > +		st_map->st_ops->unreg(&st_map->kvalue.data);
> > +		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
> > +			bpf_map_put(map);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> [...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-19 20:49               ` Andrii Nakryiko
@ 2019-12-20 10:16                 ` Toke Høiland-Jørgensen
  2019-12-20 17:34                   ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-12-20 10:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Martin Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Kernel Team, Networking

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Dec 19, 2019 at 12:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
>> >>
>> >> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
>> >> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
>> >> > >
>> >> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
>> >> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > > > >
>> >> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
>> >> > > > >
>> >> > > > > The only sec_name convention is SEC("struct_ops") to identify the
>> >> > > > > struct ops implemented in BPF, e.g.
>> >> > > > > SEC("struct_ops")
>> >> > > > > struct tcp_congestion_ops dctcp = {
>> >> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
>> >> > > > >         /* ... some more func prts ... */
>> >> > > > >         .name           = "bpf_dctcp",
>> >> > > > > };
>> >> > > > >
>> >> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
>> >> > > > > elf section and find out what is the btf-type the "struct_ops" is
>> >> > > > > implementing.  Note that the btf-type here is referring to
>> >> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
>> >> > > > > where are the bpf progs that the func ptrs are referring to.
>> >> > > > >
>> >> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
>> >> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
>> >> > > > > With the kernel's btf-type, it can then set the prog->type,
>> >> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
>> >> > > > > the prog's properties do not rely on its section name.
>> >> > > > >
>> >> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
>> >> > > > > process is as simple as: member-name match + btf-kind match + size match.
>> >> > > > > If these matching conditions fail, libbpf will reject.
>> >> > > > > The current targeting support is "struct tcp_congestion_ops" which
>> >> > > > > most of its members are function pointers.
>> >> > > > > The member ordering of the bpf_prog's btf-type can be different from
>> >> > > > > the btf_vmlinux's btf-type.
>> >> > > > >
>> >> > > > > Once the prog's properties are all set,
>> >> > > > > the libbpf will proceed to load all the progs.
>> >> > > > >
>> >> > > > > After that, register_struct_ops() will create a map, finalize the
>> >> > > > > map-value by populating it with the prog-fd, and then register this
>> >> > > > > "struct_ops" to the kernel by updating the map-value to the map.
>> >> > > > >
>> >> > > > > By default, libbpf does not unregister the struct_ops from the kernel
>> >> > > > > during bpf_object__close().  It can be changed by setting the new
>> >> > > > > "unreg_st_ops" in bpf_object_open_opts.
>> >> > > > >
>> >> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
>> >> > > > > ---
>> >> > > >
>> >> > > > This looks pretty good to me. The big two things is exposing structops
>> >> > > > as real struct bpf_map, so that users can interact with it using
>> >> > > > libbpf APIs, as well as splitting struct_ops map creation and
>> >> > > > registration. bpf_object__load() should only make sure all maps are
>> >> > > > created, progs are loaded/verified, but none of BPF program can yet be
>> >> > > > called. Then attach is the phase where registration happens.
>> >> > > Thanks for the review.
>> >> > >
>> >> > > [ ... ]
>> >> > >
>> >> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
>> >> > > > >  {
>> >> > > > >         return (__u64) (unsigned long) ptr;
>> >> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
>> >> > > > >         bool reused;
>> >> > > > >  };
>> >> > > > >
>> >> > > > > +struct bpf_struct_ops {
>> >> > > > > +       const char *var_name;
>> >> > > > > +       const char *tname;
>> >> > > > > +       const struct btf_type *type;
>> >> > > > > +       struct bpf_program **progs;
>> >> > > > > +       __u32 *kern_func_off;
>> >> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
>> >> > > > > +       void *data;
>> >> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
>> >> > > >
>> >> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
>> >> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
>> >> > > > it btf_ops_ or btf_structops_?
>> >> > > Is it a concern on name collision?
>> >> > >
>> >> > > The prefix pick is to use a more representative name.
>> >> > > struct_ops use many bpf pieces and btf is one of them.
>> >> > > Very soon, all new codes will depend on BTF and btf_ prefix
>> >> > > could become generic also.
>> >> > >
>> >> > > Unlike tracepoint, there is no non-btf version of struct_ops.
>> >> >
>> >> > Not so much name collision, as being able to immediately recognize
>> >> > that it's used to provide type information for struct_ops. Think about
>> >> > some automated tooling parsing vmlinux BTF and trying to create some
>> >> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
>> >> > unique prefix that identifies what kind of type-providing struct it is
>> >> > is very useful to do generic tool like that. While __bpf_ isn't
>> >> > specifying in any ways that it's for struct_ops.
>> >> >
>> >> > >
>> >> > > >
>> >> > > >
>> >> > > > > +        * format.
>> >> > > > > +        * struct __bpf_tcp_congestion_ops {
>> >> > > > > +        *      [... some other kernel fields ...]
>> >> > > > > +        *      struct tcp_congestion_ops data;
>> >> > > > > +        * }
>> >> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
>> >> > > >
>> >> > > > Comment isn't very clear.. do you mean that data pointed to by
>> >> > > > kern_vdata is of sizeof(...) bytes?
>> >> > > >
>> >> > > > > +        * prepare_struct_ops() will populate the "data" into
>> >> > > > > +        * "kern_vdata".
>> >> > > > > +        */
>> >> > > > > +       void *kern_vdata;
>> >> > > > > +       __u32 type_id;
>> >> > > > > +       __u32 kern_vtype_id;
>> >> > > > > +       __u32 kern_vtype_size;
>> >> > > > > +       int fd;
>> >> > > > > +       bool unreg;
>> >> > > >
>> >> > > > This unreg flag (and default behavior to not unregister) is bothering
>> >> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
>> >> > > > E.g., if no one pins that map - then struct_ops should be unregistered
>> >> > > > on map destruction. If application wants to keep BPF programs
>> >> > > > attached, it should make sure to pin map, before userspace part exits?
>> >> > > > Is this problematic in any way?
>> >> > > I don't think it should in the struct_ops case.  I think of the
>> >> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
>> >> > > in this case) and this map-progs stay (or keep attaching) until it is
>> >> > > detached.  Like other attached bpf_prog keeps running without
>> >> > > caring if the bpf_prog is pinned or not.
>> >> >
>> >> > I'll let someone else comment on how this behaves for cgroup, xdp,
>> >> > etc,
>> >> > but for tracing, for example, we have FD-based BPF links, which
>> >> > will detach program automatically when FD is closed. I think the idea
>> >> > is to extend this to other types of BPF programs as well, so there is
>> >> > no risk of leaving some stray BPF program running after unintended
>> >> Like xdp_prog, struct_ops does not have another fd-based-link.
>> >> This link can be created for struct_ops, xdp_prog and others later.
>> >> I don't see a conflict here.
>> >
>> > My point was that default behavior should be conservative: free up
>> > resources automatically on process exit, unless specifically pinned by
>> > user.
>> > But this discussion made me realize that we miss one thing from
>> > general bpf_link framework. See below.
>> >
>> >>
>> >> > crash of userspace program. When application explicitly needs BPF
>> >> > program to outlive its userspace control app, then this can be
>> >> > achieved by pinning map/program in BPFFS.
>> >> If the concern is about not leaving struct_ops behind,
>> >> lets assume there is no "detach" and only depends on the very
>> >> last userspace's handles (FD/pinned) of a map goes away,
>> >> what may be an easy way to remove bpf_cubic from the system:
>> >
>> > Yeah, I think this "last map FD close frees up resources/detaches" is
>> > a good behavior.
>> >
>> > Where we do have problem is with bpf_link__destroy() unconditionally
>> > also detaching whatever was attached (tracepoint, kprobe, or whatever
>> > was done to create bpf_link in the first place). Now,
>> > bpf_link__destroy() has to be called by user (or skeleton) to at least
>> > free up malloc()'ed structs. But it appears that it's not always
>> > desirable that upon bpf_link destruction underlying BPF program gets
>> > detached. I think this will be the case for xdp and others as well.
>>
>> For XDP the model has thus far been "once attached, the program stays
>> until explicitly detached". Changing that would certainly be surprising,
>> so I agree that splitting the API is best (not that I'm sure how many
>> XDP programs will end up using that API, but that's a different
>> concern)...
>
> This would be a new FD-based API for XDP, I don't think we can change
> existing API. But I think default behavior should still be to
> auto-detach, unless explicitly "pinned" in whatever way. That would
> prevent surprising "leakage" of BPF programs for unsuspecting users.

But why do we need a new API for attaching XDP programs? Also, what are
the use cases where it makes sense to have this kind of "transient" XDP
program? The only one I can think about is something like xdpdump, which
moves packets to userspace (and should stop doing that when the
userspace listener goes away). But with bpf-to-bpf tracing, xdpdump
won't actually be an XDP program, so what's left? The system firewall
rules don't go away when the program that installed them exits either;
why should an XDP program?

-Toke


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 06/13] bpf: Introduce  BPF_MAP_TYPE_STRUCT_OPS
  2019-12-20  7:22     ` Martin Lau
@ 2019-12-20 16:52       ` Martin Lau
  2019-12-20 18:41         ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Martin Lau @ 2019-12-20 16:52 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, David Miller,
	Kernel Team, netdev

On Thu, Dec 19, 2019 at 11:22:17PM -0800, Martin Lau wrote:

> [ ... ]
> 
> > > +/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value
> > > + * exposed to the userspace and its btf-type-id is stored
> > > + * at the map->btf_vmlinux_value_type_id.
> > > + *
> > > + * The *_name##_dummy is to ensure the BTF type is emitted.
> > > + */
> > > +
> > >   #define BPF_STRUCT_OPS_TYPE(_name)				\
> > > -extern struct bpf_struct_ops bpf_##_name;
> > > +extern struct bpf_struct_ops bpf_##_name;			\
> > > +								\
> > > +static struct __bpf_##_name {					\
> > > +	BPF_STRUCT_OPS_COMMON_VALUE;				\
> > > +	struct _name data ____cacheline_aligned_in_smp;		\
> > > +} *_name##_dummy;
> > 
> > There are other ways to retain types in debug info without
> > creating new variables. For example, you can use it in a cast
> > like
> >      (void *)(struct __bpf_##_name *)v
> hmm... What is v?
Got it.  "v" could be any dummy pointer in a function.
I will use (void) instead of (void *) to avoid compiler warning.

> 
> > Not sure whether we could easily find a place for such casting or not.
This can be done in bpf_struct_ops_init().

Thanks for the tips!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
  2019-12-20 10:16                 ` Toke Høiland-Jørgensen
@ 2019-12-20 17:34                   ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2019-12-20 17:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin Lau, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Kernel Team, Networking

On Fri, Dec 20, 2019 at 2:16 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Thu, Dec 19, 2019 at 12:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
> >> >>
> >> >> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> >> >> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> >> >> > >
> >> >> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> >> >> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> > > > >
> >> >> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> >> >> > > > >
> >> >> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> >> >> > > > > struct ops implemented in BPF, e.g.
> >> >> > > > > SEC("struct_ops")
> >> >> > > > > struct tcp_congestion_ops dctcp = {
> >> >> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> >> >> > > > >         /* ... some more func prts ... */
> >> >> > > > >         .name           = "bpf_dctcp",
> >> >> > > > > };
> >> >> > > > >
> >> >> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> >> >> > > > > elf section and find out what is the btf-type the "struct_ops" is
> >> >> > > > > implementing.  Note that the btf-type here is referring to
> >> >> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> >> >> > > > > where are the bpf progs that the func ptrs are referring to.
> >> >> > > > >
> >> >> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> >> >> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> >> >> > > > > With the kernel's btf-type, it can then set the prog->type,
> >> >> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> >> >> > > > > the prog's properties do not rely on its section name.
> >> >> > > > >
> >> >> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> >> >> > > > > process is as simple as: member-name match + btf-kind match + size match.
> >> >> > > > > If these matching conditions fail, libbpf will reject.
> >> >> > > > > The current targeting support is "struct tcp_congestion_ops" which
> >> >> > > > > most of its members are function pointers.
> >> >> > > > > The member ordering of the bpf_prog's btf-type can be different from
> >> >> > > > > the btf_vmlinux's btf-type.
> >> >> > > > >
> >> >> > > > > Once the prog's properties are all set,
> >> >> > > > > the libbpf will proceed to load all the progs.
> >> >> > > > >
> >> >> > > > > After that, register_struct_ops() will create a map, finalize the
> >> >> > > > > map-value by populating it with the prog-fd, and then register this
> >> >> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> >> >> > > > >
> >> >> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> >> >> > > > > during bpf_object__close().  It can be changed by setting the new
> >> >> > > > > "unreg_st_ops" in bpf_object_open_opts.
> >> >> > > > >
> >> >> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> >> >> > > > > ---
> >> >> > > >
> >> >> > > > This looks pretty good to me. The big two things is exposing structops
> >> >> > > > as real struct bpf_map, so that users can interact with it using
> >> >> > > > libbpf APIs, as well as splitting struct_ops map creation and
> >> >> > > > registration. bpf_object__load() should only make sure all maps are
> >> >> > > > created, progs are loaded/verified, but none of BPF program can yet be
> >> >> > > > called. Then attach is the phase where registration happens.
> >> >> > > Thanks for the review.
> >> >> > >
> >> >> > > [ ... ]
> >> >> > >
> >> >> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> >> >> > > > >  {
> >> >> > > > >         return (__u64) (unsigned long) ptr;
> >> >> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> >> >> > > > >         bool reused;
> >> >> > > > >  };
> >> >> > > > >
> >> >> > > > > +struct bpf_struct_ops {
> >> >> > > > > +       const char *var_name;
> >> >> > > > > +       const char *tname;
> >> >> > > > > +       const struct btf_type *type;
> >> >> > > > > +       struct bpf_program **progs;
> >> >> > > > > +       __u32 *kern_func_off;
> >> >> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> >> >> > > > > +       void *data;
> >> >> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> >> >> > > >
> >> >> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> >> >> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> >> >> > > > it btf_ops_ or btf_structops_?
> >> >> > > Is it a concern on name collision?
> >> >> > >
> >> >> > > The prefix pick is to use a more representative name.
> >> >> > > struct_ops use many bpf pieces and btf is one of them.
> >> >> > > Very soon, all new codes will depend on BTF and btf_ prefix
> >> >> > > could become generic also.
> >> >> > >
> >> >> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >> >> >
> >> >> > Not so much name collision, as being able to immediately recognize
> >> >> > that it's used to provide type information for struct_ops. Think about
> >> >> > some automated tooling parsing vmlinux BTF and trying to create some
> >> >> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> >> >> > unique prefix that identifies what kind of type-providing struct it is
> >> >> > is very useful to do generic tool like that. While __bpf_ isn't
> >> >> > specifying in any ways that it's for struct_ops.
> >> >> >
> >> >> > >
> >> >> > > >
> >> >> > > >
> >> >> > > > > +        * format.
> >> >> > > > > +        * struct __bpf_tcp_congestion_ops {
> >> >> > > > > +        *      [... some other kernel fields ...]
> >> >> > > > > +        *      struct tcp_congestion_ops data;
> >> >> > > > > +        * }
> >> >> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> >> >> > > >
> >> >> > > > Comment isn't very clear.. do you mean that data pointed to by
> >> >> > > > kern_vdata is of sizeof(...) bytes?
> >> >> > > >
> >> >> > > > > +        * prepare_struct_ops() will populate the "data" into
> >> >> > > > > +        * "kern_vdata".
> >> >> > > > > +        */
> >> >> > > > > +       void *kern_vdata;
> >> >> > > > > +       __u32 type_id;
> >> >> > > > > +       __u32 kern_vtype_id;
> >> >> > > > > +       __u32 kern_vtype_size;
> >> >> > > > > +       int fd;
> >> >> > > > > +       bool unreg;
> >> >> > > >
> >> >> > > > This unreg flag (and default behavior to not unregister) is bothering
> >> >> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> >> >> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> >> >> > > > on map destruction. If application wants to keep BPF programs
> >> >> > > > attached, it should make sure to pin map, before userspace part exits?
> >> >> > > > Is this problematic in any way?
> >> >> > > I don't think it should in the struct_ops case.  I think of the
> >> >> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> >> >> > > in this case) and this map-progs stay (or keep attaching) until it is
> >> >> > > detached.  Like other attached bpf_prog keeps running without
> >> >> > > caring if the bpf_prog is pinned or not.
> >> >> >
> >> >> > I'll let someone else comment on how this behaves for cgroup, xdp,
> >> >> > etc,
> >> >> > but for tracing, for example, we have FD-based BPF links, which
> >> >> > will detach program automatically when FD is closed. I think the idea
> >> >> > is to extend this to other types of BPF programs as well, so there is
> >> >> > no risk of leaving some stray BPF program running after unintended
> >> >> Like xdp_prog, struct_ops does not have another fd-based-link.
> >> >> This link can be created for struct_ops, xdp_prog and others later.
> >> >> I don't see a conflict here.
> >> >
> >> > My point was that default behavior should be conservative: free up
> >> > resources automatically on process exit, unless specifically pinned by
> >> > user.
> >> > But this discussion made me realize that we miss one thing from
> >> > general bpf_link framework. See below.
> >> >
> >> >>
> >> >> > crash of userspace program. When application explicitly needs BPF
> >> >> > program to outlive its userspace control app, then this can be
> >> >> > achieved by pinning map/program in BPFFS.
> >> >> If the concern is about not leaving struct_ops behind,
> >> >> lets assume there is no "detach" and only depends on the very
> >> >> last userspace's handles (FD/pinned) of a map goes away,
> >> >> what may be an easy way to remove bpf_cubic from the system:
> >> >
> >> > Yeah, I think this "last map FD close frees up resources/detaches" is
> >> > a good behavior.
> >> >
> >> > Where we do have problem is with bpf_link__destroy() unconditionally
> >> > also detaching whatever was attached (tracepoint, kprobe, or whatever
> >> > was done to create bpf_link in the first place). Now,
> >> > bpf_link__destroy() has to be called by user (or skeleton) to at least
> >> > free up malloc()'ed structs. But it appears that it's not always
> >> > desirable that upon bpf_link destruction underlying BPF program gets
> >> > detached. I think this will be the case for xdp and others as well.
> >>
> >> For XDP the model has thus far been "once attached, the program stays
> >> until explicitly detached". Changing that would certainly be surprising,
> >> so I agree that splitting the API is best (not that I'm sure how many
> >> XDP programs will end up using that API, but that's a different
> >> concern)...
> >
> > This would be a new FD-based API for XDP, I don't think we can change
> > existing API. But I think default behavior should still be to
> > auto-detach, unless explicitly "pinned" in whatever way. That would
> > prevent surprising "leakage" of BPF programs for unsuspecting users.
>
> But why do we need a new API for attaching XDP programs? Also, what are
> the use cases where it makes sense to have this kind of "transient" XDP
> program? The only one I can think about is something like xdpdump, which

During development, for instance, when you buggy userspace program
crashes? I think by default all those attached BPF programs should be
auto-detachable, if possible. That's the direction that worked out
really well with kprobes/tracepoints/perf_events. Previously, using
old APIs, you'd attach kprobe and if userspace doesn't clean up, that
kprobe would stay attached in the system, consuming resources without
users noticing this (which is especially critical in production).
Switching to auto-detachable FD-based interface greatly improved that
experience. I think this is a good model going forward.

In practice, for production use cases, it will be just a trivial piece
of code to keep it attached:

struct bpf_link *xdp_link = bpf_program__attach_xdp(...);
bpf_link__disconnect(xdp_link); /* now if userspace program crashes,
xdp BPF program will stay connected */

> moves packets to userspace (and should stop doing that when the
> userspace listener goes away). But with bpf-to-bpf tracing, xdpdump
> won't actually be an XDP program, so what's left? The system firewall
> rules don't go away when the program that installed them exits either;
> why should an XDP program?

See above, I'm not saying that it shouldn't be possible to keep it
attached. I'm just arguing it's not a good default, because it can
catch developers off guard and cause problems, especially in
production environments. In the end, it is a resource leak, unless you
want and expect it.

>
> -Toke
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
  2019-12-20 16:52       ` Martin Lau
@ 2019-12-20 18:41         ` Andrii Nakryiko
  0 siblings, 0 replies; 51+ messages in thread
From: Andrii Nakryiko @ 2019-12-20 18:41 UTC (permalink / raw)
  To: Martin Lau
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Daniel Borkmann,
	David Miller, Kernel Team, netdev

On Fri, Dec 20, 2019 at 8:52 AM Martin Lau <kafai@fb.com> wrote:
>
> On Thu, Dec 19, 2019 at 11:22:17PM -0800, Martin Lau wrote:
>
> > [ ... ]
> >
> > > > +/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value
> > > > + * exposed to the userspace and its btf-type-id is stored
> > > > + * at the map->btf_vmlinux_value_type_id.
> > > > + *
> > > > + * The *_name##_dummy is to ensure the BTF type is emitted.
> > > > + */
> > > > +
> > > >   #define BPF_STRUCT_OPS_TYPE(_name)                              \
> > > > -extern struct bpf_struct_ops bpf_##_name;
> > > > +extern struct bpf_struct_ops bpf_##_name;                        \
> > > > +                                                         \
> > > > +static struct __bpf_##_name {                                    \
> > > > + BPF_STRUCT_OPS_COMMON_VALUE;                            \
> > > > + struct _name data ____cacheline_aligned_in_smp;         \
> > > > +} *_name##_dummy;
> > >
> > > There are other ways to retain types in debug info without
> > > creating new variables. For example, you can use it in a cast
> > > like
> > >      (void *)(struct __bpf_##_name *)v
> > hmm... What is v?
> Got it.  "v" could be any dummy pointer in a function.
> I will use (void) instead of (void *) to avoid compiler warning.
>

This discussion inspired me to try this:

#define PRESERVE_TYPE_INFO(type) ((void)(type *)0)

... somewhere in any function ...

PRESERVE_TYPE_INFO(struct whatever_struct);

And it works! We should probably put this helper macro somewhere in
include/linux/bpf.h and use it consistently for cases like this.

> >
> > > Not sure whether we could easily find a place for such casting or not.
> This can be done in bpf_struct_ops_init().
>
> Thanks for the tips!

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2019-12-20 18:41 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
2019-12-14  0:47 ` [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
2019-12-16 19:48   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
2019-12-16 21:34   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
2019-12-16 21:36   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
2019-12-16 22:05   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
2019-12-17  6:14   ` Yonghong Song
2019-12-18 16:41     ` Martin Lau
2019-12-14  0:47 ` [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
2019-12-17  7:48   ` [Potential Spoof] " Yonghong Song
2019-12-20  7:22     ` Martin Lau
2019-12-20 16:52       ` Martin Lau
2019-12-20 18:41         ` Andrii Nakryiko
2019-12-14  0:47 ` [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
2019-12-17 17:36   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
2019-12-17 17:41   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies Martin KaFai Lau
2019-12-14  1:59   ` Eric Dumazet
2019-12-14 19:25     ` Neal Cardwell
2019-12-16 19:30       ` Martin Lau
2019-12-17  8:26       ` Jakub Sitnicki
2019-12-17 18:22         ` Martin Lau
2019-12-17 21:04           ` Eric Dumazet
2019-12-18  9:03           ` Jakub Sitnicki
2019-12-16 19:14     ` Martin Lau
2019-12-16 19:33       ` Eric Dumazet
2019-12-16 21:17         ` Martin Lau
2019-12-16 23:08       ` Alexei Starovoitov
2019-12-17  0:34         ` Eric Dumazet
2019-12-14  0:48 ` [PATCH bpf-next 10/13] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
2019-12-14  0:48 ` [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
2019-12-18  3:07   ` Andrii Nakryiko
2019-12-18  7:03     ` Martin Lau
2019-12-18  7:20       ` Martin Lau
2019-12-18 16:36         ` Andrii Nakryiko
2019-12-18 16:34       ` Andrii Nakryiko
2019-12-18 17:33         ` Martin Lau
2019-12-18 18:14           ` Andrii Nakryiko
2019-12-18 20:19             ` Martin Lau
2019-12-19  8:53             ` Toke Høiland-Jørgensen
2019-12-19 20:49               ` Andrii Nakryiko
2019-12-20 10:16                 ` Toke Høiland-Jørgensen
2019-12-20 17:34                   ` Andrii Nakryiko
2019-12-14  0:48 ` [PATCH bpf-next 12/13] bpf: Add bpf_dctcp example Martin KaFai Lau
2019-12-14  0:48 ` [PATCH bpf-next 13/13] bpf: Add bpf_cubic example Martin KaFai Lau
2019-12-14  2:26 ` [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).