bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 1/8] bpf: refactor BPF_PSEUDO_CALL checking as a helper function
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-05  5:59   ` Alexei Starovoitov
  2021-02-04 23:48 ` [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

There is no functionality change. This refactoring intends
to facilitate next patch change with BPF_PSEUDO_FUNC.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/verifier.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5e09632efddb..db294b75d03b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -228,6 +228,12 @@ static void bpf_map_key_store(struct bpf_insn_aux_data *aux, u64 state)
 			     (poisoned ? BPF_MAP_KEY_POISON : 0ULL);
 }
 
+static bool bpf_pseudo_call(const struct bpf_insn *insn)
+{
+	return insn->code == (BPF_JMP | BPF_CALL) &&
+	       insn->src_reg == BPF_PSEUDO_CALL;
+}
+
 struct bpf_call_arg_meta {
 	struct bpf_map *map_ptr;
 	bool raw_mode;
@@ -1486,9 +1492,7 @@ static int check_subprogs(struct bpf_verifier_env *env)
 
 	/* determine subprog starts. The end is one before the next starts */
 	for (i = 0; i < insn_cnt; i++) {
-		if (insn[i].code != (BPF_JMP | BPF_CALL))
-			continue;
-		if (insn[i].src_reg != BPF_PSEUDO_CALL)
+		if (!bpf_pseudo_call(insn + i))
 			continue;
 		if (!env->bpf_capable) {
 			verbose(env,
@@ -3074,9 +3078,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 continue_func:
 	subprog_end = subprog[idx + 1].start;
 	for (; i < subprog_end; i++) {
-		if (insn[i].code != (BPF_JMP | BPF_CALL))
-			continue;
-		if (insn[i].src_reg != BPF_PSEUDO_CALL)
+		if (!bpf_pseudo_call(insn + i))
 			continue;
 		/* remember insn and function to return to */
 		ret_insn[frame] = i + 1;
@@ -10844,8 +10846,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		return 0;
 
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
-		if (insn->code != (BPF_JMP | BPF_CALL) ||
-		    insn->src_reg != BPF_PSEUDO_CALL)
+		if (!bpf_pseudo_call(insn))
 			continue;
 		/* Upon error here we cannot fall back to interpreter but
 		 * need a hard reject of the program. Thus -EFAULT is
@@ -10974,8 +10975,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	for (i = 0; i < env->subprog_cnt; i++) {
 		insn = func[i]->insnsi;
 		for (j = 0; j < func[i]->len; j++, insn++) {
-			if (insn->code != (BPF_JMP | BPF_CALL) ||
-			    insn->src_reg != BPF_PSEUDO_CALL)
+			if (!bpf_pseudo_call(insn))
 				continue;
 			subprog = insn->off;
 			insn->imm = BPF_CAST_CALL(func[subprog]->bpf_func) -
@@ -11020,8 +11020,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	 * later look the same as if they were interpreted only.
 	 */
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
-		if (insn->code != (BPF_JMP | BPF_CALL) ||
-		    insn->src_reg != BPF_PSEUDO_CALL)
+		if (!bpf_pseudo_call(insn))
 			continue;
 		insn->off = env->insn_aux_data[i].call_imm;
 		subprog = find_subprog(env, i + insn->off + 1);
@@ -11050,8 +11049,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	/* cleanup main prog to be interpreted */
 	prog->jit_requested = 0;
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
-		if (insn->code != (BPF_JMP | BPF_CALL) ||
-		    insn->src_reg != BPF_PSEUDO_CALL)
+		if (!bpf_pseudo_call(insn))
 			continue;
 		insn->off = 0;
 		insn->imm = env->insn_aux_data[i].call_imm;
@@ -11086,8 +11084,7 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 		return -EINVAL;
 	}
 	for (i = 0; i < prog->len; i++, insn++) {
-		if (insn->code != (BPF_JMP | BPF_CALL) ||
-		    insn->src_reg != BPF_PSEUDO_CALL)
+		if (!bpf_pseudo_call(insn))
 			continue;
 		depth = get_callee_stack_depth(env, insn, i);
 		if (depth < 0)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper
@ 2021-02-04 23:48 Yonghong Song
  2021-02-04 23:48 ` [PATCH bpf-next 1/8] bpf: refactor BPF_PSEUDO_CALL checking as a helper function Yonghong Song
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

This patch set introduced bpf_for_each_map_elem() helper.
The helper permits bpf program iterates through all elements
for a particular map.

The work originally inspired by an internal discussion where
firewall rules are kept in a map and bpf prog wants to
check packet 5 tuples against all rules in the map.
A bounded loop can be used but it has a few drawbacks.
As the loop iteration goes up, verification time goes up too.
For really large maps, verification may fail.
A helper which abstracts out the loop itself will not have
verification time issue.

A recent discussion in [1] involves to iterate all hash map
elements in bpf program. Currently iterating all hashmap elements
in bpf program is not easy if key space is really big.
Having a helper to abstract out the loop itself is even more
meaningful.

The proposed helper signature looks like:
  long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
where callback_fn is a static function and callback_ctx is
a piece of data allocated on the caller stack which can be
accessed by the callback_fn. The callback_fn signature might be
different for different maps. For example, for hash/array maps,
the signature might be
  long callback_fn(map, key, val, callback_ctx)

In the rest of series, Patch 1 did some refactoring. Patch 2
implemented core kernel support for the helper. Patches 3 and 4
added hashmap and arraymap support. Patches 5 and 6 added
libbpf and bpftool support. Patches 7 and 8 added selftests
for hashmap and arraymap.

[1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/

Yonghong Song (8):
  bpf: refactor BPF_PSEUDO_CALL checking as a helper function
  bpf: add bpf_for_each_map_elem() helper
  bpf: add hashtab support for bpf_for_each_map_elem() helper
  bpf: add arraymap support for bpf_for_each_map_elem() helper
  libbpf: support local function pointer relocation
  bpftool: print local function pointer properly
  selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper
  selftests/bpf: add arraymap test for bpf_for_each_map_elem() helper

 include/linux/bpf.h                           |  18 ++
 include/linux/bpf_verifier.h                  |   3 +
 include/uapi/linux/bpf.h                      |  28 ++
 kernel/bpf/arraymap.c                         |  36 +++
 kernel/bpf/bpf_iter.c                         |  16 +
 kernel/bpf/hashtab.c                          |  57 ++++
 kernel/bpf/helpers.c                          |   2 +
 kernel/bpf/verifier.c                         | 299 ++++++++++++++++--
 kernel/trace/bpf_trace.c                      |   2 +
 tools/bpf/bpftool/xlated_dumper.c             |   3 +
 tools/include/uapi/linux/bpf.h                |  28 ++
 tools/lib/bpf/libbpf.c                        |  33 +-
 .../selftests/bpf/prog_tests/for_each.c       | 145 +++++++++
 .../bpf/progs/for_each_array_map_elem.c       |  71 +++++
 .../bpf/progs/for_each_hash_map_elem.c        | 103 ++++++
 15 files changed, 811 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each.c
 create mode 100644 tools/testing/selftests/bpf/progs/for_each_array_map_elem.c
 create mode 100644 tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
  2021-02-04 23:48 ` [PATCH bpf-next 1/8] bpf: refactor BPF_PSEUDO_CALL checking as a helper function Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-05  5:49   ` Alexei Starovoitov
  2021-02-08 18:16   ` Andrii Nakryiko
  2021-02-04 23:48 ` [PATCH bpf-next 3/8] bpf: add hashtab support for " Yonghong Song
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

The bpf_for_each_map_elem() helper is introduced which
iterates all map elements with a callback function. The
helper signature looks like
  long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
and for each map element, the callback_fn will be called. For example,
like hashmap, the callback signature may look like
  long callback_fn(map, key, val, callback_ctx)

There are two known use cases for this. One is from upstream ([1]) where
a for_each_map_elem helper may help implement a timeout mechanism
in a more generic way. Another is from our internal discussion
for a firewall use case where a map contains all the rules. The packet
data can be compared to all these rules to decide allow or deny
the packet.

For array maps, users can already use a bounded loop to traverse
elements. Using this helper can avoid using bounded loop. For other
type of maps (e.g., hash maps) where bounded loop is hard or
impossible to use, this helper provides a convenient way to
operate on all elements.

For callback_fn, besides map and map element, a callback_ctx,
allocated on caller stack, is also passed to the callback
function. This callback_ctx argument can provide additional
input and allow to write to caller stack for output.

If the callback_fn returns 0, the helper will iterate through next
element if available. If the callback_fn returns 1, the helper
will stop iterating and returns to the bpf program. Other return
values are not used for now.

Currently, this helper is only available with jit. It is possible
to make it work with interpreter with so effort but I leave it
as the future work.

[1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h            |  14 ++
 include/linux/bpf_verifier.h   |   3 +
 include/uapi/linux/bpf.h       |  28 ++++
 kernel/bpf/bpf_iter.c          |  16 +++
 kernel/bpf/helpers.c           |   2 +
 kernel/bpf/verifier.c          | 251 ++++++++++++++++++++++++++++++---
 kernel/trace/bpf_trace.c       |   2 +
 tools/include/uapi/linux/bpf.h |  28 ++++
 8 files changed, 328 insertions(+), 16 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 321966fc35db..c8b72ae16cc5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -40,6 +40,7 @@ struct bpf_local_storage;
 struct bpf_local_storage_map;
 struct kobject;
 struct mem_cgroup;
+struct bpf_func_state;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -130,6 +131,13 @@ struct bpf_map_ops {
 	bool (*map_meta_equal)(const struct bpf_map *meta0,
 			       const struct bpf_map *meta1);
 
+
+	int (*map_set_for_each_callback_args)(struct bpf_verifier_env *env,
+					      struct bpf_func_state *caller,
+					      struct bpf_func_state *callee);
+	int (*map_for_each_callback)(struct bpf_map *map, void *callback_fn,
+				     void *callback_ctx, u64 flags);
+
 	/* BTF name and id of struct allocated by map_alloc */
 	const char * const map_btf_name;
 	int *map_btf_id;
@@ -296,6 +304,8 @@ enum bpf_arg_type {
 	ARG_CONST_ALLOC_SIZE_OR_ZERO,	/* number of allocated bytes requested */
 	ARG_PTR_TO_BTF_ID_SOCK_COMMON,	/* pointer to in-kernel sock_common or bpf-mirrored bpf_sock */
 	ARG_PTR_TO_PERCPU_BTF_ID,	/* pointer to in-kernel percpu type */
+	ARG_PTR_TO_FUNC,	/* pointer to a bpf program function */
+	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
 	__BPF_ARG_TYPE_MAX,
 };
 
@@ -412,6 +422,8 @@ enum bpf_reg_type {
 	PTR_TO_RDWR_BUF,	 /* reg points to a read/write buffer */
 	PTR_TO_RDWR_BUF_OR_NULL, /* reg points to a read/write buffer or NULL */
 	PTR_TO_PERCPU_BTF_ID,	 /* reg points to a percpu kernel variable */
+	PTR_TO_FUNC,		 /* reg points to a bpf program function */
+	PTR_TO_MAP_KEY,		 /* reg points to map element key */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -794,6 +806,7 @@ struct bpf_prog_aux {
 	bool func_proto_unreliable;
 	bool sleepable;
 	bool tail_call_reachable;
+	bool with_callback_fn;
 	enum bpf_tramp_prog_type trampoline_prog_type;
 	struct hlist_node tramp_hlist;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
@@ -1888,6 +1901,7 @@ extern const struct bpf_func_proto bpf_per_cpu_ptr_proto;
 extern const struct bpf_func_proto bpf_this_cpu_ptr_proto;
 extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto;
 extern const struct bpf_func_proto bpf_sock_from_file_proto;
+extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
 
 const struct bpf_func_proto *bpf_tracing_func_proto(
 	enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index dfe6f85d97dd..c4366b3da342 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -68,6 +68,8 @@ struct bpf_reg_state {
 			unsigned long raw1;
 			unsigned long raw2;
 		} raw;
+
+		u32 subprog; /* for PTR_TO_FUNC */
 	};
 	/* For PTR_TO_PACKET, used to find other pointers with the same variable
 	 * offset, so they can share range knowledge.
@@ -204,6 +206,7 @@ struct bpf_func_state {
 	int acquired_refs;
 	struct bpf_reference_state *refs;
 	int allocated_stack;
+	bool with_callback_fn;
 	struct bpf_stack_state *stack;
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c001766adcbc..d55bd4557376 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -393,6 +393,15 @@ enum bpf_link_type {
  *                   is struct/union.
  */
 #define BPF_PSEUDO_BTF_ID	3
+/* insn[0].src_reg:  BPF_PSEUDO_FUNC
+ * insn[0].imm:      insn offset to the func
+ * insn[1].imm:      0
+ * insn[0].off:      0
+ * insn[1].off:      0
+ * ldimm64 rewrite:  address of the function
+ * verifier type:    PTR_TO_FUNC.
+ */
+#define BPF_PSEUDO_FUNC		4
 
 /* when bpf_call->src_reg == BPF_PSEUDO_CALL, bpf_call->imm == pc-relative
  * offset to another bpf function
@@ -3836,6 +3845,24 @@ union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * long bpf_for_each_map_elem(struct bpf_map *map, void *callback_fn, void *callback_ctx, u64 flags)
+ *	Description
+ *		For each element in **map**, call **callback_fn** function with
+ *		**map**, **callback_ctx** and other map-specific parameters.
+ *		For example, for hash and array maps, the callback signature can
+ *		be `u64 callback_fn(map, map_key, map_value, callback_ctx)`.
+ *		The **callback_fn** should be a static function and
+ *		the **callback_ctx** should be a pointer to the stack.
+ *		The **flags** is used to control certain aspects of the helper.
+ *		Currently, the **flags** must be 0.
+ *
+ *		If **callback_fn** return 0, the helper will continue to the next
+ *		element. If return value is 1, the helper will skip the rest of
+ *		elements and return. Other return values are not used now.
+ *	Return
+ *		0 for success, **-EINVAL** for invalid **flags** or unsupported
+ *		**callback_fn** return value.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4001,6 +4028,7 @@ union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(for_each_map_elem),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 5454161407f1..5187f49d3216 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -675,3 +675,19 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
 	 */
 	return ret == 0 ? 0 : -EAGAIN;
 }
+
+BPF_CALL_4(bpf_for_each_map_elem, struct bpf_map *, map, void *, callback_fn,
+	   void *, callback_ctx, u64, flags)
+{
+	return map->ops->map_for_each_callback(map, callback_fn, callback_ctx, flags);
+}
+
+const struct bpf_func_proto bpf_for_each_map_elem_proto = {
+	.func		= bpf_for_each_map_elem,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+	.arg3_type	= ARG_PTR_TO_STACK_OR_NULL,
+	.arg4_type	= ARG_ANYTHING,
+};
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 308427fe03a3..074800226327 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -708,6 +708,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_ringbuf_discard_proto;
 	case BPF_FUNC_ringbuf_query:
 		return &bpf_ringbuf_query_proto;
+	case BPF_FUNC_for_each_map_elem:
+		return &bpf_for_each_map_elem_proto;
 	default:
 		break;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index db294b75d03b..050b067a0be6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -234,6 +234,12 @@ static bool bpf_pseudo_call(const struct bpf_insn *insn)
 	       insn->src_reg == BPF_PSEUDO_CALL;
 }
 
+static bool bpf_pseudo_func(const struct bpf_insn *insn)
+{
+	return insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
+	       insn->src_reg == BPF_PSEUDO_FUNC;
+}
+
 struct bpf_call_arg_meta {
 	struct bpf_map *map_ptr;
 	bool raw_mode;
@@ -409,6 +415,7 @@ static bool reg_type_not_null(enum bpf_reg_type type)
 	return type == PTR_TO_SOCKET ||
 		type == PTR_TO_TCP_SOCK ||
 		type == PTR_TO_MAP_VALUE ||
+		type == PTR_TO_MAP_KEY ||
 		type == PTR_TO_SOCK_COMMON;
 }
 
@@ -451,7 +458,8 @@ static bool arg_type_may_be_null(enum bpf_arg_type type)
 	       type == ARG_PTR_TO_MEM_OR_NULL ||
 	       type == ARG_PTR_TO_CTX_OR_NULL ||
 	       type == ARG_PTR_TO_SOCKET_OR_NULL ||
-	       type == ARG_PTR_TO_ALLOC_MEM_OR_NULL;
+	       type == ARG_PTR_TO_ALLOC_MEM_OR_NULL ||
+	       type == ARG_PTR_TO_STACK_OR_NULL;
 }
 
 /* Determine whether the function releases some resources allocated by another
@@ -534,6 +542,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_RDONLY_BUF_OR_NULL] = "rdonly_buf_or_null",
 	[PTR_TO_RDWR_BUF]	= "rdwr_buf",
 	[PTR_TO_RDWR_BUF_OR_NULL] = "rdwr_buf_or_null",
+	[PTR_TO_FUNC]		= "func",
+	[PTR_TO_MAP_KEY]	= "map_key",
 };
 
 static char slot_type_char[] = {
@@ -605,6 +615,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 			if (type_is_pkt_pointer(t))
 				verbose(env, ",r=%d", reg->range);
 			else if (t == CONST_PTR_TO_MAP ||
+				 t == PTR_TO_MAP_KEY ||
 				 t == PTR_TO_MAP_VALUE ||
 				 t == PTR_TO_MAP_VALUE_OR_NULL)
 				verbose(env, ",ks=%d,vs=%d",
@@ -1492,6 +1503,19 @@ static int check_subprogs(struct bpf_verifier_env *env)
 
 	/* determine subprog starts. The end is one before the next starts */
 	for (i = 0; i < insn_cnt; i++) {
+		if (bpf_pseudo_func(insn + i)) {
+			if (!env->bpf_capable) {
+				verbose(env,
+					"function pointers are allowed for CAP_BPF and CAP_SYS_ADMIN\n");
+				return -EPERM;
+			}
+			ret = add_subprog(env, i + insn[i].imm + 1);
+			if (ret < 0)
+				return ret;
+			/* remember subprog */
+			insn[i + 1].imm = find_subprog(env, i + insn[i].imm + 1);
+			continue;
+		}
 		if (!bpf_pseudo_call(insn + i))
 			continue;
 		if (!env->bpf_capable) {
@@ -2223,6 +2247,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_PERCPU_BTF_ID:
 	case PTR_TO_MEM:
 	case PTR_TO_MEM_OR_NULL:
+	case PTR_TO_FUNC:
+	case PTR_TO_MAP_KEY:
 		return true;
 	default:
 		return false;
@@ -2567,6 +2593,10 @@ static int __check_mem_access(struct bpf_verifier_env *env, int regno,
 
 	reg = &cur_regs(env)[regno];
 	switch (reg->type) {
+	case PTR_TO_MAP_KEY:
+		verbose(env, "invalid access to map key, key_size=%d off=%d size=%d\n",
+			mem_size, off, size);
+		break;
 	case PTR_TO_MAP_VALUE:
 		verbose(env, "invalid access to map value, value_size=%d off=%d size=%d\n",
 			mem_size, off, size);
@@ -2977,6 +3007,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_FLOW_KEYS:
 		pointer_desc = "flow keys ";
 		break;
+	case PTR_TO_MAP_KEY:
+		pointer_desc = "key ";
+		break;
 	case PTR_TO_MAP_VALUE:
 		pointer_desc = "value ";
 		break;
@@ -3078,7 +3111,7 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 continue_func:
 	subprog_end = subprog[idx + 1].start;
 	for (; i < subprog_end; i++) {
-		if (!bpf_pseudo_call(insn + i))
+		if (!bpf_pseudo_call(insn + i) && !bpf_pseudo_func(insn + i))
 			continue;
 		/* remember insn and function to return to */
 		ret_insn[frame] = i + 1;
@@ -3430,7 +3463,19 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 	/* for access checks, reg->off is just part of off */
 	off += reg->off;
 
-	if (reg->type == PTR_TO_MAP_VALUE) {
+	if (reg->type == PTR_TO_MAP_KEY) {
+		if (t == BPF_WRITE) {
+			verbose(env, "write to change key R%d not allowed\n", regno);
+			return -EACCES;
+		}
+
+		err = check_mem_region_access(env, regno, off, size,
+					      reg->map_ptr->key_size, false);
+		if (err)
+			return err;
+		if (value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
+	} else if (reg->type == PTR_TO_MAP_VALUE) {
 		if (t == BPF_WRITE && value_regno >= 0 &&
 		    is_pointer_value(env, value_regno)) {
 			verbose(env, "R%d leaks addr into map\n", value_regno);
@@ -3858,6 +3903,9 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 	case PTR_TO_PACKET_META:
 		return check_packet_access(env, regno, reg->off, access_size,
 					   zero_size_allowed);
+	case PTR_TO_MAP_KEY:
+		return check_mem_region_access(env, regno, reg->off, access_size,
+					       reg->map_ptr->key_size, false);
 	case PTR_TO_MAP_VALUE:
 		if (check_map_access_type(env, regno, reg->off, access_size,
 					  meta && meta->raw_mode ? BPF_WRITE :
@@ -4049,6 +4097,7 @@ static const struct bpf_reg_types map_key_value_types = {
 		PTR_TO_STACK,
 		PTR_TO_PACKET,
 		PTR_TO_PACKET_META,
+		PTR_TO_MAP_KEY,
 		PTR_TO_MAP_VALUE,
 	},
 };
@@ -4080,6 +4129,7 @@ static const struct bpf_reg_types mem_types = {
 		PTR_TO_STACK,
 		PTR_TO_PACKET,
 		PTR_TO_PACKET_META,
+		PTR_TO_MAP_KEY,
 		PTR_TO_MAP_VALUE,
 		PTR_TO_MEM,
 		PTR_TO_RDONLY_BUF,
@@ -4092,6 +4142,7 @@ static const struct bpf_reg_types int_ptr_types = {
 		PTR_TO_STACK,
 		PTR_TO_PACKET,
 		PTR_TO_PACKET_META,
+		PTR_TO_MAP_KEY,
 		PTR_TO_MAP_VALUE,
 	},
 };
@@ -4104,6 +4155,8 @@ static const struct bpf_reg_types const_map_ptr_types = { .types = { CONST_PTR_T
 static const struct bpf_reg_types btf_ptr_types = { .types = { PTR_TO_BTF_ID } };
 static const struct bpf_reg_types spin_lock_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_PERCPU_BTF_ID } };
+static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
+static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -4132,6 +4185,8 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_INT]		= &int_ptr_types,
 	[ARG_PTR_TO_LONG]		= &int_ptr_types,
 	[ARG_PTR_TO_PERCPU_BTF_ID]	= &percpu_btf_ptr_types,
+	[ARG_PTR_TO_FUNC]		= &func_ptr_types,
+	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
@@ -4932,12 +4987,92 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	return 0;
 }
 
+static int check_map_elem_callback(struct bpf_verifier_env *env, int *insn_idx)
+{
+	struct bpf_verifier_state *state = env->cur_state;
+	struct bpf_prog_aux *aux = env->prog->aux;
+	struct bpf_func_state *caller, *callee;
+	struct bpf_map *map;
+	int err, subprog;
+
+	if (state->curframe + 1 >= MAX_CALL_FRAMES) {
+		verbose(env, "the call stack of %d frames is too deep\n",
+			state->curframe + 2);
+		return -E2BIG;
+	}
+
+	caller = state->frame[state->curframe];
+	if (state->frame[state->curframe + 1]) {
+		verbose(env, "verifier bug. Frame %d already allocated\n",
+			state->curframe + 1);
+		return -EFAULT;
+	}
+
+	caller->with_callback_fn = true;
+
+	callee = kzalloc(sizeof(*callee), GFP_KERNEL);
+	if (!callee)
+		return -ENOMEM;
+	state->frame[state->curframe + 1] = callee;
+
+	/* callee cannot access r0, r6 - r9 for reading and has to write
+	 * into its own stack before reading from it.
+	 * callee can read/write into caller's stack
+	 */
+	init_func_state(env, callee,
+			/* remember the callsite, it will be used by bpf_exit */
+			*insn_idx /* callsite */,
+			state->curframe + 1 /* frameno within this callchain */,
+			subprog /* subprog number within this prog */);
+
+	/* Transfer references to the callee */
+	err = transfer_reference_state(callee, caller);
+	if (err)
+		return err;
+
+	subprog = caller->regs[BPF_REG_2].subprog;
+	if (aux->func_info && aux->func_info_aux[subprog].linkage != BTF_FUNC_STATIC) {
+		verbose(env, "callback function R2 not static\n");
+		return -EINVAL;
+	}
+
+	map = caller->regs[BPF_REG_1].map_ptr;
+	if (!map->ops->map_set_for_each_callback_args ||
+	    !map->ops->map_for_each_callback) {
+		verbose(env, "callback function not allowed for map R1\n");
+		return -ENOTSUPP;
+	}
+
+	/* the following is only for hashmap, different maps
+	 * can have different callback signatures.
+	 */
+	err = map->ops->map_set_for_each_callback_args(env, caller, callee);
+	if (err)
+		return err;
+
+	clear_caller_saved_regs(env, caller->regs);
+
+	/* only increment it after check_reg_arg() finished */
+	state->curframe++;
+
+	/* and go analyze first insn of the callee */
+	*insn_idx = env->subprog_info[subprog].start - 1;
+
+	if (env->log.level & BPF_LOG_LEVEL) {
+		verbose(env, "caller:\n");
+		print_verifier_state(env, caller);
+		verbose(env, "callee:\n");
+		print_verifier_state(env, callee);
+	}
+	return 0;
+}
+
 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
 	struct bpf_func_state *caller, *callee;
 	struct bpf_reg_state *r0;
-	int err;
+	int i, err;
 
 	callee = state->frame[state->curframe];
 	r0 = &callee->regs[BPF_REG_0];
@@ -4955,7 +5090,17 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 	state->curframe--;
 	caller = state->frame[state->curframe];
 	/* return to the caller whatever r0 had in the callee */
-	caller->regs[BPF_REG_0] = *r0;
+	if (caller->with_callback_fn) {
+		/* reset caller saved regs, the helper calling callback_fn
+		 * has RET_INTEGER return types.
+		 */
+		for (i = 0; i < CALLER_SAVED_REGS; i++)
+			mark_reg_not_init(env, caller->regs, caller_saved[i]);
+		caller->regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG;
+		mark_reg_unknown(env, caller->regs, BPF_REG_0);
+	} else {
+		caller->regs[BPF_REG_0] = *r0;
+	}
 
 	/* Transfer references to the caller */
 	err = transfer_reference_state(caller, callee);
@@ -5091,7 +5236,8 @@ static int check_reference_leak(struct bpf_verifier_env *env)
 	return state->acquired_refs ? -EINVAL : 0;
 }
 
-static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
+static int check_helper_call(struct bpf_verifier_env *env, int func_id, int *insn_idx,
+			     bool map_elem_callback)
 {
 	const struct bpf_func_proto *fn = NULL;
 	struct bpf_reg_state *regs;
@@ -5151,11 +5297,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			return err;
 	}
 
-	err = record_func_map(env, &meta, func_id, insn_idx);
+	err = record_func_map(env, &meta, func_id, *insn_idx);
 	if (err)
 		return err;
 
-	err = record_func_key(env, &meta, func_id, insn_idx);
+	err = record_func_key(env, &meta, func_id, *insn_idx);
 	if (err)
 		return err;
 
@@ -5163,7 +5309,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 	 * is inferred from register state.
 	 */
 	for (i = 0; i < meta.access_size; i++) {
-		err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B,
+		err = check_mem_access(env, *insn_idx, meta.regno, i, BPF_B,
 				       BPF_WRITE, -1, false);
 		if (err)
 			return err;
@@ -5195,6 +5341,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		return -EINVAL;
 	}
 
+	if (map_elem_callback) {
+		env->prog->aux->with_callback_fn = true;
+		return check_map_elem_callback(env, insn_idx);
+	}
+
 	/* reset caller saved regs */
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		mark_reg_not_init(env, regs, caller_saved[i]);
@@ -5306,7 +5457,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		/* For release_reference() */
 		regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id;
 	} else if (is_acquire_function(func_id, meta.map_ptr)) {
-		int id = acquire_reference_state(env, insn_idx);
+		int id = acquire_reference_state(env, *insn_idx);
 
 		if (id < 0)
 			return id;
@@ -5448,6 +5599,14 @@ static int retrieve_ptr_limit(const struct bpf_reg_state *ptr_reg,
 		else
 			*ptr_limit = -off;
 		return 0;
+	case PTR_TO_MAP_KEY:
+		if (mask_to_left) {
+			*ptr_limit = ptr_reg->umax_value + ptr_reg->off;
+		} else {
+			off = ptr_reg->smin_value + ptr_reg->off;
+			*ptr_limit = ptr_reg->map_ptr->key_size - off;
+		}
+		return 0;
 	case PTR_TO_MAP_VALUE:
 		if (mask_to_left) {
 			*ptr_limit = ptr_reg->umax_value + ptr_reg->off;
@@ -5614,6 +5773,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
+	case PTR_TO_MAP_KEY:
 	case PTR_TO_MAP_VALUE:
 		if (!env->allow_ptr_leaks && !known && (smin_val < 0) != (smax_val < 0)) {
 			verbose(env, "R%d has unknown scalar with mixed signed bounds, pointer arithmetic with it prohibited for !root\n",
@@ -7818,6 +7978,12 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		return 0;
 	}
 
+	if (insn->src_reg == BPF_PSEUDO_FUNC) {
+		dst_reg->type = PTR_TO_FUNC;
+		dst_reg->subprog = insn[1].imm;
+		return 0;
+	}
+
 	map = env->used_maps[aux->map_index];
 	mark_reg_known_zero(env, regs, insn->dst_reg);
 	dst_reg->map_ptr = map;
@@ -8195,9 +8361,23 @@ static int visit_insn(int t, int insn_cnt, struct bpf_verifier_env *env)
 
 	/* All non-branch instructions have a single fall-through edge. */
 	if (BPF_CLASS(insns[t].code) != BPF_JMP &&
-	    BPF_CLASS(insns[t].code) != BPF_JMP32)
+	    BPF_CLASS(insns[t].code) != BPF_JMP32 &&
+	    !bpf_pseudo_func(insns + t))
 		return push_insn(t, t + 1, FALLTHROUGH, env, false);
 
+	if (bpf_pseudo_func(insns + t)) {
+		ret = push_insn(t, t + 1, FALLTHROUGH, env, false);
+		if (ret)
+			return ret;
+
+		if (t + 1 < insn_cnt)
+			init_explored_state(env, t + 1);
+		init_explored_state(env, t);
+		ret = push_insn(t, t + insns[t].imm + 1, BRANCH,
+				env, false);
+		return ret;
+	}
+
 	switch (BPF_OP(insns[t].code)) {
 	case BPF_EXIT:
 		return DONE_EXPLORING;
@@ -8819,6 +8999,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 			 */
 			return false;
 		}
+	case PTR_TO_MAP_KEY:
 	case PTR_TO_MAP_VALUE:
 		/* If the new min/max/var_off satisfy the old ones and
 		 * everything else matches, we are OK.
@@ -9646,6 +9827,8 @@ static int do_check(struct bpf_verifier_env *env)
 
 			env->jmps_processed++;
 			if (opcode == BPF_CALL) {
+				bool map_elem_callback;
+
 				if (BPF_SRC(insn->code) != BPF_K ||
 				    insn->off != 0 ||
 				    (insn->src_reg != BPF_REG_0 &&
@@ -9662,13 +9845,15 @@ static int do_check(struct bpf_verifier_env *env)
 					verbose(env, "function calls are not allowed while holding a lock\n");
 					return -EINVAL;
 				}
+				map_elem_callback = insn->src_reg != BPF_PSEUDO_CALL &&
+						   insn->imm == BPF_FUNC_for_each_map_elem;
 				if (insn->src_reg == BPF_PSEUDO_CALL)
 					err = check_func_call(env, insn, &env->insn_idx);
 				else
-					err = check_helper_call(env, insn->imm, env->insn_idx);
+					err = check_helper_call(env, insn->imm, &env->insn_idx,
+								map_elem_callback);
 				if (err)
 					return err;
-
 			} else if (opcode == BPF_JA) {
 				if (BPF_SRC(insn->code) != BPF_K ||
 				    insn->imm != 0 ||
@@ -10090,6 +10275,12 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
 				goto next_insn;
 			}
 
+			if (insn[0].src_reg == BPF_PSEUDO_FUNC) {
+				aux = &env->insn_aux_data[i];
+				aux->ptr_type = PTR_TO_FUNC;
+				goto next_insn;
+			}
+
 			/* In final convert_pseudo_ld_imm64() step, this is
 			 * converted into regular 64-bit imm load insn.
 			 */
@@ -10222,9 +10413,13 @@ static void convert_pseudo_ld_imm64(struct bpf_verifier_env *env)
 	int insn_cnt = env->prog->len;
 	int i;
 
-	for (i = 0; i < insn_cnt; i++, insn++)
-		if (insn->code == (BPF_LD | BPF_IMM | BPF_DW))
-			insn->src_reg = 0;
+	for (i = 0; i < insn_cnt; i++, insn++) {
+		if (insn->code != (BPF_LD | BPF_IMM | BPF_DW))
+			continue;
+		if (insn->src_reg == BPF_PSEUDO_FUNC)
+			continue;
+		insn->src_reg = 0;
+	}
 }
 
 /* single env->prog->insni[off] instruction was replaced with the range
@@ -10846,6 +11041,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		return 0;
 
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
+		if (bpf_pseudo_func(insn)) {
+			env->insn_aux_data[i].call_imm = insn->imm;
+			/* subprog is encoded in insn[1].imm */
+			continue;
+		}
+
 		if (!bpf_pseudo_call(insn))
 			continue;
 		/* Upon error here we cannot fall back to interpreter but
@@ -10975,6 +11176,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	for (i = 0; i < env->subprog_cnt; i++) {
 		insn = func[i]->insnsi;
 		for (j = 0; j < func[i]->len; j++, insn++) {
+			if (bpf_pseudo_func(insn)) {
+				subprog = insn[1].imm;
+				insn[0].imm = (u32)(long)func[subprog]->bpf_func;
+				insn[1].imm = ((u64)(long)func[subprog]->bpf_func) >> 32;
+				continue;
+			}
 			if (!bpf_pseudo_call(insn))
 				continue;
 			subprog = insn->off;
@@ -11020,6 +11227,11 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	 * later look the same as if they were interpreted only.
 	 */
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
+		if (bpf_pseudo_func(insn)) {
+			insn[0].imm = env->insn_aux_data[i].call_imm;
+			insn[1].imm = find_subprog(env, i + insn[0].imm + 1);
+			continue;
+		}
 		if (!bpf_pseudo_call(insn))
 			continue;
 		insn->off = env->insn_aux_data[i].call_imm;
@@ -11083,6 +11295,13 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 		verbose(env, "tail_calls are not allowed in non-JITed programs with bpf-to-bpf calls\n");
 		return -EINVAL;
 	}
+	if (env->subprog_cnt > 1 && env->prog->aux->with_callback_fn) {
+		/* When JIT fails the progs with callback calls
+		 * have to be rejected, since interpreter doesn't support them yet.
+		 */
+		verbose(env, "callbacks are not allowed in non-JITed programs\n");
+		return -EINVAL;
+	}
 	for (i = 0; i < prog->len; i++, insn++) {
 		if (!bpf_pseudo_call(insn))
 			continue;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..8338333bfeb0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1366,6 +1366,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_per_cpu_ptr_proto;
 	case BPF_FUNC_this_cpu_ptr:
 		return &bpf_this_cpu_ptr_proto;
+	case BPF_FUNC_for_each_map_elem:
+		return &bpf_for_each_map_elem_proto;
 	default:
 		return NULL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c001766adcbc..d55bd4557376 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -393,6 +393,15 @@ enum bpf_link_type {
  *                   is struct/union.
  */
 #define BPF_PSEUDO_BTF_ID	3
+/* insn[0].src_reg:  BPF_PSEUDO_FUNC
+ * insn[0].imm:      insn offset to the func
+ * insn[1].imm:      0
+ * insn[0].off:      0
+ * insn[1].off:      0
+ * ldimm64 rewrite:  address of the function
+ * verifier type:    PTR_TO_FUNC.
+ */
+#define BPF_PSEUDO_FUNC		4
 
 /* when bpf_call->src_reg == BPF_PSEUDO_CALL, bpf_call->imm == pc-relative
  * offset to another bpf function
@@ -3836,6 +3845,24 @@ union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * long bpf_for_each_map_elem(struct bpf_map *map, void *callback_fn, void *callback_ctx, u64 flags)
+ *	Description
+ *		For each element in **map**, call **callback_fn** function with
+ *		**map**, **callback_ctx** and other map-specific parameters.
+ *		For example, for hash and array maps, the callback signature can
+ *		be `u64 callback_fn(map, map_key, map_value, callback_ctx)`.
+ *		The **callback_fn** should be a static function and
+ *		the **callback_ctx** should be a pointer to the stack.
+ *		The **flags** is used to control certain aspects of the helper.
+ *		Currently, the **flags** must be 0.
+ *
+ *		If **callback_fn** return 0, the helper will continue to the next
+ *		element. If return value is 1, the helper will skip the rest of
+ *		elements and return. Other return values are not used now.
+ *	Return
+ *		0 for success, **-EINVAL** for invalid **flags** or unsupported
+ *		**callback_fn** return value.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4001,6 +4028,7 @@ union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(for_each_map_elem),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 3/8] bpf: add hashtab support for bpf_for_each_map_elem() helper
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
  2021-02-04 23:48 ` [PATCH bpf-next 1/8] bpf: refactor BPF_PSEUDO_CALL checking as a helper function Yonghong Song
  2021-02-04 23:48 ` [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-05  6:23   ` Alexei Starovoitov
  2021-02-04 23:48 ` [PATCH bpf-next 4/8] bpf: add arraymap " Yonghong Song
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

This patch added support for hashmap, percpu hashmap,
lru hashmap and percpu lru hashmap.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h   |  4 +++
 kernel/bpf/hashtab.c  | 57 +++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c | 23 +++++++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c8b72ae16cc5..31e0447cadd8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1396,6 +1396,10 @@ void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux,
 int bpf_iter_map_fill_link_info(const struct bpf_iter_aux_info *aux,
 				struct bpf_link_info *info);
 
+int map_set_for_each_callback_args(struct bpf_verifier_env *env,
+				   struct bpf_func_state *caller,
+				   struct bpf_func_state *callee);
+
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c1ac7f964bc9..40f5404cfb01 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1869,6 +1869,55 @@ static const struct bpf_iter_seq_info iter_seq_info = {
 	.seq_priv_size		= sizeof(struct bpf_iter_seq_hash_map_info),
 };
 
+static int bpf_for_each_hash_elem(struct bpf_map *map, void *callback_fn,
+				  void *callback_ctx, u64 flags)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_nulls_head *head;
+	struct hlist_nulls_node *n;
+	struct htab_elem *elem;
+	u32 roundup_key_size;
+	void __percpu *pptr;
+	struct bucket *b;
+	void *key, *val;
+	bool is_percpu;
+	long ret = 0;
+	int i;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	is_percpu = htab_is_percpu(htab);
+
+	roundup_key_size = round_up(map->key_size, 8);
+	for (i = 0; i < htab->n_buckets; i++) {
+		b = &htab->buckets[i];
+		rcu_read_lock();
+		head = &b->head;
+		hlist_nulls_for_each_entry_rcu(elem, n, head, hash_node) {
+			key = elem->key;
+			if (!is_percpu) {
+				val = elem->key + roundup_key_size;
+			} else {
+				/* current cpu value for percpu map */
+				pptr = htab_elem_get_ptr(elem, map->key_size);
+				val = this_cpu_ptr(pptr);
+			}
+			ret = BPF_CAST_CALL(callback_fn)((u64)(long)map,
+					(u64)(long)key, (u64)(long)val,
+					(u64)(long)callback_ctx, 0);
+			if (ret) {
+				rcu_read_unlock();
+				ret = (ret == 1) ? 0 : -EINVAL;
+				goto out;
+			}
+		}
+		rcu_read_unlock();
+	}
+out:
+	return ret;
+}
+
 static int htab_map_btf_id;
 const struct bpf_map_ops htab_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
@@ -1881,6 +1930,8 @@ const struct bpf_map_ops htab_map_ops = {
 	.map_delete_elem = htab_map_delete_elem,
 	.map_gen_lookup = htab_map_gen_lookup,
 	.map_seq_show_elem = htab_map_seq_show_elem,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_hash_elem,
 	BATCH_OPS(htab),
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_map_btf_id,
@@ -1900,6 +1951,8 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.map_delete_elem = htab_lru_map_delete_elem,
 	.map_gen_lookup = htab_lru_map_gen_lookup,
 	.map_seq_show_elem = htab_map_seq_show_elem,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_hash_elem,
 	BATCH_OPS(htab_lru),
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_lru_map_btf_id,
@@ -2019,6 +2072,8 @@ const struct bpf_map_ops htab_percpu_map_ops = {
 	.map_update_elem = htab_percpu_map_update_elem,
 	.map_delete_elem = htab_map_delete_elem,
 	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_hash_elem,
 	BATCH_OPS(htab_percpu),
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_percpu_map_btf_id,
@@ -2036,6 +2091,8 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
 	.map_update_elem = htab_lru_percpu_map_update_elem,
 	.map_delete_elem = htab_lru_map_delete_elem,
 	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_hash_elem,
 	BATCH_OPS(htab_lru_percpu),
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_lru_percpu_map_btf_id,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 050b067a0be6..32c8dcc27da8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4987,6 +4987,29 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	return 0;
 }
 
+int map_set_for_each_callback_args(struct bpf_verifier_env *env,
+				   struct bpf_func_state *caller,
+				   struct bpf_func_state *callee)
+{
+	/* pointer to map */
+	callee->regs[BPF_REG_1] = caller->regs[BPF_REG_1];
+
+	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
+	callee->regs[BPF_REG_2].map_ptr = caller->regs[BPF_REG_1].map_ptr;
+
+	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
+	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
+	callee->regs[BPF_REG_3].map_ptr = caller->regs[BPF_REG_1].map_ptr;
+
+	/* pointer to stack or null */
+	callee->regs[BPF_REG_4] = caller->regs[BPF_REG_3];
+
+	/* unused */
+	__mark_reg_unknown(env, &callee->regs[BPF_REG_5]);
+	return 0;
+}
+
 static int check_map_elem_callback(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 4/8] bpf: add arraymap support for bpf_for_each_map_elem() helper
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
                   ` (2 preceding siblings ...)
  2021-02-04 23:48 ` [PATCH bpf-next 3/8] bpf: add hashtab support for " Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-04 23:48 ` [PATCH bpf-next 5/8] libbpf: support local function pointer relocation Yonghong Song
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

This patch added support for arraymap and percpu arraymap.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/arraymap.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1f8453343bf2..af0496b9981a 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -625,6 +625,38 @@ static const struct bpf_iter_seq_info iter_seq_info = {
 	.seq_priv_size		= sizeof(struct bpf_iter_seq_array_map_info),
 };
 
+static int bpf_for_each_array_elem(struct bpf_map *map, void *callback_fn,
+				   void *callback_ctx, u64 flags)
+{
+	struct bpf_array *array;
+	bool is_percpu;
+	u32 i, index;
+	long ret = 0;
+	void *val;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	is_percpu = map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
+	array = container_of(map, struct bpf_array, map);
+	for (i = 0; i < map->max_entries; i++) {
+		index = i & array->index_mask;
+		if (is_percpu)
+			val = this_cpu_ptr(array->pptrs[i]);
+		else
+			val = array->value + array->elem_size * i;
+		ret = BPF_CAST_CALL(callback_fn)((u64)(long)map,
+					(u64)(long)&index, (u64)(long)val,
+					(u64)(long)callback_ctx, 0);
+		if (ret) {
+			ret = (ret == 1) ? 0 : -EINVAL;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 static int array_map_btf_id;
 const struct bpf_map_ops array_map_ops = {
 	.map_meta_equal = array_map_meta_equal,
@@ -643,6 +675,8 @@ const struct bpf_map_ops array_map_ops = {
 	.map_check_btf = array_map_check_btf,
 	.map_lookup_batch = generic_map_lookup_batch,
 	.map_update_batch = generic_map_update_batch,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_array_elem,
 	.map_btf_name = "bpf_array",
 	.map_btf_id = &array_map_btf_id,
 	.iter_seq_info = &iter_seq_info,
@@ -660,6 +694,8 @@ const struct bpf_map_ops percpu_array_map_ops = {
 	.map_delete_elem = array_map_delete_elem,
 	.map_seq_show_elem = percpu_array_map_seq_show_elem,
 	.map_check_btf = array_map_check_btf,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_for_each_array_elem,
 	.map_btf_name = "bpf_array",
 	.map_btf_id = &percpu_array_map_btf_id,
 	.iter_seq_info = &iter_seq_info,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 5/8] libbpf: support local function pointer relocation
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
                   ` (3 preceding siblings ...)
  2021-02-04 23:48 ` [PATCH bpf-next 4/8] bpf: add arraymap " Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-08 18:52   ` Andrii Nakryiko
  2021-02-04 23:48 ` [PATCH bpf-next 6/8] bpftool: print local function pointer properly Yonghong Song
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

A new relocation RELO_LOCAL_FUNC is added to capture
local (static) function pointers loaded with ld_imm64
insns. Such ld_imm64 insns are marked with
BPF_PSEUDO_FUNC and will be passed to kernel so
kernel can replace them with proper actual jited
func addresses.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/libbpf.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abbc3800568..a5146c9e3e06 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -188,6 +188,7 @@ enum reloc_type {
 	RELO_CALL,
 	RELO_DATA,
 	RELO_EXTERN,
+	RELO_LOCAL_FUNC,
 };
 
 struct reloc_desc {
@@ -574,6 +575,12 @@ static bool insn_is_subprog_call(const struct bpf_insn *insn)
 	       insn->off == 0;
 }
 
+static bool insn_is_pseudo_func(const struct bpf_insn *insn)
+{
+	return insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
+	       insn->src_reg == BPF_PSEUDO_FUNC;
+}
+
 static int
 bpf_object__init_prog(struct bpf_object *obj, struct bpf_program *prog,
 		      const char *name, size_t sec_idx, const char *sec_name,
@@ -3395,6 +3402,16 @@ static int bpf_program__record_reloc(struct bpf_program *prog,
 		return 0;
 	}
 
+	if (insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
+	    GELF_ST_BIND(sym->st_info) == STB_LOCAL &&
+	    GELF_ST_TYPE(sym->st_info) == STT_SECTION &&
+	    shdr_idx == obj->efile.text_shndx) {
+		reloc_desc->type = RELO_LOCAL_FUNC;
+		reloc_desc->insn_idx = insn_idx;
+		reloc_desc->sym_off = sym->st_value;
+		return 0;
+	}
+
 	if (insn->code != (BPF_LD | BPF_IMM | BPF_DW)) {
 		pr_warn("prog '%s': invalid relo against '%s' for insns[%d].code 0x%x\n",
 			prog->name, sym_name, insn_idx, insn->code);
@@ -6172,6 +6189,9 @@ bpf_object__relocate_data(struct bpf_object *obj, struct bpf_program *prog)
 			}
 			relo->processed = true;
 			break;
+		case RELO_LOCAL_FUNC:
+			insn[0].src_reg = BPF_PSEUDO_FUNC;
+			/* fallthrough */
 		case RELO_CALL:
 			/* will be handled as a follow up pass */
 			break;
@@ -6358,11 +6378,11 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
 
 	for (insn_idx = 0; insn_idx < prog->sec_insn_cnt; insn_idx++) {
 		insn = &main_prog->insns[prog->sub_insn_off + insn_idx];
-		if (!insn_is_subprog_call(insn))
+		if (!insn_is_subprog_call(insn) && !insn_is_pseudo_func(insn))
 			continue;
 
 		relo = find_prog_insn_relo(prog, insn_idx);
-		if (relo && relo->type != RELO_CALL) {
+		if (relo && relo->type != RELO_CALL && relo->type != RELO_LOCAL_FUNC) {
 			pr_warn("prog '%s': unexpected relo for insn #%zu, type %d\n",
 				prog->name, insn_idx, relo->type);
 			return -LIBBPF_ERRNO__RELOC;
@@ -6374,8 +6394,15 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
 			 * call always has imm = -1, but for static functions
 			 * relocation is against STT_SECTION and insn->imm
 			 * points to a start of a static function
+			 *
+			 * for local func relocation, the imm field encodes
+			 * the byte offset in the corresponding section.
 			 */
-			sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
+			if (relo->type == RELO_CALL)
+				sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
+			else
+				sub_insn_idx = relo->sym_off / BPF_INSN_SZ +
+					       insn->imm / BPF_INSN_SZ + 1;
 		} else {
 			/* if subprogram call is to a static function within
 			 * the same ELF section, there won't be any relocation
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 6/8] bpftool: print local function pointer properly
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
                   ` (4 preceding siblings ...)
  2021-02-04 23:48 ` [PATCH bpf-next 5/8] libbpf: support local function pointer relocation Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-08 18:22   ` Andrii Nakryiko
  2021-02-04 23:48 ` [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper Yonghong Song
  2021-02-04 23:48 ` [PATCH bpf-next 8/8] selftests/bpf: add arraymap " Yonghong Song
  7 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

With later hashmap example, using bpftool xlated output may
look like:
  int dump_task(struct bpf_iter__task * ctx):
  ; struct task_struct *task = ctx->task;
     0: (79) r2 = *(u64 *)(r1 +8)
  ; if (task == (void *)0 || called > 0)
  ...
    19: (18) r2 = subprog[+18]
    30: (18) r2 = subprog[+26]
  ...
  36: (95) exit
  __u64 check_hash_elem(struct bpf_map * map, __u32 * key, __u64 * val,
                        struct callback_ctx * data):
  ; struct bpf_iter__task *ctx = data->ctx;
    37: (79) r5 = *(u64 *)(r4 +0)
  ...
    55: (95) exit
  __u64 check_percpu_elem(struct bpf_map * map, __u32 * key,
                          __u64 * val, void * unused):
  ; check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val, void *unused)
    56: (bf) r6 = r3
  ...
    83: (18) r2 = subprog[+-46]

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/bpf/bpftool/xlated_dumper.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/bpf/bpftool/xlated_dumper.c b/tools/bpf/bpftool/xlated_dumper.c
index 8608cd68cdd0..7bdd90503727 100644
--- a/tools/bpf/bpftool/xlated_dumper.c
+++ b/tools/bpf/bpftool/xlated_dumper.c
@@ -196,6 +196,9 @@ static const char *print_imm(void *private_data,
 	else if (insn->src_reg == BPF_PSEUDO_MAP_VALUE)
 		snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
 			 "map[id:%u][0]+%u", insn->imm, (insn + 1)->imm);
+	else if (insn->src_reg == BPF_PSEUDO_FUNC)
+		snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
+			 "subprog[+%d]", insn->imm + 1);
 	else
 		snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
 			 "0x%llx", (unsigned long long)full_imm);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
                   ` (5 preceding siblings ...)
  2021-02-04 23:48 ` [PATCH bpf-next 6/8] bpftool: print local function pointer properly Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-08 18:34   ` Andrii Nakryiko
  2021-02-04 23:48 ` [PATCH bpf-next 8/8] selftests/bpf: add arraymap " Yonghong Song
  7 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

A test case is added for hashmap and percpu hashmap. The test
also exercises nested bpf_for_each_map_elem() calls like
    bpf_prog:
      bpf_for_each_map_elem(func1)
    func1:
      bpf_for_each_map_elem(func2)
    func2:

  $ ./test_progs -n 44
  #44/1 hash_map:OK
  #44 for_each:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/prog_tests/for_each.c       |  91 ++++++++++++++++
 .../bpf/progs/for_each_hash_map_elem.c        | 103 ++++++++++++++++++
 2 files changed, 194 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each.c
 create mode 100644 tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c

diff --git a/tools/testing/selftests/bpf/prog_tests/for_each.c b/tools/testing/selftests/bpf/prog_tests/for_each.c
new file mode 100644
index 000000000000..7a399fbc89a4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/for_each.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "for_each_hash_map_elem.skel.h"
+
+static int duration;
+
+static void do_dummy_read(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	char buf[16] = {};
+	int iter_fd, len;
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (CHECK(IS_ERR(link), "attach_iter", "attach_iter failed\n"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (CHECK(iter_fd < 0, "create_iter", "create_iter failed\n"))
+		goto free_link;
+
+	/* not check contents, but ensure read() ends without error */
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
+		;
+	CHECK(len < 0, "read", "read failed: %s\n", strerror(errno));
+
+	close(iter_fd);
+
+free_link:
+	bpf_link__destroy(link);
+}
+
+static void test_hash_map(void)
+{
+	int i, hashmap_fd, percpu_map_fd, err;
+	struct for_each_hash_map_elem *skel;
+	__u64 *percpu_valbuf = NULL;
+	__u32 key, num_cpus;
+	__u64 val;
+
+	skel = for_each_hash_map_elem__open_and_load();
+	if (CHECK(!skel, "for_each_hash_map_elem__open_and_load",
+		  "skeleton open_and_load failed\n"))
+		return;
+
+	hashmap_fd = bpf_map__fd(skel->maps.hashmap);
+	for (i = 0; i < bpf_map__max_entries(skel->maps.hashmap); i++) {
+		key = i;
+		val = i + 1;
+		err = bpf_map_update_elem(hashmap_fd, &key, &val, BPF_ANY);
+		if (CHECK(err, "map_update", "map_update failed\n"))
+			goto out;
+	}
+
+	num_cpus = bpf_num_possible_cpus();
+	percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);
+	percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
+	if (CHECK_FAIL(!percpu_valbuf))
+		goto out;
+
+	key = 1;
+	for (i = 0; i < num_cpus; i++)
+		percpu_valbuf[i] = i + 1;
+	err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
+	if (CHECK(err, "percpu_map_update", "map_update failed\n"))
+		goto out;
+
+	do_dummy_read(skel->progs.dump_task);
+
+	ASSERT_EQ(skel->bss->called, 1, "called");
+	ASSERT_EQ(skel->bss->hashmap_output, 4, "output_val");
+
+	key = 1;
+	err = bpf_map_lookup_elem(hashmap_fd, &key, &val);
+	ASSERT_ERR(err, "hashmap_lookup");
+
+	ASSERT_EQ(skel->bss->percpu_called, 1, "percpu_called");
+	CHECK_FAIL(skel->bss->cpu >= num_cpus);
+	ASSERT_EQ(skel->bss->percpu_key, 1, "percpu_key");
+	ASSERT_EQ(skel->bss->percpu_val, skel->bss->cpu + 1, "percpu_val");
+	ASSERT_EQ(skel->bss->percpu_output, 100, "percpu_output");
+out:
+	free(percpu_valbuf);
+	for_each_hash_map_elem__destroy(skel);
+}
+
+void test_for_each(void)
+{
+	if (test__start_subtest("hash_map"))
+		test_hash_map();
+}
diff --git a/tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c b/tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c
new file mode 100644
index 000000000000..7808a5aa75e7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 3);
+	__type(key, __u32);
+	__type(value, __u64);
+} hashmap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} percpu_map SEC(".maps");
+
+struct callback_ctx {
+	struct bpf_iter__task *ctx;
+	int input;
+	int output;
+};
+
+static __u64
+check_hash_elem(struct bpf_map *map, __u32 *key, __u64 *val,
+		struct callback_ctx *data)
+{
+	struct bpf_iter__task *ctx = data->ctx;
+	__u32 k;
+	__u64 v;
+
+	if (ctx) {
+		k = *key;
+		v = *val;
+		if (ctx->meta->seq_num == 10 && k == 10 && v == 10)
+			data->output = 3; /* impossible path */
+		else
+			data->output = 4;
+	} else {
+		data->output = data->input;
+		bpf_map_delete_elem(map, key);
+	}
+
+	return 0;
+}
+
+__u32 cpu = 0;
+__u32 percpu_called = 0;
+__u32 percpu_key = 0;
+__u64 percpu_val = 0;
+
+static __u64
+check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val,
+		  struct callback_ctx *data)
+{
+	percpu_called++;
+	cpu = bpf_get_smp_processor_id();
+	percpu_key = *key;
+	percpu_val = *val;
+
+	bpf_for_each_map_elem(&hashmap, check_hash_elem, data, 0);
+	return 0;
+}
+
+int called = 0;
+int hashmap_output = 0;
+int percpu_output = 0;
+
+SEC("iter/task")
+int dump_task(struct bpf_iter__task *ctx)
+{
+	struct seq_file *seq =  ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	struct callback_ctx data;
+	int ret;
+
+	/* only call once since we will delete map elements */
+	if (task == (void *)0 || called > 0)
+		return 0;
+
+	called++;
+
+	data.ctx = ctx;
+	data.input = task->tgid;
+	data.output = 0;
+	ret = bpf_for_each_map_elem(&hashmap, check_hash_elem, &data, 0);
+	if (ret)
+		return 0;
+
+	hashmap_output = data.output;
+
+	data.ctx = 0;
+	data.input = 100;
+	data.output = 0;
+	bpf_for_each_map_elem(&percpu_map, check_percpu_elem, &data, 0);
+	percpu_output = data.output;
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 8/8] selftests/bpf: add arraymap test for bpf_for_each_map_elem() helper
  2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
                   ` (6 preceding siblings ...)
  2021-02-04 23:48 ` [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper Yonghong Song
@ 2021-02-04 23:48 ` Yonghong Song
  2021-02-08 18:35   ` Andrii Nakryiko
  7 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-04 23:48 UTC (permalink / raw)
  To: bpf; +Cc: Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

A test is added for arraymap and percpu arraymap. The test also
exercises the early return for the helper which does not
traverse all elements.
    $ ./test_progs -n 44
    #44/1 hash_map:OK
    #44/2 array_map:OK
    #44 for_each:OK
    Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../selftests/bpf/prog_tests/for_each.c       | 54 ++++++++++++++
 .../bpf/progs/for_each_array_map_elem.c       | 71 +++++++++++++++++++
 2 files changed, 125 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/for_each_array_map_elem.c

diff --git a/tools/testing/selftests/bpf/prog_tests/for_each.c b/tools/testing/selftests/bpf/prog_tests/for_each.c
index 7a399fbc89a4..58074212875b 100644
--- a/tools/testing/selftests/bpf/prog_tests/for_each.c
+++ b/tools/testing/selftests/bpf/prog_tests/for_each.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2021 Facebook */
 #include <test_progs.h>
 #include "for_each_hash_map_elem.skel.h"
+#include "for_each_array_map_elem.skel.h"
 
 static int duration;
 
@@ -84,8 +85,61 @@ static void test_hash_map(void)
 	for_each_hash_map_elem__destroy(skel);
 }
 
+static void test_array_map(void)
+{
+	int i, arraymap_fd, percpu_map_fd, err;
+	struct for_each_array_map_elem *skel;
+	__u32 key, num_cpus, max_entries;
+	__u64 *percpu_valbuf = NULL;
+	__u64 val, expected_total;
+
+	skel = for_each_array_map_elem__open_and_load();
+	if (CHECK(!skel, "for_each_array_map_elem__open_and_load",
+		  "skeleton open_and_load failed\n"))
+		return;
+
+	arraymap_fd = bpf_map__fd(skel->maps.arraymap);
+	expected_total = 0;
+	max_entries = bpf_map__max_entries(skel->maps.arraymap);
+	for (i = 0; i < max_entries; i++) {
+		key = i;
+		val = i + 1;
+		/* skip the last iteration for expected total */
+		if (i != max_entries - 1)
+			expected_total += val;
+		err = bpf_map_update_elem(arraymap_fd, &key, &val, BPF_ANY);
+		if (CHECK(err, "map_update", "map_update failed\n"))
+			goto out;
+	}
+
+	num_cpus = bpf_num_possible_cpus();
+        percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);
+        percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
+        if (CHECK_FAIL(!percpu_valbuf))
+                goto out;
+
+	key = 0;
+        for (i = 0; i < num_cpus; i++)
+                percpu_valbuf[i] = i + 1;
+	err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
+	if (CHECK(err, "percpu_map_update", "map_update failed\n"))
+		goto out;
+
+	do_dummy_read(skel->progs.dump_task);
+
+	ASSERT_EQ(skel->bss->called, 1, "called");
+	ASSERT_EQ(skel->bss->arraymap_output, expected_total, "array_output");
+	ASSERT_EQ(skel->bss->cpu + 1, skel->bss->percpu_val, "percpu_val");
+
+out:
+	free(percpu_valbuf);
+	for_each_array_map_elem__destroy(skel);
+}
+
 void test_for_each(void)
 {
 	if (test__start_subtest("hash_map"))
 		test_hash_map();
+	if (test__start_subtest("array_map"))
+		test_array_map();
 }
diff --git a/tools/testing/selftests/bpf/progs/for_each_array_map_elem.c b/tools/testing/selftests/bpf/progs/for_each_array_map_elem.c
new file mode 100644
index 000000000000..f1f14dcd6e68
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/for_each_array_map_elem.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 3);
+	__type(key, __u32);
+	__type(value, __u64);
+} arraymap SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} percpu_map SEC(".maps");
+
+struct callback_ctx {
+	int output;
+};
+
+static __u64
+check_array_elem(struct bpf_map *map, __u32 *key, __u64 *val,
+		 struct callback_ctx *data)
+{
+	data->output += *val;
+	if (*key == 1)
+		return 1; /* stop the iteration */
+	return 0;
+}
+
+__u32 cpu = 0;
+__u64 percpu_val = 0;
+
+static __u64
+check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val,
+		  struct callback_ctx *data)
+{
+	cpu = bpf_get_smp_processor_id();
+	percpu_val = *val;
+	return 0;
+}
+
+u32 called = 0;
+u32 arraymap_output = 0;
+
+SEC("iter/task")
+int dump_task(struct bpf_iter__task *ctx)
+{
+	struct seq_file *seq =  ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	struct callback_ctx data;
+
+	/* only call once */
+	if (called > 0)
+		return 0;
+
+	called++;
+
+	data.output = 0;
+	bpf_for_each_map_elem(&arraymap, check_array_elem, &data, 0);
+	arraymap_output = data.output;
+
+	bpf_for_each_map_elem(&percpu_map, check_percpu_elem, 0, 0);
+
+	return 0;
+}
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper
  2021-02-04 23:48 ` [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
@ 2021-02-05  5:49   ` Alexei Starovoitov
  2021-02-05 17:39     ` Yonghong Song
  2021-02-08 18:16   ` Andrii Nakryiko
  1 sibling, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2021-02-05  5:49 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

On Thu, Feb 04, 2021 at 03:48:29PM -0800, Yonghong Song wrote:
> The bpf_for_each_map_elem() helper is introduced which
> iterates all map elements with a callback function. The
> helper signature looks like
>   long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
> and for each map element, the callback_fn will be called. For example,
> like hashmap, the callback signature may look like
>   long callback_fn(map, key, val, callback_ctx)
> 
> There are two known use cases for this. One is from upstream ([1]) where
> a for_each_map_elem helper may help implement a timeout mechanism
> in a more generic way. Another is from our internal discussion
> for a firewall use case where a map contains all the rules. The packet
> data can be compared to all these rules to decide allow or deny
> the packet.
> 
> For array maps, users can already use a bounded loop to traverse
> elements. Using this helper can avoid using bounded loop. For other
> type of maps (e.g., hash maps) where bounded loop is hard or
> impossible to use, this helper provides a convenient way to
> operate on all elements.
> 
> For callback_fn, besides map and map element, a callback_ctx,
> allocated on caller stack, is also passed to the callback
> function. This callback_ctx argument can provide additional
> input and allow to write to caller stack for output.

The approach and implementation look great!
Few ideas below:

> +static int check_map_elem_callback(struct bpf_verifier_env *env, int *insn_idx)
> +{
> +	struct bpf_verifier_state *state = env->cur_state;
> +	struct bpf_prog_aux *aux = env->prog->aux;
> +	struct bpf_func_state *caller, *callee;
> +	struct bpf_map *map;
> +	int err, subprog;
> +
> +	if (state->curframe + 1 >= MAX_CALL_FRAMES) {
> +		verbose(env, "the call stack of %d frames is too deep\n",
> +			state->curframe + 2);
> +		return -E2BIG;
> +	}
> +
> +	caller = state->frame[state->curframe];
> +	if (state->frame[state->curframe + 1]) {
> +		verbose(env, "verifier bug. Frame %d already allocated\n",
> +			state->curframe + 1);
> +		return -EFAULT;
> +	}
> +
> +	caller->with_callback_fn = true;
> +
> +	callee = kzalloc(sizeof(*callee), GFP_KERNEL);
> +	if (!callee)
> +		return -ENOMEM;
> +	state->frame[state->curframe + 1] = callee;
> +
> +	/* callee cannot access r0, r6 - r9 for reading and has to write
> +	 * into its own stack before reading from it.
> +	 * callee can read/write into caller's stack
> +	 */
> +	init_func_state(env, callee,
> +			/* remember the callsite, it will be used by bpf_exit */
> +			*insn_idx /* callsite */,
> +			state->curframe + 1 /* frameno within this callchain */,
> +			subprog /* subprog number within this prog */);
> +
> +	/* Transfer references to the callee */
> +	err = transfer_reference_state(callee, caller);
> +	if (err)
> +		return err;
> +
> +	subprog = caller->regs[BPF_REG_2].subprog;
> +	if (aux->func_info && aux->func_info_aux[subprog].linkage != BTF_FUNC_STATIC) {
> +		verbose(env, "callback function R2 not static\n");
> +		return -EINVAL;
> +	}
> +
> +	map = caller->regs[BPF_REG_1].map_ptr;

Take a look at for (i = 0; i < 5; i++)  err = check_func_arg loop and record_func_map.
It stores the map pointer into map_ptr_state and makes sure it's unique,
so that program doesn't try to pass two different maps into the same 'call insn'.
It can make this function a bit more generic.
There would be no need to hard code regs[BPF_REG_1].
The code would take it from map_ptr_state.
Also it will help later with optimizing
  return map->ops->map_for_each_callback(map, callback_fn, callback_ctx, flags);
since the map pointer will be the same the optimization (that is applied to other map
operations) can be applied for this callback as well.

The regs[BPF_REG_2] can be generalized a bit as well.
It think linkage != BTF_FUNC_STATIC can be moved to early check_ld_imm phase.
While here the check_func_arg() loop can look for PTR_TO_FUNC type,
remeber the subprog into meta (just like map_ptr_state) and ... continues below

> +	if (!map->ops->map_set_for_each_callback_args ||
> +	    !map->ops->map_for_each_callback) {
> +		verbose(env, "callback function not allowed for map R1\n");
> +		return -ENOTSUPP;
> +	}
> +
> +	/* the following is only for hashmap, different maps
> +	 * can have different callback signatures.
> +	 */
> +	err = map->ops->map_set_for_each_callback_args(env, caller, callee);
> +	if (err)
> +		return err;
> +
> +	clear_caller_saved_regs(env, caller->regs);
> +
> +	/* only increment it after check_reg_arg() finished */
> +	state->curframe++;
> +
> +	/* and go analyze first insn of the callee */
> +	*insn_idx = env->subprog_info[subprog].start - 1;
> +
> +	if (env->log.level & BPF_LOG_LEVEL) {
> +		verbose(env, "caller:\n");
> +		print_verifier_state(env, caller);
> +		verbose(env, "callee:\n");
> +		print_verifier_state(env, callee);
> +	}
> +	return 0;
> +}
> +
>  static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>  {
>  	struct bpf_verifier_state *state = env->cur_state;
>  	struct bpf_func_state *caller, *callee;
>  	struct bpf_reg_state *r0;
> -	int err;
> +	int i, err;
>  
>  	callee = state->frame[state->curframe];
>  	r0 = &callee->regs[BPF_REG_0];
> @@ -4955,7 +5090,17 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>  	state->curframe--;
>  	caller = state->frame[state->curframe];
>  	/* return to the caller whatever r0 had in the callee */
> -	caller->regs[BPF_REG_0] = *r0;
> +	if (caller->with_callback_fn) {
> +		/* reset caller saved regs, the helper calling callback_fn
> +		 * has RET_INTEGER return types.
> +		 */
> +		for (i = 0; i < CALLER_SAVED_REGS; i++)
> +			mark_reg_not_init(env, caller->regs, caller_saved[i]);
> +		caller->regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG;
> +		mark_reg_unknown(env, caller->regs, BPF_REG_0);

this part can stay in check_helper_call().

> +	} else {
> +		caller->regs[BPF_REG_0] = *r0;
> +	}
>  
>  	/* Transfer references to the caller */
>  	err = transfer_reference_state(caller, callee);
> @@ -5091,7 +5236,8 @@ static int check_reference_leak(struct bpf_verifier_env *env)
>  	return state->acquired_refs ? -EINVAL : 0;
>  }
>  
> -static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
> +static int check_helper_call(struct bpf_verifier_env *env, int func_id, int *insn_idx,
> +			     bool map_elem_callback)
>  {
>  	const struct bpf_func_proto *fn = NULL;
>  	struct bpf_reg_state *regs;
> @@ -5151,11 +5297,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>  			return err;
>  	}
>  
> -	err = record_func_map(env, &meta, func_id, insn_idx);
> +	err = record_func_map(env, &meta, func_id, *insn_idx);
>  	if (err)
>  		return err;
>  
> -	err = record_func_key(env, &meta, func_id, insn_idx);
> +	err = record_func_key(env, &meta, func_id, *insn_idx);
>  	if (err)
>  		return err;
>  
> @@ -5163,7 +5309,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>  	 * is inferred from register state.
>  	 */
>  	for (i = 0; i < meta.access_size; i++) {
> -		err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B,
> +		err = check_mem_access(env, *insn_idx, meta.regno, i, BPF_B,
>  				       BPF_WRITE, -1, false);
>  		if (err)
>  			return err;
> @@ -5195,6 +5341,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>  		return -EINVAL;
>  	}
>  
> +	if (map_elem_callback) {
> +		env->prog->aux->with_callback_fn = true;
> +		return check_map_elem_callback(env, insn_idx);

Instead of returning early here.
The check_func_arg() loop can look for PTR_TO_FUNC type.
The allocate new callee state,
do map_set_for_each_callback_args() here.
and then proceed further.

> +	}
> +
>  	/* reset caller saved regs */
>  	for (i = 0; i < CALLER_SAVED_REGS; i++) {
>  		mark_reg_not_init(env, regs, caller_saved[i]);

Instead of doing this loop in prepare_func_exit().
This code can just proceed here and clear caller regs. This loop can stay as-is.
The transfer of caller->callee would happen already.

Then there are few lines here that diff didn't show.
They do regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG and mark_reg_unknown.
No need to do them in prepare_func_exit().
This function can proceed further reusing this caller regs clearing loop and r0 marking.

Then before returning from check_helper_call()
it will do what you have in check_map_elem_callback() and it will adjust *insn_idx.

At this point caller would have regs cleared and r0=undef.
And callee would have regs setup the way map_set_for_each_callback_args callback meant to do it.
The only thing prepare_func_exit would need to do is to make sure that assignment:
caller->regs[BPF_REG_0] = *r0
doesn't happen. caller's r0 was already set to undef.
To achieve that I think would be a bit cleaner to mark callee state instead of caller state.
So instead of caller->with_callback_fn=true maybe callee->in_callback_fn=true ?

> @@ -5306,7 +5457,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>  		/* For release_reference() */
>  		regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id;
>  	} else if (is_acquire_function(func_id, meta.map_ptr)) {
> -		int id = acquire_reference_state(env, insn_idx);
> +		int id = acquire_reference_state(env, *insn_idx);
>  
>  		if (id < 0)
>  			return id;
> @@ -5448,6 +5599,14 @@ static int retrieve_ptr_limit(const struct bpf_reg_state *ptr_reg,
>  		else
>  			*ptr_limit = -off;
>  		return 0;
> +	case PTR_TO_MAP_KEY:
> +		if (mask_to_left) {
> +			*ptr_limit = ptr_reg->umax_value + ptr_reg->off;
> +		} else {
> +			off = ptr_reg->smin_value + ptr_reg->off;
> +			*ptr_limit = ptr_reg->map_ptr->key_size - off;
> +		}
> +		return 0;
>  	case PTR_TO_MAP_VALUE:
>  		if (mask_to_left) {
>  			*ptr_limit = ptr_reg->umax_value + ptr_reg->off;
> @@ -5614,6 +5773,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
>  		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
>  			dst, reg_type_str[ptr_reg->type]);
>  		return -EACCES;
> +	case PTR_TO_MAP_KEY:
>  	case PTR_TO_MAP_VALUE:
>  		if (!env->allow_ptr_leaks && !known && (smin_val < 0) != (smax_val < 0)) {
>  			verbose(env, "R%d has unknown scalar with mixed signed bounds, pointer arithmetic with it prohibited for !root\n",
> @@ -7818,6 +7978,12 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  		return 0;
>  	}
>  
> +	if (insn->src_reg == BPF_PSEUDO_FUNC) {
> +		dst_reg->type = PTR_TO_FUNC;
> +		dst_reg->subprog = insn[1].imm;

Like here check for linkage==static can happen ?

> +		return 0;
> +	}
> +
>  	map = env->used_maps[aux->map_index];
>  	mark_reg_known_zero(env, regs, insn->dst_reg);
>  	dst_reg->map_ptr = map;
> @@ -8195,9 +8361,23 @@ static int visit_insn(int t, int insn_cnt, struct bpf_verifier_env *env)
>  
>  	/* All non-branch instructions have a single fall-through edge. */
>  	if (BPF_CLASS(insns[t].code) != BPF_JMP &&
> -	    BPF_CLASS(insns[t].code) != BPF_JMP32)
> +	    BPF_CLASS(insns[t].code) != BPF_JMP32 &&
> +	    !bpf_pseudo_func(insns + t))
>  		return push_insn(t, t + 1, FALLTHROUGH, env, false);
>  
> +	if (bpf_pseudo_func(insns + t)) {
> +		ret = push_insn(t, t + 1, FALLTHROUGH, env, false);
> +		if (ret)
> +			return ret;
> +
> +		if (t + 1 < insn_cnt)
> +			init_explored_state(env, t + 1);
> +		init_explored_state(env, t);
> +		ret = push_insn(t, t + insns[t].imm + 1, BRANCH,
> +				env, false);
> +		return ret;
> +	}
> +
>  	switch (BPF_OP(insns[t].code)) {
>  	case BPF_EXIT:
>  		return DONE_EXPLORING;
> @@ -8819,6 +8999,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
>  			 */
>  			return false;
>  		}
> +	case PTR_TO_MAP_KEY:
>  	case PTR_TO_MAP_VALUE:
>  		/* If the new min/max/var_off satisfy the old ones and
>  		 * everything else matches, we are OK.
> @@ -9646,6 +9827,8 @@ static int do_check(struct bpf_verifier_env *env)
>  
>  			env->jmps_processed++;
>  			if (opcode == BPF_CALL) {
> +				bool map_elem_callback;
> +
>  				if (BPF_SRC(insn->code) != BPF_K ||
>  				    insn->off != 0 ||
>  				    (insn->src_reg != BPF_REG_0 &&
> @@ -9662,13 +9845,15 @@ static int do_check(struct bpf_verifier_env *env)
>  					verbose(env, "function calls are not allowed while holding a lock\n");
>  					return -EINVAL;
>  				}
> +				map_elem_callback = insn->src_reg != BPF_PSEUDO_CALL &&
> +						   insn->imm == BPF_FUNC_for_each_map_elem;
>  				if (insn->src_reg == BPF_PSEUDO_CALL)
>  					err = check_func_call(env, insn, &env->insn_idx);
>  				else
> -					err = check_helper_call(env, insn->imm, env->insn_idx);
> +					err = check_helper_call(env, insn->imm, &env->insn_idx,
> +								map_elem_callback);

then hopefully this extra 'map_elem_callback' boolean won't be needed.
Only env->insn_idx into &env->insn_idx.
In that sense check_helper_call will become a superset of check_func_call.
Maybe some code between them can be shared too.
Beyond bpf_for_each_map_elem() helper other helpers might use PTR_TO_FUNC.
I hope with this approach all of them will be handled a bit more generically.

>  				if (err)
>  					return err;
> -
>  			} else if (opcode == BPF_JA) {
>  				if (BPF_SRC(insn->code) != BPF_K ||
>  				    insn->imm != 0 ||
> @@ -10090,6 +10275,12 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
>  				goto next_insn;
>  			}
>  
> +			if (insn[0].src_reg == BPF_PSEUDO_FUNC) {
> +				aux = &env->insn_aux_data[i];
> +				aux->ptr_type = PTR_TO_FUNC;
> +				goto next_insn;
> +			}
> +
>  			/* In final convert_pseudo_ld_imm64() step, this is
>  			 * converted into regular 64-bit imm load insn.
>  			 */
> @@ -10222,9 +10413,13 @@ static void convert_pseudo_ld_imm64(struct bpf_verifier_env *env)
>  	int insn_cnt = env->prog->len;
>  	int i;
>  
> -	for (i = 0; i < insn_cnt; i++, insn++)
> -		if (insn->code == (BPF_LD | BPF_IMM | BPF_DW))
> -			insn->src_reg = 0;
> +	for (i = 0; i < insn_cnt; i++, insn++) {
> +		if (insn->code != (BPF_LD | BPF_IMM | BPF_DW))
> +			continue;
> +		if (insn->src_reg == BPF_PSEUDO_FUNC)
> +			continue;
> +		insn->src_reg = 0;
> +	}
>  }
>  
>  /* single env->prog->insni[off] instruction was replaced with the range
> @@ -10846,6 +11041,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  		return 0;
>  
>  	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
> +		if (bpf_pseudo_func(insn)) {
> +			env->insn_aux_data[i].call_imm = insn->imm;
> +			/* subprog is encoded in insn[1].imm */
> +			continue;
> +		}
> +
>  		if (!bpf_pseudo_call(insn))
>  			continue;
>  		/* Upon error here we cannot fall back to interpreter but
> @@ -10975,6 +11176,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  	for (i = 0; i < env->subprog_cnt; i++) {
>  		insn = func[i]->insnsi;
>  		for (j = 0; j < func[i]->len; j++, insn++) {
> +			if (bpf_pseudo_func(insn)) {
> +				subprog = insn[1].imm;
> +				insn[0].imm = (u32)(long)func[subprog]->bpf_func;
> +				insn[1].imm = ((u64)(long)func[subprog]->bpf_func) >> 32;
> +				continue;
> +			}
>  			if (!bpf_pseudo_call(insn))
>  				continue;
>  			subprog = insn->off;
> @@ -11020,6 +11227,11 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  	 * later look the same as if they were interpreted only.
>  	 */
>  	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
> +		if (bpf_pseudo_func(insn)) {
> +			insn[0].imm = env->insn_aux_data[i].call_imm;
> +			insn[1].imm = find_subprog(env, i + insn[0].imm + 1);
> +			continue;
> +		}
>  		if (!bpf_pseudo_call(insn))
>  			continue;
>  		insn->off = env->insn_aux_data[i].call_imm;
> @@ -11083,6 +11295,13 @@ static int fixup_call_args(struct bpf_verifier_env *env)
>  		verbose(env, "tail_calls are not allowed in non-JITed programs with bpf-to-bpf calls\n");
>  		return -EINVAL;
>  	}
> +	if (env->subprog_cnt > 1 && env->prog->aux->with_callback_fn) {

Does this bool really need to be be part of 'aux'?
There is a loop below that does if (!bpf_pseudo_call
to fixup insns for the interpreter.
May be add if (bpf_pseudo_func()) { return callbacks are not allowed in non-JITed }
to the loop below as well?
It's a trade off between memory and few extra insn.

> +		/* When JIT fails the progs with callback calls
> +		 * have to be rejected, since interpreter doesn't support them yet.
> +		 */
> +		verbose(env, "callbacks are not allowed in non-JITed programs\n");
> +		return -EINVAL;
> +	}
>  	for (i = 0; i < prog->len; i++, insn++) {
>  		if (!bpf_pseudo_call(insn))
>  			continue;

to this loop.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 1/8] bpf: refactor BPF_PSEUDO_CALL checking as a helper function
  2021-02-04 23:48 ` [PATCH bpf-next 1/8] bpf: refactor BPF_PSEUDO_CALL checking as a helper function Yonghong Song
@ 2021-02-05  5:59   ` Alexei Starovoitov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexei Starovoitov @ 2021-02-05  5:59 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

On Thu, Feb 04, 2021 at 03:48:27PM -0800, Yonghong Song wrote:
> There is no functionality change. This refactoring intends
> to facilitate next patch change with BPF_PSEUDO_FUNC.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/verifier.c | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5e09632efddb..db294b75d03b 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -228,6 +228,12 @@ static void bpf_map_key_store(struct bpf_insn_aux_data *aux, u64 state)
>  			     (poisoned ? BPF_MAP_KEY_POISON : 0ULL);
>  }
>  
> +static bool bpf_pseudo_call(const struct bpf_insn *insn)
> +{
> +	return insn->code == (BPF_JMP | BPF_CALL) &&
> +	       insn->src_reg == BPF_PSEUDO_CALL;
> +}
> +
>  struct bpf_call_arg_meta {
>  	struct bpf_map *map_ptr;
>  	bool raw_mode;
> @@ -1486,9 +1492,7 @@ static int check_subprogs(struct bpf_verifier_env *env)
>  
>  	/* determine subprog starts. The end is one before the next starts */
>  	for (i = 0; i < insn_cnt; i++) {
> -		if (insn[i].code != (BPF_JMP | BPF_CALL))
> -			continue;
> -		if (insn[i].src_reg != BPF_PSEUDO_CALL)
> +		if (!bpf_pseudo_call(insn + i))

Nice cleanup! I've applied this patch from the set.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 3/8] bpf: add hashtab support for bpf_for_each_map_elem() helper
  2021-02-04 23:48 ` [PATCH bpf-next 3/8] bpf: add hashtab support for " Yonghong Song
@ 2021-02-05  6:23   ` Alexei Starovoitov
  2021-02-05 17:49     ` Yonghong Song
  0 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2021-02-05  6:23 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team

On Thu, Feb 04, 2021 at 03:48:30PM -0800, Yonghong Song wrote:
> This patch added support for hashmap, percpu hashmap,
> lru hashmap and percpu lru hashmap.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/bpf.h   |  4 +++
>  kernel/bpf/hashtab.c  | 57 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c | 23 +++++++++++++++++
>  3 files changed, 84 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c8b72ae16cc5..31e0447cadd8 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1396,6 +1396,10 @@ void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux,
>  int bpf_iter_map_fill_link_info(const struct bpf_iter_aux_info *aux,
>  				struct bpf_link_info *info);
>  
> +int map_set_for_each_callback_args(struct bpf_verifier_env *env,
> +				   struct bpf_func_state *caller,
> +				   struct bpf_func_state *callee);
> +
>  int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>  int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index c1ac7f964bc9..40f5404cfb01 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1869,6 +1869,55 @@ static const struct bpf_iter_seq_info iter_seq_info = {
>  	.seq_priv_size		= sizeof(struct bpf_iter_seq_hash_map_info),
>  };
>  
> +static int bpf_for_each_hash_elem(struct bpf_map *map, void *callback_fn,
> +				  void *callback_ctx, u64 flags)
> +{
> +	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +	struct hlist_nulls_head *head;
> +	struct hlist_nulls_node *n;
> +	struct htab_elem *elem;
> +	u32 roundup_key_size;
> +	void __percpu *pptr;
> +	struct bucket *b;
> +	void *key, *val;
> +	bool is_percpu;
> +	long ret = 0;
> +	int i;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	is_percpu = htab_is_percpu(htab);
> +
> +	roundup_key_size = round_up(map->key_size, 8);
> +	for (i = 0; i < htab->n_buckets; i++) {
> +		b = &htab->buckets[i];
> +		rcu_read_lock();
> +		head = &b->head;
> +		hlist_nulls_for_each_entry_rcu(elem, n, head, hash_node) {
> +			key = elem->key;
> +			if (!is_percpu) {
> +				val = elem->key + roundup_key_size;
> +			} else {
> +				/* current cpu value for percpu map */
> +				pptr = htab_elem_get_ptr(elem, map->key_size);
> +				val = this_cpu_ptr(pptr);
> +			}
> +			ret = BPF_CAST_CALL(callback_fn)((u64)(long)map,
> +					(u64)(long)key, (u64)(long)val,
> +					(u64)(long)callback_ctx, 0);
> +			if (ret) {
> +				rcu_read_unlock();
> +				ret = (ret == 1) ? 0 : -EINVAL;

one more thing that I should have mentioned in patch 2.
In prepare_func_exit would be good to add:
if (callee->in_callback_fn)
  check that r0 is readable and in tnum_range(0, 1).
and then don't assign r0 reg_state anywhere.

> +				goto out;
> +			}
> +		}
> +		rcu_read_unlock();

Sleepable progs can do cond_resched here.
How about adding migrate_disable before for() loop
to make sure that for_each(per_cpu_map,...) is still meaningful and
if (!in_atomic())
   cond_resched();
here.
Since this helper is called from bpf progs only the in_atomic check
(whether prog was sleepable or not) is accurate.

> +	}
  
  migrate_enable() here after the loop.

> +out:
> +	return ret;
> +}
> +
>  static int htab_map_btf_id;
>  const struct bpf_map_ops htab_map_ops = {
>  	.map_meta_equal = bpf_map_meta_equal,
> @@ -1881,6 +1930,8 @@ const struct bpf_map_ops htab_map_ops = {
>  	.map_delete_elem = htab_map_delete_elem,
>  	.map_gen_lookup = htab_map_gen_lookup,
>  	.map_seq_show_elem = htab_map_seq_show_elem,
> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
> +	.map_for_each_callback = bpf_for_each_hash_elem,
>  	BATCH_OPS(htab),
>  	.map_btf_name = "bpf_htab",
>  	.map_btf_id = &htab_map_btf_id,
> @@ -1900,6 +1951,8 @@ const struct bpf_map_ops htab_lru_map_ops = {
>  	.map_delete_elem = htab_lru_map_delete_elem,
>  	.map_gen_lookup = htab_lru_map_gen_lookup,
>  	.map_seq_show_elem = htab_map_seq_show_elem,
> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
> +	.map_for_each_callback = bpf_for_each_hash_elem,
>  	BATCH_OPS(htab_lru),
>  	.map_btf_name = "bpf_htab",
>  	.map_btf_id = &htab_lru_map_btf_id,
> @@ -2019,6 +2072,8 @@ const struct bpf_map_ops htab_percpu_map_ops = {
>  	.map_update_elem = htab_percpu_map_update_elem,
>  	.map_delete_elem = htab_map_delete_elem,
>  	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
> +	.map_for_each_callback = bpf_for_each_hash_elem,
>  	BATCH_OPS(htab_percpu),
>  	.map_btf_name = "bpf_htab",
>  	.map_btf_id = &htab_percpu_map_btf_id,
> @@ -2036,6 +2091,8 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
>  	.map_update_elem = htab_lru_percpu_map_update_elem,
>  	.map_delete_elem = htab_lru_map_delete_elem,
>  	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
> +	.map_for_each_callback = bpf_for_each_hash_elem,
>  	BATCH_OPS(htab_lru_percpu),
>  	.map_btf_name = "bpf_htab",
>  	.map_btf_id = &htab_lru_percpu_map_btf_id,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 050b067a0be6..32c8dcc27da8 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4987,6 +4987,29 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  	return 0;
>  }
>  
> +int map_set_for_each_callback_args(struct bpf_verifier_env *env,
> +				   struct bpf_func_state *caller,
> +				   struct bpf_func_state *callee)
> +{
> +	/* pointer to map */
> +	callee->regs[BPF_REG_1] = caller->regs[BPF_REG_1];
> +
> +	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
> +	callee->regs[BPF_REG_2].map_ptr = caller->regs[BPF_REG_1].map_ptr;
> +
> +	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
> +	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
> +	callee->regs[BPF_REG_3].map_ptr = caller->regs[BPF_REG_1].map_ptr;
> +
> +	/* pointer to stack or null */
> +	callee->regs[BPF_REG_4] = caller->regs[BPF_REG_3];

This hard coding of regs 1 through 4 makes sense.
May be add a comment with bpf_for_each_map_elem and callback_fn prototypes,
so it's more obvious what's going on.

> +
> +	/* unused */
> +	__mark_reg_unknown(env, &callee->regs[BPF_REG_5]);

I think it should be __mark_reg_not_init to make sure that callback_fn
doesn't use r5.
That will help with future extensions via new flags arg that is passed
into bpf_for_each_map_elem.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper
  2021-02-05  5:49   ` Alexei Starovoitov
@ 2021-02-05 17:39     ` Yonghong Song
  0 siblings, 0 replies; 28+ messages in thread
From: Yonghong Song @ 2021-02-05 17:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team



On 2/4/21 9:49 PM, Alexei Starovoitov wrote:
> On Thu, Feb 04, 2021 at 03:48:29PM -0800, Yonghong Song wrote:
>> The bpf_for_each_map_elem() helper is introduced which
>> iterates all map elements with a callback function. The
>> helper signature looks like
>>    long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
>> and for each map element, the callback_fn will be called. For example,
>> like hashmap, the callback signature may look like
>>    long callback_fn(map, key, val, callback_ctx)
>>
>> There are two known use cases for this. One is from upstream ([1]) where
>> a for_each_map_elem helper may help implement a timeout mechanism
>> in a more generic way. Another is from our internal discussion
>> for a firewall use case where a map contains all the rules. The packet
>> data can be compared to all these rules to decide allow or deny
>> the packet.
>>
>> For array maps, users can already use a bounded loop to traverse
>> elements. Using this helper can avoid using bounded loop. For other
>> type of maps (e.g., hash maps) where bounded loop is hard or
>> impossible to use, this helper provides a convenient way to
>> operate on all elements.
>>
>> For callback_fn, besides map and map element, a callback_ctx,
>> allocated on caller stack, is also passed to the callback
>> function. This callback_ctx argument can provide additional
>> input and allow to write to caller stack for output.
> 
> The approach and implementation look great!
> Few ideas below:
> 
>> +static int check_map_elem_callback(struct bpf_verifier_env *env, int *insn_idx)
>> +{
>> +	struct bpf_verifier_state *state = env->cur_state;
>> +	struct bpf_prog_aux *aux = env->prog->aux;
>> +	struct bpf_func_state *caller, *callee;
>> +	struct bpf_map *map;
>> +	int err, subprog;
>> +
>> +	if (state->curframe + 1 >= MAX_CALL_FRAMES) {
>> +		verbose(env, "the call stack of %d frames is too deep\n",
>> +			state->curframe + 2);
>> +		return -E2BIG;
>> +	}
>> +
>> +	caller = state->frame[state->curframe];
>> +	if (state->frame[state->curframe + 1]) {
>> +		verbose(env, "verifier bug. Frame %d already allocated\n",
>> +			state->curframe + 1);
>> +		return -EFAULT;
>> +	}
>> +
>> +	caller->with_callback_fn = true;
>> +
>> +	callee = kzalloc(sizeof(*callee), GFP_KERNEL);
>> +	if (!callee)
>> +		return -ENOMEM;
>> +	state->frame[state->curframe + 1] = callee;
>> +
>> +	/* callee cannot access r0, r6 - r9 for reading and has to write
>> +	 * into its own stack before reading from it.
>> +	 * callee can read/write into caller's stack
>> +	 */
>> +	init_func_state(env, callee,
>> +			/* remember the callsite, it will be used by bpf_exit */
>> +			*insn_idx /* callsite */,
>> +			state->curframe + 1 /* frameno within this callchain */,
>> +			subprog /* subprog number within this prog */);
>> +
>> +	/* Transfer references to the callee */
>> +	err = transfer_reference_state(callee, caller);
>> +	if (err)
>> +		return err;
>> +
>> +	subprog = caller->regs[BPF_REG_2].subprog;
>> +	if (aux->func_info && aux->func_info_aux[subprog].linkage != BTF_FUNC_STATIC) {
>> +		verbose(env, "callback function R2 not static\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	map = caller->regs[BPF_REG_1].map_ptr;
> 
> Take a look at for (i = 0; i < 5; i++)  err = check_func_arg loop and record_func_map.
> It stores the map pointer into map_ptr_state and makes sure it's unique,
> so that program doesn't try to pass two different maps into the same 'call insn'.
> It can make this function a bit more generic.
> There would be no need to hard code regs[BPF_REG_1].
> The code would take it from map_ptr_state.
> Also it will help later with optimizing
>    return map->ops->map_for_each_callback(map, callback_fn, callback_ctx, flags);
> since the map pointer will be the same the optimization (that is applied to other map
> operations) can be applied for this callback as well.

sounds good. will try this approach in the next revision.

> 
> The regs[BPF_REG_2] can be generalized a bit as well.
> It think linkage != BTF_FUNC_STATIC can be moved to early check_ld_imm phase.
> While here the check_func_arg() loop can look for PTR_TO_FUNC type,
> remeber the subprog into meta (just like map_ptr_state) and ... continues below

Yes, PTR_TO_FUNC might be used for future helpers like bpf_mod_timer() 
or bpf_for_each_task() etc. Make it more general do make sense.

> 
>> +	if (!map->ops->map_set_for_each_callback_args ||
>> +	    !map->ops->map_for_each_callback) {
>> +		verbose(env, "callback function not allowed for map R1\n");
>> +		return -ENOTSUPP;
>> +	}
>> +
>> +	/* the following is only for hashmap, different maps
>> +	 * can have different callback signatures.
>> +	 */
>> +	err = map->ops->map_set_for_each_callback_args(env, caller, callee);
>> +	if (err)
>> +		return err;
>> +
>> +	clear_caller_saved_regs(env, caller->regs);
>> +
>> +	/* only increment it after check_reg_arg() finished */
>> +	state->curframe++;
>> +
>> +	/* and go analyze first insn of the callee */
>> +	*insn_idx = env->subprog_info[subprog].start - 1;
>> +
>> +	if (env->log.level & BPF_LOG_LEVEL) {
>> +		verbose(env, "caller:\n");
>> +		print_verifier_state(env, caller);
>> +		verbose(env, "callee:\n");
>> +		print_verifier_state(env, callee);
>> +	}
>> +	return 0;
>> +}
>> +
>>   static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>>   {
>>   	struct bpf_verifier_state *state = env->cur_state;
>>   	struct bpf_func_state *caller, *callee;
>>   	struct bpf_reg_state *r0;
>> -	int err;
>> +	int i, err;
>>   
>>   	callee = state->frame[state->curframe];
>>   	r0 = &callee->regs[BPF_REG_0];
>> @@ -4955,7 +5090,17 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>>   	state->curframe--;
>>   	caller = state->frame[state->curframe];
>>   	/* return to the caller whatever r0 had in the callee */
>> -	caller->regs[BPF_REG_0] = *r0;
>> +	if (caller->with_callback_fn) {
>> +		/* reset caller saved regs, the helper calling callback_fn
>> +		 * has RET_INTEGER return types.
>> +		 */
>> +		for (i = 0; i < CALLER_SAVED_REGS; i++)
>> +			mark_reg_not_init(env, caller->regs, caller_saved[i]);
>> +		caller->regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG;
>> +		mark_reg_unknown(env, caller->regs, BPF_REG_0);
> 
> this part can stay in check_helper_call().

Yes, to verify the callback function from the helper is the most
complex part in my patch set. Your above suggestions should make it
more streamlined as a better fit with existing infrastruture.
Will give a try.

> 
>> +	} else {
>> +		caller->regs[BPF_REG_0] = *r0;
>> +	}
>>   
>>   	/* Transfer references to the caller */
>>   	err = transfer_reference_state(caller, callee);
>> @@ -5091,7 +5236,8 @@ static int check_reference_leak(struct bpf_verifier_env *env)
>>   	return state->acquired_refs ? -EINVAL : 0;
>>   }
>>   
>> -static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
>> +static int check_helper_call(struct bpf_verifier_env *env, int func_id, int *insn_idx,
>> +			     bool map_elem_callback)
>>   {
>>   	const struct bpf_func_proto *fn = NULL;
>>   	struct bpf_reg_state *regs;
>> @@ -5151,11 +5297,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>>   			return err;
>>   	}
>>   
>> -	err = record_func_map(env, &meta, func_id, insn_idx);
>> +	err = record_func_map(env, &meta, func_id, *insn_idx);
>>   	if (err)
>>   		return err;
>>   
>> -	err = record_func_key(env, &meta, func_id, insn_idx);
>> +	err = record_func_key(env, &meta, func_id, *insn_idx);
>>   	if (err)
>>   		return err;
>>   
>> @@ -5163,7 +5309,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>>   	 * is inferred from register state.
>>   	 */
>>   	for (i = 0; i < meta.access_size; i++) {
>> -		err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B,
>> +		err = check_mem_access(env, *insn_idx, meta.regno, i, BPF_B,
>>   				       BPF_WRITE, -1, false);
>>   		if (err)
>>   			return err;
>> @@ -5195,6 +5341,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>>   		return -EINVAL;
>>   	}
>>   
>> +	if (map_elem_callback) {
>> +		env->prog->aux->with_callback_fn = true;
>> +		return check_map_elem_callback(env, insn_idx);
> 
> Instead of returning early here.
> The check_func_arg() loop can look for PTR_TO_FUNC type.
> The allocate new callee state,
> do map_set_for_each_callback_args() here.
> and then proceed further.

ditto. Will re-organize to avoid early return here.

> 
>> +	}
>> +
>>   	/* reset caller saved regs */
>>   	for (i = 0; i < CALLER_SAVED_REGS; i++) {
>>   		mark_reg_not_init(env, regs, caller_saved[i]);
> 
> Instead of doing this loop in prepare_func_exit().
> This code can just proceed here and clear caller regs. This loop can stay as-is.
> The transfer of caller->callee would happen already.
> 
> Then there are few lines here that diff didn't show.
> They do regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG and mark_reg_unknown.
> No need to do them in prepare_func_exit().
> This function can proceed further reusing this caller regs clearing loop and r0 marking.
> 
> Then before returning from check_helper_call()
> it will do what you have in check_map_elem_callback() and it will adjust *insn_idx.
> 
> At this point caller would have regs cleared and r0=undef.
> And callee would have regs setup the way map_set_for_each_callback_args callback meant to do it.
> The only thing prepare_func_exit would need to do is to make sure that assignment:
> caller->regs[BPF_REG_0] = *r0
> doesn't happen. caller's r0 was already set to undef.
> To achieve that I think would be a bit cleaner to mark callee state instead of caller state.
> So instead of caller->with_callback_fn=true maybe callee->in_callback_fn=true ?

Agree. Will try to do normal helper return verification here
instead of a cut-down version in prepare_func_exit().

> 
>> @@ -5306,7 +5457,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>>   		/* For release_reference() */
>>   		regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id;
>>   	} else if (is_acquire_function(func_id, meta.map_ptr)) {
>> -		int id = acquire_reference_state(env, insn_idx);
>> +		int id = acquire_reference_state(env, *insn_idx);
>>   
>>   		if (id < 0)
>>   			return id;
>> @@ -5448,6 +5599,14 @@ static int retrieve_ptr_limit(const struct bpf_reg_state *ptr_reg,
>>   		else
>>   			*ptr_limit = -off;
>>   		return 0;
>> +	case PTR_TO_MAP_KEY:
>> +		if (mask_to_left) {
>> +			*ptr_limit = ptr_reg->umax_value + ptr_reg->off;
>> +		} else {
>> +			off = ptr_reg->smin_value + ptr_reg->off;
>> +			*ptr_limit = ptr_reg->map_ptr->key_size - off;
>> +		}
>> +		return 0;
>>   	case PTR_TO_MAP_VALUE:
>>   		if (mask_to_left) {
>>   			*ptr_limit = ptr_reg->umax_value + ptr_reg->off;
>> @@ -5614,6 +5773,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
>>   		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
>>   			dst, reg_type_str[ptr_reg->type]);
>>   		return -EACCES;
>> +	case PTR_TO_MAP_KEY:
>>   	case PTR_TO_MAP_VALUE:
>>   		if (!env->allow_ptr_leaks && !known && (smin_val < 0) != (smax_val < 0)) {
>>   			verbose(env, "R%d has unknown scalar with mixed signed bounds, pointer arithmetic with it prohibited for !root\n",
>> @@ -7818,6 +7978,12 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
>>   		return 0;
>>   	}
>>   
>> +	if (insn->src_reg == BPF_PSEUDO_FUNC) {
>> +		dst_reg->type = PTR_TO_FUNC;
>> +		dst_reg->subprog = insn[1].imm;
> 
> Like here check for linkage==static can happen ?

will do.

> 
>> +		return 0;
>> +	}
>> +
>>   	map = env->used_maps[aux->map_index];
>>   	mark_reg_known_zero(env, regs, insn->dst_reg);
>>   	dst_reg->map_ptr = map;
>> @@ -8195,9 +8361,23 @@ static int visit_insn(int t, int insn_cnt, struct bpf_verifier_env *env)
>>   
>>   	/* All non-branch instructions have a single fall-through edge. */
>>   	if (BPF_CLASS(insns[t].code) != BPF_JMP &&
>> -	    BPF_CLASS(insns[t].code) != BPF_JMP32)
>> +	    BPF_CLASS(insns[t].code) != BPF_JMP32 &&
>> +	    !bpf_pseudo_func(insns + t))
>>   		return push_insn(t, t + 1, FALLTHROUGH, env, false);
>>   
>> +	if (bpf_pseudo_func(insns + t)) {
>> +		ret = push_insn(t, t + 1, FALLTHROUGH, env, false);
>> +		if (ret)
>> +			return ret;
>> +
>> +		if (t + 1 < insn_cnt)
>> +			init_explored_state(env, t + 1);
>> +		init_explored_state(env, t);
>> +		ret = push_insn(t, t + insns[t].imm + 1, BRANCH,
>> +				env, false);
>> +		return ret;
>> +	}
>> +
>>   	switch (BPF_OP(insns[t].code)) {
>>   	case BPF_EXIT:
>>   		return DONE_EXPLORING;
>> @@ -8819,6 +8999,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
>>   			 */
>>   			return false;
>>   		}
>> +	case PTR_TO_MAP_KEY:
>>   	case PTR_TO_MAP_VALUE:
>>   		/* If the new min/max/var_off satisfy the old ones and
>>   		 * everything else matches, we are OK.
>> @@ -9646,6 +9827,8 @@ static int do_check(struct bpf_verifier_env *env)
>>   
>>   			env->jmps_processed++;
>>   			if (opcode == BPF_CALL) {
>> +				bool map_elem_callback;
>> +
>>   				if (BPF_SRC(insn->code) != BPF_K ||
>>   				    insn->off != 0 ||
>>   				    (insn->src_reg != BPF_REG_0 &&
>> @@ -9662,13 +9845,15 @@ static int do_check(struct bpf_verifier_env *env)
>>   					verbose(env, "function calls are not allowed while holding a lock\n");
>>   					return -EINVAL;
>>   				}
>> +				map_elem_callback = insn->src_reg != BPF_PSEUDO_CALL &&
>> +						   insn->imm == BPF_FUNC_for_each_map_elem;
>>   				if (insn->src_reg == BPF_PSEUDO_CALL)
>>   					err = check_func_call(env, insn, &env->insn_idx);
>>   				else
>> -					err = check_helper_call(env, insn->imm, env->insn_idx);
>> +					err = check_helper_call(env, insn->imm, &env->insn_idx,
>> +								map_elem_callback);
> 
> then hopefully this extra 'map_elem_callback' boolean won't be needed.
> Only env->insn_idx into &env->insn_idx.
> In that sense check_helper_call will become a superset of check_func_call.
> Maybe some code between them can be shared too.
> Beyond bpf_for_each_map_elem() helper other helpers might use PTR_TO_FUNC.
> I hope with this approach all of them will be handled a bit more generically.

We shouldn't use this since we have insn->imm as the helper id. Will 
remove it.

> 
>>   				if (err)
>>   					return err;
>> -
>>   			} else if (opcode == BPF_JA) {
>>   				if (BPF_SRC(insn->code) != BPF_K ||
>>   				    insn->imm != 0 ||
>> @@ -10090,6 +10275,12 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
>>   				goto next_insn;
>>   			}
>>   
>> +			if (insn[0].src_reg == BPF_PSEUDO_FUNC) {
>> +				aux = &env->insn_aux_data[i];
>> +				aux->ptr_type = PTR_TO_FUNC;
>> +				goto next_insn;
>> +			}
>> +
>>   			/* In final convert_pseudo_ld_imm64() step, this is
>>   			 * converted into regular 64-bit imm load insn.
>>   			 */
>> @@ -10222,9 +10413,13 @@ static void convert_pseudo_ld_imm64(struct bpf_verifier_env *env)
>>   	int insn_cnt = env->prog->len;
>>   	int i;
>>   
>> -	for (i = 0; i < insn_cnt; i++, insn++)
>> -		if (insn->code == (BPF_LD | BPF_IMM | BPF_DW))
>> -			insn->src_reg = 0;
>> +	for (i = 0; i < insn_cnt; i++, insn++) {
>> +		if (insn->code != (BPF_LD | BPF_IMM | BPF_DW))
>> +			continue;
>> +		if (insn->src_reg == BPF_PSEUDO_FUNC)
>> +			continue;
>> +		insn->src_reg = 0;
>> +	}
>>   }
>>   
>>   /* single env->prog->insni[off] instruction was replaced with the range
>> @@ -10846,6 +11041,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>>   		return 0;
>>   
>>   	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
>> +		if (bpf_pseudo_func(insn)) {
>> +			env->insn_aux_data[i].call_imm = insn->imm;
>> +			/* subprog is encoded in insn[1].imm */
>> +			continue;
>> +		}
>> +
>>   		if (!bpf_pseudo_call(insn))
>>   			continue;
>>   		/* Upon error here we cannot fall back to interpreter but
>> @@ -10975,6 +11176,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>>   	for (i = 0; i < env->subprog_cnt; i++) {
>>   		insn = func[i]->insnsi;
>>   		for (j = 0; j < func[i]->len; j++, insn++) {
>> +			if (bpf_pseudo_func(insn)) {
>> +				subprog = insn[1].imm;
>> +				insn[0].imm = (u32)(long)func[subprog]->bpf_func;
>> +				insn[1].imm = ((u64)(long)func[subprog]->bpf_func) >> 32;
>> +				continue;
>> +			}
>>   			if (!bpf_pseudo_call(insn))
>>   				continue;
>>   			subprog = insn->off;
>> @@ -11020,6 +11227,11 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>>   	 * later look the same as if they were interpreted only.
>>   	 */
>>   	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
>> +		if (bpf_pseudo_func(insn)) {
>> +			insn[0].imm = env->insn_aux_data[i].call_imm;
>> +			insn[1].imm = find_subprog(env, i + insn[0].imm + 1);
>> +			continue;
>> +		}
>>   		if (!bpf_pseudo_call(insn))
>>   			continue;
>>   		insn->off = env->insn_aux_data[i].call_imm;
>> @@ -11083,6 +11295,13 @@ static int fixup_call_args(struct bpf_verifier_env *env)
>>   		verbose(env, "tail_calls are not allowed in non-JITed programs with bpf-to-bpf calls\n");
>>   		return -EINVAL;
>>   	}
>> +	if (env->subprog_cnt > 1 && env->prog->aux->with_callback_fn) {
> 
> Does this bool really need to be be part of 'aux'?
> There is a loop below that does if (!bpf_pseudo_call
> to fixup insns for the interpreter.
> May be add if (bpf_pseudo_func()) { return callbacks are not allowed in non-JITed }
> to the loop below as well?
> It's a trade off between memory and few extra insn.

I use this bit similar to tailcall. But agree with you that we can the 
check in below loop.

> 
>> +		/* When JIT fails the progs with callback calls
>> +		 * have to be rejected, since interpreter doesn't support them yet.
>> +		 */
>> +		verbose(env, "callbacks are not allowed in non-JITed programs\n");
>> +		return -EINVAL;
>> +	}
>>   	for (i = 0; i < prog->len; i++, insn++) {
>>   		if (!bpf_pseudo_call(insn))
>>   			continue;
> 
> to this loop.
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 3/8] bpf: add hashtab support for bpf_for_each_map_elem() helper
  2021-02-05  6:23   ` Alexei Starovoitov
@ 2021-02-05 17:49     ` Yonghong Song
  0 siblings, 0 replies; 28+ messages in thread
From: Yonghong Song @ 2021-02-05 17:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, kernel-team



On 2/4/21 10:23 PM, Alexei Starovoitov wrote:
> On Thu, Feb 04, 2021 at 03:48:30PM -0800, Yonghong Song wrote:
>> This patch added support for hashmap, percpu hashmap,
>> lru hashmap and percpu lru hashmap.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   include/linux/bpf.h   |  4 +++
>>   kernel/bpf/hashtab.c  | 57 +++++++++++++++++++++++++++++++++++++++++++
>>   kernel/bpf/verifier.c | 23 +++++++++++++++++
>>   3 files changed, 84 insertions(+)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index c8b72ae16cc5..31e0447cadd8 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1396,6 +1396,10 @@ void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux,
>>   int bpf_iter_map_fill_link_info(const struct bpf_iter_aux_info *aux,
>>   				struct bpf_link_info *info);
>>   
>> +int map_set_for_each_callback_args(struct bpf_verifier_env *env,
>> +				   struct bpf_func_state *caller,
>> +				   struct bpf_func_state *callee);
>> +
>>   int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
>>   int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index c1ac7f964bc9..40f5404cfb01 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -1869,6 +1869,55 @@ static const struct bpf_iter_seq_info iter_seq_info = {
>>   	.seq_priv_size		= sizeof(struct bpf_iter_seq_hash_map_info),
>>   };
>>   
>> +static int bpf_for_each_hash_elem(struct bpf_map *map, void *callback_fn,
>> +				  void *callback_ctx, u64 flags)
>> +{
>> +	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>> +	struct hlist_nulls_head *head;
>> +	struct hlist_nulls_node *n;
>> +	struct htab_elem *elem;
>> +	u32 roundup_key_size;
>> +	void __percpu *pptr;
>> +	struct bucket *b;
>> +	void *key, *val;
>> +	bool is_percpu;
>> +	long ret = 0;
>> +	int i;
>> +
>> +	if (flags != 0)
>> +		return -EINVAL;
>> +
>> +	is_percpu = htab_is_percpu(htab);
>> +
>> +	roundup_key_size = round_up(map->key_size, 8);
>> +	for (i = 0; i < htab->n_buckets; i++) {
>> +		b = &htab->buckets[i];
>> +		rcu_read_lock();
>> +		head = &b->head;
>> +		hlist_nulls_for_each_entry_rcu(elem, n, head, hash_node) {
>> +			key = elem->key;
>> +			if (!is_percpu) {
>> +				val = elem->key + roundup_key_size;
>> +			} else {
>> +				/* current cpu value for percpu map */
>> +				pptr = htab_elem_get_ptr(elem, map->key_size);
>> +				val = this_cpu_ptr(pptr);
>> +			}
>> +			ret = BPF_CAST_CALL(callback_fn)((u64)(long)map,
>> +					(u64)(long)key, (u64)(long)val,
>> +					(u64)(long)callback_ctx, 0);
>> +			if (ret) {
>> +				rcu_read_unlock();
>> +				ret = (ret == 1) ? 0 : -EINVAL;
> 
> one more thing that I should have mentioned in patch 2.
> In prepare_func_exit would be good to add:
> if (callee->in_callback_fn)
>    check that r0 is readable and in tnum_range(0, 1).
> and then don't assign r0 reg_state anywhere.

Yes, indeed. I added the constraint of return value 0/1 in the
last minute. My previous constraint is 0 for continue and non-0 for 
return. I changed it as I feel it is not extensible.

Will add contraint for r0 readable and tnum_range(0, 1)
in the next revision.

> 
>> +				goto out;
>> +			}
>> +		}
>> +		rcu_read_unlock();
> 
> Sleepable progs can do cond_resched here.
> How about adding migrate_disable before for() loop
> to make sure that for_each(per_cpu_map,...) is still meaningful and
> if (!in_atomic())
>     cond_resched();
> here.
> Since this helper is called from bpf progs only the in_atomic check
> (whether prog was sleepable or not) is accurate.
> 
>> +	}
>    
>    migrate_enable() here after the loop.

will do. I have not thought about sleepable program much yet. I suspect
I may miss some additional checking somewhere.

> 
>> +out:
>> +	return ret;
>> +}
>> +
>>   static int htab_map_btf_id;
>>   const struct bpf_map_ops htab_map_ops = {
>>   	.map_meta_equal = bpf_map_meta_equal,
>> @@ -1881,6 +1930,8 @@ const struct bpf_map_ops htab_map_ops = {
>>   	.map_delete_elem = htab_map_delete_elem,
>>   	.map_gen_lookup = htab_map_gen_lookup,
>>   	.map_seq_show_elem = htab_map_seq_show_elem,
>> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
>> +	.map_for_each_callback = bpf_for_each_hash_elem,
>>   	BATCH_OPS(htab),
>>   	.map_btf_name = "bpf_htab",
>>   	.map_btf_id = &htab_map_btf_id,
>> @@ -1900,6 +1951,8 @@ const struct bpf_map_ops htab_lru_map_ops = {
>>   	.map_delete_elem = htab_lru_map_delete_elem,
>>   	.map_gen_lookup = htab_lru_map_gen_lookup,
>>   	.map_seq_show_elem = htab_map_seq_show_elem,
>> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
>> +	.map_for_each_callback = bpf_for_each_hash_elem,
>>   	BATCH_OPS(htab_lru),
>>   	.map_btf_name = "bpf_htab",
>>   	.map_btf_id = &htab_lru_map_btf_id,
>> @@ -2019,6 +2072,8 @@ const struct bpf_map_ops htab_percpu_map_ops = {
>>   	.map_update_elem = htab_percpu_map_update_elem,
>>   	.map_delete_elem = htab_map_delete_elem,
>>   	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
>> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
>> +	.map_for_each_callback = bpf_for_each_hash_elem,
>>   	BATCH_OPS(htab_percpu),
>>   	.map_btf_name = "bpf_htab",
>>   	.map_btf_id = &htab_percpu_map_btf_id,
>> @@ -2036,6 +2091,8 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
>>   	.map_update_elem = htab_lru_percpu_map_update_elem,
>>   	.map_delete_elem = htab_lru_map_delete_elem,
>>   	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
>> +	.map_set_for_each_callback_args = map_set_for_each_callback_args,
>> +	.map_for_each_callback = bpf_for_each_hash_elem,
>>   	BATCH_OPS(htab_lru_percpu),
>>   	.map_btf_name = "bpf_htab",
>>   	.map_btf_id = &htab_lru_percpu_map_btf_id,
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 050b067a0be6..32c8dcc27da8 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -4987,6 +4987,29 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>   	return 0;
>>   }
>>   
>> +int map_set_for_each_callback_args(struct bpf_verifier_env *env,
>> +				   struct bpf_func_state *caller,
>> +				   struct bpf_func_state *callee)
>> +{
>> +	/* pointer to map */
>> +	callee->regs[BPF_REG_1] = caller->regs[BPF_REG_1];
>> +
>> +	callee->regs[BPF_REG_2].type = PTR_TO_MAP_KEY;
>> +	__mark_reg_known_zero(&callee->regs[BPF_REG_2]);
>> +	callee->regs[BPF_REG_2].map_ptr = caller->regs[BPF_REG_1].map_ptr;
>> +
>> +	callee->regs[BPF_REG_3].type = PTR_TO_MAP_VALUE;
>> +	__mark_reg_known_zero(&callee->regs[BPF_REG_3]);
>> +	callee->regs[BPF_REG_3].map_ptr = caller->regs[BPF_REG_1].map_ptr;
>> +
>> +	/* pointer to stack or null */
>> +	callee->regs[BPF_REG_4] = caller->regs[BPF_REG_3];
> 
> This hard coding of regs 1 through 4 makes sense.
> May be add a comment with bpf_for_each_map_elem and callback_fn prototypes,
> so it's more obvious what's going on.

Yes. an explicit callback_fn prototype comment makes sense.

> 
>> +
>> +	/* unused */
>> +	__mark_reg_unknown(env, &callee->regs[BPF_REG_5]);
> 
> I think it should be __mark_reg_not_init to make sure that callback_fn
> doesn't use r5.
> That will help with future extensions via new flags arg that is passed
> into bpf_for_each_map_elem.

Will do.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper
  2021-02-04 23:48 ` [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
  2021-02-05  5:49   ` Alexei Starovoitov
@ 2021-02-08 18:16   ` Andrii Nakryiko
  2021-02-09  6:41     ` Yonghong Song
  1 sibling, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-08 18:16 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>
> The bpf_for_each_map_elem() helper is introduced which
> iterates all map elements with a callback function. The
> helper signature looks like
>   long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
> and for each map element, the callback_fn will be called. For example,
> like hashmap, the callback signature may look like
>   long callback_fn(map, key, val, callback_ctx)
>
> There are two known use cases for this. One is from upstream ([1]) where
> a for_each_map_elem helper may help implement a timeout mechanism
> in a more generic way. Another is from our internal discussion
> for a firewall use case where a map contains all the rules. The packet
> data can be compared to all these rules to decide allow or deny
> the packet.
>
> For array maps, users can already use a bounded loop to traverse
> elements. Using this helper can avoid using bounded loop. For other
> type of maps (e.g., hash maps) where bounded loop is hard or
> impossible to use, this helper provides a convenient way to
> operate on all elements.
>
> For callback_fn, besides map and map element, a callback_ctx,
> allocated on caller stack, is also passed to the callback
> function. This callback_ctx argument can provide additional
> input and allow to write to caller stack for output.
>
> If the callback_fn returns 0, the helper will iterate through next
> element if available. If the callback_fn returns 1, the helper
> will stop iterating and returns to the bpf program. Other return
> values are not used for now.
>
> Currently, this helper is only available with jit. It is possible
> to make it work with interpreter with so effort but I leave it
> as the future work.
>
> [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

This is a great feature! Few questions and nits below.

>  include/linux/bpf.h            |  14 ++
>  include/linux/bpf_verifier.h   |   3 +
>  include/uapi/linux/bpf.h       |  28 ++++
>  kernel/bpf/bpf_iter.c          |  16 +++
>  kernel/bpf/helpers.c           |   2 +
>  kernel/bpf/verifier.c          | 251 ++++++++++++++++++++++++++++++---
>  kernel/trace/bpf_trace.c       |   2 +
>  tools/include/uapi/linux/bpf.h |  28 ++++
>  8 files changed, 328 insertions(+), 16 deletions(-)
>

[...]

>  const struct bpf_func_proto *bpf_tracing_func_proto(
>         enum bpf_func_id func_id, const struct bpf_prog *prog);
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index dfe6f85d97dd..c4366b3da342 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -68,6 +68,8 @@ struct bpf_reg_state {
>                         unsigned long raw1;
>                         unsigned long raw2;
>                 } raw;
> +
> +               u32 subprog; /* for PTR_TO_FUNC */

is it offset to subprog (in bytes or instructions?) or it's subprog
index? Let's make it clear with a better name or at least a comment.

>         };
>         /* For PTR_TO_PACKET, used to find other pointers with the same variable
>          * offset, so they can share range knowledge.
> @@ -204,6 +206,7 @@ struct bpf_func_state {
>         int acquired_refs;
>         struct bpf_reference_state *refs;
>         int allocated_stack;
> +       bool with_callback_fn;
>         struct bpf_stack_state *stack;
>  };
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c001766adcbc..d55bd4557376 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -393,6 +393,15 @@ enum bpf_link_type {
>   *                   is struct/union.
>   */
>  #define BPF_PSEUDO_BTF_ID      3
> +/* insn[0].src_reg:  BPF_PSEUDO_FUNC
> + * insn[0].imm:      insn offset to the func
> + * insn[1].imm:      0
> + * insn[0].off:      0
> + * insn[1].off:      0
> + * ldimm64 rewrite:  address of the function
> + * verifier type:    PTR_TO_FUNC.
> + */
> +#define BPF_PSEUDO_FUNC                4
>
>  /* when bpf_call->src_reg == BPF_PSEUDO_CALL, bpf_call->imm == pc-relative
>   * offset to another bpf function
> @@ -3836,6 +3845,24 @@ union bpf_attr {
>   *     Return
>   *             A pointer to a struct socket on success or NULL if the file is
>   *             not a socket.
> + *
> + * long bpf_for_each_map_elem(struct bpf_map *map, void *callback_fn, void *callback_ctx, u64 flags)

struct bpf_map * here might be problematic. In other instances where
we pass map (bpf_map_update_elem, for example) we specify this as
(void *). Let's do that instead here?

> + *     Description
> + *             For each element in **map**, call **callback_fn** function with
> + *             **map**, **callback_ctx** and other map-specific parameters.
> + *             For example, for hash and array maps, the callback signature can
> + *             be `u64 callback_fn(map, map_key, map_value, callback_ctx)`.
> + *             The **callback_fn** should be a static function and
> + *             the **callback_ctx** should be a pointer to the stack.
> + *             The **flags** is used to control certain aspects of the helper.
> + *             Currently, the **flags** must be 0.
> + *
> + *             If **callback_fn** return 0, the helper will continue to the next
> + *             element. If return value is 1, the helper will skip the rest of
> + *             elements and return. Other return values are not used now.
> + *     Return
> + *             0 for success, **-EINVAL** for invalid **flags** or unsupported
> + *             **callback_fn** return value.

just a thought: returning the number of elements *actually* iterated
seems useful (even though I don't have a specific use case right now).

>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -4001,6 +4028,7 @@ union bpf_attr {
>         FN(ktime_get_coarse_ns),        \
>         FN(ima_inode_hash),             \
>         FN(sock_from_file),             \
> +       FN(for_each_map_elem),          \

to be more in sync with other map operations, can we call this
`bpf_map_for_each_elem`? I think it makes sense and doesn't read
backwards at all.

>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 5454161407f1..5187f49d3216 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -675,3 +675,19 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
>          */
>         return ret == 0 ? 0 : -EAGAIN;
>  }
> +
> +BPF_CALL_4(bpf_for_each_map_elem, struct bpf_map *, map, void *, callback_fn,
> +          void *, callback_ctx, u64, flags)
> +{
> +       return map->ops->map_for_each_callback(map, callback_fn, callback_ctx, flags);
> +}
> +
> +const struct bpf_func_proto bpf_for_each_map_elem_proto = {
> +       .func           = bpf_for_each_map_elem,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_CONST_MAP_PTR,
> +       .arg2_type      = ARG_PTR_TO_FUNC,
> +       .arg3_type      = ARG_PTR_TO_STACK_OR_NULL,

I looked through this code just once but haven't noticed anything that
would strictly require that pointer is specifically to stack. Can this
be made into a pointer to any allocated memory? E.g., why can't we
allow passing a pointer to a ringbuf sample, for instance? Or
MAP_VALUE?

> +       .arg4_type      = ARG_ANYTHING,
> +};
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 308427fe03a3..074800226327 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -708,6 +708,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>                 return &bpf_ringbuf_discard_proto;
>         case BPF_FUNC_ringbuf_query:
>                 return &bpf_ringbuf_query_proto;
> +       case BPF_FUNC_for_each_map_elem:
> +               return &bpf_for_each_map_elem_proto;
>         default:
>                 break;
>         }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index db294b75d03b..050b067a0be6 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -234,6 +234,12 @@ static bool bpf_pseudo_call(const struct bpf_insn *insn)
>                insn->src_reg == BPF_PSEUDO_CALL;
>  }
>

[...]

>         map = env->used_maps[aux->map_index];
>         mark_reg_known_zero(env, regs, insn->dst_reg);
>         dst_reg->map_ptr = map;
> @@ -8195,9 +8361,23 @@ static int visit_insn(int t, int insn_cnt, struct bpf_verifier_env *env)
>
>         /* All non-branch instructions have a single fall-through edge. */
>         if (BPF_CLASS(insns[t].code) != BPF_JMP &&
> -           BPF_CLASS(insns[t].code) != BPF_JMP32)
> +           BPF_CLASS(insns[t].code) != BPF_JMP32 &&
> +           !bpf_pseudo_func(insns + t))
>                 return push_insn(t, t + 1, FALLTHROUGH, env, false);
>
> +       if (bpf_pseudo_func(insns + t)) {


if you check this before above JMP|JMP32 check, you won't need to do
!bpf_pseudo_func, right? I think it's cleaner.

> +               ret = push_insn(t, t + 1, FALLTHROUGH, env, false);
> +               if (ret)
> +                       return ret;
> +
> +               if (t + 1 < insn_cnt)
> +                       init_explored_state(env, t + 1);
> +               init_explored_state(env, t);
> +               ret = push_insn(t, t + insns[t].imm + 1, BRANCH,
> +                               env, false);
> +               return ret;
> +       }
> +
>         switch (BPF_OP(insns[t].code)) {
>         case BPF_EXIT:
>                 return DONE_EXPLORING;

[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 6/8] bpftool: print local function pointer properly
  2021-02-04 23:48 ` [PATCH bpf-next 6/8] bpftool: print local function pointer properly Yonghong Song
@ 2021-02-08 18:22   ` Andrii Nakryiko
  2021-02-09  6:42     ` Yonghong Song
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-08 18:22 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>
> With later hashmap example, using bpftool xlated output may
> look like:
>   int dump_task(struct bpf_iter__task * ctx):
>   ; struct task_struct *task = ctx->task;
>      0: (79) r2 = *(u64 *)(r1 +8)
>   ; if (task == (void *)0 || called > 0)
>   ...
>     19: (18) r2 = subprog[+18]
>     30: (18) r2 = subprog[+26]
>   ...
>   36: (95) exit
>   __u64 check_hash_elem(struct bpf_map * map, __u32 * key, __u64 * val,
>                         struct callback_ctx * data):
>   ; struct bpf_iter__task *ctx = data->ctx;
>     37: (79) r5 = *(u64 *)(r4 +0)
>   ...
>     55: (95) exit
>   __u64 check_percpu_elem(struct bpf_map * map, __u32 * key,
>                           __u64 * val, void * unused):
>   ; check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val, void *unused)
>     56: (bf) r6 = r3
>   ...
>     83: (18) r2 = subprog[+-46]

this +-46 looks very confusing...

>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/bpf/bpftool/xlated_dumper.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/tools/bpf/bpftool/xlated_dumper.c b/tools/bpf/bpftool/xlated_dumper.c
> index 8608cd68cdd0..7bdd90503727 100644
> --- a/tools/bpf/bpftool/xlated_dumper.c
> +++ b/tools/bpf/bpftool/xlated_dumper.c
> @@ -196,6 +196,9 @@ static const char *print_imm(void *private_data,
>         else if (insn->src_reg == BPF_PSEUDO_MAP_VALUE)
>                 snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>                          "map[id:%u][0]+%u", insn->imm, (insn + 1)->imm);
> +       else if (insn->src_reg == BPF_PSEUDO_FUNC)
> +               snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
> +                        "subprog[+%d]", insn->imm + 1);

why not `subprog[%+d]` instead (see above about confusing output)

>         else
>                 snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>                          "0x%llx", (unsigned long long)full_imm);
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper
  2021-02-04 23:48 ` [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper Yonghong Song
@ 2021-02-08 18:34   ` Andrii Nakryiko
  2021-02-09  6:46     ` Yonghong Song
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-08 18:34 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>
> A test case is added for hashmap and percpu hashmap. The test
> also exercises nested bpf_for_each_map_elem() calls like
>     bpf_prog:
>       bpf_for_each_map_elem(func1)
>     func1:
>       bpf_for_each_map_elem(func2)
>     func2:
>
>   $ ./test_progs -n 44
>   #44/1 hash_map:OK
>   #44 for_each:OK
>   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  .../selftests/bpf/prog_tests/for_each.c       |  91 ++++++++++++++++
>  .../bpf/progs/for_each_hash_map_elem.c        | 103 ++++++++++++++++++
>  2 files changed, 194 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each.c
>  create mode 100644 tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c
>

[...]

> +       num_cpus = bpf_num_possible_cpus();
> +       percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);
> +       percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
> +       if (CHECK_FAIL(!percpu_valbuf))
> +               goto out;
> +
> +       key = 1;
> +       for (i = 0; i < num_cpus; i++)
> +               percpu_valbuf[i] = i + 1;
> +       err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
> +       if (CHECK(err, "percpu_map_update", "map_update failed\n"))
> +               goto out;
> +
> +       do_dummy_read(skel->progs.dump_task);

why use iter/task programs to trigger your test BPF code? This test
doesn't seem to rely on anything iter-specific, so it's much simpler
(and less code) to just use the typical sys_enter approach with
usleep(1) as a trigger function, no?

> +
> +       ASSERT_EQ(skel->bss->called, 1, "called");
> +       ASSERT_EQ(skel->bss->hashmap_output, 4, "output_val");
> +
> +       key = 1;
> +       err = bpf_map_lookup_elem(hashmap_fd, &key, &val);
> +       ASSERT_ERR(err, "hashmap_lookup");
> +
> +       ASSERT_EQ(skel->bss->percpu_called, 1, "percpu_called");
> +       CHECK_FAIL(skel->bss->cpu >= num_cpus);

please don't use CHECK_FAIL: use CHECK or one of ASSERT_xxx variants

> +       ASSERT_EQ(skel->bss->percpu_key, 1, "percpu_key");
> +       ASSERT_EQ(skel->bss->percpu_val, skel->bss->cpu + 1, "percpu_val");
> +       ASSERT_EQ(skel->bss->percpu_output, 100, "percpu_output");
> +out:
> +       free(percpu_valbuf);
> +       for_each_hash_map_elem__destroy(skel);
> +}
> +
> +void test_for_each(void)
> +{
> +       if (test__start_subtest("hash_map"))
> +               test_hash_map();
> +}

[...]

> +
> +__u32 cpu = 0;
> +__u32 percpu_called = 0;
> +__u32 percpu_key = 0;
> +__u64 percpu_val = 0;
> +
> +static __u64
> +check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val,
> +                 struct callback_ctx *data)
> +{
> +       percpu_called++;
> +       cpu = bpf_get_smp_processor_id();

It's a bit counter-intuitive (at least I was confused initially) that
for a per-cpu array for_each() will iterate only current CPU's
elements. I think it's worthwhile to emphasize this in
bpf_for_each_map_elem() documentation (at least), and call out in
corresponding patches adding per-cpu array/hash iteration support.

> +       percpu_key = *key;
> +       percpu_val = *val;
> +
> +       bpf_for_each_map_elem(&hashmap, check_hash_elem, data, 0);
> +       return 0;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 8/8] selftests/bpf: add arraymap test for bpf_for_each_map_elem() helper
  2021-02-04 23:48 ` [PATCH bpf-next 8/8] selftests/bpf: add arraymap " Yonghong Song
@ 2021-02-08 18:35   ` Andrii Nakryiko
  2021-02-09  6:50     ` Yonghong Song
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-08 18:35 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>
> A test is added for arraymap and percpu arraymap. The test also
> exercises the early return for the helper which does not
> traverse all elements.
>     $ ./test_progs -n 44
>     #44/1 hash_map:OK
>     #44/2 array_map:OK
>     #44 for_each:OK
>     Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  .../selftests/bpf/prog_tests/for_each.c       | 54 ++++++++++++++
>  .../bpf/progs/for_each_array_map_elem.c       | 71 +++++++++++++++++++
>  2 files changed, 125 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/for_each_array_map_elem.c
>

[...]

> +
> +       arraymap_fd = bpf_map__fd(skel->maps.arraymap);
> +       expected_total = 0;
> +       max_entries = bpf_map__max_entries(skel->maps.arraymap);
> +       for (i = 0; i < max_entries; i++) {
> +               key = i;
> +               val = i + 1;
> +               /* skip the last iteration for expected total */
> +               if (i != max_entries - 1)
> +                       expected_total += val;
> +               err = bpf_map_update_elem(arraymap_fd, &key, &val, BPF_ANY);
> +               if (CHECK(err, "map_update", "map_update failed\n"))
> +                       goto out;
> +       }
> +
> +       num_cpus = bpf_num_possible_cpus();
> +        percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);

formatting is off, please check with checkfile.pl


> +        percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
> +        if (CHECK_FAIL(!percpu_valbuf))

please don't use CHECK_FAIL, ASSERT_PTR_OK would work nicely here

> +                goto out;
> +
> +       key = 0;
> +        for (i = 0; i < num_cpus; i++)
> +                percpu_valbuf[i] = i + 1;
> +       err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
> +       if (CHECK(err, "percpu_map_update", "map_update failed\n"))
> +               goto out;
> +
> +       do_dummy_read(skel->progs.dump_task);

see previous patch, iter/tasks seems like an overkill for these tests

> +
> +       ASSERT_EQ(skel->bss->called, 1, "called");
> +       ASSERT_EQ(skel->bss->arraymap_output, expected_total, "array_output");
> +       ASSERT_EQ(skel->bss->cpu + 1, skel->bss->percpu_val, "percpu_val");
> +
> +out:
> +       free(percpu_valbuf);
> +       for_each_array_map_elem__destroy(skel);
> +}
> +

[...]

> +SEC("iter/task")
> +int dump_task(struct bpf_iter__task *ctx)
> +{
> +       struct seq_file *seq =  ctx->meta->seq;
> +       struct task_struct *task = ctx->task;
> +       struct callback_ctx data;
> +
> +       /* only call once */
> +       if (called > 0)
> +               return 0;
> +
> +       called++;
> +
> +       data.output = 0;
> +       bpf_for_each_map_elem(&arraymap, check_array_elem, &data, 0);
> +       arraymap_output = data.output;
> +
> +       bpf_for_each_map_elem(&percpu_map, check_percpu_elem, 0, 0);

nit: NULL, 0 ?

> +
> +       return 0;
> +}
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 5/8] libbpf: support local function pointer relocation
  2021-02-04 23:48 ` [PATCH bpf-next 5/8] libbpf: support local function pointer relocation Yonghong Song
@ 2021-02-08 18:52   ` Andrii Nakryiko
  2021-02-09  6:56     ` Yonghong Song
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-08 18:52 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Thu, Feb 4, 2021 at 5:54 PM Yonghong Song <yhs@fb.com> wrote:
>
> A new relocation RELO_LOCAL_FUNC is added to capture
> local (static) function pointers loaded with ld_imm64
> insns. Such ld_imm64 insns are marked with
> BPF_PSEUDO_FUNC and will be passed to kernel so
> kernel can replace them with proper actual jited
> func addresses.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/lib/bpf/libbpf.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 2abbc3800568..a5146c9e3e06 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -188,6 +188,7 @@ enum reloc_type {
>         RELO_CALL,
>         RELO_DATA,
>         RELO_EXTERN,
> +       RELO_LOCAL_FUNC,

libbpf internally is using SUBPROG notation. I think "LOCAL" part is
confusing, so I'd drop it. How about just RELO_SUBPROG? We can
separately refactor these names to distinguish RELO_CALL from the new
one. It would be more clear if RELO_CALL was called RELO_SUBPROG_CALL,
and the new one either RELO_SUBPROG_ADDR or RELO_SUBPROG_REF (as in
subprog reference)

>  };
>
>  struct reloc_desc {
> @@ -574,6 +575,12 @@ static bool insn_is_subprog_call(const struct bpf_insn *insn)
>                insn->off == 0;
>  }
>
> +static bool insn_is_pseudo_func(const struct bpf_insn *insn)
> +{
> +       return insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&

there is is_ldimm64() function for this check (just move it up here,
it's a single-liner)

> +              insn->src_reg == BPF_PSEUDO_FUNC;
> +}
> +
>  static int
>  bpf_object__init_prog(struct bpf_object *obj, struct bpf_program *prog,
>                       const char *name, size_t sec_idx, const char *sec_name,
> @@ -3395,6 +3402,16 @@ static int bpf_program__record_reloc(struct bpf_program *prog,
>                 return 0;
>         }
>
> +       if (insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&

just move this check below the next if that checks !is_ldimm64, no
need to do it here early.

> +           GELF_ST_BIND(sym->st_info) == STB_LOCAL &&
> +           GELF_ST_TYPE(sym->st_info) == STT_SECTION &&
> +           shdr_idx == obj->efile.text_shndx) {

see above how RELO_CALL is handled: shdr_idx != 0 check is missing. We
also validate that sym->st_value is multiple of BPF_INSN_SZ.

> +               reloc_desc->type = RELO_LOCAL_FUNC;
> +               reloc_desc->insn_idx = insn_idx;
> +               reloc_desc->sym_off = sym->st_value;
> +               return 0;
> +       }
> +
>         if (insn->code != (BPF_LD | BPF_IMM | BPF_DW)) {

feel free to use is_ldimm64 here as well, thanks!

>                 pr_warn("prog '%s': invalid relo against '%s' for insns[%d].code 0x%x\n",
>                         prog->name, sym_name, insn_idx, insn->code);
> @@ -6172,6 +6189,9 @@ bpf_object__relocate_data(struct bpf_object *obj, struct bpf_program *prog)
>                         }
>                         relo->processed = true;
>                         break;
> +               case RELO_LOCAL_FUNC:
> +                       insn[0].src_reg = BPF_PSEUDO_FUNC;
> +                       /* fallthrough */

fallthrough into an empty break clause seems a bit weird... just break
and leave the same comment as below?

>                 case RELO_CALL:
>                         /* will be handled as a follow up pass */
>                         break;
> @@ -6358,11 +6378,11 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
>
>         for (insn_idx = 0; insn_idx < prog->sec_insn_cnt; insn_idx++) {
>                 insn = &main_prog->insns[prog->sub_insn_off + insn_idx];
> -               if (!insn_is_subprog_call(insn))
> +               if (!insn_is_subprog_call(insn) && !insn_is_pseudo_func(insn))
>                         continue;
>
>                 relo = find_prog_insn_relo(prog, insn_idx);
> -               if (relo && relo->type != RELO_CALL) {
> +               if (relo && relo->type != RELO_CALL && relo->type != RELO_LOCAL_FUNC) {
>                         pr_warn("prog '%s': unexpected relo for insn #%zu, type %d\n",
>                                 prog->name, insn_idx, relo->type);
>                         return -LIBBPF_ERRNO__RELOC;
> @@ -6374,8 +6394,15 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
>                          * call always has imm = -1, but for static functions
>                          * relocation is against STT_SECTION and insn->imm
>                          * points to a start of a static function
> +                        *
> +                        * for local func relocation, the imm field encodes
> +                        * the byte offset in the corresponding section.
>                          */
> -                       sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
> +                       if (relo->type == RELO_CALL)
> +                               sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
> +                       else
> +                               sub_insn_idx = relo->sym_off / BPF_INSN_SZ +
> +                                              insn->imm / BPF_INSN_SZ + 1;

nit: keep it on a single line, it still fits within 100 characters and
is easier to visually compare to RELO_CALL case.

>                 } else {
>                         /* if subprogram call is to a static function within
>                          * the same ELF section, there won't be any relocation

don't we have to adjust insn->imm for this case as well? Let's add
selftests to make sure this works.

> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper
  2021-02-08 18:16   ` Andrii Nakryiko
@ 2021-02-09  6:41     ` Yonghong Song
  2021-02-09 17:33       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-09  6:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team



On 2/8/21 10:16 AM, Andrii Nakryiko wrote:
> On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> The bpf_for_each_map_elem() helper is introduced which
>> iterates all map elements with a callback function. The
>> helper signature looks like
>>    long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
>> and for each map element, the callback_fn will be called. For example,
>> like hashmap, the callback signature may look like
>>    long callback_fn(map, key, val, callback_ctx)
>>
>> There are two known use cases for this. One is from upstream ([1]) where
>> a for_each_map_elem helper may help implement a timeout mechanism
>> in a more generic way. Another is from our internal discussion
>> for a firewall use case where a map contains all the rules. The packet
>> data can be compared to all these rules to decide allow or deny
>> the packet.
>>
>> For array maps, users can already use a bounded loop to traverse
>> elements. Using this helper can avoid using bounded loop. For other
>> type of maps (e.g., hash maps) where bounded loop is hard or
>> impossible to use, this helper provides a convenient way to
>> operate on all elements.
>>
>> For callback_fn, besides map and map element, a callback_ctx,
>> allocated on caller stack, is also passed to the callback
>> function. This callback_ctx argument can provide additional
>> input and allow to write to caller stack for output.
>>
>> If the callback_fn returns 0, the helper will iterate through next
>> element if available. If the callback_fn returns 1, the helper
>> will stop iterating and returns to the bpf program. Other return
>> values are not used for now.
>>
>> Currently, this helper is only available with jit. It is possible
>> to make it work with interpreter with so effort but I leave it
>> as the future work.
>>
>> [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
> 
> This is a great feature! Few questions and nits below.
> 
>>   include/linux/bpf.h            |  14 ++
>>   include/linux/bpf_verifier.h   |   3 +
>>   include/uapi/linux/bpf.h       |  28 ++++
>>   kernel/bpf/bpf_iter.c          |  16 +++
>>   kernel/bpf/helpers.c           |   2 +
>>   kernel/bpf/verifier.c          | 251 ++++++++++++++++++++++++++++++---
>>   kernel/trace/bpf_trace.c       |   2 +
>>   tools/include/uapi/linux/bpf.h |  28 ++++
>>   8 files changed, 328 insertions(+), 16 deletions(-)
>>
> 
> [...]
> 
>>   const struct bpf_func_proto *bpf_tracing_func_proto(
>>          enum bpf_func_id func_id, const struct bpf_prog *prog);
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index dfe6f85d97dd..c4366b3da342 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -68,6 +68,8 @@ struct bpf_reg_state {
>>                          unsigned long raw1;
>>                          unsigned long raw2;
>>                  } raw;
>> +
>> +               u32 subprog; /* for PTR_TO_FUNC */
> 
> is it offset to subprog (in bytes or instructions?) or it's subprog
> index? Let's make it clear with a better name or at least a comment.

This is for subprog number (or index in some subprog related arrays).
In verifier.c, subprog or subprogno is used to represent the subprog
number. I will use subprogno in the next revision.

> 
>>          };
>>          /* For PTR_TO_PACKET, used to find other pointers with the same variable
>>           * offset, so they can share range knowledge.
>> @@ -204,6 +206,7 @@ struct bpf_func_state {
>>          int acquired_refs;
>>          struct bpf_reference_state *refs;
>>          int allocated_stack;
>> +       bool with_callback_fn;
>>          struct bpf_stack_state *stack;
>>   };
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index c001766adcbc..d55bd4557376 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -393,6 +393,15 @@ enum bpf_link_type {
>>    *                   is struct/union.
>>    */
>>   #define BPF_PSEUDO_BTF_ID      3
>> +/* insn[0].src_reg:  BPF_PSEUDO_FUNC
>> + * insn[0].imm:      insn offset to the func
>> + * insn[1].imm:      0
>> + * insn[0].off:      0
>> + * insn[1].off:      0
>> + * ldimm64 rewrite:  address of the function
>> + * verifier type:    PTR_TO_FUNC.
>> + */
>> +#define BPF_PSEUDO_FUNC                4
>>
>>   /* when bpf_call->src_reg == BPF_PSEUDO_CALL, bpf_call->imm == pc-relative
>>    * offset to another bpf function
>> @@ -3836,6 +3845,24 @@ union bpf_attr {
>>    *     Return
>>    *             A pointer to a struct socket on success or NULL if the file is
>>    *             not a socket.
>> + *
>> + * long bpf_for_each_map_elem(struct bpf_map *map, void *callback_fn, void *callback_ctx, u64 flags)
> 
> struct bpf_map * here might be problematic. In other instances where
> we pass map (bpf_map_update_elem, for example) we specify this as
> (void *). Let's do that instead here?

We should be fine here. bpf_map_lookup_elem etc. all have "struct 
bpf_map *map", it is rewritten by bpf_helpers_doc.py to "void *map".

> 
>> + *     Description
>> + *             For each element in **map**, call **callback_fn** function with
>> + *             **map**, **callback_ctx** and other map-specific parameters.
>> + *             For example, for hash and array maps, the callback signature can
>> + *             be `u64 callback_fn(map, map_key, map_value, callback_ctx)`.
>> + *             The **callback_fn** should be a static function and
>> + *             the **callback_ctx** should be a pointer to the stack.
>> + *             The **flags** is used to control certain aspects of the helper.
>> + *             Currently, the **flags** must be 0.
>> + *
>> + *             If **callback_fn** return 0, the helper will continue to the next
>> + *             element. If return value is 1, the helper will skip the rest of
>> + *             elements and return. Other return values are not used now.
>> + *     Return
>> + *             0 for success, **-EINVAL** for invalid **flags** or unsupported
>> + *             **callback_fn** return value.
> 
> just a thought: returning the number of elements *actually* iterated
> seems useful (even though I don't have a specific use case right now).

Good idea. Will change to this in the next revision.

> 
>>    */
>>   #define __BPF_FUNC_MAPPER(FN)          \
>>          FN(unspec),                     \
>> @@ -4001,6 +4028,7 @@ union bpf_attr {
>>          FN(ktime_get_coarse_ns),        \
>>          FN(ima_inode_hash),             \
>>          FN(sock_from_file),             \
>> +       FN(for_each_map_elem),          \
> 
> to be more in sync with other map operations, can we call this
> `bpf_map_for_each_elem`? I think it makes sense and doesn't read
> backwards at all.

I am using for_each prefix as in the future I (or others) may add
more for_each_* helpers, e.g., for_each_task, for_each_hlist_rcu, etc.
This represents a family of helpers with callback functions. So I
would like to stick with for_each_* names.

> 
>>          /* */
>>
>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
>> index 5454161407f1..5187f49d3216 100644
>> --- a/kernel/bpf/bpf_iter.c
>> +++ b/kernel/bpf/bpf_iter.c
>> @@ -675,3 +675,19 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
>>           */
>>          return ret == 0 ? 0 : -EAGAIN;
>>   }
>> +
>> +BPF_CALL_4(bpf_for_each_map_elem, struct bpf_map *, map, void *, callback_fn,
>> +          void *, callback_ctx, u64, flags)
>> +{
>> +       return map->ops->map_for_each_callback(map, callback_fn, callback_ctx, flags);
>> +}
>> +
>> +const struct bpf_func_proto bpf_for_each_map_elem_proto = {
>> +       .func           = bpf_for_each_map_elem,
>> +       .gpl_only       = false,
>> +       .ret_type       = RET_INTEGER,
>> +       .arg1_type      = ARG_CONST_MAP_PTR,
>> +       .arg2_type      = ARG_PTR_TO_FUNC,
>> +       .arg3_type      = ARG_PTR_TO_STACK_OR_NULL,
> 
> I looked through this code just once but haven't noticed anything that
> would strictly require that pointer is specifically to stack. Can this
> be made into a pointer to any allocated memory? E.g., why can't we
> allow passing a pointer to a ringbuf sample, for instance? Or
> MAP_VALUE?

ARG_PTR_TO_STACK_OR_NULL in the most flexible one. For example, if you
want to pass map_value or ringbuf sample, you can assign these values
to the stack like
    struct ctx_t {
       struct map_value_t *map_value;
       char *ringbuf_mem;
    } tmp;
    tmp.map_value = ...;
    tmp.ringbuf_mem = ...;
    bpf_for_each_map_elem(map, callback_fn, &tmp, flags);
and callback_fn will be able to access map_value/ringbuf_mem
with their original register types.

This does not allow to pass ringbuf/map_value etc. as the
first class citizen. But I think this is a good compromise
to permit greater flexibility.

> 
>> +       .arg4_type      = ARG_ANYTHING,
>> +};
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 308427fe03a3..074800226327 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -708,6 +708,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>>                  return &bpf_ringbuf_discard_proto;
>>          case BPF_FUNC_ringbuf_query:
>>                  return &bpf_ringbuf_query_proto;
>> +       case BPF_FUNC_for_each_map_elem:
>> +               return &bpf_for_each_map_elem_proto;
>>          default:
>>                  break;
>>          }
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index db294b75d03b..050b067a0be6 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -234,6 +234,12 @@ static bool bpf_pseudo_call(const struct bpf_insn *insn)
>>                 insn->src_reg == BPF_PSEUDO_CALL;
>>   }
>>
> 
> [...]
> 
>>          map = env->used_maps[aux->map_index];
>>          mark_reg_known_zero(env, regs, insn->dst_reg);
>>          dst_reg->map_ptr = map;
>> @@ -8195,9 +8361,23 @@ static int visit_insn(int t, int insn_cnt, struct bpf_verifier_env *env)
>>
>>          /* All non-branch instructions have a single fall-through edge. */
>>          if (BPF_CLASS(insns[t].code) != BPF_JMP &&
>> -           BPF_CLASS(insns[t].code) != BPF_JMP32)
>> +           BPF_CLASS(insns[t].code) != BPF_JMP32 &&
>> +           !bpf_pseudo_func(insns + t))
>>                  return push_insn(t, t + 1, FALLTHROUGH, env, false);
>>
>> +       if (bpf_pseudo_func(insns + t)) {
> 
> 
> if you check this before above JMP|JMP32 check, you won't need to do
> !bpf_pseudo_func, right? I think it's cleaner.

Agree. will change in v2.

> 
>> +               ret = push_insn(t, t + 1, FALLTHROUGH, env, false);
>> +               if (ret)
>> +                       return ret;
>> +
>> +               if (t + 1 < insn_cnt)
>> +                       init_explored_state(env, t + 1);
>> +               init_explored_state(env, t);
>> +               ret = push_insn(t, t + insns[t].imm + 1, BRANCH,
>> +                               env, false);
>> +               return ret;
>> +       }
>> +
>>          switch (BPF_OP(insns[t].code)) {
>>          case BPF_EXIT:
>>                  return DONE_EXPLORING;
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 6/8] bpftool: print local function pointer properly
  2021-02-08 18:22   ` Andrii Nakryiko
@ 2021-02-09  6:42     ` Yonghong Song
  0 siblings, 0 replies; 28+ messages in thread
From: Yonghong Song @ 2021-02-09  6:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team



On 2/8/21 10:22 AM, Andrii Nakryiko wrote:
> On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> With later hashmap example, using bpftool xlated output may
>> look like:
>>    int dump_task(struct bpf_iter__task * ctx):
>>    ; struct task_struct *task = ctx->task;
>>       0: (79) r2 = *(u64 *)(r1 +8)
>>    ; if (task == (void *)0 || called > 0)
>>    ...
>>      19: (18) r2 = subprog[+18]
>>      30: (18) r2 = subprog[+26]
>>    ...
>>    36: (95) exit
>>    __u64 check_hash_elem(struct bpf_map * map, __u32 * key, __u64 * val,
>>                          struct callback_ctx * data):
>>    ; struct bpf_iter__task *ctx = data->ctx;
>>      37: (79) r5 = *(u64 *)(r4 +0)
>>    ...
>>      55: (95) exit
>>    __u64 check_percpu_elem(struct bpf_map * map, __u32 * key,
>>                            __u64 * val, void * unused):
>>    ; check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val, void *unused)
>>      56: (bf) r6 = r3
>>    ...
>>      83: (18) r2 = subprog[+-46]
> 
> this +-46 looks very confusing...

Make sense, will use %+d to have either +46 or -46.

> 
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   tools/bpf/bpftool/xlated_dumper.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/tools/bpf/bpftool/xlated_dumper.c b/tools/bpf/bpftool/xlated_dumper.c
>> index 8608cd68cdd0..7bdd90503727 100644
>> --- a/tools/bpf/bpftool/xlated_dumper.c
>> +++ b/tools/bpf/bpftool/xlated_dumper.c
>> @@ -196,6 +196,9 @@ static const char *print_imm(void *private_data,
>>          else if (insn->src_reg == BPF_PSEUDO_MAP_VALUE)
>>                  snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>>                           "map[id:%u][0]+%u", insn->imm, (insn + 1)->imm);
>> +       else if (insn->src_reg == BPF_PSEUDO_FUNC)
>> +               snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>> +                        "subprog[+%d]", insn->imm + 1);
> 
> why not `subprog[%+d]` instead (see above about confusing output)
> 
>>          else
>>                  snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>>                           "0x%llx", (unsigned long long)full_imm);
>> --
>> 2.24.1
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper
  2021-02-08 18:34   ` Andrii Nakryiko
@ 2021-02-09  6:46     ` Yonghong Song
  2021-02-09 17:36       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-09  6:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team



On 2/8/21 10:34 AM, Andrii Nakryiko wrote:
> On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> A test case is added for hashmap and percpu hashmap. The test
>> also exercises nested bpf_for_each_map_elem() calls like
>>      bpf_prog:
>>        bpf_for_each_map_elem(func1)
>>      func1:
>>        bpf_for_each_map_elem(func2)
>>      func2:
>>
>>    $ ./test_progs -n 44
>>    #44/1 hash_map:OK
>>    #44 for_each:OK
>>    Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   .../selftests/bpf/prog_tests/for_each.c       |  91 ++++++++++++++++
>>   .../bpf/progs/for_each_hash_map_elem.c        | 103 ++++++++++++++++++
>>   2 files changed, 194 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c
>>
> 
> [...]
> 
>> +       num_cpus = bpf_num_possible_cpus();
>> +       percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);
>> +       percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
>> +       if (CHECK_FAIL(!percpu_valbuf))
>> +               goto out;
>> +
>> +       key = 1;
>> +       for (i = 0; i < num_cpus; i++)
>> +               percpu_valbuf[i] = i + 1;
>> +       err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
>> +       if (CHECK(err, "percpu_map_update", "map_update failed\n"))
>> +               goto out;
>> +
>> +       do_dummy_read(skel->progs.dump_task);
> 
> why use iter/task programs to trigger your test BPF code? This test
> doesn't seem to rely on anything iter-specific, so it's much simpler
> (and less code) to just use the typical sys_enter approach with
> usleep(1) as a trigger function, no?

I am aware of this. I did not change this in v1 mainly wanting to
get some comments on API and verifier change etc. for v1.
I will use bpf_prog_test_run() to call the program in v2.

> 
>> +
>> +       ASSERT_EQ(skel->bss->called, 1, "called");
>> +       ASSERT_EQ(skel->bss->hashmap_output, 4, "output_val");
>> +
>> +       key = 1;
>> +       err = bpf_map_lookup_elem(hashmap_fd, &key, &val);
>> +       ASSERT_ERR(err, "hashmap_lookup");
>> +
>> +       ASSERT_EQ(skel->bss->percpu_called, 1, "percpu_called");
>> +       CHECK_FAIL(skel->bss->cpu >= num_cpus);
> 
> please don't use CHECK_FAIL: use CHECK or one of ASSERT_xxx variants

We do not have ASSERT_GE, I may invent one.

> 
>> +       ASSERT_EQ(skel->bss->percpu_key, 1, "percpu_key");
>> +       ASSERT_EQ(skel->bss->percpu_val, skel->bss->cpu + 1, "percpu_val");
>> +       ASSERT_EQ(skel->bss->percpu_output, 100, "percpu_output");
>> +out:
>> +       free(percpu_valbuf);
>> +       for_each_hash_map_elem__destroy(skel);
>> +}
>> +
>> +void test_for_each(void)
>> +{
>> +       if (test__start_subtest("hash_map"))
>> +               test_hash_map();
>> +}
> 
> [...]
> 
>> +
>> +__u32 cpu = 0;
>> +__u32 percpu_called = 0;
>> +__u32 percpu_key = 0;
>> +__u64 percpu_val = 0;
>> +
>> +static __u64
>> +check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val,
>> +                 struct callback_ctx *data)
>> +{
>> +       percpu_called++;
>> +       cpu = bpf_get_smp_processor_id();
> 
> It's a bit counter-intuitive (at least I was confused initially) that
> for a per-cpu array for_each() will iterate only current CPU's
> elements. I think it's worthwhile to emphasize this in
> bpf_for_each_map_elem() documentation (at least), and call out in
> corresponding patches adding per-cpu array/hash iteration support.

Right. Will emphasize this in uapi bpf.h and also comments in the code.

> 
>> +       percpu_key = *key;
>> +       percpu_val = *val;
>> +
>> +       bpf_for_each_map_elem(&hashmap, check_hash_elem, data, 0);
>> +       return 0;
>> +}
>> +
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 8/8] selftests/bpf: add arraymap test for bpf_for_each_map_elem() helper
  2021-02-08 18:35   ` Andrii Nakryiko
@ 2021-02-09  6:50     ` Yonghong Song
  2021-02-09 17:38       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-09  6:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team



On 2/8/21 10:35 AM, Andrii Nakryiko wrote:
> On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> A test is added for arraymap and percpu arraymap. The test also
>> exercises the early return for the helper which does not
>> traverse all elements.
>>      $ ./test_progs -n 44
>>      #44/1 hash_map:OK
>>      #44/2 array_map:OK
>>      #44 for_each:OK
>>      Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   .../selftests/bpf/prog_tests/for_each.c       | 54 ++++++++++++++
>>   .../bpf/progs/for_each_array_map_elem.c       | 71 +++++++++++++++++++
>>   2 files changed, 125 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/progs/for_each_array_map_elem.c
>>
> 
> [...]
> 
>> +
>> +       arraymap_fd = bpf_map__fd(skel->maps.arraymap);
>> +       expected_total = 0;
>> +       max_entries = bpf_map__max_entries(skel->maps.arraymap);
>> +       for (i = 0; i < max_entries; i++) {
>> +               key = i;
>> +               val = i + 1;
>> +               /* skip the last iteration for expected total */
>> +               if (i != max_entries - 1)
>> +                       expected_total += val;
>> +               err = bpf_map_update_elem(arraymap_fd, &key, &val, BPF_ANY);
>> +               if (CHECK(err, "map_update", "map_update failed\n"))
>> +                       goto out;
>> +       }
>> +
>> +       num_cpus = bpf_num_possible_cpus();
>> +        percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);
> 
> formatting is off, please check with checkfile.pl

Sure. will do.

> 
> 
>> +        percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
>> +        if (CHECK_FAIL(!percpu_valbuf))
> 
> please don't use CHECK_FAIL, ASSERT_PTR_OK would work nicely here

Okay.

> 
>> +                goto out;
>> +
>> +       key = 0;
>> +        for (i = 0; i < num_cpus; i++)
>> +                percpu_valbuf[i] = i + 1;
>> +       err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
>> +       if (CHECK(err, "percpu_map_update", "map_update failed\n"))
>> +               goto out;
>> +
>> +       do_dummy_read(skel->progs.dump_task);
> 
> see previous patch, iter/tasks seems like an overkill for these tests

Yes, will use bpf_prog_test_run() in v2.

> 
>> +
>> +       ASSERT_EQ(skel->bss->called, 1, "called");
>> +       ASSERT_EQ(skel->bss->arraymap_output, expected_total, "array_output");
>> +       ASSERT_EQ(skel->bss->cpu + 1, skel->bss->percpu_val, "percpu_val");
>> +
>> +out:
>> +       free(percpu_valbuf);
>> +       for_each_array_map_elem__destroy(skel);
>> +}
>> +
> 
> [...]
> 
>> +SEC("iter/task")
>> +int dump_task(struct bpf_iter__task *ctx)
>> +{
>> +       struct seq_file *seq =  ctx->meta->seq;
>> +       struct task_struct *task = ctx->task;
>> +       struct callback_ctx data;
>> +
>> +       /* only call once */
>> +       if (called > 0)
>> +               return 0;
>> +
>> +       called++;
>> +
>> +       data.output = 0;
>> +       bpf_for_each_map_elem(&arraymap, check_array_elem, &data, 0);
>> +       arraymap_output = data.output;
>> +
>> +       bpf_for_each_map_elem(&percpu_map, check_percpu_elem, 0, 0);
> 
> nit: NULL, 0 ?

We do not NULL defined in vmlinux.h or bpf_helpers.h. Hence I am using
0, I should at least use (void *)0 here, I think.

> 
>> +
>> +       return 0;
>> +}
>> --
>> 2.24.1
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 5/8] libbpf: support local function pointer relocation
  2021-02-08 18:52   ` Andrii Nakryiko
@ 2021-02-09  6:56     ` Yonghong Song
  2021-02-09 17:31       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Yonghong Song @ 2021-02-09  6:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team



On 2/8/21 10:52 AM, Andrii Nakryiko wrote:
> On Thu, Feb 4, 2021 at 5:54 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> A new relocation RELO_LOCAL_FUNC is added to capture
>> local (static) function pointers loaded with ld_imm64
>> insns. Such ld_imm64 insns are marked with
>> BPF_PSEUDO_FUNC and will be passed to kernel so
>> kernel can replace them with proper actual jited
>> func addresses.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   tools/lib/bpf/libbpf.c | 33 ++++++++++++++++++++++++++++++---
>>   1 file changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 2abbc3800568..a5146c9e3e06 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -188,6 +188,7 @@ enum reloc_type {
>>          RELO_CALL,
>>          RELO_DATA,
>>          RELO_EXTERN,
>> +       RELO_LOCAL_FUNC,
> 
> libbpf internally is using SUBPROG notation. I think "LOCAL" part is
> confusing, so I'd drop it. How about just RELO_SUBPROG? We can
> separately refactor these names to distinguish RELO_CALL from the new
> one. It would be more clear if RELO_CALL was called RELO_SUBPROG_CALL,
> and the new one either RELO_SUBPROG_ADDR or RELO_SUBPROG_REF (as in
> subprog reference)

Yes, we can use RELO_SUBPROG_ADDR.

> 
>>   };
>>
>>   struct reloc_desc {
>> @@ -574,6 +575,12 @@ static bool insn_is_subprog_call(const struct bpf_insn *insn)
>>                 insn->off == 0;
>>   }
>>
>> +static bool insn_is_pseudo_func(const struct bpf_insn *insn)
>> +{
>> +       return insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
> 
> there is is_ldimm64() function for this check (just move it up here,
> it's a single-liner)

I did not know it. Will use in the next revision.

> 
>> +              insn->src_reg == BPF_PSEUDO_FUNC;
>> +}
>> +
>>   static int
>>   bpf_object__init_prog(struct bpf_object *obj, struct bpf_program *prog,
>>                        const char *name, size_t sec_idx, const char *sec_name,
>> @@ -3395,6 +3402,16 @@ static int bpf_program__record_reloc(struct bpf_program *prog,
>>                  return 0;
>>          }
>>
>> +       if (insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
> 
> just move this check below the next if that checks !is_ldimm64, no
> need to do it here early.

Okay.

> 
>> +           GELF_ST_BIND(sym->st_info) == STB_LOCAL &&
>> +           GELF_ST_TYPE(sym->st_info) == STT_SECTION &&
>> +           shdr_idx == obj->efile.text_shndx) {
> 
> see above how RELO_CALL is handled: shdr_idx != 0 check is missing. We
> also validate that sym->st_value is multiple of BPF_INSN_SZ.

Okay. Will add additional checking.

> 
>> +               reloc_desc->type = RELO_LOCAL_FUNC;
>> +               reloc_desc->insn_idx = insn_idx;
>> +               reloc_desc->sym_off = sym->st_value;
>> +               return 0;
>> +       }
>> +
>>          if (insn->code != (BPF_LD | BPF_IMM | BPF_DW)) {
> 
> feel free to use is_ldimm64 here as well, thanks!
> 
>>                  pr_warn("prog '%s': invalid relo against '%s' for insns[%d].code 0x%x\n",
>>                          prog->name, sym_name, insn_idx, insn->code);
>> @@ -6172,6 +6189,9 @@ bpf_object__relocate_data(struct bpf_object *obj, struct bpf_program *prog)
>>                          }
>>                          relo->processed = true;
>>                          break;
>> +               case RELO_LOCAL_FUNC:
>> +                       insn[0].src_reg = BPF_PSEUDO_FUNC;
>> +                       /* fallthrough */
> 
> fallthrough into an empty break clause seems a bit weird... just break
> and leave the same comment as below?

Yes, "break" seems cleaner.

> 
>>                  case RELO_CALL:
>>                          /* will be handled as a follow up pass */
>>                          break;
>> @@ -6358,11 +6378,11 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
>>
>>          for (insn_idx = 0; insn_idx < prog->sec_insn_cnt; insn_idx++) {
>>                  insn = &main_prog->insns[prog->sub_insn_off + insn_idx];
>> -               if (!insn_is_subprog_call(insn))
>> +               if (!insn_is_subprog_call(insn) && !insn_is_pseudo_func(insn))
>>                          continue;
>>
>>                  relo = find_prog_insn_relo(prog, insn_idx);
>> -               if (relo && relo->type != RELO_CALL) {
>> +               if (relo && relo->type != RELO_CALL && relo->type != RELO_LOCAL_FUNC) {
>>                          pr_warn("prog '%s': unexpected relo for insn #%zu, type %d\n",
>>                                  prog->name, insn_idx, relo->type);
>>                          return -LIBBPF_ERRNO__RELOC;
>> @@ -6374,8 +6394,15 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
>>                           * call always has imm = -1, but for static functions
>>                           * relocation is against STT_SECTION and insn->imm
>>                           * points to a start of a static function
>> +                        *
>> +                        * for local func relocation, the imm field encodes
>> +                        * the byte offset in the corresponding section.
>>                           */
>> -                       sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
>> +                       if (relo->type == RELO_CALL)
>> +                               sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
>> +                       else
>> +                               sub_insn_idx = relo->sym_off / BPF_INSN_SZ +
>> +                                              insn->imm / BPF_INSN_SZ + 1;
> 
> nit: keep it on a single line, it still fits within 100 characters and
> is easier to visually compare to RELO_CALL case.

Okay.

> 
>>                  } else {
>>                          /* if subprogram call is to a static function within
>>                           * the same ELF section, there won't be any relocation
> 
> don't we have to adjust insn->imm for this case as well? Let's add
> selftests to make sure this works.

This is for relo == NULL. I think my code (RELO_LOCAL_FUNC or 
RELO_SUBPROG_ADDR) won't hit this since there always relocations. That 
is why I didn't do anything here.

> 
>> --
>> 2.24.1
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 5/8] libbpf: support local function pointer relocation
  2021-02-09  6:56     ` Yonghong Song
@ 2021-02-09 17:31       ` Andrii Nakryiko
  0 siblings, 0 replies; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-09 17:31 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Mon, Feb 8, 2021 at 10:56 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/8/21 10:52 AM, Andrii Nakryiko wrote:
> > On Thu, Feb 4, 2021 at 5:54 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> A new relocation RELO_LOCAL_FUNC is added to capture
> >> local (static) function pointers loaded with ld_imm64
> >> insns. Such ld_imm64 insns are marked with
> >> BPF_PSEUDO_FUNC and will be passed to kernel so
> >> kernel can replace them with proper actual jited
> >> func addresses.
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   tools/lib/bpf/libbpf.c | 33 ++++++++++++++++++++++++++++++---
> >>   1 file changed, 30 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index 2abbc3800568..a5146c9e3e06 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -188,6 +188,7 @@ enum reloc_type {
> >>          RELO_CALL,
> >>          RELO_DATA,
> >>          RELO_EXTERN,
> >> +       RELO_LOCAL_FUNC,
> >
> > libbpf internally is using SUBPROG notation. I think "LOCAL" part is
> > confusing, so I'd drop it. How about just RELO_SUBPROG? We can
> > separately refactor these names to distinguish RELO_CALL from the new
> > one. It would be more clear if RELO_CALL was called RELO_SUBPROG_CALL,
> > and the new one either RELO_SUBPROG_ADDR or RELO_SUBPROG_REF (as in
> > subprog reference)
>
> Yes, we can use RELO_SUBPROG_ADDR.
>
> >
> >>   };
> >>
> >>   struct reloc_desc {
> >> @@ -574,6 +575,12 @@ static bool insn_is_subprog_call(const struct bpf_insn *insn)
> >>                 insn->off == 0;
> >>   }
> >>
> >> +static bool insn_is_pseudo_func(const struct bpf_insn *insn)
> >> +{
> >> +       return insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
> >
> > there is is_ldimm64() function for this check (just move it up here,
> > it's a single-liner)
>
> I did not know it. Will use in the next revision.
>
> >
> >> +              insn->src_reg == BPF_PSEUDO_FUNC;
> >> +}
> >> +
> >>   static int
> >>   bpf_object__init_prog(struct bpf_object *obj, struct bpf_program *prog,
> >>                        const char *name, size_t sec_idx, const char *sec_name,
> >> @@ -3395,6 +3402,16 @@ static int bpf_program__record_reloc(struct bpf_program *prog,
> >>                  return 0;
> >>          }
> >>
> >> +       if (insn->code == (BPF_LD | BPF_IMM | BPF_DW) &&
> >
> > just move this check below the next if that checks !is_ldimm64, no
> > need to do it here early.
>
> Okay.
>
> >
> >> +           GELF_ST_BIND(sym->st_info) == STB_LOCAL &&
> >> +           GELF_ST_TYPE(sym->st_info) == STT_SECTION &&
> >> +           shdr_idx == obj->efile.text_shndx) {
> >
> > see above how RELO_CALL is handled: shdr_idx != 0 check is missing. We
> > also validate that sym->st_value is multiple of BPF_INSN_SZ.
>
> Okay. Will add additional checking.
>
> >
> >> +               reloc_desc->type = RELO_LOCAL_FUNC;
> >> +               reloc_desc->insn_idx = insn_idx;
> >> +               reloc_desc->sym_off = sym->st_value;
> >> +               return 0;
> >> +       }
> >> +
> >>          if (insn->code != (BPF_LD | BPF_IMM | BPF_DW)) {
> >
> > feel free to use is_ldimm64 here as well, thanks!
> >
> >>                  pr_warn("prog '%s': invalid relo against '%s' for insns[%d].code 0x%x\n",
> >>                          prog->name, sym_name, insn_idx, insn->code);
> >> @@ -6172,6 +6189,9 @@ bpf_object__relocate_data(struct bpf_object *obj, struct bpf_program *prog)
> >>                          }
> >>                          relo->processed = true;
> >>                          break;
> >> +               case RELO_LOCAL_FUNC:
> >> +                       insn[0].src_reg = BPF_PSEUDO_FUNC;
> >> +                       /* fallthrough */
> >
> > fallthrough into an empty break clause seems a bit weird... just break
> > and leave the same comment as below?
>
> Yes, "break" seems cleaner.
>
> >
> >>                  case RELO_CALL:
> >>                          /* will be handled as a follow up pass */
> >>                          break;
> >> @@ -6358,11 +6378,11 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
> >>
> >>          for (insn_idx = 0; insn_idx < prog->sec_insn_cnt; insn_idx++) {
> >>                  insn = &main_prog->insns[prog->sub_insn_off + insn_idx];
> >> -               if (!insn_is_subprog_call(insn))
> >> +               if (!insn_is_subprog_call(insn) && !insn_is_pseudo_func(insn))
> >>                          continue;
> >>
> >>                  relo = find_prog_insn_relo(prog, insn_idx);
> >> -               if (relo && relo->type != RELO_CALL) {
> >> +               if (relo && relo->type != RELO_CALL && relo->type != RELO_LOCAL_FUNC) {
> >>                          pr_warn("prog '%s': unexpected relo for insn #%zu, type %d\n",
> >>                                  prog->name, insn_idx, relo->type);
> >>                          return -LIBBPF_ERRNO__RELOC;
> >> @@ -6374,8 +6394,15 @@ bpf_object__reloc_code(struct bpf_object *obj, struct bpf_program *main_prog,
> >>                           * call always has imm = -1, but for static functions
> >>                           * relocation is against STT_SECTION and insn->imm
> >>                           * points to a start of a static function
> >> +                        *
> >> +                        * for local func relocation, the imm field encodes
> >> +                        * the byte offset in the corresponding section.
> >>                           */
> >> -                       sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
> >> +                       if (relo->type == RELO_CALL)
> >> +                               sub_insn_idx = relo->sym_off / BPF_INSN_SZ + insn->imm + 1;
> >> +                       else
> >> +                               sub_insn_idx = relo->sym_off / BPF_INSN_SZ +
> >> +                                              insn->imm / BPF_INSN_SZ + 1;
> >
> > nit: keep it on a single line, it still fits within 100 characters and
> > is easier to visually compare to RELO_CALL case.
>
> Okay.
>
> >
> >>                  } else {
> >>                          /* if subprogram call is to a static function within
> >>                           * the same ELF section, there won't be any relocation
> >
> > don't we have to adjust insn->imm for this case as well? Let's add
> > selftests to make sure this works.
>
> This is for relo == NULL. I think my code (RELO_LOCAL_FUNC or
> RELO_SUBPROG_ADDR) won't hit this since there always relocations. That
> is why I didn't do anything here.

Hm.. is it handled differently from calls then? When there is a call
from one static subprog to another static subprog, there is no
relocation (because the compiler can guarantee that both functions are
in the same section and their relative location won't change,
presumably). Why is this not the case for getting the address of a
function here?

Regardless, if this can't happen, then let's check and error out,
instead of silently producing invalid values.


>
> >
> >> --
> >> 2.24.1
> >>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper
  2021-02-09  6:41     ` Yonghong Song
@ 2021-02-09 17:33       ` Andrii Nakryiko
  0 siblings, 0 replies; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-09 17:33 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Mon, Feb 8, 2021 at 10:41 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/8/21 10:16 AM, Andrii Nakryiko wrote:
> > On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> The bpf_for_each_map_elem() helper is introduced which
> >> iterates all map elements with a callback function. The
> >> helper signature looks like
> >>    long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
> >> and for each map element, the callback_fn will be called. For example,
> >> like hashmap, the callback signature may look like
> >>    long callback_fn(map, key, val, callback_ctx)
> >>
> >> There are two known use cases for this. One is from upstream ([1]) where
> >> a for_each_map_elem helper may help implement a timeout mechanism
> >> in a more generic way. Another is from our internal discussion
> >> for a firewall use case where a map contains all the rules. The packet
> >> data can be compared to all these rules to decide allow or deny
> >> the packet.
> >>
> >> For array maps, users can already use a bounded loop to traverse
> >> elements. Using this helper can avoid using bounded loop. For other
> >> type of maps (e.g., hash maps) where bounded loop is hard or
> >> impossible to use, this helper provides a convenient way to
> >> operate on all elements.
> >>
> >> For callback_fn, besides map and map element, a callback_ctx,
> >> allocated on caller stack, is also passed to the callback
> >> function. This callback_ctx argument can provide additional
> >> input and allow to write to caller stack for output.
> >>
> >> If the callback_fn returns 0, the helper will iterate through next
> >> element if available. If the callback_fn returns 1, the helper
> >> will stop iterating and returns to the bpf program. Other return
> >> values are not used for now.
> >>
> >> Currently, this helper is only available with jit. It is possible
> >> to make it work with interpreter with so effort but I leave it
> >> as the future work.
> >>
> >> [1]: https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@gmail.com/
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >
> > This is a great feature! Few questions and nits below.
> >
> >>   include/linux/bpf.h            |  14 ++
> >>   include/linux/bpf_verifier.h   |   3 +
> >>   include/uapi/linux/bpf.h       |  28 ++++
> >>   kernel/bpf/bpf_iter.c          |  16 +++
> >>   kernel/bpf/helpers.c           |   2 +
> >>   kernel/bpf/verifier.c          | 251 ++++++++++++++++++++++++++++++---
> >>   kernel/trace/bpf_trace.c       |   2 +
> >>   tools/include/uapi/linux/bpf.h |  28 ++++
> >>   8 files changed, 328 insertions(+), 16 deletions(-)
> >>
> >
> > [...]
> >
> >>   const struct bpf_func_proto *bpf_tracing_func_proto(
> >>          enum bpf_func_id func_id, const struct bpf_prog *prog);
> >> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> >> index dfe6f85d97dd..c4366b3da342 100644
> >> --- a/include/linux/bpf_verifier.h
> >> +++ b/include/linux/bpf_verifier.h
> >> @@ -68,6 +68,8 @@ struct bpf_reg_state {
> >>                          unsigned long raw1;
> >>                          unsigned long raw2;
> >>                  } raw;
> >> +
> >> +               u32 subprog; /* for PTR_TO_FUNC */
> >
> > is it offset to subprog (in bytes or instructions?) or it's subprog
> > index? Let's make it clear with a better name or at least a comment.
>
> This is for subprog number (or index in some subprog related arrays).
> In verifier.c, subprog or subprogno is used to represent the subprog
> number. I will use subprogno in the next revision.
>

yeah, that's more clear

> >
> >>          };
> >>          /* For PTR_TO_PACKET, used to find other pointers with the same variable
> >>           * offset, so they can share range knowledge.
> >> @@ -204,6 +206,7 @@ struct bpf_func_state {
> >>          int acquired_refs;
> >>          struct bpf_reference_state *refs;
> >>          int allocated_stack;
> >> +       bool with_callback_fn;
> >>          struct bpf_stack_state *stack;
> >>   };
> >>
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index c001766adcbc..d55bd4557376 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -393,6 +393,15 @@ enum bpf_link_type {
> >>    *                   is struct/union.
> >>    */
> >>   #define BPF_PSEUDO_BTF_ID      3
> >> +/* insn[0].src_reg:  BPF_PSEUDO_FUNC
> >> + * insn[0].imm:      insn offset to the func
> >> + * insn[1].imm:      0
> >> + * insn[0].off:      0
> >> + * insn[1].off:      0
> >> + * ldimm64 rewrite:  address of the function
> >> + * verifier type:    PTR_TO_FUNC.
> >> + */
> >> +#define BPF_PSEUDO_FUNC                4
> >>
> >>   /* when bpf_call->src_reg == BPF_PSEUDO_CALL, bpf_call->imm == pc-relative
> >>    * offset to another bpf function
> >> @@ -3836,6 +3845,24 @@ union bpf_attr {
> >>    *     Return
> >>    *             A pointer to a struct socket on success or NULL if the file is
> >>    *             not a socket.
> >> + *
> >> + * long bpf_for_each_map_elem(struct bpf_map *map, void *callback_fn, void *callback_ctx, u64 flags)
> >
> > struct bpf_map * here might be problematic. In other instances where
> > we pass map (bpf_map_update_elem, for example) we specify this as
> > (void *). Let's do that instead here?
>
> We should be fine here. bpf_map_lookup_elem etc. all have "struct
> bpf_map *map", it is rewritten by bpf_helpers_doc.py to "void *map".

ok, cool

>
> >
> >> + *     Description
> >> + *             For each element in **map**, call **callback_fn** function with
> >> + *             **map**, **callback_ctx** and other map-specific parameters.
> >> + *             For example, for hash and array maps, the callback signature can
> >> + *             be `u64 callback_fn(map, map_key, map_value, callback_ctx)`.
> >> + *             The **callback_fn** should be a static function and
> >> + *             the **callback_ctx** should be a pointer to the stack.
> >> + *             The **flags** is used to control certain aspects of the helper.
> >> + *             Currently, the **flags** must be 0.
> >> + *
> >> + *             If **callback_fn** return 0, the helper will continue to the next
> >> + *             element. If return value is 1, the helper will skip the rest of
> >> + *             elements and return. Other return values are not used now.
> >> + *     Return
> >> + *             0 for success, **-EINVAL** for invalid **flags** or unsupported
> >> + *             **callback_fn** return value.
> >
> > just a thought: returning the number of elements *actually* iterated
> > seems useful (even though I don't have a specific use case right now).
>
> Good idea. Will change to this in the next revision.
>
> >
> >>    */
> >>   #define __BPF_FUNC_MAPPER(FN)          \
> >>          FN(unspec),                     \
> >> @@ -4001,6 +4028,7 @@ union bpf_attr {
> >>          FN(ktime_get_coarse_ns),        \
> >>          FN(ima_inode_hash),             \
> >>          FN(sock_from_file),             \
> >> +       FN(for_each_map_elem),          \
> >
> > to be more in sync with other map operations, can we call this
> > `bpf_map_for_each_elem`? I think it makes sense and doesn't read
> > backwards at all.
>
> I am using for_each prefix as in the future I (or others) may add
> more for_each_* helpers, e.g., for_each_task, for_each_hlist_rcu, etc.
> This represents a family of helpers with callback functions. So I
> would like to stick with for_each_* names.
>

fair enough, not a big deal

> >
> >>          /* */
> >>
> >>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> >> index 5454161407f1..5187f49d3216 100644
> >> --- a/kernel/bpf/bpf_iter.c
> >> +++ b/kernel/bpf/bpf_iter.c
> >> @@ -675,3 +675,19 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx)
> >>           */
> >>          return ret == 0 ? 0 : -EAGAIN;
> >>   }
> >> +
> >> +BPF_CALL_4(bpf_for_each_map_elem, struct bpf_map *, map, void *, callback_fn,
> >> +          void *, callback_ctx, u64, flags)
> >> +{
> >> +       return map->ops->map_for_each_callback(map, callback_fn, callback_ctx, flags);
> >> +}
> >> +
> >> +const struct bpf_func_proto bpf_for_each_map_elem_proto = {
> >> +       .func           = bpf_for_each_map_elem,
> >> +       .gpl_only       = false,
> >> +       .ret_type       = RET_INTEGER,
> >> +       .arg1_type      = ARG_CONST_MAP_PTR,
> >> +       .arg2_type      = ARG_PTR_TO_FUNC,
> >> +       .arg3_type      = ARG_PTR_TO_STACK_OR_NULL,
> >
> > I looked through this code just once but haven't noticed anything that
> > would strictly require that pointer is specifically to stack. Can this
> > be made into a pointer to any allocated memory? E.g., why can't we
> > allow passing a pointer to a ringbuf sample, for instance? Or
> > MAP_VALUE?
>
> ARG_PTR_TO_STACK_OR_NULL in the most flexible one. For example, if you
> want to pass map_value or ringbuf sample, you can assign these values
> to the stack like
>     struct ctx_t {
>        struct map_value_t *map_value;
>        char *ringbuf_mem;
>     } tmp;
>     tmp.map_value = ...;
>     tmp.ringbuf_mem = ...;
>     bpf_for_each_map_elem(map, callback_fn, &tmp, flags);
> and callback_fn will be able to access map_value/ringbuf_mem
> with their original register types.
>
> This does not allow to pass ringbuf/map_value etc. as the
> first class citizen. But I think this is a good compromise
> to permit greater flexibility.

Yeah, thanks for the explanation.

>
> >
> >> +       .arg4_type      = ARG_ANYTHING,
> >> +};
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index 308427fe03a3..074800226327 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -708,6 +708,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >>                  return &bpf_ringbuf_discard_proto;
> >>          case BPF_FUNC_ringbuf_query:
> >>                  return &bpf_ringbuf_query_proto;
> >> +       case BPF_FUNC_for_each_map_elem:
> >> +               return &bpf_for_each_map_elem_proto;
> >>          default:
> >>                  break;
> >>          }
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index db294b75d03b..050b067a0be6 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -234,6 +234,12 @@ static bool bpf_pseudo_call(const struct bpf_insn *insn)
> >>                 insn->src_reg == BPF_PSEUDO_CALL;
> >>   }
> >>
> >
> > [...]
> >
> >>          map = env->used_maps[aux->map_index];
> >>          mark_reg_known_zero(env, regs, insn->dst_reg);
> >>          dst_reg->map_ptr = map;
> >> @@ -8195,9 +8361,23 @@ static int visit_insn(int t, int insn_cnt, struct bpf_verifier_env *env)
> >>
> >>          /* All non-branch instructions have a single fall-through edge. */
> >>          if (BPF_CLASS(insns[t].code) != BPF_JMP &&
> >> -           BPF_CLASS(insns[t].code) != BPF_JMP32)
> >> +           BPF_CLASS(insns[t].code) != BPF_JMP32 &&
> >> +           !bpf_pseudo_func(insns + t))
> >>                  return push_insn(t, t + 1, FALLTHROUGH, env, false);
> >>
> >> +       if (bpf_pseudo_func(insns + t)) {
> >
> >
> > if you check this before above JMP|JMP32 check, you won't need to do
> > !bpf_pseudo_func, right? I think it's cleaner.
>
> Agree. will change in v2.
>
> >
> >> +               ret = push_insn(t, t + 1, FALLTHROUGH, env, false);
> >> +               if (ret)
> >> +                       return ret;
> >> +
> >> +               if (t + 1 < insn_cnt)
> >> +                       init_explored_state(env, t + 1);
> >> +               init_explored_state(env, t);
> >> +               ret = push_insn(t, t + insns[t].imm + 1, BRANCH,
> >> +                               env, false);
> >> +               return ret;
> >> +       }
> >> +
> >>          switch (BPF_OP(insns[t].code)) {
> >>          case BPF_EXIT:
> >>                  return DONE_EXPLORING;
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper
  2021-02-09  6:46     ` Yonghong Song
@ 2021-02-09 17:36       ` Andrii Nakryiko
  0 siblings, 0 replies; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-09 17:36 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Mon, Feb 8, 2021 at 10:46 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/8/21 10:34 AM, Andrii Nakryiko wrote:
> > On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> A test case is added for hashmap and percpu hashmap. The test
> >> also exercises nested bpf_for_each_map_elem() calls like
> >>      bpf_prog:
> >>        bpf_for_each_map_elem(func1)
> >>      func1:
> >>        bpf_for_each_map_elem(func2)
> >>      func2:
> >>
> >>    $ ./test_progs -n 44
> >>    #44/1 hash_map:OK
> >>    #44 for_each:OK
> >>    Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   .../selftests/bpf/prog_tests/for_each.c       |  91 ++++++++++++++++
> >>   .../bpf/progs/for_each_hash_map_elem.c        | 103 ++++++++++++++++++
> >>   2 files changed, 194 insertions(+)
> >>   create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each.c
> >>   create mode 100644 tools/testing/selftests/bpf/progs/for_each_hash_map_elem.c
> >>
> >
> > [...]
> >
> >> +       num_cpus = bpf_num_possible_cpus();
> >> +       percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);
> >> +       percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
> >> +       if (CHECK_FAIL(!percpu_valbuf))
> >> +               goto out;
> >> +
> >> +       key = 1;
> >> +       for (i = 0; i < num_cpus; i++)
> >> +               percpu_valbuf[i] = i + 1;
> >> +       err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
> >> +       if (CHECK(err, "percpu_map_update", "map_update failed\n"))
> >> +               goto out;
> >> +
> >> +       do_dummy_read(skel->progs.dump_task);
> >
> > why use iter/task programs to trigger your test BPF code? This test
> > doesn't seem to rely on anything iter-specific, so it's much simpler
> > (and less code) to just use the typical sys_enter approach with
> > usleep(1) as a trigger function, no?
>
> I am aware of this. I did not change this in v1 mainly wanting to
> get some comments on API and verifier change etc. for v1.
> I will use bpf_prog_test_run() to call the program in v2.
>
> >
> >> +
> >> +       ASSERT_EQ(skel->bss->called, 1, "called");
> >> +       ASSERT_EQ(skel->bss->hashmap_output, 4, "output_val");
> >> +
> >> +       key = 1;
> >> +       err = bpf_map_lookup_elem(hashmap_fd, &key, &val);
> >> +       ASSERT_ERR(err, "hashmap_lookup");
> >> +
> >> +       ASSERT_EQ(skel->bss->percpu_called, 1, "percpu_called");
> >> +       CHECK_FAIL(skel->bss->cpu >= num_cpus);
> >
> > please don't use CHECK_FAIL: use CHECK or one of ASSERT_xxx variants
>
> We do not have ASSERT_GE, I may invent one.

Yeah, it has come up multiple times now. It would be great to have all
the inequality variants so that we can stop adding new CHECK()s which
has notoriously confusing semantics. Thanks!

>
> >
> >> +       ASSERT_EQ(skel->bss->percpu_key, 1, "percpu_key");
> >> +       ASSERT_EQ(skel->bss->percpu_val, skel->bss->cpu + 1, "percpu_val");
> >> +       ASSERT_EQ(skel->bss->percpu_output, 100, "percpu_output");
> >> +out:
> >> +       free(percpu_valbuf);
> >> +       for_each_hash_map_elem__destroy(skel);
> >> +}
> >> +
> >> +void test_for_each(void)
> >> +{
> >> +       if (test__start_subtest("hash_map"))
> >> +               test_hash_map();
> >> +}
> >
> > [...]
> >
> >> +
> >> +__u32 cpu = 0;
> >> +__u32 percpu_called = 0;
> >> +__u32 percpu_key = 0;
> >> +__u64 percpu_val = 0;
> >> +
> >> +static __u64
> >> +check_percpu_elem(struct bpf_map *map, __u32 *key, __u64 *val,
> >> +                 struct callback_ctx *data)
> >> +{
> >> +       percpu_called++;
> >> +       cpu = bpf_get_smp_processor_id();
> >
> > It's a bit counter-intuitive (at least I was confused initially) that
> > for a per-cpu array for_each() will iterate only current CPU's
> > elements. I think it's worthwhile to emphasize this in
> > bpf_for_each_map_elem() documentation (at least), and call out in
> > corresponding patches adding per-cpu array/hash iteration support.
>
> Right. Will emphasize this in uapi bpf.h and also comments in the code.
>
> >
> >> +       percpu_key = *key;
> >> +       percpu_val = *val;
> >> +
> >> +       bpf_for_each_map_elem(&hashmap, check_hash_elem, data, 0);
> >> +       return 0;
> >> +}
> >> +
> >
> > [...]
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 8/8] selftests/bpf: add arraymap test for bpf_for_each_map_elem() helper
  2021-02-09  6:50     ` Yonghong Song
@ 2021-02-09 17:38       ` Andrii Nakryiko
  0 siblings, 0 replies; 28+ messages in thread
From: Andrii Nakryiko @ 2021-02-09 17:38 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Cong Wang, Daniel Borkmann, Kernel Team

On Mon, Feb 8, 2021 at 10:50 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 2/8/21 10:35 AM, Andrii Nakryiko wrote:
> > On Thu, Feb 4, 2021 at 5:53 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> A test is added for arraymap and percpu arraymap. The test also
> >> exercises the early return for the helper which does not
> >> traverse all elements.
> >>      $ ./test_progs -n 44
> >>      #44/1 hash_map:OK
> >>      #44/2 array_map:OK
> >>      #44 for_each:OK
> >>      Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> >>
> >> Signed-off-by: Yonghong Song <yhs@fb.com>
> >> ---
> >>   .../selftests/bpf/prog_tests/for_each.c       | 54 ++++++++++++++
> >>   .../bpf/progs/for_each_array_map_elem.c       | 71 +++++++++++++++++++
> >>   2 files changed, 125 insertions(+)
> >>   create mode 100644 tools/testing/selftests/bpf/progs/for_each_array_map_elem.c
> >>
> >
> > [...]
> >
> >> +
> >> +       arraymap_fd = bpf_map__fd(skel->maps.arraymap);
> >> +       expected_total = 0;
> >> +       max_entries = bpf_map__max_entries(skel->maps.arraymap);
> >> +       for (i = 0; i < max_entries; i++) {
> >> +               key = i;
> >> +               val = i + 1;
> >> +               /* skip the last iteration for expected total */
> >> +               if (i != max_entries - 1)
> >> +                       expected_total += val;
> >> +               err = bpf_map_update_elem(arraymap_fd, &key, &val, BPF_ANY);
> >> +               if (CHECK(err, "map_update", "map_update failed\n"))
> >> +                       goto out;
> >> +       }
> >> +
> >> +       num_cpus = bpf_num_possible_cpus();
> >> +        percpu_map_fd = bpf_map__fd(skel->maps.percpu_map);
> >
> > formatting is off, please check with checkfile.pl
>
> Sure. will do.
>
> >
> >
> >> +        percpu_valbuf = malloc(sizeof(__u64) * num_cpus);
> >> +        if (CHECK_FAIL(!percpu_valbuf))
> >
> > please don't use CHECK_FAIL, ASSERT_PTR_OK would work nicely here
>
> Okay.
>
> >
> >> +                goto out;
> >> +
> >> +       key = 0;
> >> +        for (i = 0; i < num_cpus; i++)
> >> +                percpu_valbuf[i] = i + 1;
> >> +       err = bpf_map_update_elem(percpu_map_fd, &key, percpu_valbuf, BPF_ANY);
> >> +       if (CHECK(err, "percpu_map_update", "map_update failed\n"))
> >> +               goto out;
> >> +
> >> +       do_dummy_read(skel->progs.dump_task);
> >
> > see previous patch, iter/tasks seems like an overkill for these tests
>
> Yes, will use bpf_prog_test_run() in v2.
>
> >
> >> +
> >> +       ASSERT_EQ(skel->bss->called, 1, "called");
> >> +       ASSERT_EQ(skel->bss->arraymap_output, expected_total, "array_output");
> >> +       ASSERT_EQ(skel->bss->cpu + 1, skel->bss->percpu_val, "percpu_val");
> >> +
> >> +out:
> >> +       free(percpu_valbuf);
> >> +       for_each_array_map_elem__destroy(skel);
> >> +}
> >> +
> >
> > [...]
> >
> >> +SEC("iter/task")
> >> +int dump_task(struct bpf_iter__task *ctx)
> >> +{
> >> +       struct seq_file *seq =  ctx->meta->seq;
> >> +       struct task_struct *task = ctx->task;
> >> +       struct callback_ctx data;
> >> +
> >> +       /* only call once */
> >> +       if (called > 0)
> >> +               return 0;
> >> +
> >> +       called++;
> >> +
> >> +       data.output = 0;
> >> +       bpf_for_each_map_elem(&arraymap, check_array_elem, &data, 0);
> >> +       arraymap_output = data.output;
> >> +
> >> +       bpf_for_each_map_elem(&percpu_map, check_percpu_elem, 0, 0);
> >
> > nit: NULL, 0 ?
>
> We do not NULL defined in vmlinux.h or bpf_helpers.h. Hence I am using
> 0, I should at least use (void *)0 here, I think.

yeah, this NULL problem is annoying. Let's just add NULL to vmlinux.h,
you are not the first one to complain. We can do that separately from
this patch set, of course.

>
> >
> >> +
> >> +       return 0;
> >> +}
> >> --
> >> 2.24.1
> >>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-02-09 17:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-04 23:48 [PATCH bpf-next 0/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
2021-02-04 23:48 ` [PATCH bpf-next 1/8] bpf: refactor BPF_PSEUDO_CALL checking as a helper function Yonghong Song
2021-02-05  5:59   ` Alexei Starovoitov
2021-02-04 23:48 ` [PATCH bpf-next 2/8] bpf: add bpf_for_each_map_elem() helper Yonghong Song
2021-02-05  5:49   ` Alexei Starovoitov
2021-02-05 17:39     ` Yonghong Song
2021-02-08 18:16   ` Andrii Nakryiko
2021-02-09  6:41     ` Yonghong Song
2021-02-09 17:33       ` Andrii Nakryiko
2021-02-04 23:48 ` [PATCH bpf-next 3/8] bpf: add hashtab support for " Yonghong Song
2021-02-05  6:23   ` Alexei Starovoitov
2021-02-05 17:49     ` Yonghong Song
2021-02-04 23:48 ` [PATCH bpf-next 4/8] bpf: add arraymap " Yonghong Song
2021-02-04 23:48 ` [PATCH bpf-next 5/8] libbpf: support local function pointer relocation Yonghong Song
2021-02-08 18:52   ` Andrii Nakryiko
2021-02-09  6:56     ` Yonghong Song
2021-02-09 17:31       ` Andrii Nakryiko
2021-02-04 23:48 ` [PATCH bpf-next 6/8] bpftool: print local function pointer properly Yonghong Song
2021-02-08 18:22   ` Andrii Nakryiko
2021-02-09  6:42     ` Yonghong Song
2021-02-04 23:48 ` [PATCH bpf-next 7/8] selftests/bpf: add hashmap test for bpf_for_each_map_elem() helper Yonghong Song
2021-02-08 18:34   ` Andrii Nakryiko
2021-02-09  6:46     ` Yonghong Song
2021-02-09 17:36       ` Andrii Nakryiko
2021-02-04 23:48 ` [PATCH bpf-next 8/8] selftests/bpf: add arraymap " Yonghong Song
2021-02-08 18:35   ` Andrii Nakryiko
2021-02-09  6:50     ` Yonghong Song
2021-02-09 17:38       ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).