linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/3] Introduce BPF map tracing capability
@ 2021-11-02  2:14 Joe Burton
  2021-11-02  2:14 ` [RFC PATCH v3 1/3] bpf: Add map tracing functions and call sites Joe Burton
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Joe Burton @ 2021-11-02  2:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf
  Cc: Petar Penkov, Stanislav Fomichev, Joe Burton

From: Joe Burton <jevburton@google.com>

This is the third version of a patch series implementing map tracing.

Map tracing enables executing BPF programs upon BPF map updates. This
might be useful to perform upgrades of stateful programs; e.g., tracing
programs can propagate changes to maps that occur during an upgrade
operation.

This version uses trampoline hooks to provide the capability.
fentry/fexit/fmod_ret programs can attach to two new functions:
        int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
                void* val, u32 flags);
        int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);

These hooks work as intended for the following map types:
        BPF_MAP_TYPE_ARRAY
        BPF_MAP_TYPE_PERCPU_ARRAY
        BPF_MAP_TYPE_HASH
        BPF_MAP_TYPE_PERCPU_HASH
        BPF_MAP_TYPE_LRU_HASH
        BPF_MAP_TYPE_LRU_PERCPU_HASH

The only guarantee about the semantics of these hooks is that they execute
before the operation takes place. We cannot call them with locks held
because the hooked program might try to acquire the same locks. Thus they
may be invoked in situations where the traced map is not ultimately
updated.

The original proposal suggested exposing a function for each
(map type) x (access type). The problem I encountered is that e.g.
percpu hashtables use a custom function for some access types
(htab_percpu_map_update_elem) but a common function for others
(htab_map_delete_elem). Thus a userspace application would have to
maintain a unique list of functions to attach to for each map type;
moreover, this list could change across kernel versions. Map tracing is
easier to use with fewer functions, at the cost of tracing programs
being triggered more times.

To prevent the compiler from optimizing out the calls to my tracing
functions, I use the asm("") trick described in gcc's
__attribute__((noinline)) documentation. Experimentally, this trick
works with clang as well.

Joe Burton (3):
  bpf: Add map tracing functions and call sites
  bpf: Add selftests
  bpf: Add real world example for map tracing

 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/arraymap.c                         |   6 +
 kernel/bpf/hashtab.c                          |  25 ++
 kernel/bpf/map_trace.c                        |  25 ++
 kernel/bpf/map_trace.h                        |  18 +
 .../selftests/bpf/prog_tests/map_trace.c      | 422 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_map_trace.c       |  95 ++++
 .../bpf/progs/bpf_map_trace_common.h          |  12 +
 .../progs/bpf_map_trace_real_world_common.h   | 125 ++++++
 .../bpf_map_trace_real_world_migration.c      | 102 +++++
 .../bpf/progs/bpf_map_trace_real_world_new.c  |   4 +
 .../bpf/progs/bpf_map_trace_real_world_old.c  |   5 +
 12 files changed, 840 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/map_trace.c
 create mode 100644 kernel/bpf/map_trace.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_trace.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_migration.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_new.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_old.c

-- 
2.33.1.1089.g2158813163f-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH v3 1/3] bpf: Add map tracing functions and call sites
  2021-11-02  2:14 [RFC PATCH v3 0/3] Introduce BPF map tracing capability Joe Burton
@ 2021-11-02  2:14 ` Joe Burton
  2021-11-02  2:14 ` [RFC PATCH v3 2/3] bpf: Add selftests Joe Burton
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Joe Burton @ 2021-11-02  2:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf
  Cc: Petar Penkov, Stanislav Fomichev, Joe Burton

From: Joe Burton <jevburton@google.com>

Add two functions that fentry/fexit/fmod_ret programs can attach to:
	bpf_map_trace_update_elem
	bpf_map_trace_delete_elem
These functions have the same arguments as
bpf_map_{update,delete}_elem.

Invoke these functions from the following map types:
	BPF_MAP_TYPE_ARRAY
	BPF_MAP_TYPE_PERCPU_ARRAY
	BPF_MAP_TYPE_HASH
	BPF_MAP_TYPE_PERCPU_HASH
	BPF_MAP_TYPE_LRU_HASH
	BPF_MAP_TYPE_LRU_PERCPU_HASH

The only guarantee about these functions is that they are invoked
before the corresponding action occurs. Other conditions may prevent
the corresponding action from occurring after the function is invoked.

Signed-off-by: Joe Burton <jevburton@google.com>
---
 kernel/bpf/Makefile    |  2 +-
 kernel/bpf/arraymap.c  |  6 ++++++
 kernel/bpf/hashtab.c   | 25 +++++++++++++++++++++++++
 kernel/bpf/map_trace.c | 25 +++++++++++++++++++++++++
 kernel/bpf/map_trace.h | 18 ++++++++++++++++++
 5 files changed, 75 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/map_trace.c
 create mode 100644 kernel/bpf/map_trace.h

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index cf6ca339f3cd..03ab5c058e73 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -9,7 +9,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
-obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o map_trace.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 5e1ccfae916b..a0b4f1769e17 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -13,6 +13,7 @@
 #include <linux/rcupdate_trace.h>
 
 #include "map_in_map.h"
+#include "map_trace.h"
 
 #define ARRAY_CREATE_FLAG_MASK \
 	(BPF_F_NUMA_NODE | BPF_F_MMAPABLE | BPF_F_ACCESS_MASK | \
@@ -300,6 +301,7 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	u32 index = *(u32 *)key;
 	char *val;
+	int err;
 
 	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
 		/* unknown flags */
@@ -317,6 +319,10 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 		     !map_value_has_spin_lock(map)))
 		return -EINVAL;
 
+	err = bpf_map_trace_update_elem(map, key, value, map_flags);
+	if (unlikely(err))
+		return err;
+
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
 		memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]),
 		       value, map->value_size);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d29af9988f37..c1816a615d82 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -13,6 +13,7 @@
 #include "percpu_freelist.h"
 #include "bpf_lru_list.h"
 #include "map_in_map.h"
+#include "map_trace.h"
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
@@ -1041,6 +1042,10 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
+	ret = bpf_map_trace_update_elem(map, key, value, map_flags);
+	if (unlikely(ret))
+		return ret;
+
 	if (unlikely(map_flags & BPF_F_LOCK)) {
 		if (unlikely(!map_value_has_spin_lock(map)))
 			return -EINVAL;
@@ -1133,6 +1138,10 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 		/* unknown flags */
 		return -EINVAL;
 
+	ret = bpf_map_trace_update_elem(map, key, value, map_flags);
+	if (unlikely(ret))
+		return ret;
+
 	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
 		     !rcu_read_lock_bh_held());
 
@@ -1201,6 +1210,10 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
 		/* unknown flags */
 		return -EINVAL;
 
+	ret = bpf_map_trace_update_elem(map, key, value, map_flags);
+	if (unlikely(ret))
+		return ret;
+
 	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
 		     !rcu_read_lock_bh_held());
 
@@ -1256,6 +1269,10 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 		/* unknown flags */
 		return -EINVAL;
 
+	ret = bpf_map_trace_update_elem(map, key, value, map_flags);
+	if (unlikely(ret))
+		return ret;
+
 	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
 		     !rcu_read_lock_bh_held());
 
@@ -1334,6 +1351,10 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
 		     !rcu_read_lock_bh_held());
 
+	ret = bpf_map_trace_delete_elem(map, key);
+	if (unlikely(ret))
+		return ret;
+
 	key_size = map->key_size;
 
 	hash = htab_map_hash(key, key_size, htab->hashrnd);
@@ -1370,6 +1391,10 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
 		     !rcu_read_lock_bh_held());
 
+	ret = bpf_map_trace_delete_elem(map, key);
+	if (unlikely(ret))
+		return ret;
+
 	key_size = map->key_size;
 
 	hash = htab_map_hash(key, key_size, htab->hashrnd);
diff --git a/kernel/bpf/map_trace.c b/kernel/bpf/map_trace.c
new file mode 100644
index 000000000000..661b433f1451
--- /dev/null
+++ b/kernel/bpf/map_trace.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021 Google */
+#include "map_trace.h"
+
+noinline int bpf_map_trace_update_elem(struct bpf_map *map, void *key,
+				       void *value, u64 map_flags)
+{
+	/*
+	 * Noop side effect prevents call site from being optimized out.
+	 */
+	asm("");
+	return 0;
+}
+ALLOW_ERROR_INJECTION(bpf_map_trace_update_elem, ERRNO);
+
+noinline int bpf_map_trace_delete_elem(struct bpf_map *map, void *key)
+{
+	/*
+	 * Noop side effect prevents call site from being optimized out.
+	 */
+	asm("");
+	return 0;
+}
+ALLOW_ERROR_INJECTION(bpf_map_trace_delete_elem, ERRNO);
+
diff --git a/kernel/bpf/map_trace.h b/kernel/bpf/map_trace.h
new file mode 100644
index 000000000000..12356a2e1f9f
--- /dev/null
+++ b/kernel/bpf/map_trace.h
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2021 Google */
+#pragma once
+
+#include <linux/bpf.h>
+
+/*
+ * Map tracing hooks. They are called from some, but not all, bpf map types.
+ * For those map types which call them, the only guarantee is that they are
+ * called before the corresponding action (bpf_map_update_elem, etc.) takes
+ * effect. Thus an fmod_ret program may use these hooks to prevent a map from
+ * being mutated via the corresponding helpers.
+ */
+noinline int bpf_map_trace_update_elem(struct bpf_map *map, void *key,
+				       void *value, u64 map_flags);
+
+noinline int bpf_map_trace_delete_elem(struct bpf_map *map, void *key);
+
-- 
2.33.1.1089.g2158813163f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v3 2/3] bpf: Add selftests
  2021-11-02  2:14 [RFC PATCH v3 0/3] Introduce BPF map tracing capability Joe Burton
  2021-11-02  2:14 ` [RFC PATCH v3 1/3] bpf: Add map tracing functions and call sites Joe Burton
@ 2021-11-02  2:14 ` Joe Burton
  2021-11-04  6:32   ` Hou Tao
  2021-11-02  2:14 ` [RFC PATCH v3 3/3] bpf: Add real world example for map tracing Joe Burton
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Joe Burton @ 2021-11-02  2:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf
  Cc: Petar Penkov, Stanislav Fomichev, Joe Burton

From: Joe Burton <jevburton@google.com>

Add selftests verifying that each supported map type is traced.

Signed-off-by: Joe Burton <jevburton@google.com>
---
 .../selftests/bpf/prog_tests/map_trace.c      | 166 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_map_trace.c       |  95 ++++++++++
 .../bpf/progs/bpf_map_trace_common.h          |  12 ++
 3 files changed, 273 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_trace.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_common.h

diff --git a/tools/testing/selftests/bpf/prog_tests/map_trace.c b/tools/testing/selftests/bpf/prog_tests/map_trace.c
new file mode 100644
index 000000000000..4b54a8e3769a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/map_trace.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Google */
+#include <test_progs.h>
+
+#include "bpf_map_trace.skel.h"
+#include "progs/bpf_map_trace_common.h"
+
+#include <sys/mount.h>
+#include <sys/stat.h>
+
+enum BoolOrErr {
+	TRUE = 0,
+	FALSE = 1,
+	ERROR = 2,
+};
+
+enum BoolOrErr percpu_key_is_set(struct bpf_map *map, uint32_t map_key)
+{
+	int num_cpus = libbpf_num_possible_cpus();
+	uint64_t *percpu_map_val = NULL;
+	int map_fd = bpf_map__fd(map);
+	enum BoolOrErr ret = ERROR;
+	int err;
+	int i;
+
+	if (!ASSERT_GE(num_cpus, 1, "get number of cpus"))
+		goto out;
+
+	percpu_map_val = malloc(sizeof(*percpu_map_val) * num_cpus);
+	if (!ASSERT_NEQ(percpu_map_val, NULL, "allocate percpu map array"))
+		goto out;
+
+	err = bpf_map_lookup_elem(map_fd, &map_key, percpu_map_val);
+	if (!ASSERT_EQ(err, 0, "map lookup update_elem"))
+		goto out;
+
+	ret = FALSE;
+	for (i = 0; i < num_cpus; i++)
+		if (percpu_map_val[i] != 0)
+			ret = TRUE;
+
+out:
+	if (percpu_map_val != NULL)
+		free(percpu_map_val);
+
+	return ret;
+}
+
+enum BoolOrErr key_is_set(struct bpf_map *map, uint32_t map_key)
+{
+	int map_fd = bpf_map__fd(map);
+	uint32_t map_val;
+	int rc;
+
+	rc = bpf_map_lookup_elem(map_fd, &map_key, &map_val);
+	if (!ASSERT_EQ(rc, 0, "array map lookup update_elem"))
+		return ERROR;
+
+	return (map_val == 0 ? FALSE : TRUE);
+}
+
+void verify_map_contents(struct bpf_map_trace *skel)
+{
+	enum BoolOrErr rc_or_err;
+	struct bpf_map *map;
+
+	map = skel->maps.array_map;
+	rc_or_err = key_is_set(map, ACCESS_LOC__TRACE_UPDATE);
+	if (!ASSERT_EQ(rc_or_err, TRUE, "array map updates are traced"))
+		return;
+	rc_or_err = key_is_set(map, ACCESS_LOC__TRACE_DELETE);
+	if (!ASSERT_EQ(rc_or_err, FALSE, "array map deletions are not traced"))
+		return;
+
+	map = skel->maps.percpu_array_map;
+	rc_or_err = percpu_key_is_set(map, ACCESS_LOC__TRACE_UPDATE);
+	if (!ASSERT_EQ(rc_or_err, TRUE, "percpu array map updates are traced"))
+		return;
+	rc_or_err = percpu_key_is_set(map, ACCESS_LOC__TRACE_DELETE);
+	if (!ASSERT_EQ(rc_or_err, FALSE,
+		       "percpu array map deletions are not traced"))
+		return;
+
+	map = skel->maps.hash_map;
+	rc_or_err = key_is_set(map, ACCESS_LOC__TRACE_UPDATE);
+	if (!ASSERT_EQ(rc_or_err, TRUE, "hash map updates are traced"))
+		return;
+	rc_or_err = key_is_set(map, ACCESS_LOC__TRACE_DELETE);
+	if (!ASSERT_EQ(rc_or_err, TRUE, "hash map deletions are traced"))
+		return;
+
+	map = skel->maps.percpu_hash_map;
+	rc_or_err = percpu_key_is_set(map, ACCESS_LOC__TRACE_UPDATE);
+	if (!ASSERT_EQ(rc_or_err, TRUE, "percpu hash map updates are traced"))
+		return;
+	rc_or_err = percpu_key_is_set(map, ACCESS_LOC__TRACE_DELETE);
+	if (!ASSERT_EQ(rc_or_err, TRUE,
+		       "percpu hash map deletions are traced"))
+		return;
+
+	map = skel->maps.lru_hash_map;
+	rc_or_err = key_is_set(map, ACCESS_LOC__TRACE_UPDATE);
+	if (!ASSERT_EQ(rc_or_err, TRUE, "lru_hash map updates are traced"))
+		return;
+	rc_or_err = key_is_set(map, ACCESS_LOC__TRACE_DELETE);
+	if (!ASSERT_EQ(rc_or_err, TRUE, "lru_hash map deletions are traced"))
+		return;
+
+	map = skel->maps.percpu_lru_hash_map;
+	rc_or_err = percpu_key_is_set(map, ACCESS_LOC__TRACE_UPDATE);
+	if (!ASSERT_EQ(rc_or_err, TRUE,
+		       "percpu lru hash map updates are traced"))
+		return;
+	rc_or_err = percpu_key_is_set(map, ACCESS_LOC__TRACE_DELETE);
+	if (!ASSERT_EQ(rc_or_err, TRUE,
+		       "percpu lru hash map deletions are traced"))
+		return;
+}
+
+void map_trace_test(void)
+{
+	struct bpf_map_trace *skel;
+	ssize_t bytes_written;
+	char write_buf = 'a';
+	int write_fd = -1;
+	int rc;
+
+	/*
+	 * Load and attach programs.
+	 */
+	skel = bpf_map_trace__open_and_load();
+	if (!ASSERT_NEQ(skel, NULL, "open/load skeleton"))
+		return;
+
+	rc = bpf_map_trace__attach(skel);
+	if (!ASSERT_EQ(rc, 0, "attach skeleton"))
+		goto out;
+
+	/*
+	 * Invoke core BPF program.
+	 */
+	write_fd = open("/tmp/map_trace_test_file", O_CREAT | O_WRONLY);
+	if (!ASSERT_GE(rc, 0, "open tmp file for writing"))
+		goto out;
+
+	bytes_written = write(write_fd, &write_buf, sizeof(write_buf));
+	if (!ASSERT_EQ(bytes_written, sizeof(write_buf), "write to tmp file"))
+		return;
+
+	/*
+	 * Verify that tracing programs were invoked as expected.
+	 */
+	verify_map_contents(skel);
+
+out:
+	if (skel)
+		bpf_map_trace__destroy(skel);
+	if (write_fd != -1)
+		close(write_fd);
+}
+
+void test_map_trace(void)
+{
+	map_trace_test();
+}
+
diff --git a/tools/testing/selftests/bpf/progs/bpf_map_trace.c b/tools/testing/selftests/bpf/progs/bpf_map_trace.c
new file mode 100644
index 000000000000..6135cd86b521
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_map_trace.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Google */
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+#include <string.h>
+
+#include "bpf_map_trace_common.h"
+
+#define DECLARE_MAP(name, map_type) \
+		struct { \
+			__uint(type, map_type); \
+			__uint(max_entries, __ACCESS_LOC__MAX); \
+			__type(key, u32); \
+			__type(value, u32); \
+		} name SEC(".maps")
+
+DECLARE_MAP(array_map, BPF_MAP_TYPE_ARRAY);
+DECLARE_MAP(percpu_array_map, BPF_MAP_TYPE_PERCPU_ARRAY);
+DECLARE_MAP(hash_map, BPF_MAP_TYPE_HASH);
+DECLARE_MAP(percpu_hash_map, BPF_MAP_TYPE_PERCPU_HASH);
+DECLARE_MAP(lru_hash_map, BPF_MAP_TYPE_LRU_HASH);
+DECLARE_MAP(percpu_lru_hash_map, BPF_MAP_TYPE_LRU_PERCPU_HASH);
+
+static inline void __log_location(void *map,
+				  enum MapAccessLocations location)
+{
+	u32 key = location;
+	u32 val = 1;
+
+	bpf_map_update_elem(map, &key, &val, /*flags=*/0);
+}
+
+static inline void log_location(struct bpf_map *map,
+				enum MapAccessLocations location)
+{
+	if (map == &array_map)
+		__log_location(&array_map, location);
+	if (map == &percpu_array_map)
+		__log_location(&percpu_array_map, location);
+	if (map == &hash_map)
+		__log_location(&hash_map, location);
+	if (map == &percpu_hash_map)
+		__log_location(&percpu_hash_map, location);
+	if (map == &lru_hash_map)
+		__log_location(&lru_hash_map, location);
+	if (map == &percpu_lru_hash_map)
+		__log_location(&percpu_lru_hash_map, location);
+}
+
+SEC("fentry/bpf_map_trace_update_elem")
+int BPF_PROG(fentry__bpf_map_trace_update_elem,
+	     struct bpf_map *map, void *key,
+	     void *value, u64 map_flags)
+{
+	log_location(map, ACCESS_LOC__TRACE_UPDATE);
+	return 0;
+}
+
+SEC("fentry/bpf_map_trace_delete_elem")
+int BPF_PROG(fentry__bpf_map_trace_delete_elem,
+	     struct bpf_map *map, void *key)
+{
+	log_location(map, ACCESS_LOC__TRACE_DELETE);
+	return 0;
+}
+
+static inline void do_map_accesses(void *map)
+{
+	u32 key = ACCESS_LOC__APP;
+	u32 val = 1;
+
+	bpf_map_update_elem(map, &key, &val, /*flags=*/0);
+	bpf_map_delete_elem(map, &key);
+}
+
+SEC("fentry/__x64_sys_write")
+int BPF_PROG(fentry__x64_sys_write, struct pt_regs *regs, int ret)
+{
+	/*
+	 * Trigger an update and a delete for every map type under test.
+	 * We want to verify that bpf_map_trace_{update,delete}_elem() fire
+	 * for each map type.
+	 */
+	do_map_accesses(&array_map);
+	do_map_accesses(&percpu_array_map);
+	do_map_accesses(&hash_map);
+	do_map_accesses(&percpu_hash_map);
+	do_map_accesses(&lru_hash_map);
+	do_map_accesses(&percpu_lru_hash_map);
+	return 0;
+}
+
diff --git a/tools/testing/selftests/bpf/progs/bpf_map_trace_common.h b/tools/testing/selftests/bpf/progs/bpf_map_trace_common.h
new file mode 100644
index 000000000000..3aac75953508
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_map_trace_common.h
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Google */
+#pragma once
+
+enum MapAccessLocations {
+	ACCESS_LOC__APP,
+	ACCESS_LOC__TRACE_UPDATE,
+	ACCESS_LOC__TRACE_DELETE,
+
+	__ACCESS_LOC__MAX,
+};
+
-- 
2.33.1.1089.g2158813163f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v3 3/3] bpf: Add real world example for map tracing
  2021-11-02  2:14 [RFC PATCH v3 0/3] Introduce BPF map tracing capability Joe Burton
  2021-11-02  2:14 ` [RFC PATCH v3 1/3] bpf: Add map tracing functions and call sites Joe Burton
  2021-11-02  2:14 ` [RFC PATCH v3 2/3] bpf: Add selftests Joe Burton
@ 2021-11-02  2:14 ` Joe Burton
  2021-11-03  0:12 ` [RFC PATCH v3 0/3] Introduce BPF map tracing capability Alexei Starovoitov
  2021-11-03 10:34 ` Jamal Hadi Salim
  4 siblings, 0 replies; 18+ messages in thread
From: Joe Burton @ 2021-11-02  2:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf
  Cc: Petar Penkov, Stanislav Fomichev, Joe Burton

From: Joe Burton <jevburton@google.com>

Add an example that demonstrates how map tracing helps us avoid race
conditions while upgrading stateful programs.

Signed-off-by: Joe Burton <jevburton@google.com>
---
 .../selftests/bpf/prog_tests/map_trace.c      | 256 ++++++++++++++++++
 .../progs/bpf_map_trace_real_world_common.h   | 125 +++++++++
 .../bpf_map_trace_real_world_migration.c      | 102 +++++++
 .../bpf/progs/bpf_map_trace_real_world_new.c  |   4 +
 .../bpf/progs/bpf_map_trace_real_world_old.c  |   5 +
 5 files changed, 492 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_migration.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_new.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_old.c

diff --git a/tools/testing/selftests/bpf/prog_tests/map_trace.c b/tools/testing/selftests/bpf/prog_tests/map_trace.c
index 4b54a8e3769a..922af9ab06e9 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_trace.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_trace.c
@@ -3,6 +3,9 @@
 #include <test_progs.h>
 
 #include "bpf_map_trace.skel.h"
+#include "bpf_map_trace_real_world_migration.skel.h"
+#include "bpf_map_trace_real_world_new.skel.h"
+#include "bpf_map_trace_real_world_old.skel.h"
 #include "progs/bpf_map_trace_common.h"
 
 #include <sys/mount.h>
@@ -159,8 +162,261 @@ void map_trace_test(void)
 		close(write_fd);
 }
 
+int real_world_example__attach_migration(
+		struct bpf_map_trace_real_world_migration *migration_skel,
+		struct bpf_link **iter_link,
+		struct bpf_link **map_trace_link_update,
+		struct bpf_link **map_trace_link_delete)
+{
+	union bpf_iter_link_info iter_link_info;
+	struct bpf_iter_attach_opts iter_opts;
+	int64_t error;
+
+	*map_trace_link_update = bpf_program__attach(
+			migration_skel->progs.copy_on_write__update);
+	error = libbpf_get_error(map_trace_link_update);
+	if (!ASSERT_EQ(error, 0,
+		       "copy_on_write update bpf_program__attach failure"))
+		return 1;
+
+	*map_trace_link_delete = bpf_program__attach(
+			migration_skel->progs.copy_on_write__delete);
+	error = libbpf_get_error(map_trace_link_delete);
+	if (!ASSERT_EQ(error, 0,
+		       "copy_on_write update bpf_program__delete failure"))
+		return 1;
+
+	memset(&iter_link_info, 0, sizeof(iter_link_info));
+	iter_link_info.map.map_fd = bpf_map__fd(migration_skel->maps.old_map);
+
+	memset(&iter_opts, 0, sizeof(iter_opts));
+	iter_opts.sz = sizeof(iter_opts);
+	iter_opts.link_info = &iter_link_info;
+	iter_opts.link_info_len = sizeof(iter_link_info);
+	*iter_link = bpf_program__attach_iter(
+			migration_skel->progs.bulk_migration, &iter_opts);
+	error = libbpf_get_error(iter_link);
+	if (!ASSERT_EQ(error, 0, "bpf_program__attach_iter failure"))
+		return 1;
+
+	return 0;
+}
+
+int open_and_write_files(const char *path, size_t num_files)
+{
+	int *fds = malloc(sizeof(int) * num_files);
+	ssize_t bytes_written;
+	const char buf = 'a';
+	size_t i, j;
+	int ret = 0;
+
+	if (fds == NULL)
+		return 1;
+
+	for (i = 0; i < num_files; i++) {
+		fds[i] = open(path, O_WRONLY | O_CREAT);
+
+		if (fds[i] < 0) {
+			ret = 2;
+			break;
+		}
+		bytes_written = write(fds[i], &buf, sizeof(buf));
+		if (bytes_written != sizeof(buf)) {
+			ret = 3;
+			break;
+		}
+	}
+	for (j = 0; j < i; j++)
+		close(fds[j]);
+	return ret;
+}
+
+void real_world_example(void)
+{
+	struct bpf_map_trace_real_world_migration *migration_skel = NULL;
+	int file_fd_should_write = -1, file_fd_should_not_write = -1;
+	struct bpf_map_trace_real_world_new *new_skel = NULL;
+	struct bpf_map_trace_real_world_old *old_skel = NULL;
+	struct bpf_link *map_trace_link_update = NULL;
+	struct bpf_link *map_trace_link_delete = NULL;
+	struct bpf_link *iter_link = NULL;
+	const bool enable_filtering = 1;
+	const uint32_t pid = getpid();
+	uint32_t max_open_files;
+	char file_buf = 'a';
+	int iter_fd = -1;
+	char iter_buf[1];
+	int rc;
+
+	/*
+	 * Begin by loading and attaching the old version of our program.
+	 */
+	old_skel = bpf_map_trace_real_world_old__open_and_load();
+	if (!ASSERT_NEQ(old_skel, NULL, "open/load old skeleton"))
+		return;
+	rc = bpf_map_trace_real_world_old__attach(old_skel);
+	if (!ASSERT_EQ(rc, 0, "attach old skeleton")) {
+		fprintf(stderr, "Failed to attach skeleton: %d\n", errno);
+		goto out;
+	}
+	rc = bpf_map_update_elem(bpf_map__fd(old_skel->maps.filtered_pids),
+				 &pid, &enable_filtering, /*flags=*/0);
+	if (!ASSERT_EQ(rc, 0, "configure process to be filtered"))
+		return;
+	if (!ASSERT_EQ(open_and_write_files("/tmp/tst_file", 1), 0,
+		       "program allows writing a single new file"))
+		goto out;
+	max_open_files = bpf_map__max_entries(old_skel->maps.allow_reads);
+	if (!ASSERT_NEQ(open_and_write_files("/tmp/tst_file",
+					     max_open_files + 1), 0,
+		       "program blocks writing too many new files"))
+		goto out;
+
+	/*
+	 * Then load the new version of the program.
+	 */
+	new_skel = bpf_map_trace_real_world_new__open_and_load();
+	if (!ASSERT_NEQ(new_skel, NULL, "open/load new skeleton"))
+		goto out;
+
+	/*
+	 * Hook up the migration programs. This gives the old map
+	 * copy-on-write semantics.
+	 */
+	migration_skel = bpf_map_trace_real_world_migration__open();
+	if (!ASSERT_NEQ(migration_skel, NULL, "open migration skeleton"))
+		goto out;
+	rc = bpf_map__reuse_fd(migration_skel->maps.old_map,
+			       bpf_map__fd(old_skel->maps.allow_reads));
+	if (!ASSERT_EQ(rc, 0, "reuse old map fd"))
+		goto out;
+	rc = bpf_map__reuse_fd(migration_skel->maps.new_map,
+			       bpf_map__fd(new_skel->maps.allow_reads));
+	if (!ASSERT_EQ(rc, 0, "reuse new map fd"))
+		goto out;
+	rc = bpf_map_trace_real_world_migration__load(migration_skel);
+	if (!ASSERT_EQ(rc, 0, "load migration skeleton"))
+		goto out;
+	rc = real_world_example__attach_migration(migration_skel,
+						  &iter_link,
+						  &map_trace_link_update,
+						  &map_trace_link_delete);
+	if (!ASSERT_EQ(rc, 0, "attach migration programs"))
+		goto out;
+
+	/*
+	 * Simulated race condition type 1: An application opens an fd before
+	 * bulk transfer and closes it after.
+	 */
+	file_fd_should_not_write = open("/tmp/tst_file", O_WRONLY | O_CREAT);
+	if (!ASSERT_GE(file_fd_should_not_write, 0,
+		       "open file before bulk migration"))
+		goto out;
+
+	/*
+	 * Perform bulk transfer.
+	 */
+	iter_fd = bpf_iter_create(bpf_link__fd(iter_link));
+	if (!ASSERT_GE(iter_fd, 0, "create iterator"))
+		goto out;
+	rc = read(iter_fd, &iter_buf, sizeof(iter_buf));
+	if (!ASSERT_EQ(rc, 0, "execute map iterator"))
+		goto out;
+	rc = bpf_map_update_elem(bpf_map__fd(new_skel->maps.filtered_pids),
+				 &pid, &enable_filtering, /*flags=*/0);
+	if (!ASSERT_EQ(rc, 0, "configure process to be filtered"))
+		goto out;
+
+	/*
+	 * Simulated race condition type 1 (continued). This close() does not
+	 * propagate to the new map without copy-on-write semantics, so it
+	 * would occupy a spot in the map until our app happens to close an fd
+	 * with the same number. This would subtly degrade the contract with
+	 * the application.
+	 */
+	close(file_fd_should_not_write);
+	file_fd_should_not_write = -1;
+
+	/*
+	 * Simulated race condition type 2: An application opens a file
+	 * descriptor after bulk transfer. This openat() does not propagate to
+	 * the new map without copy-on-write, so our app would not be able to
+	 * write to it.
+	 */
+	file_fd_should_write = open("/tmp/tst_file", O_WRONLY | O_CREAT);
+	if (!ASSERT_GE(file_fd_should_write, 0,
+		       "open file after bulk migration"))
+		goto out;
+
+	/*
+	 * State is migrated. Load new programs.
+	 */
+	rc = bpf_map_trace_real_world_new__attach(new_skel);
+	if (!ASSERT_EQ(rc, 0, "failed to attach new programs"))
+		goto out;
+
+	/*
+	 * Unload migration progs.
+	 */
+	close(iter_fd);
+	iter_fd = -1;
+	bpf_link__destroy(map_trace_link_update);
+	map_trace_link_update = NULL;
+	bpf_link__destroy(map_trace_link_delete);
+	map_trace_link_delete = NULL;
+	bpf_link__destroy(iter_link);
+	iter_link = NULL;
+	bpf_map_trace_real_world_migration__destroy(migration_skel);
+	migration_skel = NULL;
+
+	/*
+	 * Unload old programs.
+	 */
+	bpf_map_trace_real_world_old__destroy(old_skel);
+	old_skel = NULL;
+
+	if (!ASSERT_EQ(open_and_write_files("/tmp/tst_file", 1), 0,
+		       "program allows writing a single new file"))
+		goto out;
+	max_open_files = bpf_map__max_entries(new_skel->maps.allow_reads);
+	if (!ASSERT_NEQ(open_and_write_files("/tmp/tst_file",
+					     max_open_files + 1), 0,
+		       "program blocks writing too many new files"))
+		goto out;
+	/*
+	 * Simulated race condition type 2 (continued): If we didn't do
+	 * copy-on-write, this would be expected to fail, since the FD would
+	 * not be in the new map.
+	 */
+	rc = write(file_fd_should_write, &file_buf, sizeof(file_buf));
+	if (!ASSERT_EQ(rc, sizeof(file_buf),
+		       "migrated program allows writing to file opened before migration"))
+		goto out;
+
+out:
+	if (old_skel)
+		bpf_map_trace_real_world_old__destroy(old_skel);
+	if (new_skel)
+		bpf_map_trace_real_world_new__destroy(new_skel);
+	if (migration_skel)
+		bpf_map_trace_real_world_migration__destroy(migration_skel);
+	if (map_trace_link_update)
+		bpf_link__destroy(map_trace_link_update);
+	if (map_trace_link_delete)
+		bpf_link__destroy(map_trace_link_delete);
+	if (iter_link)
+		bpf_link__destroy(iter_link);
+	if (iter_fd > -1)
+		close(iter_fd);
+	if (file_fd_should_write > -1)
+		close(file_fd_should_write);
+	if (file_fd_should_not_write > -1)
+		close(file_fd_should_not_write);
+}
+
 void test_map_trace(void)
 {
 	map_trace_test();
+	real_world_example();
 }
 
diff --git a/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_common.h b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_common.h
new file mode 100644
index 000000000000..230610e1b5d5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_common.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2021 Google */
+#pragma once
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+#include <string.h>
+
+/*
+ * Mock "real world" application.
+ *
+ * Blocks all writes from a set of applications. A limited number of newly
+ * openat()ed file descriptors file descriptors may be written to. Writes to
+ * already-open file descriptors are blocked.
+ *
+ * The affected processes are selected by populating filtered_pid.
+ *
+ * It is intended as an example of a stateful policy-enforcement application
+ * which benefits from map tracing. It is not intended to be useful.
+ */
+
+/*
+ * This is the only difference between the old and new application. Since we're
+ * enforcing a policy based on this data, we want to migrate it. Since the
+ * application can modify the data in parallel, we need to give this map
+ * copy-on-write semantics so that those changes propagate.
+ */
+#if defined(OLD_VERSION)
+struct allow_reads_key {
+	uint32_t pid;
+	int fd;
+};
+#else
+struct allow_reads_key {
+	int fd;
+	uint32_t pid;
+};
+#endif
+struct allow_reads_value {
+	bool do_allow;
+};
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 16);
+	__type(key, struct allow_reads_key);
+	__type(value, struct allow_reads_value);
+} allow_reads SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 16);
+	__type(key, uint32_t);
+	__type(value, bool);
+} filtered_pids SEC(".maps");
+
+
+SEC("kretprobe/__x64_sys_openat")
+int BPF_KRETPROBE(kretprobe__x64_sys_openat, int ret)
+{
+	struct allow_reads_key key;
+	struct allow_reads_value val;
+	uint32_t pid;
+	char *pid_is_filtered;
+
+	pid = (bpf_get_current_pid_tgid() >> 32) & 0xFFFFFFFF;
+	memset(&key, 0, sizeof(key));
+	key.pid = pid;
+	key.fd = ret;
+	val.do_allow = true;
+
+	if (ret < 0)
+		return 0;
+
+	pid_is_filtered = bpf_map_lookup_elem(&filtered_pids, &pid);
+	if (!pid_is_filtered)
+		return 0;
+
+	if (!*pid_is_filtered)
+		return 0;
+
+	/*
+	 * Ignore errors. Failing to insert has the effect of blocking writes
+	 * on that file descriptor.
+	 */
+	bpf_map_update_elem(&allow_reads, &key, &val, /*flags=*/0);
+	return 0;
+}
+
+SEC("fmod_ret/__x64_sys_write")
+int BPF_PROG(fmod_ret__x64_sys_write, struct pt_regs *regs, int ret)
+{
+	int fd = PT_REGS_PARM1(regs);
+	struct allow_reads_value *val;
+	struct allow_reads_key key;
+
+	memset(&key, 0, sizeof(key));
+	key.pid = (bpf_get_current_pid_tgid() >> 32) & 0xFFFFFFFF;
+	key.fd = fd;
+	val = bpf_map_lookup_elem(&allow_reads, &key);
+	if (!val)
+		return -EPERM;
+	return val->do_allow ? 0 : -EPERM;
+}
+
+SEC("fmod_ret/__x64_sys_close")
+int BPF_PROG(fmod_ret__x64_sys_close, struct pt_regs *regs, int ret)
+{
+	int fd = PT_REGS_PARM1(regs);
+	struct allow_reads_key key;
+	struct allow_reads_value val;
+
+	memset(&key, 0, sizeof(key));
+	key.pid = (bpf_get_current_pid_tgid() >> 32) & 0xFFFFFFFF;
+	key.fd = fd;
+	val.do_allow = true;
+
+	bpf_map_delete_elem(&allow_reads, &key);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+
diff --git a/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_migration.c b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_migration.c
new file mode 100644
index 000000000000..18d61aa7ce4f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_migration.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Google */
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+/* In the "real" real world, we would use BTF to generate a program which knows
+ * about the old and new map ABI. To keep things simple we'll just use a
+ * statically defined program which knows about them.
+ */
+struct allow_reads_key__old {
+	uint32_t pid;
+	int fd;
+};
+struct allow_reads_key__new {
+	int fd;
+	uint32_t pid;
+};
+struct allow_reads_value__old {
+	bool do_drop;
+};
+struct allow_reads_value__new {
+	bool do_drop;
+};
+
+/* Likewise, in the "real" real world we would simply generate a program
+ * containing the fd of this map. For libbpf to generate a skeleton for us we
+ * need to dupicate this definition.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 100);
+	__type(key, struct allow_reads_key__old);
+	__type(value, struct allow_reads_value__old);
+} old_map SEC(".maps");
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 100);
+	__type(key, struct allow_reads_key__new);
+	__type(value, struct allow_reads_value__new);
+} new_map SEC(".maps");
+
+static inline void read_migrate_write(void *key, void *value)
+{
+	struct allow_reads_key__old old_key = {};
+	struct allow_reads_key__new new_key = {};
+	char old_value = 0;
+
+	if (bpf_probe_read(&old_key, sizeof(old_key), key))
+		return; /* Could write to a map here */
+	if (bpf_probe_read(&old_value, sizeof(old_value), value))
+		return; /* Could write to a map here */
+
+	new_key.pid = old_key.pid;
+	new_key.fd = old_key.fd;
+
+	bpf_map_update_elem(&new_map, &new_key, &old_value, /*flags=*/0);
+}
+
+SEC("fentry/bpf_map_trace_update_elem")
+int BPF_PROG(copy_on_write__update,
+	     struct bpf_map *map, void *key,
+	     void *value, u64 map_flags)
+{
+	if (map == &old_map)
+		read_migrate_write(key, value);
+	return 0;
+}
+
+static inline void read_migrate_delete(void *key)
+{
+	struct allow_reads_key__old old_key = {};
+	struct allow_reads_key__new new_key = {};
+
+	if (bpf_probe_read(&old_key, sizeof(old_key), key))
+		return; /* Could write to a map here */
+
+	new_key.pid = old_key.pid;
+	new_key.fd = old_key.fd;
+
+	bpf_map_delete_elem(&new_map, &new_key);
+}
+
+SEC("fentry/bpf_map_trace_delete_elem")
+int BPF_PROG(copy_on_write__delete,
+	     struct bpf_map *map, void *key)
+{
+	if (map == &old_map)
+		read_migrate_delete(key);
+	return 0;
+}
+
+SEC("iter/bpf_map_elem")
+int bulk_migration(struct bpf_iter__bpf_map_elem *ctx)
+{
+	read_migrate_write(ctx->key, ctx->value);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+
diff --git a/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_new.c b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_new.c
new file mode 100644
index 000000000000..9b7c4ca1deed
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_new.c
@@ -0,0 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Google */
+#include "bpf_map_trace_real_world_common.h"
+
diff --git a/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_old.c b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_old.c
new file mode 100644
index 000000000000..9f0bdd7baf71
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_map_trace_real_world_old.c
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Google */
+#define OLD_VERSION
+#include "bpf_map_trace_real_world_common.h"
+
-- 
2.33.1.1089.g2158813163f-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-02  2:14 [RFC PATCH v3 0/3] Introduce BPF map tracing capability Joe Burton
                   ` (2 preceding siblings ...)
  2021-11-02  2:14 ` [RFC PATCH v3 3/3] bpf: Add real world example for map tracing Joe Burton
@ 2021-11-03  0:12 ` Alexei Starovoitov
  2021-11-03 17:29   ` Yonghong Song
  2021-11-03 10:34 ` Jamal Hadi Salim
  4 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2021-11-03  0:12 UTC (permalink / raw)
  To: Joe Burton
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf, Petar Penkov,
	Stanislav Fomichev, Joe Burton

On Tue, Nov 02, 2021 at 02:14:29AM +0000, Joe Burton wrote:
> From: Joe Burton <jevburton@google.com>
> 
> This is the third version of a patch series implementing map tracing.
> 
> Map tracing enables executing BPF programs upon BPF map updates. This
> might be useful to perform upgrades of stateful programs; e.g., tracing
> programs can propagate changes to maps that occur during an upgrade
> operation.
> 
> This version uses trampoline hooks to provide the capability.
> fentry/fexit/fmod_ret programs can attach to two new functions:
>         int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
>                 void* val, u32 flags);
>         int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);
> 
> These hooks work as intended for the following map types:
>         BPF_MAP_TYPE_ARRAY
>         BPF_MAP_TYPE_PERCPU_ARRAY
>         BPF_MAP_TYPE_HASH
>         BPF_MAP_TYPE_PERCPU_HASH
>         BPF_MAP_TYPE_LRU_HASH
>         BPF_MAP_TYPE_LRU_PERCPU_HASH
> 
> The only guarantee about the semantics of these hooks is that they execute
> before the operation takes place. We cannot call them with locks held
> because the hooked program might try to acquire the same locks. Thus they
> may be invoked in situations where the traced map is not ultimately
> updated.
> 
> The original proposal suggested exposing a function for each
> (map type) x (access type). The problem I encountered is that e.g.
> percpu hashtables use a custom function for some access types
> (htab_percpu_map_update_elem) but a common function for others
> (htab_map_delete_elem). Thus a userspace application would have to
> maintain a unique list of functions to attach to for each map type;
> moreover, this list could change across kernel versions. Map tracing is
> easier to use with fewer functions, at the cost of tracing programs
> being triggered more times.

Good point about htab_percpu.
The patches look good to me.
Few minor bits:
- pls don't use #pragma once.
  There was a discussion not too long ago about it and the conclusion
  was that let's not use it.
  It slipped into few selftest/bpf, but let's not introduce more users.
- noinline is not needed in prototype.
- bpf_probe_read is deprecated. Pls use bpf_probe_read_kernel.

and thanks for detailed patch 3.

> To prevent the compiler from optimizing out the calls to my tracing
> functions, I use the asm("") trick described in gcc's
> __attribute__((noinline)) documentation. Experimentally, this trick
> works with clang as well.

I think noinline is enough. I don't think you need that asm in there.

In parallel let's figure out how to do:
SEC("fentry/bpf_map_trace_update_elem")
int BPF_PROG(copy_on_write__update,
             struct bpf_map *map,
             struct allow_reads_key__old *key,
             void *value, u64 map_flags)

It kinda sucks that bpf_probe_read_kernel is necessary to read key/values.
It would be much nicer to be able to specify the exact struct for the key and
access it directly.
The verifier does this already for map iterator.
It's 'void *' on the kernel side while iterator prog can cast this pointer
to specific 'struct key *' and access it directly.
See bpf_iter_reg->ctx_arg_info and btf_ctx_access().

For fentry into bpf_map_trace_update_elem it's a bit more challenging,
since it will be called for all maps and there is no way to statically
check that specific_map->key_size is within prog->aux->max_rdonly_access.

May be we can do a dynamic cast helper (simlar to those that cast sockets)
that will check for key_size at run-time?
Another alternative is to allow 'void *' -> PTR_TO_BTF_ID conversion
and let inlined probe_read do the job.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-02  2:14 [RFC PATCH v3 0/3] Introduce BPF map tracing capability Joe Burton
                   ` (3 preceding siblings ...)
  2021-11-03  0:12 ` [RFC PATCH v3 0/3] Introduce BPF map tracing capability Alexei Starovoitov
@ 2021-11-03 10:34 ` Jamal Hadi Salim
  2021-11-03 17:12   ` Joe Burton
  4 siblings, 1 reply; 18+ messages in thread
From: Jamal Hadi Salim @ 2021-11-03 10:34 UTC (permalink / raw)
  To: Joe Burton, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf
  Cc: Petar Penkov, Stanislav Fomichev, Joe Burton

On 2021-11-01 22:14, Joe Burton wrote:
> From: Joe Burton<jevburton@google.com>
> 
> This is the third version of a patch series implementing map tracing.
> 
> Map tracing enables executing BPF programs upon BPF map updates. This
> might be useful to perform upgrades of stateful programs; e.g., tracing
> programs can propagate changes to maps that occur during an upgrade
> operation.
> 
> This version uses trampoline hooks to provide the capability.
> fentry/fexit/fmod_ret programs can attach to two new functions:
>          int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
>                  void* val, u32 flags);
>          int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);
> 
> These hooks work as intended for the following map types:
>          BPF_MAP_TYPE_ARRAY
>          BPF_MAP_TYPE_PERCPU_ARRAY
>          BPF_MAP_TYPE_HASH
>          BPF_MAP_TYPE_PERCPU_HASH
>          BPF_MAP_TYPE_LRU_HASH
>          BPF_MAP_TYPE_LRU_PERCPU_HASH
> 
> The only guarantee about the semantics of these hooks is that they execute
> before the operation takes place. We cannot call them with locks held
> because the hooked program might try to acquire the same locks. Thus they
> may be invoked in situations where the traced map is not ultimately
> updated.

Sorry, I may have missed something obvious while staring at the patches,
but:
Dont you want the notification if the command actually was successful
on the map? If the command failed for whatever reason theres nothing
to synchronize? Unless you use that as an indicator to re-read the map?

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-03 10:34 ` Jamal Hadi Salim
@ 2021-11-03 17:12   ` Joe Burton
  2021-11-04 10:59     ` Jamal Hadi Salim
  0 siblings, 1 reply; 18+ messages in thread
From: Joe Burton @ 2021-11-03 17:12 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf, Petar Penkov,
	Stanislav Fomichev, Joe Burton

That's a good point. Since the probe is invoked before the update takes
place, it would not be possible to account for the possibility that the
update failed.

Unless someone wants the `pre update' hook, I'll simply adjust the
existing hooks' semantics so that they are invoked after the update.
As discussed, this better suits the intended use case.

On Wed, Nov 3, 2021 at 3:34 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On 2021-11-01 22:14, Joe Burton wrote:
> > From: Joe Burton<jevburton@google.com>
> >
> > This is the third version of a patch series implementing map tracing.
> >
> > Map tracing enables executing BPF programs upon BPF map updates. This
> > might be useful to perform upgrades of stateful programs; e.g., tracing
> > programs can propagate changes to maps that occur during an upgrade
> > operation.
> >
> > This version uses trampoline hooks to provide the capability.
> > fentry/fexit/fmod_ret programs can attach to two new functions:
> >          int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
> >                  void* val, u32 flags);
> >          int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);
> >
> > These hooks work as intended for the following map types:
> >          BPF_MAP_TYPE_ARRAY
> >          BPF_MAP_TYPE_PERCPU_ARRAY
> >          BPF_MAP_TYPE_HASH
> >          BPF_MAP_TYPE_PERCPU_HASH
> >          BPF_MAP_TYPE_LRU_HASH
> >          BPF_MAP_TYPE_LRU_PERCPU_HASH
> >
> > The only guarantee about the semantics of these hooks is that they execute
> > before the operation takes place. We cannot call them with locks held
> > because the hooked program might try to acquire the same locks. Thus they
> > may be invoked in situations where the traced map is not ultimately
> > updated.
>
> Sorry, I may have missed something obvious while staring at the patches,
> but:
> Dont you want the notification if the command actually was successful
> on the map? If the command failed for whatever reason theres nothing
> to synchronize? Unless you use that as an indicator to re-read the map?
>
> cheers,
> jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-03  0:12 ` [RFC PATCH v3 0/3] Introduce BPF map tracing capability Alexei Starovoitov
@ 2021-11-03 17:29   ` Yonghong Song
  2021-11-03 17:45     ` Joe Burton
  0 siblings, 1 reply; 18+ messages in thread
From: Yonghong Song @ 2021-11-03 17:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Joe Burton
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	linux-kernel, netdev, bpf, Petar Penkov, Stanislav Fomichev,
	Joe Burton



On 11/2/21 5:12 PM, Alexei Starovoitov wrote:
> On Tue, Nov 02, 2021 at 02:14:29AM +0000, Joe Burton wrote:
>> From: Joe Burton <jevburton@google.com>
>>
>> This is the third version of a patch series implementing map tracing.
>>
>> Map tracing enables executing BPF programs upon BPF map updates. This
>> might be useful to perform upgrades of stateful programs; e.g., tracing
>> programs can propagate changes to maps that occur during an upgrade
>> operation.
>>
>> This version uses trampoline hooks to provide the capability.
>> fentry/fexit/fmod_ret programs can attach to two new functions:
>>          int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
>>                  void* val, u32 flags);
>>          int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);
>>
>> These hooks work as intended for the following map types:
>>          BPF_MAP_TYPE_ARRAY
>>          BPF_MAP_TYPE_PERCPU_ARRAY
>>          BPF_MAP_TYPE_HASH
>>          BPF_MAP_TYPE_PERCPU_HASH
>>          BPF_MAP_TYPE_LRU_HASH
>>          BPF_MAP_TYPE_LRU_PERCPU_HASH
>>
>> The only guarantee about the semantics of these hooks is that they execute
>> before the operation takes place. We cannot call them with locks held
>> because the hooked program might try to acquire the same locks. Thus they
>> may be invoked in situations where the traced map is not ultimately
>> updated.
>>
>> The original proposal suggested exposing a function for each
>> (map type) x (access type). The problem I encountered is that e.g.
>> percpu hashtables use a custom function for some access types
>> (htab_percpu_map_update_elem) but a common function for others
>> (htab_map_delete_elem). Thus a userspace application would have to
>> maintain a unique list of functions to attach to for each map type;
>> moreover, this list could change across kernel versions. Map tracing is
>> easier to use with fewer functions, at the cost of tracing programs
>> being triggered more times.
> 
> Good point about htab_percpu.
> The patches look good to me.
> Few minor bits:
> - pls don't use #pragma once.
>    There was a discussion not too long ago about it and the conclusion
>    was that let's not use it.
>    It slipped into few selftest/bpf, but let's not introduce more users.
> - noinline is not needed in prototype.
> - bpf_probe_read is deprecated. Pls use bpf_probe_read_kernel.
> 
> and thanks for detailed patch 3.
> 
>> To prevent the compiler from optimizing out the calls to my tracing
>> functions, I use the asm("") trick described in gcc's
>> __attribute__((noinline)) documentation. Experimentally, this trick
>> works with clang as well.
> 
> I think noinline is enough. I don't think you need that asm in there.

I tried a simple program using clang lto and the optimization 
(optimizing away the call itself) doesn't happen.

[$ ~/tmp2] cat t1.c 
 

__attribute__((noinline)) int foo() { 
 

   return 0; 
 

} 
 

[$ ~/tmp2] cat t2.c 
 

extern int foo(); 
 

int main() { 
 

   return foo(); 
 

} 
 

[$ ~/tmp2] cat run.sh 
 

clang -flto=full -O2 t1.c t2.c -c 
 

clang -flto=full -fuse-ld=lld -O2 t1.o t2.o -o a.out 
 

[$ ~/tmp2] ./run.sh 
 

[$ ~/tmp2] llvm-objdump -d a.out
...
0000000000201750 <foo>:
   201750: 31 c0                         xorl    %eax, %eax
   201752: c3                            retq
   201753: cc                            int3
   201754: cc                            int3
   201755: cc                            int3
   201756: cc                            int3
   201757: cc                            int3
   201758: cc                            int3
   201759: cc                            int3
   20175a: cc                            int3
   20175b: cc                            int3
   20175c: cc                            int3
   20175d: cc                            int3
   20175e: cc                            int3
   20175f: cc                            int3

0000000000201760 <main>:
   201760: e9 eb ff ff ff                jmp     0x201750 <foo>

I remember that even if a call is marked as noinline, the compiler might 
still poke into the call to find some information for some optimization.
But I guess probably the callsite will be kept. Otherwise, it will be
considered as "inlining".

Joe, did you hit any issues, esp. with gcc lto?

> 
> In parallel let's figure out how to do:
> SEC("fentry/bpf_map_trace_update_elem")
> int BPF_PROG(copy_on_write__update,
>               struct bpf_map *map,
>               struct allow_reads_key__old *key,
>               void *value, u64 map_flags)
> 
> It kinda sucks that bpf_probe_read_kernel is necessary to read key/values.
> It would be much nicer to be able to specify the exact struct for the key and
> access it directly.
> The verifier does this already for map iterator.
> It's 'void *' on the kernel side while iterator prog can cast this pointer
> to specific 'struct key *' and access it directly.
> See bpf_iter_reg->ctx_arg_info and btf_ctx_access().
> 
> For fentry into bpf_map_trace_update_elem it's a bit more challenging,
> since it will be called for all maps and there is no way to statically
> check that specific_map->key_size is within prog->aux->max_rdonly_access.
> 
> May be we can do a dynamic cast helper (simlar to those that cast sockets)
> that will check for key_size at run-time?
> Another alternative is to allow 'void *' -> PTR_TO_BTF_ID conversion
> and let inlined probe_read do the job.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-03 17:29   ` Yonghong Song
@ 2021-11-03 17:45     ` Joe Burton
  2021-11-03 17:49       ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Joe Burton @ 2021-11-03 17:45 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf, Petar Penkov,
	Stanislav Fomichev, Joe Burton

Sort of - I hit issues when defining the function in the same
compilation unit as the call site. For example:

  static noinline int bpf_array_map_trace_update(struct bpf_map *map,
                void *key, void *value, u64 map_flags)
  {
        return 0;
  }
  ALLOW_ERROR_INJECTION(bpf_array_map_trace_update, ERRNO);

  /* Called from syscall or from eBPF program */
  static int array_map_update_elem(struct bpf_map *map, void *key,
                void *value, u64 map_flags)
  {
        ...
                /* This call is elided entirely! */
                if (unlikely(err = bpf_array_map_trace_update(map, key,
                                                value, map_flags)))
                        return err;

... I observed that the call was elided with both gcc and clang. In
either case, the function itself is left behind and can be attached to
with a trampoline prog, but the function is never invoked. Putting
the function body into its own compilation unit sidesteps the issue - I
suspect that LTO isn't clever enough to elide the call.

FWIW, I saw this in the objdump of `array_map_update_elem' when I compiled
this code:

  90:   e8 6b ff ff ff          call   0 \
      <bpf_array_map_trace_update.constprop.0>

So I suspect that constant propagation is responsible for getting rid of
the call site.

The gcc docs for __attribute__((noinline)) call out the asm("") trick to
avoid inlining:

        noinline
        This function attribute prevents a function from being
        considered for inlining. If the function does not have side-
        effects, there are optimizations other than inlining that
        causes function calls to be optimized away, although the
        function call is live. To keep such calls from being optimized
        away, put
                  asm ("");

Since putting the func in its own compilation unit sidesteps the issue,
I'm content to remove the asm("").

On Wed, Nov 3, 2021 at 10:29 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 11/2/21 5:12 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 02, 2021 at 02:14:29AM +0000, Joe Burton wrote:
> >> From: Joe Burton <jevburton@google.com>
> >>
> >> This is the third version of a patch series implementing map tracing.
> >>
> >> Map tracing enables executing BPF programs upon BPF map updates. This
> >> might be useful to perform upgrades of stateful programs; e.g., tracing
> >> programs can propagate changes to maps that occur during an upgrade
> >> operation.
> >>
> >> This version uses trampoline hooks to provide the capability.
> >> fentry/fexit/fmod_ret programs can attach to two new functions:
> >>          int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
> >>                  void* val, u32 flags);
> >>          int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);
> >>
> >> These hooks work as intended for the following map types:
> >>          BPF_MAP_TYPE_ARRAY
> >>          BPF_MAP_TYPE_PERCPU_ARRAY
> >>          BPF_MAP_TYPE_HASH
> >>          BPF_MAP_TYPE_PERCPU_HASH
> >>          BPF_MAP_TYPE_LRU_HASH
> >>          BPF_MAP_TYPE_LRU_PERCPU_HASH
> >>
> >> The only guarantee about the semantics of these hooks is that they execute
> >> before the operation takes place. We cannot call them with locks held
> >> because the hooked program might try to acquire the same locks. Thus they
> >> may be invoked in situations where the traced map is not ultimately
> >> updated.
> >>
> >> The original proposal suggested exposing a function for each
> >> (map type) x (access type). The problem I encountered is that e.g.
> >> percpu hashtables use a custom function for some access types
> >> (htab_percpu_map_update_elem) but a common function for others
> >> (htab_map_delete_elem). Thus a userspace application would have to
> >> maintain a unique list of functions to attach to for each map type;
> >> moreover, this list could change across kernel versions. Map tracing is
> >> easier to use with fewer functions, at the cost of tracing programs
> >> being triggered more times.
> >
> > Good point about htab_percpu.
> > The patches look good to me.
> > Few minor bits:
> > - pls don't use #pragma once.
> >    There was a discussion not too long ago about it and the conclusion
> >    was that let's not use it.
> >    It slipped into few selftest/bpf, but let's not introduce more users.
> > - noinline is not needed in prototype.
> > - bpf_probe_read is deprecated. Pls use bpf_probe_read_kernel.
> >
> > and thanks for detailed patch 3.
> >
> >> To prevent the compiler from optimizing out the calls to my tracing
> >> functions, I use the asm("") trick described in gcc's
> >> __attribute__((noinline)) documentation. Experimentally, this trick
> >> works with clang as well.
> >
> > I think noinline is enough. I don't think you need that asm in there.
>
> I tried a simple program using clang lto and the optimization
> (optimizing away the call itself) doesn't happen.
>
> [$ ~/tmp2] cat t1.c
>
>
> __attribute__((noinline)) int foo() {
>
>
>    return 0;
>
>
> }
>
>
> [$ ~/tmp2] cat t2.c
>
>
> extern int foo();
>
>
> int main() {
>
>
>    return foo();
>
>
> }
>
>
> [$ ~/tmp2] cat run.sh
>
>
> clang -flto=full -O2 t1.c t2.c -c
>
>
> clang -flto=full -fuse-ld=lld -O2 t1.o t2.o -o a.out
>
>
> [$ ~/tmp2] ./run.sh
>
>
> [$ ~/tmp2] llvm-objdump -d a.out
> ...
> 0000000000201750 <foo>:
>    201750: 31 c0                         xorl    %eax, %eax
>    201752: c3                            retq
>    201753: cc                            int3
>    201754: cc                            int3
>    201755: cc                            int3
>    201756: cc                            int3
>    201757: cc                            int3
>    201758: cc                            int3
>    201759: cc                            int3
>    20175a: cc                            int3
>    20175b: cc                            int3
>    20175c: cc                            int3
>    20175d: cc                            int3
>    20175e: cc                            int3
>    20175f: cc                            int3
>
> 0000000000201760 <main>:
>    201760: e9 eb ff ff ff                jmp     0x201750 <foo>
>
> I remember that even if a call is marked as noinline, the compiler might
> still poke into the call to find some information for some optimization.
> But I guess probably the callsite will be kept. Otherwise, it will be
> considered as "inlining".
>
> Joe, did you hit any issues, esp. with gcc lto?
>
> >
> > In parallel let's figure out how to do:
> > SEC("fentry/bpf_map_trace_update_elem")
> > int BPF_PROG(copy_on_write__update,
> >               struct bpf_map *map,
> >               struct allow_reads_key__old *key,
> >               void *value, u64 map_flags)
> >
> > It kinda sucks that bpf_probe_read_kernel is necessary to read key/values.
> > It would be much nicer to be able to specify the exact struct for the key and
> > access it directly.
> > The verifier does this already for map iterator.
> > It's 'void *' on the kernel side while iterator prog can cast this pointer
> > to specific 'struct key *' and access it directly.
> > See bpf_iter_reg->ctx_arg_info and btf_ctx_access().
> >
> > For fentry into bpf_map_trace_update_elem it's a bit more challenging,
> > since it will be called for all maps and there is no way to statically
> > check that specific_map->key_size is within prog->aux->max_rdonly_access.
> >
> > May be we can do a dynamic cast helper (simlar to those that cast sockets)
> > that will check for key_size at run-time?
> > Another alternative is to allow 'void *' -> PTR_TO_BTF_ID conversion
> > and let inlined probe_read do the job.
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-03 17:45     ` Joe Burton
@ 2021-11-03 17:49       ` Alexei Starovoitov
  2021-11-04  4:23         ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2021-11-03 17:49 UTC (permalink / raw)
  To: Joe Burton
  Cc: Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, LKML, Network Development, bpf, Petar Penkov,
	Stanislav Fomichev, Joe Burton

On Wed, Nov 3, 2021 at 10:45 AM Joe Burton <jevburton.kernel@gmail.com> wrote:
>
> Sort of - I hit issues when defining the function in the same
> compilation unit as the call site. For example:
>
>   static noinline int bpf_array_map_trace_update(struct bpf_map *map,
>                 void *key, void *value, u64 map_flags)

Not quite :)
You've had this issue because of 'static noinline'.
Just 'noinline' would not have such issues even in the same file.

Reminder: please don't top post and trim your replies.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-03 17:49       ` Alexei Starovoitov
@ 2021-11-04  4:23         ` Yonghong Song
  2021-11-04  4:27           ` Alexei Starovoitov
  2021-11-04 16:14           ` Alexei Starovoitov
  0 siblings, 2 replies; 18+ messages in thread
From: Yonghong Song @ 2021-11-04  4:23 UTC (permalink / raw)
  To: Alexei Starovoitov, Joe Burton
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, LKML,
	Network Development, bpf, Petar Penkov, Stanislav Fomichev,
	Joe Burton



On 11/3/21 10:49 AM, Alexei Starovoitov wrote:
> On Wed, Nov 3, 2021 at 10:45 AM Joe Burton <jevburton.kernel@gmail.com> wrote:
>>
>> Sort of - I hit issues when defining the function in the same
>> compilation unit as the call site. For example:
>>
>>    static noinline int bpf_array_map_trace_update(struct bpf_map *map,
>>                  void *key, void *value, u64 map_flags)
> 
> Not quite :)
> You've had this issue because of 'static noinline'.
> Just 'noinline' would not have such issues even in the same file.

This seems not true. With latest trunk clang,

[$ ~/tmp2] cat t.c
int __attribute__((noinline)) foo() { return 1; }
int bar() { return foo() + foo(); }
[$ ~/tmp2] clang -O2 -c t.c
[$ ~/tmp2] llvm-objdump -d t.o

t.o:    file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
        0: b8 01 00 00 00                movl    $1, %eax
        5: c3                            retq
        6: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)

0000000000000010 <bar>:
       10: b8 02 00 00 00                movl    $2, %eax
       15: c3                            retq
[$ ~/tmp2]

The compiler did the optimization and the original noinline function 
still in the binary.

With a single foo() in bar() has the same effect.

asm("") indeed helped preserve the call.

[$ ~/tmp2] cat t.c
int __attribute__((noinline)) foo() { asm(""); return 1; }
int bar() { return foo() + foo(); }
[$ ~/tmp2] clang -O2 -c t.c
[$ ~/tmp2] llvm-objdump -d t.o

t.o:    file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
        0: b8 01 00 00 00                movl    $1, %eax
        5: c3                            retq
        6: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)

0000000000000010 <bar>:
       10: 50                            pushq   %rax
       11: e8 00 00 00 00                callq   0x16 <bar+0x6>
       16: e8 00 00 00 00                callq   0x1b <bar+0xb>
       1b: b8 02 00 00 00                movl    $2, %eax
       20: 59                            popq    %rcx
       21: c3                            retq
[$ ~/tmp2]

Note with asm(""), foo() is called twice, but the compiler optimization
knows foo()'s return value is 1 so it did calculation at compiler time,
assign the 2 to %eax and returns.

Having a single foo() in bar() has the same effect.

[$ ~/tmp2] cat t.c
int __attribute__((noinline)) foo() { return 1; }
int bar() { return foo(); }
[$ ~/tmp2] clang -O2 -c t.c
[$ ~/tmp2] llvm-objdump -d t.o

t.o:    file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
        0: b8 01 00 00 00                movl    $1, %eax
        5: c3                            retq
        6: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)

0000000000000010 <bar>:
       10: b8 01 00 00 00                movl    $1, %eax
       15: c3                            retq
[$ ~/tmp2]

I checked with a few llvm compiler engineers in Facebook.
They mentioned there is nothing preventing compiler from doing
optimization like poking inside the noinline function and doing
some optimization based on that knowledge.

> 
> Reminder: please don't top post and trim your replies.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-04  4:23         ` Yonghong Song
@ 2021-11-04  4:27           ` Alexei Starovoitov
  2021-11-04 16:14           ` Alexei Starovoitov
  1 sibling, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2021-11-04  4:27 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Joe Burton, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, LKML,
	Network Development, bpf, Petar Penkov, Stanislav Fomichev,
	Joe Burton

On Wed, Nov 3, 2021 at 9:23 PM Yonghong Song <yhs@fb.com> wrote:
>
> I checked with a few llvm compiler engineers in Facebook.
> They mentioned there is nothing preventing compiler from doing
> optimization like poking inside the noinline function and doing
> some optimization based on that knowledge.

Interesting. Thanks for digging in.
Good to know!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 2/3] bpf: Add selftests
  2021-11-02  2:14 ` [RFC PATCH v3 2/3] bpf: Add selftests Joe Burton
@ 2021-11-04  6:32   ` Hou Tao
  2021-11-09 18:17     ` Joe Burton
  0 siblings, 1 reply; 18+ messages in thread
From: Hou Tao @ 2021-11-04  6:32 UTC (permalink / raw)
  To: Joe Burton, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf
  Cc: Petar Penkov, Stanislav Fomichev, Joe Burton

Hi,

On 11/2/2021 10:14 AM, Joe Burton wrote:
> From: Joe Burton <jevburton@google.com>
>
> Add selftests verifying that each supported map type is traced.
>
> Signed-off-by: Joe Burton <jevburton@google.com>
> ---
>  .../selftests/bpf/prog_tests/map_trace.c      | 166 ++++++++++++++++++
>  .../selftests/bpf/progs/bpf_map_trace.c       |  95 ++++++++++
>  .../bpf/progs/bpf_map_trace_common.h          |  12 ++
>  3 files changed, 273 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/map_trace.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_map_trace_common.h
snip
> +	/*
> +	 * Invoke core BPF program.
> +	 */
> +	write_fd = open("/tmp/map_trace_test_file", O_CREAT | O_WRONLY);
> +	if (!ASSERT_GE(rc, 0, "open tmp file for writing"))
> +		goto out;
> +
> +	bytes_written = write(write_fd, &write_buf, sizeof(write_buf));
> +	if (!ASSERT_EQ(bytes_written, sizeof(write_buf), "write to tmp file"))
> +		return;
In fentry__x64_sys_write(), you just do trigger updates to maps, so for the
portability of the test
(e.g. run-able for arm64) and minimal dependency (e.g. don't depends on /tmp),
why do you
using nanosleep() and replacing fentry_x64_sys_write by
tp/syscalls/sys_enter_nanosleep instead.
Also it will be better if you can filter out other processes by pid.

Thanks,
Tao


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-03 17:12   ` Joe Burton
@ 2021-11-04 10:59     ` Jamal Hadi Salim
  2021-11-04 11:08       ` Jamal Hadi Salim
  0 siblings, 1 reply; 18+ messages in thread
From: Jamal Hadi Salim @ 2021-11-04 10:59 UTC (permalink / raw)
  To: Joe Burton
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf, Petar Penkov,
	Stanislav Fomichev, Joe Burton

On 2021-11-03 13:12, Joe Burton wrote:
> That's a good point. Since the probe is invoked before the update takes
> place, it would not be possible to account for the possibility that the
> update failed.
> 
> Unless someone wants the `pre update' hook, I'll simply adjust the
> existing hooks' semantics so that they are invoked after the update.
> As discussed, this better suits the intended use case.
>

If the goal is to synchronize state between two maps (if i understood
correctly the intent) then it is more useful to go post-update.

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-04 10:59     ` Jamal Hadi Salim
@ 2021-11-04 11:08       ` Jamal Hadi Salim
  0 siblings, 0 replies; 18+ messages in thread
From: Jamal Hadi Salim @ 2021-11-04 11:08 UTC (permalink / raw)
  To: Joe Burton
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf, Petar Penkov,
	Stanislav Fomichev, Joe Burton

On 2021-11-04 06:59, Jamal Hadi Salim wrote:
> On 2021-11-03 13:12, Joe Burton wrote:
>> That's a good point. Since the probe is invoked before the update takes
>> place, it would not be possible to account for the possibility that the
>> update failed.
>>
>> Unless someone wants the `pre update' hook, I'll simply adjust the
>> existing hooks' semantics so that they are invoked after the update.
>> As discussed, this better suits the intended use case.
>>
> 
> If the goal is to synchronize state between two maps (if i understood
> correctly the intent) then it is more useful to go post-update.
> 

To complete that thought: Only positive results are interesting.
For example if the command was to delete an entry which doesnt
exist there is no point in reporting that (or is there?).
OTOH, a successful delete is useful...

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-04  4:23         ` Yonghong Song
  2021-11-04  4:27           ` Alexei Starovoitov
@ 2021-11-04 16:14           ` Alexei Starovoitov
  2021-11-04 17:11             ` Yonghong Song
  1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2021-11-04 16:14 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Joe Burton, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, LKML,
	Network Development, bpf, Petar Penkov, Stanislav Fomichev,
	Joe Burton

On Wed, Nov 3, 2021 at 9:23 PM Yonghong Song <yhs@fb.com> wrote:
>
> asm("") indeed helped preserve the call.
>
> [$ ~/tmp2] cat t.c
> int __attribute__((noinline)) foo() { asm(""); return 1; }
> int bar() { return foo() + foo(); }
> [$ ~/tmp2] clang -O2 -c t.c
> [$ ~/tmp2] llvm-objdump -d t.o
>
> t.o:    file format elf64-x86-64
>
> Disassembly of section .text:
>
> 0000000000000000 <foo>:
>         0: b8 01 00 00 00                movl    $1, %eax
>         5: c3                            retq
>         6: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
>
> 0000000000000010 <bar>:
>        10: 50                            pushq   %rax
>        11: e8 00 00 00 00                callq   0x16 <bar+0x6>
>        16: e8 00 00 00 00                callq   0x1b <bar+0xb>
>        1b: b8 02 00 00 00                movl    $2, %eax
>        20: 59                            popq    %rcx
>        21: c3                            retq
> [$ ~/tmp2]
>
> Note with asm(""), foo() is called twice, but the compiler optimization
> knows foo()'s return value is 1 so it did calculation at compiler time,
> assign the 2 to %eax and returns.

Missed %eax=2 part...
That means that asm("") is not enough.
Maybe something like:
int __attribute__((noinline)) foo()
{
  int ret = 0;
  asm volatile("" : "=r"(var) : "0"(var));
  return ret;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
  2021-11-04 16:14           ` Alexei Starovoitov
@ 2021-11-04 17:11             ` Yonghong Song
  0 siblings, 0 replies; 18+ messages in thread
From: Yonghong Song @ 2021-11-04 17:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Burton, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, LKML,
	Network Development, bpf, Petar Penkov, Stanislav Fomichev,
	Joe Burton



On 11/4/21 9:14 AM, Alexei Starovoitov wrote:
> On Wed, Nov 3, 2021 at 9:23 PM Yonghong Song <yhs@fb.com> wrote:
>>
>> asm("") indeed helped preserve the call.
>>
>> [$ ~/tmp2] cat t.c
>> int __attribute__((noinline)) foo() { asm(""); return 1; }
>> int bar() { return foo() + foo(); }
>> [$ ~/tmp2] clang -O2 -c t.c
>> [$ ~/tmp2] llvm-objdump -d t.o
>>
>> t.o:    file format elf64-x86-64
>>
>> Disassembly of section .text:
>>
>> 0000000000000000 <foo>:
>>          0: b8 01 00 00 00                movl    $1, %eax
>>          5: c3                            retq
>>          6: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
>>
>> 0000000000000010 <bar>:
>>         10: 50                            pushq   %rax
>>         11: e8 00 00 00 00                callq   0x16 <bar+0x6>
>>         16: e8 00 00 00 00                callq   0x1b <bar+0xb>
>>         1b: b8 02 00 00 00                movl    $2, %eax
>>         20: 59                            popq    %rcx
>>         21: c3                            retq
>> [$ ~/tmp2]
>>
>> Note with asm(""), foo() is called twice, but the compiler optimization
>> knows foo()'s return value is 1 so it did calculation at compiler time,
>> assign the 2 to %eax and returns.
> 
> Missed %eax=2 part...
> That means that asm("") is not enough.
> Maybe something like:
> int __attribute__((noinline)) foo()
> {
>    int ret = 0;
>    asm volatile("" : "=r"(var) : "0"(var));

asm volatile("" : "=r"(ret) : "0"(ret));

>    return ret;
> }

Right. We should prevent compilers from "inlining" return values.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v3 2/3] bpf: Add selftests
  2021-11-04  6:32   ` Hou Tao
@ 2021-11-09 18:17     ` Joe Burton
  0 siblings, 0 replies; 18+ messages in thread
From: Joe Burton @ 2021-11-09 18:17 UTC (permalink / raw)
  To: Hou Tao
  Cc: Joe Burton, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, linux-kernel, netdev, bpf, Petar Penkov,
	Stanislav Fomichev

Thu, Nov 04, 2021 at 02:32:37PM +0800, Hou Tao wrote:

> In fentry__x64_sys_write(), you just do trigger updates to maps, so for the
> portability of the test
> (e.g. run-able for arm64)

Agreed that the test should be runnable on arm64. I haven't tested there
yet but I'll do that before sending out v4.

> and minimal dependency (e.g. don't depends on /tmp),
> why do you
> using nanosleep() and replacing fentry_x64_sys_write by
> tp/syscalls/sys_enter_nanosleep instead.

As written, the example actually modifies the return of write(), so I
don't think I can switch to tp/syscalls/* without significantly
reworking the example. To minimize the amount of reworking while
improving compatibility, how does this sound:

1. Add #ifdefs to support arm64
2. Instead of opening /tmp/map_trace_test_file, open /dev/null

Of course this isn't as portable as your proposal but I think it might
be an acceptable compromise.

Best,
Joe


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-11-09 18:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-02  2:14 [RFC PATCH v3 0/3] Introduce BPF map tracing capability Joe Burton
2021-11-02  2:14 ` [RFC PATCH v3 1/3] bpf: Add map tracing functions and call sites Joe Burton
2021-11-02  2:14 ` [RFC PATCH v3 2/3] bpf: Add selftests Joe Burton
2021-11-04  6:32   ` Hou Tao
2021-11-09 18:17     ` Joe Burton
2021-11-02  2:14 ` [RFC PATCH v3 3/3] bpf: Add real world example for map tracing Joe Burton
2021-11-03  0:12 ` [RFC PATCH v3 0/3] Introduce BPF map tracing capability Alexei Starovoitov
2021-11-03 17:29   ` Yonghong Song
2021-11-03 17:45     ` Joe Burton
2021-11-03 17:49       ` Alexei Starovoitov
2021-11-04  4:23         ` Yonghong Song
2021-11-04  4:27           ` Alexei Starovoitov
2021-11-04 16:14           ` Alexei Starovoitov
2021-11-04 17:11             ` Yonghong Song
2021-11-03 10:34 ` Jamal Hadi Salim
2021-11-03 17:12   ` Joe Burton
2021-11-04 10:59     ` Jamal Hadi Salim
2021-11-04 11:08       ` Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).