bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 0/3] BPF LRU map fix
@ 2019-05-13 23:18 Daniel Borkmann
  2019-05-13 23:18 ` [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side Daniel Borkmann
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Daniel Borkmann @ 2019-05-13 23:18 UTC (permalink / raw)
  To: ast; +Cc: kafai, bpf, netdev, Daniel Borkmann

This set fixes LRU map eviction in combination with map lookups out
of system call side from user space. Main patch is the second one and
test cases are adapted and added in the last one. Thanks!

Daniel Borkmann (3):
  bpf: add map_lookup_elem_sys_only for lookups from syscall side
  bpf, lru: avoid messing with eviction heuristics upon syscall lookup
  bpf: test ref bit from data path and add new tests for syscall path

 include/linux/bpf.h                        |   1 +
 kernel/bpf/hashtab.c                       |  23 ++-
 kernel/bpf/syscall.c                       |   5 +-
 tools/testing/selftests/bpf/test_lru_map.c | 288 +++++++++++++++++++++++++++--
 4 files changed, 297 insertions(+), 20 deletions(-)

-- 
2.9.5


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side
  2019-05-13 23:18 [PATCH bpf 0/3] BPF LRU map fix Daniel Borkmann
@ 2019-05-13 23:18 ` Daniel Borkmann
  2019-05-14  5:04   ` Andrii Nakryiko
  2019-05-13 23:18 ` [PATCH bpf 2/3] bpf, lru: avoid messing with eviction heuristics upon syscall lookup Daniel Borkmann
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2019-05-13 23:18 UTC (permalink / raw)
  To: ast; +Cc: kafai, bpf, netdev, Daniel Borkmann

Add a callback map_lookup_elem_sys_only() that map implementations
could use over map_lookup_elem() from system call side in case the
map implementation needs to handle the latter differently than from
the BPF data path. If map_lookup_elem_sys_only() is set, this will
be preferred pick for map lookups out of user space. This hook is
used in a follow-up fix for LRU map, but once development window
opens, we can convert other map types from map_lookup_elem() (here,
the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
the callback to simplify and clean up the latter.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h  | 1 +
 kernel/bpf/syscall.c | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 59631dd..4fb3aa2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -36,6 +36,7 @@ struct bpf_map_ops {
 	void (*map_free)(struct bpf_map *map);
 	int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
 	void (*map_release_uref)(struct bpf_map *map);
+	void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
 
 	/* funcs callable from userspace and from eBPF programs */
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ad3ccf8..cb5440b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -808,7 +808,10 @@ static int map_lookup_elem(union bpf_attr *attr)
 		err = map->ops->map_peek_elem(map, value);
 	} else {
 		rcu_read_lock();
-		ptr = map->ops->map_lookup_elem(map, key);
+		if (map->ops->map_lookup_elem_sys_only)
+			ptr = map->ops->map_lookup_elem_sys_only(map, key);
+		else
+			ptr = map->ops->map_lookup_elem(map, key);
 		if (IS_ERR(ptr)) {
 			err = PTR_ERR(ptr);
 		} else if (!ptr) {
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf 2/3] bpf, lru: avoid messing with eviction heuristics upon syscall lookup
  2019-05-13 23:18 [PATCH bpf 0/3] BPF LRU map fix Daniel Borkmann
  2019-05-13 23:18 ` [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side Daniel Borkmann
@ 2019-05-13 23:18 ` Daniel Borkmann
  2019-05-13 23:18 ` [PATCH bpf 3/3] bpf: test ref bit from data path and add new tests for syscall path Daniel Borkmann
  2019-05-14 17:24 ` [PATCH bpf 0/3] BPF LRU map fix Andrii Nakryiko
  3 siblings, 0 replies; 9+ messages in thread
From: Daniel Borkmann @ 2019-05-13 23:18 UTC (permalink / raw)
  To: ast; +Cc: kafai, bpf, netdev, Daniel Borkmann

One of the biggest issues we face right now with picking LRU map over
regular hash table is that a map walk out of user space, for example,
to just dump the existing entries or to remove certain ones, will
completely mess up LRU eviction heuristics and wrong entries such
as just created ones will get evicted instead. The reason for this
is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
system call lookup side as well. Thus upon walk, all entries are
being marked, so information of actual least recently used ones
are "lost".

In case of Cilium where it can be used (besides others) as a BPF
based connection tracker, this current behavior causes disruption
upon control plane changes that need to walk the map from user space
to evict certain entries. Discussion result from bpfconf [0] was that
we should simply just remove marking from system call side as no
good use case could be found where it's actually needed there.
Therefore this patch removes marking for regular LRU and per-CPU
flavor. If there ever should be a need in future, the behavior could
be selected via map creation flag, but due to mentioned reason we
avoid this here.

  [0] http://vger.kernel.org/bpfconf.html

Fixes: 29ba732acbee ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
Fixes: 8f8449384ec3 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 192d32e..0f2708f 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -527,18 +527,30 @@ static u32 htab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 	return insn - insn_buf;
 }
 
-static void *htab_lru_map_lookup_elem(struct bpf_map *map, void *key)
+static __always_inline void *__htab_lru_map_lookup_elem(struct bpf_map *map,
+							void *key, const bool mark)
 {
 	struct htab_elem *l = __htab_map_lookup_elem(map, key);
 
 	if (l) {
-		bpf_lru_node_set_ref(&l->lru_node);
+		if (mark)
+			bpf_lru_node_set_ref(&l->lru_node);
 		return l->key + round_up(map->key_size, 8);
 	}
 
 	return NULL;
 }
 
+static void *htab_lru_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return __htab_lru_map_lookup_elem(map, key, true);
+}
+
+static void *htab_lru_map_lookup_elem_sys(struct bpf_map *map, void *key)
+{
+	return __htab_lru_map_lookup_elem(map, key, false);
+}
+
 static u32 htab_lru_map_gen_lookup(struct bpf_map *map,
 				   struct bpf_insn *insn_buf)
 {
@@ -1250,6 +1262,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_lru_map_lookup_elem,
+	.map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys,
 	.map_update_elem = htab_lru_map_update_elem,
 	.map_delete_elem = htab_lru_map_delete_elem,
 	.map_gen_lookup = htab_lru_map_gen_lookup,
@@ -1281,7 +1294,6 @@ static void *htab_lru_percpu_map_lookup_elem(struct bpf_map *map, void *key)
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value)
 {
-	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct htab_elem *l;
 	void __percpu *pptr;
 	int ret = -ENOENT;
@@ -1297,8 +1309,9 @@ int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value)
 	l = __htab_map_lookup_elem(map, key);
 	if (!l)
 		goto out;
-	if (htab_is_lru(htab))
-		bpf_lru_node_set_ref(&l->lru_node);
+	/* We do not mark LRU map element here in order to not mess up
+	 * eviction heuristics when user space does a map walk.
+	 */
 	pptr = htab_elem_get_ptr(l, map->key_size);
 	for_each_possible_cpu(cpu) {
 		bpf_long_memcpy(value + off,
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf 3/3] bpf: test ref bit from data path and add new tests for syscall path
  2019-05-13 23:18 [PATCH bpf 0/3] BPF LRU map fix Daniel Borkmann
  2019-05-13 23:18 ` [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side Daniel Borkmann
  2019-05-13 23:18 ` [PATCH bpf 2/3] bpf, lru: avoid messing with eviction heuristics upon syscall lookup Daniel Borkmann
@ 2019-05-13 23:18 ` Daniel Borkmann
  2019-05-14 17:24 ` [PATCH bpf 0/3] BPF LRU map fix Andrii Nakryiko
  3 siblings, 0 replies; 9+ messages in thread
From: Daniel Borkmann @ 2019-05-13 23:18 UTC (permalink / raw)
  To: ast; +Cc: kafai, bpf, netdev, Daniel Borkmann

The test_lru_map is relying on marking the LRU map entry via regular
BPF map lookup from system call side. This is basically for simplicity
reasons. Given we fixed marking entries in that case, the test needs
to be fixed as well. Here we add a small drop-in replacement to retain
existing behavior for the tests by marking out of the BPF program and
transferring the retrieved value out via temporary map. This also adds
new test cases to track the new behavior where two elements are marked,
one via system call side and one via program side, where the next update
then evicts the key looked up only from system call side.

  # ./test_lru_map
  nr_cpus:8

  test_lru_sanity0 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity1 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity2 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity3 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity4 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity5 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity7 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity8 (map_type:9 map_flags:0x0): Pass

  test_lru_sanity0 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity1 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity2 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity3 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity4 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity5 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity7 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity8 (map_type:10 map_flags:0x0): Pass

  test_lru_sanity0 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity4 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity6 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity7 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity8 (map_type:9 map_flags:0x2): Pass

  test_lru_sanity0 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity4 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity6 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity7 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity8 (map_type:10 map_flags:0x2): Pass

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/test_lru_map.c | 288 +++++++++++++++++++++++++++--
 1 file changed, 274 insertions(+), 14 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_lru_map.c b/tools/testing/selftests/bpf/test_lru_map.c
index 781c7de..1b25a7e 100644
--- a/tools/testing/selftests/bpf/test_lru_map.c
+++ b/tools/testing/selftests/bpf/test_lru_map.c
@@ -18,9 +18,11 @@
 #include <sys/wait.h>
 
 #include <bpf/bpf.h>
+#include <bpf/libbpf.h>
 
 #include "bpf_util.h"
 #include "bpf_rlimit.h"
+#include "../../../include/linux/filter.h"
 
 #define LOCAL_FREE_TARGET	(128)
 #define PERCPU_FREE_TARGET	(4)
@@ -40,6 +42,68 @@ static int create_map(int map_type, int map_flags, unsigned int size)
 	return map_fd;
 }
 
+static int bpf_map_lookup_elem_with_ref_bit(int fd, unsigned long long key,
+					    void *value)
+{
+	struct bpf_load_program_attr prog;
+	struct bpf_create_map_attr map;
+	struct bpf_insn insns[] = {
+		BPF_LD_MAP_VALUE(BPF_REG_9, 0, 0),
+		BPF_LD_MAP_FD(BPF_REG_1, fd),
+		BPF_LD_IMM64(BPF_REG_3, key),
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+		BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_3, 0),
+		BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+		BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
+		BPF_STX_MEM(BPF_DW, BPF_REG_9, BPF_REG_1, 0),
+		BPF_MOV64_IMM(BPF_REG_0, 42),
+		BPF_JMP_IMM(BPF_JA, 0, 0, 1),
+		BPF_MOV64_IMM(BPF_REG_0, 1),
+		BPF_EXIT_INSN(),
+	};
+	__u8 data[64] = {};
+	int mfd, pfd, ret, zero = 0;
+	__u32 retval = 0;
+
+	memset(&map, 0, sizeof(map));
+	map.map_type = BPF_MAP_TYPE_ARRAY;
+	map.key_size = sizeof(int);
+	map.value_size = sizeof(unsigned long long);
+	map.max_entries = 1;
+
+	mfd = bpf_create_map_xattr(&map);
+	if (mfd < 0)
+		return -1;
+
+	insns[0].imm = mfd;
+
+	memset(&prog, 0, sizeof(prog));
+	prog.prog_type = BPF_PROG_TYPE_SCHED_CLS;
+	prog.insns = insns;
+	prog.insns_cnt = ARRAY_SIZE(insns);
+	prog.license = "GPL";
+
+	pfd = bpf_load_program_xattr(&prog, NULL, 0);
+	if (pfd < 0) {
+		close(mfd);
+		return -1;
+	}
+
+	ret = bpf_prog_test_run(pfd, 1, data, sizeof(data),
+				NULL, NULL, &retval, NULL);
+	if (ret < 0 || retval != 42) {
+		ret = -1;
+	} else {
+		assert(!bpf_map_lookup_elem(mfd, &zero, value));
+		ret = 0;
+	}
+	close(pfd);
+	close(mfd);
+	return ret;
+}
+
 static int map_subset(int map0, int map1)
 {
 	unsigned long long next_key = 0;
@@ -87,7 +151,7 @@ static int sched_next_online(int pid, int *next_to_try)
 	return ret;
 }
 
-/* Size of the LRU amp is 2
+/* Size of the LRU map is 2
  * Add key=1 (+1 key)
  * Add key=2 (+1 key)
  * Lookup Key=1
@@ -157,7 +221,7 @@ static void test_lru_sanity0(int map_type, int map_flags)
 	 * stop LRU from removing key=1
 	 */
 	key = 1;
-	assert(!bpf_map_lookup_elem(lru_map_fd, &key, value));
+	assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd, key, value));
 	assert(value[0] == 1234);
 
 	key = 3;
@@ -167,7 +231,8 @@ static void test_lru_sanity0(int map_type, int map_flags)
 
 	/* key=2 has been removed from the LRU */
 	key = 2;
-	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1);
+	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+	       errno == ENOENT);
 
 	assert(map_equal(lru_map_fd, expected_map_fd));
 
@@ -221,7 +286,7 @@ static void test_lru_sanity1(int map_type, int map_flags, unsigned int tgt_free)
 	/* Lookup 1 to tgt_free/2 */
 	end_key = 1 + batch_size;
 	for (key = 1; key < end_key; key++) {
-		assert(!bpf_map_lookup_elem(lru_map_fd, &key, value));
+		assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd, key, value));
 		assert(!bpf_map_update_elem(expected_map_fd, &key, value,
 					    BPF_NOEXIST));
 	}
@@ -322,10 +387,11 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
 	end_key = 1 + batch_size;
 	value[0] = 4321;
 	for (key = 1; key < end_key; key++) {
-		assert(bpf_map_lookup_elem(lru_map_fd, &key, value));
+		assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+		       errno == ENOENT);
 		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
 					    BPF_NOEXIST));
-		assert(!bpf_map_lookup_elem(lru_map_fd, &key, value));
+		assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd, key, value));
 		assert(value[0] == 4321);
 		assert(!bpf_map_update_elem(expected_map_fd, &key, value,
 					    BPF_NOEXIST));
@@ -404,7 +470,7 @@ static void test_lru_sanity3(int map_type, int map_flags, unsigned int tgt_free)
 	/* Lookup key 1 to tgt_free*3/2 */
 	end_key = tgt_free + batch_size;
 	for (key = 1; key < end_key; key++) {
-		assert(!bpf_map_lookup_elem(lru_map_fd, &key, value));
+		assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd, key, value));
 		assert(!bpf_map_update_elem(expected_map_fd, &key, value,
 					    BPF_NOEXIST));
 	}
@@ -463,7 +529,7 @@ static void test_lru_sanity4(int map_type, int map_flags, unsigned int tgt_free)
 	assert(bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST));
 
 	for (key = 1; key <= tgt_free; key++) {
-		assert(!bpf_map_lookup_elem(lru_map_fd, &key, value));
+		assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd, key, value));
 		assert(!bpf_map_update_elem(expected_map_fd, &key, value,
 					    BPF_NOEXIST));
 	}
@@ -494,16 +560,16 @@ static void do_test_lru_sanity5(unsigned long long last_key, int map_fd)
 	unsigned long long key, value[nr_cpus];
 
 	/* Ensure the last key inserted by previous CPU can be found */
-	assert(!bpf_map_lookup_elem(map_fd, &last_key, value));
-
+	assert(!bpf_map_lookup_elem_with_ref_bit(map_fd, last_key, value));
 	value[0] = 1234;
 
 	key = last_key + 1;
 	assert(!bpf_map_update_elem(map_fd, &key, value, BPF_NOEXIST));
-	assert(!bpf_map_lookup_elem(map_fd, &key, value));
+	assert(!bpf_map_lookup_elem_with_ref_bit(map_fd, key, value));
 
 	/* Cannot find the last key because it was removed by LRU */
-	assert(bpf_map_lookup_elem(map_fd, &last_key, value));
+	assert(bpf_map_lookup_elem(map_fd, &last_key, value) == -1 &&
+	       errno == ENOENT);
 }
 
 /* Test map with only one element */
@@ -590,8 +656,8 @@ static void test_lru_sanity6(int map_type, int map_flags, int tgt_free)
 		/* Make ref bit sticky for key: [1, tgt_free] */
 		for (stable_key = 1; stable_key <= tgt_free; stable_key++) {
 			/* Mark the ref bit */
-			assert(!bpf_map_lookup_elem(lru_map_fd, &stable_key,
-						    value));
+			assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd,
+								 stable_key, value));
 		}
 		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
 					    BPF_NOEXIST));
@@ -612,6 +678,198 @@ static void test_lru_sanity6(int map_type, int map_flags, int tgt_free)
 	printf("Pass\n");
 }
 
+/* Size of the LRU map is 2
+ * Add key=1 (+1 key)
+ * Add key=2 (+1 key)
+ * Lookup Key=1 (datapath)
+ * Lookup Key=2 (syscall)
+ * Add Key=3
+ *   => Key=2 will be removed by LRU
+ * Iterate map.  Only found key=1 and key=3
+ */
+static void test_lru_sanity7(int map_type, int map_flags)
+{
+	unsigned long long key, value[nr_cpus];
+	int lru_map_fd, expected_map_fd;
+	int next_cpu = 0;
+
+	printf("%s (map_type:%d map_flags:0x%X): ", __func__, map_type,
+	       map_flags);
+
+	assert(sched_next_online(0, &next_cpu) != -1);
+
+	if (map_flags & BPF_F_NO_COMMON_LRU)
+		lru_map_fd = create_map(map_type, map_flags, 2 * nr_cpus);
+	else
+		lru_map_fd = create_map(map_type, map_flags, 2);
+	assert(lru_map_fd != -1);
+
+	expected_map_fd = create_map(BPF_MAP_TYPE_HASH, 0, 2);
+	assert(expected_map_fd != -1);
+
+	value[0] = 1234;
+
+	/* insert key=1 element */
+
+	key = 1;
+	assert(!bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST));
+	assert(!bpf_map_update_elem(expected_map_fd, &key, value,
+				    BPF_NOEXIST));
+
+	/* BPF_NOEXIST means: add new element if it doesn't exist */
+	assert(bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST) == -1
+	       /* key=1 already exists */
+	       && errno == EEXIST);
+
+	/* insert key=2 element */
+
+	/* check that key=2 is not found */
+	key = 2;
+	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+	       errno == ENOENT);
+
+	/* BPF_EXIST means: update existing element */
+	assert(bpf_map_update_elem(lru_map_fd, &key, value, BPF_EXIST) == -1 &&
+	       /* key=2 is not there */
+	       errno == ENOENT);
+
+	assert(!bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST));
+
+	/* insert key=3 element */
+
+	/* check that key=3 is not found */
+	key = 3;
+	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+	       errno == ENOENT);
+
+	/* check that key=1 can be found and mark the ref bit to
+	 * stop LRU from removing key=1
+	 */
+	key = 1;
+	assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd, key, value));
+	assert(value[0] == 1234);
+
+	/* check that key=2 can be found and do _not_ mark ref bit.
+	 * this will be evicted on next update.
+	 */
+	key = 2;
+	assert(!bpf_map_lookup_elem(lru_map_fd, &key, value));
+	assert(value[0] == 1234);
+
+	key = 3;
+	assert(!bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST));
+	assert(!bpf_map_update_elem(expected_map_fd, &key, value,
+				    BPF_NOEXIST));
+
+	/* key=2 has been removed from the LRU */
+	key = 2;
+	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+	       errno == ENOENT);
+
+	assert(map_equal(lru_map_fd, expected_map_fd));
+
+	close(expected_map_fd);
+	close(lru_map_fd);
+
+	printf("Pass\n");
+}
+
+/* Size of the LRU map is 2
+ * Add key=1 (+1 key)
+ * Add key=2 (+1 key)
+ * Lookup Key=1 (syscall)
+ * Lookup Key=2 (datapath)
+ * Add Key=3
+ *   => Key=1 will be removed by LRU
+ * Iterate map.  Only found key=2 and key=3
+ */
+static void test_lru_sanity8(int map_type, int map_flags)
+{
+	unsigned long long key, value[nr_cpus];
+	int lru_map_fd, expected_map_fd;
+	int next_cpu = 0;
+
+	printf("%s (map_type:%d map_flags:0x%X): ", __func__, map_type,
+	       map_flags);
+
+	assert(sched_next_online(0, &next_cpu) != -1);
+
+	if (map_flags & BPF_F_NO_COMMON_LRU)
+		lru_map_fd = create_map(map_type, map_flags, 2 * nr_cpus);
+	else
+		lru_map_fd = create_map(map_type, map_flags, 2);
+	assert(lru_map_fd != -1);
+
+	expected_map_fd = create_map(BPF_MAP_TYPE_HASH, 0, 2);
+	assert(expected_map_fd != -1);
+
+	value[0] = 1234;
+
+	/* insert key=1 element */
+
+	key = 1;
+	assert(!bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST));
+
+	/* BPF_NOEXIST means: add new element if it doesn't exist */
+	assert(bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST) == -1
+	       /* key=1 already exists */
+	       && errno == EEXIST);
+
+	/* insert key=2 element */
+
+	/* check that key=2 is not found */
+	key = 2;
+	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+	       errno == ENOENT);
+
+	/* BPF_EXIST means: update existing element */
+	assert(bpf_map_update_elem(lru_map_fd, &key, value, BPF_EXIST) == -1 &&
+	       /* key=2 is not there */
+	       errno == ENOENT);
+
+	assert(!bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST));
+	assert(!bpf_map_update_elem(expected_map_fd, &key, value,
+				    BPF_NOEXIST));
+
+	/* insert key=3 element */
+
+	/* check that key=3 is not found */
+	key = 3;
+	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+	       errno == ENOENT);
+
+	/* check that key=1 can be found and do _not_ mark ref bit.
+	 * this will be evicted on next update.
+	 */
+	key = 1;
+	assert(!bpf_map_lookup_elem(lru_map_fd, &key, value));
+	assert(value[0] == 1234);
+
+	/* check that key=2 can be found and mark the ref bit to
+	 * stop LRU from removing key=2
+	 */
+	key = 2;
+	assert(!bpf_map_lookup_elem_with_ref_bit(lru_map_fd, key, value));
+	assert(value[0] == 1234);
+
+	key = 3;
+	assert(!bpf_map_update_elem(lru_map_fd, &key, value, BPF_NOEXIST));
+	assert(!bpf_map_update_elem(expected_map_fd, &key, value,
+				    BPF_NOEXIST));
+
+	/* key=1 has been removed from the LRU */
+	key = 1;
+	assert(bpf_map_lookup_elem(lru_map_fd, &key, value) == -1 &&
+	       errno == ENOENT);
+
+	assert(map_equal(lru_map_fd, expected_map_fd));
+
+	close(expected_map_fd);
+	close(lru_map_fd);
+
+	printf("Pass\n");
+}
+
 int main(int argc, char **argv)
 {
 	int map_types[] = {BPF_MAP_TYPE_LRU_HASH,
@@ -637,6 +895,8 @@ int main(int argc, char **argv)
 			test_lru_sanity4(map_types[t], map_flags[f], tgt_free);
 			test_lru_sanity5(map_types[t], map_flags[f]);
 			test_lru_sanity6(map_types[t], map_flags[f], tgt_free);
+			test_lru_sanity7(map_types[t], map_flags[f]);
+			test_lru_sanity8(map_types[t], map_flags[f]);
 
 			printf("\n");
 		}
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side
  2019-05-13 23:18 ` [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side Daniel Borkmann
@ 2019-05-14  5:04   ` Andrii Nakryiko
  2019-05-14  7:59     ` Daniel Borkmann
  0 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2019-05-14  5:04 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, Martin Lau, bpf, Networking

On Mon, May 13, 2019 at 4:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Add a callback map_lookup_elem_sys_only() that map implementations
> could use over map_lookup_elem() from system call side in case the
> map implementation needs to handle the latter differently than from
> the BPF data path. If map_lookup_elem_sys_only() is set, this will
> be preferred pick for map lookups out of user space. This hook is

This is kind of surprising behavior  w/ preferred vs default lookup
code path. Why the desired behavior can't be achieved with an extra
flag, similar to BPF_F_LOCK? It seems like it will be more explicit,
more extensible and more generic approach, avoiding duplication of
lookup semantics.

E.g., for LRU map, with flag on lookup, one can decide whether lookup
from inside BPF program (not just from syscall side!) should modify
LRU ordering or not, simply by specifying extra flag. Am I missing
some complication that prevents us from doing it that way?

> used in a follow-up fix for LRU map, but once development window
> opens, we can convert other map types from map_lookup_elem() (here,
> the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
> the callback to simplify and clean up the latter.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/bpf.h  | 1 +
>  kernel/bpf/syscall.c | 5 ++++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 59631dd..4fb3aa2 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -36,6 +36,7 @@ struct bpf_map_ops {
>         void (*map_free)(struct bpf_map *map);
>         int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
>         void (*map_release_uref)(struct bpf_map *map);
> +       void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
>
>         /* funcs callable from userspace and from eBPF programs */
>         void *(*map_lookup_elem)(struct bpf_map *map, void *key);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ad3ccf8..cb5440b 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -808,7 +808,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>                 err = map->ops->map_peek_elem(map, value);
>         } else {
>                 rcu_read_lock();
> -               ptr = map->ops->map_lookup_elem(map, key);
> +               if (map->ops->map_lookup_elem_sys_only)
> +                       ptr = map->ops->map_lookup_elem_sys_only(map, key);
> +               else
> +                       ptr = map->ops->map_lookup_elem(map, key);
>                 if (IS_ERR(ptr)) {
>                         err = PTR_ERR(ptr);
>                 } else if (!ptr) {
> --
> 2.9.5
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side
  2019-05-14  5:04   ` Andrii Nakryiko
@ 2019-05-14  7:59     ` Daniel Borkmann
  2019-05-14 16:34       ` Andrii Nakryiko
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2019-05-14  7:59 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Alexei Starovoitov, Martin Lau, bpf, Networking

On 05/14/2019 07:04 AM, Andrii Nakryiko wrote:
> On Mon, May 13, 2019 at 4:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> Add a callback map_lookup_elem_sys_only() that map implementations
>> could use over map_lookup_elem() from system call side in case the
>> map implementation needs to handle the latter differently than from
>> the BPF data path. If map_lookup_elem_sys_only() is set, this will
>> be preferred pick for map lookups out of user space. This hook is
> 
> This is kind of surprising behavior  w/ preferred vs default lookup
> code path. Why the desired behavior can't be achieved with an extra
> flag, similar to BPF_F_LOCK? It seems like it will be more explicit,
> more extensible and more generic approach, avoiding duplication of
> lookup semantics.

For lookup from syscall side, this is possible of course. Given the
current situation breaks heuristic with any walks of the LRU map, I
presume you are saying something like an opt-in flag such as
BPF_F_MARK_USED would be more useful? I was thinking about something
like this initially, but then I couldn't come up with a concrete use
case where it's needed/useful today for user space. Given that, my
preference was to only add such flag wait until there is an actual
need for it, and in any case, it is trivial to add it later on. Do
you have a concrete need for it today that would justify such flag?

> E.g., for LRU map, with flag on lookup, one can decide whether lookup
> from inside BPF program (not just from syscall side!) should modify
> LRU ordering or not, simply by specifying extra flag. Am I missing
> some complication that prevents us from doing it that way?

For programs it's a bit tricky. The BPF call interface is ...

  BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)

... meaning verifier does not care what argument 3 and beyond contains.
From BPF context/pov, it could also be uninitialized register. This would
mean, we'd need to add a BPF_CALL_3(bpf_map_lookup_elem2, ...) interface
which programs would use instead (and to not break existing ones), or
some other new helper call that gets a map value argument to unmark the
element from LRU side. While all doable one way or another although bit
hacky, we should probably clarify and understand the use case for it
first, thus brings me back to the last question from above paragraph.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side
  2019-05-14  7:59     ` Daniel Borkmann
@ 2019-05-14 16:34       ` Andrii Nakryiko
  0 siblings, 0 replies; 9+ messages in thread
From: Andrii Nakryiko @ 2019-05-14 16:34 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, Martin Lau, bpf, Networking

On Tue, May 14, 2019 at 12:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 05/14/2019 07:04 AM, Andrii Nakryiko wrote:
> > On Mon, May 13, 2019 at 4:20 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> Add a callback map_lookup_elem_sys_only() that map implementations
> >> could use over map_lookup_elem() from system call side in case the
> >> map implementation needs to handle the latter differently than from
> >> the BPF data path. If map_lookup_elem_sys_only() is set, this will
> >> be preferred pick for map lookups out of user space. This hook is
> >
> > This is kind of surprising behavior  w/ preferred vs default lookup
> > code path. Why the desired behavior can't be achieved with an extra
> > flag, similar to BPF_F_LOCK? It seems like it will be more explicit,
> > more extensible and more generic approach, avoiding duplication of
> > lookup semantics.
>
> For lookup from syscall side, this is possible of course. Given the
> current situation breaks heuristic with any walks of the LRU map, I
> presume you are saying something like an opt-in flag such as
> BPF_F_MARK_USED would be more useful? I was thinking about something

To preserve existing semantics, it would be opt-out
BPF_F_DONT_MARK_USED, if you don't want to update LRU, so that
existing use cases don't break.

> like this initially, but then I couldn't come up with a concrete use
> case where it's needed/useful today for user space. Given that, my
> preference was to only add such flag wait until there is an actual
> need for it, and in any case, it is trivial to add it later on. Do
> you have a concrete need for it today that would justify such flag?

So my concern was with having two ops for lookup for maps
(map_lookup_elem() and map_lookup_elem_sys_only()) which for existing
use cases differ only in whether we are reordering LRU on lookup or
not, which felt like would be cleaner to solve with extending
ops->map_lookup_elem() to accept flags. But now I realize that there
are important implementation limitations preventing doing this cleanly
and efficiently, so I rescind my proposal.

>
> > E.g., for LRU map, with flag on lookup, one can decide whether lookup
> > from inside BPF program (not just from syscall side!) should modify
> > LRU ordering or not, simply by specifying extra flag. Am I missing
> > some complication that prevents us from doing it that way?
>
> For programs it's a bit tricky. The BPF call interface is ...
>
>   BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
>
> ... meaning verifier does not care what argument 3 and beyond contains.
> From BPF context/pov, it could also be uninitialized register. This would
> mean, we'd need to add a BPF_CALL_3(bpf_map_lookup_elem2, ...) interface
> which programs would use instead (and to not break existing ones), or
> some other new helper call that gets a map value argument to unmark the
> element from LRU side. While all doable one way or another although bit
> hacky, we should probably clarify and understand the use case for it
> first, thus brings me back to the last question from above paragraph.

Yeah, if we wanted to expose this functionality from BPF side right
now, we'd have to add new helper w/ extra flags arg. As I mentioned
above, though, I assumed it wouldn't be too hard to make existing
BPF_CALL_2(bpf_map_lookup_elem, struct bpf_map *, map, void *, key)
translate to map->ops->map_lookup_elem(key, 0 /* flags */), filling in
default flags = 0 value, but apparently that's not that simple (and
will hurt performance).

>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf 0/3] BPF LRU map fix
  2019-05-13 23:18 [PATCH bpf 0/3] BPF LRU map fix Daniel Borkmann
                   ` (2 preceding siblings ...)
  2019-05-13 23:18 ` [PATCH bpf 3/3] bpf: test ref bit from data path and add new tests for syscall path Daniel Borkmann
@ 2019-05-14 17:24 ` Andrii Nakryiko
  2019-05-14 17:56   ` Alexei Starovoitov
  3 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2019-05-14 17:24 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, Martin Lau, bpf, Networking

On Mon, May 13, 2019 at 4:19 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This set fixes LRU map eviction in combination with map lookups out
> of system call side from user space. Main patch is the second one and
> test cases are adapted and added in the last one. Thanks!
>
> Daniel Borkmann (3):
>   bpf: add map_lookup_elem_sys_only for lookups from syscall side
>   bpf, lru: avoid messing with eviction heuristics upon syscall lookup
>   bpf: test ref bit from data path and add new tests for syscall path
>
>  include/linux/bpf.h                        |   1 +
>  kernel/bpf/hashtab.c                       |  23 ++-
>  kernel/bpf/syscall.c                       |   5 +-
>  tools/testing/selftests/bpf/test_lru_map.c | 288 +++++++++++++++++++++++++++--
>  4 files changed, 297 insertions(+), 20 deletions(-)
>
> --
> 2.9.5
>

For the series:

Acked-by: Andrii Nakryiko <andriin@fb.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf 0/3] BPF LRU map fix
  2019-05-14 17:24 ` [PATCH bpf 0/3] BPF LRU map fix Andrii Nakryiko
@ 2019-05-14 17:56   ` Alexei Starovoitov
  0 siblings, 0 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2019-05-14 17:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Daniel Borkmann, Alexei Starovoitov, Martin Lau, bpf, Networking

On Tue, May 14, 2019 at 10:24 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, May 13, 2019 at 4:19 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > This set fixes LRU map eviction in combination with map lookups out
> > of system call side from user space. Main patch is the second one and
> > test cases are adapted and added in the last one. Thanks!
> >
> > Daniel Borkmann (3):
> >   bpf: add map_lookup_elem_sys_only for lookups from syscall side
> >   bpf, lru: avoid messing with eviction heuristics upon syscall lookup
> >   bpf: test ref bit from data path and add new tests for syscall path
> >
> >  include/linux/bpf.h                        |   1 +
> >  kernel/bpf/hashtab.c                       |  23 ++-
> >  kernel/bpf/syscall.c                       |   5 +-
> >  tools/testing/selftests/bpf/test_lru_map.c | 288 +++++++++++++++++++++++++++--
> >  4 files changed, 297 insertions(+), 20 deletions(-)
> >
> > --
> > 2.9.5
> >
>
> For the series:
>
> Acked-by: Andrii Nakryiko <andriin@fb.com>

Applied. Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-05-14 17:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-13 23:18 [PATCH bpf 0/3] BPF LRU map fix Daniel Borkmann
2019-05-13 23:18 ` [PATCH bpf 1/3] bpf: add map_lookup_elem_sys_only for lookups from syscall side Daniel Borkmann
2019-05-14  5:04   ` Andrii Nakryiko
2019-05-14  7:59     ` Daniel Borkmann
2019-05-14 16:34       ` Andrii Nakryiko
2019-05-13 23:18 ` [PATCH bpf 2/3] bpf, lru: avoid messing with eviction heuristics upon syscall lookup Daniel Borkmann
2019-05-13 23:18 ` [PATCH bpf 3/3] bpf: test ref bit from data path and add new tests for syscall path Daniel Borkmann
2019-05-14 17:24 ` [PATCH bpf 0/3] BPF LRU map fix Andrii Nakryiko
2019-05-14 17:56   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).