bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/2] add mechanism to report map pressure
@ 2023-05-31 11:05 Anton Protopopov
  2023-05-31 11:05 ` [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure Anton Protopopov
  2023-05-31 11:05 ` [PATCH bpf-next 2/2] selftests/bpf: test map pressure Anton Protopopov
  0 siblings, 2 replies; 19+ messages in thread
From: Anton Protopopov @ 2023-05-31 11:05 UTC (permalink / raw)
  To: bpf; +Cc: Anton Protopopov, Joe Stringer, John Fastabend

This series adds a mechanism to report "map pressure" (read as "the current
number of elements" for most maps) to userspace.

The primary reason for adding this functionality in our case (Cilium) is to get
signals about how full some heavy-used maps are and what the actual dynamic
profile of map capacity is. In the case of RCU maps this is impossible to get
this information anyhow else. See also [1].

The first patch in the series adds the api and implements the ops for all maps
in hashtab.c + for the lpm_trie.c (the maps which are of most interest for us).
See the commit description for additional details.

The second commit adds a new map selftest map_tests/map_pressure.c.

  [1] https://lpc.events/event/16/contributions/1368/

Anton Protopopov (2):
  bpf: add new map ops ->map_pressure
  selftests/bpf: test map pressure

 include/linux/bpf.h                           |   1 +
 include/uapi/linux/bpf.h                      |   2 +-
 kernel/bpf/hashtab.c                          | 118 ++++---
 kernel/bpf/lpm_trie.c                         |   8 +
 kernel/bpf/syscall.c                          |  15 +
 tools/include/uapi/linux/bpf.h                |   2 +-
 .../selftests/bpf/map_tests/map_pressure.c    | 307 ++++++++++++++++++
 7 files changed, 406 insertions(+), 47 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/map_tests/map_pressure.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-05-31 11:05 [PATCH bpf-next 0/2] add mechanism to report map pressure Anton Protopopov
@ 2023-05-31 11:05 ` Anton Protopopov
  2023-05-31 18:24   ` Alexei Starovoitov
  2023-06-01  0:44   ` kernel test robot
  2023-05-31 11:05 ` [PATCH bpf-next 2/2] selftests/bpf: test map pressure Anton Protopopov
  1 sibling, 2 replies; 19+ messages in thread
From: Anton Protopopov @ 2023-05-31 11:05 UTC (permalink / raw)
  To: bpf; +Cc: Anton Protopopov, Joe Stringer, John Fastabend

Add a new map ops named map_pressure to return map's "raw pressure". This value
is to be defined per map type, but in most cases this would be just the number
of elements currently present in the map: the more it grows, the more the
pressure is. (For array-based maps it seems right to set it to 0, i.e.,
"there's no pressure".)

By analogy with the 'max_entries' field the pressure value is a u32 integer.
The primary API to read it from userspace is via the map proc entry, where it
is reported in the "raw_pressure" filed, e.g., for an example map we have

    # echo -e `strace -e bpf,openat,read -s 1024 bpftool map show id 202 2>&1 | grep -A1 '/proc/self/fdinfo' | head -2 | tail -1 | cut -f2 -d' '`
    "pos:   0
    flags:  02000002
    mnt_id: 15
    ino:    18958
    map_type:       1
    key_size:       8
    value_size:     8
    max_entries:    1224
    map_flags:      0x1
    map_extra:      0x0
    memlock:        69632
    raw_pressure:   500
    map_id: 202
    frozen: 0
    ",

For old kernels and when the ->map_pressure map operation is not defined the
'raw_pressure' field is absent from the list.

The second way to get the raw_pressure is via BPF_OBJ_GET_INFO_BY_FD, where a
previously unused field in the struct bpf_map_info is now used to return this
value.

The patch adds relatively small amount of logic due to the fact that for most
maps the number of elements was already computed to provide the map memory
usage API, just not exported.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 include/linux/bpf.h            |   1 +
 include/uapi/linux/bpf.h       |   2 +-
 kernel/bpf/hashtab.c           | 118 ++++++++++++++++++++-------------
 kernel/bpf/lpm_trie.c          |   8 +++
 kernel/bpf/syscall.c           |  15 +++++
 tools/include/uapi/linux/bpf.h |   2 +-
 6 files changed, 99 insertions(+), 47 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f58895830ada..4d33fc6ed2ea 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -162,6 +162,7 @@ struct bpf_map_ops {
 				     void *callback_ctx, u64 flags);
 
 	u64 (*map_mem_usage)(const struct bpf_map *map);
+	u32 (*map_pressure)(const struct bpf_map *map);
 
 	/* BTF id of struct allocated by map_alloc */
 	int *map_btf_id;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9273c654743c..99580f2d006b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6363,7 +6363,7 @@ struct bpf_map_info {
 	__u32 btf_id;
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
-	__u32 :32;	/* alignment pad */
+	__u32 raw_pressure;
 	__u64 map_extra;
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 9901efee4339..331a923e29d5 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -133,6 +133,63 @@ static inline bool htab_is_prealloc(const struct bpf_htab *htab)
 	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
 }
 
+/* compute_batch_value() computes batch value as num_online_cpus() * 2
+ * and __percpu_counter_compare() needs
+ * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
+ * for percpu_counter to be faster than atomic_t. In practice the average bpf
+ * hash map size is 10k, which means that a system with 64 cpus will fill
+ * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
+ * define our own batch count as 32 then 10k hash map can be filled up to 80%:
+ * 10k - 8k > 32 _batch_ * 64 _cpus_
+ * and __percpu_counter_compare() will still be fast. At that point hash map
+ * collisions will dominate its performance anyway. Assume that hash map filled
+ * to 50+% isn't going to be O(1) and use the following formula to choose
+ * between percpu_counter and atomic_t.
+ *
+ * For preallocated maps we only increase/decrease counters on adding/removing
+ * an element to be later fetched by htab_map_pressure, so we always enable the
+ * per-cpu version in favor of atomic
+ */
+#define PERCPU_COUNTER_BATCH 32
+static bool htab_use_percpu_counter(union bpf_attr *attr)
+{
+	return (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH ||
+		!(attr->map_flags & BPF_F_NO_PREALLOC));
+}
+
+static bool is_map_full(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
+						PERCPU_COUNTER_BATCH) >= 0;
+	return atomic_read(&htab->count) >= htab->map.max_entries;
+}
+
+static void inc_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_inc(&htab->count);
+}
+
+static void dec_elem_count(struct bpf_htab *htab)
+{
+	if (htab->use_percpu_counter)
+		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
+	else
+		atomic_dec(&htab->count);
+}
+
+static u32 htab_map_pressure(const struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	if (htab->use_percpu_counter)
+		return __percpu_counter_sum(&htab->pcount);
+	return atomic_read(&htab->count);
+}
+
 static void htab_init_buckets(struct bpf_htab *htab)
 {
 	unsigned int i;
@@ -539,23 +596,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 
 	htab_init_buckets(htab);
 
-/* compute_batch_value() computes batch value as num_online_cpus() * 2
- * and __percpu_counter_compare() needs
- * htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
- * for percpu_counter to be faster than atomic_t. In practice the average bpf
- * hash map size is 10k, which means that a system with 64 cpus will fill
- * hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
- * define our own batch count as 32 then 10k hash map can be filled up to 80%:
- * 10k - 8k > 32 _batch_ * 64 _cpus_
- * and __percpu_counter_compare() will still be fast. At that point hash map
- * collisions will dominate its performance anyway. Assume that hash map filled
- * to 50+% isn't going to be O(1) and use the following formula to choose
- * between percpu_counter and atomic_t.
- */
-#define PERCPU_COUNTER_BATCH 32
-	if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH)
-		htab->use_percpu_counter = true;
-
+	htab->use_percpu_counter = htab_use_percpu_counter(attr);
 	if (htab->use_percpu_counter) {
 		err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL);
 		if (err)
@@ -810,6 +851,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 		if (l == tgt_l) {
 			hlist_nulls_del_rcu(&l->hash_node);
 			check_and_free_fields(htab, l);
+			dec_elem_count(htab);
 			break;
 		}
 
@@ -896,40 +938,16 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 	}
 }
 
-static bool is_map_full(struct bpf_htab *htab)
-{
-	if (htab->use_percpu_counter)
-		return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
-						PERCPU_COUNTER_BATCH) >= 0;
-	return atomic_read(&htab->count) >= htab->map.max_entries;
-}
-
-static void inc_elem_count(struct bpf_htab *htab)
-{
-	if (htab->use_percpu_counter)
-		percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
-	else
-		atomic_inc(&htab->count);
-}
-
-static void dec_elem_count(struct bpf_htab *htab)
-{
-	if (htab->use_percpu_counter)
-		percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
-	else
-		atomic_dec(&htab->count);
-}
-
-
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	htab_put_fd_value(htab, l);
 
+	dec_elem_count(htab);
+
 	if (htab_is_prealloc(htab)) {
 		check_and_free_fields(htab, l);
 		__pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
-		dec_elem_count(htab);
 		htab_elem_free(htab, l);
 	}
 }
@@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			if (!l)
 				return ERR_PTR(-E2BIG);
 			l_new = container_of(l, struct htab_elem, fnode);
+			inc_elem_count(htab);
 		}
 	} else {
 		if (is_map_full(htab))
@@ -1227,9 +1246,11 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
 	 * concurrent search will find it before old elem
 	 */
 	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+	inc_elem_count(htab);
 	if (l_old) {
 		bpf_lru_node_set_ref(&l_new->lru_node);
 		hlist_nulls_del_rcu(&l_old->hash_node);
+		dec_elem_count(htab);
 	}
 	ret = 0;
 
@@ -1357,6 +1378,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 		pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
 				value, onallcpus);
 		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+		inc_elem_count(htab);
 		l_new = NULL;
 	}
 	ret = 0;
@@ -1443,9 +1465,10 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 
 	l = lookup_elem_raw(head, hash, key, key_size);
 
-	if (l)
+	if (l) {
+		dec_elem_count(htab);
 		hlist_nulls_del_rcu(&l->hash_node);
-	else
+	} else
 		ret = -ENOENT;
 
 	htab_unlock_bucket(htab, b, hash, flags);
@@ -2249,6 +2272,7 @@ const struct bpf_map_ops htab_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2271,6 +2295,7 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab_lru),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2423,6 +2448,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2443,6 +2469,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab_lru_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2581,6 +2608,7 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
 	.map_gen_lookup = htab_of_map_gen_lookup,
 	.map_check_btf = map_check_no_btf,
 	.map_mem_usage = htab_map_mem_usage,
+	.map_pressure = htab_map_pressure,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 };
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index e0d3ddf2037a..24ff5feb07ca 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -730,6 +730,13 @@ static u64 trie_mem_usage(const struct bpf_map *map)
 	return elem_size * READ_ONCE(trie->n_entries);
 }
 
+static u32 trie_map_pressure(const struct bpf_map *map)
+{
+	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
+
+	return READ_ONCE(trie->n_entries);
+}
+
 BTF_ID_LIST_SINGLE(trie_map_btf_ids, struct, lpm_trie)
 const struct bpf_map_ops trie_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
@@ -744,5 +751,6 @@ const struct bpf_map_ops trie_map_ops = {
 	.map_delete_batch = generic_map_delete_batch,
 	.map_check_btf = trie_check_btf,
 	.map_mem_usage = trie_mem_usage,
+	.map_pressure = trie_map_pressure,
 	.map_btf_id = &trie_map_btf_ids[0],
 };
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 92a57efc77de..6ea30a24f057 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -794,6 +794,13 @@ static fmode_t map_get_sys_perms(struct bpf_map *map, struct fd f)
 	return mode;
 }
 
+static u32 bpf_map_pressure(const struct bpf_map *map)
+{
+	if (map->ops->map_pressure)
+		return map->ops->map_pressure(map);
+	return 0;
+}
+
 #ifdef CONFIG_PROC_FS
 /* Show the memory usage of a bpf map */
 static u64 bpf_map_memory_usage(const struct bpf_map *map)
@@ -804,6 +811,7 @@ static u64 bpf_map_memory_usage(const struct bpf_map *map)
 static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 {
 	struct bpf_map *map = filp->private_data;
+	char map_pressure_buf[36] = "";
 	u32 type = 0, jited = 0;
 
 	if (map_type_contains_progs(map)) {
@@ -813,6 +821,10 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		spin_unlock(&map->owner.lock);
 	}
 
+	if (map->ops->map_pressure)
+		snprintf(map_pressure_buf, sizeof(map_pressure_buf),
+			 "raw_pressure:\t%u\n", map->ops->map_pressure(map));
+
 	seq_printf(m,
 		   "map_type:\t%u\n"
 		   "key_size:\t%u\n"
@@ -821,6 +833,7 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		   "map_flags:\t%#x\n"
 		   "map_extra:\t%#llx\n"
 		   "memlock:\t%llu\n"
+		   "%s"
 		   "map_id:\t%u\n"
 		   "frozen:\t%u\n",
 		   map->map_type,
@@ -830,6 +843,7 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		   map->map_flags,
 		   (unsigned long long)map->map_extra,
 		   bpf_map_memory_usage(map),
+		   map_pressure_buf,
 		   map->id,
 		   READ_ONCE(map->frozen));
 	if (type) {
@@ -4275,6 +4289,7 @@ static int bpf_map_get_info_by_fd(struct file *file,
 	info.value_size = map->value_size;
 	info.max_entries = map->max_entries;
 	info.map_flags = map->map_flags;
+	info.raw_pressure = bpf_map_pressure(map);
 	info.map_extra = map->map_extra;
 	memcpy(info.name, map->name, sizeof(map->name));
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9273c654743c..99580f2d006b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6363,7 +6363,7 @@ struct bpf_map_info {
 	__u32 btf_id;
 	__u32 btf_key_type_id;
 	__u32 btf_value_type_id;
-	__u32 :32;	/* alignment pad */
+	__u32 raw_pressure;
 	__u64 map_extra;
 } __attribute__((aligned(8)));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next 2/2] selftests/bpf: test map pressure
  2023-05-31 11:05 [PATCH bpf-next 0/2] add mechanism to report map pressure Anton Protopopov
  2023-05-31 11:05 ` [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure Anton Protopopov
@ 2023-05-31 11:05 ` Anton Protopopov
  1 sibling, 0 replies; 19+ messages in thread
From: Anton Protopopov @ 2023-05-31 11:05 UTC (permalink / raw)
  To: bpf; +Cc: Anton Protopopov, Joe Stringer, John Fastabend

Add a new map test, map_pressure.c, which is checking the correctness of
map's raw_pressure values.

For maps for which the pressure equals to the current number of elements (only
such maps support the raw_pressure at the moment) the test upserts a number of
elements, checks the correctness of the pressure value, then deletes all the
elements and checks again that the pressure dropped down to zero.

The following map types are tested:

    * BPF_MAP_TYPE_HASH, BPF_F_NO_PREALLOC
    * BPF_MAP_TYPE_PERCPU_HASH, BPF_F_NO_PREALLOC
    * BPF_MAP_TYPE_HASH,
    * BPF_MAP_TYPE_PERCPU_HASH,
    * BPF_MAP_TYPE_LRU_HASH
    * BPF_MAP_TYPE_LRU_PERCPU_HASH
    * BPF_MAP_TYPE_LPM_TRIE

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
---
 .../selftests/bpf/map_tests/map_pressure.c    | 309 ++++++++++++++++++
 1 file changed, 309 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/map_tests/map_pressure.c

diff --git a/tools/testing/selftests/bpf/map_tests/map_pressure.c b/tools/testing/selftests/bpf/map_tests/map_pressure.c
new file mode 100644
index 000000000000..c34ae31aef4b
--- /dev/null
+++ b/tools/testing/selftests/bpf/map_tests/map_pressure.c
@@ -0,0 +1,309 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+
+#include <errno.h>
+#include <unistd.h>
+#include <pthread.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include <bpf_util.h>
+#include <test_maps.h>
+
+#define MAX_ENTRIES 4096
+#define N_THREADS 17
+
+#define MAX_MAP_KEY_SIZE (sizeof(struct bpf_lpm_trie_key) + 4)
+
+static void map_info(int map_fd, struct bpf_map_info *info)
+{
+	__u32 len = sizeof(*info);
+	int ret;
+
+	memset(info, 0, sizeof(*info));
+
+	ret = bpf_obj_get_info_by_fd(map_fd, info, &len);
+	CHECK(ret < 0, "bpf_obj_get_info_by_fd", "error: %s\n", strerror(errno));
+}
+
+static const char *map_type_to_s(__u32 type)
+{
+	switch (type) {
+	case BPF_MAP_TYPE_HASH:
+		return "HASH";
+	case BPF_MAP_TYPE_PERCPU_HASH:
+		return "PERCPU_HASH";
+	case BPF_MAP_TYPE_LRU_HASH:
+		return "LRU_HASH";
+	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
+		return "LRU_PERCPU_HASH";
+	case BPF_MAP_TYPE_LPM_TRIE:
+		return "LPM_TRIE";
+	default:
+		return "<define-me>";
+	}
+}
+
+/* Map index -> map-type-specific-key */
+static void *map_key(__u32 type, __u32 i)
+{
+	static __thread __u8 key[MAX_MAP_KEY_SIZE];
+
+	if (type == BPF_MAP_TYPE_LPM_TRIE) {
+		/* prefixlen = 32, data[0..3] = i */
+		*(__u32 *)key = 32;
+		*(__u32 *)(key+4) = i;
+	} else {
+		*(__u32 *)key = i;
+	}
+	return key;
+}
+
+static __u32 map_count_elements(__u32 type, int map_fd)
+{
+	void *key = map_key(type, -1);
+	int n = 0;
+
+	while (!bpf_map_get_next_key(map_fd, key, key))
+		n++;
+	return n;
+}
+
+static void delete_all_elements(__u32 type, int map_fd)
+{
+	void *key = map_key(type, -1);
+	void *keys;
+	int n = 0;
+	int ret;
+
+	keys = calloc(MAX_MAP_KEY_SIZE, MAX_ENTRIES);
+	CHECK(!keys, "calloc", "error: %s\n", strerror(errno));
+
+	for (; !bpf_map_get_next_key(map_fd, key, key); n++)
+		memcpy(keys + n*MAX_MAP_KEY_SIZE, key, MAX_MAP_KEY_SIZE);
+
+	while (--n >= 0) {
+		ret = bpf_map_delete_elem(map_fd, keys + n*MAX_MAP_KEY_SIZE);
+		CHECK(ret < 0, "bpf_map_delete_elem", "error: %s\n", strerror(errno));
+	}
+}
+
+static bool is_lru(__u32 map_type)
+{
+	return map_type == BPF_MAP_TYPE_LRU_HASH ||
+	       map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH;
+}
+
+struct upsert_opts {
+	__u32 map_type;
+	int map_fd;
+	__u32 n;
+};
+
+static void *patch_map_thread(void *arg)
+{
+	struct upsert_opts *opts = arg;
+	void *key;
+	int val;
+	int ret;
+	int i;
+
+	for (i = 0; i < opts->n; i++) {
+		key = map_key(opts->map_type, i);
+		val = rand();
+		ret = bpf_map_update_elem(opts->map_fd, key, &val, 0);
+		CHECK(ret < 0, "bpf_map_update_elem", "error: %s\n", strerror(errno));
+	}
+	return NULL;
+}
+
+static void upsert_elements(struct upsert_opts *opts)
+{
+	pthread_t threads[N_THREADS];
+	int ret;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(threads); i++) {
+		ret = pthread_create(&i[threads], NULL, patch_map_thread, opts);
+		CHECK(ret != 0, "pthread_create", "error: %s\n", strerror(ret));
+	}
+
+	for (i = 0; i < ARRAY_SIZE(threads); i++) {
+		ret = pthread_join(i[threads], NULL);
+		CHECK(ret != 0, "pthread_join", "error: %s\n", strerror(ret));
+	}
+}
+
+static void __test_map_pressure(int map_fd)
+{
+	__u32 n = MAX_ENTRIES - 1000, current_elements;
+	struct upsert_opts opts = {
+		.map_fd = map_fd,
+		.n = n,
+	};
+	struct bpf_map_info info;
+
+	map_info(map_fd, &info);
+	opts.map_type = info.type;
+
+	/*
+	 * Upsert keys [0, n) under some competition: with random values from
+	 * N_THREADS threads
+	 */
+	upsert_elements(&opts);
+
+	/*
+	 * Raw pressure for all hashtable-based maps should be equal to the
+	 * number of elements present in the map. For non-lru maps this number
+	 * should be the number n of upserted elements. For lru maps some
+	 * elements might have been evicted. Check that all numbers make sense
+	 */
+	map_info(map_fd, &info);
+	current_elements = map_count_elements(info.type, map_fd);
+	if (!is_lru(info.type))
+		CHECK(n != current_elements, "map_count_elements",
+		      "current_elements(%u) != expected(%u)", current_elements, n);
+	CHECK(info.raw_pressure != current_elements, "map_pressure",
+	      "raw_pressure=%u, expected %u (map_type=%s,map_flags=%08x)\n",
+	      info.raw_pressure, current_elements, map_type_to_s(info.type), info.map_flags);
+
+	/*
+	 * Cleanup the map and check that all elements are actually gone and
+	 * that the map raw_pressure is back to 0 as well
+	 */
+	delete_all_elements(info.type, map_fd);
+	map_info(map_fd, &info);
+	current_elements = map_count_elements(info.type, map_fd);
+	CHECK(current_elements, "map_count_elements",
+	      "expected current_elements=0, got %u", current_elements);
+	CHECK(info.raw_pressure != 0, "map_pressure",
+	      "raw_pressure=%u, expected 0 (map_type=%s,map_flags=%08x)\n",
+	      info.raw_pressure, map_type_to_s(info.type), info.map_flags);
+
+	close(map_fd);
+}
+
+static int map_create_opts(__u32 type, const char *name,
+			   struct bpf_map_create_opts *map_opts,
+			   __u32 key_size, __u32 val_size)
+{
+	int map_fd;
+
+	map_fd = bpf_map_create(type, name, key_size, val_size, MAX_ENTRIES, map_opts);
+	CHECK(map_fd < 0, "bpf_map_create()", "error:%s (name=%s)\n",
+			strerror(errno), name);
+
+	return map_fd;
+}
+
+static int map_create(__u32 type, const char *name, struct bpf_map_create_opts *map_opts)
+{
+	return map_create_opts(type, name, map_opts, sizeof(int), sizeof(int));
+}
+
+static int create_hash(void)
+{
+	struct bpf_map_create_opts map_opts = {
+		.sz = sizeof(map_opts),
+		.map_flags = BPF_F_NO_PREALLOC,
+	};
+
+	return map_create(BPF_MAP_TYPE_HASH, "hash", &map_opts);
+}
+
+static int create_percpu_hash(void)
+{
+	struct bpf_map_create_opts map_opts = {
+		.sz = sizeof(map_opts),
+		.map_flags = BPF_F_NO_PREALLOC,
+	};
+
+	return map_create(BPF_MAP_TYPE_PERCPU_HASH, "percpu_hash", &map_opts);
+}
+
+static int create_hash_prealloc(void)
+{
+	return map_create(BPF_MAP_TYPE_HASH, "hash", NULL);
+}
+
+static int create_percpu_hash_prealloc(void)
+{
+	return map_create(BPF_MAP_TYPE_PERCPU_HASH, "percpu_hash_prealloc", NULL);
+}
+
+static int create_lru_hash(void)
+{
+	return map_create(BPF_MAP_TYPE_LRU_HASH, "lru_hash", NULL);
+}
+
+static int create_percpu_lru_hash(void)
+{
+	return map_create(BPF_MAP_TYPE_LRU_PERCPU_HASH, "lru_hash_percpu", NULL);
+}
+
+static int create_lpm_trie(void)
+{
+	struct bpf_map_create_opts map_opts = {
+		.sz = sizeof(map_opts),
+		.map_flags = BPF_F_NO_PREALLOC,
+	};
+	__u32 key_size = sizeof(struct bpf_lpm_trie_key) + 4;
+	__u32 val_size = sizeof(int);
+
+	return map_create_opts(BPF_MAP_TYPE_LPM_TRIE, "lpm_trie",
+			       &map_opts, key_size, val_size);
+}
+
+static void map_pressure_hash(void)
+{
+	__test_map_pressure(create_hash());
+	printf("test_%s:PASS\n", __func__);
+}
+
+static void map_pressure_percpu_hash(void)
+{
+	__test_map_pressure(create_percpu_hash());
+	printf("test_%s:PASS\n", __func__);
+}
+
+static void map_pressure_hash_prealloc(void)
+{
+	__test_map_pressure(create_hash_prealloc());
+	printf("test_%s:PASS\n", __func__);
+}
+
+static void map_pressure_percpu_hash_prealloc(void)
+{
+	__test_map_pressure(create_percpu_hash_prealloc());
+	printf("test_%s:PASS\n", __func__);
+}
+
+static void map_pressure_lru_hash(void)
+{
+	__test_map_pressure(create_lru_hash());
+	printf("test_%s:PASS\n", __func__);
+}
+
+static void map_pressure_percpu_lru_hash(void)
+{
+	__test_map_pressure(create_percpu_lru_hash());
+	printf("test_%s:PASS\n", __func__);
+}
+
+static void map_pressure_lpm_trie(void)
+{
+	__test_map_pressure(create_lpm_trie());
+	printf("test_%s:PASS\n", __func__);
+}
+
+void test_map_pressure(void)
+{
+	map_pressure_hash();
+	map_pressure_percpu_hash();
+	map_pressure_hash_prealloc();
+	map_pressure_percpu_hash_prealloc();
+	map_pressure_lru_hash();
+	map_pressure_percpu_lru_hash();
+	map_pressure_lpm_trie();
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-05-31 11:05 ` [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure Anton Protopopov
@ 2023-05-31 18:24   ` Alexei Starovoitov
  2023-06-01  7:31     ` Anton Protopopov
  2023-06-01  0:44   ` kernel test robot
  1 sibling, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2023-05-31 18:24 UTC (permalink / raw)
  To: Anton Protopopov; +Cc: bpf, Joe Stringer, John Fastabend

On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
>  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
>  {
>  	htab_put_fd_value(htab, l);
>  
> +	dec_elem_count(htab);
> +
>  	if (htab_is_prealloc(htab)) {
>  		check_and_free_fields(htab, l);
>  		__pcpu_freelist_push(&htab->freelist, &l->fnode);
>  	} else {
> -		dec_elem_count(htab);
>  		htab_elem_free(htab, l);
>  	}
>  }
> @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
>  			if (!l)
>  				return ERR_PTR(-E2BIG);
>  			l_new = container_of(l, struct htab_elem, fnode);
> +			inc_elem_count(htab);

The current use_percpu_counter heuristic is far from perfect. It works for some cases,
but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
Hence, there is a big performance risk doing inc/dec everywhere.
Hence, this is a nack: we cannot decrease performance of various maps for few folks
who want to see map stats.

If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
No kernel patches needed.
Then bpf_prog_run such tracing prog and read all internal map info.
It's less convenient that exposing things in uapi, but not being uapi is the point.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-05-31 11:05 ` [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure Anton Protopopov
  2023-05-31 18:24   ` Alexei Starovoitov
@ 2023-06-01  0:44   ` kernel test robot
  2023-06-01  7:50     ` Anton Protopopov
  1 sibling, 1 reply; 19+ messages in thread
From: kernel test robot @ 2023-06-01  0:44 UTC (permalink / raw)
  To: Anton Protopopov, bpf
  Cc: oe-kbuild-all, Anton Protopopov, Joe Stringer, John Fastabend

Hi Anton,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Anton-Protopopov/bpf-add-new-map-ops-map_pressure/20230531-190704
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230531110511.64612-2-aspsk%40isovalent.com
patch subject: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20230601/202306010837.mGhA199K-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 12.3.0
reproduce (this is a W=1 build):
        mkdir -p ~/bin
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/025cc7c86c6c7e108ba5b9946a0f50e0cc082f9b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Anton-Protopopov/bpf-add-new-map-ops-map_pressure/20230531-190704
        git checkout 025cc7c86c6c7e108ba5b9946a0f50e0cc082f9b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=sh olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=sh SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/bpf/hashtab.c: In function 'htab_map_pressure':
>> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
     189 |                 return __percpu_counter_sum(&htab->pcount);
         |                        ^~~~~~~~~~~~~~~~~~~~
         |                        percpu_counter_sum
   cc1: some warnings being treated as errors


vim +189 kernel/bpf/hashtab.c

   183	
   184	static u32 htab_map_pressure(const struct bpf_map *map)
   185	{
   186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
   187	
   188		if (htab->use_percpu_counter)
 > 189			return __percpu_counter_sum(&htab->pcount);
   190		return atomic_read(&htab->count);
   191	}
   192	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-05-31 18:24   ` Alexei Starovoitov
@ 2023-06-01  7:31     ` Anton Protopopov
  2023-06-01 16:39       ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Protopopov @ 2023-06-01  7:31 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Joe Stringer, John Fastabend

On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote:
> On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
> >  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> >  {
> >  	htab_put_fd_value(htab, l);
> >  
> > +	dec_elem_count(htab);
> > +
> >  	if (htab_is_prealloc(htab)) {
> >  		check_and_free_fields(htab, l);
> >  		__pcpu_freelist_push(&htab->freelist, &l->fnode);
> >  	} else {
> > -		dec_elem_count(htab);
> >  		htab_elem_free(htab, l);
> >  	}
> >  }
> > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
> >  			if (!l)
> >  				return ERR_PTR(-E2BIG);
> >  			l_new = container_of(l, struct htab_elem, fnode);
> > +			inc_elem_count(htab);
> 
> The current use_percpu_counter heuristic is far from perfect. It works for some cases,
> but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
> Hence, there is a big performance risk doing inc/dec everywhere.
> Hence, this is a nack: we cannot decrease performance of various maps for few folks
> who want to see map stats.

This patch adds some inc/dec only for preallocated hashtabs and doesn't change
code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for
preallocated hashtabs we don't need to compare counters, so a raw (non-batch)
percpu counter may be used for this case.

> If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
> use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
> No kernel patches needed.
> Then bpf_prog_run such tracing prog and read all internal map info.
> It's less convenient that exposing things in uapi, but not being uapi is the point.

Thanks for the pointers, this makes sense. However, this doesn't work for LRU
which is always pre-allocated. Would it be ok if we add non-batch percpu
counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-01  0:44   ` kernel test robot
@ 2023-06-01  7:50     ` Anton Protopopov
  2023-06-13  8:23       ` Yujie Liu
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Protopopov @ 2023-06-01  7:50 UTC (permalink / raw)
  To: kernel test robot; +Cc: bpf, oe-kbuild-all, Joe Stringer, John Fastabend

On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote:
> Hi Anton,
> 
> kernel test robot noticed the following build errors:
> 
> [...]
> 
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/

How does this apply to patches? If I send a v2, should I include these tags
there? If this patch gets rejected, is there need to do anything to close the
robot's ticket?

> All errors (new ones prefixed by >>):
> 
>    kernel/bpf/hashtab.c: In function 'htab_map_pressure':
> >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
>      189 |                 return __percpu_counter_sum(&htab->pcount);
>          |                        ^~~~~~~~~~~~~~~~~~~~
>          |                        percpu_counter_sum
>    cc1: some warnings being treated as errors
> 
> 
> vim +189 kernel/bpf/hashtab.c
> 
>    183	
>    184	static u32 htab_map_pressure(const struct bpf_map *map)
>    185	{
>    186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>    187	
>    188		if (htab->use_percpu_counter)
>  > 189			return __percpu_counter_sum(&htab->pcount);
>    190		return atomic_read(&htab->count);
>    191	}
>    192	

(This bug happens for !SMP case.)

> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-01  7:31     ` Anton Protopopov
@ 2023-06-01 16:39       ` Alexei Starovoitov
  2023-06-01 18:18         ` Anton Protopopov
  0 siblings, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2023-06-01 16:39 UTC (permalink / raw)
  To: Anton Protopopov; +Cc: bpf, Joe Stringer, John Fastabend

On Thu, Jun 1, 2023 at 12:30 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote:
> > On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
> > >  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> > >  {
> > >     htab_put_fd_value(htab, l);
> > >
> > > +   dec_elem_count(htab);
> > > +
> > >     if (htab_is_prealloc(htab)) {
> > >             check_and_free_fields(htab, l);
> > >             __pcpu_freelist_push(&htab->freelist, &l->fnode);
> > >     } else {
> > > -           dec_elem_count(htab);
> > >             htab_elem_free(htab, l);
> > >     }
> > >  }
> > > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
> > >                     if (!l)
> > >                             return ERR_PTR(-E2BIG);
> > >                     l_new = container_of(l, struct htab_elem, fnode);
> > > +                   inc_elem_count(htab);
> >
> > The current use_percpu_counter heuristic is far from perfect. It works for some cases,
> > but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
> > Hence, there is a big performance risk doing inc/dec everywhere.
> > Hence, this is a nack: we cannot decrease performance of various maps for few folks
> > who want to see map stats.
>
> This patch adds some inc/dec only for preallocated hashtabs and doesn't change
> code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for
> preallocated hashtabs we don't need to compare counters,

exactly. that's why I don't like to add inc/dec that serves no purpose
other than stats.

> so a raw (non-batch)
> percpu counter may be used for this case.

and you can do it inside your own bpf prog.

> > If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
> > use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
> > No kernel patches needed.
> > Then bpf_prog_run such tracing prog and read all internal map info.
> > It's less convenient that exposing things in uapi, but not being uapi is the point.
>
> Thanks for the pointers, this makes sense. However, this doesn't work for LRU
> which is always pre-allocated. Would it be ok if we add non-batch percpu
> counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace?

LRU logic doesn't kick in until the map is full.
If your LRU map is not full you shouldn't be using LRU in the first place.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-01 16:39       ` Alexei Starovoitov
@ 2023-06-01 18:18         ` Anton Protopopov
  2023-06-01 18:24           ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Protopopov @ 2023-06-01 18:18 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Joe Stringer, John Fastabend

On Thu, Jun 01, 2023 at 09:39:43AM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 12:30 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On Wed, May 31, 2023 at 11:24:29AM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 31, 2023 at 11:05:10AM +0000, Anton Protopopov wrote:
> > > >  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> > > >  {
> > > >     htab_put_fd_value(htab, l);
> > > >
> > > > +   dec_elem_count(htab);
> > > > +
> > > >     if (htab_is_prealloc(htab)) {
> > > >             check_and_free_fields(htab, l);
> > > >             __pcpu_freelist_push(&htab->freelist, &l->fnode);
> > > >     } else {
> > > > -           dec_elem_count(htab);
> > > >             htab_elem_free(htab, l);
> > > >     }
> > > >  }
> > > > @@ -1006,6 +1024,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
> > > >                     if (!l)
> > > >                             return ERR_PTR(-E2BIG);
> > > >                     l_new = container_of(l, struct htab_elem, fnode);
> > > > +                   inc_elem_count(htab);
> > >
> > > The current use_percpu_counter heuristic is far from perfect. It works for some cases,
> > > but will surely get bad as the comment next to PERCPU_COUNTER_BATCH is trying to say.
> > > Hence, there is a big performance risk doing inc/dec everywhere.
> > > Hence, this is a nack: we cannot decrease performance of various maps for few folks
> > > who want to see map stats.
> >
> > This patch adds some inc/dec only for preallocated hashtabs and doesn't change
> > code for BPF_F_NO_PREALLOC (they already do incs/decs where needed). And for
> > preallocated hashtabs we don't need to compare counters,
> 
> exactly. that's why I don't like to add inc/dec that serves no purpose
> other than stats.
> 
> > so a raw (non-batch)
> > percpu counter may be used for this case.
> 
> and you can do it inside your own bpf prog.
> 
> > > If you want to see "pressure", please switch cilium to use bpf_mem_alloc htab and
> > > use tracing style direct 'struct bpf_htab' access like progs/map_ptr_kern.c is demonstrating.
> > > No kernel patches needed.
> > > Then bpf_prog_run such tracing prog and read all internal map info.
> > > It's less convenient that exposing things in uapi, but not being uapi is the point.
> >
> > Thanks for the pointers, this makes sense. However, this doesn't work for LRU
> > which is always pre-allocated. Would it be ok if we add non-batch percpu
> > counter for !BPF_F_NO_PREALLOC case and won't expose it directly to userspace?
> 
> LRU logic doesn't kick in until the map is full.

In fact, it can: a reproducable example is in the self-test from this patch
series. In the test N threads try to insert random values for keys 1..3000
simultaneously. As the result, the map may contain any number of elements,
typically 100 to 1000 (never full 3000, which is also less than the map size).
So a user can't really even closely estimate the number of elements in the LRU
map based on the number of updates (with unique keys). A per-cpu counter
inc/dec'ed from the kernel side would solve this.

> If your LRU map is not full you shouldn't be using LRU in the first place.

This makes sense, yes, especially that LRU evictions may happen randomly,
without a map being full. I will step back with this patch until we investigate
if we can replace LRUs with hashes.

Thanks for the comments!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-01 18:18         ` Anton Protopopov
@ 2023-06-01 18:24           ` Alexei Starovoitov
  2023-06-02  0:40             ` Alexei Starovoitov
  2023-06-28 13:17             ` Anton Protopopov
  0 siblings, 2 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2023-06-01 18:24 UTC (permalink / raw)
  To: Anton Protopopov, Martin KaFai Lau; +Cc: bpf, Joe Stringer, John Fastabend

On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > LRU logic doesn't kick in until the map is full.
>
> In fact, it can: a reproducable example is in the self-test from this patch
> series. In the test N threads try to insert random values for keys 1..3000
> simultaneously. As the result, the map may contain any number of elements,
> typically 100 to 1000 (never full 3000, which is also less than the map size).
> So a user can't really even closely estimate the number of elements in the LRU
> map based on the number of updates (with unique keys). A per-cpu counter
> inc/dec'ed from the kernel side would solve this.

That's odd and unexpected.
Definitely something to investigate and fix in the LRU map.

Pls cc Martin in the future.

> > If your LRU map is not full you shouldn't be using LRU in the first place.
>
> This makes sense, yes, especially that LRU evictions may happen randomly,
> without a map being full. I will step back with this patch until we investigate
> if we can replace LRUs with hashes.
>
> Thanks for the comments!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-01 18:24           ` Alexei Starovoitov
@ 2023-06-02  0:40             ` Alexei Starovoitov
  2023-06-02 14:21               ` Anton Protopopov
  2023-06-28 13:17             ` Anton Protopopov
  1 sibling, 1 reply; 19+ messages in thread
From: Alexei Starovoitov @ 2023-06-02  0:40 UTC (permalink / raw)
  To: Anton Protopopov, Martin KaFai Lau; +Cc: bpf, Joe Stringer, John Fastabend

On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > LRU logic doesn't kick in until the map is full.
> >
> > In fact, it can: a reproducable example is in the self-test from this patch
> > series. In the test N threads try to insert random values for keys 1..3000
> > simultaneously. As the result, the map may contain any number of elements,
> > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > So a user can't really even closely estimate the number of elements in the LRU
> > map based on the number of updates (with unique keys). A per-cpu counter
> > inc/dec'ed from the kernel side would solve this.
>
> That's odd and unexpected.
> Definitely something to investigate and fix in the LRU map.
>
> Pls cc Martin in the future.
>
> > > If your LRU map is not full you shouldn't be using LRU in the first place.
> >
> > This makes sense, yes, especially that LRU evictions may happen randomly,
> > without a map being full. I will step back with this patch until we investigate
> > if we can replace LRUs with hashes.
> >
> > Thanks for the comments!

Thinking about it more...
since you're proposing to use percpu counter unconditionally for prealloc
and percpu_counter_add_batch() logic is batched,
it could actually be acceptable if it's paired with non-api access.
Like another patch can add generic kfunc to do __percpu_counter_sum()
and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
for maps can be extended to print the element count, so the user can have
convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.

But additional logic of percpu_counter_add_batch() might get in the way
of debugging eventually.
If we want to have stats then we can have normal per-cpu u32 in basic
struct bpf_map that most maps, except array, will inc/dec on update/delete.
kfunc to iterate over percpu is still necessary.
This way we will be able to see not only number of elements, but detect
bad usage when one cpu is only adding and another cpu is deleting elements.
And other cpu misbalance.

but debugging and stats is a slippery slope. These simple stats won't be
enough and people will be tempted to add more and more.
So I agree that there is a need for bpf map observability,
but it is not clear whether hard coded stats is the solution.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-02  0:40             ` Alexei Starovoitov
@ 2023-06-02 14:21               ` Anton Protopopov
  2023-06-02 16:23                 ` Alexei Starovoitov
  0 siblings, 1 reply; 19+ messages in thread
From: Anton Protopopov @ 2023-06-02 14:21 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Martin KaFai Lau, bpf, Joe Stringer, John Fastabend

On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > >
> > > > LRU logic doesn't kick in until the map is full.
> > >
> > > In fact, it can: a reproducable example is in the self-test from this patch
> > > series. In the test N threads try to insert random values for keys 1..3000
> > > simultaneously. As the result, the map may contain any number of elements,
> > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > So a user can't really even closely estimate the number of elements in the LRU
> > > map based on the number of updates (with unique keys). A per-cpu counter
> > > inc/dec'ed from the kernel side would solve this.
> >
> > That's odd and unexpected.
> > Definitely something to investigate and fix in the LRU map.
> >
> > Pls cc Martin in the future.
> >
> > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > >
> > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > without a map being full. I will step back with this patch until we investigate
> > > if we can replace LRUs with hashes.
> > >
> > > Thanks for the comments!
> 
> Thinking about it more...
> since you're proposing to use percpu counter unconditionally for prealloc
> and percpu_counter_add_batch() logic is batched,
> it could actually be acceptable if it's paired with non-api access.
> Like another patch can add generic kfunc to do __percpu_counter_sum()
> and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> for maps can be extended to print the element count, so the user can have
> convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> 
> But additional logic of percpu_counter_add_batch() might get in the way
> of debugging eventually.
> If we want to have stats then we can have normal per-cpu u32 in basic
> struct bpf_map that most maps, except array, will inc/dec on update/delete.
> kfunc to iterate over percpu is still necessary.
> This way we will be able to see not only number of elements, but detect
> bad usage when one cpu is only adding and another cpu is deleting elements.
> And other cpu misbalance.

This looks for me like two different things: one is a kfunc to get the current
counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
detailed stats (e.g., per-cpu values or more).

My patch, slightly modified, addresses the first goal: most maps of interest
already have a counter in some form (sometimes just atomic_t or u64+lock). If
we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
the new kfunc can get the counter based on the map type.

If/when there's need to provide per-cpu statistics of elements or some more
sophisticated statistics, this can be done without changing the api of the
bpf_map_elements_count() kfunc.

Would this work?

> but debugging and stats is a slippery slope. These simple stats won't be
> enough and people will be tempted to add more and more.
> So I agree that there is a need for bpf map observability,
> but it is not clear whether hard coded stats is the solution.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-02 14:21               ` Anton Protopopov
@ 2023-06-02 16:23                 ` Alexei Starovoitov
  2023-06-06  7:49                   ` Anton Protopopov
  2023-06-24  0:00                   ` John Fastabend
  0 siblings, 2 replies; 19+ messages in thread
From: Alexei Starovoitov @ 2023-06-02 16:23 UTC (permalink / raw)
  To: Anton Protopopov; +Cc: Martin KaFai Lau, bpf, Joe Stringer, John Fastabend

On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
>
> On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > >
> > > > > LRU logic doesn't kick in until the map is full.
> > > >
> > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > simultaneously. As the result, the map may contain any number of elements,
> > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > inc/dec'ed from the kernel side would solve this.
> > >
> > > That's odd and unexpected.
> > > Definitely something to investigate and fix in the LRU map.
> > >
> > > Pls cc Martin in the future.
> > >
> > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > >
> > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > without a map being full. I will step back with this patch until we investigate
> > > > if we can replace LRUs with hashes.
> > > >
> > > > Thanks for the comments!
> >
> > Thinking about it more...
> > since you're proposing to use percpu counter unconditionally for prealloc
> > and percpu_counter_add_batch() logic is batched,
> > it could actually be acceptable if it's paired with non-api access.
> > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > for maps can be extended to print the element count, so the user can have
> > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> >
> > But additional logic of percpu_counter_add_batch() might get in the way
> > of debugging eventually.
> > If we want to have stats then we can have normal per-cpu u32 in basic
> > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > kfunc to iterate over percpu is still necessary.
> > This way we will be able to see not only number of elements, but detect
> > bad usage when one cpu is only adding and another cpu is deleting elements.
> > And other cpu misbalance.
>
> This looks for me like two different things: one is a kfunc to get the current
> counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> detailed stats (e.g., per-cpu values or more).
>
> My patch, slightly modified, addresses the first goal: most maps of interest
> already have a counter in some form (sometimes just atomic_t or u64+lock). If
> we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> the new kfunc can get the counter based on the map type.
>
> If/when there's need to provide per-cpu statistics of elements or some more
> sophisticated statistics, this can be done without changing the api of the
> bpf_map_elements_count() kfunc.
>
> Would this work?

No, because bpf_map_elements_count() as a building block is too big
and too specific. Nothing else can be made out of it, but counting
elements.
"for_each_cpu in per-cpu variable" would be generic that is usable beyond
this particular use case of stats collection.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-02 16:23                 ` Alexei Starovoitov
@ 2023-06-06  7:49                   ` Anton Protopopov
  2023-06-24  0:00                   ` John Fastabend
  1 sibling, 0 replies; 19+ messages in thread
From: Anton Protopopov @ 2023-06-06  7:49 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Martin KaFai Lau, bpf, Joe Stringer, John Fastabend

On Fri, Jun 02, 2023 at 09:23:11AM -0700, Alexei Starovoitov wrote:
> On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > >
> > > > > > LRU logic doesn't kick in until the map is full.
> > > > >
> > > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > > simultaneously. As the result, the map may contain any number of elements,
> > > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > > inc/dec'ed from the kernel side would solve this.
> > > >
> > > > That's odd and unexpected.
> > > > Definitely something to investigate and fix in the LRU map.
> > > >
> > > > Pls cc Martin in the future.
> > > >
> > > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > > >
> > > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > > without a map being full. I will step back with this patch until we investigate
> > > > > if we can replace LRUs with hashes.
> > > > >
> > > > > Thanks for the comments!
> > >
> > > Thinking about it more...
> > > since you're proposing to use percpu counter unconditionally for prealloc
> > > and percpu_counter_add_batch() logic is batched,
> > > it could actually be acceptable if it's paired with non-api access.
> > > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > > for maps can be extended to print the element count, so the user can have
> > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> > >
> > > But additional logic of percpu_counter_add_batch() might get in the way
> > > of debugging eventually.
> > > If we want to have stats then we can have normal per-cpu u32 in basic
> > > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > > kfunc to iterate over percpu is still necessary.
> > > This way we will be able to see not only number of elements, but detect
> > > bad usage when one cpu is only adding and another cpu is deleting elements.
> > > And other cpu misbalance.
> >
> > This looks for me like two different things: one is a kfunc to get the current
> > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> > detailed stats (e.g., per-cpu values or more).
> >
> > My patch, slightly modified, addresses the first goal: most maps of interest
> > already have a counter in some form (sometimes just atomic_t or u64+lock). If
> > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> > the new kfunc can get the counter based on the map type.
> >
> > If/when there's need to provide per-cpu statistics of elements or some more
> > sophisticated statistics, this can be done without changing the api of the
> > bpf_map_elements_count() kfunc.
> >
> > Would this work?
> 
> No, because bpf_map_elements_count() as a building block is too big
> and too specific. Nothing else can be made out of it, but counting
> elements.
> "for_each_cpu in per-cpu variable" would be generic that is usable beyond
> this particular use case of stats collection.

Thanks. I will prepare a v2 with a "no-uapi percpu" version.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-01  7:50     ` Anton Protopopov
@ 2023-06-13  8:23       ` Yujie Liu
  2023-06-13  8:30         ` Anton Protopopov
  0 siblings, 1 reply; 19+ messages in thread
From: Yujie Liu @ 2023-06-13  8:23 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: kernel test robot, bpf, oe-kbuild-all, Joe Stringer, John Fastabend

Hi Anton,

Sorry for the late reply.

On Thu, Jun 01, 2023 at 07:50:00AM +0000, Anton Protopopov wrote:
> On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote:
> > Hi Anton,
> > 
> > kernel test robot noticed the following build errors:
> > 
> > [...]
> > 
> > If you fix the issue, kindly add following tag where applicable
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/
> 
> How does this apply to patches? If I send a v2, should I include these tags
> there?

If a v2 is sent, these tags should not be included.

> If this patch gets rejected, is there need to do anything to close the
> robot's ticket?

No need to close this ticket.

Thanks for raising above concerns. We have updated the wording in our
reports as below to avoid misinterpretation:

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: ...
| Closes: ...

--
Best Regards,
Yujie

> > All errors (new ones prefixed by >>):
> > 
> >    kernel/bpf/hashtab.c: In function 'htab_map_pressure':
> > >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
> >      189 |                 return __percpu_counter_sum(&htab->pcount);
> >          |                        ^~~~~~~~~~~~~~~~~~~~
> >          |                        percpu_counter_sum
> >    cc1: some warnings being treated as errors
> > 
> > 
> > vim +189 kernel/bpf/hashtab.c
> > 
> >    183	
> >    184	static u32 htab_map_pressure(const struct bpf_map *map)
> >    185	{
> >    186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> >    187	
> >    188		if (htab->use_percpu_counter)
> >  > 189			return __percpu_counter_sum(&htab->pcount);
> >    190		return atomic_read(&htab->count);
> >    191	}
> >    192	
> 
> (This bug happens for !SMP case.)
> 
> > -- 
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-13  8:23       ` Yujie Liu
@ 2023-06-13  8:30         ` Anton Protopopov
  0 siblings, 0 replies; 19+ messages in thread
From: Anton Protopopov @ 2023-06-13  8:30 UTC (permalink / raw)
  To: Yujie Liu
  Cc: kernel test robot, bpf, oe-kbuild-all, Joe Stringer, John Fastabend

On Tue, Jun 13, 2023 at 04:23:29PM +0800, Yujie Liu wrote:
> Hi Anton,
> 
> Sorry for the late reply.
> 
> On Thu, Jun 01, 2023 at 07:50:00AM +0000, Anton Protopopov wrote:
> > On Thu, Jun 01, 2023 at 08:44:24AM +0800, kernel test robot wrote:
> > > Hi Anton,
> > > 
> > > kernel test robot noticed the following build errors:
> > > 
> > > [...]
> > > 
> > > If you fix the issue, kindly add following tag where applicable
> > > | Reported-by: kernel test robot <lkp@intel.com>
> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202306010837.mGhA199K-lkp@intel.com/
> > 
> > How does this apply to patches? If I send a v2, should I include these tags
> > there?
> 
> If a v2 is sent, these tags should not be included.
> 
> > If this patch gets rejected, is there need to do anything to close the
> > robot's ticket?
> 
> No need to close this ticket.
> 
> Thanks for raising above concerns. We have updated the wording in our
> reports as below to avoid misinterpretation:
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: ...
> | Closes: ...

Great, thanks for the explanations!

> --
> Best Regards,
> Yujie
> 
> > > All errors (new ones prefixed by >>):
> > > 
> > >    kernel/bpf/hashtab.c: In function 'htab_map_pressure':
> > > >> kernel/bpf/hashtab.c:189:24: error: implicit declaration of function '__percpu_counter_sum'; did you mean 'percpu_counter_sum'? [-Werror=implicit-function-declaration]
> > >      189 |                 return __percpu_counter_sum(&htab->pcount);
> > >          |                        ^~~~~~~~~~~~~~~~~~~~
> > >          |                        percpu_counter_sum
> > >    cc1: some warnings being treated as errors
> > > 
> > > 
> > > vim +189 kernel/bpf/hashtab.c
> > > 
> > >    183	
> > >    184	static u32 htab_map_pressure(const struct bpf_map *map)
> > >    185	{
> > >    186		struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > >    187	
> > >    188		if (htab->use_percpu_counter)
> > >  > 189			return __percpu_counter_sum(&htab->pcount);
> > >    190		return atomic_read(&htab->count);
> > >    191	}
> > >    192	
> > 
> > (This bug happens for !SMP case.)
> > 
> > > -- 
> > > 0-DAY CI Kernel Test Service
> > > https://github.com/intel/lkp-tests/wiki
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-02 16:23                 ` Alexei Starovoitov
  2023-06-06  7:49                   ` Anton Protopopov
@ 2023-06-24  0:00                   ` John Fastabend
  2023-06-26 16:29                     ` Anton Protopopov
  1 sibling, 1 reply; 19+ messages in thread
From: John Fastabend @ 2023-06-24  0:00 UTC (permalink / raw)
  To: Alexei Starovoitov, Anton Protopopov
  Cc: Martin KaFai Lau, bpf, Joe Stringer, John Fastabend

Alexei Starovoitov wrote:
> On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> >
> > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > >
> > > > > > LRU logic doesn't kick in until the map is full.
> > > > >
> > > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > > simultaneously. As the result, the map may contain any number of elements,
> > > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > > inc/dec'ed from the kernel side would solve this.
> > > >
> > > > That's odd and unexpected.
> > > > Definitely something to investigate and fix in the LRU map.
> > > >
> > > > Pls cc Martin in the future.
> > > >
> > > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > > >
> > > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > > without a map being full. I will step back with this patch until we investigate
> > > > > if we can replace LRUs with hashes.
> > > > >
> > > > > Thanks for the comments!
> > >
> > > Thinking about it more...
> > > since you're proposing to use percpu counter unconditionally for prealloc
> > > and percpu_counter_add_batch() logic is batched,
> > > it could actually be acceptable if it's paired with non-api access.
> > > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > > for maps can be extended to print the element count, so the user can have
> > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> > >
> > > But additional logic of percpu_counter_add_batch() might get in the way
> > > of debugging eventually.
> > > If we want to have stats then we can have normal per-cpu u32 in basic
> > > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > > kfunc to iterate over percpu is still necessary.
> > > This way we will be able to see not only number of elements, but detect
> > > bad usage when one cpu is only adding and another cpu is deleting elements.
> > > And other cpu misbalance.
> >
> > This looks for me like two different things: one is a kfunc to get the current
> > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> > detailed stats (e.g., per-cpu values or more).
> >
> > My patch, slightly modified, addresses the first goal: most maps of interest
> > already have a counter in some form (sometimes just atomic_t or u64+lock). If
> > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> > the new kfunc can get the counter based on the map type.
> >
> > If/when there's need to provide per-cpu statistics of elements or some more
> > sophisticated statistics, this can be done without changing the api of the
> > bpf_map_elements_count() kfunc.
> >
> > Would this work?
> 
> No, because bpf_map_elements_count() as a building block is too big
> and too specific. Nothing else can be made out of it, but counting
> elements.
> "for_each_cpu in per-cpu variable" would be generic that is usable beyond
> this particular use case of stats collection.

Without much thought, could you hook the eviction logic in LRU to know
when the evict happens and even more details about what was evicted so
we could debug the random case where we evict something in a conntrack
table and then later it comes back to life and sends some data like a
long living UDP session.

For example in the cases where you build an LRU map because in 99%
cases no evictions happen and the LRU is just there as a backstop
you might even generate events to userspace to let it know evicts
are in progress and they should do something about it.

Thanks,
John

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-24  0:00                   ` John Fastabend
@ 2023-06-26 16:29                     ` Anton Protopopov
  0 siblings, 0 replies; 19+ messages in thread
From: Anton Protopopov @ 2023-06-26 16:29 UTC (permalink / raw)
  To: John Fastabend; +Cc: Alexei Starovoitov, Martin KaFai Lau, bpf, Joe Stringer

On Fri, Jun 23, 2023 at 05:00:15PM -0700, John Fastabend wrote:
> Alexei Starovoitov wrote:
> > On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > > >
> > > > > > > LRU logic doesn't kick in until the map is full.
> > > > > >
> > > > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > > > simultaneously. As the result, the map may contain any number of elements,
> > > > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > > > inc/dec'ed from the kernel side would solve this.
> > > > >
> > > > > That's odd and unexpected.
> > > > > Definitely something to investigate and fix in the LRU map.
> > > > >
> > > > > Pls cc Martin in the future.
> > > > >
> > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > > > >
> > > > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > > > without a map being full. I will step back with this patch until we investigate
> > > > > > if we can replace LRUs with hashes.
> > > > > >
> > > > > > Thanks for the comments!
> > > >
> > > > Thinking about it more...
> > > > since you're proposing to use percpu counter unconditionally for prealloc
> > > > and percpu_counter_add_batch() logic is batched,
> > > > it could actually be acceptable if it's paired with non-api access.
> > > > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > > > for maps can be extended to print the element count, so the user can have
> > > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> > > >
> > > > But additional logic of percpu_counter_add_batch() might get in the way
> > > > of debugging eventually.
> > > > If we want to have stats then we can have normal per-cpu u32 in basic
> > > > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > > > kfunc to iterate over percpu is still necessary.
> > > > This way we will be able to see not only number of elements, but detect
> > > > bad usage when one cpu is only adding and another cpu is deleting elements.
> > > > And other cpu misbalance.
> > >
> > > This looks for me like two different things: one is a kfunc to get the current
> > > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> > > detailed stats (e.g., per-cpu values or more).
> > >
> > > My patch, slightly modified, addresses the first goal: most maps of interest
> > > already have a counter in some form (sometimes just atomic_t or u64+lock). If
> > > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> > > the new kfunc can get the counter based on the map type.
> > >
> > > If/when there's need to provide per-cpu statistics of elements or some more
> > > sophisticated statistics, this can be done without changing the api of the
> > > bpf_map_elements_count() kfunc.
> > >
> > > Would this work?
> > 
> > No, because bpf_map_elements_count() as a building block is too big
> > and too specific. Nothing else can be made out of it, but counting
> > elements.
> > "for_each_cpu in per-cpu variable" would be generic that is usable beyond
> > this particular use case of stats collection.
> 
> Without much thought, could you hook the eviction logic in LRU to know
> when the evict happens and even more details about what was evicted so
> we could debug the random case where we evict something in a conntrack
> table and then later it comes back to life and sends some data like a
> long living UDP session.
> 
> For example in the cases where you build an LRU map because in 99%
> cases no evictions happen and the LRU is just there as a backstop
> you might even generate events to userspace to let it know evicts
> are in progress and they should do something about it.

Yes, one can trace evictions, as the destructor function for lru_list is
noinline. An example here: https://github.com/aspsk/bcc/tree/aspsk/lrusnoop

One problem with evictions and LRU is that evictions can [currently] happen at
random times even if map is nearly emptee, see a new map test from this series
and also https://lore.kernel.org/bpf/ZHjhBFLLnUcSy9Tt@zh-lab-node-5/ So for LRU
we really need to invest more time. For hashtabs and other maps the
bpf_map_sum_elem_count is in any case quite useful.

> Thanks,
> John

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
  2023-06-01 18:24           ` Alexei Starovoitov
  2023-06-02  0:40             ` Alexei Starovoitov
@ 2023-06-28 13:17             ` Anton Protopopov
  1 sibling, 0 replies; 19+ messages in thread
From: Anton Protopopov @ 2023-06-28 13:17 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Martin KaFai Lau, bpf, Joe Stringer, John Fastabend

Hi Alexey, hi Martin,

On Thu, Jun 01, 2023 at 11:24:20AM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > LRU logic doesn't kick in until the map is full.
> >
> > In fact, it can: a reproducable example is in the self-test from this patch
> > series. In the test N threads try to insert random values for keys 1..3000
> > simultaneously. As the result, the map may contain any number of elements,
> > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > So a user can't really even closely estimate the number of elements in the LRU
> > map based on the number of updates (with unique keys). A per-cpu counter
> > inc/dec'ed from the kernel side would solve this.
> 
> That's odd and unexpected.
> Definitely something to investigate and fix in the LRU map.
> 
> Pls cc Martin in the future.

I've looked into this a bit more and the problem is as follows.

LRU maps allocate MAX_ENTRIES elements and put them in the global free list.
Then each CPU will try to get memory in 128 elements chunks into their own
local free lists.

The user expectation is that evictions start when the map is full, however, on
practice we start evicting elements when capacity reaches about (MAX_ENTRIES -
NCPUS*128) elements. This happens because when one CPU have used its local
free-list, it gets to the global lists. While there could be another
(NCPUS-1)*128 free elements in local free lists of other CPUs, our CPU goes to
the global free list, which is empty, and then starts to evict elements from
active/inactive lists (a 128 elements chunk). Then this can happen for another
active CPU, etc.

This looks like not a problem for big maps, where (NCPUS*128) is not a big %%
of the total map capacity. For smaller maps this may be unexpected (I first
noticed this on a 4K map where after updating 4K keys map capacity was about
200-300 elements).

My first attempt to fix this was to just increase nr_entries allocated for the
map by NCPUS*128, which makes evictions to start happening at MAX_ENTRIES.  But
soon I've realized that in such way users can get more than MAX_ENTRIES inside
a map, which is unexpected as well (say, when dumping a map in a location of
MAX_ENTRIES size or syncing entries with another map of MAX_ENTRIES capacity).

I also briefly looked into allowing to call the prealloc_lru_pop() function
under a bucket lock (by passing the currently locked bucket to it so that this
pointer is passed all the way to the htab_lru_map_delete_node() function which
may then bypass locking the bucket if it is the same one). Looks like this
works, but I didn't have time to understand if this breaks the LRU architecture
badly or not.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-06-28 13:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-31 11:05 [PATCH bpf-next 0/2] add mechanism to report map pressure Anton Protopopov
2023-05-31 11:05 ` [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure Anton Protopopov
2023-05-31 18:24   ` Alexei Starovoitov
2023-06-01  7:31     ` Anton Protopopov
2023-06-01 16:39       ` Alexei Starovoitov
2023-06-01 18:18         ` Anton Protopopov
2023-06-01 18:24           ` Alexei Starovoitov
2023-06-02  0:40             ` Alexei Starovoitov
2023-06-02 14:21               ` Anton Protopopov
2023-06-02 16:23                 ` Alexei Starovoitov
2023-06-06  7:49                   ` Anton Protopopov
2023-06-24  0:00                   ` John Fastabend
2023-06-26 16:29                     ` Anton Protopopov
2023-06-28 13:17             ` Anton Protopopov
2023-06-01  0:44   ` kernel test robot
2023-06-01  7:50     ` Anton Protopopov
2023-06-13  8:23       ` Yujie Liu
2023-06-13  8:30         ` Anton Protopopov
2023-05-31 11:05 ` [PATCH bpf-next 2/2] selftests/bpf: test map pressure Anton Protopopov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).