All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch bpf-next 0/3] bpf: introduce timeout map
@ 2020-12-11  0:06 Cong Wang
  2020-12-11  0:06 ` [Patch bpf-next 1/3] bpf: use index instead of hash for map_locked[] Cong Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Cong Wang @ 2020-12-11  0:06 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang

From: Cong Wang <cong.wang@bytedance.com>

This patchset introduces a new bpf hash map which has timeout.
Patch 1 is a preparation, patch 2 is the implementation of timeout
map, patch 3 contains a test case for timeout map. Please check each
patch description for more details.

---
Cong Wang (3):
  bpf: use index instead of hash for map_locked[]
  bpf: introduce timeout map
  tools: add a test case for bpf timeout map

 include/linux/bpf_types.h               |   1 +
 include/uapi/linux/bpf.h                |   3 +-
 kernel/bpf/hashtab.c                    | 296 +++++++++++++++++++++---
 kernel/bpf/syscall.c                    |   3 +-
 tools/include/uapi/linux/bpf.h          |   1 +
 tools/testing/selftests/bpf/test_maps.c |  41 ++++
 6 files changed, 314 insertions(+), 31 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Patch bpf-next 1/3] bpf: use index instead of hash for map_locked[]
  2020-12-11  0:06 [Patch bpf-next 0/3] bpf: introduce timeout map Cong Wang
@ 2020-12-11  0:06 ` Cong Wang
  2020-12-11  0:06 ` [Patch bpf-next 2/3] bpf: introduce timeout map Cong Wang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2020-12-11  0:06 UTC (permalink / raw)
  To: netdev
  Cc: bpf, Cong Wang, Song Liu, Alexei Starovoitov, Daniel Borkmann,
	Dongdong Wang

From: Cong Wang <cong.wang@bytedance.com>

Commit 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
introduced a percpu counter map_locked to bail out NMI case. It uses
the hash of each bucket for indexing, which requires callers of
htab_lock_bucket()/htab_unlock_bucket() to pass it in. But hash value
is not always available, especially when we traverse the whole hash
table where we do not have keys to compute the hash. We can just
compute the index of each bucket with its address and use index
instead.

This is a prerequisite for the following timeout map patch.

Cc: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dongdong Wang <wangdongdong.6@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 kernel/bpf/hashtab.c | 57 +++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 7e848200cd26..f0b7b54fa3a8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -156,16 +156,17 @@ static void htab_init_buckets(struct bpf_htab *htab)
 }
 
 static inline int htab_lock_bucket(const struct bpf_htab *htab,
-				   struct bucket *b, u32 hash,
-				   unsigned long *pflags)
+				   struct bucket *b, unsigned long *pflags)
 {
 	unsigned long flags;
+	unsigned int index;
 
-	hash = hash & HASHTAB_MAP_LOCK_MASK;
+	index = b - htab->buckets;
+	index &= HASHTAB_MAP_LOCK_MASK;
 
 	migrate_disable();
-	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
-		__this_cpu_dec(*(htab->map_locked[hash]));
+	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[index])) != 1)) {
+		__this_cpu_dec(*(htab->map_locked[index]));
 		migrate_enable();
 		return -EBUSY;
 	}
@@ -180,15 +181,17 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
 }
 
 static inline void htab_unlock_bucket(const struct bpf_htab *htab,
-				      struct bucket *b, u32 hash,
-				      unsigned long flags)
+				      struct bucket *b, unsigned long flags)
 {
-	hash = hash & HASHTAB_MAP_LOCK_MASK;
+	unsigned int index;
+
+	index = b - htab->buckets;
+	index &= HASHTAB_MAP_LOCK_MASK;
 	if (htab_use_raw_lock(htab))
 		raw_spin_unlock_irqrestore(&b->raw_lock, flags);
 	else
 		spin_unlock_irqrestore(&b->lock, flags);
-	__this_cpu_dec(*(htab->map_locked[hash]));
+	__this_cpu_dec(*(htab->map_locked[index]));
 	migrate_enable();
 }
 
@@ -710,7 +713,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	b = __select_bucket(htab, tgt_l->hash);
 	head = &b->head;
 
-	ret = htab_lock_bucket(htab, b, tgt_l->hash, &flags);
+	ret = htab_lock_bucket(htab, b, &flags);
 	if (ret)
 		return false;
 
@@ -720,7 +723,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 			break;
 		}
 
-	htab_unlock_bucket(htab, b, tgt_l->hash, flags);
+	htab_unlock_bucket(htab, b, flags);
 
 	return l == tgt_l;
 }
@@ -1019,7 +1022,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		 */
 	}
 
-	ret = htab_lock_bucket(htab, b, hash, &flags);
+	ret = htab_lock_bucket(htab, b, &flags);
 	if (ret)
 		return ret;
 
@@ -1062,7 +1065,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	}
 	ret = 0;
 err:
-	htab_unlock_bucket(htab, b, hash, flags);
+	htab_unlock_bucket(htab, b, flags);
 	return ret;
 }
 
@@ -1100,7 +1103,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -ENOMEM;
 	memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
 
-	ret = htab_lock_bucket(htab, b, hash, &flags);
+	ret = htab_lock_bucket(htab, b, &flags);
 	if (ret)
 		return ret;
 
@@ -1121,7 +1124,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
 	ret = 0;
 
 err:
-	htab_unlock_bucket(htab, b, hash, flags);
+	htab_unlock_bucket(htab, b, flags);
 
 	if (ret)
 		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
@@ -1156,7 +1159,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-	ret = htab_lock_bucket(htab, b, hash, &flags);
+	ret = htab_lock_bucket(htab, b, &flags);
 	if (ret)
 		return ret;
 
@@ -1181,7 +1184,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
 	}
 	ret = 0;
 err:
-	htab_unlock_bucket(htab, b, hash, flags);
+	htab_unlock_bucket(htab, b, flags);
 	return ret;
 }
 
@@ -1221,7 +1224,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 			return -ENOMEM;
 	}
 
-	ret = htab_lock_bucket(htab, b, hash, &flags);
+	ret = htab_lock_bucket(htab, b, &flags);
 	if (ret)
 		return ret;
 
@@ -1245,7 +1248,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 	}
 	ret = 0;
 err:
-	htab_unlock_bucket(htab, b, hash, flags);
+	htab_unlock_bucket(htab, b, flags);
 	if (l_new)
 		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
 	return ret;
@@ -1283,7 +1286,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-	ret = htab_lock_bucket(htab, b, hash, &flags);
+	ret = htab_lock_bucket(htab, b, &flags);
 	if (ret)
 		return ret;
 
@@ -1296,7 +1299,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 		ret = -ENOENT;
 	}
 
-	htab_unlock_bucket(htab, b, hash, flags);
+	htab_unlock_bucket(htab, b, flags);
 	return ret;
 }
 
@@ -1318,7 +1321,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-	ret = htab_lock_bucket(htab, b, hash, &flags);
+	ret = htab_lock_bucket(htab, b, &flags);
 	if (ret)
 		return ret;
 
@@ -1329,7 +1332,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 	else
 		ret = -ENOENT;
 
-	htab_unlock_bucket(htab, b, hash, flags);
+	htab_unlock_bucket(htab, b, flags);
 	if (l)
 		bpf_lru_push_free(&htab->lru, &l->lru_node);
 	return ret;
@@ -1480,7 +1483,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	head = &b->head;
 	/* do not grab the lock unless need it (bucket_cnt > 0). */
 	if (locked) {
-		ret = htab_lock_bucket(htab, b, batch, &flags);
+		ret = htab_lock_bucket(htab, b, &flags);
 		if (ret)
 			goto next_batch;
 	}
@@ -1500,7 +1503,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 		/* Note that since bucket_cnt > 0 here, it is implicit
 		 * that the locked was grabbed, so release it.
 		 */
-		htab_unlock_bucket(htab, b, batch, flags);
+		htab_unlock_bucket(htab, b, flags);
 		rcu_read_unlock();
 		bpf_enable_instrumentation();
 		goto after_loop;
@@ -1511,7 +1514,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 		/* Note that since bucket_cnt > 0 here, it is implicit
 		 * that the locked was grabbed, so release it.
 		 */
-		htab_unlock_bucket(htab, b, batch, flags);
+		htab_unlock_bucket(htab, b, flags);
 		rcu_read_unlock();
 		bpf_enable_instrumentation();
 		kvfree(keys);
@@ -1564,7 +1567,7 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 		dst_val += value_size;
 	}
 
-	htab_unlock_bucket(htab, b, batch, flags);
+	htab_unlock_bucket(htab, b, flags);
 	locked = false;
 
 	while (node_to_free) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Patch bpf-next 2/3] bpf: introduce timeout map
  2020-12-11  0:06 [Patch bpf-next 0/3] bpf: introduce timeout map Cong Wang
  2020-12-11  0:06 ` [Patch bpf-next 1/3] bpf: use index instead of hash for map_locked[] Cong Wang
@ 2020-12-11  0:06 ` Cong Wang
  2020-12-11  0:06 ` [Patch bpf-next 3/3] tools: add a test case for bpf " Cong Wang
  2020-12-11 19:55 ` [Patch bpf-next 0/3] bpf: introduce " Andrii Nakryiko
  3 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2020-12-11  0:06 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang, Alexei Starovoitov, Daniel Borkmann, Dongdong Wang

From: Cong Wang <cong.wang@bytedance.com>

This borrows the idea from conntrack and will be used for conntrack in
bpf too. Each element in a timeout map has a user-specified timeout
in secs, after it expires it will be automatically removed from the map.

There are two cases here:

1. When the timeout map is idle, that is, no one updates or accesses it,
   we rely on the idle work to scan the whole hash table and remove
   these expired. The idle work is scheduled every 1 sec.

2. When the timeout map is actively accessed, we could reach expired
   elements before the idle work kicks in, we can simply skip them and
   schedule another work to do the actual removal work. We avoid taking
   locks on fast path.

The timeout of each element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dongdong Wang <wangdongdong.6@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |   3 +-
 kernel/bpf/hashtab.c           | 239 ++++++++++++++++++++++++++++++++-
 kernel/bpf/syscall.c           |   3 +-
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 243 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..901c4e01d726 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_PERCPU_HASH, htab_lru_percpu_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 30b477a26482..dedb47bc3f52 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -158,6 +158,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
@@ -393,7 +394,7 @@ enum bpf_link_type {
  */
 #define BPF_PSEUDO_CALL		1
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
 	BPF_NOEXIST	= 1, /* create new element if it didn't exist */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index f0b7b54fa3a8..1f2bad82d52b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -8,6 +8,8 @@
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/llist.h>
+#include <linux/workqueue.h>
 #include <uapi/linux/btf.h>
 #include <linux/rcupdate_trace.h>
 #include "percpu_freelist.h"
@@ -84,6 +86,8 @@ struct bucket {
 		raw_spinlock_t raw_lock;
 		spinlock_t     lock;
 	};
+	struct llist_node gc_node;
+	atomic_t pending;
 };
 
 #define HASHTAB_MAP_LOCK_COUNT 8
@@ -104,6 +108,9 @@ struct bpf_htab {
 	u32 hashrnd;
 	struct lock_class_key lockdep_key;
 	int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
+	struct llist_head gc_list;
+	struct work_struct gc_work;
+	struct delayed_work gc_idle_work;
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -124,6 +131,7 @@ struct htab_elem {
 		struct bpf_lru_node lru_node;
 	};
 	u32 hash;
+	u64 expires;
 	char key[] __aligned(8);
 };
 
@@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
 
 	for (i = 0; i < htab->n_buckets; i++) {
 		INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
+		atomic_set(&htab->buckets[i].pending, 0);
 		if (htab_use_raw_lock(htab)) {
 			raw_spin_lock_init(&htab->buckets[i].raw_lock);
 			lockdep_set_class(&htab->buckets[i].raw_lock,
@@ -431,12 +440,24 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	return 0;
 }
 
+static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
+{
+	if (atomic_fetch_or(1, &b->pending))
+		return;
+	llist_add(&b->gc_node, &htab->gc_list);
+	queue_work(system_unbound_wq, &htab->gc_work);
+}
+
+static void htab_gc(struct work_struct *work);
+static void htab_gc_idle(struct work_struct *work);
+
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
 	bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 		       attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
 	bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH ||
 		    attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
+	bool timeout = attr->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
 	/* percpu_lru means each cpu has its own LRU list.
 	 * it is different from BPF_MAP_TYPE_PERCPU_HASH where
 	 * the map's value itself is percpu.  percpu_lru has
@@ -521,6 +542,14 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		}
 	}
 
+	if (timeout) {
+		init_llist_head(&htab->gc_list);
+		INIT_WORK(&htab->gc_work, htab_gc);
+		INIT_DEFERRABLE_WORK(&htab->gc_idle_work, htab_gc_idle);
+		queue_delayed_work(system_power_efficient_wq, &htab->gc_idle_work,
+				   HZ);
+	}
+
 	return &htab->map;
 
 free_prealloc:
@@ -732,10 +761,13 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	bool is_timeout = map->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
 	struct hlist_nulls_head *head;
 	struct htab_elem *l, *next_l;
 	u32 hash, key_size;
+	struct bucket *b;
 	int i = 0;
+	u64 now;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
@@ -746,7 +778,8 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 	hash = htab_map_hash(key, key_size, htab->hashrnd);
 
-	head = select_bucket(htab, hash);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
 
 	/* lookup the key */
 	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets);
@@ -759,6 +792,13 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 				  struct htab_elem, hash_node);
 
 	if (next_l) {
+		if (is_timeout) {
+			now = get_jiffies_64();
+			if (time_after_eq64(now, next_l->expires)) {
+				htab_sched_gc(htab, b);
+				goto find_first_elem;
+			}
+		}
 		/* if next elem in this hash list is non-zero, just return it */
 		memcpy(next_key, next_l->key, key_size);
 		return 0;
@@ -771,12 +811,20 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 find_first_elem:
 	/* iterate over buckets */
 	for (; i < htab->n_buckets; i++) {
-		head = select_bucket(htab, i);
+		b = __select_bucket(htab, i);
+		head = &b->head;
 
 		/* pick first element in the bucket */
 		next_l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_first_rcu(head)),
 					  struct htab_elem, hash_node);
 		if (next_l) {
+			if (is_timeout) {
+				now = get_jiffies_64();
+				if (time_after_eq64(now, next_l->expires)) {
+					htab_sched_gc(htab, b);
+					continue;
+				}
+			}
 			/* if it's not empty, just return it */
 			memcpy(next_key, next_l->key, key_size);
 			return 0;
@@ -975,18 +1023,31 @@ static int check_flags(struct bpf_htab *htab, struct htab_elem *l_old,
 	return 0;
 }
 
+static u32 fetch_timeout(u64 *map_flags)
+{
+	u32 timeout = (*map_flags) >> 32;
+
+	*map_flags = (*map_flags) & 0xffffffff;
+	return timeout;
+}
+
 /* Called from syscall or from eBPF program */
 static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 				u64 map_flags)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	bool timeout_map = map->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
 	struct htab_elem *l_new = NULL, *l_old;
 	struct hlist_nulls_head *head;
 	unsigned long flags;
 	struct bucket *b;
 	u32 key_size, hash;
+	u32 timeout;
+	u64 now;
 	int ret;
 
+	timeout = fetch_timeout(&map_flags);
+
 	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
 		/* unknown flags */
 		return -EINVAL;
@@ -1042,6 +1103,10 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		copy_map_value_locked(map,
 				      l_old->key + round_up(key_size, 8),
 				      value, false);
+		if (timeout_map) {
+			now = get_jiffies_64();
+			l_old->expires = now + HZ * timeout;
+		}
 		ret = 0;
 		goto err;
 	}
@@ -1054,6 +1119,13 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		goto err;
 	}
 
+	if (timeout_map) {
+		now = get_jiffies_64();
+		if (l_old && time_after_eq64(now, l_old->expires))
+			htab_sched_gc(htab, b);
+		l_new->expires = now + HZ * timeout;
+	}
+
 	/* add new element to the head of the list, so that
 	 * concurrent search will find it before old elem
 	 */
@@ -2180,3 +2252,166 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
 	.map_btf_name = "bpf_htab",
 	.map_btf_id = &htab_of_maps_map_btf_id,
 };
+
+static void __htab_gc_bucket(struct bpf_htab *htab, struct bucket *b)
+{
+	struct hlist_nulls_head *head = &b->head;
+	struct hlist_nulls_node *n;
+	u64 now = get_jiffies_64();
+	struct htab_elem *l;
+
+	hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
+		if (time_after_eq64(now, l->expires)) {
+			hlist_nulls_del_rcu(&l->hash_node);
+			free_htab_elem(htab, l);
+		}
+	}
+}
+
+static void htab_gc(struct work_struct *work)
+{
+	struct llist_node *head;
+	struct bpf_htab *htab;
+	struct bucket *b, *n;
+
+	htab = container_of(work, struct bpf_htab, gc_work);
+	head = llist_del_all(&htab->gc_list);
+
+	llist_for_each_entry_safe(b, n, head, gc_node) {
+		unsigned long flags;
+		int ret;
+
+		ret = htab_lock_bucket(htab, b, &flags);
+		if (ret)
+			continue;
+		__htab_gc_bucket(htab, b);
+		htab_unlock_bucket(htab, b, flags);
+
+		atomic_set(&b->pending, 0);
+
+		cond_resched();
+	}
+}
+
+static void htab_gc_idle(struct work_struct *work)
+{
+	struct bpf_htab *htab;
+	int i;
+
+	htab = container_of(work, struct bpf_htab, gc_idle_work.work);
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		unsigned long flags;
+		struct bucket *b;
+		int ret;
+
+		b = __select_bucket(htab, i);
+		if (hlist_nulls_empty(&b->head))
+			continue;
+		if (atomic_read(&b->pending))
+			continue;
+		ret = htab_lock_bucket(htab, b, &flags);
+		if (ret)
+			continue;
+		__htab_gc_bucket(htab, b);
+		htab_unlock_bucket(htab, b, flags);
+		cond_resched();
+	}
+
+	queue_delayed_work(system_power_efficient_wq, &htab->gc_idle_work, HZ);
+}
+
+static void *__htab_timeout_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_nulls_head *head;
+	struct htab_elem *l;
+	struct bucket *b;
+	u32 key_size = map->key_size;
+	u32 hash;
+
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets);
+	if (l && time_after_eq64(get_jiffies_64(), l->expires)) {
+		htab_sched_gc(htab, b);
+		l = NULL;
+	}
+
+	return l;
+}
+
+static void *htab_timeout_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct htab_elem *l = __htab_timeout_map_lookup_elem(map, key);
+
+	if (l)
+		return l->key + round_up(map->key_size, 8);
+	return NULL;
+}
+
+static int htab_timeout_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
+{
+	struct bpf_insn *insn = insn_buf;
+	const int ret = BPF_REG_0;
+
+	BUILD_BUG_ON(!__same_type(&__htab_timeout_map_lookup_elem,
+		     (void *(*)(struct bpf_map *map, void *key))NULL));
+	*insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_timeout_map_lookup_elem));
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
+	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
+				offsetof(struct htab_elem, key) +
+				round_up(map->key_size, 8));
+	return insn - insn_buf;
+}
+
+static void htab_timeout_map_seq_show_elem(struct bpf_map *map, void *key,
+					   struct seq_file *m)
+{
+	void *value;
+
+	rcu_read_lock();
+
+	value = htab_timeout_map_lookup_elem(map, key);
+	if (!value) {
+		rcu_read_unlock();
+		return;
+	}
+
+	btf_type_seq_show(map->btf, map->btf_key_type_id, key, m);
+	seq_puts(m, ": ");
+	btf_type_seq_show(map->btf, map->btf_value_type_id, value, m);
+	seq_puts(m, "\n");
+
+	rcu_read_unlock();
+}
+
+static void htab_timeout_map_free(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	cancel_work_sync(&htab->gc_work);
+	cancel_delayed_work_sync(&htab->gc_idle_work);
+
+	htab_map_free(map);
+}
+
+static int htab_timeout_map_btf_id;
+const struct bpf_map_ops htab_timeout_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = htab_map_alloc_check,
+	.map_alloc = htab_map_alloc,
+	.map_free = htab_timeout_map_free,
+	.map_get_next_key = htab_map_get_next_key,
+	.map_lookup_elem = htab_timeout_map_lookup_elem,
+	.map_update_elem = htab_map_update_elem,
+	.map_delete_elem = htab_map_delete_elem,
+	.map_gen_lookup = htab_timeout_map_gen_lookup,
+	.map_seq_show_elem = htab_timeout_map_seq_show_elem,
+	BATCH_OPS(htab),
+	.map_btf_name = "bpf_htab",
+	.map_btf_id = &htab_timeout_map_btf_id,
+	.iter_seq_info = &iter_seq_info,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 287be337d5f6..9ebd2e380a57 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -778,7 +778,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
-		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
+		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_TIMEOUT_HASH)
 			return -ENOTSUPP;
 		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
 		    map->value_size) {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 30b477a26482..684b8011a97a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -158,6 +158,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
+	BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Patch bpf-next 3/3] tools: add a test case for bpf timeout map
  2020-12-11  0:06 [Patch bpf-next 0/3] bpf: introduce timeout map Cong Wang
  2020-12-11  0:06 ` [Patch bpf-next 1/3] bpf: use index instead of hash for map_locked[] Cong Wang
  2020-12-11  0:06 ` [Patch bpf-next 2/3] bpf: introduce timeout map Cong Wang
@ 2020-12-11  0:06 ` Cong Wang
  2020-12-11 19:55 ` [Patch bpf-next 0/3] bpf: introduce " Andrii Nakryiko
  3 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2020-12-11  0:06 UTC (permalink / raw)
  To: netdev; +Cc: bpf, Cong Wang, Alexei Starovoitov, Daniel Borkmann, Dongdong Wang

From: Cong Wang <cong.wang@bytedance.com>

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dongdong Wang <wangdongdong.6@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 tools/testing/selftests/bpf/test_maps.c | 41 +++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 0ad3e6305ff0..9fc7b5424fdf 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -1723,6 +1723,45 @@ static void test_reuseport_array(void)
 	close(map_fd);
 }
 
+static void test_timeout_map(void)
+{
+	int val1 = 1, val2 = 2, val3 = 3;
+	int key1 = 1, key2 = 2, key3 = 3;
+	int fd;
+
+	fd = bpf_create_map(BPF_MAP_TYPE_TIMEOUT_HASH, sizeof(int), sizeof(int),
+			    3, map_flags);
+	if (fd < 0) {
+		printf("Failed to create timeout map '%s'!\n", strerror(errno));
+		exit(1);
+	}
+
+	/* Timeout after 1 secs */
+	assert(bpf_map_update_elem(fd, &key1, &val1, (u64)1<<32) == 0);
+	/* Timeout after 2 secs */
+	assert(bpf_map_update_elem(fd, &key2, &val2, (u64)2<<32) == 0);
+	/* Timeout after 10 secs */
+	assert(bpf_map_update_elem(fd, &key3, &val3, (u64)10<<32) == 0);
+
+	sleep(1);
+	assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+	val2 = 0;
+	assert(bpf_map_lookup_elem(fd, &key2, &val2) == 0 && val2 == 2);
+
+	sleep(1);
+	assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+	assert(bpf_map_lookup_elem(fd, &key2, &val2) != 0);
+
+	/* Modify timeout to expire it earlier */
+	val3 = 0;
+	assert(bpf_map_lookup_elem(fd, &key3, &val3) == 0 && val3 == 3);
+	assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+	sleep(1);
+	assert(bpf_map_lookup_elem(fd, &key3, &val3) != 0);
+
+	close(fd);
+}
+
 static void run_all_tests(void)
 {
 	test_hashmap(0, NULL);
@@ -1752,6 +1791,8 @@ static void run_all_tests(void)
 	test_stackmap(0, NULL);
 
 	test_map_in_map();
+
+	test_timeout_map();
 }
 
 #define DEFINE_TEST(name) extern void test_##name(void);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Patch bpf-next 0/3] bpf: introduce timeout map
  2020-12-11  0:06 [Patch bpf-next 0/3] bpf: introduce timeout map Cong Wang
                   ` (2 preceding siblings ...)
  2020-12-11  0:06 ` [Patch bpf-next 3/3] tools: add a test case for bpf " Cong Wang
@ 2020-12-11 19:55 ` Andrii Nakryiko
  2020-12-12 22:25   ` Cong Wang
  3 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2020-12-11 19:55 UTC (permalink / raw)
  To: Cong Wang; +Cc: Networking, bpf, Cong Wang

On Fri, Dec 11, 2020 at 2:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> From: Cong Wang <cong.wang@bytedance.com>
>
> This patchset introduces a new bpf hash map which has timeout.
> Patch 1 is a preparation, patch 2 is the implementation of timeout
> map, patch 3 contains a test case for timeout map. Please check each
> patch description for more details.
>
> ---

This patch set seems to be breaking existing selftests. Please take a
look ([0]).
Also patch #3 should have a commit message, even if pretty trivial one.

  [0] https://travis-ci.com/github/kernel-patches/bpf/builds/207928289

> Cong Wang (3):
>   bpf: use index instead of hash for map_locked[]
>   bpf: introduce timeout map
>   tools: add a test case for bpf timeout map
>
>  include/linux/bpf_types.h               |   1 +
>  include/uapi/linux/bpf.h                |   3 +-
>  kernel/bpf/hashtab.c                    | 296 +++++++++++++++++++++---
>  kernel/bpf/syscall.c                    |   3 +-
>  tools/include/uapi/linux/bpf.h          |   1 +
>  tools/testing/selftests/bpf/test_maps.c |  41 ++++
>  6 files changed, 314 insertions(+), 31 deletions(-)
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch bpf-next 0/3] bpf: introduce timeout map
  2020-12-11 19:55 ` [Patch bpf-next 0/3] bpf: introduce " Andrii Nakryiko
@ 2020-12-12 22:25   ` Cong Wang
  2020-12-12 23:18     ` Cong Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Cong Wang @ 2020-12-12 22:25 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Networking, bpf, Cong Wang

On Fri, Dec 11, 2020 at 11:55 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Dec 11, 2020 at 2:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > This patchset introduces a new bpf hash map which has timeout.
> > Patch 1 is a preparation, patch 2 is the implementation of timeout
> > map, patch 3 contains a test case for timeout map. Please check each
> > patch description for more details.
> >
> > ---
>
> This patch set seems to be breaking existing selftests. Please take a
> look ([0]).

Interesting, looks unrelated to my patches but let me double check.

> Also patch #3 should have a commit message, even if pretty trivial one.

Yeah, I thought its subject is sufficient for a trivial patch.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch bpf-next 0/3] bpf: introduce timeout map
  2020-12-12 22:25   ` Cong Wang
@ 2020-12-12 23:18     ` Cong Wang
  2020-12-14  1:33       ` Andrey Ignatov
  0 siblings, 1 reply; 9+ messages in thread
From: Cong Wang @ 2020-12-12 23:18 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Networking, bpf, Cong Wang, Andrey Ignatov

On Sat, Dec 12, 2020 at 2:25 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Fri, Dec 11, 2020 at 11:55 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Dec 11, 2020 at 2:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > This patchset introduces a new bpf hash map which has timeout.
> > > Patch 1 is a preparation, patch 2 is the implementation of timeout
> > > map, patch 3 contains a test case for timeout map. Please check each
> > > patch description for more details.
> > >
> > > ---
> >
> > This patch set seems to be breaking existing selftests. Please take a
> > look ([0]).
>
> Interesting, looks unrelated to my patches but let me double check.

Cc'ing Andrey...

Looks like the failure is due to the addition of a new member to struct
htab_elem. Any reason why it is hard-coded as 64 in check_hash()?
And what's the point of verifying its size? htab_elem should be only
visible to the kernel itself.

I can certainly change 64 to whatever its new size is, but I do wonder
why the test is there.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch bpf-next 0/3] bpf: introduce timeout map
  2020-12-12 23:18     ` Cong Wang
@ 2020-12-14  1:33       ` Andrey Ignatov
  2020-12-14 18:40         ` Cong Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Ignatov @ 2020-12-14  1:33 UTC (permalink / raw)
  To: Cong Wang; +Cc: Andrii Nakryiko, Networking, bpf, Cong Wang

Cong Wang <xiyou.wangcong@gmail.com> [Sat, 2020-12-12 15:18 -0800]:
> On Sat, Dec 12, 2020 at 2:25 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Fri, Dec 11, 2020 at 11:55 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 2:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > From: Cong Wang <cong.wang@bytedance.com>
> > > >
> > > > This patchset introduces a new bpf hash map which has timeout.
> > > > Patch 1 is a preparation, patch 2 is the implementation of timeout
> > > > map, patch 3 contains a test case for timeout map. Please check each
> > > > patch description for more details.
> > > >
> > > > ---
> > >
> > > This patch set seems to be breaking existing selftests. Please take a
> > > look ([0]).
> >
> > Interesting, looks unrelated to my patches but let me double check.
> 
> Cc'ing Andrey...
> 
> Looks like the failure is due to the addition of a new member to struct
> htab_elem. Any reason why it is hard-coded as 64 in check_hash()?
> And what's the point of verifying its size? htab_elem should be only
> visible to the kernel itself.
> 
> I can certainly change 64 to whatever its new size is, but I do wonder
> why the test is there.

Cong, the test is there to make sure that access to map pointers from
BPF program works.

Please see (41c48f3a9823 "bpf: Support access to bpf map fields") for
more details on what "access to map pointer" means, but it's basically a
way to access any field (e.g. max_entries) of common `struct bpf_map` or
any type-specific struct like `struct bpf_htab` from BPF program, i.e.
these structs are visible to not only kernel but also to BPF programs.

The point of the test is to access a few fields from every map struct
and make sure it works. Changing `struct htab_elem` indeed breaks the
`VERIFY(hash->elem_size == 64);` check. But it can be easily updated
(from 64 to whatever new size is) or replaced by some other field check.
`htab->elem_size` was chosen semi-randomly since any bpf_htab-specific
field would work for the test's purposes.

Hope it clarifies.

Also since you add a new map type it would be great to cover it in
tools/testing/selftests/bpf/progs/map_ptr_kern.c as well.

-- 
Andrey Ignatov

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Patch bpf-next 0/3] bpf: introduce timeout map
  2020-12-14  1:33       ` Andrey Ignatov
@ 2020-12-14 18:40         ` Cong Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2020-12-14 18:40 UTC (permalink / raw)
  To: Andrey Ignatov; +Cc: Andrii Nakryiko, Networking, bpf, Cong Wang

On Sun, Dec 13, 2020 at 5:33 PM Andrey Ignatov <rdna@fb.com> wrote:
>
> Cong Wang <xiyou.wangcong@gmail.com> [Sat, 2020-12-12 15:18 -0800]:
> > On Sat, Dec 12, 2020 at 2:25 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 11:55 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Fri, Dec 11, 2020 at 2:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >
> > > > > From: Cong Wang <cong.wang@bytedance.com>
> > > > >
> > > > > This patchset introduces a new bpf hash map which has timeout.
> > > > > Patch 1 is a preparation, patch 2 is the implementation of timeout
> > > > > map, patch 3 contains a test case for timeout map. Please check each
> > > > > patch description for more details.
> > > > >
> > > > > ---
> > > >
> > > > This patch set seems to be breaking existing selftests. Please take a
> > > > look ([0]).
> > >
> > > Interesting, looks unrelated to my patches but let me double check.
> >
> > Cc'ing Andrey...
> >
> > Looks like the failure is due to the addition of a new member to struct
> > htab_elem. Any reason why it is hard-coded as 64 in check_hash()?
> > And what's the point of verifying its size? htab_elem should be only
> > visible to the kernel itself.
> >
> > I can certainly change 64 to whatever its new size is, but I do wonder
> > why the test is there.
>
> Cong, the test is there to make sure that access to map pointers from
> BPF program works.
>
> Please see (41c48f3a9823 "bpf: Support access to bpf map fields") for
> more details on what "access to map pointer" means, but it's basically a
> way to access any field (e.g. max_entries) of common `struct bpf_map` or
> any type-specific struct like `struct bpf_htab` from BPF program, i.e.
> these structs are visible to not only kernel but also to BPF programs.

I see, I was not aware of this.

>
> The point of the test is to access a few fields from every map struct
> and make sure it works. Changing `struct htab_elem` indeed breaks the
> `VERIFY(hash->elem_size == 64);` check. But it can be easily updated
> (from 64 to whatever new size is) or replaced by some other field check.
> `htab->elem_size` was chosen semi-randomly since any bpf_htab-specific
> field would work for the test's purposes.

Good to know it is useful, I will have to change 64 to 72, as I tried to use
sizeof but struct htab_elem is not visible to that test.

>
> Hope it clarifies.
>
> Also since you add a new map type it would be great to cover it in
> tools/testing/selftests/bpf/progs/map_ptr_kern.c as well.

Yeah, will do.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-12-14 18:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-11  0:06 [Patch bpf-next 0/3] bpf: introduce timeout map Cong Wang
2020-12-11  0:06 ` [Patch bpf-next 1/3] bpf: use index instead of hash for map_locked[] Cong Wang
2020-12-11  0:06 ` [Patch bpf-next 2/3] bpf: introduce timeout map Cong Wang
2020-12-11  0:06 ` [Patch bpf-next 3/3] tools: add a test case for bpf " Cong Wang
2020-12-11 19:55 ` [Patch bpf-next 0/3] bpf: introduce " Andrii Nakryiko
2020-12-12 22:25   ` Cong Wang
2020-12-12 23:18     ` Cong Wang
2020-12-14  1:33       ` Andrey Ignatov
2020-12-14 18:40         ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.