linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] bpf: hash: optimization
@ 2015-12-15 11:20 Ming Lei
  2015-12-15 11:20 ` [PATCH 1/6] bpf: hash: use atomic count Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-15 11:20 UTC (permalink / raw)
  To: linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev

Hi,

This patchset tries to optimize ebpf hash map, and follows
the ideas:

1) Both htab_map_update_elem() and htab_map_delete_elem()
can be called from eBPF program, and they may be in kernel
hot path, so it isn't efficient to use a per-hashtable lock
in this two helpers, so this patch converts the lock into
per-bucket bit spinlock.

2) kmalloc() is called in htab_map_update_elem() for allocating
element, together with one global counter for tracking how many
elementes have been allocated. kmalloc is often a bit slow,
and the global counter doesn't scale well. This patch pre-allocates
one element pool and uses percpu ida for runtime element allocation/free,
and the global counter is removed too with this approach.

With this patchset, looks the performance penalty from eBPF
decreased a lot, see the following test:

1) run 'tools/biolatency' of bcc before running block test;

2) run fio to test block throught over /dev/nullb0,
(randread, 16jobs, libaio, 4k bs) and the test box
is one 24cores(dual sockets) VM server:

	- without patchset:  607K IOPS
	- with this patchset: 1332K IOPS
	- without running eBPF prog: 1492K IOPS

 include/linux/rculist.h |  55 +++++++++++
 kernel/bpf/hashtab.c    | 247 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 252 insertions(+), 50 deletions(-)


Thanks,
Ming 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] bpf: hash: use atomic count
  2015-12-15 11:20 [PATCH 0/6] bpf: hash: optimization Ming Lei
@ 2015-12-15 11:20 ` Ming Lei
  2015-12-15 11:21 ` [PATCH 2/6] hlist: prepare for supporting bit spinlock Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-15 11:20 UTC (permalink / raw)
  To: linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev, Ming Lei

Preparing for removing global per-hashtable lock, so
the counter need to be defined as aotmic_t first.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/bpf/hashtab.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 34777b3..2615388 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -18,7 +18,7 @@ struct bpf_htab {
 	struct bpf_map map;
 	struct hlist_head *buckets;
 	raw_spinlock_t lock;
-	u32 count;	/* number of elements in this hashtable */
+	atomic_t count;	/* number of elements in this hashtable */
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
 };
@@ -106,7 +106,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		INIT_HLIST_HEAD(&htab->buckets[i]);
 
 	raw_spin_lock_init(&htab->lock);
-	htab->count = 0;
+	atomic_set(&htab->count, 0);
 
 	return &htab->map;
 
@@ -256,7 +256,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 
 	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
 
-	if (!l_old && unlikely(htab->count >= map->max_entries)) {
+	if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
 		/* if elem with this 'key' doesn't exist and we've reached
 		 * max_entries limit, fail insertion of new elem
 		 */
@@ -284,7 +284,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		hlist_del_rcu(&l_old->hash_node);
 		kfree_rcu(l_old, rcu);
 	} else {
-		htab->count++;
+		atomic_inc(&htab->count);
 	}
 	raw_spin_unlock_irqrestore(&htab->lock, flags);
 
@@ -319,7 +319,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 
 	if (l) {
 		hlist_del_rcu(&l->hash_node);
-		htab->count--;
+		atomic_dec(&htab->count);
 		kfree_rcu(l, rcu);
 		ret = 0;
 	}
@@ -339,7 +339,7 @@ static void delete_all_elements(struct bpf_htab *htab)
 
 		hlist_for_each_entry_safe(l, n, head, hash_node) {
 			hlist_del_rcu(&l->hash_node);
-			htab->count--;
+			atomic_dec(&htab->count);
 			kfree(l);
 		}
 	}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] hlist: prepare for supporting bit spinlock
  2015-12-15 11:20 [PATCH 0/6] bpf: hash: optimization Ming Lei
  2015-12-15 11:20 ` [PATCH 1/6] bpf: hash: use atomic count Ming Lei
@ 2015-12-15 11:21 ` Ming Lei
  2015-12-15 11:21 ` [PATCH 3/6] bpf: hash: move select_bucket() out of htab's spinlock Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-15 11:21 UTC (permalink / raw)
  To: linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev, Ming Lei

hlist_head is often used for implementing bucket header of
hash table, and one lock is often needed for adding/deleting
node in the bucket list. It can consume lots of memory if
per-bucket spinlock is used, so this patch trys to use the 1st
bit of hlist_head->first as bit lock for this purpose.

bit spinlock isn't efficient as spin lock, but the contention
shouldn't be very high for operating per-bucket hlist, so bit
spinlock should be OK.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 include/linux/rculist.h | 55 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 14ec165..9fc394a 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -435,6 +435,61 @@ static inline void hlist_replace_rcu(struct hlist_node *old,
 #define hlist_next_rcu(node)	(*((struct hlist_node __rcu **)(&(node)->next)))
 #define hlist_pprev_rcu(node)	(*((struct hlist_node __rcu **)((node)->pprev)))
 
+/*
+ * This bit of hlist_head->first can be used as per-bucket
+ * bit spin_lock in hash table use case, and the lock bit
+ * should always be kept when hlist_node is added, deleted
+ * and other update operations.
+ *
+ * When lock bit is used, only the _lock version of hlist
+ * helpers can be used for operating the hlist.
+ */
+#define	HLIST_LOCK_BIT		0
+#define	HLIST_LOCK_MASK		(1UL << HLIST_LOCK_BIT)
+#define hlist_get_first(h)	\
+	(struct hlist_node *)((unsigned long)(h)->first & ~HLIST_LOCK_MASK)
+#define hlist_get_lock(h)    ((unsigned long)(h)->first & HLIST_LOCK_MASK)
+#define hlist_make_1st_node(h, n) \
+	(struct hlist_node *)((unsigned long)(n) | hlist_get_lock(h))
+
+static inline struct hlist_head *hlist_get_head_lock(
+		struct hlist_head *head, struct hlist_head *new_head)
+{
+	new_head->first = hlist_get_first(head);
+	return new_head;
+}
+
+static inline void __hlist_del_lock(struct hlist_node *n)
+{
+	struct hlist_node *next = n->next;
+	struct hlist_node **pprev = n->pprev;
+
+	WRITE_ONCE(*pprev, (unsigned long)next |
+			((unsigned long)*pprev & HLIST_LOCK_MASK));
+	if (next)
+		next->pprev = pprev;
+}
+
+/* The bit lock should be held before calling this helper */
+static inline void hlist_del_rcu_lock(struct hlist_node *n)
+{
+	__hlist_del_lock(n);
+	n->pprev = LIST_POISON2;
+}
+
+/* The bit lock should be held before calling this helper */
+static inline void hlist_add_head_rcu_lock(struct hlist_node *n,
+					   struct hlist_head *h)
+{
+	struct hlist_node *first = hlist_get_first(h);
+
+	n->next = first;
+	n->pprev = &h->first;
+	rcu_assign_pointer(hlist_first_rcu(h), hlist_make_1st_node(h, n));
+	if (first)
+		first->pprev = &n->next;
+}
+
 /**
  * hlist_add_head_rcu
  * @n: the element to add to the hash list.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] bpf: hash: move select_bucket() out of htab's spinlock
  2015-12-15 11:20 [PATCH 0/6] bpf: hash: optimization Ming Lei
  2015-12-15 11:20 ` [PATCH 1/6] bpf: hash: use atomic count Ming Lei
  2015-12-15 11:21 ` [PATCH 2/6] hlist: prepare for supporting bit spinlock Ming Lei
@ 2015-12-15 11:21 ` Ming Lei
  2015-12-15 11:21 ` [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-15 11:21 UTC (permalink / raw)
  To: linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev, Ming Lei

The spinlock is just used for protecting the per-bucket
hlist, so it isn't needed for selecting bucket.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/bpf/hashtab.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 2615388..d857fcb 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -248,12 +248,11 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
 
 	l_new->hash = htab_map_hash(l_new->key, key_size);
+	head = select_bucket(htab, l_new->hash);
 
 	/* bpf_map_update_elem() can be called in_irq() */
 	raw_spin_lock_irqsave(&htab->lock, flags);
 
-	head = select_bucket(htab, l_new->hash);
-
 	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
 
 	if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
@@ -310,11 +309,10 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	key_size = map->key_size;
 
 	hash = htab_map_hash(key, key_size);
+	head = select_bucket(htab, hash);
 
 	raw_spin_lock_irqsave(&htab->lock, flags);
 
-	head = select_bucket(htab, hash);
-
 	l = lookup_elem_raw(head, hash, key, key_size);
 
 	if (l) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock
  2015-12-15 11:20 [PATCH 0/6] bpf: hash: optimization Ming Lei
                   ` (2 preceding siblings ...)
  2015-12-15 11:21 ` [PATCH 3/6] bpf: hash: move select_bucket() out of htab's spinlock Ming Lei
@ 2015-12-15 11:21 ` Ming Lei
  2015-12-15 22:51   ` Alexei Starovoitov
  2015-12-15 11:21 ` [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog Ming Lei
  2015-12-15 11:21 ` [PATCH 6/6] bpf: hash: reorganize 'struct htab_elem' Ming Lei
  5 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2015-12-15 11:21 UTC (permalink / raw)
  To: linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev, Ming Lei

Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.

The per-hashtable spinlock is used just for protecting bucket's
hlist, and per-bucket lock should be enough. This patch converts
the per-hashtable lock into per-bucket bit spinlock, so that
contention can be decreased a lot, and no extra memory can be
consumed for these locks.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/bpf/hashtab.c | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index d857fcb..8543fea 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -17,7 +17,6 @@
 struct bpf_htab {
 	struct bpf_map map;
 	struct hlist_head *buckets;
-	raw_spinlock_t lock;
 	atomic_t count;	/* number of elements in this hashtable */
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
@@ -105,7 +104,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < htab->n_buckets; i++)
 		INIT_HLIST_HEAD(&htab->buckets[i]);
 
-	raw_spin_lock_init(&htab->lock);
 	atomic_set(&htab->count, 0);
 
 	return &htab->map;
@@ -142,6 +140,7 @@ static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct hlist_head *head;
+	struct hlist_head h;
 	struct htab_elem *l;
 	u32 hash, key_size;
 
@@ -153,6 +152,7 @@ static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
 	hash = htab_map_hash(key, key_size);
 
 	head = select_bucket(htab, hash);
+	head = hlist_get_head_lock(head, &h);
 
 	l = lookup_elem_raw(head, hash, key, key_size);
 
@@ -167,6 +167,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct hlist_head *head;
+	struct hlist_head h;
 	struct htab_elem *l, *next_l;
 	u32 hash, key_size;
 	int i;
@@ -178,6 +179,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	hash = htab_map_hash(key, key_size);
 
 	head = select_bucket(htab, hash);
+	head = hlist_get_head_lock(head, &h);
 
 	/* lookup the key */
 	l = lookup_elem_raw(head, hash, key, key_size);
@@ -205,6 +207,7 @@ find_first_elem:
 	/* iterate over buckets */
 	for (; i < htab->n_buckets; i++) {
 		head = select_bucket(htab, i);
+		head = hlist_get_head_lock(head, &h);
 
 		/* pick first element in the bucket */
 		next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
@@ -227,6 +230,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct htab_elem *l_new, *l_old;
 	struct hlist_head *head;
+	struct hlist_head h;
 	unsigned long flags;
 	u32 key_size;
 	int ret;
@@ -251,9 +255,11 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	head = select_bucket(htab, l_new->hash);
 
 	/* bpf_map_update_elem() can be called in_irq() */
-	raw_spin_lock_irqsave(&htab->lock, flags);
+	raw_local_irq_save(flags);
+	bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
 
-	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
+	l_old = lookup_elem_raw(hlist_get_head_lock(head, &h), l_new->hash,
+			key, key_size);
 
 	if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
 		/* if elem with this 'key' doesn't exist and we've reached
@@ -278,18 +284,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	/* add new element to the head of the list, so that concurrent
 	 * search will find it before old elem
 	 */
-	hlist_add_head_rcu(&l_new->hash_node, head);
+	hlist_add_head_rcu_lock(&l_new->hash_node, head);
 	if (l_old) {
-		hlist_del_rcu(&l_old->hash_node);
+		hlist_del_rcu_lock(&l_old->hash_node);
 		kfree_rcu(l_old, rcu);
 	} else {
 		atomic_inc(&htab->count);
 	}
-	raw_spin_unlock_irqrestore(&htab->lock, flags);
+	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
+	raw_local_irq_restore(flags);
 
 	return 0;
 err:
-	raw_spin_unlock_irqrestore(&htab->lock, flags);
+	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
+	raw_local_irq_restore(flags);
 	kfree(l_new);
 	return ret;
 }
@@ -299,6 +307,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct hlist_head *head;
+	struct hlist_head h;
 	struct htab_elem *l;
 	unsigned long flags;
 	u32 hash, key_size;
@@ -311,18 +320,20 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	hash = htab_map_hash(key, key_size);
 	head = select_bucket(htab, hash);
 
-	raw_spin_lock_irqsave(&htab->lock, flags);
+	raw_local_irq_save(flags);
+	bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
 
-	l = lookup_elem_raw(head, hash, key, key_size);
+	l = lookup_elem_raw(hlist_get_head_lock(head, &h), hash, key, key_size);
 
 	if (l) {
-		hlist_del_rcu(&l->hash_node);
+		hlist_del_rcu_lock(&l->hash_node);
 		atomic_dec(&htab->count);
 		kfree_rcu(l, rcu);
 		ret = 0;
 	}
 
-	raw_spin_unlock_irqrestore(&htab->lock, flags);
+	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
+	raw_local_irq_restore(flags);
 	return ret;
 }
 
@@ -332,9 +343,12 @@ static void delete_all_elements(struct bpf_htab *htab)
 
 	for (i = 0; i < htab->n_buckets; i++) {
 		struct hlist_head *head = select_bucket(htab, i);
+		struct hlist_head h;
 		struct hlist_node *n;
 		struct htab_elem *l;
 
+		head = hlist_get_head_lock(head, &h);
+
 		hlist_for_each_entry_safe(l, n, head, hash_node) {
 			hlist_del_rcu(&l->hash_node);
 			atomic_dec(&htab->count);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog
  2015-12-15 11:20 [PATCH 0/6] bpf: hash: optimization Ming Lei
                   ` (3 preceding siblings ...)
  2015-12-15 11:21 ` [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock Ming Lei
@ 2015-12-15 11:21 ` Ming Lei
  2015-12-15 23:10   ` Alexei Starovoitov
                     ` (2 more replies)
  2015-12-15 11:21 ` [PATCH 6/6] bpf: hash: reorganize 'struct htab_elem' Ming Lei
  5 siblings, 3 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-15 11:21 UTC (permalink / raw)
  To: linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev, Ming Lei

kmalloc() is often a bit time-consuming, also
one atomic counter has to be used to track the total
allocated elements, which is also not good.

This patch pre-allocates element pool in htab_map_alloc(),
then use percpu_ida to allocate one slot from the pool,
then the runtime allocation/freeing cost can be decreased.

>From my test, at least 10% fio throughput is improved in block
I/O test when tools/biolatency of bcc(iovisor) is running.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/bpf/hashtab.c | 204 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 167 insertions(+), 37 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8543fea..c1600c3 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -13,23 +13,165 @@
 #include <linux/jhash.h>
 #include <linux/filter.h>
 #include <linux/vmalloc.h>
+#include <linux/percpu_ida.h>
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+	union {
+		struct hlist_node hash_node;
+
+		/* used after deleted from hash */
+		struct bpf_htab *htab;
+	};
+	struct rcu_head rcu;
+	u32 hash;
+	u32 tag;
+	char key[0] __aligned(8);
+};
 
 struct bpf_htab {
 	struct bpf_map map;
 	struct hlist_head *buckets;
-	atomic_t count;	/* number of elements in this hashtable */
 	u32 n_buckets;	/* number of hash buckets */
 	u32 elem_size;	/* size of each element in bytes */
-};
 
-/* each htab element is struct htab_elem + key + value */
-struct htab_elem {
-	struct hlist_node hash_node;
-	struct rcu_head rcu;
-	u32 hash;
-	char key[0] __aligned(8);
+	struct list_head page_list;
+	struct htab_elem **elems;
+	struct percpu_ida elems_pool;
 };
 
+static size_t order_to_size(unsigned int order)
+{
+	return (size_t)PAGE_SIZE << order;
+}
+
+/* Called from syscall, and the code is borrowed from blk_mq */
+static int htab_pre_alloc_elems(struct bpf_htab *htab)
+{
+	const unsigned max_order = 4;
+	unsigned elem_size = htab->elem_size, i;
+	unsigned nr_entries = htab->map.max_entries;
+	size_t left = nr_entries * elem_size;
+
+	htab->elems = kzalloc(nr_entries * sizeof(struct htab_elem *),
+			      GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+	if (!htab->elems)
+		goto fail;
+
+	INIT_LIST_HEAD(&htab->page_list);
+
+	for (i = 0; i < nr_entries; ) {
+		int this_order = max_order;
+		struct page *page;
+		int j, to_do;
+		void *p;
+
+		while (left < order_to_size(this_order - 1) && this_order)
+			this_order--;
+
+		do {
+			page = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
+					   __GFP_NORETRY | __GFP_ZERO,
+					   this_order);
+			if (page)
+				break;
+			if (!this_order--)
+				break;
+			if (order_to_size(this_order) < elem_size)
+				break;
+		} while (1);
+
+		if (!page)
+			goto fail;
+
+		page->private = this_order;
+		list_add_tail(&page->lru, &htab->page_list);
+
+		p = page_address(page);
+
+		to_do = min_t(unsigned,
+			      order_to_size(this_order) / elem_size,
+			      nr_entries - i);
+		left -= to_do * elem_size;
+
+		for (j = 0; j < to_do; j++) {
+			htab->elems[i] = p;
+			p += elem_size;
+			i++;
+		}
+	}
+	return 0;
+
+fail:
+	kfree(htab->elems);
+	return -ENOMEM;
+}
+
+static void htab_destroy_elems(struct bpf_htab *htab)
+{
+	struct page *page;
+
+	while (!list_empty(&htab->page_list)) {
+		page = list_first_entry(&htab->page_list, struct page, lru);
+		list_del_init(&page->lru);
+		__free_pages(page, page->private);
+	}
+
+	kfree(htab->elems);
+}
+
+static int htab_init_elems_allocator(struct bpf_htab *htab)
+{
+	int ret = htab_pre_alloc_elems(htab);
+
+	if (ret)
+		return ret;
+
+	ret = percpu_ida_init(&htab->elems_pool, htab->map.max_entries);
+	if (ret)
+		htab_destroy_elems(htab);
+	return ret;
+}
+
+static void htab_deinit_elems_allocator(struct bpf_htab *htab)
+{
+	htab_destroy_elems(htab);
+	percpu_ida_destroy(&htab->elems_pool);
+}
+
+static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
+{
+	int tag = percpu_ida_alloc(&htab->elems_pool, TASK_RUNNING);
+	struct htab_elem *elem;
+
+	if (tag < 0)
+		return NULL;
+
+	elem = htab->elems[tag];
+	elem->tag = tag;
+	return elem;
+}
+
+static void htab_free_elem(struct bpf_htab *htab, struct htab_elem *elem)
+{
+	percpu_ida_free(&htab->elems_pool, elem->tag);
+}
+
+static void htab_free_elem_cb(struct rcu_head *head)
+{
+	struct htab_elem *elem = container_of(head, struct htab_elem, rcu);
+
+	htab_free_elem(elem->htab, elem);
+}
+
+static void htab_free_elem_rcu(struct bpf_htab *htab,
+			       struct htab_elem *elem)
+{
+	hlist_del_rcu_lock(&elem->hash_node);
+	elem->htab = htab;
+	call_rcu(&elem->rcu, htab_free_elem_cb);
+}
+
 /* Called from syscall */
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
@@ -72,9 +214,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 		 */
 		goto free_htab;
 
-	htab->elem_size = sizeof(struct htab_elem) +
-			  round_up(htab->map.key_size, 8) +
-			  htab->map.value_size;
+	htab->elem_size = round_up(sizeof(struct htab_elem) +
+				   round_up(htab->map.key_size, 8) +
+				   htab->map.value_size,
+				   cache_line_size());
 
 	/* prevent zero size kmalloc and check for u32 overflow */
 	if (htab->n_buckets == 0 ||
@@ -104,10 +247,14 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < htab->n_buckets; i++)
 		INIT_HLIST_HEAD(&htab->buckets[i]);
 
-	atomic_set(&htab->count, 0);
+	err = htab_init_elems_allocator(htab);
+	if (err)
+		goto free_buckets;
 
 	return &htab->map;
 
+free_buckets:
+	kvfree(htab->buckets);
 free_htab:
 	kfree(htab);
 	return ERR_PTR(err);
@@ -242,9 +389,9 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
 	/* allocate new element outside of lock */
-	l_new = kmalloc(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN);
+	l_new = htab_alloc_elem(htab);
 	if (!l_new)
-		return -ENOMEM;
+		return -E2BIG;
 
 	key_size = map->key_size;
 
@@ -261,14 +408,6 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	l_old = lookup_elem_raw(hlist_get_head_lock(head, &h), l_new->hash,
 			key, key_size);
 
-	if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
-		/* if elem with this 'key' doesn't exist and we've reached
-		 * max_entries limit, fail insertion of new elem
-		 */
-		ret = -E2BIG;
-		goto err;
-	}
-
 	if (l_old && map_flags == BPF_NOEXIST) {
 		/* elem already exists */
 		ret = -EEXIST;
@@ -285,12 +424,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	 * search will find it before old elem
 	 */
 	hlist_add_head_rcu_lock(&l_new->hash_node, head);
-	if (l_old) {
-		hlist_del_rcu_lock(&l_old->hash_node);
-		kfree_rcu(l_old, rcu);
-	} else {
-		atomic_inc(&htab->count);
-	}
+	if (l_old)
+		htab_free_elem_rcu(htab, l_old);
 	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
 	raw_local_irq_restore(flags);
 
@@ -298,7 +433,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 err:
 	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
 	raw_local_irq_restore(flags);
-	kfree(l_new);
+	htab_free_elem(htab, l_new);
 	return ret;
 }
 
@@ -324,11 +459,8 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
 
 	l = lookup_elem_raw(hlist_get_head_lock(head, &h), hash, key, key_size);
-
 	if (l) {
-		hlist_del_rcu_lock(&l->hash_node);
-		atomic_dec(&htab->count);
-		kfree_rcu(l, rcu);
+		htab_free_elem_rcu(htab, l);
 		ret = 0;
 	}
 
@@ -349,11 +481,8 @@ static void delete_all_elements(struct bpf_htab *htab)
 
 		head = hlist_get_head_lock(head, &h);
 
-		hlist_for_each_entry_safe(l, n, head, hash_node) {
+		hlist_for_each_entry_safe(l, n, head, hash_node)
 			hlist_del_rcu(&l->hash_node);
-			atomic_dec(&htab->count);
-			kfree(l);
-		}
 	}
 }
 
@@ -373,6 +502,7 @@ static void htab_map_free(struct bpf_map *map)
 	 * executed. It's ok. Proceed to free residual elements and map itself
 	 */
 	delete_all_elements(htab);
+	htab_deinit_elems_allocator(htab);
 	kvfree(htab->buckets);
 	kfree(htab);
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] bpf: hash: reorganize 'struct htab_elem'
  2015-12-15 11:20 [PATCH 0/6] bpf: hash: optimization Ming Lei
                   ` (4 preceding siblings ...)
  2015-12-15 11:21 ` [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog Ming Lei
@ 2015-12-15 11:21 ` Ming Lei
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-15 11:21 UTC (permalink / raw)
  To: linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev, Ming Lei

Lifetime for hash fields and liftime for kfree_rcu fields
can't be overlapped, so re-organizing them for better
readabilty.

Also one sizeof(void *) should be saved with this change,
and cache footprint can got improved too.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 kernel/bpf/hashtab.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c1600c3..5476545 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -17,15 +17,20 @@
 
 /* each htab element is struct htab_elem + key + value */
 struct htab_elem {
+	u32 tag;
 	union {
-		struct hlist_node hash_node;
-
-		/* used after deleted from hash */
-		struct bpf_htab *htab;
+		/* won't be used after being removed from hash */
+		struct {
+			u32 hash;
+			struct hlist_node hash_node;
+		};
+
+		/* set after being deleted from hash */
+		struct {
+			struct bpf_htab *htab;
+			struct rcu_head rcu;
+		};
 	};
-	struct rcu_head rcu;
-	u32 hash;
-	u32 tag;
 	char key[0] __aligned(8);
 };
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock
  2015-12-15 11:21 ` [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock Ming Lei
@ 2015-12-15 22:51   ` Alexei Starovoitov
  2015-12-16  2:57     ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2015-12-15 22:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Alexei Starovoitov, David S. Miller, netdev

On Tue, Dec 15, 2015 at 07:21:02PM +0800, Ming Lei wrote:
> Both htab_map_update_elem() and htab_map_delete_elem() can be
> called from eBPF program, and they may be in kernel hot path,
> so it isn't efficient to use a per-hashtable lock in this two
> helpers.
> 
> The per-hashtable spinlock is used just for protecting bucket's
> hlist, and per-bucket lock should be enough. This patch converts
> the per-hashtable lock into per-bucket bit spinlock, so that
> contention can be decreased a lot, and no extra memory can be
> consumed for these locks.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

thank you for working on this.
Interesting stuff!

>  	/* bpf_map_update_elem() can be called in_irq() */
> -	raw_spin_lock_irqsave(&htab->lock, flags);
> +	raw_local_irq_save(flags);
> +	bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);

can you add a helper for bit_spin_lock/unlock as well so that whole hlist+bit
api looks consistent?

>  
> -	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
> +	l_old = lookup_elem_raw(hlist_get_head_lock(head, &h), l_new->hash,
> +			key, key_size);
>  
>  	if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
>  		/* if elem with this 'key' doesn't exist and we've reached
> @@ -278,18 +284,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	/* add new element to the head of the list, so that concurrent
>  	 * search will find it before old elem
>  	 */
> -	hlist_add_head_rcu(&l_new->hash_node, head);
> +	hlist_add_head_rcu_lock(&l_new->hash_node, head);

I think the new macros have confusing names:

+#define hlist_get_first(h)     \
+       (struct hlist_node *)((unsigned long)(h)->first & ~HLIST_LOCK_MASK)
+#define hlist_get_lock(h)    ((unsigned long)(h)->first & HLIST_LOCK_MASK)
+#define hlist_make_1st_node(h, n) \
+       (struct hlist_node *)((unsigned long)(n) | hlist_get_lock(h))
+
+static inline struct hlist_head *hlist_get_head_lock(
+               struct hlist_head *head, struct hlist_head *new_head)
+{
+       new_head->first = hlist_get_first(head);
+       return new_head;
+}

This new hlist_add_head_rcu_lock() adds new element and it doesn't take new lock.
May be rename this new api as 'locked_hlist' ?
Then all normal helpers will convert like:
hlist_add_head_rcu() -> locked_hlist_add_head_rcu()

>  	if (l_old) {
> -		hlist_del_rcu(&l_old->hash_node);
> +		hlist_del_rcu_lock(&l_old->hash_node);

and here it will be:
hlist_del_rcu() -> locked_hlist_del_rcu()

Also is there a race here ?
+static inline void hlist_add_head_rcu_lock(struct hlist_node *n,
+                                          struct hlist_head *h)
+{
+       struct hlist_node *first = hlist_get_first(h);
+
+       n->next = first;
+       n->pprev = &h->first;
+       rcu_assign_pointer(hlist_first_rcu(h), hlist_make_1st_node(h, n));
+       if (first)
+               first->pprev = &n->next;
+}

Do you need cmpxchg when updatding hlist_head->first ?

Overall looks like a very interesting idea, but I'm not sure about
trade-off of saving 8 bytes for rounded-up spinlock per bucket.
The loss of lockdep is concerning.
Have you compared performance of this bit-lock embedded inside hlist_head vs
just adding spinlock_t for every hlist_head?
Another concern is that hlist walking may be more costly due to
requirement of having to clear that bit while doing the walk in lookup().
I guess I would prefer normal spinlock per bucket. Additional 8 * n_buckets bytes
don't seem to be worth optimizing.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog
  2015-12-15 11:21 ` [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog Ming Lei
@ 2015-12-15 23:10   ` Alexei Starovoitov
  2015-12-15 23:42     ` Daniel Borkmann
  2015-12-16  7:13     ` Ming Lei
  2015-12-15 23:21   ` Daniel Borkmann
  2015-12-15 23:35   ` Daniel Borkmann
  2 siblings, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2015-12-15 23:10 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Alexei Starovoitov, David S. Miller, netdev

On Tue, Dec 15, 2015 at 07:21:03PM +0800, Ming Lei wrote:
> kmalloc() is often a bit time-consuming, also
> one atomic counter has to be used to track the total
> allocated elements, which is also not good.
> 
> This patch pre-allocates element pool in htab_map_alloc(),
> then use percpu_ida to allocate one slot from the pool,
> then the runtime allocation/freeing cost can be decreased.
> 
> From my test, at least 10% fio throughput is improved in block
> I/O test when tools/biolatency of bcc(iovisor) is running.
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

Looks very intersting as well.
Approach looks good.
If you can make a common allocation helper for this map and
for blk-mq would be even better.

> -	htab->elem_size = sizeof(struct htab_elem) +
> -			  round_up(htab->map.key_size, 8) +
> -			  htab->map.value_size;
> +	htab->elem_size = round_up(sizeof(struct htab_elem) +
> +				   round_up(htab->map.key_size, 8) +
> +				   htab->map.value_size,
> +				   cache_line_size());

this rounding to cache line is great for performance, but it's extra
memory upfront which may not be needed. The per-allocation is a classic
performance vs memory trade-off. In other cases it may hurt.
So could you change the patch to do pre-allocation only when
requested by user space via extra flag for hash map or via
new BPF_MAP_TYPE_HASH_PREALLOC type? Not sure yet whether flag or
new type is better. I guess implementation will dictate.

PS
Glad that you found iovisor/tools/biolatency useful.
It's indeed pretty helpful to analyze real-time block io latency.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog
  2015-12-15 11:21 ` [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog Ming Lei
  2015-12-15 23:10   ` Alexei Starovoitov
@ 2015-12-15 23:21   ` Daniel Borkmann
  2015-12-15 23:35   ` Daniel Borkmann
  2 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2015-12-15 23:21 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev

On 12/15/2015 12:21 PM, Ming Lei wrote:
...
> +/* Called from syscall, and the code is borrowed from blk_mq */
> +static int htab_pre_alloc_elems(struct bpf_htab *htab)
> +{
> +	const unsigned max_order = 4;
> +	unsigned elem_size = htab->elem_size, i;
> +	unsigned nr_entries = htab->map.max_entries;
> +	size_t left = nr_entries * elem_size;
> +
> +	htab->elems = kzalloc(nr_entries * sizeof(struct htab_elem *),
> +			      GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);

Should this use GFP_USER (same below)?

Also, when having a large number of elements e.g. > 1Mio, should we fall
back to vzalloc()?

> +	if (!htab->elems)
> +		goto fail;
> +
> +	INIT_LIST_HEAD(&htab->page_list);
> +
> +	for (i = 0; i < nr_entries; ) {
> +		int this_order = max_order;
> +		struct page *page;
> +		int j, to_do;
> +		void *p;
> +
> +		while (left < order_to_size(this_order - 1) && this_order)
> +			this_order--;
> +
> +		do {
> +			page = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
> +					   __GFP_NORETRY | __GFP_ZERO,
> +					   this_order);
> +			if (page)
> +				break;
...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog
  2015-12-15 11:21 ` [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog Ming Lei
  2015-12-15 23:10   ` Alexei Starovoitov
  2015-12-15 23:21   ` Daniel Borkmann
@ 2015-12-15 23:35   ` Daniel Borkmann
  2015-12-16  0:12     ` Daniel Borkmann
  2 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2015-12-15 23:35 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev

On 12/15/2015 12:21 PM, Ming Lei wrote:
...
> +static int htab_init_elems_allocator(struct bpf_htab *htab)
> +{
> +	int ret = htab_pre_alloc_elems(htab);
> +
> +	if (ret)
> +		return ret;
> +
> +	ret = percpu_ida_init(&htab->elems_pool, htab->map.max_entries);
> +	if (ret)
> +		htab_destroy_elems(htab);
> +	return ret;
> +}
> +
> +static void htab_deinit_elems_allocator(struct bpf_htab *htab)
> +{
> +	htab_destroy_elems(htab);
> +	percpu_ida_destroy(&htab->elems_pool);
> +}
> +
> +static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
> +{
> +	int tag = percpu_ida_alloc(&htab->elems_pool, TASK_RUNNING);
> +	struct htab_elem *elem;
> +
> +	if (tag < 0)
> +		return NULL;
> +
> +	elem = htab->elems[tag];
> +	elem->tag = tag;
> +	return elem;
> +}
....
> @@ -285,12 +424,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>   	 * search will find it before old elem
>   	 */
>   	hlist_add_head_rcu_lock(&l_new->hash_node, head);
> -	if (l_old) {
> -		hlist_del_rcu_lock(&l_old->hash_node);
> -		kfree_rcu(l_old, rcu);
> -	} else {
> -		atomic_inc(&htab->count);
> -	}
> +	if (l_old)
> +		htab_free_elem_rcu(htab, l_old);
>   	bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
>   	raw_local_irq_restore(flags);

On a quick look, you are using the ida to keep track of elements, right? What happens
if you have a hash-table of max_entry size 1, fill that one slot and later on try to
replace it with a different element.

Old behaviour (htab->count) doesn't increase htab count and would allow the replacement
of that element to happen.

Looks like in your case, we'd get -E2BIG from htab_alloc_elem(), no? ... as preallocated
pool is already used up then?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog
  2015-12-15 23:10   ` Alexei Starovoitov
@ 2015-12-15 23:42     ` Daniel Borkmann
  2015-12-16  7:13     ` Ming Lei
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2015-12-15 23:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Ming Lei
  Cc: linux-kernel, Alexei Starovoitov, David S. Miller, netdev

On 12/16/2015 12:10 AM, Alexei Starovoitov wrote:
...
> this rounding to cache line is great for performance, but it's extra
> memory upfront which may not be needed. The per-allocation is a classic
> performance vs memory trade-off. In other cases it may hurt.
> So could you change the patch to do pre-allocation only when
> requested by user space via extra flag for hash map or via
> new BPF_MAP_TYPE_HASH_PREALLOC type? Not sure yet whether flag or
> new type is better. I guess implementation will dictate.

Was also thinking about this, probably new map type makes sense.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog
  2015-12-15 23:35   ` Daniel Borkmann
@ 2015-12-16  0:12     ` Daniel Borkmann
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2015-12-16  0:12 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, Alexei Starovoitov; +Cc: David S. Miller, netdev

On 12/16/2015 12:35 AM, Daniel Borkmann wrote:
> On 12/15/2015 12:21 PM, Ming Lei wrote:
> ...
>> +static int htab_init_elems_allocator(struct bpf_htab *htab)
>> +{
>> +    int ret = htab_pre_alloc_elems(htab);
>> +
>> +    if (ret)
>> +        return ret;
>> +
>> +    ret = percpu_ida_init(&htab->elems_pool, htab->map.max_entries);
>> +    if (ret)
>> +        htab_destroy_elems(htab);
>> +    return ret;
>> +}
>> +
>> +static void htab_deinit_elems_allocator(struct bpf_htab *htab)
>> +{
>> +    htab_destroy_elems(htab);
>> +    percpu_ida_destroy(&htab->elems_pool);
>> +}
>> +
>> +static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
>> +{
>> +    int tag = percpu_ida_alloc(&htab->elems_pool, TASK_RUNNING);
>> +    struct htab_elem *elem;
>> +
>> +    if (tag < 0)
>> +        return NULL;
>> +
>> +    elem = htab->elems[tag];
>> +    elem->tag = tag;
>> +    return elem;
>> +}
> ....
>> @@ -285,12 +424,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>        * search will find it before old elem
>>        */
>>       hlist_add_head_rcu_lock(&l_new->hash_node, head);
>> -    if (l_old) {
>> -        hlist_del_rcu_lock(&l_old->hash_node);
>> -        kfree_rcu(l_old, rcu);
>> -    } else {
>> -        atomic_inc(&htab->count);
>> -    }
>> +    if (l_old)
>> +        htab_free_elem_rcu(htab, l_old);
>>       bit_spin_unlock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
>>       raw_local_irq_restore(flags);
>
> On a quick look, you are using the ida to keep track of elements, right? What happens
> if you have a hash-table of max_entry size 1, fill that one slot and later on try to
> replace it with a different element.
>
> Old behaviour (htab->count) doesn't increase htab count and would allow the replacement
> of that element to happen.
>
> Looks like in your case, we'd get -E2BIG from htab_alloc_elem(), no? ... as preallocated
> pool is already used up then?

Btw, if you take that further where htab elem replacements in parallel (e.g. from one
or multiple user space applications via bpf(2) and/or one or multiple eBPF programs)
could occur on the same shared map, current behavior allows setup of new elements to
happen (outside of htab lock) first and then replacement serialized via lock.

So there would probably need to be overcommit beyond max_entries pool preallocs for
such map type.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock
  2015-12-15 22:51   ` Alexei Starovoitov
@ 2015-12-16  2:57     ` Ming Lei
       [not found]       ` <CAHbLzkpW_seTrs+WgFqsriH5uhG4LMY_jiYC0iRQ-LAdJFiUjw@mail.gmail.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2015-12-16  2:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linux Kernel Mailing List, Alexei Starovoitov, David S. Miller,
	Network Development

Hi Alexei,

On Wed, Dec 16, 2015 at 6:51 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Dec 15, 2015 at 07:21:02PM +0800, Ming Lei wrote:
>> Both htab_map_update_elem() and htab_map_delete_elem() can be
>> called from eBPF program, and they may be in kernel hot path,
>> so it isn't efficient to use a per-hashtable lock in this two
>> helpers.
>>
>> The per-hashtable spinlock is used just for protecting bucket's
>> hlist, and per-bucket lock should be enough. This patch converts
>> the per-hashtable lock into per-bucket bit spinlock, so that
>> contention can be decreased a lot, and no extra memory can be
>> consumed for these locks.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>
> thank you for working on this.
> Interesting stuff!

Thanks for your review!

>
>>       /* bpf_map_update_elem() can be called in_irq() */
>> -     raw_spin_lock_irqsave(&htab->lock, flags);
>> +     raw_local_irq_save(flags);
>> +     bit_spin_lock(HLIST_LOCK_BIT, (unsigned long *)&head->first);
>
> can you add a helper for bit_spin_lock/unlock as well so that whole hlist+bit
> api looks consistent?

OK, I will try to add it in V1.

>
>>
>> -     l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
>> +     l_old = lookup_elem_raw(hlist_get_head_lock(head, &h), l_new->hash,
>> +                     key, key_size);
>>
>>       if (!l_old && unlikely(atomic_read(&htab->count) >= map->max_entries)) {
>>               /* if elem with this 'key' doesn't exist and we've reached
>> @@ -278,18 +284,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>       /* add new element to the head of the list, so that concurrent
>>        * search will find it before old elem
>>        */
>> -     hlist_add_head_rcu(&l_new->hash_node, head);
>> +     hlist_add_head_rcu_lock(&l_new->hash_node, head);
>
> I think the new macros have confusing names:
>
> +#define hlist_get_first(h)     \
> +       (struct hlist_node *)((unsigned long)(h)->first & ~HLIST_LOCK_MASK)
> +#define hlist_get_lock(h)    ((unsigned long)(h)->first & HLIST_LOCK_MASK)
> +#define hlist_make_1st_node(h, n) \
> +       (struct hlist_node *)((unsigned long)(n) | hlist_get_lock(h))
> +
> +static inline struct hlist_head *hlist_get_head_lock(
> +               struct hlist_head *head, struct hlist_head *new_head)
> +{
> +       new_head->first = hlist_get_first(head);
> +       return new_head;
> +}
>
> This new hlist_add_head_rcu_lock() adds new element and it doesn't take new lock.
> May be rename this new api as 'locked_hlist' ?

Looks much better, :-)

> Then all normal helpers will convert like:
> hlist_add_head_rcu() -> locked_hlist_add_head_rcu()
>
>>       if (l_old) {
>> -             hlist_del_rcu(&l_old->hash_node);
>> +             hlist_del_rcu_lock(&l_old->hash_node);
>
> and here it will be:
> hlist_del_rcu() -> locked_hlist_del_rcu()
>
> Also is there a race here ?

Both locked_hlist_add_head_rcu() and locked_hlist_del_rcu()
won't change the lock bit of hlist_head->first, also the lock has
to be held before calling the two helpers.

So I don't think there is a race.

> +static inline void hlist_add_head_rcu_lock(struct hlist_node *n,
> +                                          struct hlist_head *h)
> +{
> +       struct hlist_node *first = hlist_get_first(h);
> +
> +       n->next = first;
> +       n->pprev = &h->first;
> +       rcu_assign_pointer(hlist_first_rcu(h), hlist_make_1st_node(h, n));
> +       if (first)
> +               first->pprev = &n->next;
> +}
>
> Do you need cmpxchg when updatding hlist_head->first ?

Firstly it is just one pointer update, and it is guaranteed implicitly that
updating the pointer is atomic.

Secondly the bit spinlock has been held before calling the helper, and
other concurrent hlist add/del can't be possible.

So cmpxchg isn't needed here.

>
> Overall looks like a very interesting idea, but I'm not sure about
> trade-off of saving 8 bytes for rounded-up spinlock per bucket.

The hashtab can be very big, and one of reason why hlist_head is
defined as single pointer is for saving memory.

> The loss of lockdep is concerning.

Currently we have been using raw_spin_lock/unlock, so lockdep
is bypassed already, also no other lock is nested inside
the bit lock, and lockdep isn't needed.

> Have you compared performance of this bit-lock embedded inside hlist_head vs
> just adding spinlock_t for every hlist_head?

Yes, I did, and no difference can be observed.

> Another concern is that hlist walking may be more costly due to
> requirement of having to clear that bit while doing the walk in lookup().

No mater clearning the bit or not, the head variable need to be
loaded to cache and register, and the clearing zero bit op is
just one or zero instruction, so I think the cost can be or close to nop.

> I guess I would prefer normal spinlock per bucket. Additional 8 * n_buckets bytes
> don't seem to be worth optimizing.

As I mentioned, the hashtable can be very big, and that is why
hlist_head is defined as single pointer.  Also it is enough to
use bit spinlock for the very very low contention case, :-)

Even in the future, we may extend it to generic hashtable
using pattern, :-)

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock
       [not found]       ` <CAHbLzkpW_seTrs+WgFqsriH5uhG4LMY_jiYC0iRQ-LAdJFiUjw@mail.gmail.com>
@ 2015-12-16  6:58         ` Ming Lei
  2015-12-18  6:20           ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2015-12-16  6:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: Alexei Starovoitov, Linux Kernel Mailing List,
	Alexei Starovoitov, David S. Miller, Network Development,
	Steven Rostedt

On Wed, Dec 16, 2015 at 1:01 PM, Yang Shi <shy828301@gmail.com> wrote:

>
> I recalled Steven confirmed raw_spin_lock has the lockdep benefit too in the
> patch review for changing to raw lock.
>
> Please check this thread out
>  http://lists.openwall.net/netdev/2015/10/31/7

OK, looks I was wrong about the lockdep benifit, :-(

But for this lock, I think lockdep isn't such important, because it is the
intermost lock, and it can be used just for protecting the bucket list
and nothing else need to be covered.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog
  2015-12-15 23:10   ` Alexei Starovoitov
  2015-12-15 23:42     ` Daniel Borkmann
@ 2015-12-16  7:13     ` Ming Lei
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-16  7:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Linux Kernel Mailing List, Alexei Starovoitov, David S. Miller,
	Network Development

On Wed, Dec 16, 2015 at 7:10 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Dec 15, 2015 at 07:21:03PM +0800, Ming Lei wrote:
>> kmalloc() is often a bit time-consuming, also
>> one atomic counter has to be used to track the total
>> allocated elements, which is also not good.
>>
>> This patch pre-allocates element pool in htab_map_alloc(),
>> then use percpu_ida to allocate one slot from the pool,
>> then the runtime allocation/freeing cost can be decreased.
>>
>> From my test, at least 10% fio throughput is improved in block
>> I/O test when tools/biolatency of bcc(iovisor) is running.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>
> Looks very intersting as well.
> Approach looks good.
> If you can make a common allocation helper for this map and
> for blk-mq would be even better.

OK, I will see if it is doable.

>
>> -     htab->elem_size = sizeof(struct htab_elem) +
>> -                       round_up(htab->map.key_size, 8) +
>> -                       htab->map.value_size;
>> +     htab->elem_size = round_up(sizeof(struct htab_elem) +
>> +                                round_up(htab->map.key_size, 8) +
>> +                                htab->map.value_size,
>> +                                cache_line_size());
>
> this rounding to cache line is great for performance, but it's extra
> memory upfront which may not be needed. The per-allocation is a classic
> performance vs memory trade-off. In other cases it may hurt.

The current kmalloc allocation for 'struct htab_elem' is still cache line
aligned, that is one reason why I choose to do it, but we can change
it too.

> So could you change the patch to do pre-allocation only when
> requested by user space via extra flag for hash map or via
> new BPF_MAP_TYPE_HASH_PREALLOC type? Not sure yet whether flag or
> new type is better. I guess implementation will dictate.

Looks a better idea, then we can let user make the choice.

>
> PS
> Glad that you found iovisor/tools/biolatency useful.
> It's indeed pretty helpful to analyze real-time block io latency.

This tool is great and I have played it for a while, :-)



Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock
  2015-12-16  6:58         ` Ming Lei
@ 2015-12-18  6:20           ` Alexei Starovoitov
  2015-12-26  8:58             ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2015-12-18  6:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Yang Shi, Linux Kernel Mailing List, Alexei Starovoitov,
	David S. Miller, Network Development, Steven Rostedt

On Wed, Dec 16, 2015 at 02:58:08PM +0800, Ming Lei wrote:
> On Wed, Dec 16, 2015 at 1:01 PM, Yang Shi <shy828301@gmail.com> wrote:
> 
> >
> > I recalled Steven confirmed raw_spin_lock has the lockdep benefit too in the
> > patch review for changing to raw lock.
> >
> > Please check this thread out
> >  http://lists.openwall.net/netdev/2015/10/31/7
> 
> OK, looks I was wrong about the lockdep benifit, :-(
> 
> But for this lock, I think lockdep isn't such important, because it is the
> intermost lock, and it can be used just for protecting the bucket list
> and nothing else need to be covered.

I still think that overhead of normal spinlock per bucket is acceptable.
Makes the whole thing much easier to read.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock
  2015-12-18  6:20           ` Alexei Starovoitov
@ 2015-12-26  8:58             ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2015-12-26  8:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yang Shi, Linux Kernel Mailing List, Alexei Starovoitov,
	David S. Miller, Network Development, Steven Rostedt

On Fri, Dec 18, 2015 at 2:20 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Dec 16, 2015 at 02:58:08PM +0800, Ming Lei wrote:
>> On Wed, Dec 16, 2015 at 1:01 PM, Yang Shi <shy828301@gmail.com> wrote:
>>
>> >
>> > I recalled Steven confirmed raw_spin_lock has the lockdep benefit too in the
>> > patch review for changing to raw lock.
>> >
>> > Please check this thread out
>> >  http://lists.openwall.net/netdev/2015/10/31/7
>>
>> OK, looks I was wrong about the lockdep benifit, :-(
>>
>> But for this lock, I think lockdep isn't such important, because it is the
>> intermost lock, and it can be used just for protecting the bucket list
>> and nothing else need to be covered.
>
> I still think that overhead of normal spinlock per bucket is acceptable.
> Makes the whole thing much easier to read.

OK, let's use per-bucket spinlock first.

--
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-12-26  8:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-15 11:20 [PATCH 0/6] bpf: hash: optimization Ming Lei
2015-12-15 11:20 ` [PATCH 1/6] bpf: hash: use atomic count Ming Lei
2015-12-15 11:21 ` [PATCH 2/6] hlist: prepare for supporting bit spinlock Ming Lei
2015-12-15 11:21 ` [PATCH 3/6] bpf: hash: move select_bucket() out of htab's spinlock Ming Lei
2015-12-15 11:21 ` [PATCH 4/6] bpf: hash: convert per-hashtable lock into per-bucket bit spinlock Ming Lei
2015-12-15 22:51   ` Alexei Starovoitov
2015-12-16  2:57     ` Ming Lei
     [not found]       ` <CAHbLzkpW_seTrs+WgFqsriH5uhG4LMY_jiYC0iRQ-LAdJFiUjw@mail.gmail.com>
2015-12-16  6:58         ` Ming Lei
2015-12-18  6:20           ` Alexei Starovoitov
2015-12-26  8:58             ` Ming Lei
2015-12-15 11:21 ` [PATCH 5/6] bpf: hash: avoid to call kmalloc() in eBPF prog Ming Lei
2015-12-15 23:10   ` Alexei Starovoitov
2015-12-15 23:42     ` Daniel Borkmann
2015-12-16  7:13     ` Ming Lei
2015-12-15 23:21   ` Daniel Borkmann
2015-12-15 23:35   ` Daniel Borkmann
2015-12-16  0:12     ` Daniel Borkmann
2015-12-15 11:21 ` [PATCH 6/6] bpf: hash: reorganize 'struct htab_elem' Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).