netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
@ 2022-05-18  6:27 Feng zhou
  2022-05-18  6:32 ` Alexei Starovoitov
  0 siblings, 1 reply; 9+ messages in thread
From: Feng zhou @ 2022-05-18  6:27 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh
  Cc: netdev, bpf, linux-kernel, duanxiongchun, songmuchun,
	wangdongdong.6, cong.wang, zhouchengming, zhoufeng.zf

From: Feng Zhou <zhoufeng.zf@bytedance.com>

We encountered bad case on big system with 96 CPUs that
alloc_htab_elem() would last for 1ms. The reason is that after the
prealloc hashtab has no free elems, when trying to update, it will still
grab spin_locks of all cpus. If there are multiple update users, the
competition is very serious.

So this patch add is_empty in pcpu_freelist_head to check freelist
having free or not. If having, grab spin_lock, or check next cpu's
freelist.

Before patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 975345 events per sec
4:hash_map_perf pre-alloc 855367 events per sec
12:hash_map_perf pre-alloc 860862 events per sec
8:hash_map_perf pre-alloc 849561 events per sec
3:hash_map_perf pre-alloc 849074 events per sec
6:hash_map_perf pre-alloc 847120 events per sec
10:hash_map_perf pre-alloc 845047 events per sec
5:hash_map_perf pre-alloc 841266 events per sec
14:hash_map_perf pre-alloc 849740 events per sec
2:hash_map_perf pre-alloc 839598 events per sec
9:hash_map_perf pre-alloc 838695 events per sec
11:hash_map_perf pre-alloc 845390 events per sec
7:hash_map_perf pre-alloc 834865 events per sec
13:hash_map_perf pre-alloc 842619 events per sec
1:hash_map_perf pre-alloc 804231 events per sec
15:hash_map_perf pre-alloc 795314 events per sec

hash_map the worst: no free
./map_perf_test 2048
6:worse hash_map_perf pre-alloc 28628 events per sec
5:worse hash_map_perf pre-alloc 28553 events per sec
11:worse hash_map_perf pre-alloc 28543 events per sec
3:worse hash_map_perf pre-alloc 28444 events per sec
1:worse hash_map_perf pre-alloc 28418 events per sec
7:worse hash_map_perf pre-alloc 28427 events per sec
13:worse hash_map_perf pre-alloc 28330 events per sec
14:worse hash_map_perf pre-alloc 28263 events per sec
9:worse hash_map_perf pre-alloc 28211 events per sec
15:worse hash_map_perf pre-alloc 28193 events per sec
12:worse hash_map_perf pre-alloc 28190 events per sec
10:worse hash_map_perf pre-alloc 28129 events per sec
8:worse hash_map_perf pre-alloc 28116 events per sec
4:worse hash_map_perf pre-alloc 27906 events per sec
2:worse hash_map_perf pre-alloc 27801 events per sec
0:worse hash_map_perf pre-alloc 27416 events per sec
3:worse hash_map_perf pre-alloc 28188 events per sec

ftrace trace

0)               |  htab_map_update_elem() {
0)   0.198 us    |    migrate_disable();
0)               |    _raw_spin_lock_irqsave() {
0)   0.157 us    |      preempt_count_add();
0)   0.538 us    |    }
0)   0.260 us    |    lookup_elem_raw();
0)               |    alloc_htab_elem() {
0)               |      __pcpu_freelist_pop() {
0)               |        _raw_spin_lock() {
0)   0.152 us    |          preempt_count_add();
0)   0.352 us    |          native_queued_spin_lock_slowpath();
0)   1.065 us    |        }
		 |	  ...
0)               |        _raw_spin_unlock() {
0)   0.254 us    |          preempt_count_sub();
0)   0.555 us    |        }
0) + 25.188 us   |      }
0) + 25.486 us   |    }
0)               |    _raw_spin_unlock_irqrestore() {
0)   0.155 us    |      preempt_count_sub();
0)   0.454 us    |    }
0)   0.148 us    |    migrate_enable();
0) + 28.439 us   |  }

The test machine is 16C, trying to get spin_lock 17 times, in addition
to 16c, there is an extralist.

after patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 969348 events per sec
10:hash_map_perf pre-alloc 906526 events per sec
11:hash_map_perf pre-alloc 904557 events per sec
9:hash_map_perf pre-alloc 902384 events per sec
15:hash_map_perf pre-alloc 912287 events per sec
14:hash_map_perf pre-alloc 905689 events per sec
12:hash_map_perf pre-alloc 903680 events per sec
13:hash_map_perf pre-alloc 902631 events per sec
8:hash_map_perf pre-alloc 875369 events per sec
4:hash_map_perf pre-alloc 862808 events per sec
1:hash_map_perf pre-alloc 857218 events per sec
2:hash_map_perf pre-alloc 852875 events per sec
5:hash_map_perf pre-alloc 846497 events per sec
6:hash_map_perf pre-alloc 828467 events per sec
3:hash_map_perf pre-alloc 812542 events per sec
7:hash_map_perf pre-alloc 805336 events per sec

hash_map worst: no free
./map_perf_test 2048
7:worse hash_map_perf pre-alloc 391104 events per sec
4:worse hash_map_perf pre-alloc 388073 events per sec
5:worse hash_map_perf pre-alloc 387038 events per sec
1:worse hash_map_perf pre-alloc 386546 events per sec
0:worse hash_map_perf pre-alloc 384590 events per sec
11:worse hash_map_perf pre-alloc 379378 events per sec
10:worse hash_map_perf pre-alloc 375480 events per sec
12:worse hash_map_perf pre-alloc 372394 events per sec
6:worse hash_map_perf pre-alloc 367692 events per sec
3:worse hash_map_perf pre-alloc 363970 events per sec
9:worse hash_map_perf pre-alloc 364008 events per sec
8:worse hash_map_perf pre-alloc 363759 events per sec
2:worse hash_map_perf pre-alloc 360743 events per sec
14:worse hash_map_perf pre-alloc 361195 events per sec
13:worse hash_map_perf pre-alloc 360276 events per sec
15:worse hash_map_perf pre-alloc 360057 events per sec
0:worse hash_map_perf pre-alloc 378177 events per sec

ftrace trace
0)               |  htab_map_update_elem() {
0)   0.317 us    |    migrate_disable();
0)               |    _raw_spin_lock_irqsave() {
0)   0.260 us    |      preempt_count_add();
0)   1.803 us    |    }
0)   0.276 us    |    lookup_elem_raw();
0)               |    alloc_htab_elem() {
0)   0.586 us    |      __pcpu_freelist_pop();
0)   0.945 us    |    }
0)               |    _raw_spin_unlock_irqrestore() {
0)   0.160 us    |      preempt_count_sub();
0)   0.972 us    |    }
0)   0.657 us    |    migrate_enable();
0)   8.669 us    |  }

It can be seen that after adding this patch, the map performance is
almost not degraded, and when free=0, first check is_empty instead of
directly acquiring spin_lock.

As for why to add is_empty instead of directly judging head->first, my
understanding is this, head->first is frequently modified during updating
map, which will lead to invalid other cpus's cache, and is_empty is after
freelist having no free elems will be changed, the performance will be
better.

Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
---
 kernel/bpf/percpu_freelist.c | 68 ++++++++++++++++++++++++------------
 kernel/bpf/percpu_freelist.h |  1 +
 2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c
index 3d897de89061..abc148c8cae1 100644
--- a/kernel/bpf/percpu_freelist.c
+++ b/kernel/bpf/percpu_freelist.c
@@ -16,9 +16,11 @@ int pcpu_freelist_init(struct pcpu_freelist *s)
 
 		raw_spin_lock_init(&head->lock);
 		head->first = NULL;
+		head->is_empty = true;
 	}
 	raw_spin_lock_init(&s->extralist.lock);
 	s->extralist.first = NULL;
+	s->extralist.is_empty = true;
 	return 0;
 }
 
@@ -32,6 +34,8 @@ static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head,
 {
 	node->next = head->first;
 	head->first = node;
+	if (head->is_empty)
+		head->is_empty = false;
 }
 
 static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head,
@@ -130,14 +134,18 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
 	orig_cpu = cpu = raw_smp_processor_id();
 	while (1) {
 		head = per_cpu_ptr(s->freelist, cpu);
-		raw_spin_lock(&head->lock);
-		node = head->first;
-		if (node) {
-			head->first = node->next;
+		if (!head->is_empty) {
+			raw_spin_lock(&head->lock);
+			node = head->first;
+			if (node) {
+				head->first = node->next;
+				if (!head->first)
+					head->is_empty = true;
+				raw_spin_unlock(&head->lock);
+				return node;
+			}
 			raw_spin_unlock(&head->lock);
-			return node;
 		}
-		raw_spin_unlock(&head->lock);
 		cpu = cpumask_next(cpu, cpu_possible_mask);
 		if (cpu >= nr_cpu_ids)
 			cpu = 0;
@@ -146,11 +154,16 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
 	}
 
 	/* per cpu lists are all empty, try extralist */
-	raw_spin_lock(&s->extralist.lock);
-	node = s->extralist.first;
-	if (node)
-		s->extralist.first = node->next;
-	raw_spin_unlock(&s->extralist.lock);
+	if (!s->extralist.is_empty) {
+		raw_spin_lock(&s->extralist.lock);
+		node = s->extralist.first;
+		if (node) {
+			s->extralist.first = node->next;
+			if (!s->extralist.first)
+				s->extralist.is_empty = true;
+		}
+		raw_spin_unlock(&s->extralist.lock);
+	}
 	return node;
 }
 
@@ -164,14 +177,18 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
 	orig_cpu = cpu = raw_smp_processor_id();
 	while (1) {
 		head = per_cpu_ptr(s->freelist, cpu);
-		if (raw_spin_trylock(&head->lock)) {
-			node = head->first;
-			if (node) {
-				head->first = node->next;
+		if (!head->is_empty) {
+			if (raw_spin_trylock(&head->lock)) {
+				node = head->first;
+				if (node) {
+					head->first = node->next;
+					if (!head->first)
+						head->is_empty = true;
+					raw_spin_unlock(&head->lock);
+					return node;
+				}
 				raw_spin_unlock(&head->lock);
-				return node;
 			}
-			raw_spin_unlock(&head->lock);
 		}
 		cpu = cpumask_next(cpu, cpu_possible_mask);
 		if (cpu >= nr_cpu_ids)
@@ -181,12 +198,17 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
 	}
 
 	/* cannot pop from per cpu lists, try extralist */
-	if (!raw_spin_trylock(&s->extralist.lock))
-		return NULL;
-	node = s->extralist.first;
-	if (node)
-		s->extralist.first = node->next;
-	raw_spin_unlock(&s->extralist.lock);
+	if (!s->extralist.is_empty) {
+		if (!raw_spin_trylock(&s->extralist.lock))
+			return NULL;
+		node = s->extralist.first;
+		if (node) {
+			s->extralist.first = node->next;
+			if (!s->extralist.first)
+				s->extralist.is_empty = true;
+		}
+		raw_spin_unlock(&s->extralist.lock);
+	}
 	return node;
 }
 
diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h
index 3c76553cfe57..9e4545631ed5 100644
--- a/kernel/bpf/percpu_freelist.h
+++ b/kernel/bpf/percpu_freelist.h
@@ -9,6 +9,7 @@
 struct pcpu_freelist_head {
 	struct pcpu_freelist_node *first;
 	raw_spinlock_t lock;
+	bool is_empty;
 };
 
 struct pcpu_freelist {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-18  6:27 [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems Feng zhou
@ 2022-05-18  6:32 ` Alexei Starovoitov
  2022-05-18  6:57   ` [External] " Feng Zhou
  0 siblings, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2022-05-18  6:32 UTC (permalink / raw)
  To: Feng zhou
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Network Development, bpf, LKML, Xiongchun Duan,
	Muchun Song, Dongdong Wang, Cong Wang, Chengming Zhou

On Tue, May 17, 2022 at 11:27 PM Feng zhou <zhoufeng.zf@bytedance.com> wrote:
>
> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>
> We encountered bad case on big system with 96 CPUs that
> alloc_htab_elem() would last for 1ms. The reason is that after the
> prealloc hashtab has no free elems, when trying to update, it will still
> grab spin_locks of all cpus. If there are multiple update users, the
> competition is very serious.
>
> So this patch add is_empty in pcpu_freelist_head to check freelist
> having free or not. If having, grab spin_lock, or check next cpu's
> freelist.
>
> Before patch: hash_map performance
> ./map_perf_test 1
> 0:hash_map_perf pre-alloc 975345 events per sec
> 4:hash_map_perf pre-alloc 855367 events per sec
> 12:hash_map_perf pre-alloc 860862 events per sec
> 8:hash_map_perf pre-alloc 849561 events per sec
> 3:hash_map_perf pre-alloc 849074 events per sec
> 6:hash_map_perf pre-alloc 847120 events per sec
> 10:hash_map_perf pre-alloc 845047 events per sec
> 5:hash_map_perf pre-alloc 841266 events per sec
> 14:hash_map_perf pre-alloc 849740 events per sec
> 2:hash_map_perf pre-alloc 839598 events per sec
> 9:hash_map_perf pre-alloc 838695 events per sec
> 11:hash_map_perf pre-alloc 845390 events per sec
> 7:hash_map_perf pre-alloc 834865 events per sec
> 13:hash_map_perf pre-alloc 842619 events per sec
> 1:hash_map_perf pre-alloc 804231 events per sec
> 15:hash_map_perf pre-alloc 795314 events per sec
>
> hash_map the worst: no free
> ./map_perf_test 2048
> 6:worse hash_map_perf pre-alloc 28628 events per sec
> 5:worse hash_map_perf pre-alloc 28553 events per sec
> 11:worse hash_map_perf pre-alloc 28543 events per sec
> 3:worse hash_map_perf pre-alloc 28444 events per sec
> 1:worse hash_map_perf pre-alloc 28418 events per sec
> 7:worse hash_map_perf pre-alloc 28427 events per sec
> 13:worse hash_map_perf pre-alloc 28330 events per sec
> 14:worse hash_map_perf pre-alloc 28263 events per sec
> 9:worse hash_map_perf pre-alloc 28211 events per sec
> 15:worse hash_map_perf pre-alloc 28193 events per sec
> 12:worse hash_map_perf pre-alloc 28190 events per sec
> 10:worse hash_map_perf pre-alloc 28129 events per sec
> 8:worse hash_map_perf pre-alloc 28116 events per sec
> 4:worse hash_map_perf pre-alloc 27906 events per sec
> 2:worse hash_map_perf pre-alloc 27801 events per sec
> 0:worse hash_map_perf pre-alloc 27416 events per sec
> 3:worse hash_map_perf pre-alloc 28188 events per sec
>
> ftrace trace
>
> 0)               |  htab_map_update_elem() {
> 0)   0.198 us    |    migrate_disable();
> 0)               |    _raw_spin_lock_irqsave() {
> 0)   0.157 us    |      preempt_count_add();
> 0)   0.538 us    |    }
> 0)   0.260 us    |    lookup_elem_raw();
> 0)               |    alloc_htab_elem() {
> 0)               |      __pcpu_freelist_pop() {
> 0)               |        _raw_spin_lock() {
> 0)   0.152 us    |          preempt_count_add();
> 0)   0.352 us    |          native_queued_spin_lock_slowpath();
> 0)   1.065 us    |        }
>                  |        ...
> 0)               |        _raw_spin_unlock() {
> 0)   0.254 us    |          preempt_count_sub();
> 0)   0.555 us    |        }
> 0) + 25.188 us   |      }
> 0) + 25.486 us   |    }
> 0)               |    _raw_spin_unlock_irqrestore() {
> 0)   0.155 us    |      preempt_count_sub();
> 0)   0.454 us    |    }
> 0)   0.148 us    |    migrate_enable();
> 0) + 28.439 us   |  }
>
> The test machine is 16C, trying to get spin_lock 17 times, in addition
> to 16c, there is an extralist.

Is this with small max_entries and a large number of cpus?

If so, probably better to fix would be to artificially
bump max_entries to be 4x of num_cpus.
Racy is_empty check still wastes the loop.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-18  6:32 ` Alexei Starovoitov
@ 2022-05-18  6:57   ` Feng Zhou
  2022-05-18 20:39     ` Yonghong Song
  0 siblings, 1 reply; 9+ messages in thread
From: Feng Zhou @ 2022-05-18  6:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Network Development, bpf, LKML, Xiongchun Duan,
	Muchun Song, Dongdong Wang, Cong Wang, Chengming Zhou

在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
> On Tue, May 17, 2022 at 11:27 PM Feng zhou <zhoufeng.zf@bytedance.com> wrote:
>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>
>> We encountered bad case on big system with 96 CPUs that
>> alloc_htab_elem() would last for 1ms. The reason is that after the
>> prealloc hashtab has no free elems, when trying to update, it will still
>> grab spin_locks of all cpus. If there are multiple update users, the
>> competition is very serious.
>>
>> So this patch add is_empty in pcpu_freelist_head to check freelist
>> having free or not. If having, grab spin_lock, or check next cpu's
>> freelist.
>>
>> Before patch: hash_map performance
>> ./map_perf_test 1
>> 0:hash_map_perf pre-alloc 975345 events per sec
>> 4:hash_map_perf pre-alloc 855367 events per sec
>> 12:hash_map_perf pre-alloc 860862 events per sec
>> 8:hash_map_perf pre-alloc 849561 events per sec
>> 3:hash_map_perf pre-alloc 849074 events per sec
>> 6:hash_map_perf pre-alloc 847120 events per sec
>> 10:hash_map_perf pre-alloc 845047 events per sec
>> 5:hash_map_perf pre-alloc 841266 events per sec
>> 14:hash_map_perf pre-alloc 849740 events per sec
>> 2:hash_map_perf pre-alloc 839598 events per sec
>> 9:hash_map_perf pre-alloc 838695 events per sec
>> 11:hash_map_perf pre-alloc 845390 events per sec
>> 7:hash_map_perf pre-alloc 834865 events per sec
>> 13:hash_map_perf pre-alloc 842619 events per sec
>> 1:hash_map_perf pre-alloc 804231 events per sec
>> 15:hash_map_perf pre-alloc 795314 events per sec
>>
>> hash_map the worst: no free
>> ./map_perf_test 2048
>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>
>> ftrace trace
>>
>> 0)               |  htab_map_update_elem() {
>> 0)   0.198 us    |    migrate_disable();
>> 0)               |    _raw_spin_lock_irqsave() {
>> 0)   0.157 us    |      preempt_count_add();
>> 0)   0.538 us    |    }
>> 0)   0.260 us    |    lookup_elem_raw();
>> 0)               |    alloc_htab_elem() {
>> 0)               |      __pcpu_freelist_pop() {
>> 0)               |        _raw_spin_lock() {
>> 0)   0.152 us    |          preempt_count_add();
>> 0)   0.352 us    |          native_queued_spin_lock_slowpath();
>> 0)   1.065 us    |        }
>>                   |        ...
>> 0)               |        _raw_spin_unlock() {
>> 0)   0.254 us    |          preempt_count_sub();
>> 0)   0.555 us    |        }
>> 0) + 25.188 us   |      }
>> 0) + 25.486 us   |    }
>> 0)               |    _raw_spin_unlock_irqrestore() {
>> 0)   0.155 us    |      preempt_count_sub();
>> 0)   0.454 us    |    }
>> 0)   0.148 us    |    migrate_enable();
>> 0) + 28.439 us   |  }
>>
>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>> to 16c, there is an extralist.
> Is this with small max_entries and a large number of cpus?
>
> If so, probably better to fix would be to artificially
> bump max_entries to be 4x of num_cpus.
> Racy is_empty check still wastes the loop.

This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.

This is the test case I constructed, it is to fill the map on purpose, 
and then

continue to update, just to reproduce the problem phenomenon.

The bad case we encountered with 96 CPUs, map's max_entries is 10240.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-18  6:57   ` [External] " Feng Zhou
@ 2022-05-18 20:39     ` Yonghong Song
  2022-05-19  3:12       ` Feng Zhou
  0 siblings, 1 reply; 9+ messages in thread
From: Yonghong Song @ 2022-05-18 20:39 UTC (permalink / raw)
  To: Feng Zhou, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Network Development, bpf, LKML, Xiongchun Duan, Muchun Song,
	Dongdong Wang, Cong Wang, Chengming Zhou



On 5/17/22 11:57 PM, Feng Zhou wrote:
> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>> On Tue, May 17, 2022 at 11:27 PM Feng zhou <zhoufeng.zf@bytedance.com> 
>> wrote:
>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>
>>> We encountered bad case on big system with 96 CPUs that
>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>> prealloc hashtab has no free elems, when trying to update, it will still
>>> grab spin_locks of all cpus. If there are multiple update users, the
>>> competition is very serious.
>>>
>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>> having free or not. If having, grab spin_lock, or check next cpu's
>>> freelist.
>>>
>>> Before patch: hash_map performance
>>> ./map_perf_test 1

could you explain what parameter '1' means here?

>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>
>>> hash_map the worst: no free
>>> ./map_perf_test 2048
>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>
>>> ftrace trace
>>>
>>> 0)               |  htab_map_update_elem() {
>>> 0)   0.198 us    |    migrate_disable();
>>> 0)               |    _raw_spin_lock_irqsave() {
>>> 0)   0.157 us    |      preempt_count_add();
>>> 0)   0.538 us    |    }
>>> 0)   0.260 us    |    lookup_elem_raw();
>>> 0)               |    alloc_htab_elem() {
>>> 0)               |      __pcpu_freelist_pop() {
>>> 0)               |        _raw_spin_lock() {
>>> 0)   0.152 us    |          preempt_count_add();
>>> 0)   0.352 us    |          native_queued_spin_lock_slowpath();
>>> 0)   1.065 us    |        }
>>>                   |        ...
>>> 0)               |        _raw_spin_unlock() {
>>> 0)   0.254 us    |          preempt_count_sub();
>>> 0)   0.555 us    |        }
>>> 0) + 25.188 us   |      }
>>> 0) + 25.486 us   |    }
>>> 0)               |    _raw_spin_unlock_irqrestore() {
>>> 0)   0.155 us    |      preempt_count_sub();
>>> 0)   0.454 us    |    }
>>> 0)   0.148 us    |    migrate_enable();
>>> 0) + 28.439 us   |  }
>>>
>>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>>> to 16c, there is an extralist.
>> Is this with small max_entries and a large number of cpus?
>>
>> If so, probably better to fix would be to artificially
>> bump max_entries to be 4x of num_cpus.
>> Racy is_empty check still wastes the loop.
> 
> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
> 
> This is the test case I constructed, it is to fill the map on purpose, 
> and then
> 
> continue to update, just to reproduce the problem phenomenon.
> 
> The bad case we encountered with 96 CPUs, map's max_entries is 10240.

For such cases, most likely the map is *almost* full. What is the 
performance if we increase map size, e.g., from 10240 to 16K(16192)?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-18 20:39     ` Yonghong Song
@ 2022-05-19  3:12       ` Feng Zhou
  2022-05-19 16:12         ` Alexei Starovoitov
  2022-05-19 16:45         ` Yonghong Song
  0 siblings, 2 replies; 9+ messages in thread
From: Feng Zhou @ 2022-05-19  3:12 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Network Development, bpf, LKML, Xiongchun Duan, Muchun Song,
	Dongdong Wang, Cong Wang, Chengming Zhou

在 2022/5/19 上午4:39, Yonghong Song 写道:
>
>
> On 5/17/22 11:57 PM, Feng Zhou wrote:
>> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>>> On Tue, May 17, 2022 at 11:27 PM Feng zhou 
>>> <zhoufeng.zf@bytedance.com> wrote:
>>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>>
>>>> We encountered bad case on big system with 96 CPUs that
>>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>>> prealloc hashtab has no free elems, when trying to update, it will 
>>>> still
>>>> grab spin_locks of all cpus. If there are multiple update users, the
>>>> competition is very serious.
>>>>
>>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>>> having free or not. If having, grab spin_lock, or check next cpu's
>>>> freelist.
>>>>
>>>> Before patch: hash_map performance
>>>> ./map_perf_test 1
>
> could you explain what parameter '1' means here?

This code is here:
samples/bpf/map_perf_test_user.c
samples/bpf/map_perf_test_kern.c
parameter '1' means testcase flag, test hash_map's performance
parameter '2048' means test hash_map's performance when free=0.
testcase flag '2048' is added by myself to reproduce the problem phenomenon.

>
>>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>>
>>>> hash_map the worst: no free
>>>> ./map_perf_test 2048
>>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>>
>>>> ftrace trace
>>>>
>>>> 0)               |  htab_map_update_elem() {
>>>> 0)   0.198 us    |    migrate_disable();
>>>> 0)               |    _raw_spin_lock_irqsave() {
>>>> 0)   0.157 us    |      preempt_count_add();
>>>> 0)   0.538 us    |    }
>>>> 0)   0.260 us    |    lookup_elem_raw();
>>>> 0)               |    alloc_htab_elem() {
>>>> 0)               |      __pcpu_freelist_pop() {
>>>> 0)               |        _raw_spin_lock() {
>>>> 0)   0.152 us    |          preempt_count_add();
>>>> 0)   0.352 us    | native_queued_spin_lock_slowpath();
>>>> 0)   1.065 us    |        }
>>>>                   |        ...
>>>> 0)               |        _raw_spin_unlock() {
>>>> 0)   0.254 us    |          preempt_count_sub();
>>>> 0)   0.555 us    |        }
>>>> 0) + 25.188 us   |      }
>>>> 0) + 25.486 us   |    }
>>>> 0)               |    _raw_spin_unlock_irqrestore() {
>>>> 0)   0.155 us    |      preempt_count_sub();
>>>> 0)   0.454 us    |    }
>>>> 0)   0.148 us    |    migrate_enable();
>>>> 0) + 28.439 us   |  }
>>>>
>>>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>>>> to 16c, there is an extralist.
>>> Is this with small max_entries and a large number of cpus?
>>>
>>> If so, probably better to fix would be to artificially
>>> bump max_entries to be 4x of num_cpus.
>>> Racy is_empty check still wastes the loop.
>>
>> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
>>
>> This is the test case I constructed, it is to fill the map on 
>> purpose, and then
>>
>> continue to update, just to reproduce the problem phenomenon.
>>
>> The bad case we encountered with 96 CPUs, map's max_entries is 10240.
>
> For such cases, most likely the map is *almost* full. What is the 
> performance if we increase map size, e.g., from 10240 to 16K(16192)?

Yes, increasing max_entries can temporarily solve this problem, but when 
16k is used up,
it will still encounter this problem. This patch is to try to fix this 
corner case.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-19  3:12       ` Feng Zhou
@ 2022-05-19 16:12         ` Alexei Starovoitov
  2022-05-20  3:02           ` Feng Zhou
  2022-05-19 16:45         ` Yonghong Song
  1 sibling, 1 reply; 9+ messages in thread
From: Alexei Starovoitov @ 2022-05-19 16:12 UTC (permalink / raw)
  To: Feng Zhou
  Cc: Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Network Development, bpf, LKML, Xiongchun Duan,
	Muchun Song, Dongdong Wang, Cong Wang, Chengming Zhou

On Wed, May 18, 2022 at 8:12 PM Feng Zhou <zhoufeng.zf@bytedance.com> wrote:
>
> 在 2022/5/19 上午4:39, Yonghong Song 写道:
> >
> >
> > On 5/17/22 11:57 PM, Feng Zhou wrote:
> >> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
> >>> On Tue, May 17, 2022 at 11:27 PM Feng zhou
> >>> <zhoufeng.zf@bytedance.com> wrote:
> >>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
> >>>>
> >>>> We encountered bad case on big system with 96 CPUs that
> >>>> alloc_htab_elem() would last for 1ms. The reason is that after the
> >>>> prealloc hashtab has no free elems, when trying to update, it will
> >>>> still
> >>>> grab spin_locks of all cpus. If there are multiple update users, the
> >>>> competition is very serious.
> >>>>
> >>>> So this patch add is_empty in pcpu_freelist_head to check freelist
> >>>> having free or not. If having, grab spin_lock, or check next cpu's
> >>>> freelist.
> >>>>
> >>>> Before patch: hash_map performance
> >>>> ./map_perf_test 1
> >
> > could you explain what parameter '1' means here?
>
> This code is here:
> samples/bpf/map_perf_test_user.c
> samples/bpf/map_perf_test_kern.c
> parameter '1' means testcase flag, test hash_map's performance
> parameter '2048' means test hash_map's performance when free=0.
> testcase flag '2048' is added by myself to reproduce the problem phenomenon.

Please convert it to selftests/bpf/bench,
so that everyone can reproduce the issue you're seeing
and can assess whether it's a real issue or a corner case.

Also please avoid adding indent in the patch.
Instead of
 if (!s->extralist.is_empty) {
  .. churn

do

 if (s->extralist.is_empty)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-19  3:12       ` Feng Zhou
  2022-05-19 16:12         ` Alexei Starovoitov
@ 2022-05-19 16:45         ` Yonghong Song
  2022-05-23  2:24           ` Feng Zhou
  1 sibling, 1 reply; 9+ messages in thread
From: Yonghong Song @ 2022-05-19 16:45 UTC (permalink / raw)
  To: Feng Zhou, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Network Development, bpf, LKML, Xiongchun Duan, Muchun Song,
	Dongdong Wang, Cong Wang, Chengming Zhou



On 5/18/22 8:12 PM, Feng Zhou wrote:
> 在 2022/5/19 上午4:39, Yonghong Song 写道:
>>
>>
>> On 5/17/22 11:57 PM, Feng Zhou wrote:
>>> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>>>> On Tue, May 17, 2022 at 11:27 PM Feng zhou 
>>>> <zhoufeng.zf@bytedance.com> wrote:
>>>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>>>
>>>>> We encountered bad case on big system with 96 CPUs that
>>>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>>>> prealloc hashtab has no free elems, when trying to update, it will 
>>>>> still
>>>>> grab spin_locks of all cpus. If there are multiple update users, the
>>>>> competition is very serious.
>>>>>
>>>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>>>> having free or not. If having, grab spin_lock, or check next cpu's
>>>>> freelist.
>>>>>
>>>>> Before patch: hash_map performance
>>>>> ./map_perf_test 1
>>
>> could you explain what parameter '1' means here?
> 
> This code is here:
> samples/bpf/map_perf_test_user.c
> samples/bpf/map_perf_test_kern.c
> parameter '1' means testcase flag, test hash_map's performance
> parameter '2048' means test hash_map's performance when free=0.
> testcase flag '2048' is added by myself to reproduce the problem 
> phenomenon.
> 
>>
>>>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>>>
>>>>> hash_map the worst: no free
>>>>> ./map_perf_test 2048
>>>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>>>
>>>>> ftrace trace
>>>>>
>>>>> 0)               |  htab_map_update_elem() {
>>>>> 0)   0.198 us    |    migrate_disable();
>>>>> 0)               |    _raw_spin_lock_irqsave() {
>>>>> 0)   0.157 us    |      preempt_count_add();
>>>>> 0)   0.538 us    |    }
>>>>> 0)   0.260 us    |    lookup_elem_raw();
>>>>> 0)               |    alloc_htab_elem() {
>>>>> 0)               |      __pcpu_freelist_pop() {
>>>>> 0)               |        _raw_spin_lock() {
>>>>> 0)   0.152 us    |          preempt_count_add();
>>>>> 0)   0.352 us    | native_queued_spin_lock_slowpath();
>>>>> 0)   1.065 us    |        }
>>>>>                   |        ...
>>>>> 0)               |        _raw_spin_unlock() {
>>>>> 0)   0.254 us    |          preempt_count_sub();
>>>>> 0)   0.555 us    |        }
>>>>> 0) + 25.188 us   |      }
>>>>> 0) + 25.486 us   |    }
>>>>> 0)               |    _raw_spin_unlock_irqrestore() {
>>>>> 0)   0.155 us    |      preempt_count_sub();
>>>>> 0)   0.454 us    |    }
>>>>> 0)   0.148 us    |    migrate_enable();
>>>>> 0) + 28.439 us   |  }
>>>>>
>>>>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>>>>> to 16c, there is an extralist.
>>>> Is this with small max_entries and a large number of cpus?
>>>>
>>>> If so, probably better to fix would be to artificially
>>>> bump max_entries to be 4x of num_cpus.
>>>> Racy is_empty check still wastes the loop.
>>>
>>> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
>>>
>>> This is the test case I constructed, it is to fill the map on 
>>> purpose, and then
>>>
>>> continue to update, just to reproduce the problem phenomenon.
>>>
>>> The bad case we encountered with 96 CPUs, map's max_entries is 10240.
>>
>> For such cases, most likely the map is *almost* full. What is the 
>> performance if we increase map size, e.g., from 10240 to 16K(16192)?
> 
> Yes, increasing max_entries can temporarily solve this problem, but when 
> 16k is used up,
> it will still encounter this problem. This patch is to try to fix this 
> corner case.

Okay, if I understand correctly, in your use case, you have lots of 
different keys and your intention is NOT to capture all the keys in
the hash table. So given a hash table, it is possible that the hash
will become full even if you increase the hashtable size.

Maybe you will occasionally delete some keys which will free some
space but the space will be quickly occupied by the new updates.

For such cases, yes, check whether the free list is empty or not
before taking the lock should be helpful. But I am wondering
what is the rationale behind your use case.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-19 16:12         ` Alexei Starovoitov
@ 2022-05-20  3:02           ` Feng Zhou
  0 siblings, 0 replies; 9+ messages in thread
From: Feng Zhou @ 2022-05-20  3:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Network Development, bpf, LKML, Xiongchun Duan,
	Muchun Song, Dongdong Wang, Cong Wang, Chengming Zhou

在 2022/5/20 上午12:12, Alexei Starovoitov 写道:
> On Wed, May 18, 2022 at 8:12 PM Feng Zhou <zhoufeng.zf@bytedance.com> wrote:
>> 在 2022/5/19 上午4:39, Yonghong Song 写道:
>>>
>>> On 5/17/22 11:57 PM, Feng Zhou wrote:
>>>> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>>>>> On Tue, May 17, 2022 at 11:27 PM Feng zhou
>>>>> <zhoufeng.zf@bytedance.com> wrote:
>>>>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>>>>
>>>>>> We encountered bad case on big system with 96 CPUs that
>>>>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>>>>> prealloc hashtab has no free elems, when trying to update, it will
>>>>>> still
>>>>>> grab spin_locks of all cpus. If there are multiple update users, the
>>>>>> competition is very serious.
>>>>>>
>>>>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>>>>> having free or not. If having, grab spin_lock, or check next cpu's
>>>>>> freelist.
>>>>>>
>>>>>> Before patch: hash_map performance
>>>>>> ./map_perf_test 1
>>> could you explain what parameter '1' means here?
>> This code is here:
>> samples/bpf/map_perf_test_user.c
>> samples/bpf/map_perf_test_kern.c
>> parameter '1' means testcase flag, test hash_map's performance
>> parameter '2048' means test hash_map's performance when free=0.
>> testcase flag '2048' is added by myself to reproduce the problem phenomenon.
> Please convert it to selftests/bpf/bench,
> so that everyone can reproduce the issue you're seeing
> and can assess whether it's a real issue or a corner case.
>
> Also please avoid adding indent in the patch.
> Instead of
>   if (!s->extralist.is_empty) {
>    .. churn
>
> do
>
>   if (s->extralist.is_empty)

Ok, will do. Thanks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems
  2022-05-19 16:45         ` Yonghong Song
@ 2022-05-23  2:24           ` Feng Zhou
  0 siblings, 0 replies; 9+ messages in thread
From: Feng Zhou @ 2022-05-23  2:24 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Network Development, bpf, LKML, Xiongchun Duan, Muchun Song,
	Dongdong Wang, Cong Wang, Chengming Zhou

在 2022/5/20 上午12:45, Yonghong Song 写道:
>
>
> On 5/18/22 8:12 PM, Feng Zhou wrote:
>> 在 2022/5/19 上午4:39, Yonghong Song 写道:
>>>
>>>
>>> On 5/17/22 11:57 PM, Feng Zhou wrote:
>>>> 在 2022/5/18 下午2:32, Alexei Starovoitov 写道:
>>>>> On Tue, May 17, 2022 at 11:27 PM Feng zhou 
>>>>> <zhoufeng.zf@bytedance.com> wrote:
>>>>>> From: Feng Zhou <zhoufeng.zf@bytedance.com>
>>>>>>
>>>>>> We encountered bad case on big system with 96 CPUs that
>>>>>> alloc_htab_elem() would last for 1ms. The reason is that after the
>>>>>> prealloc hashtab has no free elems, when trying to update, it 
>>>>>> will still
>>>>>> grab spin_locks of all cpus. If there are multiple update users, the
>>>>>> competition is very serious.
>>>>>>
>>>>>> So this patch add is_empty in pcpu_freelist_head to check freelist
>>>>>> having free or not. If having, grab spin_lock, or check next cpu's
>>>>>> freelist.
>>>>>>
>>>>>> Before patch: hash_map performance
>>>>>> ./map_perf_test 1
>>>
>>> could you explain what parameter '1' means here?
>>
>> This code is here:
>> samples/bpf/map_perf_test_user.c
>> samples/bpf/map_perf_test_kern.c
>> parameter '1' means testcase flag, test hash_map's performance
>> parameter '2048' means test hash_map's performance when free=0.
>> testcase flag '2048' is added by myself to reproduce the problem 
>> phenomenon.
>>
>>>
>>>>>> 0:hash_map_perf pre-alloc 975345 events per sec
>>>>>> 4:hash_map_perf pre-alloc 855367 events per sec
>>>>>> 12:hash_map_perf pre-alloc 860862 events per sec
>>>>>> 8:hash_map_perf pre-alloc 849561 events per sec
>>>>>> 3:hash_map_perf pre-alloc 849074 events per sec
>>>>>> 6:hash_map_perf pre-alloc 847120 events per sec
>>>>>> 10:hash_map_perf pre-alloc 845047 events per sec
>>>>>> 5:hash_map_perf pre-alloc 841266 events per sec
>>>>>> 14:hash_map_perf pre-alloc 849740 events per sec
>>>>>> 2:hash_map_perf pre-alloc 839598 events per sec
>>>>>> 9:hash_map_perf pre-alloc 838695 events per sec
>>>>>> 11:hash_map_perf pre-alloc 845390 events per sec
>>>>>> 7:hash_map_perf pre-alloc 834865 events per sec
>>>>>> 13:hash_map_perf pre-alloc 842619 events per sec
>>>>>> 1:hash_map_perf pre-alloc 804231 events per sec
>>>>>> 15:hash_map_perf pre-alloc 795314 events per sec
>>>>>>
>>>>>> hash_map the worst: no free
>>>>>> ./map_perf_test 2048
>>>>>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>>>>>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>>>>>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>>>>>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>>>>>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>>>>>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>>>>>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>>>>>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>>>>>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>>>>>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>>>>>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>>>>>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>>>>>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>>>>>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>>>>>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>>>>>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>>>>>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>>>>>
>>>>>> ftrace trace
>>>>>>
>>>>>> 0)               |  htab_map_update_elem() {
>>>>>> 0)   0.198 us    |    migrate_disable();
>>>>>> 0)               |    _raw_spin_lock_irqsave() {
>>>>>> 0)   0.157 us    |      preempt_count_add();
>>>>>> 0)   0.538 us    |    }
>>>>>> 0)   0.260 us    |    lookup_elem_raw();
>>>>>> 0)               |    alloc_htab_elem() {
>>>>>> 0)               |      __pcpu_freelist_pop() {
>>>>>> 0)               |        _raw_spin_lock() {
>>>>>> 0)   0.152 us    |          preempt_count_add();
>>>>>> 0)   0.352 us    | native_queued_spin_lock_slowpath();
>>>>>> 0)   1.065 us    |        }
>>>>>>                   |        ...
>>>>>> 0)               |        _raw_spin_unlock() {
>>>>>> 0)   0.254 us    |          preempt_count_sub();
>>>>>> 0)   0.555 us    |        }
>>>>>> 0) + 25.188 us   |      }
>>>>>> 0) + 25.486 us   |    }
>>>>>> 0)               |    _raw_spin_unlock_irqrestore() {
>>>>>> 0)   0.155 us    |      preempt_count_sub();
>>>>>> 0)   0.454 us    |    }
>>>>>> 0)   0.148 us    |    migrate_enable();
>>>>>> 0) + 28.439 us   |  }
>>>>>>
>>>>>> The test machine is 16C, trying to get spin_lock 17 times, in 
>>>>>> addition
>>>>>> to 16c, there is an extralist.
>>>>> Is this with small max_entries and a large number of cpus?
>>>>>
>>>>> If so, probably better to fix would be to artificially
>>>>> bump max_entries to be 4x of num_cpus.
>>>>> Racy is_empty check still wastes the loop.
>>>>
>>>> This hash_map worst testcase with 16 CPUs, map's max_entries is 1000.
>>>>
>>>> This is the test case I constructed, it is to fill the map on 
>>>> purpose, and then
>>>>
>>>> continue to update, just to reproduce the problem phenomenon.
>>>>
>>>> The bad case we encountered with 96 CPUs, map's max_entries is 10240.
>>>
>>> For such cases, most likely the map is *almost* full. What is the 
>>> performance if we increase map size, e.g., from 10240 to 16K(16192)?
>>
>> Yes, increasing max_entries can temporarily solve this problem, but 
>> when 16k is used up,
>> it will still encounter this problem. This patch is to try to fix 
>> this corner case.
>
> Okay, if I understand correctly, in your use case, you have lots of 
> different keys and your intention is NOT to capture all the keys in
> the hash table. So given a hash table, it is possible that the hash
> will become full even if you increase the hashtable size.
>
> Maybe you will occasionally delete some keys which will free some
> space but the space will be quickly occupied by the new updates.
>
> For such cases, yes, check whether the free list is empty or not
> before taking the lock should be helpful. But I am wondering
> what is the rationale behind your use case.

My use case is to monitor the network traffic of the server, and use 
five-tuple as the key.
When there is a surge in network traffic, it is possible to cause the 
hash_map to be full.





^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-05-23  2:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-18  6:27 [PATCH] bpf: avoid grabbing spin_locks of all cpus when no free elems Feng zhou
2022-05-18  6:32 ` Alexei Starovoitov
2022-05-18  6:57   ` [External] " Feng Zhou
2022-05-18 20:39     ` Yonghong Song
2022-05-19  3:12       ` Feng Zhou
2022-05-19 16:12         ` Alexei Starovoitov
2022-05-20  3:02           ` Feng Zhou
2022-05-19 16:45         ` Yonghong Song
2022-05-23  2:24           ` Feng Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).