Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Hou Tao <houtao@huaweicloud.com>
Cc: bpf@vger.kernel.org, Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org, "houtao1@huawei.com" <houtao1@huawei.com>
Subject: Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
Date: Wed, 7 Jun 2023 10:52:24 -0700	[thread overview]
Message-ID: <20230607175224.oqezpaztsb5hln2s@MacBook-Pro-8.local> (raw)
In-Reply-To: <9d17ed7f-1726-d894-9f74-75ec9702ca7e@huaweicloud.com>

On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> As said in the commit message, the command line for test is
> "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> using default max_entries and the number of CPUs is greater than 15,
> use_percpu_counter will be false.

Right. percpu or not depends on number of cpus.

> 
> I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> there are obvious performance degradation.
...
> [root@hello bpf]# ./map_perf_test 4 8 16384
> 2:hash_map_perf kmalloc 359201 events per sec
..
> [root@hello bpf]# ./map_perf_test 4 8 16384
> 4:hash_map_perf kmalloc 203983 events per sec

this is indeed a degration in a VM.

> I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> performances for "./map_perf_test 4 8" are similar, but there is obvious
> performance degradation for "./map_perf_test 4 8 16384"

but... a degradation?

> Before reuse-after-rcu-gp:
>
> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> 1:hash_map_perf kmalloc 388088 events per sec
...
> After reuse-after-rcu-gp:
> [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> 5:hash_map_perf kmalloc 655628 events per sec

This is a big improvement :) Not a degration.
You always have to double check the numbers with perf report.

> So could you please double check your setup and rerun map_perf_test ? If
> there is no performance degradation, could you please share your setup
> and your kernel configure file ?

I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
Playing with it a bit more I found something interesting:
map_perf_test 4 8 16348
before/after has too much noise to be conclusive.

So I did
map_perf_test 4 8 16348 1000000

and now I see significant degration from patch 3.
It drops from 800k to 200k.
And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
The following hack addresses most of the perf degradtion:

diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
index fea1cb0c78bb..eeadc9359097 100644
--- a/kernel/bpf/memalloc.c
+++ b/kernel/bpf/memalloc.c
@@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
        alloc = 0;
        head = NULL;
        tail = NULL;
-       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
+       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
        while (alloc < cnt) {
                obj = __llist_del_first(&sc->reuse_ready_head);
                if (obj) {
@@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
                alloc++;
        }
        raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
+       }

        if (alloc) {
                if (IS_ENABLED(CONFIG_PREEMPT_RT))
@@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
                sc->reuse_ready_tail = NULL;
                WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
                __llist_add_batch(head, tail, &sc->wait_for_free);
+               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
                call_rcu_tasks_trace(&sc->rcu, free_rcu);
+       } else {
+               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
        }
-       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
 }

It now drops from 800k to 450k.
And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
For now I still prefer to see v5 with per-bpf-ma and no _tail optimization.

Answering your other email:

> I see your point. I will continue to debug the memory usage difference
> between v3 and v4.

imo it's a waste of time to continue analyzing performance based on bench in patch 2.

> I don't think so. Let's considering the per-cpu list first. Assume the
> normal RCU grace period is about 30ms and we are tracing the IO latency
> of a normal SSD. The iops is about 176K per seconds, so before one RCU
> GP is passed, we will need to allocate about 176 * 30 = 5.2K elements.
> For the per-ma list, when the number of CPUs increased, it is easy to
> make the list contain thousands of elements.

That would be true only if there were no scheduling events in all of 176K ops.
Which is not the case.
I'm not sure why you're saying that RCU GP is 30ms.
In CONFIG_PREEMPT_NONE rcu_read_lock/unlock are true nops.
Every sched event is sort-of implicit rcu_read_lock/unlock.
Network and block IO doesn't process 176K packets without resched.
Don't know how block does it, but in networking NAPI will process 64 packets and will yield softirq.

For small size buckets low_watermark=32 and high=96.
We typically move 32 elements at a time from one list to another.
A bunch of elements maybe sitting in free_by_rcu and moving them to waiting_for_gp
is not instant, but once __free_rcu_tasks_trace is called we need to take
elements from waiting_for_gp one at a time and kfree it one at a time.
So optimizing the move from free_by_rcu into waiting_for_gp is not worth the code complexity.

> Before I post v5, I want to know the reason why per-bpf-ma list is
>introduced. Previously, I though it was used to handle the case in which
> allocation and freeing are done on different CPUs. 

Correct. per-bpf-ma list is necessary to avoid OOM-ing due to slow rcu_tasks_trace GP.

> And as we can see
> from the benchmark result above and in v3, the performance and the
> memory usage of v4 for add_del_on_diff_cpu is better than v3. 

bench from patch 2 is invalid. Hence no conclusion can be made.

So far the only bench we can trust and analyze is map_perf_test.
Please make bench in patch 2 yield the cpu after few updates.
Earlier I suggested to stick to 10, but since NAPI can do 64 at a time.
64 updates is realistic too. A thousand is not.