bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Hou Tao <houtao@huaweicloud.com>
Cc: bpf <bpf@vger.kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	 Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>,  Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	 Stanislav Fomichev <sdf@google.com>,
	Jiri Olsa <jolsa@kernel.org>,
	 John Fastabend <john.fastabend@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	rcu@vger.kernel.org,  "houtao1@huawei.com" <houtao1@huawei.com>
Subject: Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator
Date: Wed, 7 Jun 2023 16:23:20 -0700	[thread overview]
Message-ID: <CAADnVQJ1njnHb96HfO4k48XDY9L3YXqQW1iUW=ti5iBNKKcE9A@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQJMM2ueRoDMmmBsxb_chPFr_WCH34tyiYQiwphnDhyuGw@mail.gmail.com>

On Wed, Jun 7, 2023 at 1:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 7, 2023 at 10:52 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jun 07, 2023 at 04:42:11PM +0800, Hou Tao wrote:
> > > As said in the commit message, the command line for test is
> > > "./map_perf_test 4 8 16384", because the default max_entries is 1000. If
> > > using default max_entries and the number of CPUs is greater than 15,
> > > use_percpu_counter will be false.
> >
> > Right. percpu or not depends on number of cpus.
> >
> > >
> > > I have double checked my local VM setup (8 CPUs + 16GB) and rerun the
> > > test.  For both "./map_perf_test 4 8" and "./map_perf_test 4 8 16384"
> > > there are obvious performance degradation.
> > ...
> > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > 2:hash_map_perf kmalloc 359201 events per sec
> > ..
> > > [root@hello bpf]# ./map_perf_test 4 8 16384
> > > 4:hash_map_perf kmalloc 203983 events per sec
> >
> > this is indeed a degration in a VM.
> >
> > > I also run map_perf_test on a physical x86-64 host with 72 CPUs. The
> > > performances for "./map_perf_test 4 8" are similar, but there is obvious
> > > performance degradation for "./map_perf_test 4 8 16384"
> >
> > but... a degradation?
> >
> > > Before reuse-after-rcu-gp:
> > >
> > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > 1:hash_map_perf kmalloc 388088 events per sec
> > ...
> > > After reuse-after-rcu-gp:
> > > [houtao@fedora bpf]$ sudo ./map_perf_test 4 8 16384
> > > 5:hash_map_perf kmalloc 655628 events per sec
> >
> > This is a big improvement :) Not a degration.
> > You always have to double check the numbers with perf report.
> >
> > > So could you please double check your setup and rerun map_perf_test ? If
> > > there is no performance degradation, could you please share your setup
> > > and your kernel configure file ?
> >
> > I'm testing on normal no-debug kernel. No kasan. No lockdep. HZ=1000
> > Playing with it a bit more I found something interesting:
> > map_perf_test 4 8 16348
> > before/after has too much noise to be conclusive.
> >
> > So I did
> > map_perf_test 4 8 16348 1000000
> >
> > and now I see significant degration from patch 3.
> > It drops from 800k to 200k.
> > And perf report confirms that heavy contention on sc->reuse_lock is the culprit.
> > The following hack addresses most of the perf degradtion:
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index fea1cb0c78bb..eeadc9359097 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -188,7 +188,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >         alloc = 0;
> >         head = NULL;
> >         tail = NULL;
> > -       raw_spin_lock_irqsave(&sc->reuse_lock, flags);
> > +       if (raw_spin_trylock_irqsave(&sc->reuse_lock, flags)) {
> >         while (alloc < cnt) {
> >                 obj = __llist_del_first(&sc->reuse_ready_head);
> >                 if (obj) {
> > @@ -206,6 +206,7 @@ static int bpf_ma_get_reusable_obj(struct bpf_mem_cache *c, int cnt)
> >                 alloc++;
> >         }
> >         raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> > +       }
> >
> >         if (alloc) {
> >                 if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > @@ -334,9 +335,11 @@ static void bpf_ma_add_to_reuse_ready_or_free(struct bpf_mem_cache *c)
> >                 sc->reuse_ready_tail = NULL;
> >                 WARN_ON_ONCE(!llist_empty(&sc->wait_for_free));
> >                 __llist_add_batch(head, tail, &sc->wait_for_free);
> > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >                 call_rcu_tasks_trace(&sc->rcu, free_rcu);
> > +       } else {
> > +               raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >         }
> > -       raw_spin_unlock_irqrestore(&sc->reuse_lock, flags);
> >  }
> >
> > It now drops from 800k to 450k.
> > And perf report shows that both reuse is happening and slab is working hard to satisfy kmalloc/kfree.
> > So we may consider per-cpu waiting_for_rcu_gp and per-bpf-ma waiting_for_rcu_task_trace_gp lists.
>
> Sorry. per-cpu waiting_for_rcu_gp is what patch 3 does already.
> I meant per-cpu reuse_ready and per-bpf-ma waiting_for_rcu_task_trace_gp.

An update..

I tweaked patch 3 to do per-cpu reuse_ready and it addressed
the lock contention, but cache miss on
__llist_del_first(&c->reuse_ready_head);
was still very high and performance was still at 450k as
with a simple hack above.

Then I removed some of the _tail optimizations and added counters
to these llists.
To my surprise
map_perf_test 4 1 16348 1000000
was showing ~200k on average in waiting_for_gp when reuse_rcu() is called
and ~400k sitting in reuse_ready_head.

Then noticed that we should be doing:
call_rcu_hurry(&c->rcu, reuse_rcu);
instead of call_rcu(),
but my config didn't have RCU_LAZY, so that didn't help.
Obviously we cannot allow such a huge number of elements to sit
in these link lists.
The whole "reuse-after-rcu-gp" idea for bpf_mem_alloc may not work.
To unblock qp-trie work I suggest to add rcu_head to each inner node
and do call_rcu() on them before free-ing them to bpf_mem_alloc.
Explicit call_rcu would disqualify qp-tree from tracing programs though :(

  reply	other threads:[~2023-06-07 23:23 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-06  3:53 [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator Hou Tao
2023-06-06  3:53 ` [RFC PATCH bpf-next v4 1/3] bpf: Factor out a common helper free_all() Hou Tao
2023-06-06  3:53 ` [RFC PATCH bpf-next v4 2/3] selftests/bpf: Add benchmark for bpf memory allocator Hou Tao
2023-06-06 21:13   ` Alexei Starovoitov
2023-06-07  1:32     ` Hou Tao
2023-06-06  3:53 ` [RFC PATCH bpf-next v4 3/3] bpf: Only reuse after one RCU GP in " Hou Tao
2023-06-06 12:30 ` [RFC PATCH bpf-next v4 0/3] Handle immediate reuse " Hou Tao
2023-06-06 21:04   ` Alexei Starovoitov
2023-06-07  1:19     ` Hou Tao
2023-06-07  1:39       ` Alexei Starovoitov
2023-06-07  7:56         ` Hou Tao
2023-06-07  8:42     ` Hou Tao
2023-06-07 17:52       ` Alexei Starovoitov
2023-06-07 20:50         ` Alexei Starovoitov
2023-06-07 23:23           ` Alexei Starovoitov [this message]
2023-06-07 23:30             ` Paul E. McKenney
2023-06-07 23:50               ` Alexei Starovoitov
2023-06-08  0:13                 ` Paul E. McKenney
2023-06-08  0:34                   ` Alexei Starovoitov
2023-06-08  1:15                     ` Paul E. McKenney
2023-06-08  3:35                     ` Hou Tao
2023-06-08  4:30                       ` Hou Tao
2023-06-08  4:30                       ` Alexei Starovoitov
2023-06-08  1:57             ` Hou Tao
2023-06-08  1:51           ` Hou Tao
2023-06-08  2:55             ` Paul E. McKenney
2023-06-08  3:43               ` Hou Tao
2023-06-08 16:18                 ` Paul E. McKenney
2023-06-09  3:02                   ` Hou Tao
2023-06-09 14:30                     ` Paul E. McKenney
2023-06-12  2:03                       ` Hou Tao
2023-06-12  3:40                         ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAADnVQJ1njnHb96HfO4k48XDY9L3YXqQW1iUW=ti5iBNKKcE9A@mail.gmail.com' \
    --to=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).