linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too
Date: Mon, 18 Oct 2021 21:43:27 -0700	[thread overview]
Message-ID: <CAADnVQ+ijmng_s1EP_qTG3Xsvg6v5EWLvP9PTFEH0vLnyJUtRg@mail.gmail.com> (raw)
In-Reply-To: <36b27bba-e20b-8fd4-1436-d2d4c0e86896@bytedance.com>

On Mon, Oct 18, 2021 at 9:31 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> 在 2021/10/19 上午11:45, Alexei Starovoitov 写道:
> > On Mon, Oct 18, 2021 at 8:14 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> 在 2021/10/19 上午9:57, Alexei Starovoitov 写道:
> >>> On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
> >>> <zhouchengming@bytedance.com> wrote:
> >>>>
> >>>> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
> >>>>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
> >>>>> <zhouchengming@bytedance.com> wrote:
> >>>>>>
> >>>>>> We only use count for kmalloc hashtab not for prealloc hashtab, because
> >>>>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
> >>>>>>
> >>>>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
> >>>>>> spin_lock for all CPUs to find there is no more elem at last.
> >>>>>>
> >>>>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
> >>>>>> would last for 1ms. This patch use count for prealloc hashtab too,
> >>>>>> avoid traverse and spin_lock for all CPUs in this case.
> >>>>>>
> >>>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >>>>>
> >>>>> It's not clear from the commit log what you're solving.
> >>>>> The atomic inc/dec in critical path of prealloc maps hurts performance.
> >>>>> That's why it's not used.
> >>>>>
> >>>> Thanks for the explanation, what I'm solving is when hash table hasn't free
> >>>> elements, we don't need to call __pcpu_freelist_pop() to traverse and
> >>>> spin_lock all CPUs. The ftrace output of this bad case is below:
> >>>>
> >>>>  50)               |  htab_map_update_elem() {
> >>>>  50)   0.329 us    |    _raw_spin_lock_irqsave();
> >>>>  50)   0.063 us    |    lookup_elem_raw();
> >>>>  50)               |    alloc_htab_elem() {
> >>>>  50)               |      pcpu_freelist_pop() {
> >>>>  50)   0.209 us    |        _raw_spin_lock();
> >>>>  50)   0.264 us    |        _raw_spin_lock();
> >>>
> >>> This is LRU map. Not hash map.
> >>> It will grab spin_locks of other cpus
> >>> only if all previous cpus don't have free elements.
> >>> Most likely your map is actually full and doesn't have any free elems.
> >>> Since it's an lru it will force free an elem eventually.
> >>>
> >>
> >> Maybe I missed something, the map_update_elem function of LRU map is
> >> htab_lru_map_update_elem() and the htab_map_update_elem() above is the
> >> map_update_elem function of hash map.
> >> Because of the implementation of percpu freelist used in hash map, it
> >> will spin_lock all other CPUs when there is no free elements.
> >
> > Ahh. Right. Then what's the point of optimizing the error case
> > at the expense of the fast path?
> >
>
> Yes, this patch only optimized the very bad case that no free elements left,
> and add atomic operation in the fast path. Maybe the better workaround is not
> allowing full hash map in our case or using LRU map?

No idea, since you haven't explained your use case.

> But the problem of spinlock contention also exist even when the map is not full,
> like some CPUs run out of its freelist but other CPUs seldom used, then have to
> grab those CPUs' spinlock to get free element.

In theory that would be correct. Do you see it in practice?
Please describe the use case.

> Should we change the current implementation of percpu freelist to percpu lockless freelist?

Like llist.h ? That was tried already and for typical hash map usage
it's slower than percpu free list.
Many progs still do a lot of hash map update/delete on all cpus at once.
That is the use case hashmap optimized for.
Please see commit 6c9059817432 ("bpf: pre-allocate hash map elements")
that also lists different alternative algorithms that were benchmarked.

  reply	other threads:[~2021-10-19  4:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15  9:03 [PATCH] bpf: use count for prealloc hashtab too Chengming Zhou
2021-10-15 19:58 ` Alexei Starovoitov
2021-10-18  5:49   ` [External] " Chengming Zhou
2021-10-19  1:57     ` Alexei Starovoitov
2021-10-19  3:14       ` Chengming Zhou
2021-10-19  3:45         ` Alexei Starovoitov
2021-10-19  4:31           ` Chengming Zhou
2021-10-19  4:43             ` Alexei Starovoitov [this message]
2021-10-19  5:11               ` Chengming Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAADnVQ+ijmng_s1EP_qTG3Xsvg6v5EWLvP9PTFEH0vLnyJUtRg@mail.gmail.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=songliubraving@fb.com \
    --cc=yhs@fb.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).