bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anton Protopopov <aspsk@isovalent.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>,
	bpf <bpf@vger.kernel.org>, Joe Stringer <joe@isovalent.com>,
	John Fastabend <john.fastabend@gmail.com>
Subject: Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
Date: Wed, 28 Jun 2023 13:17:23 +0000	[thread overview]
Message-ID: <ZJwy478jHkxYNVMc@zh-lab-node-5> (raw)
In-Reply-To: <CAADnVQLXFyhACfZP3bze8PUa43Fnc-Nn_PDGYX2vOq3i8FqKbA@mail.gmail.com>

Hi Alexey, hi Martin,

On Thu, Jun 01, 2023 at 11:24:20AM -0700, Alexei Starovoitov wrote:
> On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > LRU logic doesn't kick in until the map is full.
> >
> > In fact, it can: a reproducable example is in the self-test from this patch
> > series. In the test N threads try to insert random values for keys 1..3000
> > simultaneously. As the result, the map may contain any number of elements,
> > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > So a user can't really even closely estimate the number of elements in the LRU
> > map based on the number of updates (with unique keys). A per-cpu counter
> > inc/dec'ed from the kernel side would solve this.
> That's odd and unexpected.
> Definitely something to investigate and fix in the LRU map.
> Pls cc Martin in the future.

I've looked into this a bit more and the problem is as follows.

LRU maps allocate MAX_ENTRIES elements and put them in the global free list.
Then each CPU will try to get memory in 128 elements chunks into their own
local free lists.

The user expectation is that evictions start when the map is full, however, on
practice we start evicting elements when capacity reaches about (MAX_ENTRIES -
NCPUS*128) elements. This happens because when one CPU have used its local
free-list, it gets to the global lists. While there could be another
(NCPUS-1)*128 free elements in local free lists of other CPUs, our CPU goes to
the global free list, which is empty, and then starts to evict elements from
active/inactive lists (a 128 elements chunk). Then this can happen for another
active CPU, etc.

This looks like not a problem for big maps, where (NCPUS*128) is not a big %%
of the total map capacity. For smaller maps this may be unexpected (I first
noticed this on a 4K map where after updating 4K keys map capacity was about
200-300 elements).

My first attempt to fix this was to just increase nr_entries allocated for the
map by NCPUS*128, which makes evictions to start happening at MAX_ENTRIES.  But
soon I've realized that in such way users can get more than MAX_ENTRIES inside
a map, which is unexpected as well (say, when dumping a map in a location of
MAX_ENTRIES size or syncing entries with another map of MAX_ENTRIES capacity).

I also briefly looked into allowing to call the prealloc_lru_pop() function
under a bucket lock (by passing the currently locked bucket to it so that this
pointer is passed all the way to the htab_lru_map_delete_node() function which
may then bypass locking the bucket if it is the same one). Looks like this
works, but I didn't have time to understand if this breaks the LRU architecture
badly or not.

  parent reply	other threads:[~2023-06-28 13:16 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-31 11:05 [PATCH bpf-next 0/2] add mechanism to report map pressure Anton Protopopov
2023-05-31 11:05 ` [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure Anton Protopopov
2023-05-31 18:24   ` Alexei Starovoitov
2023-06-01  7:31     ` Anton Protopopov
2023-06-01 16:39       ` Alexei Starovoitov
2023-06-01 18:18         ` Anton Protopopov
2023-06-01 18:24           ` Alexei Starovoitov
2023-06-02  0:40             ` Alexei Starovoitov
2023-06-02 14:21               ` Anton Protopopov
2023-06-02 16:23                 ` Alexei Starovoitov
2023-06-06  7:49                   ` Anton Protopopov
2023-06-24  0:00                   ` John Fastabend
2023-06-26 16:29                     ` Anton Protopopov
2023-06-28 13:17             ` Anton Protopopov [this message]
2023-06-01  0:44   ` kernel test robot
2023-06-01  7:50     ` Anton Protopopov
2023-06-13  8:23       ` Yujie Liu
2023-06-13  8:30         ` Anton Protopopov
2023-05-31 11:05 ` [PATCH bpf-next 2/2] selftests/bpf: test map pressure Anton Protopopov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJwy478jHkxYNVMc@zh-lab-node-5 \
    --to=aspsk@isovalent.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=joe@isovalent.com \
    --cc=john.fastabend@gmail.com \
    --cc=martin.lau@kernel.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).