Re: [PATCH 0/2] execve scalability issues, part 1

From: Mateusz Guzik <mjguzik@gmail.com>
To: Dennis Zhou <dennis@kernel.org>
Cc: linux-kernel@vger.kernel.org, tj@kernel.org, cl@linux.com,
	akpm@linux-foundation.org, shakeelb@google.com,
	linux-mm@kvack.org, jack@suse.cz
Subject: Re: [PATCH 0/2] execve scalability issues, part 1
Date: Tue, 22 Aug 2023 00:29:49 +0200	[thread overview]
Message-ID: <CAGudoHHJECp2-DfSr5hudooAdV6mivvSO+4mC9kwUrWnSiob5g@mail.gmail.com> (raw)
In-Reply-To: <20230821213951.bx3yyqh7omdvpyae@f>

On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote:
>> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
>> > With this out of the way I'll be looking at some form of caching to
>> > eliminate these allocs as a problem.
>> >
>>
>> I'm not against caching, this is just my first thought. Caching will
>> have an impact on the backing pages of percpu. All it takes is 1
>> allocation on a page for the current allocator to pin n pages of memory.
>> A few years ago percpu depopulation was implemented so that limits the
>> amount of resident backing pages.
>>
>
> I'm painfully aware.
>
>> Maybe the right thing to do is preallocate pools of common sized
>> allocations so that way they can be recycled such that we don't have to
>> think too hard about fragmentation that can occur if we populate these
>> pools over time?
>>
>
> This is what I was going to suggest :)
>
> FreeBSD has a per-cpu allocator which pretends to be the same as the
> slab allocator, except handing out per-cpu bufs. So far it has sizes 4,
> 8, 16, 32 and 64 and you can act as if you are mallocing in that size.
>
> Scales perfectly fine of course since it caches objs per-CPU, but there
> is some waste and I have 0 idea how it compares to what Linux is doing
> on that front.
>
> I stress though that even if you were to carve out certain sizes, a
> global lock to handle ops will still kill scalability.
>
> Perhaps granularity better than global, but less than per-CPU would be a
> sweet spot for scalabability vs memory waste.
>
> That said...
>
>> Also as you've pointed out, it wasn't just the percpu allocation being
>> the bottleneck, but percpu_counter's global lock too for hotplug
>> support. I'm hazarding a guess most use cases of percpu might have
>> additional locking requirements too such as percpu_counter.
>>
>
> True Fix(tm) is a longer story.
>
> Maybe let's sort out this patchset first, whichever way. :)
>

So I found the discussion around the original patch with a perf
regression report.

https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/

The reporter suggests dodging the problem by only allocating per-cpu
counters when the process is going multithreaded. Given that there is
still plenty of forever single-threaded procs out there I think that's
does sound like a great plan regardless of what happens with this
patchset.

Almost all access is already done using dedicated routines, so this
should be an afternoon churn to sort out, unless I missed a
showstopper. (maybe there is no good place to stuff a flag/whatever
other indicator about the state of counters?)

That said I'll look into it some time this or next week.

>> Thanks,
>> Dennis
>>
>> > Thoughts?
>> >
>> > Mateusz Guzik (2):
>> >   pcpcntr: add group allocation/free
>> >   fork: group allocation of per-cpu counters for mm struct
>> >
>> >  include/linux/percpu_counter.h | 19 ++++++++---
>> >  kernel/fork.c                  | 13 ++------
>> >  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
>> >  3 files changed, 60 insertions(+), 33 deletions(-)
>> >
>> > --
>> > 2.39.2
>> >
>

-- 
Mateusz Guzik <mjguzik gmail.com>