Re: [PATCH 0/2] execve scalability issues, part 1

From: Mateusz Guzik <mjguzik@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: Dennis Zhou <dennis@kernel.org>,
	linux-kernel@vger.kernel.org, tj@kernel.org, cl@linux.com,
	akpm@linux-foundation.org, shakeelb@google.com,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/2] execve scalability issues, part 1
Date: Wed, 23 Aug 2023 14:13:20 +0200	[thread overview]
Message-ID: <CAGudoHFFt5wvYWrwNkz813KaXBmROJ7YJ67s1h3_CBgcoV2fCA@mail.gmail.com> (raw)
In-Reply-To: <20230823094915.ggv3spzevgyoov6i@quack3>

On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
>> On 8/22/23, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
>> >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >> > True Fix(tm) is a longer story.
>> >> >
>> >> > Maybe let's sort out this patchset first, whichever way. :)
>> >> >
>> >>
>> >> So I found the discussion around the original patch with a perf
>> >> regression report.
>> >>
>> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
>> >>
>> >> The reporter suggests dodging the problem by only allocating per-cpu
>> >> counters when the process is going multithreaded. Given that there is
>> >> still plenty of forever single-threaded procs out there I think that's
>> >> does sound like a great plan regardless of what happens with this
>> >> patchset.
>> >>
>> >> Almost all access is already done using dedicated routines, so this
>> >> should be an afternoon churn to sort out, unless I missed a
>> >> showstopper. (maybe there is no good place to stuff a flag/whatever
>> >> other indicator about the state of counters?)
>> >>
>> >> That said I'll look into it some time this or next week.
>> >
>> > Good, just let me know how it went, I also wanted to start looking into
>> > this to come up with some concrete patches :). What I had in mind was
>> > that
>> > we could use 'counters == NULL' as an indication that the counter is
>> > still
>> > in 'single counter mode'.
>> >
>>
>> In the current state there are only pointers to counters in mm_struct
>> and there is no storage for them in task_struct. So I don't think
>> merely null-checking the per-cpu stuff is going to cut it -- where
>> should the single-threaded counters land?
>
> I think you misunderstood. What I wanted to do it to provide a new flavor
> of percpu_counter (sharing most of code and definitions) which would have
> an option to start as simple counter (indicated by pcc->counters == NULL
> and using pcc->count for counting) and then be upgraded by a call to real
> percpu thing. Because I think such counters would be useful also on other
> occasions than as rss counters.
>

Indeed I did -- I had tunnel vision on dodging atomics for current
given remote modifications, which wont be done in your proposal.

I concede your idea solves the problem at hand, I question whether it
is the right to do though. Not my call to make.

>> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
>> countes * 2 -- first set updated without any synchro by current
>> thread. Second set only to be modified by others and protected with
>> mm->arg_lock. The lock protects remote access to the union to begin
>> with.
>
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?
>

arg_lock is not as is, it would have to be renamed to something more generic.

Atomics on x86-64 are very expensive to this very day. Here is a
sample measurement of 2 atomics showing up done by someone else:
https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@intel.com/T/#u

tl;dr it is *really* bad.

> If the second counter proves to be worth it, we could make just that one
> atomic to avoid the need for abusing some spinlock.
>

The spinlock would be there to synchronize against the transition to
per-cpu -- any trickery is avoided and we trivially know for a fact
the remote party either sees the per-cpu state if transitioned, or
local if not. Then one easily knows no updates have been lost and the
buf for 2 sets of counters can be safely freed.

While writing down the idea previously I did not realize the per-cpu
counter ops disable interrupts around the op. That's already very slow
and the trip should be comparable to paying for an atomic (as in the
patch which introduced percpu counters here slowed things down for
single-threaded processes).

With your proposal the atomic would be there, but interrupt trip could
be avoided. This would roughly maintain the current cost of doing the
op (as in it would not get /worse/). My patch would make it lower.

All that said, I'm going to refrain from writing a patch for the time
being. If powers to be decide on your approach, I'm not going to argue
-- I don't think either is a clear winner over the other.

-- 
Mateusz Guzik <mjguzik gmail.com>