Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

From: Frank Rowand <frowand.list@gmail.com>
To: Rob Herring <robh+dt@kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>,
	devicetree@vger.kernel.org,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
Date: Tue, 10 Dec 2019 02:17:30 -0600	[thread overview]
Message-ID: <c2334575-fa38-eb73-bb56-21a530e773bf@gmail.com> (raw)
In-Reply-To: <CAL_JsqLVJi_v9yD5khLUvJHDEXFj=eXFf-CHXz30y0mYojt-Yg@mail.gmail.com>

On 12/9/19 7:51 PM, Rob Herring wrote:
> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
>>> Is there a memory usage issue for the systems that led to this thread?
>>
>> No, no memory issue led to this thread. I was just testing my patch and
>> I assumed that I did something wrong in the counting/lock drop/lock
>> acquire/allocate path because the array was hardly used. So I started to
>> look deeper…
>> Once I figured out everything was fine, I was curious if everyone is
>> aware of the different phandle creation by dtc vs POWER. And I posted
>> the mail in the thread.
>> Once you confirmed that everything is "known / not an issue" I was ready
>> to take off [0].
>>
>> Later more replies came in such as one mail [1] from Rob describing the
>> original reason with 814 phandles. _Here_ I was just surprised that 1024
>> were used over 64 entries for a benefit of 60ms. I understand that this
>> is low concern for you because that memory is released if modules are
>> not enabled. I usually see that module support is left enabled.
>>
>> However, Rob suggested / asked about the fixed size array (this is how I
>> understood it):
>> |And yes, as mentioned earlier I don't like the complexity. I didn't
>> |from the start and I'm  I'm still of the opinion we should have a
>> |fixed or 1 time sized true cache (i.e. smaller than total # of
>> |phandles). That would solve the RT memory allocation and locking issue
>> |too.
>>
>> so I attempted to ask if we should have the fixed size array maybe
>> with the hash_32() instead the mask. This would make my other patch
>> obsolete because the fixed size array should not have a RT issue. The
>> hash_32() part here would address the POWER issue where the cache is
>> currently not used efficiently.
>>
>> If you want instead to keep things as-is then this is okay from my side.
>> If you want to keep this cache off on POWER then I could contribute a
>> patch doing so.
> 
> It turns out there's actually a bug in the current implementation. If
> we have multiple phandles with the same mask, then we leak node
> references if we miss in the cache and re-assign the cache entry.

Aaargh.  Patch sent.

> Easily fixed I suppose, but holding a ref count for a cached entry
> seems wrong. That means we never have a ref count of 0 on every node
> with a phandle.

It will go to zero when the cache is freed.

My memory is that we free the cache as part of removing an overlay.  I'll
verify whether my memory is correct.

-Frank

> 
> I've done some more experiments with the performance. I've come to the
> conclusion that just measuring boot time is not a good way at least if
> the time is not a significant percentage of the total. I couldn't get
> any measurable data. I'm using a RK3399 Rock960 board. It has 190
> phandles. I get about 1500 calls to of_find_node_by_phandle() during
> boot. Note that about the first 300 are before we have any timekeeping
> (the prior measurements also would not account for this). Those calls
> have no cache in the current implementation and are cached in my
> implementation.
> 
> no cache:  20124 us
> current cache: 819 us
> 
> new cache (64 entry): 4342 us
> new cache (128 entry): 2875 us
> new cache (256 entry): 1235 us
> 
> To get some idea on the time spent before timekeeping is up, if we
> take the avg miss time is ~17us (20124/1200), then we're spending
> about ~5ms on lookups before the cache is enabled. I'd estimate the
> new cache takes ~400us before timekeeping is up as there's 11 misses
> early.
> 
>>From these numbers, it seems the miss rate has a significant impact on
> performance for the different sizes. But taken from the original 20+
> ms, they all look like good improvement.
> 
> Rob
>