Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

From: Frank Rowand <frowand.list@gmail.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	devicetree@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	Rob Herring <robh+dt@kernel.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF
Date: Sat, 7 Dec 2019 22:30:04 -0600	[thread overview]
Message-ID: <051776be-9247-1c00-3131-a68c3c7794b2@gmail.com> (raw)
In-Reply-To: <20191206234034.GW3152@gate.crashing.org>

On 12/6/19 5:40 PM, Segher Boessenkool wrote:
> Hi,
> 
> On Thu, Dec 05, 2019 at 07:37:24PM -0600, Frank Rowand wrote:
>> On 12/3/19 12:35 PM, Segher Boessenkool wrote:
>>> Btw.  Some OFs mangle the phandles some way, to make it easier to catch
>>> people using it as an address (and similarly, mangle ihandles differently,
>>> so you catch confusion between ihandles and phandles as well).  Like a
>>> simple xor, with some odd number preferably.  You should assume *nothing*
>>> about phandles, they are opaque identifiers.
>>
>> For arm32 machines that use dtc to generate the devicetree, which is a
>> very large user base, we certainly can make assumptions about phandles.
> 
> I was talking about OF.  Phandles are explicitly defined to be opaque
> tokens.  If there is an extra meaning to them in flattened device trees,
> well, the kernel should then only depend on that there, not for more
> general phandles.  Where is this documented btw?

And dtc generated devicetrees are a huge proportion of the OF systems.

It is not documented.

As an aside, overlays also depend upon the current dtc implementation.
If an extremely large value is used for a phandle then overlay
application will fail.

> 
>> Especially because the complaints about the overhead of phandle based
>> lookups have been voiced by users of this specific set of machines.
>>
>> For systems with a devicetree that does not follow the assumptions, the
>> phandle cache should not measurably increase the overhead of phandle
>> based lookups.
> 
> It's an extra memory access and extra code to execute, for not much gain
> (if anything).  While with a reasonable hash function it will be good
> for everyone.
> 
>> If you have measurements of a system where implementing the phandle
>> cache increased the overhead,
> 
> Are you seriously saying you think this code can run in zero time and
> space on most systems?

No.  I made no such claim.  Note the additional words in the following
sentences.

>> and the additional overhead is a concern
>> (such as significantly increasing boot time) then please share that
>> information with us.  Otherwise this is just a theoretical exercise.
> 
> The point is that this code could be easily beneficial for most (or all)
> users, not just those that use dtc-constructed device trees.  It is

The point is that the cache was implemented to solve a specific problem
for certain specific systems.  There had been a few reports of various
machines having the same issue, but finally someone measures a **significant**
improvement in boot time for a specific machine.  The boot time with
the cache was **measured** to be much shorter.  The boot time for all
systems with a dtc generated devicetree is expected to be faster.  No
one responded to the implementation when it was proposed with a
**measurement** that showed increased boot time.  A concern of using
more memory was raised and discussed, with at least on feature added
as a result (freeing the cache in late init if modules are not
enabled).

Being "beneficial for most (or all) users" has to be balanced against
whether the change would remove the benefit for the system that the
feature was originally implemented to solve a problem for.  There
was no performance data supplied to answer this question.  (Though
eventually Rob did some measurements of the impact on hash efficiency
for such a system.)

> completely obvious that having a worse cache hash function results in
> many more lookups.  Whether that results in something expressed as
> milliseconds on tiny systems or microseconds on bigger systems is
> completely beside the point.

There was no performance data accompanying the proposed change that
started this thread.  There was no data showing whether the systems
that this feature was created for would suffer.  There was no data
showing that the boot time of the pseries systems would improve.
There was no assertion made that too much memory was being used
by the cache (there was an implied assertion that a large
percentage of the memory used for the cache was not used, thus
the performance benefit of the cache could be improved by
changing to using a hash instead of mask).  We had rejected
creating a cache for several years until finally some solid
data was provided showing an actual need for it.

It is not a question of "milliseconds on tiny systems or
microseconds on bigger systems".  I agree with that.  But it
does matter whether the performance impact of various
implementations is large enough to either solve a problem
or create a problem.  On the other hand, if the amount of
memory used by the cache is a problem (which is _not_
what was asserted by the patch submitter) then we can have
a conversation about how to resolve that.

-Frank

> 
> 
> Segher
>