Re: [PATCH v3 0/3] arm64: Add optimized memset/memcpy/memove functions

From: Stefan Roese <sr@denx.de>
To: Tom Rini <trini@konsulko.com>
Cc: u-boot@lists.denx.de,
	Rasmus Villemoes <rasmus.villemoes@prevas.dk>,
	sjg@chromium.org, Wolfgang Denk <wd@denx.de>
Subject: Re: [PATCH v3 0/3] arm64: Add optimized memset/memcpy/memove functions
Date: Thu, 12 Aug 2021 10:43:56 +0200	[thread overview]
Message-ID: <b7eb3113-418f-b668-6f95-30e46cdbb5bf@denx.de> (raw)
In-Reply-To: <bf556e0e-158f-c23e-ea40-7bf9f0b370d6@denx.de>

On 11.08.21 16:28, Stefan Roese wrote:
> On 11.08.21 16:25, Tom Rini wrote:
>> On Wed, Aug 11, 2021 at 04:02:39PM +0200, Stefan Roese wrote:
>>>
>>> On an NXP LX2160 based platform it has been noticed, that the currently
>>> implemented memset/memcpy functions for aarch64 are suboptimal.
>>> Especially the memset() for clearing the NXP MC firmware memory is very
>>> expensive (time-wise).
>>>
>>> By using optimized functions, a speedup of ~ factor 6 has been measured.
>>
>> To be clear, you re-measured with the cache check code added, and this
>> is the speed up?
> 
> I forgot doing this. BTW: I was wrong with factor ~6. From my notices,
> it is ~ factor 4 using the optimized memset() version.
> 
> I'll follow-up on this mail with some measurements for all affected
> functions, using small and large sizes. Hopefully tomorrow.

Here the numbers:

Current original version:
-------------------------
memset() 32 Bytes, 16M times:
time: 0.446 seconds

memset() 16MiB, 256 times:
time: 1.076 seconds

memcpy() 512MiB:
time: 0.224 seconds

New optimized version:
----------------------
memset() 32 Bytes, 16M times:
time: 0.287 seconds

memset() 16MiB, 256 times:
time: 0.292 seconds

memcpy() 512MiB:
time: 0.222 seconds

Summary:
The optimized memcpy is nearly identical to the original one. But the
optimized memset is much faster, for small and big sizes. Small sizes
factor ~1.6 and big sizes factor ~3.7.

Note: These measurements were done on the NXP LX2160ARDB board.

Thanks,
Stefan