All of lore.kernel.org
 help / color / mirror / Atom feed
* Call for testing/opinions: Optimized memset/memcpy
@ 2013-07-13 15:51 Harm Hanemaaijer
  2013-07-13 16:48 ` Dr. David Alan Gilbert
  2013-07-13 17:24 ` Willy Tarreau
  0 siblings, 2 replies; 18+ messages in thread
From: Harm Hanemaaijer @ 2013-07-13 15:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

I've been doing some work on optimizing the memset/memcpy family of
functions for modern ARM platforms, including copy_page, memset,
memzero, memcpy, copy_from_user and copy_to_user. It appears that
there is room for improvement, especially with regard to using an
optimal preload strategy for armv6/v7 architectures as well as
aligning the write target. For example, on an armv6-based platform
(RPi) I am seeing a 80% speed-up in copy_page and large sized
memcpy. Gains in the range 10-25% are seen on a Cortex A8 device.
These optimizations use the regular register file, like the
previous implementation, and do not use any NEON or vfp registers.

To properly benchmark and test these new implementations, I've
created a userspace testing utility that can be used to compare
and validate exact copies of the original and optimized kernel
versions of the functions in userspace. The repository is
available at https://github.com/hglm/test-arm-kernel-memcpy.git.
It would be useful to compare the results on different
platforms and to check whether changes in the prefetch distance
or write alignment result in optimized performance.

I've created a preliminary patch set that replaces the copy_page,
memset and memzero functions for all ARM platforms. Features
include use of a configurable prefetch distance in copy_page,
translation to 16-bit Thumb2 instructions whenever possible,
optimization for the common word-aligned case in memset/memzero,
and application of a predefined write alignment in memset/memzero.
In order to safely use unified ARM assembler syntax, which appears
to be desirable going forward, the first patch in the set renames
all references of the "push" macro so that it no longer conflicts
with the "push" instruction defined in unified syntax. The new
memset/memzero functions use the unified syntax. The patch set
is available at
https://github.com/hglm/patches/tree/master/arm-mem-funcs.

Optimization of memcpy/copy_from_user/copy_to_user is more
complicated, and although I've created optimized versions that
provide better results in benchmarks, we have to be careful that
increased code size and branch prediction burden does not result
in lower performance in real-world use, especially on older
platforms. Therefore it might be desirable to only enable them
on newer platforms like armv6/v7.

So in short, I am looking for opinions, and test results especially
from the userspace benchmark, to see the relative merit of these
optimizations on different platforms.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-13 15:51 Call for testing/opinions: Optimized memset/memcpy Harm Hanemaaijer
@ 2013-07-13 16:48 ` Dr. David Alan Gilbert
  2013-07-13 21:13   ` Harm Hanemaaijer
  2013-07-14 11:19   ` Harm Hanemaaijer
  2013-07-13 17:24 ` Willy Tarreau
  1 sibling, 2 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2013-07-13 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

* Harm Hanemaaijer (fgenfb at yahoo.com) wrote:
> Hello,
> 
> I've been doing some work on optimizing the memset/memcpy family of
> functions for modern ARM platforms, including copy_page, memset,
> memzero, memcpy, copy_from_user and copy_to_user. It appears that
> there is room for improvement, especially with regard to using an
> optimal preload strategy for armv6/v7 architectures as well as
> aligning the write target. For example, on an armv6-based platform
> (RPi) I am seeing a 80% speed-up in copy_page and large sized
> memcpy. Gains in the range 10-25% are seen on a Cortex A8 device.
> These optimizations use the regular register file, like the
> previous implementation, and do not use any NEON or vfp registers.

You might like to compare with some of the routines at:
https://launchpad.net/cortex-strings
and some of the numbers at:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/

(I'm sure Michael Hope who owns that set of stuff would be
interested in seeing your stuff as well).

> To properly benchmark and test these new implementations, I've
> created a userspace testing utility that can be used to compare
> and validate exact copies of the original and optimized kernel
> versions of the functions in userspace. The repository is
> available at https://github.com/hglm/test-arm-kernel-memcpy.git.
> It would be useful to compare the results on different
> platforms and to check whether changes in the prefetch distance
> or write alignment result in optimized performance.

It's quite tricky figuring out across different machines; also
even the same machine in different setups;

http://ssvb.github.io/2013/06/27/fullhd-x11-desktop-performance-of-the-allwinner-a10.html

is an interesting article on one machine being screwed over by
video bandwidth.

I've only had a brief scan through your code, one thing I remember
from a couple of years ago was a theory that ldrd/strd was supposed
to be faster on A15's (but I never had a chance to try it out).

<snip>

> So in short, I am looking for opinions, and test results especially
> from the userspace benchmark, to see the relative merit of these
> optimizations on different platforms.

Maybe neon is worth a try these days (although be careful of platforms
like Tegra 2 that doens't have it); there was a recent patch that enabled
use in the kernel (I think for some RAID use). The downside is it's
supposed to be quite power hungry.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\ gro.gilbert @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-13 15:51 Call for testing/opinions: Optimized memset/memcpy Harm Hanemaaijer
  2013-07-13 16:48 ` Dr. David Alan Gilbert
@ 2013-07-13 17:24 ` Willy Tarreau
  2013-07-13 21:51   ` Harm Hanemaaijer
  1 sibling, 1 reply; 18+ messages in thread
From: Willy Tarreau @ 2013-07-13 17:24 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Harm,

On Sat, Jul 13, 2013 at 03:51:07PM +0000, Harm Hanemaaijer wrote:
> Hello,
> 
> I've been doing some work on optimizing the memset/memcpy family of
> functions for modern ARM platforms, including copy_page, memset,
> memzero, memcpy, copy_from_user and copy_to_user. It appears that
> there is room for improvement, especially with regard to using an
> optimal preload strategy for armv6/v7 architectures as well as
> aligning the write target. For example, on an armv6-based platform
> (RPi) I am seeing a 80% speed-up in copy_page and large sized
> memcpy. Gains in the range 10-25% are seen on a Cortex A8 device.

Interesting, especially for devices that have a narrow DDR bus where
we want to shave every possible bus cycle!

(...)
> So in short, I am looking for opinions, and test results especially
> from the userspace benchmark, to see the relative merit of these
> optimizations on different platforms.

OK I've run bench.script on the following platforms :

  - Snowball board : it is a dual-core 1GHz cortex-a9 from STE (A9500).
    It has some 32-bit LPDDR2 soldered on the CPU (package on package).
    The test ran only in ARMv7 mode.

    root at snowball:tmp# cat /proc/cpuinfo 
    processor       : 0
    model name      : ARMv7 Processor rev 1 (v7l)
    BogoMIPS        : 4.80
    Features        : swp half thumb fastmult vfp edsp neon vfpv3 tls 
    CPU implementer : 0x41
    CPU architecture: 7
    CPU variant     : 0x2
    CPU part        : 0xc09
    CPU revision    : 1

  - Armada XP-GP board : it's a quad-core 1.6 GHz Marvell Armada-XP (PJ4Bv2)
    CPU. It has 64-bit DDR3-1600 RAM on a DIMM. The tests were run in ARMv7
    and Thumb2 modes. The difference was not impressive between the two
    modes.

    root at xpgp:tmp# cat /proc/cpuinfo 
    processor       : 0
    model name      : ARMv7 Processor rev 2 (v7l)
    BogoMIPS        : 1594.16
    Features        : swp half thumb fastmult vfp edsp vfpv3 tls idiva idivt 
    CPU implementer : 0x56
    CPU architecture: 7
    CPU variant     : 0x2
    CPU part        : 0x584
    CPU revision    : 2

  - Mirabox : single-core 1.2 GHz Marvell Armada370 (PJ4B) CPU. It uses
    16-bit DDR3-1200 soldered onboard. The tests were run in ARMv7 and
    Thumb2 modes. It can be useful to compare with the xp-gp above because
    its CPU can be seen as a scaled down version of the previous one, with
    1/4 of the DRAM bus width, and both have the DRAM at half CPU frequency.

    root at mirabox:tmp# cat /proc/cpuinfo 
    processor       : 0
    model name      : ARMv7 Processor rev 1 (v7l)
    BogoMIPS        : 597.60
    Features        : swp half thumb fastmult vfp edsp vfpv3 vfpv3d16 tls idivt 
    CPU implementer : 0x56
    CPU architecture: 7
    CPU variant     : 0x1
    CPU part        : 0x581
    CPU revision    : 1

I'm attaching all the results.

Hoping this helps,
Willy

-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 599.89 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 600.57 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 597.81 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 598.70 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 595.39 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 618.28 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 615.10 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 618.15 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 615.02 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 621.19 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 618.03 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 612.97 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 614.82 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 611.68 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 616.50 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.92 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 365.71 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.92 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 365.73 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 365.63 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 381.35 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 383.49 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 381.49 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 383.32 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 381.47 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 426.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 426.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 426.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 426.69 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 424.72 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 311.75 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 310.30 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 311.74 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 310.22 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 311.76 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 327.84 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 327.89 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 327.87 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 326.25 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 327.87 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 364.50 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 366.29 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 364.51 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 366.24 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 366.31 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 361.11 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 362.86 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 361.10 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 362.86 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 361.13 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 366.61 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 364.79 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 366.56 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 366.60 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 364.84 MB/s
libc memcpy:
4096 bytes page aligned: 356.71 MB/s
4096 bytes page aligned: 355.04 MB/s
4096 bytes page aligned: 356.67 MB/s
4096 bytes page aligned: 354.98 MB/s
4096 bytes page aligned: 356.68 MB/s
kernel memcpy (original):
4096 bytes page aligned: 355.32 MB/s
4096 bytes page aligned: 356.96 MB/s
4096 bytes page aligned: 355.31 MB/s
4096 bytes page aligned: 357.01 MB/s
4096 bytes page aligned: 355.30 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 341.05 MB/s
4096 bytes page aligned: 339.37 MB/s
4096 bytes page aligned: 341.04 MB/s
4096 bytes page aligned: 339.37 MB/s
4096 bytes page aligned: 341.03 MB/s
kernel copy_page (original):
4096 bytes page aligned: 382.31 MB/s
4096 bytes page aligned: 384.19 MB/s
4096 bytes page aligned: 382.29 MB/s
4096 bytes page aligned: 384.25 MB/s
4096 bytes page aligned: 382.30 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 340.55 MB/s
4096 bytes page aligned: 338.96 MB/s
4096 bytes page aligned: 340.60 MB/s
4096 bytes page aligned: 338.96 MB/s
4096 bytes page aligned: 340.56 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 513.06 MB/s
Mixed from 1 to 1023 (power law), unaligned: 513.02 MB/s
Mixed from 1 to 1023 (power law), unaligned: 512.94 MB/s
Mixed from 1 to 1023 (power law), unaligned: 510.37 MB/s
Mixed from 1 to 1023 (power law), unaligned: 513.35 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 532.66 MB/s
Mixed from 1 to 1023 (power law), unaligned: 535.20 MB/s
Mixed from 1 to 1023 (power law), unaligned: 532.29 MB/s
Mixed from 1 to 1023 (power law), unaligned: 535.41 MB/s
Mixed from 1 to 1023 (power law), unaligned: 535.59 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 528.33 MB/s
Mixed from 1 to 1023 (power law), unaligned: 531.12 MB/s
Mixed from 1 to 1023 (power law), unaligned: 527.64 MB/s
Mixed from 1 to 1023 (power law), unaligned: 530.72 MB/s
Mixed from 1 to 1023 (power law), unaligned: 528.05 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 888.47 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 884.25 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 888.42 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 888.49 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 884.05 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 962.84 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 958.71 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 963.20 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 958.83 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 962.86 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1004.37 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.61 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1004.49 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.43 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1004.46 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 922.59 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 926.98 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 926.99 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 922.46 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 927.07 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 930.00 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 934.53 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 930.89 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 935.60 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 935.32 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 520.37 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 520.42 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 517.93 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 520.36 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 517.84 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 594.94 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 591.54 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 594.39 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 594.45 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 591.58 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 658.84 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 655.68 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 658.78 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 655.58 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 658.85 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.21 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.94 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.92 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.08 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.93 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 586.06 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 588.64 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 585.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 588.86 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 588.66 MB/s
libc memset:
4096 bytes page aligned: 2052.77 MB/s
4096 bytes page aligned: 2052.69 MB/s
4096 bytes page aligned: 2042.84 MB/s
4096 bytes page aligned: 2052.72 MB/s
4096 bytes page aligned: 2042.30 MB/s
kernel memset (original):
4096 bytes page aligned: 1920.98 MB/s
4096 bytes page aligned: 1911.66 MB/s
4096 bytes page aligned: 1921.13 MB/s
4096 bytes page aligned: 1921.17 MB/s
4096 bytes page aligned: 1911.92 MB/s
kernel memset (optimized):
4096 bytes page aligned: 1900.46 MB/s
4096 bytes page aligned: 1891.21 MB/s
4096 bytes page aligned: 1900.52 MB/s
4096 bytes page aligned: 1891.16 MB/s
4096 bytes page aligned: 1900.64 MB/s
kernel memzero (original):
4096 bytes page aligned: 1910.57 MB/s
4096 bytes page aligned: 1920.05 MB/s
4096 bytes page aligned: 1920.02 MB/s
4096 bytes page aligned: 1910.87 MB/s
4096 bytes page aligned: 1920.06 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 1917.74 MB/s
4096 bytes page aligned: 1927.05 MB/s
4096 bytes page aligned: 1917.28 MB/s
4096 bytes page aligned: 1927.11 MB/s
4096 bytes page aligned: 1926.87 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 759.37 MB/s
Mixed from 1 to 1023 (power law), unaligned: 759.42 MB/s
Mixed from 1 to 1023 (power law), unaligned: 755.88 MB/s
Mixed from 1 to 1023 (power law), unaligned: 759.32 MB/s
Mixed from 1 to 1023 (power law), unaligned: 756.04 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 802.77 MB/s
Mixed from 1 to 1023 (power law), unaligned: 798.89 MB/s
Mixed from 1 to 1023 (power law), unaligned: 801.62 MB/s
Mixed from 1 to 1023 (power law), unaligned: 802.67 MB/s
Mixed from 1 to 1023 (power law), unaligned: 798.07 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 862.50 MB/s
Mixed from 1 to 1023 (power law), unaligned: 857.72 MB/s
Mixed from 1 to 1023 (power law), unaligned: 862.52 MB/s
Mixed from 1 to 1023 (power law), unaligned: 857.00 MB/s
Mixed from 1 to 1023 (power law), unaligned: 860.71 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 784.48 MB/s
Mixed from 1 to 1023 (power law), unaligned: 780.41 MB/s
Mixed from 1 to 1023 (power law), unaligned: 784.97 MB/s
Mixed from 1 to 1023 (power law), unaligned: 781.14 MB/s
Mixed from 1 to 1023 (power law), unaligned: 783.99 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 793.48 MB/s
Mixed from 1 to 1023 (power law), unaligned: 796.39 MB/s
Mixed from 1 to 1023 (power law), unaligned: 792.86 MB/s
Mixed from 1 to 1023 (power law), unaligned: 796.20 MB/s
Mixed from 1 to 1023 (power law), unaligned: 796.68 MB/s
-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 614.78 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 618.39 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 614.90 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 618.16 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 614.83 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 654.11 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 650.60 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 653.49 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 653.81 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 649.56 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 653.09 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 650.86 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 653.72 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 650.74 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 653.71 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 332.22 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 333.86 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 332.22 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 333.86 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 333.77 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 365.63 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 365.65 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.96 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 365.63 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.95 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 403.08 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 401.21 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 403.06 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 401.23 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 403.02 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 293.84 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 293.87 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 293.79 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 292.46 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 293.78 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 312.63 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 314.11 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 312.64 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 314.05 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 312.63 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 347.08 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 345.40 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 347.01 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 347.06 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 347.05 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 338.99 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 337.42 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 338.96 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 337.42 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 339.07 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 336.61 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 338.16 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 336.61 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 338.21 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 336.58 MB/s
libc memcpy:
4096 bytes page aligned: 358.08 MB/s
4096 bytes page aligned: 356.32 MB/s
4096 bytes page aligned: 358.07 MB/s
4096 bytes page aligned: 356.39 MB/s
4096 bytes page aligned: 358.08 MB/s
kernel memcpy (original):
4096 bytes page aligned: 356.76 MB/s
4096 bytes page aligned: 358.47 MB/s
4096 bytes page aligned: 356.76 MB/s
4096 bytes page aligned: 358.47 MB/s
4096 bytes page aligned: 356.86 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 342.33 MB/s
4096 bytes page aligned: 340.66 MB/s
4096 bytes page aligned: 342.32 MB/s
4096 bytes page aligned: 340.70 MB/s
4096 bytes page aligned: 342.31 MB/s
kernel copy_page (original):
4096 bytes page aligned: 381.93 MB/s
4096 bytes page aligned: 383.87 MB/s
4096 bytes page aligned: 381.97 MB/s
4096 bytes page aligned: 383.86 MB/s
4096 bytes page aligned: 381.98 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 341.86 MB/s
4096 bytes page aligned: 341.83 MB/s
4096 bytes page aligned: 341.86 MB/s
4096 bytes page aligned: 341.80 MB/s
4096 bytes page aligned: 341.85 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 484.57 MB/s
Mixed from 1 to 1023 (power law), unaligned: 482.42 MB/s
Mixed from 1 to 1023 (power law), unaligned: 484.45 MB/s
Mixed from 1 to 1023 (power law), unaligned: 482.49 MB/s
Mixed from 1 to 1023 (power law), unaligned: 484.27 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 503.45 MB/s
Mixed from 1 to 1023 (power law), unaligned: 505.11 MB/s
Mixed from 1 to 1023 (power law), unaligned: 502.65 MB/s
Mixed from 1 to 1023 (power law), unaligned: 505.09 MB/s
Mixed from 1 to 1023 (power law), unaligned: 502.69 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 490.07 MB/s
Mixed from 1 to 1023 (power law), unaligned: 490.26 MB/s
Mixed from 1 to 1023 (power law), unaligned: 486.98 MB/s
Mixed from 1 to 1023 (power law), unaligned: 489.95 MB/s
Mixed from 1 to 1023 (power law), unaligned: 487.95 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 844.51 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 840.39 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 844.37 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 840.68 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 844.55 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 886.05 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 890.19 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 890.11 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 885.76 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 889.84 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 930.57 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 934.93 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 930.50 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 934.75 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 930.35 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 860.46 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 860.40 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 860.34 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 860.40 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 856.31 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 881.67 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 877.42 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 881.60 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 877.48 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 881.70 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 496.66 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 499.04 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 498.98 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 496.62 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 498.96 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 551.78 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 554.33 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 551.63 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 554.13 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 551.60 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 601.07 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 597.87 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 601.06 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 601.08 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 598.38 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 525.40 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 522.99 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 525.42 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 522.74 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 525.28 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 556.46 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 559.02 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 559.16 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 559.00 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 559.13 MB/s
libc memset:
4096 bytes page aligned: 2029.13 MB/s
4096 bytes page aligned: 2038.87 MB/s
4096 bytes page aligned: 2029.11 MB/s
4096 bytes page aligned: 2038.82 MB/s
4096 bytes page aligned: 2028.82 MB/s
kernel memset (original):
4096 bytes page aligned: 1918.99 MB/s
4096 bytes page aligned: 1909.79 MB/s
4096 bytes page aligned: 1919.03 MB/s
4096 bytes page aligned: 1918.82 MB/s
4096 bytes page aligned: 1918.96 MB/s
kernel memset (optimized):
4096 bytes page aligned: 1920.02 MB/s
4096 bytes page aligned: 1910.71 MB/s
4096 bytes page aligned: 1920.03 MB/s
4096 bytes page aligned: 1910.58 MB/s
4096 bytes page aligned: 1919.89 MB/s
kernel memzero (original):
4096 bytes page aligned: 1885.37 MB/s
4096 bytes page aligned: 1894.53 MB/s
4096 bytes page aligned: 1885.11 MB/s
4096 bytes page aligned: 1894.52 MB/s
4096 bytes page aligned: 1894.52 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 1895.10 MB/s
4096 bytes page aligned: 1894.72 MB/s
4096 bytes page aligned: 1885.82 MB/s
4096 bytes page aligned: 1895.08 MB/s
4096 bytes page aligned: 1885.86 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 737.90 MB/s
Mixed from 1 to 1023 (power law), unaligned: 734.13 MB/s
Mixed from 1 to 1023 (power law), unaligned: 737.61 MB/s
Mixed from 1 to 1023 (power law), unaligned: 734.18 MB/s
Mixed from 1 to 1023 (power law), unaligned: 737.53 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 786.00 MB/s
Mixed from 1 to 1023 (power law), unaligned: 786.00 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.98 MB/s
Mixed from 1 to 1023 (power law), unaligned: 782.09 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.96 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 813.68 MB/s
Mixed from 1 to 1023 (power law), unaligned: 817.65 MB/s
Mixed from 1 to 1023 (power law), unaligned: 813.22 MB/s
Mixed from 1 to 1023 (power law), unaligned: 817.10 MB/s
Mixed from 1 to 1023 (power law), unaligned: 813.94 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 746.57 MB/s
Mixed from 1 to 1023 (power law), unaligned: 746.77 MB/s
Mixed from 1 to 1023 (power law), unaligned: 742.82 MB/s
Mixed from 1 to 1023 (power law), unaligned: 746.56 MB/s
Mixed from 1 to 1023 (power law), unaligned: 743.25 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 785.01 MB/s
Mixed from 1 to 1023 (power law), unaligned: 781.21 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.10 MB/s
Mixed from 1 to 1023 (power law), unaligned: 781.19 MB/s
Mixed from 1 to 1023 (power law), unaligned: 784.99 MB/s
-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 944.06 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 939.55 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 936.32 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 938.91 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 935.52 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 921.58 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 918.61 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 915.82 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 915.27 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 911.62 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 908.06 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 905.13 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 907.52 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 906.64 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 907.89 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 547.23 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 547.29 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 546.17 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 547.24 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 547.50 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.90 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.91 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.93 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 542.91 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.95 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 615.08 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 614.48 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 615.11 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 615.07 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 614.90 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 459.28 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 459.87 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 459.40 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 459.62 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 459.40 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 457.91 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.35 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 457.98 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.22 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 457.85 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 545.62 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 544.90 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 545.52 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 545.42 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 545.54 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 485.72 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 484.69 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 484.78 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.02 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.64 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 489.08 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 491.05 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 492.40 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 493.27 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 491.08 MB/s
libc memcpy:
4096 bytes page aligned: 1027.53 MB/s
4096 bytes page aligned: 1020.33 MB/s
4096 bytes page aligned: 1026.20 MB/s
4096 bytes page aligned: 1025.76 MB/s
4096 bytes page aligned: 1024.70 MB/s
kernel memcpy (original):
4096 bytes page aligned: 1026.80 MB/s
4096 bytes page aligned: 1027.25 MB/s
4096 bytes page aligned: 1026.46 MB/s
4096 bytes page aligned: 1020.09 MB/s
4096 bytes page aligned: 1027.83 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 841.49 MB/s
4096 bytes page aligned: 847.07 MB/s
4096 bytes page aligned: 840.32 MB/s
4096 bytes page aligned: 847.07 MB/s
4096 bytes page aligned: 841.32 MB/s
kernel copy_page (original):
4096 bytes page aligned: 948.27 MB/s
4096 bytes page aligned: 940.34 MB/s
4096 bytes page aligned: 946.30 MB/s
4096 bytes page aligned: 942.02 MB/s
4096 bytes page aligned: 948.32 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 850.59 MB/s
4096 bytes page aligned: 857.73 MB/s
4096 bytes page aligned: 851.24 MB/s
4096 bytes page aligned: 858.75 MB/s
4096 bytes page aligned: 851.73 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 715.47 MB/s
Mixed from 1 to 1023 (power law), unaligned: 714.09 MB/s
Mixed from 1 to 1023 (power law), unaligned: 715.65 MB/s
Mixed from 1 to 1023 (power law), unaligned: 714.83 MB/s
Mixed from 1 to 1023 (power law), unaligned: 712.47 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 721.70 MB/s
Mixed from 1 to 1023 (power law), unaligned: 719.15 MB/s
Mixed from 1 to 1023 (power law), unaligned: 721.34 MB/s
Mixed from 1 to 1023 (power law), unaligned: 718.81 MB/s
Mixed from 1 to 1023 (power law), unaligned: 721.02 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 635.79 MB/s
Mixed from 1 to 1023 (power law), unaligned: 636.97 MB/s
Mixed from 1 to 1023 (power law), unaligned: 635.52 MB/s
Mixed from 1 to 1023 (power law), unaligned: 636.23 MB/s
Mixed from 1 to 1023 (power law), unaligned: 636.05 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1323.49 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1326.82 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1348.12 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1328.57 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1324.56 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1786.48 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1782.46 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1776.21 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1745.68 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1771.53 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1770.77 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1759.21 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1721.21 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1782.98 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1762.74 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1745.20 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1763.23 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1743.48 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1766.37 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1728.34 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1682.73 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1660.62 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1695.76 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1703.42 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1766.86 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 901.11 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 901.81 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 889.89 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 886.94 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 899.02 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1142.87 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1145.74 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1141.91 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1142.41 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1143.23 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1129.60 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1132.20 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1131.63 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1131.37 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1128.10 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1110.96 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1105.10 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1106.56 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1107.89 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1105.29 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1081.12 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1086.37 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1086.06 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1086.13 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1085.48 MB/s
libc memset:
4096 bytes page aligned: 1371.96 MB/s
4096 bytes page aligned: 1362.53 MB/s
4096 bytes page aligned: 1383.10 MB/s
4096 bytes page aligned: 1356.89 MB/s
4096 bytes page aligned: 1367.61 MB/s
kernel memset (original):
4096 bytes page aligned: 1321.56 MB/s
4096 bytes page aligned: 1337.12 MB/s
4096 bytes page aligned: 1318.98 MB/s
4096 bytes page aligned: 1330.80 MB/s
4096 bytes page aligned: 1324.66 MB/s
kernel memset (optimized):
4096 bytes page aligned: 1317.07 MB/s
4096 bytes page aligned: 1305.07 MB/s
4096 bytes page aligned: 1311.78 MB/s
4096 bytes page aligned: 1301.32 MB/s
4096 bytes page aligned: 1305.47 MB/s
kernel memzero (original):
4096 bytes page aligned: 1320.70 MB/s
4096 bytes page aligned: 1317.15 MB/s
4096 bytes page aligned: 1380.78 MB/s
4096 bytes page aligned: 1316.34 MB/s
4096 bytes page aligned: 1363.25 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 1302.89 MB/s
4096 bytes page aligned: 1349.68 MB/s
4096 bytes page aligned: 1305.33 MB/s
4096 bytes page aligned: 1338.91 MB/s
4096 bytes page aligned: 1304.71 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 1296.85 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1281.93 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1284.15 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1303.82 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1289.72 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 1635.98 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1631.05 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1630.50 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1629.33 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1640.34 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1674.27 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1661.84 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1670.77 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1656.26 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1664.30 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 1583.12 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1576.78 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1579.13 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1571.27 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1554.87 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1613.16 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1624.66 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1613.26 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1624.16 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1611.64 MB/s
-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 938.28 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 938.13 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 938.22 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 937.87 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 938.26 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 992.48 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 992.77 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 992.53 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 992.82 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 992.45 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.57 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.32 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.57 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.32 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.65 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 506.25 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 506.18 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 506.17 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 506.16 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 506.19 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 542.36 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 542.08 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.74 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 542.09 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 542.71 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 568.31 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.96 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.96 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.81 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.88 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 425.27 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 425.41 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 425.29 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 426.54 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 425.58 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 458.17 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.13 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.73 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.32 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.95 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 503.75 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 503.23 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 503.38 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 502.87 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 503.40 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 486.47 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.02 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.65 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.20 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.11 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 456.43 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 455.72 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 455.60 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 455.58 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 456.06 MB/s
libc memcpy:
4096 bytes page aligned: 2733.85 MB/s
4096 bytes page aligned: 2734.82 MB/s
4096 bytes page aligned: 2735.47 MB/s
4096 bytes page aligned: 2733.74 MB/s
4096 bytes page aligned: 2735.10 MB/s
kernel memcpy (original):
4096 bytes page aligned: 2763.15 MB/s
4096 bytes page aligned: 2764.57 MB/s
4096 bytes page aligned: 2762.87 MB/s
4096 bytes page aligned: 2764.31 MB/s
4096 bytes page aligned: 2763.97 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 2021.61 MB/s
4096 bytes page aligned: 2022.85 MB/s
4096 bytes page aligned: 2021.30 MB/s
4096 bytes page aligned: 2022.75 MB/s
4096 bytes page aligned: 2021.18 MB/s
kernel copy_page (original):
4096 bytes page aligned: 1536.64 MB/s
4096 bytes page aligned: 1536.07 MB/s
4096 bytes page aligned: 1536.62 MB/s
4096 bytes page aligned: 1536.44 MB/s
4096 bytes page aligned: 1536.04 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 2029.46 MB/s
4096 bytes page aligned: 2028.46 MB/s
4096 bytes page aligned: 2029.26 MB/s
4096 bytes page aligned: 2028.49 MB/s
4096 bytes page aligned: 2029.51 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 677.42 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.45 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.43 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.49 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.55 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 705.91 MB/s
Mixed from 1 to 1023 (power law), unaligned: 705.96 MB/s
Mixed from 1 to 1023 (power law), unaligned: 706.14 MB/s
Mixed from 1 to 1023 (power law), unaligned: 706.18 MB/s
Mixed from 1 to 1023 (power law), unaligned: 706.32 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 671.04 MB/s
Mixed from 1 to 1023 (power law), unaligned: 671.49 MB/s
Mixed from 1 to 1023 (power law), unaligned: 671.19 MB/s
Mixed from 1 to 1023 (power law), unaligned: 671.87 MB/s
Mixed from 1 to 1023 (power law), unaligned: 671.50 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1288.97 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1288.99 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1288.74 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1288.95 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1288.51 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1698.82 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1695.12 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1695.28 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1699.55 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1698.91 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1826.35 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1826.33 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1833.66 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1833.25 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1834.97 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1608.61 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1603.63 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1606.36 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1608.51 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1607.49 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1654.00 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1653.34 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1653.09 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1647.16 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1653.98 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 779.98 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 780.05 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 779.98 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 780.09 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 779.82 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 971.07 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 969.65 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 969.63 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 969.63 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 969.45 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1166.68 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1166.31 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1166.68 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1166.41 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1166.45 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 915.94 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 915.88 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 916.08 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 915.77 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 915.94 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 980.79 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 981.17 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 981.46 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 981.44 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 981.17 MB/s
libc memset:
4096 bytes page aligned: 2808.48 MB/s
4096 bytes page aligned: 2809.23 MB/s
4096 bytes page aligned: 2809.10 MB/s
4096 bytes page aligned: 2808.32 MB/s
4096 bytes page aligned: 2808.85 MB/s
kernel memset (original):
4096 bytes page aligned: 4285.77 MB/s
4096 bytes page aligned: 4286.95 MB/s
4096 bytes page aligned: 4285.80 MB/s
4096 bytes page aligned: 4287.03 MB/s
4096 bytes page aligned: 4286.30 MB/s
kernel memset (optimized):
4096 bytes page aligned: 4332.88 MB/s
4096 bytes page aligned: 4333.13 MB/s
4096 bytes page aligned: 4332.22 MB/s
4096 bytes page aligned: 4333.00 MB/s
4096 bytes page aligned: 4331.64 MB/s
kernel memzero (original):
4096 bytes page aligned: 4286.68 MB/s
4096 bytes page aligned: 4286.68 MB/s
4096 bytes page aligned: 4286.96 MB/s
4096 bytes page aligned: 4286.31 MB/s
4096 bytes page aligned: 4285.41 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 4307.47 MB/s
4096 bytes page aligned: 4306.33 MB/s
4096 bytes page aligned: 4307.97 MB/s
4096 bytes page aligned: 4305.94 MB/s
4096 bytes page aligned: 4307.61 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 1150.12 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1149.80 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1150.06 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1149.76 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1149.91 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 1482.23 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1483.26 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1483.42 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1482.48 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1483.19 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1683.39 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1680.19 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1681.58 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1680.15 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1680.06 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 1357.13 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1357.31 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1356.41 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1357.16 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1356.60 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1469.08 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1470.31 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1469.47 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1468.80 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1469.37 MB/s
-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.54 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.27 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.78 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.52 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.50 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 954.22 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 954.17 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 954.16 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 954.08 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 954.19 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 852.17 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 852.53 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 852.37 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 852.44 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 852.45 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 455.51 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 457.69 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 455.01 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 455.30 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 455.68 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 512.36 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 512.02 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 512.47 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 512.47 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 512.66 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 538.32 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 537.83 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 538.36 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 538.29 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 539.25 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 392.90 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 388.25 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 388.67 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 392.51 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 392.09 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 433.21 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 433.73 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 433.34 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 433.91 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 433.43 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 474.10 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 474.06 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 474.29 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 474.10 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 473.95 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 455.22 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 455.10 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 454.55 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 454.71 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 454.86 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 429.08 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 429.08 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 429.42 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 429.12 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 429.59 MB/s
libc memcpy:
4096 bytes page aligned: 2698.97 MB/s
4096 bytes page aligned: 2703.85 MB/s
4096 bytes page aligned: 2706.42 MB/s
4096 bytes page aligned: 2701.26 MB/s
4096 bytes page aligned: 2699.65 MB/s
kernel memcpy (original):
4096 bytes page aligned: 2735.92 MB/s
4096 bytes page aligned: 2735.76 MB/s
4096 bytes page aligned: 2739.53 MB/s
4096 bytes page aligned: 2737.95 MB/s
4096 bytes page aligned: 2735.23 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 2016.76 MB/s
4096 bytes page aligned: 2015.85 MB/s
4096 bytes page aligned: 2016.87 MB/s
4096 bytes page aligned: 2015.99 MB/s
4096 bytes page aligned: 2018.49 MB/s
kernel copy_page (original):
4096 bytes page aligned: 1533.05 MB/s
4096 bytes page aligned: 1533.36 MB/s
4096 bytes page aligned: 1533.81 MB/s
4096 bytes page aligned: 1533.62 MB/s
4096 bytes page aligned: 1533.05 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 2016.48 MB/s
4096 bytes page aligned: 2019.79 MB/s
4096 bytes page aligned: 2016.49 MB/s
4096 bytes page aligned: 2017.68 MB/s
4096 bytes page aligned: 2018.23 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 640.12 MB/s
Mixed from 1 to 1023 (power law), unaligned: 640.23 MB/s
Mixed from 1 to 1023 (power law), unaligned: 640.13 MB/s
Mixed from 1 to 1023 (power law), unaligned: 640.34 MB/s
Mixed from 1 to 1023 (power law), unaligned: 640.36 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 681.11 MB/s
Mixed from 1 to 1023 (power law), unaligned: 680.79 MB/s
Mixed from 1 to 1023 (power law), unaligned: 681.19 MB/s
Mixed from 1 to 1023 (power law), unaligned: 680.93 MB/s
Mixed from 1 to 1023 (power law), unaligned: 681.05 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 645.50 MB/s
Mixed from 1 to 1023 (power law), unaligned: 644.98 MB/s
Mixed from 1 to 1023 (power law), unaligned: 645.10 MB/s
Mixed from 1 to 1023 (power law), unaligned: 644.91 MB/s
Mixed from 1 to 1023 (power law), unaligned: 645.03 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1246.47 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1246.77 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1246.49 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1246.87 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1246.58 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1609.02 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1612.50 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1612.66 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1614.68 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1609.93 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1744.85 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1747.18 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1748.65 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1745.03 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1745.42 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1509.51 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1510.41 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1509.70 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1508.00 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1508.73 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1615.44 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1617.76 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1612.05 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1616.54 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1610.91 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 735.51 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 735.65 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 735.62 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 735.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 735.83 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 884.22 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 884.39 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 884.11 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 885.90 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 884.09 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1025.79 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1025.70 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1025.98 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1025.56 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1025.59 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 831.09 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 830.34 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 830.77 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 830.50 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 830.64 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 919.83 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 920.16 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 919.50 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 919.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 920.02 MB/s
libc memset:
4096 bytes page aligned: 2789.85 MB/s
4096 bytes page aligned: 2790.47 MB/s
4096 bytes page aligned: 2789.64 MB/s
4096 bytes page aligned: 2790.60 MB/s
4096 bytes page aligned: 2789.42 MB/s
kernel memset (original):
4096 bytes page aligned: 4292.31 MB/s
4096 bytes page aligned: 4292.19 MB/s
4096 bytes page aligned: 4291.39 MB/s
4096 bytes page aligned: 4291.91 MB/s
4096 bytes page aligned: 4291.29 MB/s
kernel memset (optimized):
4096 bytes page aligned: 4321.51 MB/s
4096 bytes page aligned: 4319.98 MB/s
4096 bytes page aligned: 4321.53 MB/s
4096 bytes page aligned: 4319.93 MB/s
4096 bytes page aligned: 4321.46 MB/s
kernel memzero (original):
4096 bytes page aligned: 4243.19 MB/s
4096 bytes page aligned: 4242.35 MB/s
4096 bytes page aligned: 4243.32 MB/s
4096 bytes page aligned: 4242.29 MB/s
4096 bytes page aligned: 4243.34 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 4261.67 MB/s
4096 bytes page aligned: 4262.59 MB/s
4096 bytes page aligned: 4262.13 MB/s
4096 bytes page aligned: 4262.75 MB/s
4096 bytes page aligned: 4262.62 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 1084.53 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1084.89 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1084.61 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1084.71 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1084.43 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 1364.45 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1363.67 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1364.87 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1364.47 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1364.17 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1508.02 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1510.44 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1508.57 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1508.86 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1510.14 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 1261.52 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1261.24 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1262.57 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1260.26 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1261.35 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1412.76 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1412.17 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1413.32 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1412.77 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1413.13 MB/s

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-13 16:48 ` Dr. David Alan Gilbert
@ 2013-07-13 21:13   ` Harm Hanemaaijer
  2013-07-15 13:15     ` Catalin Marinas
  2013-07-14 11:19   ` Harm Hanemaaijer
  1 sibling, 1 reply; 18+ messages in thread
From: Harm Hanemaaijer @ 2013-07-13 21:13 UTC (permalink / raw)
  To: linux-arm-kernel

Dr. David Alan Gilbert <gilbertd <at> treblig.org> writes:

> 
> You might like to compare with some of the routines at:
> https://launchpad.net/cortex-strings
> and some of the numbers at:
> https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/

That's interesting. I had looked at cortex-strings before but didn't
dig into it, also because its benchmark program seemed to be limited in
scope. From the Linaro numbers it seems NEON isn't always a win
especially on newer Cortex platforms, with large variability across
different platforms/cores.

> 
>
http://ssvb.github.io/2013/06/27/fullhd-x11-desktop-performance-of-the-allwinner-a10.html
> 
> is an interesting article on one machine being screwed over by
> video bandwidth.

I have the same type of device (the Cortex A8 which I've tested on),
when running a 1920x1080 screen at 32bpp that does indeed cost a lot
bandwidth (it's 500MB/s of scanout bandwidth), I think this applies to
most devices except higher-end ones with a 64-bit DRAM interface.

> I've only had a brief scan through your code, one thing I remember
> from a couple of years ago was a theory that ldrd/strd was supposed
> to be faster on A15's (but I never had a chance to try it out).

I briefly experimented with ldrd/strd, it seemed to be fast but
highly dependent on the proper (64-bit) alignment. In my current code
it is only used in Thumb2 mode in one spot.

> Maybe neon is worth a try these days (although be careful of platforms
> like Tegra 2 that doens't have it); there was a recent patch that enabled
> use in the kernel (I think for some RAID use). The downside is it's
> supposed to be quite power hungry.

Although I don't have experience with NEON, there seems to be a lot of
variability across platforms/cores when using it for memcpy, and it may
have extra overhead when used in the kernel. I will look at it in more
detail, but not using NEON does make things easier (not having to detect
NEON, being compatible with older platforms etc).

Thanks for the comments.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-13 17:24 ` Willy Tarreau
@ 2013-07-13 21:51   ` Harm Hanemaaijer
  2013-07-14  6:13     ` Willy Tarreau
  0 siblings, 1 reply; 18+ messages in thread
From: Harm Hanemaaijer @ 2013-07-13 21:51 UTC (permalink / raw)
  To: linux-arm-kernel

Willy Tarreau <w <at> 1wt.eu> writes:

> OK I've run bench.script on the following platforms :

Thanks, that's incredibly helpful!

Note that Thumb2 mode usually doesn't do much in synthetic benchmarks,
because the benchmark code will fit into the L1 instruction cache; the
benefit of Thumb2 happens in real-world usage when the active code
footprint becomes larger.

To summarize, memset seems to be in good shape and also the "fast path"
for common word-aligned memcpy of size <= 256 seems to be working well.

However, the copy_page and memcpy results for larger sizes seem to suggest
that the prefetch strategy isn't working well on these platforms. Note also
that on the quad core the existing copy_page is also highly sub-optimal.

Fixing the preload strategy for these platforms may simply be a case of
changing the configurable constant PREFETCH_DISTANCE from 3 to 2 (from an
offset of 192 bytes to 128 bytes), which more closely mimics the original
kernel memcpy. I have added PREFETCH_DISTANCE as a configurable parameter
in the Makefile in the latest version of test-arm-kernel-memcpy. It will
be interesting to see the results of testing with a PREFETCH_DISTANCE
of 2 especially on the quad-core platform or a similar one.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-13 21:51   ` Harm Hanemaaijer
@ 2013-07-14  6:13     ` Willy Tarreau
  2013-07-14 11:00       ` Harm Hanemaaijer
  0 siblings, 1 reply; 18+ messages in thread
From: Willy Tarreau @ 2013-07-14  6:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Sat, Jul 13, 2013 at 09:51:18PM +0000, Harm Hanemaaijer wrote:
> Willy Tarreau <w <at> 1wt.eu> writes:
> 
> > OK I've run bench.script on the following platforms :
> 
> Thanks, that's incredibly helpful!
> 
> Note that Thumb2 mode usually doesn't do much in synthetic benchmarks,
> because the benchmark code will fit into the L1 instruction cache; the
> benefit of Thumb2 happens in real-world usage when the active code
> footprint becomes larger.
> 
> To summarize, memset seems to be in good shape and also the "fast path"
> for common word-aligned memcpy of size <= 256 seems to be working well.
> 
> However, the copy_page and memcpy results for larger sizes seem to suggest
> that the prefetch strategy isn't working well on these platforms. Note also
> that on the quad core the existing copy_page is also highly sub-optimal.
> 
> Fixing the preload strategy for these platforms may simply be a case of
> changing the configurable constant PREFETCH_DISTANCE from 3 to 2 (from an
> offset of 192 bytes to 128 bytes), which more closely mimics the original
> kernel memcpy. I have added PREFETCH_DISTANCE as a configurable parameter
> in the Makefile in the latest version of test-arm-kernel-memcpy. It will
> be interesting to see the results of testing with a PREFETCH_DISTANCE
> of 2 especially on the quad-core platform or a similar one.

No problem, I ran it on the two in armv7+thumb mode again.

Please find the results attached. It seems that memcpy improved by 0.8%
though that's not even certain.

Regards,
Willy

-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.97 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.98 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.96 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.88 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.63 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 955.68 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 955.36 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 955.71 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 955.41 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 955.66 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 850.25 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 850.26 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 850.16 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 849.91 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 850.27 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 454.00 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 457.50 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 453.22 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 456.13 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 454.23 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 508.77 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 508.95 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 509.26 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 509.19 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 509.46 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 523.20 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 523.22 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 523.31 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 523.09 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 523.62 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 389.04 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 388.08 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 387.82 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 387.74 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 387.92 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 429.52 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 430.19 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 430.10 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 430.02 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 429.45 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 473.75 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 474.00 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 473.59 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 473.24 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 473.65 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 452.37 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 452.11 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 452.91 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 451.84 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 452.71 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 427.17 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 427.11 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 426.57 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 426.67 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 427.11 MB/s
libc memcpy:
4096 bytes page aligned: 2703.64 MB/s
4096 bytes page aligned: 2702.35 MB/s
4096 bytes page aligned: 2705.23 MB/s
4096 bytes page aligned: 2702.31 MB/s
4096 bytes page aligned: 2703.18 MB/s
kernel memcpy (original):
4096 bytes page aligned: 2735.75 MB/s
4096 bytes page aligned: 2736.98 MB/s
4096 bytes page aligned: 2739.54 MB/s
4096 bytes page aligned: 2736.56 MB/s
4096 bytes page aligned: 2735.81 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 2019.77 MB/s
4096 bytes page aligned: 2019.01 MB/s
4096 bytes page aligned: 2019.78 MB/s
4096 bytes page aligned: 2019.88 MB/s
4096 bytes page aligned: 2018.68 MB/s
kernel copy_page (original):
4096 bytes page aligned: 1533.13 MB/s
4096 bytes page aligned: 1532.51 MB/s
4096 bytes page aligned: 1534.12 MB/s
4096 bytes page aligned: 1532.53 MB/s
4096 bytes page aligned: 1533.16 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 2012.66 MB/s
4096 bytes page aligned: 2013.76 MB/s
4096 bytes page aligned: 2013.53 MB/s
4096 bytes page aligned: 2013.34 MB/s
4096 bytes page aligned: 2013.62 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 641.26 MB/s
Mixed from 1 to 1023 (power law), unaligned: 641.16 MB/s
Mixed from 1 to 1023 (power law), unaligned: 640.95 MB/s
Mixed from 1 to 1023 (power law), unaligned: 641.30 MB/s
Mixed from 1 to 1023 (power law), unaligned: 640.65 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 677.55 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.50 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.51 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.09 MB/s
Mixed from 1 to 1023 (power law), unaligned: 676.69 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 660.80 MB/s
Mixed from 1 to 1023 (power law), unaligned: 660.89 MB/s
Mixed from 1 to 1023 (power law), unaligned: 660.50 MB/s
Mixed from 1 to 1023 (power law), unaligned: 660.72 MB/s
Mixed from 1 to 1023 (power law), unaligned: 661.12 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1241.64 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1242.02 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1241.66 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1241.32 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1241.57 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1603.86 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1608.36 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1605.22 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1606.88 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1606.02 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1733.22 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1729.46 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1737.01 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1734.14 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1733.59 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1509.90 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1507.44 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1508.64 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1508.11 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1505.42 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1616.59 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1616.74 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1617.85 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1613.74 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1621.71 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 742.55 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 742.68 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 742.64 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 742.52 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 742.60 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 893.16 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 893.35 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 893.18 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 893.45 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 893.39 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1028.50 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1028.49 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1028.30 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1028.37 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1028.22 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 839.00 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 838.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 839.01 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 838.93 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 838.96 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 930.07 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 930.04 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 930.11 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 930.09 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 930.08 MB/s
libc memset:
4096 bytes page aligned: 2787.64 MB/s
4096 bytes page aligned: 2788.50 MB/s
4096 bytes page aligned: 2788.44 MB/s
4096 bytes page aligned: 2788.39 MB/s
4096 bytes page aligned: 2788.18 MB/s
kernel memset (original):
4096 bytes page aligned: 4285.78 MB/s
4096 bytes page aligned: 4286.76 MB/s
4096 bytes page aligned: 4285.85 MB/s
4096 bytes page aligned: 4286.59 MB/s
4096 bytes page aligned: 4285.58 MB/s
kernel memset (optimized):
4096 bytes page aligned: 4314.98 MB/s
4096 bytes page aligned: 4314.69 MB/s
4096 bytes page aligned: 4314.15 MB/s
4096 bytes page aligned: 4314.67 MB/s
4096 bytes page aligned: 4313.65 MB/s
kernel memzero (original):
4096 bytes page aligned: 4242.90 MB/s
4096 bytes page aligned: 4241.60 MB/s
4096 bytes page aligned: 4242.77 MB/s
4096 bytes page aligned: 4241.56 MB/s
4096 bytes page aligned: 4243.05 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 4265.52 MB/s
4096 bytes page aligned: 4264.31 MB/s
4096 bytes page aligned: 4265.14 MB/s
4096 bytes page aligned: 4264.22 MB/s
4096 bytes page aligned: 4265.74 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 1083.33 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1083.76 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1083.22 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1083.63 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1083.44 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 1361.29 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1362.14 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1361.44 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1362.91 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1361.52 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1511.68 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1511.65 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1512.21 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1512.55 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1512.37 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 1259.19 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1259.69 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1260.27 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1259.07 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1260.15 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1410.53 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1410.31 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1410.48 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1408.95 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1412.63 MB/s
-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 944.18 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 943.83 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 944.12 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 943.90 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 944.20 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.62 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.90 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.98 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.64 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1000.03 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 869.93 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.49 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.24 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.35 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 870.49 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 505.38 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 505.22 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 505.65 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 505.57 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 505.54 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.06 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.00 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 540.94 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.01 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 541.03 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 549.25 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 549.45 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 549.94 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 549.20 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 549.48 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 425.16 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 425.82 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 425.51 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 425.70 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 425.59 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 458.28 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.62 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 459.25 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 458.18 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 459.43 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 501.98 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 502.06 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 501.65 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 502.31 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 502.14 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 484.64 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 484.08 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 483.97 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.09 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 485.96 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 455.69 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 455.98 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 455.98 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 455.97 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 457.07 MB/s
libc memcpy:
4096 bytes page aligned: 2739.85 MB/s
4096 bytes page aligned: 2738.74 MB/s
4096 bytes page aligned: 2739.70 MB/s
4096 bytes page aligned: 2738.93 MB/s
4096 bytes page aligned: 2739.83 MB/s
kernel memcpy (original):
4096 bytes page aligned: 2770.15 MB/s
4096 bytes page aligned: 2772.07 MB/s
4096 bytes page aligned: 2771.84 MB/s
4096 bytes page aligned: 2770.57 MB/s
4096 bytes page aligned: 2771.75 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 2016.25 MB/s
4096 bytes page aligned: 2017.41 MB/s
4096 bytes page aligned: 2017.92 MB/s
4096 bytes page aligned: 2019.81 MB/s
4096 bytes page aligned: 2016.19 MB/s
kernel copy_page (original):
4096 bytes page aligned: 1537.52 MB/s
4096 bytes page aligned: 1537.46 MB/s
4096 bytes page aligned: 1536.99 MB/s
4096 bytes page aligned: 1537.60 MB/s
4096 bytes page aligned: 1536.97 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 2032.28 MB/s
4096 bytes page aligned: 2031.33 MB/s
4096 bytes page aligned: 2032.23 MB/s
4096 bytes page aligned: 2032.35 MB/s
4096 bytes page aligned: 2031.26 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 678.17 MB/s
Mixed from 1 to 1023 (power law), unaligned: 677.84 MB/s
Mixed from 1 to 1023 (power law), unaligned: 678.13 MB/s
Mixed from 1 to 1023 (power law), unaligned: 678.03 MB/s
Mixed from 1 to 1023 (power law), unaligned: 678.14 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 706.55 MB/s
Mixed from 1 to 1023 (power law), unaligned: 706.16 MB/s
Mixed from 1 to 1023 (power law), unaligned: 706.71 MB/s
Mixed from 1 to 1023 (power law), unaligned: 706.09 MB/s
Mixed from 1 to 1023 (power law), unaligned: 706.90 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 691.01 MB/s
Mixed from 1 to 1023 (power law), unaligned: 691.40 MB/s
Mixed from 1 to 1023 (power law), unaligned: 691.07 MB/s
Mixed from 1 to 1023 (power law), unaligned: 691.55 MB/s
Mixed from 1 to 1023 (power law), unaligned: 691.35 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1279.54 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1280.04 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1279.75 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1279.82 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1279.46 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1700.89 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1699.79 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1699.45 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1699.46 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1699.12 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1859.00 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1855.05 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1857.88 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1858.97 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1855.57 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1603.50 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1603.51 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1602.76 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1603.89 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1604.60 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1653.52 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1652.73 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1654.63 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1652.44 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1654.76 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 777.78 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 777.85 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 777.78 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 777.86 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 777.86 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 966.31 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 966.26 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 966.17 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 966.31 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 966.12 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1161.60 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1161.58 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1161.33 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1161.54 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 1161.27 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 912.78 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 912.68 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 912.72 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 912.83 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 912.75 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 978.47 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 978.58 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 978.63 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 978.51 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 977.65 MB/s
libc memset:
4096 bytes page aligned: 2809.19 MB/s
4096 bytes page aligned: 2809.15 MB/s
4096 bytes page aligned: 2809.19 MB/s
4096 bytes page aligned: 2808.39 MB/s
4096 bytes page aligned: 2809.20 MB/s
kernel memset (original):
4096 bytes page aligned: 4286.67 MB/s
4096 bytes page aligned: 4287.73 MB/s
4096 bytes page aligned: 4287.69 MB/s
4096 bytes page aligned: 4287.50 MB/s
4096 bytes page aligned: 4287.77 MB/s
kernel memset (optimized):
4096 bytes page aligned: 4332.86 MB/s
4096 bytes page aligned: 4333.92 MB/s
4096 bytes page aligned: 4332.87 MB/s
4096 bytes page aligned: 4333.86 MB/s
4096 bytes page aligned: 4332.81 MB/s
kernel memzero (original):
4096 bytes page aligned: 4286.77 MB/s
4096 bytes page aligned: 4286.73 MB/s
4096 bytes page aligned: 4285.68 MB/s
4096 bytes page aligned: 4286.65 MB/s
4096 bytes page aligned: 4285.85 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 4308.08 MB/s
4096 bytes page aligned: 4307.07 MB/s
4096 bytes page aligned: 4308.18 MB/s
4096 bytes page aligned: 4307.95 MB/s
4096 bytes page aligned: 4306.85 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 1156.13 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1156.08 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1156.25 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1156.23 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1156.31 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 1491.20 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1491.11 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1491.80 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1491.44 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1491.66 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1690.43 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1691.03 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1693.37 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1691.31 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1691.96 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 1364.67 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1365.10 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1364.98 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1365.15 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1365.25 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 1475.90 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1476.30 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1476.07 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1476.49 MB/s
Mixed from 1 to 1023 (power law), unaligned: 1476.28 MB/s
-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 652.61 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 649.67 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 652.72 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 649.61 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 652.57 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 673.87 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 677.13 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 677.32 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 677.41 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 677.17 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 662.60 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 663.56 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 659.15 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 664.26 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 659.52 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 364.58 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 364.71 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 362.93 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 364.58 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.00 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 382.17 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 380.45 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 382.24 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 380.23 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 382.24 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 424.01 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 421.91 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 423.94 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 421.65 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 423.90 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 311.50 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 312.98 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 311.42 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 312.96 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 312.97 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 327.64 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 329.20 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 327.67 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 329.21 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 327.65 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 367.15 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 365.31 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 367.18 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 367.12 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 365.37 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 365.11 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 363.52 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 365.17 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 363.37 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 365.18 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 368.24 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 368.29 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 368.23 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 366.48 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 368.24 MB/s
libc memcpy:
4096 bytes page aligned: 358.42 MB/s
4096 bytes page aligned: 360.12 MB/s
4096 bytes page aligned: 358.39 MB/s
4096 bytes page aligned: 360.09 MB/s
4096 bytes page aligned: 358.45 MB/s
kernel memcpy (original):
4096 bytes page aligned: 360.40 MB/s
4096 bytes page aligned: 358.72 MB/s
4096 bytes page aligned: 360.39 MB/s
4096 bytes page aligned: 358.79 MB/s
4096 bytes page aligned: 360.46 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 342.08 MB/s
4096 bytes page aligned: 343.69 MB/s
4096 bytes page aligned: 341.96 MB/s
4096 bytes page aligned: 343.70 MB/s
4096 bytes page aligned: 342.10 MB/s
kernel copy_page (original):
4096 bytes page aligned: 386.91 MB/s
4096 bytes page aligned: 385.04 MB/s
4096 bytes page aligned: 386.90 MB/s
4096 bytes page aligned: 385.13 MB/s
4096 bytes page aligned: 386.90 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 341.49 MB/s
4096 bytes page aligned: 343.25 MB/s
4096 bytes page aligned: 343.26 MB/s
4096 bytes page aligned: 343.20 MB/s
4096 bytes page aligned: 343.12 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 514.14 MB/s
Mixed from 1 to 1023 (power law), unaligned: 515.74 MB/s
Mixed from 1 to 1023 (power law), unaligned: 514.14 MB/s
Mixed from 1 to 1023 (power law), unaligned: 515.79 MB/s
Mixed from 1 to 1023 (power law), unaligned: 514.18 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 540.90 MB/s
Mixed from 1 to 1023 (power law), unaligned: 537.63 MB/s
Mixed from 1 to 1023 (power law), unaligned: 539.82 MB/s
Mixed from 1 to 1023 (power law), unaligned: 540.33 MB/s
Mixed from 1 to 1023 (power law), unaligned: 537.00 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 540.31 MB/s
Mixed from 1 to 1023 (power law), unaligned: 537.17 MB/s
Mixed from 1 to 1023 (power law), unaligned: 540.38 MB/s
Mixed from 1 to 1023 (power law), unaligned: 539.03 MB/s
Mixed from 1 to 1023 (power law), unaligned: 542.41 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 881.70 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 881.68 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 881.56 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 877.40 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 881.52 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 954.65 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 958.99 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 954.36 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 959.20 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 958.94 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.30 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1004.01 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.36 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 1004.03 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 999.32 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 925.38 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 925.25 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 920.83 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 925.23 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 920.99 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 933.68 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 929.32 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 933.83 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 933.73 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 933.68 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 521.29 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 518.76 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 521.32 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 518.80 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 521.31 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 588.12 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 590.97 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 591.00 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 588.13 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 590.94 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 645.02 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 648.18 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 645.16 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 648.13 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 648.04 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.18 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.19 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 566.41 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.04 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 566.44 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 587.84 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 585.04 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 587.75 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 587.79 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 585.07 MB/s
libc memset:
4096 bytes page aligned: 2052.96 MB/s
4096 bytes page aligned: 2042.84 MB/s
4096 bytes page aligned: 2052.52 MB/s
4096 bytes page aligned: 2043.01 MB/s
4096 bytes page aligned: 2052.58 MB/s
kernel memset (original):
4096 bytes page aligned: 1912.63 MB/s
4096 bytes page aligned: 1922.23 MB/s
4096 bytes page aligned: 1921.84 MB/s
4096 bytes page aligned: 1912.60 MB/s
4096 bytes page aligned: 1921.86 MB/s
kernel memset (optimized):
4096 bytes page aligned: 1892.39 MB/s
4096 bytes page aligned: 1901.32 MB/s
4096 bytes page aligned: 1892.51 MB/s
4096 bytes page aligned: 1901.22 MB/s
4096 bytes page aligned: 1901.58 MB/s
kernel memzero (original):
4096 bytes page aligned: 1920.75 MB/s
4096 bytes page aligned: 1920.38 MB/s
4096 bytes page aligned: 1911.56 MB/s
4096 bytes page aligned: 1920.81 MB/s
4096 bytes page aligned: 1911.45 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 1928.78 MB/s
4096 bytes page aligned: 1919.76 MB/s
4096 bytes page aligned: 1928.75 MB/s
4096 bytes page aligned: 1929.09 MB/s
4096 bytes page aligned: 1919.61 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 785.51 MB/s
Mixed from 1 to 1023 (power law), unaligned: 781.66 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.54 MB/s
Mixed from 1 to 1023 (power law), unaligned: 781.71 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.41 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 816.79 MB/s
Mixed from 1 to 1023 (power law), unaligned: 820.37 MB/s
Mixed from 1 to 1023 (power law), unaligned: 820.29 MB/s
Mixed from 1 to 1023 (power law), unaligned: 817.25 MB/s
Mixed from 1 to 1023 (power law), unaligned: 820.35 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 880.18 MB/s
Mixed from 1 to 1023 (power law), unaligned: 884.47 MB/s
Mixed from 1 to 1023 (power law), unaligned: 880.03 MB/s
Mixed from 1 to 1023 (power law), unaligned: 884.15 MB/s
Mixed from 1 to 1023 (power law), unaligned: 884.00 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 797.30 MB/s
Mixed from 1 to 1023 (power law), unaligned: 800.99 MB/s
Mixed from 1 to 1023 (power law), unaligned: 797.06 MB/s
Mixed from 1 to 1023 (power law), unaligned: 800.49 MB/s
Mixed from 1 to 1023 (power law), unaligned: 797.08 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 813.62 MB/s
Mixed from 1 to 1023 (power law), unaligned: 813.55 MB/s
Mixed from 1 to 1023 (power law), unaligned: 813.41 MB/s
Mixed from 1 to 1023 (power law), unaligned: 813.81 MB/s
Mixed from 1 to 1023 (power law), unaligned: 809.52 MB/s
-------------- next part --------------
libc memcpy:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 628.06 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 623.94 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 626.71 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 623.43 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 627.13 MB/s
kernel memcpy (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 657.41 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 661.00 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 660.91 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 659.46 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 661.87 MB/s
kernel memcpy (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 657.37 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 661.33 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 659.10 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 662.16 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 658.66 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 332.21 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 330.70 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 332.24 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 332.27 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 330.55 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.62 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 361.89 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.65 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 361.77 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 363.54 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 397.26 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 399.06 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 397.13 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 399.11 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 399.11 MB/s
libc memcpy:
Mixed multiples of 4 from 4 to 130, word aligned: 292.31 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 292.31 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 290.92 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 292.26 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 290.86 MB/s
kernel memcpy (original):
Mixed multiples of 4 from 4 to 130, word aligned: 311.41 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 309.88 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 311.35 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 309.86 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 311.41 MB/s
kernel memcpy (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 343.87 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 343.89 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 343.85 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 342.24 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 343.91 MB/s
kernel copy_from_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 336.13 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 337.70 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 336.16 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 337.76 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 336.12 MB/s
kernel copy_to_user (optimized):
Mixed multiples of 4 from 4 to 130, word aligned: 336.24 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 334.60 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 336.29 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 336.30 MB/s
Mixed multiples of 4 from 4 to 130, word aligned: 336.28 MB/s
libc memcpy:
4096 bytes page aligned: 350.93 MB/s
4096 bytes page aligned: 350.87 MB/s
4096 bytes page aligned: 350.86 MB/s
4096 bytes page aligned: 349.12 MB/s
4096 bytes page aligned: 350.82 MB/s
kernel memcpy (original):
4096 bytes page aligned: 349.41 MB/s
4096 bytes page aligned: 351.20 MB/s
4096 bytes page aligned: 349.45 MB/s
4096 bytes page aligned: 351.11 MB/s
4096 bytes page aligned: 349.44 MB/s
kernel memcpy (optimized):
4096 bytes page aligned: 335.77 MB/s
4096 bytes page aligned: 334.08 MB/s
4096 bytes page aligned: 335.69 MB/s
4096 bytes page aligned: 334.18 MB/s
4096 bytes page aligned: 335.80 MB/s
kernel copy_page (original):
4096 bytes page aligned: 376.23 MB/s
4096 bytes page aligned: 377.99 MB/s
4096 bytes page aligned: 376.22 MB/s
4096 bytes page aligned: 378.12 MB/s
4096 bytes page aligned: 376.26 MB/s
kernel copy_page (optimized):
4096 bytes page aligned: 335.23 MB/s
4096 bytes page aligned: 333.74 MB/s
4096 bytes page aligned: 335.35 MB/s
4096 bytes page aligned: 333.73 MB/s
4096 bytes page aligned: 335.24 MB/s
libc memcpy:
Mixed from 1 to 1023 (power law), unaligned: 491.15 MB/s
Mixed from 1 to 1023 (power law), unaligned: 494.03 MB/s
Mixed from 1 to 1023 (power law), unaligned: 491.42 MB/s
Mixed from 1 to 1023 (power law), unaligned: 493.73 MB/s
Mixed from 1 to 1023 (power law), unaligned: 493.67 MB/s
kernel memcpy (original):
Mixed from 1 to 1023 (power law), unaligned: 511.36 MB/s
Mixed from 1 to 1023 (power law), unaligned: 511.31 MB/s
Mixed from 1 to 1023 (power law), unaligned: 508.09 MB/s
Mixed from 1 to 1023 (power law), unaligned: 510.07 MB/s
Mixed from 1 to 1023 (power law), unaligned: 508.48 MB/s
kernel memcpy (optimized):
Mixed from 1 to 1023 (power law), unaligned: 504.81 MB/s
Mixed from 1 to 1023 (power law), unaligned: 502.20 MB/s
Mixed from 1 to 1023 (power law), unaligned: 504.56 MB/s
Mixed from 1 to 1023 (power law), unaligned: 502.11 MB/s
Mixed from 1 to 1023 (power law), unaligned: 504.76 MB/s
libc memset:
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 848.27 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 848.05 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 848.22 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 844.06 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 848.15 MB/s
kernel memset (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 904.37 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 908.54 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 904.19 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 908.48 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 903.71 MB/s
kernel memset (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 950.89 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 951.03 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 946.37 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 950.95 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 946.38 MB/s
kernel memzero (original):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 861.66 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 857.97 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 861.77 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 857.91 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 861.79 MB/s
kernel memzero (optimized):
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 895.24 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 895.20 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 895.13 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 890.91 MB/s
Mixed powers of 2 from 4 to 4096 (power law), word aligned: 895.07 MB/s
libc memset:
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 501.37 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 503.81 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 501.35 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 503.73 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 501.30 MB/s
kernel memset (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.17 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.17 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.07 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.06 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 566.40 MB/s
kernel memset (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 621.23 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 618.26 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 621.15 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 618.15 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 621.22 MB/s
kernel memzero (original):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 535.10 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 537.69 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 537.67 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 535.13 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 537.73 MB/s
kernel memzero (optimized):
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 566.99 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.74 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.10 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 569.83 MB/s
Mixed multiples of 4 from 4 to 1024 (power law), word aligned: 567.03 MB/s
libc memset:
4096 bytes page aligned: 2041.83 MB/s
4096 bytes page aligned: 2032.34 MB/s
4096 bytes page aligned: 2042.07 MB/s
4096 bytes page aligned: 2042.09 MB/s
4096 bytes page aligned: 2031.88 MB/s
kernel memset (original):
4096 bytes page aligned: 1922.09 MB/s
4096 bytes page aligned: 1912.70 MB/s
4096 bytes page aligned: 1922.13 MB/s
4096 bytes page aligned: 1912.52 MB/s
4096 bytes page aligned: 1921.78 MB/s
kernel memset (optimized):
4096 bytes page aligned: 1913.71 MB/s
4096 bytes page aligned: 1923.03 MB/s
4096 bytes page aligned: 1913.67 MB/s
4096 bytes page aligned: 1922.56 MB/s
4096 bytes page aligned: 1923.01 MB/s
kernel memzero (original):
4096 bytes page aligned: 1888.00 MB/s
4096 bytes page aligned: 1897.21 MB/s
4096 bytes page aligned: 1887.74 MB/s
4096 bytes page aligned: 1896.99 MB/s
4096 bytes page aligned: 1887.97 MB/s
kernel memzero (optimized):
4096 bytes page aligned: 1898.35 MB/s
4096 bytes page aligned: 1888.97 MB/s
4096 bytes page aligned: 1897.97 MB/s
4096 bytes page aligned: 1889.20 MB/s
4096 bytes page aligned: 1898.33 MB/s
libc memset:
Mixed from 1 to 1023 (power law), unaligned: 735.51 MB/s
Mixed from 1 to 1023 (power law), unaligned: 732.16 MB/s
Mixed from 1 to 1023 (power law), unaligned: 735.44 MB/s
Mixed from 1 to 1023 (power law), unaligned: 731.94 MB/s
Mixed from 1 to 1023 (power law), unaligned: 735.37 MB/s
kernel memset (original):
Mixed from 1 to 1023 (power law), unaligned: 782.22 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.91 MB/s
Mixed from 1 to 1023 (power law), unaligned: 782.22 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.91 MB/s
Mixed from 1 to 1023 (power law), unaligned: 785.99 MB/s
kernel memset (optimized):
Mixed from 1 to 1023 (power law), unaligned: 818.63 MB/s
Mixed from 1 to 1023 (power law), unaligned: 818.80 MB/s
Mixed from 1 to 1023 (power law), unaligned: 815.12 MB/s
Mixed from 1 to 1023 (power law), unaligned: 818.64 MB/s
Mixed from 1 to 1023 (power law), unaligned: 814.92 MB/s
kernel memzero (original):
Mixed from 1 to 1023 (power law), unaligned: 748.04 MB/s
Mixed from 1 to 1023 (power law), unaligned: 745.01 MB/s
Mixed from 1 to 1023 (power law), unaligned: 748.67 MB/s
Mixed from 1 to 1023 (power law), unaligned: 744.85 MB/s
Mixed from 1 to 1023 (power law), unaligned: 748.90 MB/s
kernel memzero (optimized):
Mixed from 1 to 1023 (power law), unaligned: 784.81 MB/s
Mixed from 1 to 1023 (power law), unaligned: 781.09 MB/s
Mixed from 1 to 1023 (power law), unaligned: 784.40 MB/s
Mixed from 1 to 1023 (power law), unaligned: 780.62 MB/s
Mixed from 1 to 1023 (power law), unaligned: 784.59 MB/s

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14  6:13     ` Willy Tarreau
@ 2013-07-14 11:00       ` Harm Hanemaaijer
  2013-07-14 13:09         ` Russell King - ARM Linux
  2013-07-14 15:21         ` Siarhei Siamashka
  0 siblings, 2 replies; 18+ messages in thread
From: Harm Hanemaaijer @ 2013-07-14 11:00 UTC (permalink / raw)
  To: linux-arm-kernel

Willy Tarreau <w <at> 1wt.eu> writes:

> 
> Please find the results attached. It seems that memcpy improved by 0.8%
> though that's not even certain.
> 

What is interesting is that
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Caccifbd.html,
and several other sources (such as other
optimized memcpy implementations) document the cache line size of the Cortex
A9 as 32 bytes, which is an anomaly in the armv7 family. However, it looks
like the kernel is defining L1_CACHE_BYTES as 64 (L1_CACHE_SHIFT == 6) for
all armv7 platforms, which looks like a serious configuring error for Cortex
A9.

This explains why the large size memcpy results that you posted are not
optimal, and also explains the below-par copy_page performance in the current
kernel implementation, because copy_page uses L1_CACHE_BYTES to determine the
preload strategy, while the current memcpy doesn't (it is hardcoded for
L1_CACHE_BYTES of 32).

This merits further investigation, and there might potentially be other
kernel issues for Cortex A9 (including performance) related to this.

To confirm, does running 'zcat /proc/config.gz| grep L1_CACHE_SHIFT' on a
Cortex A9 show CONFIG_ARM_L1_CACHE_SHIFT defined as 6?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-13 16:48 ` Dr. David Alan Gilbert
  2013-07-13 21:13   ` Harm Hanemaaijer
@ 2013-07-14 11:19   ` Harm Hanemaaijer
  2013-07-14 11:32     ` Dr. David Alan Gilbert
  2013-07-14 11:37     ` Ard Biesheuvel
  1 sibling, 2 replies; 18+ messages in thread
From: Harm Hanemaaijer @ 2013-07-14 11:19 UTC (permalink / raw)
  To: linux-arm-kernel

Dr. David Alan Gilbert <gilbertd <at> treblig.org> writes:
> 
> Maybe neon is worth a try these days (although be careful of platforms
> like Tegra 2 that doens't have it); there was a recent patch that enabled
> use in the kernel (I think for some RAID use). The downside is it's
> supposed to be quite power hungry.
> 

As it turns out, NEON isn't too hard to implement. I have added NEON support
to copy_page, memset, memzero, and memcpy (both for the aligned and unaligned
case) in my userspace testing environment. It gives a nice boost (ranging
from 10% for copy_page to >30% for unaligned memcpy on a Cortex A8), which
can potentially be more on other cores. Although I have not tested a live
kernel yet, it looks like NEON can be used fairly transparently #ifdefed on
the CONFIG_NEON kernel definition as long as only the lower end of the
NEON/vfp register file is clobbered (although this needs verification).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 11:19   ` Harm Hanemaaijer
@ 2013-07-14 11:32     ` Dr. David Alan Gilbert
  2013-07-14 11:37     ` Ard Biesheuvel
  1 sibling, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2013-07-14 11:32 UTC (permalink / raw)
  To: linux-arm-kernel

* Harm Hanemaaijer (fgenfb at yahoo.com) wrote:
> Dr. David Alan Gilbert <gilbertd <at> treblig.org> writes:
> > 
> > Maybe neon is worth a try these days (although be careful of platforms
> > like Tegra 2 that doens't have it); there was a recent patch that enabled
> > use in the kernel (I think for some RAID use). The downside is it's
> > supposed to be quite power hungry.
> > 
> 
> As it turns out, NEON isn't too hard to implement. I have added NEON support
> to copy_page, memset, memzero, and memcpy (both for the aligned and unaligned
> case) in my userspace testing environment. It gives a nice boost (ranging
> from 10% for copy_page to >30% for unaligned memcpy on a Cortex A8), which
> can potentially be more on other cores.

What size memcpy's is that on?   If I remember correctly A8 happens to be
able to do very fast Neon to it's cache but it doesn't help outside of the cache,
and it doesn't make any benefit on A9.

> Although I have not tested a live
> kernel yet, it looks like NEON can be used fairly transparently #ifdefed on
> the CONFIG_NEON kernel definition as long as only the lower end of the
> NEON/vfp register file is clobbered (although this needs verification).

Hmm I'd assumed there would be some save/restory stuff needed and given
copy_to_ etc get used everywhere I'd be careful.

Dave
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\ gro.gilbert @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 11:19   ` Harm Hanemaaijer
  2013-07-14 11:32     ` Dr. David Alan Gilbert
@ 2013-07-14 11:37     ` Ard Biesheuvel
  2013-07-14 13:13       ` Russell King - ARM Linux
  2013-07-14 13:33       ` Harm Hanemaaijer
  1 sibling, 2 replies; 18+ messages in thread
From: Ard Biesheuvel @ 2013-07-14 11:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 14 July 2013 13:19, Harm Hanemaaijer <fgenfb@yahoo.com> wrote:
> Dr. David Alan Gilbert <gilbertd <at> treblig.org> writes:
>>
>> Maybe neon is worth a try these days (although be careful of platforms
>> like Tegra 2 that doens't have it); there was a recent patch that enabled
>> use in the kernel (I think for some RAID use). The downside is it's
>> supposed to be quite power hungry.
>>
>
> As it turns out, NEON isn't too hard to implement. I have added NEON support
> to copy_page, memset, memzero, and memcpy (both for the aligned and unaligned
> case) in my userspace testing environment. It gives a nice boost (ranging
> from 10% for copy_page to >30% for unaligned memcpy on a Cortex A8), which
> can potentially be more on other cores. Although I have not tested a live
> kernel yet, it looks like NEON can be used fairly transparently #ifdefed on
> the CONFIG_NEON kernel definition as long as only the lower end of the
> NEON/vfp register file is clobbered (although this needs verification).
>

You will clobber the userland NEON contents of the register file if
you don't preserve them properly. Also, kernel preemption (if enabled)
may put your task to sleep at any time, and the context switching
machinery is totally oblivious of NEON being used in the kernel, so
the kernel side will get corrupted as well in this case.

I have a patch series pending (i.e., accepted but not pulled yet by
Russell) which addresses these issues.



--
Ard.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 11:00       ` Harm Hanemaaijer
@ 2013-07-14 13:09         ` Russell King - ARM Linux
  2013-07-14 13:59           ` Harm Hanemaaijer
  2013-07-14 15:21         ` Siarhei Siamashka
  1 sibling, 1 reply; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-07-14 13:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jul 14, 2013 at 11:00:50AM +0000, Harm Hanemaaijer wrote:
> What is interesting is that
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Caccifbd.html,
> and several other sources (such as other
> optimized memcpy implementations) document the cache line size of the Cortex
> A9 as 32 bytes, which is an anomaly in the armv7 family. However, it looks
> like the kernel is defining L1_CACHE_BYTES as 64 (L1_CACHE_SHIFT == 6) for
> all armv7 platforms, which looks like a serious configuring error for Cortex
> A9.

You're making wrong assumptions about what L1_CACHE_BYTES is.

Firstly, L1_CACHE_BYTES is not dynamic - it's a build time constant.
You have to make a decision what value it is to be set to when you
build the kernel.  This is because it gets used to determine the
alignment of structures built into the kernel image, amongst other
things, and we can't dynamically relink the kernel at boot time.

So please, get out of your mind any idea that L1_CACHE_BYTES somehow
relates to the exact CPU you're running on.  It doesn't.

What it relates to is the *maximum* cache line size of *any* CPU that
we will run on.

Take a moment to think about why given the above.  If you're booting on
a 32 byte cache line CPU, will a structure aligned for a 64 byte cache
line also be aligned for a 32-byte cache line?  How about the reverse
case?

Now, there are various ARMv7 Cortex CPUs that have 64 byte cache lines
out there in the wild - Cortex A8 and Cortex A15 are two examples, both
of them are ARMv7 CPUs.

As we can't distinguish at run time between these, and we are working for
a single zImage kernel, we have to assume that ARMv7 means a 64 byte cache
line as far as the L1_CACHE_* constants are concerned.  Yes, we used to
set it for OMAP3 and some Samsung SoC too, but then others came along and
single zImage too - and that all makes trying to reduce it down to the
minimum rather pointless.

So, no, this is *not* a "serious configuring error" at all.  It is totally
intended to be this way.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 11:37     ` Ard Biesheuvel
@ 2013-07-14 13:13       ` Russell King - ARM Linux
  2013-07-14 13:33       ` Harm Hanemaaijer
  1 sibling, 0 replies; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-07-14 13:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jul 14, 2013 at 01:37:44PM +0200, Ard Biesheuvel wrote:
> On 14 July 2013 13:19, Harm Hanemaaijer <fgenfb@yahoo.com> wrote:
> > Dr. David Alan Gilbert <gilbertd <at> treblig.org> writes:
> >>
> >> Maybe neon is worth a try these days (although be careful of platforms
> >> like Tegra 2 that doens't have it); there was a recent patch that enabled
> >> use in the kernel (I think for some RAID use). The downside is it's
> >> supposed to be quite power hungry.
> >>
> >
> > As it turns out, NEON isn't too hard to implement. I have added NEON support
> > to copy_page, memset, memzero, and memcpy (both for the aligned and unaligned
> > case) in my userspace testing environment. It gives a nice boost (ranging
> > from 10% for copy_page to >30% for unaligned memcpy on a Cortex A8), which
> > can potentially be more on other cores. Although I have not tested a live
> > kernel yet, it looks like NEON can be used fairly transparently #ifdefed on
> > the CONFIG_NEON kernel definition as long as only the lower end of the
> > NEON/vfp register file is clobbered (although this needs verification).
> >
> 
> You will clobber the userland NEON contents of the register file if
> you don't preserve them properly. Also, kernel preemption (if enabled)
> may put your task to sleep at any time, and the context switching
> machinery is totally oblivious of NEON being used in the kernel, so
> the kernel side will get corrupted as well in this case.

The other issue is - not every ARMv7 core has Neon, so this is going
to have to be something that is selected at runtime - which means
indirecting every memcpy/memset through a function pointer.

The final point is, don't forget that gcc will generate implicit calls
to memset/memcpy, and neon won't be available early in the kernel boot,
so you can't optimize those function pointers away.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 11:37     ` Ard Biesheuvel
  2013-07-14 13:13       ` Russell King - ARM Linux
@ 2013-07-14 13:33       ` Harm Hanemaaijer
  2013-07-14 14:09         ` Ard Biesheuvel
  1 sibling, 1 reply; 18+ messages in thread
From: Harm Hanemaaijer @ 2013-07-14 13:33 UTC (permalink / raw)
  To: linux-arm-kernel

Ard Biesheuvel <ard.biesheuvel <at> linaro.org> writes:

> 
> You will clobber the userland NEON contents of the register file if
> you don't preserve them properly. Also, kernel preemption (if enabled)
> may put your task to sleep at any time, and the context switching
> machinery is totally oblivious of NEON being used in the kernel, so
> the kernel side will get corrupted as well in this case.
> 
> I have a patch series pending (i.e., accepted but not pulled yet by
> Russell) which addresses these issues.
> 

That was what I was afraid of concerning NEON. It must be tricky to solve
without sacrificing performance, since saving/restoring the entire NEON
register file would obviously seriously impact context switch performance.
For memcpy-like applications, basically only four dword registers are
required (d0-d3) which could possibly be optimized for.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 13:09         ` Russell King - ARM Linux
@ 2013-07-14 13:59           ` Harm Hanemaaijer
  0 siblings, 0 replies; 18+ messages in thread
From: Harm Hanemaaijer @ 2013-07-14 13:59 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux <linux <at> arm.linux.org.uk> writes:

> 
> You're making wrong assumptions about what L1_CACHE_BYTES is.

Thanks for the clarification. I have been focused too much on the concept
of a kernel image customized for a single device.

I can see how having to support multiple platforms with a single kernel
image makes things more difficult, especially when trying to optimize for
something. I will have to think about how to manage this when trying to
optimize memcpy-related functions.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 13:33       ` Harm Hanemaaijer
@ 2013-07-14 14:09         ` Ard Biesheuvel
  2013-07-14 14:32           ` Russell King - ARM Linux
  0 siblings, 1 reply; 18+ messages in thread
From: Ard Biesheuvel @ 2013-07-14 14:09 UTC (permalink / raw)
  To: linux-arm-kernel

On 14 July 2013 15:33, Harm Hanemaaijer <fgenfb@yahoo.com> wrote:
> Ard Biesheuvel <ard.biesheuvel <at> linaro.org> writes:
>
>>
>> You will clobber the userland NEON contents of the register file if
>> you don't preserve them properly. Also, kernel preemption (if enabled)
>> may put your task to sleep at any time, and the context switching
>> machinery is totally oblivious of NEON being used in the kernel, so
>> the kernel side will get corrupted as well in this case.
>>
>> I have a patch series pending (i.e., accepted but not pulled yet by
>> Russell) which addresses these issues.
>>
>
> That was what I was afraid of concerning NEON. It must be tricky to solve
> without sacrificing performance, since saving/restoring the entire NEON
> register file would obviously seriously impact context switch performance.
> For memcpy-like applications, basically only four dword registers are
> required (d0-d3) which could possibly be optimized for.
>

Well, the whole lazy preserve/restore mechanism is based on the
premise that preserve/restore is only required when multiple users are
contending for the NEON (or in the SMP case, when a task gets migrated
to another CPU). As we will not be allowing NEON in interrupt context
nor in a preemptible section, the burden of the more costly context
switches should not grow disproportionately, even if tasks may be
contending for the NEON with themselves in a way (userland vs kernel).
However, it also means that a NEON based memcpy() is going to be
problematic, not only for the reasons pointed out by Russell, also
because you will need a fallback to use from interrupt context.

Perhaps for sufficiently large sizes, it makes sense to take the hit
of testing whether NEON is allowable at that particular moment, and
doing the preserve in that case. In the end, the numbers should speak
for themselves: if you manage a considerable speedup in a real-world
case, and no deterioration in others, people are usually quite
receptive.

-- 
Ard.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 14:09         ` Ard Biesheuvel
@ 2013-07-14 14:32           ` Russell King - ARM Linux
  0 siblings, 0 replies; 18+ messages in thread
From: Russell King - ARM Linux @ 2013-07-14 14:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jul 14, 2013 at 04:09:20PM +0200, Ard Biesheuvel wrote:
> Well, the whole lazy preserve/restore mechanism is based on the
> premise that preserve/restore is only required when multiple users are
> contending for the NEON (or in the SMP case, when a task gets migrated
> to another CPU). As we will not be allowing NEON in interrupt context
> nor in a preemptible section, the burden of the more costly context
> switches should not grow disproportionately, even if tasks may be
> contending for the NEON with themselves in a way (userland vs kernel).
> However, it also means that a NEON based memcpy() is going to be
> problematic, not only for the reasons pointed out by Russell, also
> because you will need a fallback to use from interrupt context.

There's another reason too: it would make memcpy() et.al., non-preemptible
also - that's probably fine for very small copies, but not for larger ones.
The acceptability threshold depends on how RT orientated you are and what
your application demands in terms of accuracy from the RT implementation.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-14 11:00       ` Harm Hanemaaijer
  2013-07-14 13:09         ` Russell King - ARM Linux
@ 2013-07-14 15:21         ` Siarhei Siamashka
  1 sibling, 0 replies; 18+ messages in thread
From: Siarhei Siamashka @ 2013-07-14 15:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 14 Jul 2013 11:00:50 +0000 (UTC)
Harm Hanemaaijer <fgenfb@yahoo.com> wrote:

> Willy Tarreau <w <at> 1wt.eu> writes:
> 
> > 
> > Please find the results attached. It seems that memcpy improved by 0.8%
> > though that's not even certain.
> > 
> 
> What is interesting is that
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Caccifbd.html,
> and several other sources (such as other
> optimized memcpy implementations) document the cache line size of the Cortex
> A9 as 32 bytes, which is an anomaly in the armv7 family.

Yes, the cache line size is 32 bytes in Cortex-A9. However in order to
mitigate poor memory memory bandwidth utilization, the L2 cache
controller implements 'double linefill' feature:

    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/CHDHIECI.html

But 'double linefill' only first appeared in r3p0 revision of L2C-310
L2 cache controller (also known as PL310) and was a bit buggy in the
revisions older than r3p2 according to the errata list:

    http://infocenter.arm.com/help/topic/com.arm.doc.uan0013b/index.html

Which only makes double linefill usable in modern Cortex-A9 based SoCs
such as Exynos4412, but unfortunately not in the older Cortex-A9 based
systems.

When double linefill is enabled, two cache lines are allocated at once
in L2, so for the memcpy alike workloads it looks somewhat similar to
real 64 byte cache line size. Welcome to the diverse world of ARM
hardware :)

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Call for testing/opinions: Optimized memset/memcpy
  2013-07-13 21:13   ` Harm Hanemaaijer
@ 2013-07-15 13:15     ` Catalin Marinas
  0 siblings, 0 replies; 18+ messages in thread
From: Catalin Marinas @ 2013-07-15 13:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jul 13, 2013 at 10:13:12PM +0100, Harm Hanemaaijer wrote:
> Dr. David Alan Gilbert <gilbertd <at> treblig.org> writes:
> 
> > 
> > You might like to compare with some of the routines at:
> > https://launchpad.net/cortex-strings
> > and some of the numbers at:
> > https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/
> 
> That's interesting. I had looked at cortex-strings before but didn't
> dig into it, also because its benchmark program seemed to be limited in
> scope. From the Linaro numbers it seems NEON isn't always a win
> especially on newer Cortex platforms, with large variability across
> different platforms/cores.

As it has been stated in this thread, we shouldn't use Neon for memcpy.
There is a significant overhead with saving/restoring Neon registers,
preemptability.

But Cortex Strings is a good starting point and Linaro is going to port
some of these functions to the Linux kernel for ARMv8 (AArch64).

-- 
Catalin

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-07-15 13:15 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-13 15:51 Call for testing/opinions: Optimized memset/memcpy Harm Hanemaaijer
2013-07-13 16:48 ` Dr. David Alan Gilbert
2013-07-13 21:13   ` Harm Hanemaaijer
2013-07-15 13:15     ` Catalin Marinas
2013-07-14 11:19   ` Harm Hanemaaijer
2013-07-14 11:32     ` Dr. David Alan Gilbert
2013-07-14 11:37     ` Ard Biesheuvel
2013-07-14 13:13       ` Russell King - ARM Linux
2013-07-14 13:33       ` Harm Hanemaaijer
2013-07-14 14:09         ` Ard Biesheuvel
2013-07-14 14:32           ` Russell King - ARM Linux
2013-07-13 17:24 ` Willy Tarreau
2013-07-13 21:51   ` Harm Hanemaaijer
2013-07-14  6:13     ` Willy Tarreau
2013-07-14 11:00       ` Harm Hanemaaijer
2013-07-14 13:09         ` Russell King - ARM Linux
2013-07-14 13:59           ` Harm Hanemaaijer
2013-07-14 15:21         ` Siarhei Siamashka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.