linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] riscv: optimize memcpy/memmove/memset
@ 2024-01-28 11:10 Jisheng Zhang
  2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou; +Cc: linux-riscv, linux-kernel

This series is to renew Matteo's "riscv: optimized mem* functions"
sereies.

Compared with Matteo's original series, Jisheng made below changes:
1. adopt Emil's change to fix boot failure when build with clang
2. add corresponding changes to purgatory
3. always build optimized string.c rather than only build when optimize
for performance
4. implement unroll support when src & dst are both aligned to keep
the same performance as assembly version. After disassembling, I found
that the unroll version looks something like below, so it acchieves
the "unroll" effect as asm version but in C programming language:
	ld	t2,0(a5)
	ld	t0,8(a5)
	ld	t6,16(a5)
	ld	t5,24(a5)
	ld	t4,32(a5)
	ld	t3,40(a5)
	ld	t1,48(a5)
	ld	a1,56(a5)
	sd	t2,0(a6)
	sd	t0,8(a6)
	sd	t6,16(a6)
	sd	t5,24(a6)
	sd	t4,32(a6)
	sd	t3,40(a6)
	sd	t1,48(a6)
	sd	a1,56(a6)
And per my testing, unrolling more doesn't help performance, so
the "c" version only unrolls by using 8 GP regs rather than 16
ones as asm version.
5. Add proper __pi_memcpy and __pi___memcpy alias
6. more performance numbers.

Per my benchmark with [1] on TH1520, CV1800B and JH7110 platforms,
the unaligned medium memcpy performance is running about 3.5x ~ 8.6x
speed of the unpatched versions's! Check patch1 for more details and
performance numbers.

Link:https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memcpy.c [1]

Here is the original cover letter msg from Matteo:
Replace the assembly mem{cpy,move,set} with C equivalent.

Try to access RAM with the largest bit width possible, but without
doing unaligned accesses.

A further improvement could be to use multiple read and writes as the
assembly version was trying to do.

Tested on a BeagleV Starlight with a SiFive U74 core, where the
improvement is noticeable.


Matteo Croce (3):
  riscv: optimized memcpy
  riscv: optimized memmove
  riscv: optimized memset

 arch/riscv/include/asm/string.h |  14 +-
 arch/riscv/kernel/riscv_ksyms.c |   6 -
 arch/riscv/lib/Makefile         |   9 +-
 arch/riscv/lib/memcpy.S         | 110 -----------
 arch/riscv/lib/memmove.S        | 317 --------------------------------
 arch/riscv/lib/memset.S         | 113 ------------
 arch/riscv/lib/string.c         | 187 +++++++++++++++++++
 arch/riscv/purgatory/Makefile   |  13 +-
 8 files changed, 206 insertions(+), 563 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 delete mode 100644 arch/riscv/lib/memmove.S
 delete mode 100644 arch/riscv/lib/memset.S
 create mode 100644 arch/riscv/lib/string.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] riscv: optimized memcpy
  2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
@ 2024-01-28 11:10 ` Jisheng Zhang
  2024-01-28 12:35   ` David Laight
  2024-01-30 12:11   ` Nick Kossifidis
  2024-01-28 11:10 ` [PATCH 2/3] riscv: optimized memmove Jisheng Zhang
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Matteo Croce <mcroce@microsoft.com>

Write a C version of memcpy() which uses the biggest data size allowed,
without generating unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next u8
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

On a BeagleV, the TCP RX throughput increased by 45%:

before:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44840 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  76.4 MBytes   641 Mbits/sec   27    624 KBytes
[  5]   1.00-2.00   sec  72.5 MBytes   608 Mbits/sec    0    708 KBytes
[  5]   2.00-3.00   sec  73.8 MBytes   619 Mbits/sec   10    451 KBytes
[  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0    564 KBytes
[  5]   4.00-5.00   sec  73.8 MBytes   619 Mbits/sec    0    658 KBytes
[  5]   5.00-6.00   sec  73.8 MBytes   619 Mbits/sec   14    522 KBytes
[  5]   6.00-7.00   sec  73.8 MBytes   619 Mbits/sec    0    621 KBytes
[  5]   7.00-8.00   sec  72.5 MBytes   608 Mbits/sec    0    706 KBytes
[  5]   8.00-9.00   sec  73.8 MBytes   619 Mbits/sec   20    580 KBytes
[  5]   9.00-10.00  sec  73.8 MBytes   619 Mbits/sec    0    672 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   736 MBytes   618 Mbits/sec   71             sender
[  5]   0.00-10.01  sec   733 MBytes   615 Mbits/sec                  receiver

after:

$ iperf3 -c beaglev
Connecting to host beaglev, port 5201
[  5] local 192.168.85.6 port 44864 connected to 192.168.85.48 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   109 MBytes   912 Mbits/sec   48    559 KBytes
[  5]   1.00-2.00   sec   108 MBytes   902 Mbits/sec    0    690 KBytes
[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec   36    396 KBytes
[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0    567 KBytes
[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    0    699 KBytes
[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec   32    414 KBytes
[  5]   6.00-7.00   sec   106 MBytes   891 Mbits/sec    0    583 KBytes
[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0    708 KBytes
[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec   28    433 KBytes
[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   897 Mbits/sec  144             sender
[  5]   0.00-10.01  sec  1.04 GBytes   894 Mbits/sec                  receiver

And the decreased CPU time of the memcpy() is observable with perf top.
This is the `perf top -Ue task-clock` output when doing the test:

before:

Overhead  Shared O  Symbol
  42.22%  [kernel]  [k] memcpy
  35.00%  [kernel]  [k] __asm_copy_to_user
   3.50%  [kernel]  [k] sifive_l2_flush64_range
   2.30%  [kernel]  [k] stmmac_napi_poll_rx
   1.11%  [kernel]  [k] memset

after:

Overhead  Shared O  Symbol
  45.69%  [kernel]  [k] __asm_copy_to_user
  29.06%  [kernel]  [k] memcpy
   4.09%  [kernel]  [k] sifive_l2_flush64_range
   2.77%  [kernel]  [k] stmmac_napi_poll_rx
   1.24%  [kernel]  [k] memset

Compared with Matteo's original series, Jisheng made below changes:
1. adopt Emil's change to fix boot failure when build with clang
2. add corresponding changes to purgatory
3. always build optimized string.c rather than only build when optimize
for performance
4. implement unroll support when src & dst are both aligned to keep
the same performance as assembly version. After disassembling, I found
that the unroll version looks something like below, so it acchieves
the "unroll" effect as asm version but in C programming language:
	ld	t2,0(a5)
	ld	t0,8(a5)
	ld	t6,16(a5)
	ld	t5,24(a5)
	ld	t4,32(a5)
	ld	t3,40(a5)
	ld	t1,48(a5)
	ld	a1,56(a5)
	sd	t2,0(a6)
	sd	t0,8(a6)
	sd	t6,16(a6)
	sd	t5,24(a6)
	sd	t4,32(a6)
	sd	t3,40(a6)
	sd	t1,48(a6)
	sd	a1,56(a6)
And per my testing, unrolling more doesn't help performance, so
the "c" version only unrolls by using 8 GP regs rather than 16
ones as asm version.
5. Add proper __pi_memcpy and __pi___memcpy alias
6. more performance numbers.

Jisheng's commit msg:
Use the benchmark program from [1], I got below results on TH1520,
CV1800B and JH7110 platforms.

*TH1520 platform (I fixed cpu freq at 750MHZ):

Before the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.52 64K: 0.43 128K: 0.38 256K: 0.35 512K: 0.31 1024K: 0.22 avg 0.34
memcpy_call 32K: 0.41 64K: 0.35 128K: 0.33 256K: 0.31 512K: 0.28 1024K: 0.20 avg 0.30

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.46 16B: 0.61 32B: 0.84 64B: 0.89 128B: 3.31 256B: 3.44 512B: 3.51
memcpy_call 8B: 0.18 16B: 0.26 32B: 0.50 64B: 0.90 128B: 1.57 256B: 2.31 512B: 2.92

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.19 16B: 0.18 32B: 0.25 64B: 0.30 128B: 0.33 256B: 0.35 512B: 0.36
memcpy_call 8B: 0.16 16B: 0.22 32B: 0.39 64B: 0.70 128B: 1.11 256B: 1.46 512B: 1.81

Large memcpy (bytes/ns):
   __memcpy 1K: 3.57 2K: 3.85 4K: 3.75 8K: 3.98 16K: 4.03 32K: 4.06 64K: 4.40
memcpy_call 1K: 3.13 2K: 3.75 4K: 3.99 8K: 4.29 16K: 4.40 32K: 4.46 64K: 4.63

After the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.32 64K: 0.28 128K: 0.26 256K: 0.24 512K: 0.22 1024K: 0.17 avg 0.24
memcpy_call 32K: 0.39 64K: 0.34 128K: 0.32 256K: 0.30 512K: 0.27 1024K: 0.20 avg 0.29

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.20 16B: 0.22 32B: 0.25 64B: 2.43 128B: 3.19 256B: 3.36 512B: 3.55
memcpy_call 8B: 0.18 16B: 0.24 32B: 0.46 64B: 0.88 128B: 1.53 256B: 2.30 512B: 2.92

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.22 16B: 0.29 32B: 0.49 64B: 0.51 128B: 0.87 256B: 1.08 512B: 1.27
memcpy_call 8B: 0.12 16B: 0.21 32B: 0.40 64B: 0.70 128B: 1.10 256B: 1.46 512B: 1.80

Large memcpy (bytes/ns):
   __memcpy 1K: 3.63 2K: 3.66 4K: 3.78 8K: 3.87 16K: 3.96 32K: 4.11 64K: 4.40
memcpy_call 1K: 3.32 2K: 3.68 4K: 3.99 8K: 4.17 16K: 4.25 32K: 4.48 64K: 4.60

As can be seen, the unaligned medium memcpy performance is improved by
about 252%, I.E got 3.5x speed of original's. The performance of other
style mempcy is kept the same as original's.

And since the TH1520 supports HAVE_EFFICIENT_UNALIGNED_ACCESS, we can
optimize the memcpy futher without taking care of alignment at all.
Random memcpy (bytes/ns):
              __memcpy 32K: 0.35 64K: 0.31 128K: 0.28 256K: 0.25 512K: 0.23 1024K: 0.17 av
g 0.25
           memcpy_call 32K: 0.40 64K: 0.35 128K: 0.33 256K: 0.31 512K: 0.27 1024K: 0.20 av
g 0.30

Aligned medium memcpy (bytes/ns):
              __memcpy 8B: 0.21 16B: 0.23 32B: 0.27 64B: 3.34 128B: 3.42 256B: 3.50 512B:
3.58
           memcpy_call 8B: 0.18 16B: 0.24 32B: 0.46 64B: 0.88 128B: 1.53 256B: 2.31 512B:
2.92

Unaligned medium memcpy (bytes/ns):
              __memcpy 8B: 0.20 16B: 0.23 32B: 0.28 64B: 3.05 128B: 2.70 256B: 2.82 512B:
2.88
           memcpy_call 8B: 0.16 16B: 0.21 32B: 0.38 64B: 0.70 128B: 1.11 256B: 1.50 512B:
1.81

Large memcpy (bytes/ns):
              __memcpy 1K: 3.62 2K: 3.71 4K: 3.76 8K: 3.92 16K: 3.96 32K: 4.12 64K: 4.40
           memcpy_call 1K: 3.11 2K: 3.66 4K: 4.02 8K: 4.16 16K: 4.34 32K: 4.47 64K: 4.62

As can be seen, the unaligned medium memcpy is improved by 700%, I.E 8x
speed of original's.

*CV1800B platform:

Before the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.21 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08
memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.26 16B: 0.36 32B: 0.48 64B: 0.51 128B: 2.01 256B: 2.44 512B: 2.73
memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.11 16B: 0.12 32B: 0.15 64B: 0.16 128B: 0.16 256B: 0.17 512B: 0.17
memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77

Large memcpy (bytes/ns):
   __memcpy 1K: 2.90 2K: 2.91 4K: 3.00 8K: 3.04 16K: 3.03 32K: 2.89 64K: 2.52
memcpy_call 1K: 1.62 2K: 1.74 4K: 1.80 8K: 1.83 16K: 1.84 32K: 1.78 64K: 1.54

After the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.15 64K: 0.08 128K: 0.06 256K: 0.06 512K: 0.05 1024K: 0.05 avg 0.07
memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.11 16B: 0.11 32B: 0.14 64B: 1.15 128B: 1.62 256B: 2.06 512B: 2.40
memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.11 16B: 0.12 32B: 0.21 64B: 0.32 128B: 0.43 256B: 0.52 512B: 0.59
memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77

Large memcpy (bytes/ns):
   __memcpy 1K: 2.56 2K: 2.71 4K: 2.78 8K: 2.81 16K: 2.80 32K: 2.68 64K: 2.51
memcpy_call 1K: 1.62 2K: 1.74 4K: 1.80 8K: 1.83 16K: 1.84 32K: 1.78 64K: 1.54

We get similar performance improvement as TH1520. And since CV1800B also
supports HAVE_EFFICIENT_UNALIGNED_ACCESS, so the performance can be
improved futher:
Random memcpy (bytes/ns):
   __memcpy 32K: 0.15 64K: 0.08 128K: 0.07 256K: 0.06 512K: 0.05 1024K: 0.05 avg 0.07
memcpy_call 32K: 0.19 64K: 0.10 128K: 0.08 256K: 0.07 512K: 0.06 1024K: 0.06 avg 0.08

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.13 16B: 0.14 32B: 0.15 64B: 1.55 128B: 2.01 256B: 2.36 512B: 2.58
memcpy_call 8B: 0.10 16B: 0.18 32B: 0.33 64B: 0.59 128B: 0.90 256B: 1.21 512B: 1.47

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.13 16B: 0.14 32B: 0.15 64B: 1.06 128B: 1.26 256B: 1.39 512B: 1.46
memcpy_call 8B: 0.10 16B: 0.12 32B: 0.21 64B: 0.34 128B: 0.50 256B: 0.66 512B: 0.77

Large memcpy (bytes/ns):
   __memcpy 1K: 2.65 2K: 2.76 4K: 2.80 8K: 2.82 16K: 2.81 32K: 2.68 64K: 2.51
memcpy_call 1K: 1.63 2K: 1.74 4K: 1.80 8K: 1.84 16K: 1.84 32K: 1.78 64K: 1.54

Now the unaligned medium memcpy is running at 8.6x speed of original's!

*JH7110 (I fixed cpufreq at 1.5GHZ)

Before the patch:
Random memcpy (bytes/ns):
   __memcpy 32K: 0.45 64K: 0.40 128K: 0.36 256K: 0.33 512K: 0.33 1024K: 0.31 avg 0.36
memcpy_call 32K: 0.43 64K: 0.38 128K: 0.34 256K: 0.31 512K: 0.31 1024K: 0.29 avg 0.34

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.42 16B: 0.55 32B: 0.65 64B: 0.72 128B: 2.91 256B: 3.36 512B: 3.65
memcpy_call 8B: 0.16 16B: 0.36 32B: 0.67 64B: 1.14 128B: 1.70 256B: 2.26 512B: 2.71

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.17 16B: 0.18 32B: 0.19 64B: 0.19 128B: 0.19 256B: 0.20 512B: 0.20
memcpy_call 8B: 0.16 16B: 0.20 32B: 0.36 64B: 0.62 128B: 0.94 256B: 1.28 512B: 1.52

Large memcpy (bytes/ns):
   __memcpy 1K: 3.62 2K: 3.82 4K: 3.90 8K: 3.95 16K: 3.97 32K: 1.33 64K: 1.33
memcpy_call 1K: 2.93 2K: 3.14 4K: 3.25 8K: 3.31 16K: 3.19 32K: 1.27 64K: 1.28

After the patch:

Random memcpy (bytes/ns):
   __memcpy 32K: 0.26 64K: 0.24 128K: 0.23 256K: 0.22 512K: 0.22 1024K: 0.21 avg 0.23
memcpy_call 32K: 0.42 64K: 0.38 128K: 0.34 256K: 0.31 512K: 0.31 1024K: 0.29 avg 0.34

Aligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.17 16B: 0.17 32B: 0.18 64B: 1.94 128B: 2.56 256B: 3.04 512B: 3.36
memcpy_call 8B: 0.17 16B: 0.36 32B: 0.65 64B: 1.12 128B: 1.73 256B: 2.37 512B: 2.91

Unaligned medium memcpy (bytes/ns):
   __memcpy 8B: 0.17 16B: 0.24 32B: 0.41 64B: 0.63 128B: 0.85 256B: 1.00 512B: 1.14
memcpy_call 8B: 0.16 16B: 0.22 32B: 0.38 64B: 0.65 128B: 0.99 256B: 1.35 512B: 1.61

Large memcpy (bytes/ns):
   __memcpy 1K: 3.43 2K: 3.59 4K: 3.67 8K: 3.72 16K: 3.73 32K: 1.28 64K: 1.28
memcpy_call 1K: 3.21 2K: 3.46 4K: 3.60 8K: 3.68 16K: 3.51 32K: 1.27 64K: 1.28

As can be seen, the unaligned medium memcpy performance is improved by
about 470%, I.E 5.7x speed of original's. The performance of other
style mempcy is kept the same as original's.

Link:https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memcpy.c [1]

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Co-developed-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/include/asm/string.h |   6 +-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   7 +-
 arch/riscv/lib/memcpy.S         | 110 -----------------------------
 arch/riscv/lib/string.c         | 121 ++++++++++++++++++++++++++++++++
 arch/riscv/purgatory/Makefile   |  10 +--
 6 files changed, 136 insertions(+), 120 deletions(-)
 delete mode 100644 arch/riscv/lib/memcpy.S
 create mode 100644 arch/riscv/lib/string.c

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index a96b1fea24fe..edf1d56e4f13 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -12,9 +12,11 @@
 #define __HAVE_ARCH_MEMSET
 extern asmlinkage void *memset(void *, int, size_t);
 extern asmlinkage void *__memset(void *, int, size_t);
+
 #define __HAVE_ARCH_MEMCPY
-extern asmlinkage void *memcpy(void *, const void *, size_t);
-extern asmlinkage void *__memcpy(void *, const void *, size_t);
+extern void *memcpy(void *dest, const void *src, size_t count);
+extern void *__memcpy(void *dest, const void *src, size_t count);
+
 #define __HAVE_ARCH_MEMMOVE
 extern asmlinkage void *memmove(void *, const void *, size_t);
 extern asmlinkage void *__memmove(void *, const void *, size_t);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index a72879b4249a..c69dc74e0a27 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,11 +10,9 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strlen);
 EXPORT_SYMBOL(strncmp);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memcpy);
 EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index bd6e6c1b0497..5f2f94f6db17 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,10 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memcpy.o
 lib-y			+= memset.o
 lib-y			+= memmove.o
 lib-y			+= strcmp.o
 lib-y			+= strlen.o
+lib-y			+= string.o
 lib-y			+= strncmp.o
 lib-y			+= csum.o
 ifeq ($(CONFIG_MMU), y)
@@ -14,6 +14,11 @@ lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
 
+# string.o implements standard library functions like memset/memcpy etc.
+# Use -ffreestanding to ensure that the compiler does not try to "optimize"
+# them into calls to themselves.
+CFLAGS_string.o := -ffreestanding
+
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 lib-$(CONFIG_RISCV_ISA_V)	+= xor.o
 lib-$(CONFIG_RISCV_ISA_V)	+= riscv_v_helpers.o
diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
deleted file mode 100644
index 44e009ec5fef..000000000000
--- a/arch/riscv/lib/memcpy.S
+++ /dev/null
@@ -1,110 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memcpy(void *, const void *, size_t) */
-SYM_FUNC_START(__memcpy)
-	move t6, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
-
-	beqz a3, 2f  /* Skip if already aligned */
-	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
-	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
-1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
-
-	ret
-
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
-	ret
-SYM_FUNC_END(__memcpy)
-SYM_FUNC_ALIAS_WEAK(memcpy, __memcpy)
-SYM_FUNC_ALIAS(__pi_memcpy, __memcpy)
-SYM_FUNC_ALIAS(__pi___memcpy, __memcpy)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
new file mode 100644
index 000000000000..5f9c83ec548d
--- /dev/null
+++ b/arch/riscv/lib/string.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * String functions optimized for hardware which doesn't
+ * handle unaligned memory accesses efficiently.
+ *
+ * Copyright (C) 2021 Matteo Croce
+ */
+
+#include <linux/types.h>
+#include <linux/module.h>
+
+/* Minimum size for a word copy to be convenient */
+#define BYTES_LONG	sizeof(long)
+#define WORD_MASK	(BYTES_LONG - 1)
+#define MIN_THRESHOLD	(BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	const unsigned long *as_ulong;
+	const uintptr_t as_uptr;
+};
+
+static void __memcpy_aligned(unsigned long *dest, const unsigned long *src, size_t count)
+{
+	for (; count > 0; count -= BYTES_LONG * 8) {
+		register unsigned long d0, d1, d2, d3, d4, d5, d6, d7;
+		d0 = src[0];
+		d1 = src[1];
+		d2 = src[2];
+		d3 = src[3];
+		d4 = src[4];
+		d5 = src[5];
+		d6 = src[6];
+		d7 = src[7];
+		dest[0] = d0;
+		dest[1] = d1;
+		dest[2] = d2;
+		dest[3] = d3;
+		dest[4] = d4;
+		dest[5] = d5;
+		dest[6] = d6;
+		dest[7] = d7;
+		dest += 8;
+		src += 8;
+	}
+}
+
+void *__memcpy(void *dest, const void *src, size_t count)
+{
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (count < MIN_THRESHOLD)
+		goto copy_remainder;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & WORD_MASK; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & WORD_MASK;
+	}
+
+	if (distance) {
+		unsigned long last, next;
+
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = last >> (distance * 8) |
+					next << ((BYTES_LONG - distance) * 8);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * aligned copy.
+		 */
+		size_t aligned_count = count & ~(BYTES_LONG * 8 - 1);
+
+		__memcpy_aligned(d.as_ulong, s.as_ulong, aligned_count);
+		d.as_u8 += aligned_count;
+		s.as_u8 += aligned_count;
+		count &= BYTES_LONG * 8 - 1;
+	}
+
+copy_remainder:
+	while (count--)
+		*d.as_u8++ = *s.as_u8++;
+
+	return dest;
+}
+EXPORT_SYMBOL(__memcpy);
+
+void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy);
+EXPORT_SYMBOL(memcpy);
+void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
+void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
diff --git a/arch/riscv/purgatory/Makefile b/arch/riscv/purgatory/Makefile
index 280b0eb352b8..8b940ff04895 100644
--- a/arch/riscv/purgatory/Makefile
+++ b/arch/riscv/purgatory/Makefile
@@ -1,21 +1,21 @@
 # SPDX-License-Identifier: GPL-2.0
 OBJECT_FILES_NON_STANDARD := y
 
-purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memcpy.o memset.o
-purgatory-y += strcmp.o strlen.o strncmp.o
+purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memset.o
+purgatory-y += strcmp.o strlen.o strncmp.o riscv_string.o
 
 targets += $(purgatory-y)
 PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
 
+$(obj)/riscv_string.o: $(srctree)/arch/riscv/lib/string.c FORCE
+	$(call if_changed_rule,cc_o_c)
+
 $(obj)/string.o: $(srctree)/lib/string.c FORCE
 	$(call if_changed_rule,cc_o_c)
 
 $(obj)/ctype.o: $(srctree)/lib/ctype.c FORCE
 	$(call if_changed_rule,cc_o_c)
 
-$(obj)/memcpy.o: $(srctree)/arch/riscv/lib/memcpy.S FORCE
-	$(call if_changed_rule,as_o_S)
-
 $(obj)/memset.o: $(srctree)/arch/riscv/lib/memset.S FORCE
 	$(call if_changed_rule,as_o_S)
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] riscv: optimized memmove
  2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
  2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
@ 2024-01-28 11:10 ` Jisheng Zhang
  2024-01-28 12:47   ` David Laight
  2024-01-30 11:39   ` Nick Kossifidis
  2024-01-28 11:10 ` [PATCH 3/3] riscv: optimized memset Jisheng Zhang
  2024-01-29 18:16 ` [PATCH 0/3] riscv: optimize memcpy/memmove/memset Conor Dooley
  3 siblings, 2 replies; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Matteo Croce <mcroce@microsoft.com>

When the destination buffer is before the source one, or when the
buffers doesn't overlap, it's safe to use memcpy() instead, which is
optimized to use a bigger data size possible.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
 arch/riscv/include/asm/string.h |   4 +-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memmove.S        | 317 --------------------------------
 arch/riscv/lib/string.c         |  25 +++
 5 files changed, 27 insertions(+), 322 deletions(-)
 delete mode 100644 arch/riscv/lib/memmove.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index edf1d56e4f13..17c3b40382e1 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -18,8 +18,8 @@ extern void *memcpy(void *dest, const void *src, size_t count);
 extern void *__memcpy(void *dest, const void *src, size_t count);
 
 #define __HAVE_ARCH_MEMMOVE
-extern asmlinkage void *memmove(void *, const void *, size_t);
-extern asmlinkage void *__memmove(void *, const void *, size_t);
+extern void *memmove(void *dest, const void *src, size_t count);
+extern void *__memmove(void *dest, const void *src, size_t count);
 
 #define __HAVE_ARCH_STRCMP
 extern asmlinkage int strcmp(const char *cs, const char *ct);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index c69dc74e0a27..76849d0906ef 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -10,9 +10,7 @@
  * Assembly functions that may be used (directly or indirectly) by modules
  */
 EXPORT_SYMBOL(memset);
-EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strlen);
 EXPORT_SYMBOL(strncmp);
 EXPORT_SYMBOL(__memset);
-EXPORT_SYMBOL(__memmove);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 5f2f94f6db17..5fa88c5a601c 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
 lib-y			+= memset.o
-lib-y			+= memmove.o
 lib-y			+= strcmp.o
 lib-y			+= strlen.o
 lib-y			+= string.o
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
deleted file mode 100644
index cb3e2e7ef0ba..000000000000
--- a/arch/riscv/lib/memmove.S
+++ /dev/null
@@ -1,317 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com>
- */
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-SYM_FUNC_START(__memmove)
-	/*
-	 * Returns
-	 *   a0 - dest
-	 *
-	 * Parameters
-	 *   a0 - Inclusive first byte of dest
-	 *   a1 - Inclusive first byte of src
-	 *   a2 - Length of copy n
-	 *
-	 * Because the return matches the parameter register a0,
-	 * we will not clobber or modify that register.
-	 *
-	 * Note: This currently only works on little-endian.
-	 * To port to big-endian, reverse the direction of shifts
-	 * in the 2 misaligned fixup copy loops.
-	 */
-
-	/* Return if nothing to do */
-	beq a0, a1, .Lreturn_from_memmove
-	beqz a2, .Lreturn_from_memmove
-
-	/*
-	 * Register Uses
-	 *      Forward Copy: a1 - Index counter of src
-	 *      Reverse Copy: a4 - Index counter of src
-	 *      Forward Copy: t3 - Index counter of dest
-	 *      Reverse Copy: t4 - Index counter of dest
-	 *   Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest
-	 *   Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest
-	 *   Both Copy Modes: t0 - Link / Temporary for load-store
-	 *   Both Copy Modes: t1 - Temporary for load-store
-	 *   Both Copy Modes: t2 - Temporary for load-store
-	 *   Both Copy Modes: a5 - dest to src alignment offset
-	 *   Both Copy Modes: a6 - Shift ammount
-	 *   Both Copy Modes: a7 - Inverse Shift ammount
-	 *   Both Copy Modes: a2 - Alternate breakpoint for unrolled loops
-	 */
-
-	/*
-	 * Solve for some register values now.
-	 * Byte copy does not need t5 or t6.
-	 */
-	mv   t3, a0
-	add  t4, a0, a2
-	add  a4, a1, a2
-
-	/*
-	 * Byte copy if copying less than (2 * SZREG) bytes. This can
-	 * cause problems with the bulk copy implementation and is
-	 * small enough not to bother.
-	 */
-	andi t0, a2, -(2 * SZREG)
-	beqz t0, .Lbyte_copy
-
-	/*
-	 * Now solve for t5 and t6.
-	 */
-	andi t5, t3, -SZREG
-	andi t6, t4, -SZREG
-	/*
-	 * If dest(Register t3) rounded down to the nearest naturally
-	 * aligned SZREG address, does not equal dest, then add SZREG
-	 * to find the low-bound of SZREG alignment in the dest memory
-	 * region.  Note that this could overshoot the dest memory
-	 * region if n is less than SZREG.  This is one reason why
-	 * we always byte copy if n is less than SZREG.
-	 * Otherwise, dest is already naturally aligned to SZREG.
-	 */
-	beq  t5, t3, 1f
-		addi t5, t5, SZREG
-	1:
-
-	/*
-	 * If the dest and src are co-aligned to SZREG, then there is
-	 * no need for the full rigmarole of a full misaligned fixup copy.
-	 * Instead, do a simpler co-aligned copy.
-	 */
-	xor  t0, a0, a1
-	andi t1, t0, (SZREG - 1)
-	beqz t1, .Lcoaligned_copy
-	/* Fall through to misaligned fixup copy */
-
-.Lmisaligned_fixup_copy:
-	bltu a1, a0, .Lmisaligned_fixup_copy_reverse
-
-.Lmisaligned_fixup_copy_forward:
-	jal  t0, .Lbyte_copy_until_aligned_forward
-
-	andi a5, a1, (SZREG - 1) /* Find the alignment offset of src (a1) */
-	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
-	sub  a5, a1, t3 /* Find the difference between src and dest */
-	andi a1, a1, -SZREG /* Align the src pointer */
-	addi a2, t6, SZREG /* The other breakpoint for the unrolled loop*/
-
-	/*
-	 * Compute The Inverse Shift
-	 * a7 = XLEN - a6 = XLEN + -a6
-	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
-	 * Add that to XLEN.  XLEN = SZREG * 8.
-	 */
-	not  a7, a6
-	addi a7, a7, (SZREG * 8 + 1)
-
-	/*
-	 * Fix Misalignment Copy Loop - Forward
-	 * load_val0 = load_ptr[0];
-	 * do {
-	 * 	load_val1 = load_ptr[1];
-	 * 	store_ptr += 2;
-	 * 	store_ptr[0 - 2] = (load_val0 >> {a6}) | (load_val1 << {a7});
-	 *
-	 * 	if (store_ptr == {a2})
-	 * 		break;
-	 *
-	 * 	load_val0 = load_ptr[2];
-	 * 	load_ptr += 2;
-	 * 	store_ptr[1 - 2] = (load_val1 >> {a6}) | (load_val0 << {a7});
-	 *
-	 * } while (store_ptr != store_ptr_end);
-	 * store_ptr = store_ptr_end;
-	 */
-
-	REG_L t0, (0 * SZREG)(a1)
-	1:
-	REG_L t1, (1 * SZREG)(a1)
-	addi  t3, t3, (2 * SZREG)
-	srl   t0, t0, a6
-	sll   t2, t1, a7
-	or    t2, t0, t2
-	REG_S t2, ((0 * SZREG) - (2 * SZREG))(t3)
-
-	beq   t3, a2, 2f
-
-	REG_L t0, (2 * SZREG)(a1)
-	addi  a1, a1, (2 * SZREG)
-	srl   t1, t1, a6
-	sll   t2, t0, a7
-	or    t2, t1, t2
-	REG_S t2, ((1 * SZREG) - (2 * SZREG))(t3)
-
-	bne   t3, t6, 1b
-	2:
-	mv    t3, t6 /* Fix the dest pointer in case the loop was broken */
-
-	add  a1, t3, a5 /* Restore the src pointer */
-	j .Lbyte_copy_forward /* Copy any remaining bytes */
-
-.Lmisaligned_fixup_copy_reverse:
-	jal  t0, .Lbyte_copy_until_aligned_reverse
-
-	andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */
-	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
-	sub  a5, a4, t4 /* Find the difference between src and dest */
-	andi a4, a4, -SZREG /* Align the src pointer */
-	addi a2, t5, -SZREG /* The other breakpoint for the unrolled loop*/
-
-	/*
-	 * Compute The Inverse Shift
-	 * a7 = XLEN - a6 = XLEN + -a6
-	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
-	 * Add that to XLEN.  XLEN = SZREG * 8.
-	 */
-	not  a7, a6
-	addi a7, a7, (SZREG * 8 + 1)
-
-	/*
-	 * Fix Misalignment Copy Loop - Reverse
-	 * load_val1 = load_ptr[0];
-	 * do {
-	 * 	load_val0 = load_ptr[-1];
-	 * 	store_ptr -= 2;
-	 * 	store_ptr[1] = (load_val0 >> {a6}) | (load_val1 << {a7});
-	 *
-	 * 	if (store_ptr == {a2})
-	 * 		break;
-	 *
-	 * 	load_val1 = load_ptr[-2];
-	 * 	load_ptr -= 2;
-	 * 	store_ptr[0] = (load_val1 >> {a6}) | (load_val0 << {a7});
-	 *
-	 * } while (store_ptr != store_ptr_end);
-	 * store_ptr = store_ptr_end;
-	 */
-
-	REG_L t1, ( 0 * SZREG)(a4)
-	1:
-	REG_L t0, (-1 * SZREG)(a4)
-	addi  t4, t4, (-2 * SZREG)
-	sll   t1, t1, a7
-	srl   t2, t0, a6
-	or    t2, t1, t2
-	REG_S t2, ( 1 * SZREG)(t4)
-
-	beq   t4, a2, 2f
-
-	REG_L t1, (-2 * SZREG)(a4)
-	addi  a4, a4, (-2 * SZREG)
-	sll   t0, t0, a7
-	srl   t2, t1, a6
-	or    t2, t0, t2
-	REG_S t2, ( 0 * SZREG)(t4)
-
-	bne   t4, t5, 1b
-	2:
-	mv    t4, t5 /* Fix the dest pointer in case the loop was broken */
-
-	add  a4, t4, a5 /* Restore the src pointer */
-	j .Lbyte_copy_reverse /* Copy any remaining bytes */
-
-/*
- * Simple copy loops for SZREG co-aligned memory locations.
- * These also make calls to do byte copies for any unaligned
- * data at their terminations.
- */
-.Lcoaligned_copy:
-	bltu a1, a0, .Lcoaligned_copy_reverse
-
-.Lcoaligned_copy_forward:
-	jal t0, .Lbyte_copy_until_aligned_forward
-
-	1:
-	REG_L t1, ( 0 * SZREG)(a1)
-	addi  a1, a1, SZREG
-	addi  t3, t3, SZREG
-	REG_S t1, (-1 * SZREG)(t3)
-	bne   t3, t6, 1b
-
-	j .Lbyte_copy_forward /* Copy any remaining bytes */
-
-.Lcoaligned_copy_reverse:
-	jal t0, .Lbyte_copy_until_aligned_reverse
-
-	1:
-	REG_L t1, (-1 * SZREG)(a4)
-	addi  a4, a4, -SZREG
-	addi  t4, t4, -SZREG
-	REG_S t1, ( 0 * SZREG)(t4)
-	bne   t4, t5, 1b
-
-	j .Lbyte_copy_reverse /* Copy any remaining bytes */
-
-/*
- * These are basically sub-functions within the function.  They
- * are used to byte copy until the dest pointer is in alignment.
- * At which point, a bulk copy method can be used by the
- * calling code.  These work on the same registers as the bulk
- * copy loops.  Therefore, the register values can be picked
- * up from where they were left and we avoid code duplication
- * without any overhead except the call in and return jumps.
- */
-.Lbyte_copy_until_aligned_forward:
-	beq  t3, t5, 2f
-	1:
-	lb   t1,  0(a1)
-	addi a1, a1, 1
-	addi t3, t3, 1
-	sb   t1, -1(t3)
-	bne  t3, t5, 1b
-	2:
-	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
-
-.Lbyte_copy_until_aligned_reverse:
-	beq  t4, t6, 2f
-	1:
-	lb   t1, -1(a4)
-	addi a4, a4, -1
-	addi t4, t4, -1
-	sb   t1,  0(t4)
-	bne  t4, t6, 1b
-	2:
-	jalr zero, 0x0(t0) /* Return to multibyte copy loop */
-
-/*
- * Simple byte copy loops.
- * These will byte copy until they reach the end of data to copy.
- * At that point, they will call to return from memmove.
- */
-.Lbyte_copy:
-	bltu a1, a0, .Lbyte_copy_reverse
-
-.Lbyte_copy_forward:
-	beq  t3, t4, 2f
-	1:
-	lb   t1,  0(a1)
-	addi a1, a1, 1
-	addi t3, t3, 1
-	sb   t1, -1(t3)
-	bne  t3, t4, 1b
-	2:
-	ret
-
-.Lbyte_copy_reverse:
-	beq  t4, t3, 2f
-	1:
-	lb   t1, -1(a4)
-	addi a4, a4, -1
-	addi t4, t4, -1
-	sb   t1,  0(t4)
-	bne  t4, t3, 1b
-	2:
-
-.Lreturn_from_memmove:
-	ret
-
-SYM_FUNC_END(__memmove)
-SYM_FUNC_ALIAS_WEAK(memmove, __memmove)
-SYM_FUNC_ALIAS(__pi_memmove, __memmove)
-SYM_FUNC_ALIAS(__pi___memmove, __memmove)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 5f9c83ec548d..20677c8067da 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
 EXPORT_SYMBOL(memcpy);
 void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
 void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
+
+/*
+ * Simply check if the buffer overlaps an call memcpy() in case,
+ * otherwise do a simple one byte at time backward copy.
+ */
+void *__memmove(void *dest, const void *src, size_t count)
+{
+	if (dest < src || src + count <= dest)
+		return __memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
+
+		while (count--)
+			*--tmp = *--s;
+	}
+	return dest;
+}
+EXPORT_SYMBOL(__memmove);
+
+void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmove);
+EXPORT_SYMBOL(memmove);
+void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove);
+void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] riscv: optimized memset
  2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
  2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
  2024-01-28 11:10 ` [PATCH 2/3] riscv: optimized memmove Jisheng Zhang
@ 2024-01-28 11:10 ` Jisheng Zhang
  2024-01-30 12:07   ` Nick Kossifidis
  2024-01-29 18:16 ` [PATCH 0/3] riscv: optimize memcpy/memmove/memset Conor Dooley
  3 siblings, 1 reply; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-28 11:10 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce

From: Matteo Croce <mcroce@microsoft.com>

The generic memset is defined as a byte at time write. This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed,
and finally fills the remaining data one byte at time.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
 arch/riscv/include/asm/string.h |   4 +-
 arch/riscv/kernel/riscv_ksyms.c |   2 -
 arch/riscv/lib/Makefile         |   1 -
 arch/riscv/lib/memset.S         | 113 --------------------------------
 arch/riscv/lib/string.c         |  41 ++++++++++++
 arch/riscv/purgatory/Makefile   |   5 +-
 6 files changed, 44 insertions(+), 122 deletions(-)
 delete mode 100644 arch/riscv/lib/memset.S

diff --git a/arch/riscv/include/asm/string.h b/arch/riscv/include/asm/string.h
index 17c3b40382e1..3d45d61ddab9 100644
--- a/arch/riscv/include/asm/string.h
+++ b/arch/riscv/include/asm/string.h
@@ -10,8 +10,8 @@
 #include <linux/linkage.h>
 
 #define __HAVE_ARCH_MEMSET
-extern asmlinkage void *memset(void *, int, size_t);
-extern asmlinkage void *__memset(void *, int, size_t);
+extern void *memset(void *s, int c, size_t count);
+extern void *__memset(void *s, int c, size_t count);
 
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *dest, const void *src, size_t count);
diff --git a/arch/riscv/kernel/riscv_ksyms.c b/arch/riscv/kernel/riscv_ksyms.c
index 76849d0906ef..9a07cbf67095 100644
--- a/arch/riscv/kernel/riscv_ksyms.c
+++ b/arch/riscv/kernel/riscv_ksyms.c
@@ -9,8 +9,6 @@
 /*
  * Assembly functions that may be used (directly or indirectly) by modules
  */
-EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strlen);
 EXPORT_SYMBOL(strncmp);
-EXPORT_SYMBOL(__memset);
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 5fa88c5a601c..b20cadccff78 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 lib-y			+= delay.o
-lib-y			+= memset.o
 lib-y			+= strcmp.o
 lib-y			+= strlen.o
 lib-y			+= string.o
diff --git a/arch/riscv/lib/memset.S b/arch/riscv/lib/memset.S
deleted file mode 100644
index 35f358e70bdb..000000000000
--- a/arch/riscv/lib/memset.S
+++ /dev/null
@@ -1,113 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Regents of the University of California
- */
-
-
-#include <linux/linkage.h>
-#include <asm/asm.h>
-
-/* void *memset(void *, int, size_t) */
-SYM_FUNC_START(__memset)
-	move t0, a0  /* Preserve return value */
-
-	/* Defer to byte-oriented fill for small sizes */
-	sltiu a3, a2, 16
-	bnez a3, 4f
-
-	/*
-	 * Round to nearest XLEN-aligned address
-	 * greater than or equal to start address
-	 */
-	addi a3, t0, SZREG-1
-	andi a3, a3, ~(SZREG-1)
-	beq a3, t0, 2f  /* Skip if already aligned */
-	/* Handle initial misalignment */
-	sub a4, a3, t0
-1:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 1b
-	sub a2, a2, a4  /* Update count */
-
-2: /* Duff's device with 32 XLEN stores per iteration */
-	/* Broadcast value into all bytes */
-	andi a1, a1, 0xff
-	slli a3, a1, 8
-	or a1, a3, a1
-	slli a3, a1, 16
-	or a1, a3, a1
-#ifdef CONFIG_64BIT
-	slli a3, a1, 32
-	or a1, a3, a1
-#endif
-
-	/* Calculate end address */
-	andi a4, a2, ~(SZREG-1)
-	add a3, t0, a4
-
-	andi a4, a4, 31*SZREG  /* Calculate remainder */
-	beqz a4, 3f            /* Shortcut if no remainder */
-	neg a4, a4
-	addi a4, a4, 32*SZREG  /* Calculate initial offset */
-
-	/* Adjust start address with offset */
-	sub t0, t0, a4
-
-	/* Jump into loop body */
-	/* Assumes 32-bit instruction lengths */
-	la a5, 3f
-#ifdef CONFIG_64BIT
-	srli a4, a4, 1
-#endif
-	add a5, a5, a4
-	jr a5
-3:
-	REG_S a1,        0(t0)
-	REG_S a1,    SZREG(t0)
-	REG_S a1,  2*SZREG(t0)
-	REG_S a1,  3*SZREG(t0)
-	REG_S a1,  4*SZREG(t0)
-	REG_S a1,  5*SZREG(t0)
-	REG_S a1,  6*SZREG(t0)
-	REG_S a1,  7*SZREG(t0)
-	REG_S a1,  8*SZREG(t0)
-	REG_S a1,  9*SZREG(t0)
-	REG_S a1, 10*SZREG(t0)
-	REG_S a1, 11*SZREG(t0)
-	REG_S a1, 12*SZREG(t0)
-	REG_S a1, 13*SZREG(t0)
-	REG_S a1, 14*SZREG(t0)
-	REG_S a1, 15*SZREG(t0)
-	REG_S a1, 16*SZREG(t0)
-	REG_S a1, 17*SZREG(t0)
-	REG_S a1, 18*SZREG(t0)
-	REG_S a1, 19*SZREG(t0)
-	REG_S a1, 20*SZREG(t0)
-	REG_S a1, 21*SZREG(t0)
-	REG_S a1, 22*SZREG(t0)
-	REG_S a1, 23*SZREG(t0)
-	REG_S a1, 24*SZREG(t0)
-	REG_S a1, 25*SZREG(t0)
-	REG_S a1, 26*SZREG(t0)
-	REG_S a1, 27*SZREG(t0)
-	REG_S a1, 28*SZREG(t0)
-	REG_S a1, 29*SZREG(t0)
-	REG_S a1, 30*SZREG(t0)
-	REG_S a1, 31*SZREG(t0)
-	addi t0, t0, 32*SZREG
-	bltu t0, a3, 3b
-	andi a2, a2, SZREG-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, t0, a2
-5:
-	sb a1, 0(t0)
-	addi t0, t0, 1
-	bltu t0, a3, 5b
-6:
-	ret
-SYM_FUNC_END(__memset)
-SYM_FUNC_ALIAS_WEAK(memset, __memset)
diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 20677c8067da..022edda68f1c 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -144,3 +144,44 @@ void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmov
 EXPORT_SYMBOL(memmove);
 void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove);
 void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove);
+
+void *__memset(void *s, int c, size_t count)
+{
+	union types dest = { .as_u8 = s };
+
+	if (count >= MIN_THRESHOLD) {
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+		cu *= 0x0101010101010101UL;
+#else
+		cu |= cu << 8;
+		cu |= cu << 16;
+		/* Suppress warning on 32 bit machines */
+		cu |= (cu << 16) << 16;
+#endif
+		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+			/*
+			 * Fill the buffer one byte at time until
+			 * the destination is word aligned.
+			 */
+			for (; count && dest.as_uptr & WORD_MASK; count--)
+				*dest.as_u8++ = c;
+		}
+
+		/* Copy using the largest size allowed */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*dest.as_ulong++ = cu;
+	}
+
+	/* copy the remainder */
+	while (count--)
+		*dest.as_u8++ = c;
+
+	return s;
+}
+EXPORT_SYMBOL(__memset);
+
+void *memset(void *s, int c, size_t count) __weak __alias(__memset);
+EXPORT_SYMBOL(memset);
diff --git a/arch/riscv/purgatory/Makefile b/arch/riscv/purgatory/Makefile
index 8b940ff04895..a31513346951 100644
--- a/arch/riscv/purgatory/Makefile
+++ b/arch/riscv/purgatory/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 OBJECT_FILES_NON_STANDARD := y
 
-purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o memset.o
+purgatory-y := purgatory.o sha256.o entry.o string.o ctype.o
 purgatory-y += strcmp.o strlen.o strncmp.o riscv_string.o
 
 targets += $(purgatory-y)
@@ -16,9 +16,6 @@ $(obj)/string.o: $(srctree)/lib/string.c FORCE
 $(obj)/ctype.o: $(srctree)/lib/ctype.c FORCE
 	$(call if_changed_rule,cc_o_c)
 
-$(obj)/memset.o: $(srctree)/arch/riscv/lib/memset.S FORCE
-	$(call if_changed_rule,as_o_S)
-
 $(obj)/strcmp.o: $(srctree)/arch/riscv/lib/strcmp.S FORCE
 	$(call if_changed_rule,as_o_S)
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* RE: [PATCH 1/3] riscv: optimized memcpy
  2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
@ 2024-01-28 12:35   ` David Laight
  2024-01-30 12:11   ` Nick Kossifidis
  1 sibling, 0 replies; 19+ messages in thread
From: David Laight @ 2024-01-28 12:35 UTC (permalink / raw)
  To: 'Jisheng Zhang', Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 28 January 2024 11:10
> 
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Write a C version of memcpy() which uses the biggest data size allowed,
> without generating unaligned accesses.
> 
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next u8
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
> 
> On a BeagleV, the TCP RX throughput increased by 45%:
...
> +static void __memcpy_aligned(unsigned long *dest, const unsigned long *src, size_t count)
> +{

You should be able to remove an instruction from the loop by using:
	const unsigned long *src_lim = src + count;
	for (; src < src_lim; ) {

> +	for (; count > 0; count -= BYTES_LONG * 8) {
> +		register unsigned long d0, d1, d2, d3, d4, d5, d6, d7;

register is completely ignored and pointless.
(More annoyingly auto is also ignored.)

> +		d0 = src[0];
> +		d1 = src[1];
> +		d2 = src[2];
> +		d3 = src[3];
> +		d4 = src[4];
> +		d5 = src[5];
> +		d6 = src[6];
> +		d7 = src[7];
> +		dest[0] = d0;
> +		dest[1] = d1;
> +		dest[2] = d2;
> +		dest[3] = d3;
> +		dest[4] = d4;
> +		dest[5] = d5;
> +		dest[6] = d6;
> +		dest[7] = d7;
> +		dest += 8;
> +		src += 8;

There two lines belong in the for (...) statement.

> +	}
> +}

If you __always_inline the function you can pass &src and &dest
and use the updated pointers following the loop.

I don't believe that risc-v supports 'reg+reg+(imm5<<3)' addressing
(although there is probably space in the instruction for it.
Actually 'reg+reg' addressing could be supported for loads but
not stores - since the latter would require 3 registers be read.

We use the Nios-II cpu in some fpgas. Intel are removing support
in favour of Risc-V - we are thinking of re-implementing Nios-II
ourselves!
I don't think they understand what the cpu get used for!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 2/3] riscv: optimized memmove
  2024-01-28 11:10 ` [PATCH 2/3] riscv: optimized memmove Jisheng Zhang
@ 2024-01-28 12:47   ` David Laight
  2024-01-30 11:30     ` Jisheng Zhang
  2024-01-30 11:39   ` Nick Kossifidis
  1 sibling, 1 reply; 19+ messages in thread
From: David Laight @ 2024-01-28 12:47 UTC (permalink / raw)
  To: 'Jisheng Zhang', Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 28 January 2024 11:10
> 
> When the destination buffer is before the source one, or when the
> buffers doesn't overlap, it's safe to use memcpy() instead, which is
> optimized to use a bigger data size possible.
> 
...
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.

I'd at least do a 64bit copy loop if the addresses are aligned.

Thinks a bit more....

Put the copy 64 bytes code (the body of the memcpy() loop)
into it an inline function and call it with increasing addresses
in memcpy() are decrementing addresses in memmove.

So memcpy() contains:
	src_lim = src_lim + count;
	... alignment copy
	for (; src + 64 <= src_lim; src += 64; dest += 64)
		copy_64_bytes(dest, src);
	... tail copy

Then you can do something very similar for backwards copies.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] riscv: optimize memcpy/memmove/memset
  2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
                   ` (2 preceding siblings ...)
  2024-01-28 11:10 ` [PATCH 3/3] riscv: optimized memset Jisheng Zhang
@ 2024-01-29 18:16 ` Conor Dooley
  2024-01-30  2:28   ` Jisheng Zhang
  3 siblings, 1 reply; 19+ messages in thread
From: Conor Dooley @ 2024-01-29 18:16 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2798 bytes --]

On Sun, Jan 28, 2024 at 07:10:10PM +0800, Jisheng Zhang wrote:
> This series is to renew Matteo's "riscv: optimized mem* functions"
> sereies.
> 
> Compared with Matteo's original series, Jisheng made below changes:
> 1. adopt Emil's change to fix boot failure when build with clang
> 2. add corresponding changes to purgatory
> 3. always build optimized string.c rather than only build when optimize
> for performance
> 4. implement unroll support when src & dst are both aligned to keep
> the same performance as assembly version. After disassembling, I found
> that the unroll version looks something like below, so it acchieves
> the "unroll" effect as asm version but in C programming language:
> 	ld	t2,0(a5)
> 	ld	t0,8(a5)
> 	ld	t6,16(a5)
> 	ld	t5,24(a5)
> 	ld	t4,32(a5)
> 	ld	t3,40(a5)
> 	ld	t1,48(a5)
> 	ld	a1,56(a5)
> 	sd	t2,0(a6)
> 	sd	t0,8(a6)
> 	sd	t6,16(a6)
> 	sd	t5,24(a6)
> 	sd	t4,32(a6)
> 	sd	t3,40(a6)
> 	sd	t1,48(a6)
> 	sd	a1,56(a6)
> And per my testing, unrolling more doesn't help performance, so
> the "c" version only unrolls by using 8 GP regs rather than 16
> ones as asm version.
> 5. Add proper __pi_memcpy and __pi___memcpy alias
> 6. more performance numbers.
> 
> Per my benchmark with [1] on TH1520, CV1800B and JH7110 platforms,
> the unaligned medium memcpy performance is running about 3.5x ~ 8.6x
> speed of the unpatched versions's! Check patch1 for more details and
> performance numbers.
> 
> Link:https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memcpy.c [1]
> 
> Here is the original cover letter msg from Matteo:
> Replace the assembly mem{cpy,move,set} with C equivalent.
> 
> Try to access RAM with the largest bit width possible, but without
> doing unaligned accesses.
> 
> A further improvement could be to use multiple read and writes as the
> assembly version was trying to do.
> 
> Tested on a BeagleV Starlight with a SiFive U74 core, where the
> improvement is noticeable.

However, with allmodconfig it doesn't compile:
  Redirect to /build/tmp.zzMIlhgQQo and /build/tmp.vxnoxu8G5e
  Tree base:
  0c526539d432 ("riscv: optimized memcpy")
  Building the whole tree with the patch
  ../arch/riscv/lib/string.c:118:7: error: expected identifier or '('
  ../arch/riscv/lib/string.c:118:7: error: expected ')'
  ../arch/riscv/lib/string.c:143:7: error: expected identifier or '('
  ../arch/riscv/lib/string.c:143:7: error: expected ')'
  ../arch/riscv/lib/string.c:118:7: error: expected identifier or '('
  ../arch/riscv/lib/string.c:118:7: error: expected ')'
  ../arch/riscv/lib/string.c:143:7: error: expected identifier or '('
  ../arch/riscv/lib/string.c:143:7: error: expected ')'

Seems to be the case both with llvm and gcc.

Cheers,
Conor.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] riscv: optimize memcpy/memmove/memset
  2024-01-29 18:16 ` [PATCH 0/3] riscv: optimize memcpy/memmove/memset Conor Dooley
@ 2024-01-30  2:28   ` Jisheng Zhang
  0 siblings, 0 replies; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-30  2:28 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv, linux-kernel

On Mon, Jan 29, 2024 at 06:16:13PM +0000, Conor Dooley wrote:
> On Sun, Jan 28, 2024 at 07:10:10PM +0800, Jisheng Zhang wrote:
> > This series is to renew Matteo's "riscv: optimized mem* functions"
> > sereies.
> > 
> > Compared with Matteo's original series, Jisheng made below changes:
> > 1. adopt Emil's change to fix boot failure when build with clang
> > 2. add corresponding changes to purgatory
> > 3. always build optimized string.c rather than only build when optimize
> > for performance
> > 4. implement unroll support when src & dst are both aligned to keep
> > the same performance as assembly version. After disassembling, I found
> > that the unroll version looks something like below, so it acchieves
> > the "unroll" effect as asm version but in C programming language:
> > 	ld	t2,0(a5)
> > 	ld	t0,8(a5)
> > 	ld	t6,16(a5)
> > 	ld	t5,24(a5)
> > 	ld	t4,32(a5)
> > 	ld	t3,40(a5)
> > 	ld	t1,48(a5)
> > 	ld	a1,56(a5)
> > 	sd	t2,0(a6)
> > 	sd	t0,8(a6)
> > 	sd	t6,16(a6)
> > 	sd	t5,24(a6)
> > 	sd	t4,32(a6)
> > 	sd	t3,40(a6)
> > 	sd	t1,48(a6)
> > 	sd	a1,56(a6)
> > And per my testing, unrolling more doesn't help performance, so
> > the "c" version only unrolls by using 8 GP regs rather than 16
> > ones as asm version.
> > 5. Add proper __pi_memcpy and __pi___memcpy alias
> > 6. more performance numbers.
> > 
> > Per my benchmark with [1] on TH1520, CV1800B and JH7110 platforms,
> > the unaligned medium memcpy performance is running about 3.5x ~ 8.6x
> > speed of the unpatched versions's! Check patch1 for more details and
> > performance numbers.
> > 
> > Link:https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memcpy.c [1]
> > 
> > Here is the original cover letter msg from Matteo:
> > Replace the assembly mem{cpy,move,set} with C equivalent.
> > 
> > Try to access RAM with the largest bit width possible, but without
> > doing unaligned accesses.
> > 
> > A further improvement could be to use multiple read and writes as the
> > assembly version was trying to do.
> > 
> > Tested on a BeagleV Starlight with a SiFive U74 core, where the
> > improvement is noticeable.
> 
> However, with allmodconfig it doesn't compile:
>   Redirect to /build/tmp.zzMIlhgQQo and /build/tmp.vxnoxu8G5e
>   Tree base:
>   0c526539d432 ("riscv: optimized memcpy")
>   Building the whole tree with the patch
>   ../arch/riscv/lib/string.c:118:7: error: expected identifier or '('
>   ../arch/riscv/lib/string.c:118:7: error: expected ')'
>   ../arch/riscv/lib/string.c:143:7: error: expected identifier or '('
>   ../arch/riscv/lib/string.c:143:7: error: expected ')'
>   ../arch/riscv/lib/string.c:118:7: error: expected identifier or '('
>   ../arch/riscv/lib/string.c:118:7: error: expected ')'
>   ../arch/riscv/lib/string.c:143:7: error: expected identifier or '('
>   ../arch/riscv/lib/string.c:143:7: error: expected ')'
> 
> Seems to be the case both with llvm and gcc.

Hi Conor,

This is due to missing proper FORTIFY_SOURCE handling.
Below trival patch can fix it :)
I'm waiting for more comments before sending out v2.

diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
index 022edda68f1c..bfaab058f2cb 100644
--- a/arch/riscv/lib/string.c
+++ b/arch/riscv/lib/string.c
@@ -6,6 +6,7 @@
  * Copyright (C) 2021 Matteo Croce
  */

+#define __NO_FORTIFY
 #include <linux/types.h>
 #include <linux/module.h>
> 
> Cheers,
> Conor.
> 



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-28 12:47   ` David Laight
@ 2024-01-30 11:30     ` Jisheng Zhang
  2024-01-30 11:51       ` David Laight
  0 siblings, 1 reply; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-30 11:30 UTC (permalink / raw)
  To: David Laight
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote:
> From: Jisheng Zhang
> > Sent: 28 January 2024 11:10
> > 
> > When the destination buffer is before the source one, or when the
> > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > optimized to use a bigger data size possible.
> > 
> ...
> > + * Simply check if the buffer overlaps an call memcpy() in case,
> > + * otherwise do a simple one byte at time backward copy.
> 
> I'd at least do a 64bit copy loop if the addresses are aligned.
> 
> Thinks a bit more....
> 
> Put the copy 64 bytes code (the body of the memcpy() loop)
> into it an inline function and call it with increasing addresses
> in memcpy() are decrementing addresses in memmove.

Hi David,

Besides the 64 bytes copy, there's another optimization in __memcpy:
word-by-word copy even if s and d are not aligned.
So if we make the two optimizd copy as inline functions and call them
in memmove(), we almost duplicate the __memcpy code, so I think
directly calling __memcpy is a bit better.

Thanks
> 
> So memcpy() contains:
> 	src_lim = src_lim + count;
> 	... alignment copy
> 	for (; src + 64 <= src_lim; src += 64; dest += 64)
> 		copy_64_bytes(dest, src);
> 	... tail copy
> 
> Then you can do something very similar for backwards copies.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-28 11:10 ` [PATCH 2/3] riscv: optimized memmove Jisheng Zhang
  2024-01-28 12:47   ` David Laight
@ 2024-01-30 11:39   ` Nick Kossifidis
  2024-01-30 13:12     ` Jisheng Zhang
  1 sibling, 1 reply; 19+ messages in thread
From: Nick Kossifidis @ 2024-01-30 11:39 UTC (permalink / raw)
  To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

On 1/28/24 13:10, Jisheng Zhang wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> When the destination buffer is before the source one, or when the
> buffers doesn't overlap, it's safe to use memcpy() instead, which is
> optimized to use a bigger data size possible.
> 
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>

I'd expect to have memmove handle both fw/bw copying and then memcpy 
being an alias to memmove, to also take care when regions overlap and 
avoid undefined behavior.


> --- a/arch/riscv/lib/string.c
> +++ b/arch/riscv/lib/string.c
> @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
>   EXPORT_SYMBOL(memcpy);
>   void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
>   void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> +
> +/*
> + * Simply check if the buffer overlaps an call memcpy() in case,
> + * otherwise do a simple one byte at time backward copy.
> + */
> +void *__memmove(void *dest, const void *src, size_t count)
> +{
> +	if (dest < src || src + count <= dest)
> +		return __memcpy(dest, src, count);
> +
> +	if (dest > src) {
> +		const char *s = src + count;
> +		char *tmp = dest + count;
> +
> +		while (count--)
> +			*--tmp = *--s;
> +	}
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memmove);
> +

Here is an approach for the backwards case to get things started...

static void
copy_bw(void *dst_ptr, const void *src_ptr, size_t len)
{
	union const_data src = { .as_bytes = src_ptr + len };
	union data dst = { .as_bytes = dst_ptr + len };
	size_t remaining = len;
	size_t src_offt = 0;

	if (len < 2 * WORD_SIZE)
		goto trailing_bw;

	for(; dst.as_uptr & WORD_MASK; remaining--)
		*--dst.as_bytes = *--src.as_bytes;

	src_offt = src.as_uptr & WORD_MASK;
	if (!src_offt) {
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE)
			*--dst.as_ulong = *--src.as_ulong;
	} else {
		unsigned long cur, prev;
		src.as_bytes -= src_offt;
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
			cur = *src.as_ulong;
			prev = *--src.as_ulong;
			*--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) |
					  prev >> (src_offt * 8);
		}
		src.as_bytes += src_offt;
	}

  trailing_bw:
	while (remaining-- > 0)
		*--dst.as_bytes = *--src.as_bytes;
}

Regards,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 11:30     ` Jisheng Zhang
@ 2024-01-30 11:51       ` David Laight
  0 siblings, 0 replies; 19+ messages in thread
From: David Laight @ 2024-01-30 11:51 UTC (permalink / raw)
  To: 'Jisheng Zhang'
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

From: Jisheng Zhang
> Sent: 30 January 2024 11:31
> 
> On Sun, Jan 28, 2024 at 12:47:00PM +0000, David Laight wrote:
> > From: Jisheng Zhang
> > > Sent: 28 January 2024 11:10
> > >
> > > When the destination buffer is before the source one, or when the
> > > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > > optimized to use a bigger data size possible.
> > >
> > ...
> > > + * Simply check if the buffer overlaps an call memcpy() in case,
> > > + * otherwise do a simple one byte at time backward copy.
> >
> > I'd at least do a 64bit copy loop if the addresses are aligned.
> >
> > Thinks a bit more....
> >
> > Put the copy 64 bytes code (the body of the memcpy() loop)
> > into it an inline function and call it with increasing addresses
> > in memcpy() are decrementing addresses in memmove.
> 
> Hi David,
> 
> Besides the 64 bytes copy, there's another optimization in __memcpy:
> word-by-word copy even if s and d are not aligned.
> So if we make the two optimizd copy as inline functions and call them
> in memmove(), we almost duplicate the __memcpy code, so I think
> directly calling __memcpy is a bit better.

If a forwards copy is valid call memcpy() - which I think you do.
If not you can still use the same 'copy 8 register' code
that memcpy() uses - just with a decrementing block address.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] riscv: optimized memset
  2024-01-28 11:10 ` [PATCH 3/3] riscv: optimized memset Jisheng Zhang
@ 2024-01-30 12:07   ` Nick Kossifidis
  2024-01-30 13:25     ` Jisheng Zhang
  2024-02-01 23:04     ` David Laight
  0 siblings, 2 replies; 19+ messages in thread
From: Nick Kossifidis @ 2024-01-30 12:07 UTC (permalink / raw)
  To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce

On 1/28/24 13:10, Jisheng Zhang wrote:
> diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> index 20677c8067da..022edda68f1c 100644
> --- a/arch/riscv/lib/string.c
> +++ b/arch/riscv/lib/string.c
> @@ -144,3 +144,44 @@ void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmov
>   EXPORT_SYMBOL(memmove);
>   void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove);
>   void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove);
> +
> +void *__memset(void *s, int c, size_t count)
> +{
> +	union types dest = { .as_u8 = s };
> +
> +	if (count >= MIN_THRESHOLD) {
> +		unsigned long cu = (unsigned long)c;
> +
> +		/* Compose an ulong with 'c' repeated 4/8 times */
> +#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
> +		cu *= 0x0101010101010101UL;
> +#else
> +		cu |= cu << 8;
> +		cu |= cu << 16;
> +		/* Suppress warning on 32 bit machines */
> +		cu |= (cu << 16) << 16;
> +#endif

I guess you could check against __SIZEOF_LONG__ here.

> +		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
> +			/*
> +			 * Fill the buffer one byte at time until
> +			 * the destination is word aligned.
> +			 */
> +			for (; count && dest.as_uptr & WORD_MASK; count--)
> +				*dest.as_u8++ = c;
> +		}
> +
> +		/* Copy using the largest size allowed */
> +		for (; count >= BYTES_LONG; count -= BYTES_LONG)
> +			*dest.as_ulong++ = cu;
> +	}
> +
> +	/* copy the remainder */
> +	while (count--)
> +		*dest.as_u8++ = c;
> +
> +	return s;
> +}
> +EXPORT_SYMBOL(__memset);

BTW a similar approach could be used for memchr, e.g.:

#if __SIZEOF_LONG__ == 8
#define HAS_ZERO(_x) (((_x) - 0x0101010101010101ULL) & ~(_x) & 
0x8080808080808080ULL)
#else
#define HAS_ZERO(_x) (((_x) - 0x01010101UL) & ~(_x) & 0x80808080UL)
#endif

void *
memchr(const void *src_ptr, int c, size_t len)
{
	union const_data src = { .as_bytes = src_ptr };
	unsigned char byte = (unsigned char) c;
	unsigned long mask = (unsigned long) c;
	size_t remaining = len;

	/* Nothing to do */
	if (!src_ptr || !len)
		return NULL;

	if (len < 2 * WORD_SIZE)
		goto trailing;

	mask |= mask << 8;
	mask |= mask << 16;
#if __SIZEOF_LONG__ == 8
	mask |= mask << 32;
#endif

	/* Search by byte up to the src's alignment boundary */
	for(; src.as_uptr & WORD_MASK; remaining--, src.as_bytes++) {
		if (*src.as_bytes == byte)
			return (void*) src.as_bytes;
	}

	/* Search word by word using the mask */
	for(; remaining >= WORD_SIZE; remaining -= WORD_SIZE, src.as_ulong++) {
		unsigned long check = *src.as_ulong ^ mask;
		if(HAS_ZERO(check))
			break;
	}

  trailing:
	for(; remaining > 0; remaining--, src.as_bytes++) {
		if (*src.as_bytes == byte)
			return (void*) src.as_bytes;
	}

	return NULL;
}

Regards,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] riscv: optimized memcpy
  2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
  2024-01-28 12:35   ` David Laight
@ 2024-01-30 12:11   ` Nick Kossifidis
  1 sibling, 0 replies; 19+ messages in thread
From: Nick Kossifidis @ 2024-01-30 12:11 UTC (permalink / raw)
  To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce, kernel test robot

On 1/28/24 13:10, Jisheng Zhang wrote:
> +
> +void *__memcpy(void *dest, const void *src, size_t count)
> +{
> +	union const_types s = { .as_u8 = src };
> +	union types d = { .as_u8 = dest };
> +	int distance = 0;
> +
> +	if (count < MIN_THRESHOLD)
> +		goto copy_remainder;
> +
> +	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
> +		/* Copy a byte at time until destination is aligned. */
> +		for (; d.as_uptr & WORD_MASK; count--)
> +			*d.as_u8++ = *s.as_u8++;
> +
> +		distance = s.as_uptr & WORD_MASK;
> +	}
> +
> +	if (distance) {
> +		unsigned long last, next;
> +
> +		/*
> +		 * s is distance bytes ahead of d, and d just reached
> +		 * the alignment boundary. Move s backward to word align it
> +		 * and shift data to compensate for distance, in order to do
> +		 * word-by-word copy.
> +		 */
> +		s.as_u8 -= distance;
> +
> +		next = s.as_ulong[0];
> +		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
> +			last = next;
> +			next = s.as_ulong[1];
> +
> +			d.as_ulong[0] = last >> (distance * 8) |
> +					next << ((BYTES_LONG - distance) * 8);
> +
> +			d.as_ulong++;
> +			s.as_ulong++;
> +		}
> +
> +		/* Restore s with the original offset. */
> +		s.as_u8 += distance;
> +	} else {
> +		/*
> +		 * If the source and dest lower bits are the same, do a simple
> +		 * aligned copy.
> +		 */
> +		size_t aligned_count = count & ~(BYTES_LONG * 8 - 1);
> +
> +		__memcpy_aligned(d.as_ulong, s.as_ulong, aligned_count);
> +		d.as_u8 += aligned_count;
> +		s.as_u8 += aligned_count;
> +		count &= BYTES_LONG * 8 - 1;
> +	}
> +
> +copy_remainder:
> +	while (count--)
> +		*d.as_u8++ = *s.as_u8++;
> +
> +	return dest;
> +}
> +EXPORT_SYMBOL(__memcpy);
> +

We could also implement memcmp this way, e.g.:

int
memcmp(const void *s1, const void *s2, size_t len)
{
	union const_data a = { .as_bytes = s1 };
	union const_data b = { .as_bytes = s2 };
	unsigned long a_val = 0;
	unsigned long b_val = 0;
	size_t remaining = len;
	size_t a_offt = 0;

	/* Nothing to do */
	if (!s1 || !s2 || s1 == s2 || !len)
		return 0;

	if (len < 2 * WORD_SIZE)
		goto trailing_fw;

	for(; b.as_uptr & WORD_MASK; remaining--) {
		a_val = *a.as_bytes++;
		b_val = *b.as_bytes++;
		if (a_val != b_val)
			goto done;
	}

	a_offt = a.as_uptr & WORD_MASK;
	if (!a_offt) {
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
			a_val = *a.as_ulong++;
			b_val = *b.as_ulong++;
			if (a_val != b_val)
				break;

		}
	} else {
		unsigned long a_cur, a_next;
		a.as_bytes -= a_offt;
		a_next = *a.as_ulong;
		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE, b.as_ulong++) {
			a_cur = a_next;
			a_next = *++a.as_ulong;
			a_val = a_cur >> (a_offt * 8) |
				a_next << ((WORD_SIZE - a_offt) * 8);
			b_val = *b.as_ulong;
			if (a_val != b_val) {
				a.as_bytes += a_offt;
				break;
			}
		}
		a.as_bytes += a_offt;
	}

  trailing_fw:
	while (remaining-- > 0) {
		a_val = *a.as_bytes++;
		b_val = *b.as_bytes++;
		if (a_val != b_val)
			break;
	}

  done:
	if (!remaining)
		return 0;

	return (int) (a_val - b_val);
}

Regards,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 11:39   ` Nick Kossifidis
@ 2024-01-30 13:12     ` Jisheng Zhang
  2024-01-30 16:52       ` Nick Kossifidis
  0 siblings, 1 reply; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-30 13:12 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
> On 1/28/24 13:10, Jisheng Zhang wrote:
> > From: Matteo Croce <mcroce@microsoft.com>
> > 
> > When the destination buffer is before the source one, or when the
> > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > optimized to use a bigger data size possible.
> > 
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> 
> I'd expect to have memmove handle both fw/bw copying and then memcpy being
> an alias to memmove, to also take care when regions overlap and avoid
> undefined behavior.

Hi Nick,

Here is somthing from man memcpy:

"void *memcpy(void dest[restrict .n], const void src[restrict .n],
                    size_t n);

The  memcpy()  function copies n bytes from memory area src to memory area dest.
The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
lap."

IMHO, the "restrict" implies that there's no overlap. If overlap
happens, the manual doesn't say what will happen.

From another side, I have a concern: currently, other arch don't have
this alias behavior, IIUC(at least, per my understanding of arm and arm64
memcpy implementations)they just copy forward. I want to keep similar behavior
for riscv.

So I want to hear more before going towards alias-memcpy-to-memmove direction.

Thanks
> 
> 
> > --- a/arch/riscv/lib/string.c
> > +++ b/arch/riscv/lib/string.c
> > @@ -119,3 +119,28 @@ void *memcpy(void *dest, const void *src, size_t count) __weak __alias(__memcpy)
> >   EXPORT_SYMBOL(memcpy);
> >   void *__pi_memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> >   void *__pi___memcpy(void *dest, const void *src, size_t count) __alias(__memcpy);
> > +
> > +/*
> > + * Simply check if the buffer overlaps an call memcpy() in case,
> > + * otherwise do a simple one byte at time backward copy.
> > + */
> > +void *__memmove(void *dest, const void *src, size_t count)
> > +{
> > +	if (dest < src || src + count <= dest)
> > +		return __memcpy(dest, src, count);
> > +
> > +	if (dest > src) {
> > +		const char *s = src + count;
> > +		char *tmp = dest + count;
> > +
> > +		while (count--)
> > +			*--tmp = *--s;
> > +	}
> > +	return dest;
> > +}
> > +EXPORT_SYMBOL(__memmove);
> > +
> 
> Here is an approach for the backwards case to get things started...
> 
> static void
> copy_bw(void *dst_ptr, const void *src_ptr, size_t len)
> {
> 	union const_data src = { .as_bytes = src_ptr + len };
> 	union data dst = { .as_bytes = dst_ptr + len };
> 	size_t remaining = len;
> 	size_t src_offt = 0;
> 
> 	if (len < 2 * WORD_SIZE)
> 		goto trailing_bw;
> 
> 	for(; dst.as_uptr & WORD_MASK; remaining--)
> 		*--dst.as_bytes = *--src.as_bytes;
> 
> 	src_offt = src.as_uptr & WORD_MASK;
> 	if (!src_offt) {
> 		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE)
> 			*--dst.as_ulong = *--src.as_ulong;
> 	} else {
> 		unsigned long cur, prev;
> 		src.as_bytes -= src_offt;
> 		for (; remaining >= WORD_SIZE; remaining -= WORD_SIZE) {
> 			cur = *src.as_ulong;
> 			prev = *--src.as_ulong;
> 			*--dst.as_ulong = cur << ((WORD_SIZE - src_offt) * 8) |
> 					  prev >> (src_offt * 8);
> 		}
> 		src.as_bytes += src_offt;
> 	}
> 
>  trailing_bw:
> 	while (remaining-- > 0)
> 		*--dst.as_bytes = *--src.as_bytes;
> }
> 
> Regards,
> Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] riscv: optimized memset
  2024-01-30 12:07   ` Nick Kossifidis
@ 2024-01-30 13:25     ` Jisheng Zhang
  2024-02-01 23:04     ` David Laight
  1 sibling, 0 replies; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-30 13:25 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce

On Tue, Jan 30, 2024 at 02:07:37PM +0200, Nick Kossifidis wrote:
> On 1/28/24 13:10, Jisheng Zhang wrote:
> > diff --git a/arch/riscv/lib/string.c b/arch/riscv/lib/string.c
> > index 20677c8067da..022edda68f1c 100644
> > --- a/arch/riscv/lib/string.c
> > +++ b/arch/riscv/lib/string.c
> > @@ -144,3 +144,44 @@ void *memmove(void *dest, const void *src, size_t count) __weak __alias(__memmov
> >   EXPORT_SYMBOL(memmove);
> >   void *__pi_memmove(void *dest, const void *src, size_t count) __alias(__memmove);
> >   void *__pi___memmove(void *dest, const void *src, size_t count) __alias(__memmove);
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > +	union types dest = { .as_u8 = s };
> > +
> > +	if (count >= MIN_THRESHOLD) {
> > +		unsigned long cu = (unsigned long)c;
> > +
> > +		/* Compose an ulong with 'c' repeated 4/8 times */
> > +#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
> > +		cu *= 0x0101010101010101UL;

Here we need to check BITS_PER_LONG, use 0x01010101UL for rv32

> > +#else
> > +		cu |= cu << 8;
> > +		cu |= cu << 16;
> > +		/* Suppress warning on 32 bit machines */
> > +		cu |= (cu << 16) << 16;
> > +#endif
> 
> I guess you could check against __SIZEOF_LONG__ here.

Hmm I believe we can remove the | and shift totally, and fall
back to ARCH_HAS_FAST_MULTIPLIER, see
https://lore.kernel.org/linux-riscv/20240125145703.913-1-jszhang@kernel.org/

> 
> > +		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
> > +			/*
> > +			 * Fill the buffer one byte at time until
> > +			 * the destination is word aligned.
> > +			 */
> > +			for (; count && dest.as_uptr & WORD_MASK; count--)
> > +				*dest.as_u8++ = c;
> > +		}
> > +
> > +		/* Copy using the largest size allowed */
> > +		for (; count >= BYTES_LONG; count -= BYTES_LONG)
> > +			*dest.as_ulong++ = cu;
> > +	}
> > +
> > +	/* copy the remainder */
> > +	while (count--)
> > +		*dest.as_u8++ = c;
> > +
> > +	return s;
> > +}
> > +EXPORT_SYMBOL(__memset);
> 
> BTW a similar approach could be used for memchr, e.g.:
> 
> #if __SIZEOF_LONG__ == 8
> #define HAS_ZERO(_x) (((_x) - 0x0101010101010101ULL) & ~(_x) &
> 0x8080808080808080ULL)
> #else
> #define HAS_ZERO(_x) (((_x) - 0x01010101UL) & ~(_x) & 0x80808080UL)
> #endif
> 
> void *
> memchr(const void *src_ptr, int c, size_t len)
> {
> 	union const_data src = { .as_bytes = src_ptr };
> 	unsigned char byte = (unsigned char) c;
> 	unsigned long mask = (unsigned long) c;
> 	size_t remaining = len;
> 
> 	/* Nothing to do */
> 	if (!src_ptr || !len)
> 		return NULL;
> 
> 	if (len < 2 * WORD_SIZE)
> 		goto trailing;
> 
> 	mask |= mask << 8;
> 	mask |= mask << 16;
> #if __SIZEOF_LONG__ == 8
> 	mask |= mask << 32;
> #endif
> 
> 	/* Search by byte up to the src's alignment boundary */
> 	for(; src.as_uptr & WORD_MASK; remaining--, src.as_bytes++) {
> 		if (*src.as_bytes == byte)
> 			return (void*) src.as_bytes;
> 	}
> 
> 	/* Search word by word using the mask */
> 	for(; remaining >= WORD_SIZE; remaining -= WORD_SIZE, src.as_ulong++) {
> 		unsigned long check = *src.as_ulong ^ mask;
> 		if(HAS_ZERO(check))
> 			break;
> 	}
> 
>  trailing:
> 	for(; remaining > 0; remaining--, src.as_bytes++) {
> 		if (*src.as_bytes == byte)
> 			return (void*) src.as_bytes;
> 	}
> 
> 	return NULL;
> }
> 
> Regards,
> Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 13:12     ` Jisheng Zhang
@ 2024-01-30 16:52       ` Nick Kossifidis
  2024-01-31  5:25         ` Jisheng Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Kossifidis @ 2024-01-30 16:52 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On 1/30/24 15:12, Jisheng Zhang wrote:
> On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
>> On 1/28/24 13:10, Jisheng Zhang wrote:
>>> From: Matteo Croce <mcroce@microsoft.com>
>>>
>>> When the destination buffer is before the source one, or when the
>>> buffers doesn't overlap, it's safe to use memcpy() instead, which is
>>> optimized to use a bigger data size possible.
>>>
>>> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
>>> Reported-by: kernel test robot <lkp@intel.com>
>>> Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
>>
>> I'd expect to have memmove handle both fw/bw copying and then memcpy being
>> an alias to memmove, to also take care when regions overlap and avoid
>> undefined behavior.
> 
> Hi Nick,
> 
> Here is somthing from man memcpy:
> 
> "void *memcpy(void dest[restrict .n], const void src[restrict .n],
>                      size_t n);
> 
> The  memcpy()  function copies n bytes from memory area src to memory area dest.
> The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
> lap."
> 
> IMHO, the "restrict" implies that there's no overlap. If overlap
> happens, the manual doesn't say what will happen.
> 
>  From another side, I have a concern: currently, other arch don't have
> this alias behavior, IIUC(at least, per my understanding of arm and arm64
> memcpy implementations)they just copy forward. I want to keep similar behavior
> for riscv.
> 
> So I want to hear more before going towards alias-memcpy-to-memmove direction.
> 
> Thanks

If you read Matteo's original post that was also his suggestion, and 
Linus has also commented on that. In general it's better to handle the 
case where the regions provided to memcpy() overlap than to resort to 
"undefined behavior", I provided a backwards copy example that you can 
use so that we can have both fw and bw copying for memmove(), and use 
memmove() in any case. The [restrict .n] in the prototype is just there 
to say that the size of src is restricted by n (the next argument). If 
someone uses memcpy() with overlapping regions, which is always a 
possibility, in your case it'll result corrupted data, we won't even get 
a warning (still counts as undefined behavior) about it.

Regards,
Nick


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-30 16:52       ` Nick Kossifidis
@ 2024-01-31  5:25         ` Jisheng Zhang
  2024-01-31  9:13           ` Nick Kossifidis
  0 siblings, 1 reply; 19+ messages in thread
From: Jisheng Zhang @ 2024-01-31  5:25 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On Tue, Jan 30, 2024 at 06:52:24PM +0200, Nick Kossifidis wrote:
> On 1/30/24 15:12, Jisheng Zhang wrote:
> > On Tue, Jan 30, 2024 at 01:39:10PM +0200, Nick Kossifidis wrote:
> > > On 1/28/24 13:10, Jisheng Zhang wrote:
> > > > From: Matteo Croce <mcroce@microsoft.com>
> > > > 
> > > > When the destination buffer is before the source one, or when the
> > > > buffers doesn't overlap, it's safe to use memcpy() instead, which is
> > > > optimized to use a bigger data size possible.
> > > > 
> > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
> > > 
> > > I'd expect to have memmove handle both fw/bw copying and then memcpy being
> > > an alias to memmove, to also take care when regions overlap and avoid
> > > undefined behavior.
> > 
> > Hi Nick,
> > 
> > Here is somthing from man memcpy:
> > 
> > "void *memcpy(void dest[restrict .n], const void src[restrict .n],
> >                      size_t n);
> > 
> > The  memcpy()  function copies n bytes from memory area src to memory area dest.
> > The memory areas must not overlap.  Use memmove(3) if the memory areas do  over‐
> > lap."
> > 
> > IMHO, the "restrict" implies that there's no overlap. If overlap
> > happens, the manual doesn't say what will happen.
> > 
> >  From another side, I have a concern: currently, other arch don't have
> > this alias behavior, IIUC(at least, per my understanding of arm and arm64
> > memcpy implementations)they just copy forward. I want to keep similar behavior
> > for riscv.
> > 
> > So I want to hear more before going towards alias-memcpy-to-memmove direction.
> > 
> > Thanks
> 

Hi Nick,

> If you read Matteo's original post that was also his suggestion, and Linus

I did read all discussions in Matteo's v1 ~ v5 before this renew. Per my
understanding, Matteo also concerned no such memcpy-alias-memmove behavior
in other arch's implementations.

> has also commented on that. In general it's better to handle the case where

Linus commented on https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132
about glibc alias memcpy to memove rather than the patch series.

> the regions provided to memcpy() overlap than to resort to "undefined
> behavior", I provided a backwards copy example that you can use so that we
> can have both fw and bw copying for memmove(), and use memmove() in any
> case. The [restrict .n] in the prototype is just there to say that the size
> of src is restricted by n (the next argument). If someone uses memcpy() with

I didn't have c99 spec in hand, but I found gcc explanations about
restrict keyword from [1]:

"the restrict declaration promises that the code will not access that
object in any other way--only through p."

So if there's overlap in memcpy, then it contradicts the restrict
implication.

[1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html

And from the manual, if the memcpy users must ensure "The memory areas
must not overlap." So I think all linux kernel's memcpy implementations(only copy
fw and don't take overlap into consideration) are right.

I did see the alias-memcpy-as-memmove in some libc implementations, but
this is not the style in current kernel's implementations.

Given current riscv asm implementation also doesn't do the alias and
copy-fw only, and this series improves performance and doesn't introduce the
Is it better to divide this into two steps: Firstly, merge this series
if there's no obvious bug; secondly, do the alias as you suggested,
since you have a basic implementation, you could even submit your patch
;) What do you think about this two steps solution?

Thanks
> overlapping regions, which is always a possibility, in your case it'll
> result corrupted data, we won't even get a warning (still counts as
> undefined behavior) about it.
> 
> Regards,
> Nick
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] riscv: optimized memmove
  2024-01-31  5:25         ` Jisheng Zhang
@ 2024-01-31  9:13           ` Nick Kossifidis
  0 siblings, 0 replies; 19+ messages in thread
From: Nick Kossifidis @ 2024-01-31  9:13 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, linux-riscv,
	linux-kernel, Matteo Croce, kernel test robot

On 1/31/24 07:25, Jisheng Zhang wrote:
> 
> I didn't have c99 spec in hand, but I found gcc explanations about
> restrict keyword from [1]:
> 
> "the restrict declaration promises that the code will not access that
> object in any other way--only through p."
> 
> So if there's overlap in memcpy, then it contradicts the restrict
> implication.
> 
> [1] https://www.gnu.org/software/c-intro-and-ref/manual/html_node/restrict-Pointers.html
> 
The union used in the code also contradicts this. BTW the restrict 
qualifier isn't used in kernel's lib/string.c nor in the current 
implementation 
(https://elixir.bootlin.com/linux/latest/source/arch/riscv/include/asm/string.h#L16).

> And from the manual, if the memcpy users must ensure "The memory areas
> must not overlap." So I think all linux kernel's memcpy implementations(only copy
> fw and don't take overlap into consideration) are right.
> 
> I did see the alias-memcpy-as-memmove in some libc implementations, but
> this is not the style in current kernel's implementations.
> 
> Given current riscv asm implementation also doesn't do the alias and
> copy-fw only, and this series improves performance and doesn't introduce the
> Is it better to divide this into two steps: Firstly, merge this series
> if there's no obvious bug; secondly, do the alias as you suggested,
> since you have a basic implementation, you could even submit your patch
> ;) What do you think about this two steps solution?
> 

I still don't understand why you prefer undefined behavior over just 
aliasing memcpy to memmove. Anyway, do as you wish, I don't have time to 
work on this unfortunately. Feel free to use the code I shared for bw 
copy etc.

Regards,
Nick


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH 3/3] riscv: optimized memset
  2024-01-30 12:07   ` Nick Kossifidis
  2024-01-30 13:25     ` Jisheng Zhang
@ 2024-02-01 23:04     ` David Laight
  1 sibling, 0 replies; 19+ messages in thread
From: David Laight @ 2024-02-01 23:04 UTC (permalink / raw)
  To: 'Nick Kossifidis',
	Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: linux-riscv, linux-kernel, Matteo Croce

...
> > +		/* Compose an ulong with 'c' repeated 4/8 times */
> > +#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
> > +		cu *= 0x0101010101010101UL;

That it likely to generate a compile error on 32bit.
Maybe:
		cu *= (unsigned long)0x0101010101010101ULL;
> > +#else
> > +		cu |= cu << 8;
> > +		cu |= cu << 16;
> > +		/* Suppress warning on 32 bit machines */
> > +		cu |= (cu << 16) << 16;
> > +#endif
> 
> I guess you could check against __SIZEOF_LONG__ here.

Or even sizeof (cu), possible as:
		cu |= cu << (sizeof (cu) == 8 ? 32 : 0);
which I'm pretty sure modern compiler will throw away for 32bit.

I do wonder whether CONFIG_ARCH_HAS_FAST_MULTIPLIER is worth
testing - you'd really want to know there is a risc-v cpu
with a multiply that is slower than the shift and or version.
I actually doubt it.
Multiply is used so often (all array indexing) that you
really do need something better than a '1 bit per clock' loop.

It is worth remembering that you can implement an n*n multiply
with n*n 'full adders' (3 input bits, 2 output bits) with a
latency of 2*n adders.
So the latency is only twice that of the corresponding add.
For a modern chip that is not much logic at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-02-01 23:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
2024-01-28 12:35   ` David Laight
2024-01-30 12:11   ` Nick Kossifidis
2024-01-28 11:10 ` [PATCH 2/3] riscv: optimized memmove Jisheng Zhang
2024-01-28 12:47   ` David Laight
2024-01-30 11:30     ` Jisheng Zhang
2024-01-30 11:51       ` David Laight
2024-01-30 11:39   ` Nick Kossifidis
2024-01-30 13:12     ` Jisheng Zhang
2024-01-30 16:52       ` Nick Kossifidis
2024-01-31  5:25         ` Jisheng Zhang
2024-01-31  9:13           ` Nick Kossifidis
2024-01-28 11:10 ` [PATCH 3/3] riscv: optimized memset Jisheng Zhang
2024-01-30 12:07   ` Nick Kossifidis
2024-01-30 13:25     ` Jisheng Zhang
2024-02-01 23:04     ` David Laight
2024-01-29 18:16 ` [PATCH 0/3] riscv: optimize memcpy/memmove/memset Conor Dooley
2024-01-30  2:28   ` Jisheng Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).