* [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
@ 2014-07-25 15:22 Ian Campbell
2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:22 UTC (permalink / raw)
To: xen-devel; +Cc: julien.grall, tim, Ian Campbell, stefano.stabellini
The only really interesting changes here are the updates to mem* which update
to actually optimised versions and introduce an optimised memcmp.
bitops: No change to the bits we import. Record new baseline.
cmpxchg: Import:
60010e5 arm64: cmpxchg: update macros to prevent warnings
Author: Mark Hambleton <mahamble@broadcom.com>
Signed-off-by: Mark Hambleton <mahamble@broadcom.com>
Signed-off-by: Mark Brown <broonie@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
e1dfda9 arm64: xchg: prevent warning if return value is unused
Author: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
e1dfda9 resolves the warning which previous caused us to skip 60010e508111.
Since arm32 and arm64 now differ (as do Linux arm and arm64) here the
existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h.
Previously this was shadowing the arm64 one but they happened to be identical.
atomics: Import:
8715466 arch,arm64: Convert smp_mb__*()
Author: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
This just drops some unused (by us) smp_mb__*_atomic_*.
spinlocks: No change. Record new baseline.
mem*: Import:
808dbac arm64: lib: Implement optimized memcpy routine
Author: zhichang.yuan <zhichang.yuan@linaro.org>
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
280adc1 arm64: lib: Implement optimized memmove routine
Author: zhichang.yuan <zhichang.yuan@linaro.org>
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
b29a51f arm64: lib: Implement optimized memset routine
Author: zhichang.yuan <zhichang.yuan@linaro.org>
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
d875c9b arm64: lib: Implement optimized memcmp routine
Author: zhichang.yuan <zhichang.yuan@linaro.org>
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
These import various routines from Linaro's Cortex Strings library.
Added assembler.h similar to on arm32 to define the various magic symbols
which these imported routines depend on (e.g. CPU_LE() and CPU_BE())
str*: No changes. Record new baseline.
Correct the paths in the README.
*_page: No changes. Record new baseline.
README previous said clear_page was unused while clear page was, which was
backwards.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
xen/arch/arm/README.LinuxPrimitives | 36 +++--
xen/arch/arm/arm64/lib/Makefile | 2 +-
xen/arch/arm/arm64/lib/assembler.h | 13 ++
xen/arch/arm/arm64/lib/memchr.S | 1 +
xen/arch/arm/arm64/lib/memcmp.S | 258 +++++++++++++++++++++++++++++++++++
xen/arch/arm/arm64/lib/memcpy.S | 193 +++++++++++++++++++++++---
xen/arch/arm/arm64/lib/memmove.S | 191 ++++++++++++++++++++++----
xen/arch/arm/arm64/lib/memset.S | 208 +++++++++++++++++++++++++---
xen/include/asm-arm/arm32/cmpxchg.h | 3 +
xen/include/asm-arm/arm64/atomic.h | 5 -
xen/include/asm-arm/arm64/cmpxchg.h | 35 +++--
xen/include/asm-arm/string.h | 5 +
xen/include/asm-arm/system.h | 3 -
13 files changed, 844 insertions(+), 109 deletions(-)
create mode 100644 xen/arch/arm/arm64/lib/assembler.h
create mode 100644 xen/arch/arm/arm64/lib/memcmp.S
diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
index 6cd03ca..69eeb70 100644
--- a/xen/arch/arm/README.LinuxPrimitives
+++ b/xen/arch/arm/README.LinuxPrimitives
@@ -6,29 +6,26 @@ were last updated.
arm64:
=====================================================================
-bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b)
+bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027)
linux/arch/arm64/lib/bitops.S xen/arch/arm/arm64/lib/bitops.S
linux/arch/arm64/include/asm/bitops.h xen/include/asm-arm/arm64/bitops.h
---------------------------------------------------------------------
-cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189)
+cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b)
linux/arch/arm64/include/asm/cmpxchg.h xen/include/asm-arm/arm64/cmpxchg.h
-Skipped:
- 60010e5 arm64: cmpxchg: update macros to prevent warnings
-
---------------------------------------------------------------------
-atomics: last sync @ v3.14-rc7 (last commit: 95c4189)
+atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027)
linux/arch/arm64/include/asm/atomic.h xen/include/asm-arm/arm64/atomic.h
---------------------------------------------------------------------
-spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189)
+spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9)
linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h
@@ -38,30 +35,31 @@ Skipped:
---------------------------------------------------------------------
-mem*: last sync @ v3.14-rc7 (last commit: 4a89922)
+mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240)
-linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S
-linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S
-linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S
-linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S
+linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S
+linux/arch/arm64/lib/memcmp.S xen/arch/arm/arm64/lib/memcmp.S
+linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S
+linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S
+linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S
-for i in memchr.S memcpy.S memmove.S memset.S ; do
+for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do
diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i
done
---------------------------------------------------------------------
-str*: last sync @ v3.14-rc7 (last commit: 2b8cac8)
+str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5)
-linux/arch/arm/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S
-linux/arch/arm/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S
+linux/arch/arm64/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S
+linux/arch/arm64/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S
---------------------------------------------------------------------
-{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13)
+{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387)
-linux/arch/arm64/lib/clear_page.S unused in Xen
-linux/arch/arm64/lib/copy_page.S xen/arch/arm/arm64/lib/copy_page.S
+linux/arch/arm64/lib/clear_page.S xen/arch/arm/arm64/lib/clear_page.S
+linux/arch/arm64/lib/copy_page.S unused in Xen
=====================================================================
arm32
diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile
index b895afa..2e7fb64 100644
--- a/xen/arch/arm/arm64/lib/Makefile
+++ b/xen/arch/arm/arm64/lib/Makefile
@@ -1,4 +1,4 @@
-obj-y += memcpy.o memmove.o memset.o memchr.o
+obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o
obj-y += clear_page.o
obj-y += bitops.o find_next_bit.o
obj-y += strchr.o strrchr.o
diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h
new file mode 100644
index 0000000..84669d1
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/assembler.h
@@ -0,0 +1,13 @@
+#ifndef __ASM_ASSEMBLER_H__
+#define __ASM_ASSEMBLER_H__
+
+#ifndef __ASSEMBLY__
+#error "Only include this from assembly code"
+#endif
+
+/* Only LE support so far */
+#define CPU_BE(x...)
+#define CPU_LE(x...) x
+
+#endif /* __ASM_ASSEMBLER_H__ */
+
diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S
index 3cc1b01..b04590c 100644
--- a/xen/arch/arm/arm64/lib/memchr.S
+++ b/xen/arch/arm/arm64/lib/memchr.S
@@ -18,6 +18,7 @@
*/
#include <xen/config.h>
+#include "assembler.h"
/*
* Find a character in an area of memory.
diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S
new file mode 100644
index 0000000..9aad925
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/memcmp.S
@@ -0,0 +1,258 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+#include "assembler.h"
+
+/*
+* compare memory areas(when two memory areas' offset are different,
+* alignment handled by the hardware)
+*
+* Parameters:
+* x0 - const memory area 1 pointer
+* x1 - const memory area 2 pointer
+* x2 - the maximal compare byte length
+* Returns:
+* x0 - a compare result, maybe less than, equal to, or greater than ZERO
+*/
+
+/* Parameters and result. */
+src1 .req x0
+src2 .req x1
+limit .req x2
+result .req x0
+
+/* Internal variables. */
+data1 .req x3
+data1w .req w3
+data2 .req x4
+data2w .req w4
+has_nul .req x5
+diff .req x6
+endloop .req x7
+tmp1 .req x8
+tmp2 .req x9
+tmp3 .req x10
+pos .req x11
+limit_wd .req x12
+mask .req x13
+
+ENTRY(memcmp)
+ cbz limit, .Lret0
+ eor tmp1, src1, src2
+ tst tmp1, #7
+ b.ne .Lmisaligned8
+ ands tmp1, src1, #7
+ b.ne .Lmutual_align
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+ lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */
+ /*
+ * The input source addresses are at alignment boundary.
+ * Directly compare eight bytes each time.
+ */
+.Lloop_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+.Lstart_realigned:
+ subs limit_wd, limit_wd, #1
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, cs /* Last Dword or differences. */
+ cbz endloop, .Lloop_aligned
+
+ /* Not reached the limit, must have found a diff. */
+ tbz limit_wd, #63, .Lnot_limit
+
+ /* Limit % 8 == 0 => the diff is in the last 8 bytes. */
+ ands limit, limit, #7
+ b.eq .Lnot_limit
+ /*
+ * The remained bytes less than 8. It is needed to extract valid data
+ * from last eight bytes of the intended memory range.
+ */
+ lsl limit, limit, #3 /* bytes-> bits. */
+ mov mask, #~0
+CPU_BE( lsr mask, mask, limit )
+CPU_LE( lsl mask, mask, limit )
+ bic data1, data1, mask
+ bic data2, data2, mask
+
+ orr diff, diff, mask
+ b .Lnot_limit
+
+.Lmutual_align:
+ /*
+ * Sources are mutually aligned, but are not currently at an
+ * alignment boundary. Round down the addresses and then mask off
+ * the bytes that precede the start point.
+ */
+ bic src1, src1, #7
+ bic src2, src2, #7
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ /*
+ * We can not add limit with alignment offset(tmp1) here. Since the
+ * addition probably make the limit overflown.
+ */
+ sub limit_wd, limit, #1/*limit != 0, so no underflow.*/
+ and tmp3, limit_wd, #7
+ lsr limit_wd, limit_wd, #3
+ add tmp3, tmp3, tmp1
+ add limit_wd, limit_wd, tmp3, lsr #3
+ add limit, limit, tmp1/* Adjust the limit for the extra. */
+
+ lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
+ neg tmp1, tmp1/* Bits to alignment -64. */
+ mov tmp2, #~0
+ /*mask off the non-intended bytes before the start address.*/
+CPU_BE( lsl tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/
+ /* Little-endian. Early bytes are at LSB. */
+CPU_LE( lsr tmp2, tmp2, tmp1 )
+
+ orr data1, data1, tmp2
+ orr data2, data2, tmp2
+ b .Lstart_realigned
+
+ /*src1 and src2 have different alignment offset.*/
+.Lmisaligned8:
+ cmp limit, #8
+ b.lo .Ltiny8proc /*limit < 8: compare byte by byte*/
+
+ and tmp1, src1, #7
+ neg tmp1, tmp1
+ add tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/
+ and tmp2, src2, #7
+ neg tmp2, tmp2
+ add tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/
+ subs tmp3, tmp1, tmp2
+ csel pos, tmp1, tmp2, hi /*Choose the maximum.*/
+
+ sub limit, limit, pos
+ /*compare the proceeding bytes in the first 8 byte segment.*/
+.Ltinycmp:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs pos, pos, #1
+ ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
+ b.eq .Ltinycmp
+ cbnz pos, 1f /*diff occurred before the last byte.*/
+ cmp data1w, data2w
+ b.eq .Lstart_align
+1:
+ sub result, data1, data2
+ ret
+
+.Lstart_align:
+ lsr limit_wd, limit, #3
+ cbz limit_wd, .Lremain8
+
+ ands xzr, src1, #7
+ b.eq .Lrecal_offset
+ /*process more leading bytes to make src1 aligned...*/
+ add src1, src1, tmp3 /*backwards src1 to alignment boundary*/
+ add src2, src2, tmp3
+ sub limit, limit, tmp3
+ lsr limit_wd, limit, #3
+ cbz limit_wd, .Lremain8
+ /*load 8 bytes from aligned SRC1..*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+
+ subs limit_wd, limit_wd, #1
+ eor diff, data1, data2 /*Non-zero if differences found.*/
+ csinv endloop, diff, xzr, ne
+ cbnz endloop, .Lunequal_proc
+ /*How far is the current SRC2 from the alignment boundary...*/
+ and tmp3, tmp3, #7
+
+.Lrecal_offset:/*src1 is aligned now..*/
+ neg pos, tmp3
+.Lloopcmp_proc:
+ /*
+ * Divide the eight bytes into two parts. First,backwards the src2
+ * to an alignment boundary,load eight bytes and compare from
+ * the SRC2 alignment boundary. If all 8 bytes are equal,then start
+ * the second part's comparison. Otherwise finish the comparison.
+ * This special handle can garantee all the accesses are in the
+ * thread/task space in avoid to overrange access.
+ */
+ ldr data1, [src1,pos]
+ ldr data2, [src2,pos]
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ cbnz diff, .Lnot_limit
+
+ /*The second part process*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ subs limit_wd, limit_wd, #1
+ csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+ cbz endloop, .Lloopcmp_proc
+.Lunequal_proc:
+ cbz diff, .Lremain8
+
+/*There is differnence occured in the latest comparison.*/
+.Lnot_limit:
+/*
+* For little endian,reverse the low significant equal bits into MSB,then
+* following CLZ can find how many equal bits exist.
+*/
+CPU_LE( rev diff, diff )
+CPU_LE( rev data1, data1 )
+CPU_LE( rev data2, data2 )
+
+ /*
+ * The MS-non-zero bit of DIFF marks either the first bit
+ * that is different, or the end of the significant data.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ clz pos, diff
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /*
+ * We need to zero-extend (char is unsigned) the value and then
+ * perform a signed subtraction.
+ */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+
+.Lremain8:
+ /* Limit % 8 == 0 =>. all data are equal.*/
+ ands limit, limit, #7
+ b.eq .Lret0
+
+.Ltiny8proc:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs limit, limit, #1
+
+ ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
+ b.eq .Ltiny8proc
+ sub result, data1, data2
+ ret
+.Lret0:
+ mov result, #0
+ ret
+ENDPROC(memcmp)
diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S
index c8197c6..7cc885d 100644
--- a/xen/arch/arm/arm64/lib/memcpy.S
+++ b/xen/arch/arm/arm64/lib/memcpy.S
@@ -1,5 +1,13 @@
/*
* Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
@@ -15,6 +23,8 @@
*/
#include <xen/config.h>
+#include <asm/cache.h>
+#include "assembler.h"
/*
* Copy a buffer from src to dest (alignment handled by the hardware)
@@ -26,27 +36,166 @@
* Returns:
* x0 - dest
*/
+dstin .req x0
+src .req x1
+count .req x2
+tmp1 .req x3
+tmp1w .req w3
+tmp2 .req x4
+tmp2w .req w4
+tmp3 .req x5
+tmp3w .req w5
+dst .req x6
+
+A_l .req x7
+A_h .req x8
+B_l .req x9
+B_h .req x10
+C_l .req x11
+C_h .req x12
+D_l .req x13
+D_h .req x14
+
ENTRY(memcpy)
- mov x4, x0
- subs x2, x2, #8
- b.mi 2f
-1: ldr x3, [x1], #8
- subs x2, x2, #8
- str x3, [x4], #8
- b.pl 1b
-2: adds x2, x2, #4
- b.mi 3f
- ldr w3, [x1], #4
- sub x2, x2, #4
- str w3, [x4], #4
-3: adds x2, x2, #2
- b.mi 4f
- ldrh w3, [x1], #2
- sub x2, x2, #2
- strh w3, [x4], #2
-4: adds x2, x2, #1
- b.mi 5f
- ldrb w3, [x1]
- strb w3, [x4]
-5: ret
+ mov dst, dstin
+ cmp count, #16
+ /*When memory length is less than 16, the accessed are not aligned.*/
+ b.lo .Ltiny15
+
+ neg tmp2, src
+ ands tmp2, tmp2, #15/* Bytes to reach alignment. */
+ b.eq .LSrcAligned
+ sub count, count, tmp2
+ /*
+ * Copy the leading memory data from src to dst in an increasing
+ * address order.By this way,the risk of overwritting the source
+ * memory data is eliminated when the distance between src and
+ * dst is less than 16. The memory accesses here are alignment.
+ */
+ tbz tmp2, #0, 1f
+ ldrb tmp1w, [src], #1
+ strb tmp1w, [dst], #1
+1:
+ tbz tmp2, #1, 2f
+ ldrh tmp1w, [src], #2
+ strh tmp1w, [dst], #2
+2:
+ tbz tmp2, #2, 3f
+ ldr tmp1w, [src], #4
+ str tmp1w, [dst], #4
+3:
+ tbz tmp2, #3, .LSrcAligned
+ ldr tmp1, [src],#8
+ str tmp1, [dst],#8
+
+.LSrcAligned:
+ cmp count, #64
+ b.ge .Lcpy_over64
+ /*
+ * Deal with small copies quickly by dropping straight into the
+ * exit block.
+ */
+.Ltail63:
+ /*
+ * Copy up to 48 bytes of data. At this point we only need the
+ * bottom 6 bits of count to be accurate.
+ */
+ ands tmp1, count, #0x30
+ b.eq .Ltiny15
+ cmp tmp1w, #0x20
+ b.eq 1f
+ b.lt 2f
+ ldp A_l, A_h, [src], #16
+ stp A_l, A_h, [dst], #16
+1:
+ ldp A_l, A_h, [src], #16
+ stp A_l, A_h, [dst], #16
+2:
+ ldp A_l, A_h, [src], #16
+ stp A_l, A_h, [dst], #16
+.Ltiny15:
+ /*
+ * Prefer to break one ldp/stp into several load/store to access
+ * memory in an increasing address order,rather than to load/store 16
+ * bytes from (src-16) to (dst-16) and to backward the src to aligned
+ * address,which way is used in original cortex memcpy. If keeping
+ * the original memcpy process here, memmove need to satisfy the
+ * precondition that src address is at least 16 bytes bigger than dst
+ * address,otherwise some source data will be overwritten when memove
+ * call memcpy directly. To make memmove simpler and decouple the
+ * memcpy's dependency on memmove, withdrew the original process.
+ */
+ tbz count, #3, 1f
+ ldr tmp1, [src], #8
+ str tmp1, [dst], #8
+1:
+ tbz count, #2, 2f
+ ldr tmp1w, [src], #4
+ str tmp1w, [dst], #4
+2:
+ tbz count, #1, 3f
+ ldrh tmp1w, [src], #2
+ strh tmp1w, [dst], #2
+3:
+ tbz count, #0, .Lexitfunc
+ ldrb tmp1w, [src]
+ strb tmp1w, [dst]
+
+.Lexitfunc:
+ ret
+
+.Lcpy_over64:
+ subs count, count, #128
+ b.ge .Lcpy_body_large
+ /*
+ * Less than 128 bytes to copy, so handle 64 here and then jump
+ * to the tail.
+ */
+ ldp A_l, A_h, [src],#16
+ stp A_l, A_h, [dst],#16
+ ldp B_l, B_h, [src],#16
+ ldp C_l, C_h, [src],#16
+ stp B_l, B_h, [dst],#16
+ stp C_l, C_h, [dst],#16
+ ldp D_l, D_h, [src],#16
+ stp D_l, D_h, [dst],#16
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
+
+ /*
+ * Critical loop. Start at a new cache line boundary. Assuming
+ * 64 bytes per line this ensures the entire loop is in one line.
+ */
+ .p2align L1_CACHE_SHIFT
+.Lcpy_body_large:
+ /* pre-get 64 bytes data. */
+ ldp A_l, A_h, [src],#16
+ ldp B_l, B_h, [src],#16
+ ldp C_l, C_h, [src],#16
+ ldp D_l, D_h, [src],#16
+1:
+ /*
+ * interlace the load of next 64 bytes data block with store of the last
+ * loaded 64 bytes data.
+ */
+ stp A_l, A_h, [dst],#16
+ ldp A_l, A_h, [src],#16
+ stp B_l, B_h, [dst],#16
+ ldp B_l, B_h, [src],#16
+ stp C_l, C_h, [dst],#16
+ ldp C_l, C_h, [src],#16
+ stp D_l, D_h, [dst],#16
+ ldp D_l, D_h, [src],#16
+ subs count, count, #64
+ b.ge 1b
+ stp A_l, A_h, [dst],#16
+ stp B_l, B_h, [dst],#16
+ stp C_l, C_h, [dst],#16
+ stp D_l, D_h, [dst],#16
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
ENDPROC(memcpy)
diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S
index 1bf0936..f4065b9 100644
--- a/xen/arch/arm/arm64/lib/memmove.S
+++ b/xen/arch/arm/arm64/lib/memmove.S
@@ -1,5 +1,13 @@
/*
* Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
@@ -15,6 +23,8 @@
*/
#include <xen/config.h>
+#include <asm/cache.h>
+#include "assembler.h"
/*
* Move a buffer from src to test (alignment handled by the hardware).
@@ -27,30 +37,161 @@
* Returns:
* x0 - dest
*/
+dstin .req x0
+src .req x1
+count .req x2
+tmp1 .req x3
+tmp1w .req w3
+tmp2 .req x4
+tmp2w .req w4
+tmp3 .req x5
+tmp3w .req w5
+dst .req x6
+
+A_l .req x7
+A_h .req x8
+B_l .req x9
+B_h .req x10
+C_l .req x11
+C_h .req x12
+D_l .req x13
+D_h .req x14
+
ENTRY(memmove)
- cmp x0, x1
- b.ls memcpy
- add x4, x0, x2
- add x1, x1, x2
- subs x2, x2, #8
- b.mi 2f
-1: ldr x3, [x1, #-8]!
- subs x2, x2, #8
- str x3, [x4, #-8]!
- b.pl 1b
-2: adds x2, x2, #4
- b.mi 3f
- ldr w3, [x1, #-4]!
- sub x2, x2, #4
- str w3, [x4, #-4]!
-3: adds x2, x2, #2
- b.mi 4f
- ldrh w3, [x1, #-2]!
- sub x2, x2, #2
- strh w3, [x4, #-2]!
-4: adds x2, x2, #1
- b.mi 5f
- ldrb w3, [x1, #-1]
- strb w3, [x4, #-1]
-5: ret
+ cmp dstin, src
+ b.lo memcpy
+ add tmp1, src, count
+ cmp dstin, tmp1
+ b.hs memcpy /* No overlap. */
+
+ add dst, dstin, count
+ add src, src, count
+ cmp count, #16
+ b.lo .Ltail15 /*probably non-alignment accesses.*/
+
+ ands tmp2, src, #15 /* Bytes to reach alignment. */
+ b.eq .LSrcAligned
+ sub count, count, tmp2
+ /*
+ * process the aligned offset length to make the src aligned firstly.
+ * those extra instructions' cost is acceptable. It also make the
+ * coming accesses are based on aligned address.
+ */
+ tbz tmp2, #0, 1f
+ ldrb tmp1w, [src, #-1]!
+ strb tmp1w, [dst, #-1]!
+1:
+ tbz tmp2, #1, 2f
+ ldrh tmp1w, [src, #-2]!
+ strh tmp1w, [dst, #-2]!
+2:
+ tbz tmp2, #2, 3f
+ ldr tmp1w, [src, #-4]!
+ str tmp1w, [dst, #-4]!
+3:
+ tbz tmp2, #3, .LSrcAligned
+ ldr tmp1, [src, #-8]!
+ str tmp1, [dst, #-8]!
+
+.LSrcAligned:
+ cmp count, #64
+ b.ge .Lcpy_over64
+
+ /*
+ * Deal with small copies quickly by dropping straight into the
+ * exit block.
+ */
+.Ltail63:
+ /*
+ * Copy up to 48 bytes of data. At this point we only need the
+ * bottom 6 bits of count to be accurate.
+ */
+ ands tmp1, count, #0x30
+ b.eq .Ltail15
+ cmp tmp1w, #0x20
+ b.eq 1f
+ b.lt 2f
+ ldp A_l, A_h, [src, #-16]!
+ stp A_l, A_h, [dst, #-16]!
+1:
+ ldp A_l, A_h, [src, #-16]!
+ stp A_l, A_h, [dst, #-16]!
+2:
+ ldp A_l, A_h, [src, #-16]!
+ stp A_l, A_h, [dst, #-16]!
+
+.Ltail15:
+ tbz count, #3, 1f
+ ldr tmp1, [src, #-8]!
+ str tmp1, [dst, #-8]!
+1:
+ tbz count, #2, 2f
+ ldr tmp1w, [src, #-4]!
+ str tmp1w, [dst, #-4]!
+2:
+ tbz count, #1, 3f
+ ldrh tmp1w, [src, #-2]!
+ strh tmp1w, [dst, #-2]!
+3:
+ tbz count, #0, .Lexitfunc
+ ldrb tmp1w, [src, #-1]
+ strb tmp1w, [dst, #-1]
+
+.Lexitfunc:
+ ret
+
+.Lcpy_over64:
+ subs count, count, #128
+ b.ge .Lcpy_body_large
+ /*
+ * Less than 128 bytes to copy, so handle 64 bytes here and then jump
+ * to the tail.
+ */
+ ldp A_l, A_h, [src, #-16]
+ stp A_l, A_h, [dst, #-16]
+ ldp B_l, B_h, [src, #-32]
+ ldp C_l, C_h, [src, #-48]
+ stp B_l, B_h, [dst, #-32]
+ stp C_l, C_h, [dst, #-48]
+ ldp D_l, D_h, [src, #-64]!
+ stp D_l, D_h, [dst, #-64]!
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
+
+ /*
+ * Critical loop. Start at a new cache line boundary. Assuming
+ * 64 bytes per line this ensures the entire loop is in one line.
+ */
+ .p2align L1_CACHE_SHIFT
+.Lcpy_body_large:
+ /* pre-load 64 bytes data. */
+ ldp A_l, A_h, [src, #-16]
+ ldp B_l, B_h, [src, #-32]
+ ldp C_l, C_h, [src, #-48]
+ ldp D_l, D_h, [src, #-64]!
+1:
+ /*
+ * interlace the load of next 64 bytes data block with store of the last
+ * loaded 64 bytes data.
+ */
+ stp A_l, A_h, [dst, #-16]
+ ldp A_l, A_h, [src, #-16]
+ stp B_l, B_h, [dst, #-32]
+ ldp B_l, B_h, [src, #-32]
+ stp C_l, C_h, [dst, #-48]
+ ldp C_l, C_h, [src, #-48]
+ stp D_l, D_h, [dst, #-64]!
+ ldp D_l, D_h, [src, #-64]!
+ subs count, count, #64
+ b.ge 1b
+ stp A_l, A_h, [dst, #-16]
+ stp B_l, B_h, [dst, #-32]
+ stp C_l, C_h, [dst, #-48]
+ stp D_l, D_h, [dst, #-64]!
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
ENDPROC(memmove)
diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S
index 25a4fb6..4ee714d 100644
--- a/xen/arch/arm/arm64/lib/memset.S
+++ b/xen/arch/arm/arm64/lib/memset.S
@@ -1,5 +1,13 @@
/*
* Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
@@ -15,6 +23,8 @@
*/
#include <xen/config.h>
+#include <asm/cache.h>
+#include "assembler.h"
/*
* Fill in the buffer with character c (alignment handled by the hardware)
@@ -26,27 +36,181 @@
* Returns:
* x0 - buf
*/
+
+dstin .req x0
+val .req w1
+count .req x2
+tmp1 .req x3
+tmp1w .req w3
+tmp2 .req x4
+tmp2w .req w4
+zva_len_x .req x5
+zva_len .req w5
+zva_bits_x .req x6
+
+A_l .req x7
+A_lw .req w7
+dst .req x8
+tmp3w .req w9
+tmp3 .req x9
+
ENTRY(memset)
- mov x4, x0
- and w1, w1, #0xff
- orr w1, w1, w1, lsl #8
- orr w1, w1, w1, lsl #16
- orr x1, x1, x1, lsl #32
- subs x2, x2, #8
- b.mi 2f
-1: str x1, [x4], #8
- subs x2, x2, #8
- b.pl 1b
-2: adds x2, x2, #4
- b.mi 3f
- sub x2, x2, #4
- str w1, [x4], #4
-3: adds x2, x2, #2
- b.mi 4f
- sub x2, x2, #2
- strh w1, [x4], #2
-4: adds x2, x2, #1
- b.mi 5f
- strb w1, [x4]
-5: ret
+ mov dst, dstin /* Preserve return value. */
+ and A_lw, val, #255
+ orr A_lw, A_lw, A_lw, lsl #8
+ orr A_lw, A_lw, A_lw, lsl #16
+ orr A_l, A_l, A_l, lsl #32
+
+ cmp count, #15
+ b.hi .Lover16_proc
+ /*All store maybe are non-aligned..*/
+ tbz count, #3, 1f
+ str A_l, [dst], #8
+1:
+ tbz count, #2, 2f
+ str A_lw, [dst], #4
+2:
+ tbz count, #1, 3f
+ strh A_lw, [dst], #2
+3:
+ tbz count, #0, 4f
+ strb A_lw, [dst]
+4:
+ ret
+
+.Lover16_proc:
+ /*Whether the start address is aligned with 16.*/
+ neg tmp2, dst
+ ands tmp2, tmp2, #15
+ b.eq .Laligned
+/*
+* The count is not less than 16, we can use stp to store the start 16 bytes,
+* then adjust the dst aligned with 16.This process will make the current
+* memory address at alignment boundary.
+*/
+ stp A_l, A_l, [dst] /*non-aligned store..*/
+ /*make the dst aligned..*/
+ sub count, count, tmp2
+ add dst, dst, tmp2
+
+.Laligned:
+ cbz A_l, .Lzero_mem
+
+.Ltail_maybe_long:
+ cmp count, #64
+ b.ge .Lnot_short
+.Ltail63:
+ ands tmp1, count, #0x30
+ b.eq 3f
+ cmp tmp1w, #0x20
+ b.eq 1f
+ b.lt 2f
+ stp A_l, A_l, [dst], #16
+1:
+ stp A_l, A_l, [dst], #16
+2:
+ stp A_l, A_l, [dst], #16
+/*
+* The last store length is less than 16,use stp to write last 16 bytes.
+* It will lead some bytes written twice and the access is non-aligned.
+*/
+3:
+ ands count, count, #15
+ cbz count, 4f
+ add dst, dst, count
+ stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */
+4:
+ ret
+
+ /*
+ * Critical loop. Start at a new cache line boundary. Assuming
+ * 64 bytes per line, this ensures the entire loop is in one line.
+ */
+ .p2align L1_CACHE_SHIFT
+.Lnot_short:
+ sub dst, dst, #16/* Pre-bias. */
+ sub count, count, #64
+1:
+ stp A_l, A_l, [dst, #16]
+ stp A_l, A_l, [dst, #32]
+ stp A_l, A_l, [dst, #48]
+ stp A_l, A_l, [dst, #64]!
+ subs count, count, #64
+ b.ge 1b
+ tst count, #0x3f
+ add dst, dst, #16
+ b.ne .Ltail63
+.Lexitfunc:
+ ret
+
+ /*
+ * For zeroing memory, check to see if we can use the ZVA feature to
+ * zero entire 'cache' lines.
+ */
+.Lzero_mem:
+ cmp count, #63
+ b.le .Ltail63
+ /*
+ * For zeroing small amounts of memory, it's not worth setting up
+ * the line-clear code.
+ */
+ cmp count, #128
+ b.lt .Lnot_short /*count is at least 128 bytes*/
+
+ mrs tmp1, dczid_el0
+ tbnz tmp1, #4, .Lnot_short
+ mov tmp3w, #4
+ and zva_len, tmp1w, #15 /* Safety: other bits reserved. */
+ lsl zva_len, tmp3w, zva_len
+
+ ands tmp3w, zva_len, #63
+ /*
+ * ensure the zva_len is not less than 64.
+ * It is not meaningful to use ZVA if the block size is less than 64.
+ */
+ b.ne .Lnot_short
+.Lzero_by_line:
+ /*
+ * Compute how far we need to go to become suitably aligned. We're
+ * already at quad-word alignment.
+ */
+ cmp count, zva_len_x
+ b.lt .Lnot_short /* Not enough to reach alignment. */
+ sub zva_bits_x, zva_len_x, #1
+ neg tmp2, dst
+ ands tmp2, tmp2, zva_bits_x
+ b.eq 2f /* Already aligned. */
+ /* Not aligned, check that there's enough to copy after alignment.*/
+ sub tmp1, count, tmp2
+ /*
+ * grantee the remain length to be ZVA is bigger than 64,
+ * avoid to make the 2f's process over mem range.*/
+ cmp tmp1, #64
+ ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */
+ b.lt .Lnot_short
+ /*
+ * We know that there's at least 64 bytes to zero and that it's safe
+ * to overrun by 64 bytes.
+ */
+ mov count, tmp1
+1:
+ stp A_l, A_l, [dst]
+ stp A_l, A_l, [dst, #16]
+ stp A_l, A_l, [dst, #32]
+ subs tmp2, tmp2, #64
+ stp A_l, A_l, [dst, #48]
+ add dst, dst, #64
+ b.ge 1b
+ /* We've overrun a bit, so adjust dst downwards.*/
+ add dst, dst, tmp2
+2:
+ sub count, count, zva_len_x
+3:
+ dc zva, dst
+ add dst, dst, zva_len_x
+ subs count, count, zva_len_x
+ b.ge 3b
+ ands count, count, zva_bits_x
+ b.ne .Ltail_maybe_long
+ ret
ENDPROC(memset)
diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
index 3f4e7a1..9a511f2 100644
--- a/xen/include/asm-arm/arm32/cmpxchg.h
+++ b/xen/include/asm-arm/arm32/cmpxchg.h
@@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
return ret;
}
+#define xchg(ptr,x) \
+ ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
+
/*
* Atomic compare and exchange. Compare OLD with MEM, if identical,
* store NEW in MEM. Return the initial value in MEM. Success is
diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h
index b5d50f2..b49219e 100644
--- a/xen/include/asm-arm/arm64/atomic.h
+++ b/xen/include/asm-arm/arm64/atomic.h
@@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u)
#define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
-#define smp_mb__before_atomic_dec() smp_mb()
-#define smp_mb__after_atomic_dec() smp_mb()
-#define smp_mb__before_atomic_inc() smp_mb()
-#define smp_mb__after_atomic_inc() smp_mb()
-
#endif
/*
* Local variables:
diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h
index 4e930ce..ae42b2f 100644
--- a/xen/include/asm-arm/arm64/cmpxchg.h
+++ b/xen/include/asm-arm/arm64/cmpxchg.h
@@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
}
#define xchg(ptr,x) \
- ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
+({ \
+ __typeof__(*(ptr)) __ret; \
+ __ret = (__typeof__(*(ptr))) \
+ __xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \
+ __ret; \
+})
extern void __bad_cmpxchg(volatile void *ptr, int size);
@@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
return ret;
}
-#define cmpxchg(ptr,o,n) \
- ((__typeof__(*(ptr)))__cmpxchg_mb((ptr), \
- (unsigned long)(o), \
- (unsigned long)(n), \
- sizeof(*(ptr))))
-
-#define cmpxchg_local(ptr,o,n) \
- ((__typeof__(*(ptr)))__cmpxchg((ptr), \
- (unsigned long)(o), \
- (unsigned long)(n), \
- sizeof(*(ptr))))
+#define cmpxchg(ptr, o, n) \
+({ \
+ __typeof__(*(ptr)) __ret; \
+ __ret = (__typeof__(*(ptr))) \
+ __cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \
+ sizeof(*(ptr))); \
+ __ret; \
+})
+
+#define cmpxchg_local(ptr, o, n) \
+({ \
+ __typeof__(*(ptr)) __ret; \
+ __ret = (__typeof__(*(ptr))) \
+ __cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), sizeof(*(ptr))); \
+ __ret; \
+})
#endif
/*
diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h
index 3242762..dfad1fe 100644
--- a/xen/include/asm-arm/string.h
+++ b/xen/include/asm-arm/string.h
@@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c);
#define __HAVE_ARCH_MEMCPY
extern void * memcpy(void *, const void *, __kernel_size_t);
+#if defined(CONFIG_ARM_64)
+#define __HAVE_ARCH_MEMCMP
+extern int memcmp(const void *, const void *, __kernel_size_t);
+#endif
+
/* Some versions of gcc don't have this builtin. It's non-critical anyway. */
#define __HAVE_ARCH_MEMMOVE
extern void *memmove(void *dest, const void *src, size_t n);
diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
index 7aaaf50..ce3d38a 100644
--- a/xen/include/asm-arm/system.h
+++ b/xen/include/asm-arm/system.h
@@ -33,9 +33,6 @@
#define smp_wmb() dmb(ishst)
-#define xchg(ptr,x) \
- ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
-
/*
* This is used to ensure the compiler did actually allocate the register we
* asked it for some inline assembly sequences. Apparently we can't trust
--
1.7.10.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
@ 2014-07-25 15:22 ` Ian Campbell
2014-07-25 15:42 ` Julien Grall
2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
2014-07-25 15:43 ` Ian Campbell
2 siblings, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:22 UTC (permalink / raw)
To: xen-devel; +Cc: julien.grall, tim, Ian Campbell, stefano.stabellini
bitops, cmpxchg, atomics: Import:
c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
Author: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
atomics: In addition to the above import:
db38ee8 ARM: 7983/1: atomics: implement a better __atomic_add_unless for v6+
Author: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
spinlocks: We have diverged from Linux, so no updates but note this in the README.
mem* and str*: Import:
d98b90e ARM: 7990/1: asm: rename logical shift macros push pull into lspush lspull
Author: Victor Kamensky <victor.kamensky@linaro.org>
Suggested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
For some reason str* were mentioned under mem* in the README, fix.
libgcc: No changes, update baseline
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
xen/arch/arm/README.LinuxPrimitives | 17 +++++++--------
xen/arch/arm/arm32/lib/assembler.h | 8 +++----
xen/arch/arm/arm32/lib/bitops.h | 5 +++++
xen/arch/arm/arm32/lib/copy_template.S | 36 ++++++++++++++++----------------
xen/arch/arm/arm32/lib/memmove.S | 36 ++++++++++++++++----------------
xen/include/asm-arm/arm32/atomic.h | 32 ++++++++++++++++++++++++++++
xen/include/asm-arm/arm32/cmpxchg.h | 5 +++++
7 files changed, 90 insertions(+), 49 deletions(-)
diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
index 69eeb70..7e15b04 100644
--- a/xen/arch/arm/README.LinuxPrimitives
+++ b/xen/arch/arm/README.LinuxPrimitives
@@ -65,7 +65,7 @@ linux/arch/arm64/lib/copy_page.S unused in Xen
arm32
=====================================================================
-bitops: last sync @ v3.14-rc7 (last commit: b7ec699)
+bitops: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
linux/arch/arm/lib/bitops.h xen/arch/arm/arm32/lib/bitops.h
linux/arch/arm/lib/changebit.S xen/arch/arm/arm32/lib/changebit.S
@@ -83,13 +83,13 @@ done
---------------------------------------------------------------------
-cmpxchg: last sync @ v3.14-rc7 (last commit: 775ebcc)
+cmpxchg: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
linux/arch/arm/include/asm/cmpxchg.h xen/include/asm-arm/arm32/cmpxchg.h
---------------------------------------------------------------------
-atomics: last sync @ v3.14-rc7 (last commit: aed3a4e)
+atomics: last sync @ v3.16-rc6 (last commit: 030d0178bdbd)
linux/arch/arm/include/asm/atomic.h xen/include/asm-arm/arm32/atomic.h
@@ -99,6 +99,8 @@ spinlocks: last sync: 15e7e5c1ebf5
linux/arch/arm/include/asm/spinlock.h xen/include/asm-arm/arm32/spinlock.h
+*** Linux has switched to ticket locks but we still use bitlocks.
+
resync to v3.14-rc7:
7c8746a ARM: 7955/1: spinlock: ensure we have a compiler barrier before sev
@@ -111,7 +113,7 @@ resync to v3.14-rc7:
---------------------------------------------------------------------
-mem*: last sync @ v3.14-rc7 (last commit: 418df63a)
+mem*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
linux/arch/arm/lib/copy_template.S xen/arch/arm/arm32/lib/copy_template.S
linux/arch/arm/lib/memchr.S xen/arch/arm/arm32/lib/memchr.S
@@ -120,9 +122,6 @@ linux/arch/arm/lib/memmove.S xen/arch/arm/arm32/lib/memmove.S
linux/arch/arm/lib/memset.S xen/arch/arm/arm32/lib/memset.S
linux/arch/arm/lib/memzero.S xen/arch/arm/arm32/lib/memzero.S
-linux/arch/arm/lib/strchr.S xen/arch/arm/arm32/lib/strchr.S
-linux/arch/arm/lib/strrchr.S xen/arch/arm/arm32/lib/strrchr.S
-
for i in copy_template.S memchr.S memcpy.S memmove.S memset.S \
memzero.S ; do
diff -u linux/arch/arm/lib/$i xen/arch/arm/arm32/lib/$i
@@ -130,7 +129,7 @@ done
---------------------------------------------------------------------
-str*: last sync @ v3.13-rc7 (last commit: 93ed397)
+str*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
linux/arch/arm/lib/strchr.S xen/arch/arm/arm32/lib/strchr.S
linux/arch/arm/lib/strrchr.S xen/arch/arm/arm32/lib/strrchr.S
@@ -145,7 +144,7 @@ clear_page == memset
---------------------------------------------------------------------
-libgcc: last sync @ v3.14-rc7 (last commit: 01885bc)
+libgcc: last sync @ v3.16-rc6 (last commit: 01885bc)
linux/arch/arm/lib/lib1funcs.S xen/arch/arm/arm32/lib/lib1funcs.S
linux/arch/arm/lib/lshrdi3.S xen/arch/arm/arm32/lib/lshrdi3.S
diff --git a/xen/arch/arm/arm32/lib/assembler.h b/xen/arch/arm/arm32/lib/assembler.h
index f8d4b3a..6de2638 100644
--- a/xen/arch/arm/arm32/lib/assembler.h
+++ b/xen/arch/arm/arm32/lib/assembler.h
@@ -36,8 +36,8 @@
* Endian independent macros for shifting bytes within registers.
*/
#ifndef __ARMEB__
-#define pull lsr
-#define push lsl
+#define lspull lsr
+#define lspush lsl
#define get_byte_0 lsl #0
#define get_byte_1 lsr #8
#define get_byte_2 lsr #16
@@ -47,8 +47,8 @@
#define put_byte_2 lsl #16
#define put_byte_3 lsl #24
#else
-#define pull lsl
-#define push lsr
+#define lspull lsl
+#define lspush lsr
#define get_byte_0 lsr #24
#define get_byte_1 lsr #16
#define get_byte_2 lsr #8
diff --git a/xen/arch/arm/arm32/lib/bitops.h b/xen/arch/arm/arm32/lib/bitops.h
index 25784c3..a167c2d 100644
--- a/xen/arch/arm/arm32/lib/bitops.h
+++ b/xen/arch/arm/arm32/lib/bitops.h
@@ -37,6 +37,11 @@ UNWIND( .fnstart )
add r1, r1, r0, lsl #2 @ Get word offset
mov r3, r2, lsl r3 @ create mask
smp_dmb
+#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
+ .arch_extension mp
+ ALT_SMP(W(pldw) [r1])
+ ALT_UP(W(nop))
+#endif
1: ldrex r2, [r1]
ands r0, r2, r3 @ save old value of bit
\instr r2, r2, r3 @ toggle bit
diff --git a/xen/arch/arm/arm32/lib/copy_template.S b/xen/arch/arm/arm32/lib/copy_template.S
index 805e3f8..3bc8eb8 100644
--- a/xen/arch/arm/arm32/lib/copy_template.S
+++ b/xen/arch/arm/arm32/lib/copy_template.S
@@ -197,24 +197,24 @@
12: PLD( pld [r1, #124] )
13: ldr4w r1, r4, r5, r6, r7, abort=19f
- mov r3, lr, pull #\pull
+ mov r3, lr, lspull #\pull
subs r2, r2, #32
ldr4w r1, r8, r9, ip, lr, abort=19f
- orr r3, r3, r4, push #\push
- mov r4, r4, pull #\pull
- orr r4, r4, r5, push #\push
- mov r5, r5, pull #\pull
- orr r5, r5, r6, push #\push
- mov r6, r6, pull #\pull
- orr r6, r6, r7, push #\push
- mov r7, r7, pull #\pull
- orr r7, r7, r8, push #\push
- mov r8, r8, pull #\pull
- orr r8, r8, r9, push #\push
- mov r9, r9, pull #\pull
- orr r9, r9, ip, push #\push
- mov ip, ip, pull #\pull
- orr ip, ip, lr, push #\push
+ orr r3, r3, r4, lspush #\push
+ mov r4, r4, lspull #\pull
+ orr r4, r4, r5, lspush #\push
+ mov r5, r5, lspull #\pull
+ orr r5, r5, r6, lspush #\push
+ mov r6, r6, lspull #\pull
+ orr r6, r6, r7, lspush #\push
+ mov r7, r7, lspull #\pull
+ orr r7, r7, r8, lspush #\push
+ mov r8, r8, lspull #\pull
+ orr r8, r8, r9, lspush #\push
+ mov r9, r9, lspull #\pull
+ orr r9, r9, ip, lspush #\push
+ mov ip, ip, lspull #\pull
+ orr ip, ip, lr, lspush #\push
str8w r0, r3, r4, r5, r6, r7, r8, r9, ip, , abort=19f
bge 12b
PLD( cmn r2, #96 )
@@ -225,10 +225,10 @@
14: ands ip, r2, #28
beq 16f
-15: mov r3, lr, pull #\pull
+15: mov r3, lr, lspull #\pull
ldr1w r1, lr, abort=21f
subs ip, ip, #4
- orr r3, r3, lr, push #\push
+ orr r3, r3, lr, lspush #\push
str1w r0, r3, abort=21f
bgt 15b
CALGN( cmp r2, #0 )
diff --git a/xen/arch/arm/arm32/lib/memmove.S b/xen/arch/arm/arm32/lib/memmove.S
index 4e142b8..18634c3 100644
--- a/xen/arch/arm/arm32/lib/memmove.S
+++ b/xen/arch/arm/arm32/lib/memmove.S
@@ -148,24 +148,24 @@ ENTRY(memmove)
12: PLD( pld [r1, #-128] )
13: ldmdb r1!, {r7, r8, r9, ip}
- mov lr, r3, push #\push
+ mov lr, r3, lspush #\push
subs r2, r2, #32
ldmdb r1!, {r3, r4, r5, r6}
- orr lr, lr, ip, pull #\pull
- mov ip, ip, push #\push
- orr ip, ip, r9, pull #\pull
- mov r9, r9, push #\push
- orr r9, r9, r8, pull #\pull
- mov r8, r8, push #\push
- orr r8, r8, r7, pull #\pull
- mov r7, r7, push #\push
- orr r7, r7, r6, pull #\pull
- mov r6, r6, push #\push
- orr r6, r6, r5, pull #\pull
- mov r5, r5, push #\push
- orr r5, r5, r4, pull #\pull
- mov r4, r4, push #\push
- orr r4, r4, r3, pull #\pull
+ orr lr, lr, ip, lspull #\pull
+ mov ip, ip, lspush #\push
+ orr ip, ip, r9, lspull #\pull
+ mov r9, r9, lspush #\push
+ orr r9, r9, r8, lspull #\pull
+ mov r8, r8, lspush #\push
+ orr r8, r8, r7, lspull #\pull
+ mov r7, r7, lspush #\push
+ orr r7, r7, r6, lspull #\pull
+ mov r6, r6, lspush #\push
+ orr r6, r6, r5, lspull #\pull
+ mov r5, r5, lspush #\push
+ orr r5, r5, r4, lspull #\pull
+ mov r4, r4, lspush #\push
+ orr r4, r4, r3, lspull #\pull
stmdb r0!, {r4 - r9, ip, lr}
bge 12b
PLD( cmn r2, #96 )
@@ -176,10 +176,10 @@ ENTRY(memmove)
14: ands ip, r2, #28
beq 16f
-15: mov lr, r3, push #\push
+15: mov lr, r3, lspush #\push
ldr r3, [r1, #-4]!
subs ip, ip, #4
- orr lr, lr, r3, pull #\pull
+ orr lr, lr, r3, lspull #\pull
str lr, [r0, #-4]!
bgt 15b
CALGN( cmp r2, #0 )
diff --git a/xen/include/asm-arm/arm32/atomic.h b/xen/include/asm-arm/arm32/atomic.h
index 3d601d1..7ec712f 100644
--- a/xen/include/asm-arm/arm32/atomic.h
+++ b/xen/include/asm-arm/arm32/atomic.h
@@ -39,6 +39,7 @@ static inline int atomic_add_return(int i, atomic_t *v)
int result;
smp_mb();
+ prefetchw(&v->counter);
__asm__ __volatile__("@ atomic_add_return\n"
"1: ldrex %0, [%3]\n"
@@ -78,6 +79,7 @@ static inline int atomic_sub_return(int i, atomic_t *v)
int result;
smp_mb();
+ prefetchw(&v->counter);
__asm__ __volatile__("@ atomic_sub_return\n"
"1: ldrex %0, [%3]\n"
@@ -100,6 +102,7 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
unsigned long res;
smp_mb();
+ prefetchw(&ptr->counter);
do {
__asm__ __volatile__("@ atomic_cmpxchg\n"
@@ -117,6 +120,35 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
return oldval;
}
+static inline int __atomic_add_unless(atomic_t *v, int a, int u)
+{
+ int oldval, newval;
+ unsigned long tmp;
+
+ smp_mb();
+ prefetchw(&v->counter);
+
+ __asm__ __volatile__ ("@ atomic_add_unless\n"
+"1: ldrex %0, [%4]\n"
+" teq %0, %5\n"
+" beq 2f\n"
+" add %1, %0, %6\n"
+" strex %2, %1, [%4]\n"
+" teq %2, #0\n"
+" bne 1b\n"
+"2:"
+ : "=&r" (oldval), "=&r" (newval), "=&r" (tmp), "+Qo" (v->counter)
+ : "r" (&v->counter), "r" (u), "r" (a)
+ : "cc");
+
+ if (oldval != u)
+ smp_mb();
+
+ return oldval;
+}
+
+#define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+
#define atomic_inc(v) atomic_add(1, v)
#define atomic_dec(v) atomic_sub(1, v)
diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
index 9a511f2..03e0bed 100644
--- a/xen/include/asm-arm/arm32/cmpxchg.h
+++ b/xen/include/asm-arm/arm32/cmpxchg.h
@@ -1,6 +1,8 @@
#ifndef __ASM_ARM32_CMPXCHG_H
#define __ASM_ARM32_CMPXCHG_H
+#include <xen/prefetch.h>
+
extern void __bad_xchg(volatile void *, int);
static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size)
@@ -9,6 +11,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
unsigned int tmp;
smp_mb();
+ prefetchw((const void *)ptr);
switch (size) {
case 1:
@@ -56,6 +59,8 @@ static always_inline unsigned long __cmpxchg(
{
unsigned long oldval, res;
+ prefetchw((const void *)ptr);
+
switch (size) {
case 1:
do {
--
1.7.10.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
@ 2014-07-25 15:36 ` Julien Grall
2014-08-04 16:16 ` Ian Campbell
2014-07-25 15:43 ` Ian Campbell
2 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 15:36 UTC (permalink / raw)
To: Ian Campbell, xen-devel; +Cc: tim, stefano.stabellini
Hi Ian,
On 07/25/2014 04:22 PM, Ian Campbell wrote:
> The only really interesting changes here are the updates to mem* which update
> to actually optimised versions and introduce an optimised memcmp.
I didn't read the whole code as I assume it's just a copy with few
changes from Linux.
Acked-by: Julien Grall <julien.grall@linaro.org>
Regards,
> bitops: No change to the bits we import. Record new baseline.
>
> cmpxchg: Import:
> 60010e5 arm64: cmpxchg: update macros to prevent warnings
> Author: Mark Hambleton <mahamble@broadcom.com>
> Signed-off-by: Mark Hambleton <mahamble@broadcom.com>
> Signed-off-by: Mark Brown <broonie@linaro.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>
> e1dfda9 arm64: xchg: prevent warning if return value is unused
> Author: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>
> e1dfda9 resolves the warning which previous caused us to skip 60010e508111.
>
> Since arm32 and arm64 now differ (as do Linux arm and arm64) here the
> existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h.
> Previously this was shadowing the arm64 one but they happened to be identical.
>
> atomics: Import:
> 8715466 arch,arm64: Convert smp_mb__*()
> Author: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>
> This just drops some unused (by us) smp_mb__*_atomic_*.
>
> spinlocks: No change. Record new baseline.
>
> mem*: Import:
> 808dbac arm64: lib: Implement optimized memcpy routine
> Author: zhichang.yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 280adc1 arm64: lib: Implement optimized memmove routine
> Author: zhichang.yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> b29a51f arm64: lib: Implement optimized memset routine
> Author: zhichang.yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> d875c9b arm64: lib: Implement optimized memcmp routine
> Author: zhichang.yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>
> These import various routines from Linaro's Cortex Strings library.
>
> Added assembler.h similar to on arm32 to define the various magic symbols
> which these imported routines depend on (e.g. CPU_LE() and CPU_BE())
>
> str*: No changes. Record new baseline.
>
> Correct the paths in the README.
>
> *_page: No changes. Record new baseline.
>
> README previous said clear_page was unused while clear page was, which was
> backwards.
>
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
> xen/arch/arm/README.LinuxPrimitives | 36 +++--
> xen/arch/arm/arm64/lib/Makefile | 2 +-
> xen/arch/arm/arm64/lib/assembler.h | 13 ++
> xen/arch/arm/arm64/lib/memchr.S | 1 +
> xen/arch/arm/arm64/lib/memcmp.S | 258 +++++++++++++++++++++++++++++++++++
> xen/arch/arm/arm64/lib/memcpy.S | 193 +++++++++++++++++++++++---
> xen/arch/arm/arm64/lib/memmove.S | 191 ++++++++++++++++++++++----
> xen/arch/arm/arm64/lib/memset.S | 208 +++++++++++++++++++++++++---
> xen/include/asm-arm/arm32/cmpxchg.h | 3 +
> xen/include/asm-arm/arm64/atomic.h | 5 -
> xen/include/asm-arm/arm64/cmpxchg.h | 35 +++--
> xen/include/asm-arm/string.h | 5 +
> xen/include/asm-arm/system.h | 3 -
> 13 files changed, 844 insertions(+), 109 deletions(-)
> create mode 100644 xen/arch/arm/arm64/lib/assembler.h
> create mode 100644 xen/arch/arm/arm64/lib/memcmp.S
>
> diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
> index 6cd03ca..69eeb70 100644
> --- a/xen/arch/arm/README.LinuxPrimitives
> +++ b/xen/arch/arm/README.LinuxPrimitives
> @@ -6,29 +6,26 @@ were last updated.
> arm64:
> =====================================================================
>
> -bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b)
> +bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027)
>
> linux/arch/arm64/lib/bitops.S xen/arch/arm/arm64/lib/bitops.S
> linux/arch/arm64/include/asm/bitops.h xen/include/asm-arm/arm64/bitops.h
>
> ---------------------------------------------------------------------
>
> -cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189)
> +cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b)
>
> linux/arch/arm64/include/asm/cmpxchg.h xen/include/asm-arm/arm64/cmpxchg.h
>
> -Skipped:
> - 60010e5 arm64: cmpxchg: update macros to prevent warnings
> -
> ---------------------------------------------------------------------
>
> -atomics: last sync @ v3.14-rc7 (last commit: 95c4189)
> +atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027)
>
> linux/arch/arm64/include/asm/atomic.h xen/include/asm-arm/arm64/atomic.h
>
> ---------------------------------------------------------------------
>
> -spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189)
> +spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9)
>
> linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h
>
> @@ -38,30 +35,31 @@ Skipped:
>
> ---------------------------------------------------------------------
>
> -mem*: last sync @ v3.14-rc7 (last commit: 4a89922)
> +mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240)
>
> -linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S
> -linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S
> -linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S
> -linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S
> +linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S
> +linux/arch/arm64/lib/memcmp.S xen/arch/arm/arm64/lib/memcmp.S
> +linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S
> +linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S
> +linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S
>
> -for i in memchr.S memcpy.S memmove.S memset.S ; do
> +for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do
> diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i
> done
>
> ---------------------------------------------------------------------
>
> -str*: last sync @ v3.14-rc7 (last commit: 2b8cac8)
> +str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5)
>
> -linux/arch/arm/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S
> -linux/arch/arm/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S
> +linux/arch/arm64/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S
> +linux/arch/arm64/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S
>
> ---------------------------------------------------------------------
>
> -{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13)
> +{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387)
>
> -linux/arch/arm64/lib/clear_page.S unused in Xen
> -linux/arch/arm64/lib/copy_page.S xen/arch/arm/arm64/lib/copy_page.S
> +linux/arch/arm64/lib/clear_page.S xen/arch/arm/arm64/lib/clear_page.S
> +linux/arch/arm64/lib/copy_page.S unused in Xen
>
> =====================================================================
> arm32
> diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile
> index b895afa..2e7fb64 100644
> --- a/xen/arch/arm/arm64/lib/Makefile
> +++ b/xen/arch/arm/arm64/lib/Makefile
> @@ -1,4 +1,4 @@
> -obj-y += memcpy.o memmove.o memset.o memchr.o
> +obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o
> obj-y += clear_page.o
> obj-y += bitops.o find_next_bit.o
> obj-y += strchr.o strrchr.o
> diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h
> new file mode 100644
> index 0000000..84669d1
> --- /dev/null
> +++ b/xen/arch/arm/arm64/lib/assembler.h
> @@ -0,0 +1,13 @@
> +#ifndef __ASM_ASSEMBLER_H__
> +#define __ASM_ASSEMBLER_H__
> +
> +#ifndef __ASSEMBLY__
> +#error "Only include this from assembly code"
> +#endif
> +
> +/* Only LE support so far */
> +#define CPU_BE(x...)
> +#define CPU_LE(x...) x
> +
> +#endif /* __ASM_ASSEMBLER_H__ */
> +
> diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S
> index 3cc1b01..b04590c 100644
> --- a/xen/arch/arm/arm64/lib/memchr.S
> +++ b/xen/arch/arm/arm64/lib/memchr.S
> @@ -18,6 +18,7 @@
> */
>
> #include <xen/config.h>
> +#include "assembler.h"
>
> /*
> * Find a character in an area of memory.
> diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S
> new file mode 100644
> index 0000000..9aad925
> --- /dev/null
> +++ b/xen/arch/arm/arm64/lib/memcmp.S
> @@ -0,0 +1,258 @@
> +/*
> + * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/config.h>
> +#include "assembler.h"
> +
> +/*
> +* compare memory areas(when two memory areas' offset are different,
> +* alignment handled by the hardware)
> +*
> +* Parameters:
> +* x0 - const memory area 1 pointer
> +* x1 - const memory area 2 pointer
> +* x2 - the maximal compare byte length
> +* Returns:
> +* x0 - a compare result, maybe less than, equal to, or greater than ZERO
> +*/
> +
> +/* Parameters and result. */
> +src1 .req x0
> +src2 .req x1
> +limit .req x2
> +result .req x0
> +
> +/* Internal variables. */
> +data1 .req x3
> +data1w .req w3
> +data2 .req x4
> +data2w .req w4
> +has_nul .req x5
> +diff .req x6
> +endloop .req x7
> +tmp1 .req x8
> +tmp2 .req x9
> +tmp3 .req x10
> +pos .req x11
> +limit_wd .req x12
> +mask .req x13
> +
> +ENTRY(memcmp)
> + cbz limit, .Lret0
> + eor tmp1, src1, src2
> + tst tmp1, #7
> + b.ne .Lmisaligned8
> + ands tmp1, src1, #7
> + b.ne .Lmutual_align
> + sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
> + lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */
> + /*
> + * The input source addresses are at alignment boundary.
> + * Directly compare eight bytes each time.
> + */
> +.Lloop_aligned:
> + ldr data1, [src1], #8
> + ldr data2, [src2], #8
> +.Lstart_realigned:
> + subs limit_wd, limit_wd, #1
> + eor diff, data1, data2 /* Non-zero if differences found. */
> + csinv endloop, diff, xzr, cs /* Last Dword or differences. */
> + cbz endloop, .Lloop_aligned
> +
> + /* Not reached the limit, must have found a diff. */
> + tbz limit_wd, #63, .Lnot_limit
> +
> + /* Limit % 8 == 0 => the diff is in the last 8 bytes. */
> + ands limit, limit, #7
> + b.eq .Lnot_limit
> + /*
> + * The remained bytes less than 8. It is needed to extract valid data
> + * from last eight bytes of the intended memory range.
> + */
> + lsl limit, limit, #3 /* bytes-> bits. */
> + mov mask, #~0
> +CPU_BE( lsr mask, mask, limit )
> +CPU_LE( lsl mask, mask, limit )
> + bic data1, data1, mask
> + bic data2, data2, mask
> +
> + orr diff, diff, mask
> + b .Lnot_limit
> +
> +.Lmutual_align:
> + /*
> + * Sources are mutually aligned, but are not currently at an
> + * alignment boundary. Round down the addresses and then mask off
> + * the bytes that precede the start point.
> + */
> + bic src1, src1, #7
> + bic src2, src2, #7
> + ldr data1, [src1], #8
> + ldr data2, [src2], #8
> + /*
> + * We can not add limit with alignment offset(tmp1) here. Since the
> + * addition probably make the limit overflown.
> + */
> + sub limit_wd, limit, #1/*limit != 0, so no underflow.*/
> + and tmp3, limit_wd, #7
> + lsr limit_wd, limit_wd, #3
> + add tmp3, tmp3, tmp1
> + add limit_wd, limit_wd, tmp3, lsr #3
> + add limit, limit, tmp1/* Adjust the limit for the extra. */
> +
> + lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
> + neg tmp1, tmp1/* Bits to alignment -64. */
> + mov tmp2, #~0
> + /*mask off the non-intended bytes before the start address.*/
> +CPU_BE( lsl tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/
> + /* Little-endian. Early bytes are at LSB. */
> +CPU_LE( lsr tmp2, tmp2, tmp1 )
> +
> + orr data1, data1, tmp2
> + orr data2, data2, tmp2
> + b .Lstart_realigned
> +
> + /*src1 and src2 have different alignment offset.*/
> +.Lmisaligned8:
> + cmp limit, #8
> + b.lo .Ltiny8proc /*limit < 8: compare byte by byte*/
> +
> + and tmp1, src1, #7
> + neg tmp1, tmp1
> + add tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/
> + and tmp2, src2, #7
> + neg tmp2, tmp2
> + add tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/
> + subs tmp3, tmp1, tmp2
> + csel pos, tmp1, tmp2, hi /*Choose the maximum.*/
> +
> + sub limit, limit, pos
> + /*compare the proceeding bytes in the first 8 byte segment.*/
> +.Ltinycmp:
> + ldrb data1w, [src1], #1
> + ldrb data2w, [src2], #1
> + subs pos, pos, #1
> + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
> + b.eq .Ltinycmp
> + cbnz pos, 1f /*diff occurred before the last byte.*/
> + cmp data1w, data2w
> + b.eq .Lstart_align
> +1:
> + sub result, data1, data2
> + ret
> +
> +.Lstart_align:
> + lsr limit_wd, limit, #3
> + cbz limit_wd, .Lremain8
> +
> + ands xzr, src1, #7
> + b.eq .Lrecal_offset
> + /*process more leading bytes to make src1 aligned...*/
> + add src1, src1, tmp3 /*backwards src1 to alignment boundary*/
> + add src2, src2, tmp3
> + sub limit, limit, tmp3
> + lsr limit_wd, limit, #3
> + cbz limit_wd, .Lremain8
> + /*load 8 bytes from aligned SRC1..*/
> + ldr data1, [src1], #8
> + ldr data2, [src2], #8
> +
> + subs limit_wd, limit_wd, #1
> + eor diff, data1, data2 /*Non-zero if differences found.*/
> + csinv endloop, diff, xzr, ne
> + cbnz endloop, .Lunequal_proc
> + /*How far is the current SRC2 from the alignment boundary...*/
> + and tmp3, tmp3, #7
> +
> +.Lrecal_offset:/*src1 is aligned now..*/
> + neg pos, tmp3
> +.Lloopcmp_proc:
> + /*
> + * Divide the eight bytes into two parts. First,backwards the src2
> + * to an alignment boundary,load eight bytes and compare from
> + * the SRC2 alignment boundary. If all 8 bytes are equal,then start
> + * the second part's comparison. Otherwise finish the comparison.
> + * This special handle can garantee all the accesses are in the
> + * thread/task space in avoid to overrange access.
> + */
> + ldr data1, [src1,pos]
> + ldr data2, [src2,pos]
> + eor diff, data1, data2 /* Non-zero if differences found. */
> + cbnz diff, .Lnot_limit
> +
> + /*The second part process*/
> + ldr data1, [src1], #8
> + ldr data2, [src2], #8
> + eor diff, data1, data2 /* Non-zero if differences found. */
> + subs limit_wd, limit_wd, #1
> + csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
> + cbz endloop, .Lloopcmp_proc
> +.Lunequal_proc:
> + cbz diff, .Lremain8
> +
> +/*There is differnence occured in the latest comparison.*/
> +.Lnot_limit:
> +/*
> +* For little endian,reverse the low significant equal bits into MSB,then
> +* following CLZ can find how many equal bits exist.
> +*/
> +CPU_LE( rev diff, diff )
> +CPU_LE( rev data1, data1 )
> +CPU_LE( rev data2, data2 )
> +
> + /*
> + * The MS-non-zero bit of DIFF marks either the first bit
> + * that is different, or the end of the significant data.
> + * Shifting left now will bring the critical information into the
> + * top bits.
> + */
> + clz pos, diff
> + lsl data1, data1, pos
> + lsl data2, data2, pos
> + /*
> + * We need to zero-extend (char is unsigned) the value and then
> + * perform a signed subtraction.
> + */
> + lsr data1, data1, #56
> + sub result, data1, data2, lsr #56
> + ret
> +
> +.Lremain8:
> + /* Limit % 8 == 0 =>. all data are equal.*/
> + ands limit, limit, #7
> + b.eq .Lret0
> +
> +.Ltiny8proc:
> + ldrb data1w, [src1], #1
> + ldrb data2w, [src2], #1
> + subs limit, limit, #1
> +
> + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
> + b.eq .Ltiny8proc
> + sub result, data1, data2
> + ret
> +.Lret0:
> + mov result, #0
> + ret
> +ENDPROC(memcmp)
> diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S
> index c8197c6..7cc885d 100644
> --- a/xen/arch/arm/arm64/lib/memcpy.S
> +++ b/xen/arch/arm/arm64/lib/memcpy.S
> @@ -1,5 +1,13 @@
> /*
> * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
> */
>
> #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>
> /*
> * Copy a buffer from src to dest (alignment handled by the hardware)
> @@ -26,27 +36,166 @@
> * Returns:
> * x0 - dest
> */
> +dstin .req x0
> +src .req x1
> +count .req x2
> +tmp1 .req x3
> +tmp1w .req w3
> +tmp2 .req x4
> +tmp2w .req w4
> +tmp3 .req x5
> +tmp3w .req w5
> +dst .req x6
> +
> +A_l .req x7
> +A_h .req x8
> +B_l .req x9
> +B_h .req x10
> +C_l .req x11
> +C_h .req x12
> +D_l .req x13
> +D_h .req x14
> +
> ENTRY(memcpy)
> - mov x4, x0
> - subs x2, x2, #8
> - b.mi 2f
> -1: ldr x3, [x1], #8
> - subs x2, x2, #8
> - str x3, [x4], #8
> - b.pl 1b
> -2: adds x2, x2, #4
> - b.mi 3f
> - ldr w3, [x1], #4
> - sub x2, x2, #4
> - str w3, [x4], #4
> -3: adds x2, x2, #2
> - b.mi 4f
> - ldrh w3, [x1], #2
> - sub x2, x2, #2
> - strh w3, [x4], #2
> -4: adds x2, x2, #1
> - b.mi 5f
> - ldrb w3, [x1]
> - strb w3, [x4]
> -5: ret
> + mov dst, dstin
> + cmp count, #16
> + /*When memory length is less than 16, the accessed are not aligned.*/
> + b.lo .Ltiny15
> +
> + neg tmp2, src
> + ands tmp2, tmp2, #15/* Bytes to reach alignment. */
> + b.eq .LSrcAligned
> + sub count, count, tmp2
> + /*
> + * Copy the leading memory data from src to dst in an increasing
> + * address order.By this way,the risk of overwritting the source
> + * memory data is eliminated when the distance between src and
> + * dst is less than 16. The memory accesses here are alignment.
> + */
> + tbz tmp2, #0, 1f
> + ldrb tmp1w, [src], #1
> + strb tmp1w, [dst], #1
> +1:
> + tbz tmp2, #1, 2f
> + ldrh tmp1w, [src], #2
> + strh tmp1w, [dst], #2
> +2:
> + tbz tmp2, #2, 3f
> + ldr tmp1w, [src], #4
> + str tmp1w, [dst], #4
> +3:
> + tbz tmp2, #3, .LSrcAligned
> + ldr tmp1, [src],#8
> + str tmp1, [dst],#8
> +
> +.LSrcAligned:
> + cmp count, #64
> + b.ge .Lcpy_over64
> + /*
> + * Deal with small copies quickly by dropping straight into the
> + * exit block.
> + */
> +.Ltail63:
> + /*
> + * Copy up to 48 bytes of data. At this point we only need the
> + * bottom 6 bits of count to be accurate.
> + */
> + ands tmp1, count, #0x30
> + b.eq .Ltiny15
> + cmp tmp1w, #0x20
> + b.eq 1f
> + b.lt 2f
> + ldp A_l, A_h, [src], #16
> + stp A_l, A_h, [dst], #16
> +1:
> + ldp A_l, A_h, [src], #16
> + stp A_l, A_h, [dst], #16
> +2:
> + ldp A_l, A_h, [src], #16
> + stp A_l, A_h, [dst], #16
> +.Ltiny15:
> + /*
> + * Prefer to break one ldp/stp into several load/store to access
> + * memory in an increasing address order,rather than to load/store 16
> + * bytes from (src-16) to (dst-16) and to backward the src to aligned
> + * address,which way is used in original cortex memcpy. If keeping
> + * the original memcpy process here, memmove need to satisfy the
> + * precondition that src address is at least 16 bytes bigger than dst
> + * address,otherwise some source data will be overwritten when memove
> + * call memcpy directly. To make memmove simpler and decouple the
> + * memcpy's dependency on memmove, withdrew the original process.
> + */
> + tbz count, #3, 1f
> + ldr tmp1, [src], #8
> + str tmp1, [dst], #8
> +1:
> + tbz count, #2, 2f
> + ldr tmp1w, [src], #4
> + str tmp1w, [dst], #4
> +2:
> + tbz count, #1, 3f
> + ldrh tmp1w, [src], #2
> + strh tmp1w, [dst], #2
> +3:
> + tbz count, #0, .Lexitfunc
> + ldrb tmp1w, [src]
> + strb tmp1w, [dst]
> +
> +.Lexitfunc:
> + ret
> +
> +.Lcpy_over64:
> + subs count, count, #128
> + b.ge .Lcpy_body_large
> + /*
> + * Less than 128 bytes to copy, so handle 64 here and then jump
> + * to the tail.
> + */
> + ldp A_l, A_h, [src],#16
> + stp A_l, A_h, [dst],#16
> + ldp B_l, B_h, [src],#16
> + ldp C_l, C_h, [src],#16
> + stp B_l, B_h, [dst],#16
> + stp C_l, C_h, [dst],#16
> + ldp D_l, D_h, [src],#16
> + stp D_l, D_h, [dst],#16
> +
> + tst count, #0x3f
> + b.ne .Ltail63
> + ret
> +
> + /*
> + * Critical loop. Start at a new cache line boundary. Assuming
> + * 64 bytes per line this ensures the entire loop is in one line.
> + */
> + .p2align L1_CACHE_SHIFT
> +.Lcpy_body_large:
> + /* pre-get 64 bytes data. */
> + ldp A_l, A_h, [src],#16
> + ldp B_l, B_h, [src],#16
> + ldp C_l, C_h, [src],#16
> + ldp D_l, D_h, [src],#16
> +1:
> + /*
> + * interlace the load of next 64 bytes data block with store of the last
> + * loaded 64 bytes data.
> + */
> + stp A_l, A_h, [dst],#16
> + ldp A_l, A_h, [src],#16
> + stp B_l, B_h, [dst],#16
> + ldp B_l, B_h, [src],#16
> + stp C_l, C_h, [dst],#16
> + ldp C_l, C_h, [src],#16
> + stp D_l, D_h, [dst],#16
> + ldp D_l, D_h, [src],#16
> + subs count, count, #64
> + b.ge 1b
> + stp A_l, A_h, [dst],#16
> + stp B_l, B_h, [dst],#16
> + stp C_l, C_h, [dst],#16
> + stp D_l, D_h, [dst],#16
> +
> + tst count, #0x3f
> + b.ne .Ltail63
> + ret
> ENDPROC(memcpy)
> diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S
> index 1bf0936..f4065b9 100644
> --- a/xen/arch/arm/arm64/lib/memmove.S
> +++ b/xen/arch/arm/arm64/lib/memmove.S
> @@ -1,5 +1,13 @@
> /*
> * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
> */
>
> #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>
> /*
> * Move a buffer from src to test (alignment handled by the hardware).
> @@ -27,30 +37,161 @@
> * Returns:
> * x0 - dest
> */
> +dstin .req x0
> +src .req x1
> +count .req x2
> +tmp1 .req x3
> +tmp1w .req w3
> +tmp2 .req x4
> +tmp2w .req w4
> +tmp3 .req x5
> +tmp3w .req w5
> +dst .req x6
> +
> +A_l .req x7
> +A_h .req x8
> +B_l .req x9
> +B_h .req x10
> +C_l .req x11
> +C_h .req x12
> +D_l .req x13
> +D_h .req x14
> +
> ENTRY(memmove)
> - cmp x0, x1
> - b.ls memcpy
> - add x4, x0, x2
> - add x1, x1, x2
> - subs x2, x2, #8
> - b.mi 2f
> -1: ldr x3, [x1, #-8]!
> - subs x2, x2, #8
> - str x3, [x4, #-8]!
> - b.pl 1b
> -2: adds x2, x2, #4
> - b.mi 3f
> - ldr w3, [x1, #-4]!
> - sub x2, x2, #4
> - str w3, [x4, #-4]!
> -3: adds x2, x2, #2
> - b.mi 4f
> - ldrh w3, [x1, #-2]!
> - sub x2, x2, #2
> - strh w3, [x4, #-2]!
> -4: adds x2, x2, #1
> - b.mi 5f
> - ldrb w3, [x1, #-1]
> - strb w3, [x4, #-1]
> -5: ret
> + cmp dstin, src
> + b.lo memcpy
> + add tmp1, src, count
> + cmp dstin, tmp1
> + b.hs memcpy /* No overlap. */
> +
> + add dst, dstin, count
> + add src, src, count
> + cmp count, #16
> + b.lo .Ltail15 /*probably non-alignment accesses.*/
> +
> + ands tmp2, src, #15 /* Bytes to reach alignment. */
> + b.eq .LSrcAligned
> + sub count, count, tmp2
> + /*
> + * process the aligned offset length to make the src aligned firstly.
> + * those extra instructions' cost is acceptable. It also make the
> + * coming accesses are based on aligned address.
> + */
> + tbz tmp2, #0, 1f
> + ldrb tmp1w, [src, #-1]!
> + strb tmp1w, [dst, #-1]!
> +1:
> + tbz tmp2, #1, 2f
> + ldrh tmp1w, [src, #-2]!
> + strh tmp1w, [dst, #-2]!
> +2:
> + tbz tmp2, #2, 3f
> + ldr tmp1w, [src, #-4]!
> + str tmp1w, [dst, #-4]!
> +3:
> + tbz tmp2, #3, .LSrcAligned
> + ldr tmp1, [src, #-8]!
> + str tmp1, [dst, #-8]!
> +
> +.LSrcAligned:
> + cmp count, #64
> + b.ge .Lcpy_over64
> +
> + /*
> + * Deal with small copies quickly by dropping straight into the
> + * exit block.
> + */
> +.Ltail63:
> + /*
> + * Copy up to 48 bytes of data. At this point we only need the
> + * bottom 6 bits of count to be accurate.
> + */
> + ands tmp1, count, #0x30
> + b.eq .Ltail15
> + cmp tmp1w, #0x20
> + b.eq 1f
> + b.lt 2f
> + ldp A_l, A_h, [src, #-16]!
> + stp A_l, A_h, [dst, #-16]!
> +1:
> + ldp A_l, A_h, [src, #-16]!
> + stp A_l, A_h, [dst, #-16]!
> +2:
> + ldp A_l, A_h, [src, #-16]!
> + stp A_l, A_h, [dst, #-16]!
> +
> +.Ltail15:
> + tbz count, #3, 1f
> + ldr tmp1, [src, #-8]!
> + str tmp1, [dst, #-8]!
> +1:
> + tbz count, #2, 2f
> + ldr tmp1w, [src, #-4]!
> + str tmp1w, [dst, #-4]!
> +2:
> + tbz count, #1, 3f
> + ldrh tmp1w, [src, #-2]!
> + strh tmp1w, [dst, #-2]!
> +3:
> + tbz count, #0, .Lexitfunc
> + ldrb tmp1w, [src, #-1]
> + strb tmp1w, [dst, #-1]
> +
> +.Lexitfunc:
> + ret
> +
> +.Lcpy_over64:
> + subs count, count, #128
> + b.ge .Lcpy_body_large
> + /*
> + * Less than 128 bytes to copy, so handle 64 bytes here and then jump
> + * to the tail.
> + */
> + ldp A_l, A_h, [src, #-16]
> + stp A_l, A_h, [dst, #-16]
> + ldp B_l, B_h, [src, #-32]
> + ldp C_l, C_h, [src, #-48]
> + stp B_l, B_h, [dst, #-32]
> + stp C_l, C_h, [dst, #-48]
> + ldp D_l, D_h, [src, #-64]!
> + stp D_l, D_h, [dst, #-64]!
> +
> + tst count, #0x3f
> + b.ne .Ltail63
> + ret
> +
> + /*
> + * Critical loop. Start at a new cache line boundary. Assuming
> + * 64 bytes per line this ensures the entire loop is in one line.
> + */
> + .p2align L1_CACHE_SHIFT
> +.Lcpy_body_large:
> + /* pre-load 64 bytes data. */
> + ldp A_l, A_h, [src, #-16]
> + ldp B_l, B_h, [src, #-32]
> + ldp C_l, C_h, [src, #-48]
> + ldp D_l, D_h, [src, #-64]!
> +1:
> + /*
> + * interlace the load of next 64 bytes data block with store of the last
> + * loaded 64 bytes data.
> + */
> + stp A_l, A_h, [dst, #-16]
> + ldp A_l, A_h, [src, #-16]
> + stp B_l, B_h, [dst, #-32]
> + ldp B_l, B_h, [src, #-32]
> + stp C_l, C_h, [dst, #-48]
> + ldp C_l, C_h, [src, #-48]
> + stp D_l, D_h, [dst, #-64]!
> + ldp D_l, D_h, [src, #-64]!
> + subs count, count, #64
> + b.ge 1b
> + stp A_l, A_h, [dst, #-16]
> + stp B_l, B_h, [dst, #-32]
> + stp C_l, C_h, [dst, #-48]
> + stp D_l, D_h, [dst, #-64]!
> +
> + tst count, #0x3f
> + b.ne .Ltail63
> + ret
> ENDPROC(memmove)
> diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S
> index 25a4fb6..4ee714d 100644
> --- a/xen/arch/arm/arm64/lib/memset.S
> +++ b/xen/arch/arm/arm64/lib/memset.S
> @@ -1,5 +1,13 @@
> /*
> * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
> */
>
> #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>
> /*
> * Fill in the buffer with character c (alignment handled by the hardware)
> @@ -26,27 +36,181 @@
> * Returns:
> * x0 - buf
> */
> +
> +dstin .req x0
> +val .req w1
> +count .req x2
> +tmp1 .req x3
> +tmp1w .req w3
> +tmp2 .req x4
> +tmp2w .req w4
> +zva_len_x .req x5
> +zva_len .req w5
> +zva_bits_x .req x6
> +
> +A_l .req x7
> +A_lw .req w7
> +dst .req x8
> +tmp3w .req w9
> +tmp3 .req x9
> +
> ENTRY(memset)
> - mov x4, x0
> - and w1, w1, #0xff
> - orr w1, w1, w1, lsl #8
> - orr w1, w1, w1, lsl #16
> - orr x1, x1, x1, lsl #32
> - subs x2, x2, #8
> - b.mi 2f
> -1: str x1, [x4], #8
> - subs x2, x2, #8
> - b.pl 1b
> -2: adds x2, x2, #4
> - b.mi 3f
> - sub x2, x2, #4
> - str w1, [x4], #4
> -3: adds x2, x2, #2
> - b.mi 4f
> - sub x2, x2, #2
> - strh w1, [x4], #2
> -4: adds x2, x2, #1
> - b.mi 5f
> - strb w1, [x4]
> -5: ret
> + mov dst, dstin /* Preserve return value. */
> + and A_lw, val, #255
> + orr A_lw, A_lw, A_lw, lsl #8
> + orr A_lw, A_lw, A_lw, lsl #16
> + orr A_l, A_l, A_l, lsl #32
> +
> + cmp count, #15
> + b.hi .Lover16_proc
> + /*All store maybe are non-aligned..*/
> + tbz count, #3, 1f
> + str A_l, [dst], #8
> +1:
> + tbz count, #2, 2f
> + str A_lw, [dst], #4
> +2:
> + tbz count, #1, 3f
> + strh A_lw, [dst], #2
> +3:
> + tbz count, #0, 4f
> + strb A_lw, [dst]
> +4:
> + ret
> +
> +.Lover16_proc:
> + /*Whether the start address is aligned with 16.*/
> + neg tmp2, dst
> + ands tmp2, tmp2, #15
> + b.eq .Laligned
> +/*
> +* The count is not less than 16, we can use stp to store the start 16 bytes,
> +* then adjust the dst aligned with 16.This process will make the current
> +* memory address at alignment boundary.
> +*/
> + stp A_l, A_l, [dst] /*non-aligned store..*/
> + /*make the dst aligned..*/
> + sub count, count, tmp2
> + add dst, dst, tmp2
> +
> +.Laligned:
> + cbz A_l, .Lzero_mem
> +
> +.Ltail_maybe_long:
> + cmp count, #64
> + b.ge .Lnot_short
> +.Ltail63:
> + ands tmp1, count, #0x30
> + b.eq 3f
> + cmp tmp1w, #0x20
> + b.eq 1f
> + b.lt 2f
> + stp A_l, A_l, [dst], #16
> +1:
> + stp A_l, A_l, [dst], #16
> +2:
> + stp A_l, A_l, [dst], #16
> +/*
> +* The last store length is less than 16,use stp to write last 16 bytes.
> +* It will lead some bytes written twice and the access is non-aligned.
> +*/
> +3:
> + ands count, count, #15
> + cbz count, 4f
> + add dst, dst, count
> + stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */
> +4:
> + ret
> +
> + /*
> + * Critical loop. Start at a new cache line boundary. Assuming
> + * 64 bytes per line, this ensures the entire loop is in one line.
> + */
> + .p2align L1_CACHE_SHIFT
> +.Lnot_short:
> + sub dst, dst, #16/* Pre-bias. */
> + sub count, count, #64
> +1:
> + stp A_l, A_l, [dst, #16]
> + stp A_l, A_l, [dst, #32]
> + stp A_l, A_l, [dst, #48]
> + stp A_l, A_l, [dst, #64]!
> + subs count, count, #64
> + b.ge 1b
> + tst count, #0x3f
> + add dst, dst, #16
> + b.ne .Ltail63
> +.Lexitfunc:
> + ret
> +
> + /*
> + * For zeroing memory, check to see if we can use the ZVA feature to
> + * zero entire 'cache' lines.
> + */
> +.Lzero_mem:
> + cmp count, #63
> + b.le .Ltail63
> + /*
> + * For zeroing small amounts of memory, it's not worth setting up
> + * the line-clear code.
> + */
> + cmp count, #128
> + b.lt .Lnot_short /*count is at least 128 bytes*/
> +
> + mrs tmp1, dczid_el0
> + tbnz tmp1, #4, .Lnot_short
> + mov tmp3w, #4
> + and zva_len, tmp1w, #15 /* Safety: other bits reserved. */
> + lsl zva_len, tmp3w, zva_len
> +
> + ands tmp3w, zva_len, #63
> + /*
> + * ensure the zva_len is not less than 64.
> + * It is not meaningful to use ZVA if the block size is less than 64.
> + */
> + b.ne .Lnot_short
> +.Lzero_by_line:
> + /*
> + * Compute how far we need to go to become suitably aligned. We're
> + * already at quad-word alignment.
> + */
> + cmp count, zva_len_x
> + b.lt .Lnot_short /* Not enough to reach alignment. */
> + sub zva_bits_x, zva_len_x, #1
> + neg tmp2, dst
> + ands tmp2, tmp2, zva_bits_x
> + b.eq 2f /* Already aligned. */
> + /* Not aligned, check that there's enough to copy after alignment.*/
> + sub tmp1, count, tmp2
> + /*
> + * grantee the remain length to be ZVA is bigger than 64,
> + * avoid to make the 2f's process over mem range.*/
> + cmp tmp1, #64
> + ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */
> + b.lt .Lnot_short
> + /*
> + * We know that there's at least 64 bytes to zero and that it's safe
> + * to overrun by 64 bytes.
> + */
> + mov count, tmp1
> +1:
> + stp A_l, A_l, [dst]
> + stp A_l, A_l, [dst, #16]
> + stp A_l, A_l, [dst, #32]
> + subs tmp2, tmp2, #64
> + stp A_l, A_l, [dst, #48]
> + add dst, dst, #64
> + b.ge 1b
> + /* We've overrun a bit, so adjust dst downwards.*/
> + add dst, dst, tmp2
> +2:
> + sub count, count, zva_len_x
> +3:
> + dc zva, dst
> + add dst, dst, zva_len_x
> + subs count, count, zva_len_x
> + b.ge 3b
> + ands count, count, zva_bits_x
> + b.ne .Ltail_maybe_long
> + ret
> ENDPROC(memset)
> diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
> index 3f4e7a1..9a511f2 100644
> --- a/xen/include/asm-arm/arm32/cmpxchg.h
> +++ b/xen/include/asm-arm/arm32/cmpxchg.h
> @@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> return ret;
> }
>
> +#define xchg(ptr,x) \
> + ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> +
> /*
> * Atomic compare and exchange. Compare OLD with MEM, if identical,
> * store NEW in MEM. Return the initial value in MEM. Success is
> diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h
> index b5d50f2..b49219e 100644
> --- a/xen/include/asm-arm/arm64/atomic.h
> +++ b/xen/include/asm-arm/arm64/atomic.h
> @@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u)
>
> #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
>
> -#define smp_mb__before_atomic_dec() smp_mb()
> -#define smp_mb__after_atomic_dec() smp_mb()
> -#define smp_mb__before_atomic_inc() smp_mb()
> -#define smp_mb__after_atomic_inc() smp_mb()
> -
> #endif
> /*
> * Local variables:
> diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h
> index 4e930ce..ae42b2f 100644
> --- a/xen/include/asm-arm/arm64/cmpxchg.h
> +++ b/xen/include/asm-arm/arm64/cmpxchg.h
> @@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> }
>
> #define xchg(ptr,x) \
> - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> +({ \
> + __typeof__(*(ptr)) __ret; \
> + __ret = (__typeof__(*(ptr))) \
> + __xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \
> + __ret; \
> +})
>
> extern void __bad_cmpxchg(volatile void *ptr, int size);
>
> @@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
> return ret;
> }
>
> -#define cmpxchg(ptr,o,n) \
> - ((__typeof__(*(ptr)))__cmpxchg_mb((ptr), \
> - (unsigned long)(o), \
> - (unsigned long)(n), \
> - sizeof(*(ptr))))
> -
> -#define cmpxchg_local(ptr,o,n) \
> - ((__typeof__(*(ptr)))__cmpxchg((ptr), \
> - (unsigned long)(o), \
> - (unsigned long)(n), \
> - sizeof(*(ptr))))
> +#define cmpxchg(ptr, o, n) \
> +({ \
> + __typeof__(*(ptr)) __ret; \
> + __ret = (__typeof__(*(ptr))) \
> + __cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \
> + sizeof(*(ptr))); \
> + __ret; \
> +})
> +
> +#define cmpxchg_local(ptr, o, n) \
> +({ \
> + __typeof__(*(ptr)) __ret; \
> + __ret = (__typeof__(*(ptr))) \
> + __cmpxchg((ptr), (unsigned long)(o), \
> + (unsigned long)(n), sizeof(*(ptr))); \
> + __ret; \
> +})
>
> #endif
> /*
> diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h
> index 3242762..dfad1fe 100644
> --- a/xen/include/asm-arm/string.h
> +++ b/xen/include/asm-arm/string.h
> @@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c);
> #define __HAVE_ARCH_MEMCPY
> extern void * memcpy(void *, const void *, __kernel_size_t);
>
> +#if defined(CONFIG_ARM_64)
> +#define __HAVE_ARCH_MEMCMP
> +extern int memcmp(const void *, const void *, __kernel_size_t);
> +#endif
> +
> /* Some versions of gcc don't have this builtin. It's non-critical anyway. */
> #define __HAVE_ARCH_MEMMOVE
> extern void *memmove(void *dest, const void *src, size_t n);
> diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> index 7aaaf50..ce3d38a 100644
> --- a/xen/include/asm-arm/system.h
> +++ b/xen/include/asm-arm/system.h
> @@ -33,9 +33,6 @@
>
> #define smp_wmb() dmb(ishst)
>
> -#define xchg(ptr,x) \
> - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> -
> /*
> * This is used to ensure the compiler did actually allocate the register we
> * asked it for some inline assembly sequences. Apparently we can't trust
>
--
Julien Grall
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
@ 2014-07-25 15:42 ` Julien Grall
2014-07-25 15:48 ` Ian Campbell
0 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 15:42 UTC (permalink / raw)
To: Ian Campbell, xen-devel; +Cc: tim, stefano.stabellini
Hi Ian,
On 07/25/2014 04:22 PM, Ian Campbell wrote:
> bitops, cmpxchg, atomics: Import:
> c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
Compare to Linux we don't have specific prefetch* helpers. We directly
use the compiler builtin ones. Shouldn't we import the ARM specific
helpers to gain in performance?
Regards,
> Author: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>
> atomics: In addition to the above import:
> db38ee8 ARM: 7983/1: atomics: implement a better __atomic_add_unless for v6+
> Author: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>
> spinlocks: We have diverged from Linux, so no updates but note this in the README.
>
> mem* and str*: Import:
> d98b90e ARM: 7990/1: asm: rename logical shift macros push pull into lspush lspull
> Author: Victor Kamensky <victor.kamensky@linaro.org>
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
> Acked-by: Nicolas Pitre <nico@linaro.org>
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>
> For some reason str* were mentioned under mem* in the README, fix.
>
> libgcc: No changes, update baseline
>
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
> xen/arch/arm/README.LinuxPrimitives | 17 +++++++--------
> xen/arch/arm/arm32/lib/assembler.h | 8 +++----
> xen/arch/arm/arm32/lib/bitops.h | 5 +++++
> xen/arch/arm/arm32/lib/copy_template.S | 36 ++++++++++++++++----------------
> xen/arch/arm/arm32/lib/memmove.S | 36 ++++++++++++++++----------------
> xen/include/asm-arm/arm32/atomic.h | 32 ++++++++++++++++++++++++++++
> xen/include/asm-arm/arm32/cmpxchg.h | 5 +++++
> 7 files changed, 90 insertions(+), 49 deletions(-)
>
> diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
> index 69eeb70..7e15b04 100644
> --- a/xen/arch/arm/README.LinuxPrimitives
> +++ b/xen/arch/arm/README.LinuxPrimitives
> @@ -65,7 +65,7 @@ linux/arch/arm64/lib/copy_page.S unused in Xen
> arm32
> =====================================================================
>
> -bitops: last sync @ v3.14-rc7 (last commit: b7ec699)
> +bitops: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
>
> linux/arch/arm/lib/bitops.h xen/arch/arm/arm32/lib/bitops.h
> linux/arch/arm/lib/changebit.S xen/arch/arm/arm32/lib/changebit.S
> @@ -83,13 +83,13 @@ done
>
> ---------------------------------------------------------------------
>
> -cmpxchg: last sync @ v3.14-rc7 (last commit: 775ebcc)
> +cmpxchg: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
>
> linux/arch/arm/include/asm/cmpxchg.h xen/include/asm-arm/arm32/cmpxchg.h
>
> ---------------------------------------------------------------------
>
> -atomics: last sync @ v3.14-rc7 (last commit: aed3a4e)
> +atomics: last sync @ v3.16-rc6 (last commit: 030d0178bdbd)
>
> linux/arch/arm/include/asm/atomic.h xen/include/asm-arm/arm32/atomic.h
>
> @@ -99,6 +99,8 @@ spinlocks: last sync: 15e7e5c1ebf5
>
> linux/arch/arm/include/asm/spinlock.h xen/include/asm-arm/arm32/spinlock.h
>
> +*** Linux has switched to ticket locks but we still use bitlocks.
> +
> resync to v3.14-rc7:
>
> 7c8746a ARM: 7955/1: spinlock: ensure we have a compiler barrier before sev
> @@ -111,7 +113,7 @@ resync to v3.14-rc7:
>
> ---------------------------------------------------------------------
>
> -mem*: last sync @ v3.14-rc7 (last commit: 418df63a)
> +mem*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
>
> linux/arch/arm/lib/copy_template.S xen/arch/arm/arm32/lib/copy_template.S
> linux/arch/arm/lib/memchr.S xen/arch/arm/arm32/lib/memchr.S
> @@ -120,9 +122,6 @@ linux/arch/arm/lib/memmove.S xen/arch/arm/arm32/lib/memmove.S
> linux/arch/arm/lib/memset.S xen/arch/arm/arm32/lib/memset.S
> linux/arch/arm/lib/memzero.S xen/arch/arm/arm32/lib/memzero.S
>
> -linux/arch/arm/lib/strchr.S xen/arch/arm/arm32/lib/strchr.S
> -linux/arch/arm/lib/strrchr.S xen/arch/arm/arm32/lib/strrchr.S
> -
> for i in copy_template.S memchr.S memcpy.S memmove.S memset.S \
> memzero.S ; do
> diff -u linux/arch/arm/lib/$i xen/arch/arm/arm32/lib/$i
> @@ -130,7 +129,7 @@ done
>
> ---------------------------------------------------------------------
>
> -str*: last sync @ v3.13-rc7 (last commit: 93ed397)
> +str*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
>
> linux/arch/arm/lib/strchr.S xen/arch/arm/arm32/lib/strchr.S
> linux/arch/arm/lib/strrchr.S xen/arch/arm/arm32/lib/strrchr.S
> @@ -145,7 +144,7 @@ clear_page == memset
>
> ---------------------------------------------------------------------
>
> -libgcc: last sync @ v3.14-rc7 (last commit: 01885bc)
> +libgcc: last sync @ v3.16-rc6 (last commit: 01885bc)
>
> linux/arch/arm/lib/lib1funcs.S xen/arch/arm/arm32/lib/lib1funcs.S
> linux/arch/arm/lib/lshrdi3.S xen/arch/arm/arm32/lib/lshrdi3.S
> diff --git a/xen/arch/arm/arm32/lib/assembler.h b/xen/arch/arm/arm32/lib/assembler.h
> index f8d4b3a..6de2638 100644
> --- a/xen/arch/arm/arm32/lib/assembler.h
> +++ b/xen/arch/arm/arm32/lib/assembler.h
> @@ -36,8 +36,8 @@
> * Endian independent macros for shifting bytes within registers.
> */
> #ifndef __ARMEB__
> -#define pull lsr
> -#define push lsl
> +#define lspull lsr
> +#define lspush lsl
> #define get_byte_0 lsl #0
> #define get_byte_1 lsr #8
> #define get_byte_2 lsr #16
> @@ -47,8 +47,8 @@
> #define put_byte_2 lsl #16
> #define put_byte_3 lsl #24
> #else
> -#define pull lsl
> -#define push lsr
> +#define lspull lsl
> +#define lspush lsr
> #define get_byte_0 lsr #24
> #define get_byte_1 lsr #16
> #define get_byte_2 lsr #8
> diff --git a/xen/arch/arm/arm32/lib/bitops.h b/xen/arch/arm/arm32/lib/bitops.h
> index 25784c3..a167c2d 100644
> --- a/xen/arch/arm/arm32/lib/bitops.h
> +++ b/xen/arch/arm/arm32/lib/bitops.h
> @@ -37,6 +37,11 @@ UNWIND( .fnstart )
> add r1, r1, r0, lsl #2 @ Get word offset
> mov r3, r2, lsl r3 @ create mask
> smp_dmb
> +#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
> + .arch_extension mp
> + ALT_SMP(W(pldw) [r1])
> + ALT_UP(W(nop))
> +#endif
> 1: ldrex r2, [r1]
> ands r0, r2, r3 @ save old value of bit
> \instr r2, r2, r3 @ toggle bit
> diff --git a/xen/arch/arm/arm32/lib/copy_template.S b/xen/arch/arm/arm32/lib/copy_template.S
> index 805e3f8..3bc8eb8 100644
> --- a/xen/arch/arm/arm32/lib/copy_template.S
> +++ b/xen/arch/arm/arm32/lib/copy_template.S
> @@ -197,24 +197,24 @@
>
> 12: PLD( pld [r1, #124] )
> 13: ldr4w r1, r4, r5, r6, r7, abort=19f
> - mov r3, lr, pull #\pull
> + mov r3, lr, lspull #\pull
> subs r2, r2, #32
> ldr4w r1, r8, r9, ip, lr, abort=19f
> - orr r3, r3, r4, push #\push
> - mov r4, r4, pull #\pull
> - orr r4, r4, r5, push #\push
> - mov r5, r5, pull #\pull
> - orr r5, r5, r6, push #\push
> - mov r6, r6, pull #\pull
> - orr r6, r6, r7, push #\push
> - mov r7, r7, pull #\pull
> - orr r7, r7, r8, push #\push
> - mov r8, r8, pull #\pull
> - orr r8, r8, r9, push #\push
> - mov r9, r9, pull #\pull
> - orr r9, r9, ip, push #\push
> - mov ip, ip, pull #\pull
> - orr ip, ip, lr, push #\push
> + orr r3, r3, r4, lspush #\push
> + mov r4, r4, lspull #\pull
> + orr r4, r4, r5, lspush #\push
> + mov r5, r5, lspull #\pull
> + orr r5, r5, r6, lspush #\push
> + mov r6, r6, lspull #\pull
> + orr r6, r6, r7, lspush #\push
> + mov r7, r7, lspull #\pull
> + orr r7, r7, r8, lspush #\push
> + mov r8, r8, lspull #\pull
> + orr r8, r8, r9, lspush #\push
> + mov r9, r9, lspull #\pull
> + orr r9, r9, ip, lspush #\push
> + mov ip, ip, lspull #\pull
> + orr ip, ip, lr, lspush #\push
> str8w r0, r3, r4, r5, r6, r7, r8, r9, ip, , abort=19f
> bge 12b
> PLD( cmn r2, #96 )
> @@ -225,10 +225,10 @@
> 14: ands ip, r2, #28
> beq 16f
>
> -15: mov r3, lr, pull #\pull
> +15: mov r3, lr, lspull #\pull
> ldr1w r1, lr, abort=21f
> subs ip, ip, #4
> - orr r3, r3, lr, push #\push
> + orr r3, r3, lr, lspush #\push
> str1w r0, r3, abort=21f
> bgt 15b
> CALGN( cmp r2, #0 )
> diff --git a/xen/arch/arm/arm32/lib/memmove.S b/xen/arch/arm/arm32/lib/memmove.S
> index 4e142b8..18634c3 100644
> --- a/xen/arch/arm/arm32/lib/memmove.S
> +++ b/xen/arch/arm/arm32/lib/memmove.S
> @@ -148,24 +148,24 @@ ENTRY(memmove)
>
> 12: PLD( pld [r1, #-128] )
> 13: ldmdb r1!, {r7, r8, r9, ip}
> - mov lr, r3, push #\push
> + mov lr, r3, lspush #\push
> subs r2, r2, #32
> ldmdb r1!, {r3, r4, r5, r6}
> - orr lr, lr, ip, pull #\pull
> - mov ip, ip, push #\push
> - orr ip, ip, r9, pull #\pull
> - mov r9, r9, push #\push
> - orr r9, r9, r8, pull #\pull
> - mov r8, r8, push #\push
> - orr r8, r8, r7, pull #\pull
> - mov r7, r7, push #\push
> - orr r7, r7, r6, pull #\pull
> - mov r6, r6, push #\push
> - orr r6, r6, r5, pull #\pull
> - mov r5, r5, push #\push
> - orr r5, r5, r4, pull #\pull
> - mov r4, r4, push #\push
> - orr r4, r4, r3, pull #\pull
> + orr lr, lr, ip, lspull #\pull
> + mov ip, ip, lspush #\push
> + orr ip, ip, r9, lspull #\pull
> + mov r9, r9, lspush #\push
> + orr r9, r9, r8, lspull #\pull
> + mov r8, r8, lspush #\push
> + orr r8, r8, r7, lspull #\pull
> + mov r7, r7, lspush #\push
> + orr r7, r7, r6, lspull #\pull
> + mov r6, r6, lspush #\push
> + orr r6, r6, r5, lspull #\pull
> + mov r5, r5, lspush #\push
> + orr r5, r5, r4, lspull #\pull
> + mov r4, r4, lspush #\push
> + orr r4, r4, r3, lspull #\pull
> stmdb r0!, {r4 - r9, ip, lr}
> bge 12b
> PLD( cmn r2, #96 )
> @@ -176,10 +176,10 @@ ENTRY(memmove)
> 14: ands ip, r2, #28
> beq 16f
>
> -15: mov lr, r3, push #\push
> +15: mov lr, r3, lspush #\push
> ldr r3, [r1, #-4]!
> subs ip, ip, #4
> - orr lr, lr, r3, pull #\pull
> + orr lr, lr, r3, lspull #\pull
> str lr, [r0, #-4]!
> bgt 15b
> CALGN( cmp r2, #0 )
> diff --git a/xen/include/asm-arm/arm32/atomic.h b/xen/include/asm-arm/arm32/atomic.h
> index 3d601d1..7ec712f 100644
> --- a/xen/include/asm-arm/arm32/atomic.h
> +++ b/xen/include/asm-arm/arm32/atomic.h
> @@ -39,6 +39,7 @@ static inline int atomic_add_return(int i, atomic_t *v)
> int result;
>
> smp_mb();
> + prefetchw(&v->counter);
>
> __asm__ __volatile__("@ atomic_add_return\n"
> "1: ldrex %0, [%3]\n"
> @@ -78,6 +79,7 @@ static inline int atomic_sub_return(int i, atomic_t *v)
> int result;
>
> smp_mb();
> + prefetchw(&v->counter);
>
> __asm__ __volatile__("@ atomic_sub_return\n"
> "1: ldrex %0, [%3]\n"
> @@ -100,6 +102,7 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
> unsigned long res;
>
> smp_mb();
> + prefetchw(&ptr->counter);
>
> do {
> __asm__ __volatile__("@ atomic_cmpxchg\n"
> @@ -117,6 +120,35 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
> return oldval;
> }
>
> +static inline int __atomic_add_unless(atomic_t *v, int a, int u)
> +{
> + int oldval, newval;
> + unsigned long tmp;
> +
> + smp_mb();
> + prefetchw(&v->counter);
> +
> + __asm__ __volatile__ ("@ atomic_add_unless\n"
> +"1: ldrex %0, [%4]\n"
> +" teq %0, %5\n"
> +" beq 2f\n"
> +" add %1, %0, %6\n"
> +" strex %2, %1, [%4]\n"
> +" teq %2, #0\n"
> +" bne 1b\n"
> +"2:"
> + : "=&r" (oldval), "=&r" (newval), "=&r" (tmp), "+Qo" (v->counter)
> + : "r" (&v->counter), "r" (u), "r" (a)
> + : "cc");
> +
> + if (oldval != u)
> + smp_mb();
> +
> + return oldval;
> +}
> +
> +#define atomic_xchg(v, new) (xchg(&((v)->counter), new))
> +
> #define atomic_inc(v) atomic_add(1, v)
> #define atomic_dec(v) atomic_sub(1, v)
>
> diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
> index 9a511f2..03e0bed 100644
> --- a/xen/include/asm-arm/arm32/cmpxchg.h
> +++ b/xen/include/asm-arm/arm32/cmpxchg.h
> @@ -1,6 +1,8 @@
> #ifndef __ASM_ARM32_CMPXCHG_H
> #define __ASM_ARM32_CMPXCHG_H
>
> +#include <xen/prefetch.h>
> +
> extern void __bad_xchg(volatile void *, int);
>
> static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size)
> @@ -9,6 +11,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> unsigned int tmp;
>
> smp_mb();
> + prefetchw((const void *)ptr);
>
> switch (size) {
> case 1:
> @@ -56,6 +59,8 @@ static always_inline unsigned long __cmpxchg(
> {
> unsigned long oldval, res;
>
> + prefetchw((const void *)ptr);
> +
> switch (size) {
> case 1:
> do {
>
--
Julien Grall
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
@ 2014-07-25 15:43 ` Ian Campbell
2 siblings, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:43 UTC (permalink / raw)
To: xen-devel; +Cc: julien.grall, tim, stefano.stabellini
On Fri, 2014-07-25 at 16:22 +0100, Ian Campbell wrote:
> str*: No changes. Record new baseline.
I missed that there were some new primitives (str[n]len and str[n]cmp).
Rather than respin this big patch here is a followup:
8<-------------------
>From 66c115115122ca21035d55f486ea2eed1e284dd7 Mon Sep 17 00:00:00 2001
Message-Id: <66c115115122ca21035d55f486ea2eed1e284dd7.1406302952.git.ian.campbell@citrix.com>
From: Ian Campbell <ian.campbell@citrix.com>
Date: Fri, 25 Jul 2014 16:31:46 +0100
Subject: [PATCH] xen: arm: Add new str* primitives from Linux v3.16-rc6.
Imports:
0a42cb0 arm64: lib: Implement optimized string length routines
Author: zhichang.yuan <zhichang.yuan@linaro.org>
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
192c4d9 arm64: lib: Implement optimized string compare routines
Author: zhichang.yuan <zhichang.yuan@linaro.org>
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
xen/arch/arm/README.LinuxPrimitives | 10 +-
xen/arch/arm/arm64/lib/Makefile | 2 +-
xen/arch/arm/arm64/lib/strcmp.S | 235 ++++++++++++++++++++++++++
xen/arch/arm/arm64/lib/strlen.S | 128 ++++++++++++++
xen/arch/arm/arm64/lib/strncmp.S | 311 +++++++++++++++++++++++++++++++++++
xen/arch/arm/arm64/lib/strnlen.S | 172 +++++++++++++++++++
xen/include/asm-arm/string.h | 14 ++
7 files changed, 870 insertions(+), 2 deletions(-)
create mode 100644 xen/arch/arm/arm64/lib/strcmp.S
create mode 100644 xen/arch/arm/arm64/lib/strlen.S
create mode 100644 xen/arch/arm/arm64/lib/strncmp.S
create mode 100644 xen/arch/arm/arm64/lib/strnlen.S
diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
index 7e15b04..7f33fc7 100644
--- a/xen/arch/arm/README.LinuxPrimitives
+++ b/xen/arch/arm/README.LinuxPrimitives
@@ -49,11 +49,19 @@ done
---------------------------------------------------------------------
-str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5)
+str*: last sync @ v3.16-rc6 (last commit: 0a42cb0a6fa6)
linux/arch/arm64/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S
+linux/arch/arm64/lib/strcmp.S xen/arch/arm/arm64/lib/strcmp.S
+linux/arch/arm64/lib/strlen.S xen/arch/arm/arm64/lib/strlen.S
+linux/arch/arm64/lib/strncmp.S xen/arch/arm/arm64/lib/strncmp.S
+linux/arch/arm64/lib/strnlen.S xen/arch/arm/arm64/lib/strnlen.S
linux/arch/arm64/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S
+for i in strchr.S strcmp.S strlen.S strncmp.S strnlen.S strrchr.S ; do
+ diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i
+done
+
---------------------------------------------------------------------
{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387)
diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile
index 2e7fb64..1b9c7a9 100644
--- a/xen/arch/arm/arm64/lib/Makefile
+++ b/xen/arch/arm/arm64/lib/Makefile
@@ -1,4 +1,4 @@
obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o
obj-y += clear_page.o
obj-y += bitops.o find_next_bit.o
-obj-y += strchr.o strrchr.o
+obj-y += strchr.o strcmp.o strlen.o strncmp.o strnlen.o strrchr.o
diff --git a/xen/arch/arm/arm64/lib/strcmp.S b/xen/arch/arm/arm64/lib/strcmp.S
new file mode 100644
index 0000000..bdcf7b0
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strcmp.S
@@ -0,0 +1,235 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ * x0 - const string 1 pointer
+ * x1 - const string 2 pointer
+ * Returns:
+ * x0 - an integer less than, equal to, or greater than zero
+ * if s1 is found, respectively, to be less than, to match,
+ * or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result. */
+src1 .req x0
+src2 .req x1
+result .req x0
+
+/* Internal variables. */
+data1 .req x2
+data1w .req w2
+data2 .req x3
+data2w .req w3
+has_nul .req x4
+diff .req x5
+syndrome .req x6
+tmp1 .req x7
+tmp2 .req x8
+tmp3 .req x9
+zeroones .req x10
+pos .req x11
+
+ENTRY(strcmp)
+ eor tmp1, src1, src2
+ mov zeroones, #REP8_01
+ tst tmp1, #7
+ b.ne .Lmisaligned8
+ ands tmp1, src1, #7
+ b.ne .Lmutual_align
+
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+.Lloop_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+.Lstart_realigned:
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ orr syndrome, diff, has_nul
+ cbz syndrome, .Lloop_aligned
+ b .Lcal_cmpresult
+
+.Lmutual_align:
+ /*
+ * Sources are mutually aligned, but are not currently at an
+ * alignment boundary. Round down the addresses and then mask off
+ * the bytes that preceed the start point.
+ */
+ bic src1, src1, #7
+ bic src2, src2, #7
+ lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */
+ ldr data1, [src1], #8
+ neg tmp1, tmp1 /* Bits to alignment -64. */
+ ldr data2, [src2], #8
+ mov tmp2, #~0
+ /* Big-endian. Early bytes are at MSB. */
+CPU_BE( lsl tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */
+ /* Little-endian. Early bytes are at LSB. */
+CPU_LE( lsr tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */
+
+ orr data1, data1, tmp2
+ orr data2, data2, tmp2
+ b .Lstart_realigned
+
+.Lmisaligned8:
+ /*
+ * Get the align offset length to compare per byte first.
+ * After this process, one string's address will be aligned.
+ */
+ and tmp1, src1, #7
+ neg tmp1, tmp1
+ add tmp1, tmp1, #8
+ and tmp2, src2, #7
+ neg tmp2, tmp2
+ add tmp2, tmp2, #8
+ subs tmp3, tmp1, tmp2
+ csel pos, tmp1, tmp2, hi /*Choose the maximum. */
+.Ltinycmp:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs pos, pos, #1
+ ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.eq .Ltinycmp
+ cbnz pos, 1f /*find the null or unequal...*/
+ cmp data1w, #1
+ ccmp data1w, data2w, #0, cs
+ b.eq .Lstart_align /*the last bytes are equal....*/
+1:
+ sub result, data1, data2
+ ret
+
+.Lstart_align:
+ ands xzr, src1, #7
+ b.eq .Lrecal_offset
+ /*process more leading bytes to make str1 aligned...*/
+ add src1, src1, tmp3
+ add src2, src2, tmp3
+ /*load 8 bytes from aligned str1 and non-aligned str2..*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ orr syndrome, diff, has_nul
+ cbnz syndrome, .Lcal_cmpresult
+ /*How far is the current str2 from the alignment boundary...*/
+ and tmp3, tmp3, #7
+.Lrecal_offset:
+ neg pos, tmp3
+.Lloopcmp_proc:
+ /*
+ * Divide the eight bytes into two parts. First,backwards the src2
+ * to an alignment boundary,load eight bytes from the SRC2 alignment
+ * boundary,then compare with the relative bytes from SRC1.
+ * If all 8 bytes are equal,then start the second part's comparison.
+ * Otherwise finish the comparison.
+ * This special handle can garantee all the accesses are in the
+ * thread/task space in avoid to overrange access.
+ */
+ ldr data1, [src1,pos]
+ ldr data2, [src2,pos]
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ orr syndrome, diff, has_nul
+ cbnz syndrome, .Lcal_cmpresult
+
+ /*The second part process*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ orr syndrome, diff, has_nul
+ cbz syndrome, .Lloopcmp_proc
+
+.Lcal_cmpresult:
+ /*
+ * reversed the byte-order as big-endian,then CLZ can find the most
+ * significant zero bits.
+ */
+CPU_LE( rev syndrome, syndrome )
+CPU_LE( rev data1, data1 )
+CPU_LE( rev data2, data2 )
+
+ /*
+ * For big-endian we cannot use the trick with the syndrome value
+ * as carry-propagation can corrupt the upper bits if the trailing
+ * bytes in the string contain 0x01.
+ * However, if there is no NUL byte in the dword, we can generate
+ * the result directly. We ca not just subtract the bytes as the
+ * MSB might be significant.
+ */
+CPU_BE( cbnz has_nul, 1f )
+CPU_BE( cmp data1, data2 )
+CPU_BE( cset result, ne )
+CPU_BE( cneg result, result, lo )
+CPU_BE( ret )
+CPU_BE( 1: )
+ /*Re-compute the NUL-byte detection, using a byte-reversed value. */
+CPU_BE( rev tmp3, data1 )
+CPU_BE( sub tmp1, tmp3, zeroones )
+CPU_BE( orr tmp2, tmp3, #REP8_7f )
+CPU_BE( bic has_nul, tmp1, tmp2 )
+CPU_BE( rev has_nul, has_nul )
+CPU_BE( orr syndrome, diff, has_nul )
+
+ clz pos, syndrome
+ /*
+ * The MS-non-zero bit of the syndrome marks either the first bit
+ * that is different, or the top bit of the first zero byte.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /*
+ * But we need to zero-extend (char is unsigned) the value and then
+ * perform a signed 32-bit subtraction.
+ */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+ENDPROC(strcmp)
diff --git a/xen/arch/arm/arm64/lib/strlen.S b/xen/arch/arm/arm64/lib/strlen.S
new file mode 100644
index 0000000..ee055a2
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strlen.S
@@ -0,0 +1,128 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+
+/*
+ * calculate the length of a string
+ *
+ * Parameters:
+ * x0 - const string pointer
+ * Returns:
+ * x0 - the return length of specific string
+ */
+
+/* Arguments and results. */
+srcin .req x0
+len .req x0
+
+/* Locals and temporaries. */
+src .req x1
+data1 .req x2
+data2 .req x3
+data2a .req x4
+has_nul1 .req x5
+has_nul2 .req x6
+tmp1 .req x7
+tmp2 .req x8
+tmp3 .req x9
+tmp4 .req x10
+zeroones .req x11
+pos .req x12
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strlen)
+ mov zeroones, #REP8_01
+ bic src, srcin, #15
+ ands tmp1, srcin, #15
+ b.ne .Lmisaligned
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+ /*
+ * The inner loop deals with two Dwords at a time. This has a
+ * slightly higher start-up cost, but we should win quite quickly,
+ * especially on cores with a high number of issue slots per
+ * cycle, as we get much better parallelism out of the operations.
+ */
+.Lloop:
+ ldp data1, data2, [src], #16
+.Lrealigned:
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ sub tmp3, data2, zeroones
+ orr tmp4, data2, #REP8_7f
+ bic has_nul1, tmp1, tmp2
+ bics has_nul2, tmp3, tmp4
+ ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */
+ b.eq .Lloop
+
+ sub len, src, srcin
+ cbz has_nul1, .Lnul_in_data2
+CPU_BE( mov data2, data1 ) /*prepare data to re-calculate the syndrome*/
+ sub len, len, #8
+ mov has_nul2, has_nul1
+.Lnul_in_data2:
+ /*
+ * For big-endian, carry propagation (if the final byte in the
+ * string is 0x01) means we cannot use has_nul directly. The
+ * easiest way to get the correct byte is to byte-swap the data
+ * and calculate the syndrome a second time.
+ */
+CPU_BE( rev data2, data2 )
+CPU_BE( sub tmp1, data2, zeroones )
+CPU_BE( orr tmp2, data2, #REP8_7f )
+CPU_BE( bic has_nul2, tmp1, tmp2 )
+
+ sub len, len, #8
+ rev has_nul2, has_nul2
+ clz pos, has_nul2
+ add len, len, pos, lsr #3 /* Bits to bytes. */
+ ret
+
+.Lmisaligned:
+ cmp tmp1, #8
+ neg tmp1, tmp1
+ ldp data1, data2, [src], #16
+ lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */
+ mov tmp2, #~0
+ /* Big-endian. Early bytes are at MSB. */
+CPU_BE( lsl tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */
+ /* Little-endian. Early bytes are at LSB. */
+CPU_LE( lsr tmp2, tmp2, tmp1 ) /* Shift (tmp1 & 63). */
+
+ orr data1, data1, tmp2
+ orr data2a, data2, tmp2
+ csinv data1, data1, xzr, le
+ csel data2, data2, data2a, le
+ b .Lrealigned
+ENDPROC(strlen)
diff --git a/xen/arch/arm/arm64/lib/strncmp.S b/xen/arch/arm/arm64/lib/strncmp.S
new file mode 100644
index 0000000..ca2e4a6
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strncmp.S
@@ -0,0 +1,311 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ * x0 - const string 1 pointer
+ * x1 - const string 2 pointer
+ * x2 - the maximal length to be compared
+ * Returns:
+ * x0 - an integer less than, equal to, or greater than zero if s1 is found,
+ * respectively, to be less than, to match, or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result. */
+src1 .req x0
+src2 .req x1
+limit .req x2
+result .req x0
+
+/* Internal variables. */
+data1 .req x3
+data1w .req w3
+data2 .req x4
+data2w .req w4
+has_nul .req x5
+diff .req x6
+syndrome .req x7
+tmp1 .req x8
+tmp2 .req x9
+tmp3 .req x10
+zeroones .req x11
+pos .req x12
+limit_wd .req x13
+mask .req x14
+endloop .req x15
+
+ENTRY(strncmp)
+ cbz limit, .Lret0
+ eor tmp1, src1, src2
+ mov zeroones, #REP8_01
+ tst tmp1, #7
+ b.ne .Lmisaligned8
+ ands tmp1, src1, #7
+ b.ne .Lmutual_align
+ /* Calculate the number of full and partial words -1. */
+ /*
+ * when limit is mulitply of 8, if not sub 1,
+ * the judgement of last dword will wrong.
+ */
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+ lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */
+
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+.Lloop_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+.Lstart_realigned:
+ subs limit_wd, limit_wd, #1
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, pl /* Last Dword or differences.*/
+ bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ ccmp endloop, #0, #0, eq
+ b.eq .Lloop_aligned
+
+ /*Not reached the limit, must have found the end or a diff. */
+ tbz limit_wd, #63, .Lnot_limit
+
+ /* Limit % 8 == 0 => all bytes significant. */
+ ands limit, limit, #7
+ b.eq .Lnot_limit
+
+ lsl limit, limit, #3 /* Bits -> bytes. */
+ mov mask, #~0
+CPU_BE( lsr mask, mask, limit )
+CPU_LE( lsl mask, mask, limit )
+ bic data1, data1, mask
+ bic data2, data2, mask
+
+ /* Make sure that the NUL byte is marked in the syndrome. */
+ orr has_nul, has_nul, mask
+
+.Lnot_limit:
+ orr syndrome, diff, has_nul
+ b .Lcal_cmpresult
+
+.Lmutual_align:
+ /*
+ * Sources are mutually aligned, but are not currently at an
+ * alignment boundary. Round down the addresses and then mask off
+ * the bytes that precede the start point.
+ * We also need to adjust the limit calculations, but without
+ * overflowing if the limit is near ULONG_MAX.
+ */
+ bic src1, src1, #7
+ bic src2, src2, #7
+ ldr data1, [src1], #8
+ neg tmp3, tmp1, lsl #3 /* 64 - bits(bytes beyond align). */
+ ldr data2, [src2], #8
+ mov tmp2, #~0
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+ /* Big-endian. Early bytes are at MSB. */
+CPU_BE( lsl tmp2, tmp2, tmp3 ) /* Shift (tmp1 & 63). */
+ /* Little-endian. Early bytes are at LSB. */
+CPU_LE( lsr tmp2, tmp2, tmp3 ) /* Shift (tmp1 & 63). */
+
+ and tmp3, limit_wd, #7
+ lsr limit_wd, limit_wd, #3
+ /* Adjust the limit. Only low 3 bits used, so overflow irrelevant.*/
+ add limit, limit, tmp1
+ add tmp3, tmp3, tmp1
+ orr data1, data1, tmp2
+ orr data2, data2, tmp2
+ add limit_wd, limit_wd, tmp3, lsr #3
+ b .Lstart_realigned
+
+/*when src1 offset is not equal to src2 offset...*/
+.Lmisaligned8:
+ cmp limit, #8
+ b.lo .Ltiny8proc /*limit < 8... */
+ /*
+ * Get the align offset length to compare per byte first.
+ * After this process, one string's address will be aligned.*/
+ and tmp1, src1, #7
+ neg tmp1, tmp1
+ add tmp1, tmp1, #8
+ and tmp2, src2, #7
+ neg tmp2, tmp2
+ add tmp2, tmp2, #8
+ subs tmp3, tmp1, tmp2
+ csel pos, tmp1, tmp2, hi /*Choose the maximum. */
+ /*
+ * Here, limit is not less than 8, so directly run .Ltinycmp
+ * without checking the limit.*/
+ sub limit, limit, pos
+.Ltinycmp:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs pos, pos, #1
+ ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.eq .Ltinycmp
+ cbnz pos, 1f /*find the null or unequal...*/
+ cmp data1w, #1
+ ccmp data1w, data2w, #0, cs
+ b.eq .Lstart_align /*the last bytes are equal....*/
+1:
+ sub result, data1, data2
+ ret
+
+.Lstart_align:
+ lsr limit_wd, limit, #3
+ cbz limit_wd, .Lremain8
+ /*process more leading bytes to make str1 aligned...*/
+ ands xzr, src1, #7
+ b.eq .Lrecal_offset
+ add src1, src1, tmp3 /*tmp3 is positive in this branch.*/
+ add src2, src2, tmp3
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+
+ sub limit, limit, tmp3
+ lsr limit_wd, limit, #3
+ subs limit_wd, limit_wd, #1
+
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+ bics has_nul, tmp1, tmp2
+ ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+ b.ne .Lunequal_proc
+ /*How far is the current str2 from the alignment boundary...*/
+ and tmp3, tmp3, #7
+.Lrecal_offset:
+ neg pos, tmp3
+.Lloopcmp_proc:
+ /*
+ * Divide the eight bytes into two parts. First,backwards the src2
+ * to an alignment boundary,load eight bytes from the SRC2 alignment
+ * boundary,then compare with the relative bytes from SRC1.
+ * If all 8 bytes are equal,then start the second part's comparison.
+ * Otherwise finish the comparison.
+ * This special handle can garantee all the accesses are in the
+ * thread/task space in avoid to overrange access.
+ */
+ ldr data1, [src1,pos]
+ ldr data2, [src2,pos]
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, eq
+ cbnz endloop, .Lunequal_proc
+
+ /*The second part process*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ subs limit_wd, limit_wd, #1
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+ bics has_nul, tmp1, tmp2
+ ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+ b.eq .Lloopcmp_proc
+
+.Lunequal_proc:
+ orr syndrome, diff, has_nul
+ cbz syndrome, .Lremain8
+.Lcal_cmpresult:
+ /*
+ * reversed the byte-order as big-endian,then CLZ can find the most
+ * significant zero bits.
+ */
+CPU_LE( rev syndrome, syndrome )
+CPU_LE( rev data1, data1 )
+CPU_LE( rev data2, data2 )
+ /*
+ * For big-endian we cannot use the trick with the syndrome value
+ * as carry-propagation can corrupt the upper bits if the trailing
+ * bytes in the string contain 0x01.
+ * However, if there is no NUL byte in the dword, we can generate
+ * the result directly. We can't just subtract the bytes as the
+ * MSB might be significant.
+ */
+CPU_BE( cbnz has_nul, 1f )
+CPU_BE( cmp data1, data2 )
+CPU_BE( cset result, ne )
+CPU_BE( cneg result, result, lo )
+CPU_BE( ret )
+CPU_BE( 1: )
+ /* Re-compute the NUL-byte detection, using a byte-reversed value.*/
+CPU_BE( rev tmp3, data1 )
+CPU_BE( sub tmp1, tmp3, zeroones )
+CPU_BE( orr tmp2, tmp3, #REP8_7f )
+CPU_BE( bic has_nul, tmp1, tmp2 )
+CPU_BE( rev has_nul, has_nul )
+CPU_BE( orr syndrome, diff, has_nul )
+ /*
+ * The MS-non-zero bit of the syndrome marks either the first bit
+ * that is different, or the top bit of the first zero byte.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ clz pos, syndrome
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /*
+ * But we need to zero-extend (char is unsigned) the value and then
+ * perform a signed 32-bit subtraction.
+ */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+
+.Lremain8:
+ /* Limit % 8 == 0 => all bytes significant. */
+ ands limit, limit, #7
+ b.eq .Lret0
+.Ltiny8proc:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs limit, limit, #1
+
+ ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.eq .Ltiny8proc
+ sub result, data1, data2
+ ret
+
+.Lret0:
+ mov result, #0
+ ret
+ENDPROC(strncmp)
diff --git a/xen/arch/arm/arm64/lib/strnlen.S b/xen/arch/arm/arm64/lib/strnlen.S
new file mode 100644
index 0000000..8aa5bbf
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strnlen.S
@@ -0,0 +1,172 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+/*
+ * determine the length of a fixed-size string
+ *
+ * Parameters:
+ * x0 - const string pointer
+ * x1 - maximal string length
+ * Returns:
+ * x0 - the return length of specific string
+ */
+
+/* Arguments and results. */
+srcin .req x0
+len .req x0
+limit .req x1
+
+/* Locals and temporaries. */
+src .req x2
+data1 .req x3
+data2 .req x4
+data2a .req x5
+has_nul1 .req x6
+has_nul2 .req x7
+tmp1 .req x8
+tmp2 .req x9
+tmp3 .req x10
+tmp4 .req x11
+zeroones .req x12
+pos .req x13
+limit_wd .req x14
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strnlen)
+ cbz limit, .Lhit_limit
+ mov zeroones, #REP8_01
+ bic src, srcin, #15
+ ands tmp1, srcin, #15
+ b.ne .Lmisaligned
+ /* Calculate the number of full and partial words -1. */
+ sub limit_wd, limit, #1 /* Limit != 0, so no underflow. */
+ lsr limit_wd, limit_wd, #4 /* Convert to Qwords. */
+
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+ /*
+ * The inner loop deals with two Dwords at a time. This has a
+ * slightly higher start-up cost, but we should win quite quickly,
+ * especially on cores with a high number of issue slots per
+ * cycle, as we get much better parallelism out of the operations.
+ */
+.Lloop:
+ ldp data1, data2, [src], #16
+.Lrealigned:
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ sub tmp3, data2, zeroones
+ orr tmp4, data2, #REP8_7f
+ bic has_nul1, tmp1, tmp2
+ bic has_nul2, tmp3, tmp4
+ subs limit_wd, limit_wd, #1
+ orr tmp1, has_nul1, has_nul2
+ ccmp tmp1, #0, #0, pl /* NZCV = 0000 */
+ b.eq .Lloop
+
+ cbz tmp1, .Lhit_limit /* No null in final Qword. */
+
+ /*
+ * We know there's a null in the final Qword. The easiest thing
+ * to do now is work out the length of the string and return
+ * MIN (len, limit).
+ */
+ sub len, src, srcin
+ cbz has_nul1, .Lnul_in_data2
+CPU_BE( mov data2, data1 ) /*perpare data to re-calculate the syndrome*/
+
+ sub len, len, #8
+ mov has_nul2, has_nul1
+.Lnul_in_data2:
+ /*
+ * For big-endian, carry propagation (if the final byte in the
+ * string is 0x01) means we cannot use has_nul directly. The
+ * easiest way to get the correct byte is to byte-swap the data
+ * and calculate the syndrome a second time.
+ */
+CPU_BE( rev data2, data2 )
+CPU_BE( sub tmp1, data2, zeroones )
+CPU_BE( orr tmp2, data2, #REP8_7f )
+CPU_BE( bic has_nul2, tmp1, tmp2 )
+
+ sub len, len, #8
+ rev has_nul2, has_nul2
+ clz pos, has_nul2
+ add len, len, pos, lsr #3 /* Bits to bytes. */
+ cmp len, limit
+ csel len, len, limit, ls /* Return the lower value. */
+ ret
+
+.Lmisaligned:
+ /*
+ * Deal with a partial first word.
+ * We're doing two things in parallel here;
+ * 1) Calculate the number of words (but avoiding overflow if
+ * limit is near ULONG_MAX) - to do this we need to work out
+ * limit + tmp1 - 1 as a 65-bit value before shifting it;
+ * 2) Load and mask the initial data words - we force the bytes
+ * before the ones we are interested in to 0xff - this ensures
+ * early bytes will not hit any zero detection.
+ */
+ ldp data1, data2, [src], #16
+
+ sub limit_wd, limit, #1
+ and tmp3, limit_wd, #15
+ lsr limit_wd, limit_wd, #4
+
+ add tmp3, tmp3, tmp1
+ add limit_wd, limit_wd, tmp3, lsr #4
+
+ neg tmp4, tmp1
+ lsl tmp4, tmp4, #3 /* Bytes beyond alignment -> bits. */
+
+ mov tmp2, #~0
+ /* Big-endian. Early bytes are at MSB. */
+CPU_BE( lsl tmp2, tmp2, tmp4 ) /* Shift (tmp1 & 63). */
+ /* Little-endian. Early bytes are at LSB. */
+CPU_LE( lsr tmp2, tmp2, tmp4 ) /* Shift (tmp1 & 63). */
+
+ cmp tmp1, #8
+
+ orr data1, data1, tmp2
+ orr data2a, data2, tmp2
+
+ csinv data1, data1, xzr, le
+ csel data2, data2, data2a, le
+ b .Lrealigned
+
+.Lhit_limit:
+ mov len, limit
+ ret
+ENDPROC(strnlen)
diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h
index dfad1fe..e4b4469 100644
--- a/xen/include/asm-arm/string.h
+++ b/xen/include/asm-arm/string.h
@@ -14,6 +14,20 @@ extern char * strrchr(const char * s, int c);
#define __HAVE_ARCH_STRCHR
extern char * strchr(const char * s, int c);
+#if defined(CONFIG_ARM_64)
+#define __HAVE_ARCH_STRCMP
+extern int strcmp(const char *, const char *);
+
+#define __HAVE_ARCH_STRNCMP
+extern int strncmp(const char *, const char *, __kernel_size_t);
+
+#define __HAVE_ARCH_STRLEN
+extern __kernel_size_t strlen(const char *);
+
+#define __HAVE_ARCH_STRNLEN
+extern __kernel_size_t strnlen(const char *, __kernel_size_t);
+#endif
+
#define __HAVE_ARCH_MEMCPY
extern void * memcpy(void *, const void *, __kernel_size_t);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 15:42 ` Julien Grall
@ 2014-07-25 15:48 ` Ian Campbell
2014-07-25 15:48 ` Julien Grall
0 siblings, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:48 UTC (permalink / raw)
To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel
On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> Hi Ian,
>
> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> > bitops, cmpxchg, atomics: Import:
> > c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
>
> Compare to Linux we don't have specific prefetch* helpers. We directly
> use the compiler builtin ones. Shouldn't we import the ARM specific
> helpers to gain in performance?
My binaries are full of pld instructions where I think I would expect
them, so it seems like the compiler builtin ones are sufficient.
I suspect the Linux define is there to cope with older compilers or
something.
Ian.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 15:48 ` Ian Campbell
@ 2014-07-25 15:48 ` Julien Grall
2014-07-25 16:03 ` Ian Campbell
0 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 15:48 UTC (permalink / raw)
To: Ian Campbell; +Cc: stefano.stabellini, tim, xen-devel
On 07/25/2014 04:48 PM, Ian Campbell wrote:
> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
>> Hi Ian,
>>
>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
>>> bitops, cmpxchg, atomics: Import:
>>> c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
>>
>> Compare to Linux we don't have specific prefetch* helpers. We directly
>> use the compiler builtin ones. Shouldn't we import the ARM specific
>> helpers to gain in performance?
>
> My binaries are full of pld instructions where I think I would expect
> them, so it seems like the compiler builtin ones are sufficient.
>
> I suspect the Linux define is there to cope with older compilers or
> something.
If so:
Acked-by: Julien Grall <julien.grall@linaro.org>
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 15:48 ` Julien Grall
@ 2014-07-25 16:03 ` Ian Campbell
2014-07-25 16:13 ` Ian Campbell
2014-07-25 16:17 ` Julien Grall
0 siblings, 2 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 16:03 UTC (permalink / raw)
To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel
On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
> On 07/25/2014 04:48 PM, Ian Campbell wrote:
> > On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> >> Hi Ian,
> >>
> >> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> >>> bitops, cmpxchg, atomics: Import:
> >>> c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
> >>
> >> Compare to Linux we don't have specific prefetch* helpers. We directly
> >> use the compiler builtin ones. Shouldn't we import the ARM specific
> >> helpers to gain in performance?
> >
> > My binaries are full of pld instructions where I think I would expect
> > them, so it seems like the compiler builtin ones are sufficient.
> >
> > I suspect the Linux define is there to cope with older compilers or
> > something.
>
> If so:
The compiled output is very different if I use the arch specific
explicit variants. The explicit variant generates (lots) more pldw and
(somewhat) fewer pld. I've no idea what this means...
Note that the builtins presumably let gcc reason about whether preloads
are needed, whereas the explicit variants do not. I'm not sure how that
results in fewer pld with the explicit variant though! (unless it's
doing some sort of peephole optimisation and throwing them away?)
I've no idea what the right answer is.
How about we take the updates for now and revisit the question of
builtin vs explicit prefetches some other time?
> Acked-by: Julien Grall <julien.grall@linaro.org>
>
> Regards,
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 16:03 ` Ian Campbell
@ 2014-07-25 16:13 ` Ian Campbell
2014-07-25 16:20 ` Julien Grall
2014-07-25 16:17 ` Julien Grall
1 sibling, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 16:13 UTC (permalink / raw)
To: Julien Grall; +Cc: xen-devel, tim, stefano.stabellini
On Fri, 2014-07-25 at 17:03 +0100, Ian Campbell wrote:
> On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
> > On 07/25/2014 04:48 PM, Ian Campbell wrote:
> > > On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> > >> Hi Ian,
> > >>
> > >> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> > >>> bitops, cmpxchg, atomics: Import:
> > >>> c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
> > >>
> > >> Compare to Linux we don't have specific prefetch* helpers. We directly
> > >> use the compiler builtin ones. Shouldn't we import the ARM specific
> > >> helpers to gain in performance?
> > >
> > > My binaries are full of pld instructions where I think I would expect
> > > them, so it seems like the compiler builtin ones are sufficient.
> > >
> > > I suspect the Linux define is there to cope with older compilers or
> > > something.
> >
> > If so:
>
> The compiled output is very different if I use the arch specific
> explicit variants. The explicit variant generates (lots) more pldw and
> (somewhat) fewer pld. I've no idea what this means...
It's a bit more obvious for aarch64 where gcc 4.8 doesn't generate any
prefetches at all via the builtins...
Here's what I've got in my tree. I've no idea if we should take some or
all of it...
Ian.
8<-----------------
>From feb516fee01a0af60f54337b323975154eb466d8 Mon Sep 17 00:00:00 2001
Message-Id: <feb516fee01a0af60f54337b323975154eb466d8.1406304807.git.ian.campbell@citrix.com>
From: Ian Campbell <ian.campbell@citrix.com>
Date: Fri, 25 Jul 2014 17:08:42 +0100
Subject: [PATCH] xen: arm: Use explicit prefetch instructions.
On ARM32 these certainly generate *different* sets of prefetches.
I've no clue if that is a good thing...
On ARM64 the builtin variants seems to be non-functional (at least
with gcc 4.8).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
xen/include/asm-arm/arm32/processor.h | 17 +++++++++++++++++
xen/include/asm-arm/arm64/processor.h | 22 ++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/xen/include/asm-arm/arm32/processor.h b/xen/include/asm-arm/arm32/processor.h
index f41644d..6feacc9 100644
--- a/xen/include/asm-arm/arm32/processor.h
+++ b/xen/include/asm-arm/arm32/processor.h
@@ -119,6 +119,23 @@ struct cpu_user_regs
#define cpu_has_erratum_766422() \
(unlikely(current_cpu_data.midr.bits == 0x410fc0f4))
+#define ARCH_HAS_PREFETCH
+static inline void prefetch(const void *ptr)
+{
+ __asm__ __volatile__(
+ "pld\t%a0"
+ :: "p" (ptr));
+}
+
+#define ARCH_HAS_PREFETCHW
+static inline void prefetchw(const void *ptr)
+{
+ __asm__ __volatile__(
+ ".arch_extension mp\n"
+ "pldw\t%a0"
+ :: "p" (ptr));
+}
+
#endif /* __ASSEMBLY__ */
#endif /* __ASM_ARM_ARM32_PROCESSOR_H */
diff --git a/xen/include/asm-arm/arm64/processor.h b/xen/include/asm-arm/arm64/processor.h
index 5bf0867..56b1002 100644
--- a/xen/include/asm-arm/arm64/processor.h
+++ b/xen/include/asm-arm/arm64/processor.h
@@ -106,6 +106,28 @@ struct cpu_user_regs
#define cpu_has_erratum_766422() 0
+/*
+ * Prefetching support
+ */
+#define ARCH_HAS_PREFETCH
+static inline void prefetch(const void *ptr)
+{
+ asm volatile("prfm pldl1keep, %a0\n" : : "p" (ptr));
+}
+
+#define ARCH_HAS_PREFETCHW
+static inline void prefetchw(const void *ptr)
+{
+ asm volatile("prfm pstl1keep, %a0\n" : : "p" (ptr));
+}
+
+#define ARCH_HAS_SPINLOCK_PREFETCH
+static inline void spin_lock_prefetch(const void *x)
+{
+ prefetchw(x);
+}
+
+
#endif /* __ASSEMBLY__ */
#endif /* __ASM_ARM_ARM64_PROCESSOR_H */
--
1.7.10.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 16:03 ` Ian Campbell
2014-07-25 16:13 ` Ian Campbell
@ 2014-07-25 16:17 ` Julien Grall
2014-07-25 16:23 ` Ian Campbell
1 sibling, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 16:17 UTC (permalink / raw)
To: Ian Campbell; +Cc: stefano.stabellini, tim, xen-devel
On 07/25/2014 05:03 PM, Ian Campbell wrote:
> On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
>> On 07/25/2014 04:48 PM, Ian Campbell wrote:
>>> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
>>>> Hi Ian,
>>>>
>>>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
>>>>> bitops, cmpxchg, atomics: Import:
>>>>> c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
>>>>
>>>> Compare to Linux we don't have specific prefetch* helpers. We directly
>>>> use the compiler builtin ones. Shouldn't we import the ARM specific
>>>> helpers to gain in performance?
>>>
>>> My binaries are full of pld instructions where I think I would expect
>>> them, so it seems like the compiler builtin ones are sufficient.
>>>
>>> I suspect the Linux define is there to cope with older compilers or
>>> something.
>>
>> If so:
>
> The compiled output is very different if I use the arch specific
> explicit variants. The explicit variant generates (lots) more pldw and
> (somewhat) fewer pld. I've no idea what this means...
It looks like that pldw has been defined for ARMv7 with MP extensions.
AFAIU, pldw is used to signal we will likely write on this address.
I guess, we use the prefetch* helpers more often for write in the memory.
>
> Note that the builtins presumably let gcc reason about whether preloads
> are needed, whereas the explicit variants do not. I'm not sure how that
> results in fewer pld with the explicit variant though! (unless it's
> doing some sort of peephole optimisation and throwing them away?)
>
> I've no idea what the right answer is.
>
> How about we take the updates for now and revisit the question of
> builtin vs explicit prefetches some other time?
I'm fine with it. You can keep the ack for this patch.
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 16:13 ` Ian Campbell
@ 2014-07-25 16:20 ` Julien Grall
0 siblings, 0 replies; 13+ messages in thread
From: Julien Grall @ 2014-07-25 16:20 UTC (permalink / raw)
To: Ian Campbell; +Cc: xen-devel, tim, stefano.stabellini
On 07/25/2014 05:13 PM, Ian Campbell wrote:
> On Fri, 2014-07-25 at 17:03 +0100, Ian Campbell wrote:
>> On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
>>> On 07/25/2014 04:48 PM, Ian Campbell wrote:
>>>> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
>>>>> Hi Ian,
>>>>>
>>>>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
>>>>>> bitops, cmpxchg, atomics: Import:
>>>>>> c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
>>>>>
>>>>> Compare to Linux we don't have specific prefetch* helpers. We directly
>>>>> use the compiler builtin ones. Shouldn't we import the ARM specific
>>>>> helpers to gain in performance?
>>>>
>>>> My binaries are full of pld instructions where I think I would expect
>>>> them, so it seems like the compiler builtin ones are sufficient.
>>>>
>>>> I suspect the Linux define is there to cope with older compilers or
>>>> something.
>>>
>>> If so:
>>
>> The compiled output is very different if I use the arch specific
>> explicit variants. The explicit variant generates (lots) more pldw and
>> (somewhat) fewer pld. I've no idea what this means...
>
> It's a bit more obvious for aarch64 where gcc 4.8 doesn't generate any
> prefetches at all via the builtins...
>
> Here's what I've got in my tree. I've no idea if we should take some or
> all of it...
I don't think it will be harmful for ARMv7 to use specific prefetch*
helpers.
[..]
> +/*
> + * Prefetching support
> + */
> +#define ARCH_HAS_PREFETCH
> +static inline void prefetch(const void *ptr)
> +{
> + asm volatile("prfm pldl1keep, %a0\n" : : "p" (ptr));
> +}
> +
> +#define ARCH_HAS_PREFETCHW
> +static inline void prefetchw(const void *ptr)
> +{
> + asm volatile("prfm pstl1keep, %a0\n" : : "p" (ptr));
> +}
> +
> +#define ARCH_HAS_SPINLOCK_PREFETCH
> +static inline void spin_lock_prefetch(const void *x)
> +{
> + prefetchw(x);
> +}
Looking to the code. spin_lock_prefetch is called in the tree. I'm not
sure we should keep this helper.
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
2014-07-25 16:17 ` Julien Grall
@ 2014-07-25 16:23 ` Ian Campbell
0 siblings, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 16:23 UTC (permalink / raw)
To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel
On Fri, 2014-07-25 at 17:17 +0100, Julien Grall wrote:
> On 07/25/2014 05:03 PM, Ian Campbell wrote:
> > On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
> >> On 07/25/2014 04:48 PM, Ian Campbell wrote:
> >>> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> >>>> Hi Ian,
> >>>>
> >>>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> >>>>> bitops, cmpxchg, atomics: Import:
> >>>>> c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
> >>>>
> >>>> Compare to Linux we don't have specific prefetch* helpers. We directly
> >>>> use the compiler builtin ones. Shouldn't we import the ARM specific
> >>>> helpers to gain in performance?
> >>>
> >>> My binaries are full of pld instructions where I think I would expect
> >>> them, so it seems like the compiler builtin ones are sufficient.
> >>>
> >>> I suspect the Linux define is there to cope with older compilers or
> >>> something.
> >>
> >> If so:
> >
> > The compiled output is very different if I use the arch specific
> > explicit variants. The explicit variant generates (lots) more pldw and
> > (somewhat) fewer pld. I've no idea what this means...
>
> It looks like that pldw has been defined for ARMv7 with MP extensions.
>
> AFAIU, pldw is used to signal we will likely write on this address.
Oh, I know *that*.
What I couldn't explain is why the builtins should generate 181 pld's
and 6 pldw's (total 187) while the explicit ones generate 127 pld's and
93 pldw's (total 220) for the exact same code base.
Perhaps we simply use prefetchw too often in our code in gcc's opinion
so it elides some of them. Or perhaps the volatile in the explicit
version stops gcc from making other optimisations so there's simply more
occasions where the prefetching is needed.
The difference in the write prefetches is pretty stark though, 6 vs 93.
Ian.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
@ 2014-08-04 16:16 ` Ian Campbell
0 siblings, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2014-08-04 16:16 UTC (permalink / raw)
To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel
On Fri, 2014-07-25 at 16:36 +0100, Julien Grall wrote:
> Hi Ian,
>
> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> > The only really interesting changes here are the updates to mem* which update
> > to actually optimised versions and introduce an optimised memcmp.
>
> I didn't read the whole code as I assume it's just a copy with few
> changes from Linux.
>
> Acked-by: Julien Grall <julien.grall@linaro.org>
Thanks.
Julien also acked the other two patches via IRC, so I have applied.
Ian.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-08-04 16:16 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
2014-07-25 15:42 ` Julien Grall
2014-07-25 15:48 ` Ian Campbell
2014-07-25 15:48 ` Julien Grall
2014-07-25 16:03 ` Ian Campbell
2014-07-25 16:13 ` Ian Campbell
2014-07-25 16:20 ` Julien Grall
2014-07-25 16:17 ` Julien Grall
2014-07-25 16:23 ` Ian Campbell
2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
2014-08-04 16:16 ` Ian Campbell
2014-07-25 15:43 ` Ian Campbell
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.