All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
@ 2014-07-25 15:22 Ian Campbell
  2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: julien.grall, tim, Ian Campbell, stefano.stabellini

The only really interesting changes here are the updates to mem* which update
to actually optimised versions and introduce an optimised memcmp.

bitops: No change to the bits we import. Record new baseline.

cmpxchg: Import:
  60010e5 arm64: cmpxchg: update macros to prevent warnings
    Author: Mark Hambleton <mahamble@broadcom.com>
    Signed-off-by: Mark Hambleton <mahamble@broadcom.com>
    Signed-off-by: Mark Brown <broonie@linaro.org>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

  e1dfda9 arm64: xchg: prevent warning if return value is unused
    Author: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

  e1dfda9 resolves the warning which previous caused us to skip 60010e508111.

  Since arm32 and arm64 now differ (as do Linux arm and arm64) here the
  existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h.
  Previously this was shadowing the arm64 one but they happened to be identical.

atomics: Import:
  8715466 arch,arm64: Convert smp_mb__*()
    Author: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>

  This just drops some unused (by us) smp_mb__*_atomic_*.

spinlocks: No change. Record new baseline.

mem*: Import:
  808dbac arm64: lib: Implement optimized memcpy routine
    Author: zhichang.yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
  280adc1 arm64: lib: Implement optimized memmove routine
    Author: zhichang.yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
  b29a51f arm64: lib: Implement optimized memset routine
    Author: zhichang.yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
  d875c9b arm64: lib: Implement optimized memcmp routine
    Author: zhichang.yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

  These import various routines from Linaro's Cortex Strings library.

  Added assembler.h similar to on arm32 to define the various magic symbols
  which these imported routines depend on (e.g. CPU_LE() and CPU_BE())

str*: No changes. Record new baseline.

  Correct the paths in the README.

*_page: No changes. Record new baseline.

  README previous said clear_page was unused while clear page was, which was
  backwards.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 xen/arch/arm/README.LinuxPrimitives |   36 +++--
 xen/arch/arm/arm64/lib/Makefile     |    2 +-
 xen/arch/arm/arm64/lib/assembler.h  |   13 ++
 xen/arch/arm/arm64/lib/memchr.S     |    1 +
 xen/arch/arm/arm64/lib/memcmp.S     |  258 +++++++++++++++++++++++++++++++++++
 xen/arch/arm/arm64/lib/memcpy.S     |  193 +++++++++++++++++++++++---
 xen/arch/arm/arm64/lib/memmove.S    |  191 ++++++++++++++++++++++----
 xen/arch/arm/arm64/lib/memset.S     |  208 +++++++++++++++++++++++++---
 xen/include/asm-arm/arm32/cmpxchg.h |    3 +
 xen/include/asm-arm/arm64/atomic.h  |    5 -
 xen/include/asm-arm/arm64/cmpxchg.h |   35 +++--
 xen/include/asm-arm/string.h        |    5 +
 xen/include/asm-arm/system.h        |    3 -
 13 files changed, 844 insertions(+), 109 deletions(-)
 create mode 100644 xen/arch/arm/arm64/lib/assembler.h
 create mode 100644 xen/arch/arm/arm64/lib/memcmp.S

diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
index 6cd03ca..69eeb70 100644
--- a/xen/arch/arm/README.LinuxPrimitives
+++ b/xen/arch/arm/README.LinuxPrimitives
@@ -6,29 +6,26 @@ were last updated.
 arm64:
 =====================================================================
 
-bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b)
+bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027)
 
 linux/arch/arm64/lib/bitops.S           xen/arch/arm/arm64/lib/bitops.S
 linux/arch/arm64/include/asm/bitops.h   xen/include/asm-arm/arm64/bitops.h
 
 ---------------------------------------------------------------------
 
-cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189)
+cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b)
 
 linux/arch/arm64/include/asm/cmpxchg.h  xen/include/asm-arm/arm64/cmpxchg.h
 
-Skipped:
-  60010e5 arm64: cmpxchg: update macros to prevent warnings
-
 ---------------------------------------------------------------------
 
-atomics: last sync @ v3.14-rc7 (last commit: 95c4189)
+atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027)
 
 linux/arch/arm64/include/asm/atomic.h   xen/include/asm-arm/arm64/atomic.h
 
 ---------------------------------------------------------------------
 
-spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189)
+spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9)
 
 linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h
 
@@ -38,30 +35,31 @@ Skipped:
 
 ---------------------------------------------------------------------
 
-mem*: last sync @ v3.14-rc7 (last commit: 4a89922)
+mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240)
 
-linux/arch/arm64/lib/memchr.S             xen/arch/arm/arm64/lib/memchr.S
-linux/arch/arm64/lib/memcpy.S             xen/arch/arm/arm64/lib/memcpy.S
-linux/arch/arm64/lib/memmove.S            xen/arch/arm/arm64/lib/memmove.S
-linux/arch/arm64/lib/memset.S             xen/arch/arm/arm64/lib/memset.S
+linux/arch/arm64/lib/memchr.S           xen/arch/arm/arm64/lib/memchr.S
+linux/arch/arm64/lib/memcmp.S           xen/arch/arm/arm64/lib/memcmp.S
+linux/arch/arm64/lib/memcpy.S           xen/arch/arm/arm64/lib/memcpy.S
+linux/arch/arm64/lib/memmove.S          xen/arch/arm/arm64/lib/memmove.S
+linux/arch/arm64/lib/memset.S           xen/arch/arm/arm64/lib/memset.S
 
-for i in memchr.S memcpy.S memmove.S memset.S ; do
+for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do
     diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i
 done
 
 ---------------------------------------------------------------------
 
-str*: last sync @ v3.14-rc7 (last commit: 2b8cac8)
+str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5)
 
-linux/arch/arm/lib/strchr.S             xen/arch/arm/arm64/lib/strchr.S
-linux/arch/arm/lib/strrchr.S            xen/arch/arm/arm64/lib/strrchr.S
+linux/arch/arm64/lib/strchr.S           xen/arch/arm/arm64/lib/strchr.S
+linux/arch/arm64/lib/strrchr.S          xen/arch/arm/arm64/lib/strrchr.S
 
 ---------------------------------------------------------------------
 
-{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13)
+{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387)
 
-linux/arch/arm64/lib/clear_page.S       unused in Xen
-linux/arch/arm64/lib/copy_page.S        xen/arch/arm/arm64/lib/copy_page.S
+linux/arch/arm64/lib/clear_page.S       xen/arch/arm/arm64/lib/clear_page.S
+linux/arch/arm64/lib/copy_page.S        unused in Xen
 
 =====================================================================
 arm32
diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile
index b895afa..2e7fb64 100644
--- a/xen/arch/arm/arm64/lib/Makefile
+++ b/xen/arch/arm/arm64/lib/Makefile
@@ -1,4 +1,4 @@
-obj-y += memcpy.o memmove.o memset.o memchr.o
+obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o
 obj-y += clear_page.o
 obj-y += bitops.o find_next_bit.o
 obj-y += strchr.o strrchr.o
diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h
new file mode 100644
index 0000000..84669d1
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/assembler.h
@@ -0,0 +1,13 @@
+#ifndef __ASM_ASSEMBLER_H__
+#define __ASM_ASSEMBLER_H__
+
+#ifndef __ASSEMBLY__
+#error "Only include this from assembly code"
+#endif
+
+/* Only LE support so far */
+#define CPU_BE(x...)
+#define CPU_LE(x...) x
+
+#endif /* __ASM_ASSEMBLER_H__ */
+
diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S
index 3cc1b01..b04590c 100644
--- a/xen/arch/arm/arm64/lib/memchr.S
+++ b/xen/arch/arm/arm64/lib/memchr.S
@@ -18,6 +18,7 @@
  */
 
 #include <xen/config.h>
+#include "assembler.h"
 
 /*
  * Find a character in an area of memory.
diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S
new file mode 100644
index 0000000..9aad925
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/memcmp.S
@@ -0,0 +1,258 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+#include "assembler.h"
+
+/*
+* compare memory areas(when two memory areas' offset are different,
+* alignment handled by the hardware)
+*
+* Parameters:
+*  x0 - const memory area 1 pointer
+*  x1 - const memory area 2 pointer
+*  x2 - the maximal compare byte length
+* Returns:
+*  x0 - a compare result, maybe less than, equal to, or greater than ZERO
+*/
+
+/* Parameters and result.  */
+src1		.req	x0
+src2		.req	x1
+limit		.req	x2
+result		.req	x0
+
+/* Internal variables.  */
+data1		.req	x3
+data1w		.req	w3
+data2		.req	x4
+data2w		.req	w4
+has_nul		.req	x5
+diff		.req	x6
+endloop		.req	x7
+tmp1		.req	x8
+tmp2		.req	x9
+tmp3		.req	x10
+pos		.req	x11
+limit_wd	.req	x12
+mask		.req	x13
+
+ENTRY(memcmp)
+	cbz	limit, .Lret0
+	eor	tmp1, src1, src2
+	tst	tmp1, #7
+	b.ne	.Lmisaligned8
+	ands	tmp1, src1, #7
+	b.ne	.Lmutual_align
+	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
+	lsr	limit_wd, limit_wd, #3 /* Convert to Dwords.  */
+	/*
+	* The input source addresses are at alignment boundary.
+	* Directly compare eight bytes each time.
+	*/
+.Lloop_aligned:
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+.Lstart_realigned:
+	subs	limit_wd, limit_wd, #1
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, cs	/* Last Dword or differences.  */
+	cbz	endloop, .Lloop_aligned
+
+	/* Not reached the limit, must have found a diff.  */
+	tbz	limit_wd, #63, .Lnot_limit
+
+	/* Limit % 8 == 0 => the diff is in the last 8 bytes. */
+	ands	limit, limit, #7
+	b.eq	.Lnot_limit
+	/*
+	* The remained bytes less than 8. It is needed to extract valid data
+	* from last eight bytes of the intended memory range.
+	*/
+	lsl	limit, limit, #3	/* bytes-> bits.  */
+	mov	mask, #~0
+CPU_BE( lsr	mask, mask, limit )
+CPU_LE( lsl	mask, mask, limit )
+	bic	data1, data1, mask
+	bic	data2, data2, mask
+
+	orr	diff, diff, mask
+	b	.Lnot_limit
+
+.Lmutual_align:
+	/*
+	* Sources are mutually aligned, but are not currently at an
+	* alignment boundary. Round down the addresses and then mask off
+	* the bytes that precede the start point.
+	*/
+	bic	src1, src1, #7
+	bic	src2, src2, #7
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	/*
+	* We can not add limit with alignment offset(tmp1) here. Since the
+	* addition probably make the limit overflown.
+	*/
+	sub	limit_wd, limit, #1/*limit != 0, so no underflow.*/
+	and	tmp3, limit_wd, #7
+	lsr	limit_wd, limit_wd, #3
+	add	tmp3, tmp3, tmp1
+	add	limit_wd, limit_wd, tmp3, lsr #3
+	add	limit, limit, tmp1/* Adjust the limit for the extra.  */
+
+	lsl	tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
+	neg	tmp1, tmp1/* Bits to alignment -64.  */
+	mov	tmp2, #~0
+	/*mask off the non-intended bytes before the start address.*/
+CPU_BE( lsl	tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/
+	/* Little-endian.  Early bytes are at LSB.  */
+CPU_LE( lsr	tmp2, tmp2, tmp1 )
+
+	orr	data1, data1, tmp2
+	orr	data2, data2, tmp2
+	b	.Lstart_realigned
+
+	/*src1 and src2 have different alignment offset.*/
+.Lmisaligned8:
+	cmp	limit, #8
+	b.lo	.Ltiny8proc /*limit < 8: compare byte by byte*/
+
+	and	tmp1, src1, #7
+	neg	tmp1, tmp1
+	add	tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/
+	and	tmp2, src2, #7
+	neg	tmp2, tmp2
+	add	tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/
+	subs	tmp3, tmp1, tmp2
+	csel	pos, tmp1, tmp2, hi /*Choose the maximum.*/
+
+	sub	limit, limit, pos
+	/*compare the proceeding bytes in the first 8 byte segment.*/
+.Ltinycmp:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	pos, pos, #1
+	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000.  */
+	b.eq	.Ltinycmp
+	cbnz	pos, 1f /*diff occurred before the last byte.*/
+	cmp	data1w, data2w
+	b.eq	.Lstart_align
+1:
+	sub	result, data1, data2
+	ret
+
+.Lstart_align:
+	lsr	limit_wd, limit, #3
+	cbz	limit_wd, .Lremain8
+
+	ands	xzr, src1, #7
+	b.eq	.Lrecal_offset
+	/*process more leading bytes to make src1 aligned...*/
+	add	src1, src1, tmp3 /*backwards src1 to alignment boundary*/
+	add	src2, src2, tmp3
+	sub	limit, limit, tmp3
+	lsr	limit_wd, limit, #3
+	cbz	limit_wd, .Lremain8
+	/*load 8 bytes from aligned SRC1..*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+
+	subs	limit_wd, limit_wd, #1
+	eor	diff, data1, data2  /*Non-zero if differences found.*/
+	csinv	endloop, diff, xzr, ne
+	cbnz	endloop, .Lunequal_proc
+	/*How far is the current SRC2 from the alignment boundary...*/
+	and	tmp3, tmp3, #7
+
+.Lrecal_offset:/*src1 is aligned now..*/
+	neg	pos, tmp3
+.Lloopcmp_proc:
+	/*
+	* Divide the eight bytes into two parts. First,backwards the src2
+	* to an alignment boundary,load eight bytes and compare from
+	* the SRC2 alignment boundary. If all 8 bytes are equal,then start
+	* the second part's comparison. Otherwise finish the comparison.
+	* This special handle can garantee all the accesses are in the
+	* thread/task space in avoid to overrange access.
+	*/
+	ldr	data1, [src1,pos]
+	ldr	data2, [src2,pos]
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	cbnz	diff, .Lnot_limit
+
+	/*The second part process*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	subs	limit_wd, limit_wd, #1
+	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+	cbz	endloop, .Lloopcmp_proc
+.Lunequal_proc:
+	cbz	diff, .Lremain8
+
+/*There is differnence occured in the latest comparison.*/
+.Lnot_limit:
+/*
+* For little endian,reverse the low significant equal bits into MSB,then
+* following CLZ can find how many equal bits exist.
+*/
+CPU_LE( rev	diff, diff )
+CPU_LE( rev	data1, data1 )
+CPU_LE( rev	data2, data2 )
+
+	/*
+	* The MS-non-zero bit of DIFF marks either the first bit
+	* that is different, or the end of the significant data.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	clz	pos, diff
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/*
+	* We need to zero-extend (char is unsigned) the value and then
+	* perform a signed subtraction.
+	*/
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+
+.Lremain8:
+	/* Limit % 8 == 0 =>. all data are equal.*/
+	ands	limit, limit, #7
+	b.eq	.Lret0
+
+.Ltiny8proc:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	limit, limit, #1
+
+	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000. */
+	b.eq	.Ltiny8proc
+	sub	result, data1, data2
+	ret
+.Lret0:
+	mov	result, #0
+	ret
+ENDPROC(memcmp)
diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S
index c8197c6..7cc885d 100644
--- a/xen/arch/arm/arm64/lib/memcpy.S
+++ b/xen/arch/arm/arm64/lib/memcpy.S
@@ -1,5 +1,13 @@
 /*
  * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
@@ -15,6 +23,8 @@
  */
 
 #include <xen/config.h>
+#include <asm/cache.h>
+#include "assembler.h"
 
 /*
  * Copy a buffer from src to dest (alignment handled by the hardware)
@@ -26,27 +36,166 @@
  * Returns:
  *	x0 - dest
  */
+dstin	.req	x0
+src	.req	x1
+count	.req	x2
+tmp1	.req	x3
+tmp1w	.req	w3
+tmp2	.req	x4
+tmp2w	.req	w4
+tmp3	.req	x5
+tmp3w	.req	w5
+dst	.req	x6
+
+A_l	.req	x7
+A_h	.req	x8
+B_l	.req	x9
+B_h	.req	x10
+C_l	.req	x11
+C_h	.req	x12
+D_l	.req	x13
+D_h	.req	x14
+
 ENTRY(memcpy)
-	mov	x4, x0
-	subs	x2, x2, #8
-	b.mi	2f
-1:	ldr	x3, [x1], #8
-	subs	x2, x2, #8
-	str	x3, [x4], #8
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-	ldr	w3, [x1], #4
-	sub	x2, x2, #4
-	str	w3, [x4], #4
-3:	adds	x2, x2, #2
-	b.mi	4f
-	ldrh	w3, [x1], #2
-	sub	x2, x2, #2
-	strh	w3, [x4], #2
-4:	adds	x2, x2, #1
-	b.mi	5f
-	ldrb	w3, [x1]
-	strb	w3, [x4]
-5:	ret
+	mov	dst, dstin
+	cmp	count, #16
+	/*When memory length is less than 16, the accessed are not aligned.*/
+	b.lo	.Ltiny15
+
+	neg	tmp2, src
+	ands	tmp2, tmp2, #15/* Bytes to reach alignment. */
+	b.eq	.LSrcAligned
+	sub	count, count, tmp2
+	/*
+	* Copy the leading memory data from src to dst in an increasing
+	* address order.By this way,the risk of overwritting the source
+	* memory data is eliminated when the distance between src and
+	* dst is less than 16. The memory accesses here are alignment.
+	*/
+	tbz	tmp2, #0, 1f
+	ldrb	tmp1w, [src], #1
+	strb	tmp1w, [dst], #1
+1:
+	tbz	tmp2, #1, 2f
+	ldrh	tmp1w, [src], #2
+	strh	tmp1w, [dst], #2
+2:
+	tbz	tmp2, #2, 3f
+	ldr	tmp1w, [src], #4
+	str	tmp1w, [dst], #4
+3:
+	tbz	tmp2, #3, .LSrcAligned
+	ldr	tmp1, [src],#8
+	str	tmp1, [dst],#8
+
+.LSrcAligned:
+	cmp	count, #64
+	b.ge	.Lcpy_over64
+	/*
+	* Deal with small copies quickly by dropping straight into the
+	* exit block.
+	*/
+.Ltail63:
+	/*
+	* Copy up to 48 bytes of data. At this point we only need the
+	* bottom 6 bits of count to be accurate.
+	*/
+	ands	tmp1, count, #0x30
+	b.eq	.Ltiny15
+	cmp	tmp1w, #0x20
+	b.eq	1f
+	b.lt	2f
+	ldp	A_l, A_h, [src], #16
+	stp	A_l, A_h, [dst], #16
+1:
+	ldp	A_l, A_h, [src], #16
+	stp	A_l, A_h, [dst], #16
+2:
+	ldp	A_l, A_h, [src], #16
+	stp	A_l, A_h, [dst], #16
+.Ltiny15:
+	/*
+	* Prefer to break one ldp/stp into several load/store to access
+	* memory in an increasing address order,rather than to load/store 16
+	* bytes from (src-16) to (dst-16) and to backward the src to aligned
+	* address,which way is used in original cortex memcpy. If keeping
+	* the original memcpy process here, memmove need to satisfy the
+	* precondition that src address is at least 16 bytes bigger than dst
+	* address,otherwise some source data will be overwritten when memove
+	* call memcpy directly. To make memmove simpler and decouple the
+	* memcpy's dependency on memmove, withdrew the original process.
+	*/
+	tbz	count, #3, 1f
+	ldr	tmp1, [src], #8
+	str	tmp1, [dst], #8
+1:
+	tbz	count, #2, 2f
+	ldr	tmp1w, [src], #4
+	str	tmp1w, [dst], #4
+2:
+	tbz	count, #1, 3f
+	ldrh	tmp1w, [src], #2
+	strh	tmp1w, [dst], #2
+3:
+	tbz	count, #0, .Lexitfunc
+	ldrb	tmp1w, [src]
+	strb	tmp1w, [dst]
+
+.Lexitfunc:
+	ret
+
+.Lcpy_over64:
+	subs	count, count, #128
+	b.ge	.Lcpy_body_large
+	/*
+	* Less than 128 bytes to copy, so handle 64 here and then jump
+	* to the tail.
+	*/
+	ldp	A_l, A_h, [src],#16
+	stp	A_l, A_h, [dst],#16
+	ldp	B_l, B_h, [src],#16
+	ldp	C_l, C_h, [src],#16
+	stp	B_l, B_h, [dst],#16
+	stp	C_l, C_h, [dst],#16
+	ldp	D_l, D_h, [src],#16
+	stp	D_l, D_h, [dst],#16
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
+
+	/*
+	* Critical loop.  Start at a new cache line boundary.  Assuming
+	* 64 bytes per line this ensures the entire loop is in one line.
+	*/
+	.p2align	L1_CACHE_SHIFT
+.Lcpy_body_large:
+	/* pre-get 64 bytes data. */
+	ldp	A_l, A_h, [src],#16
+	ldp	B_l, B_h, [src],#16
+	ldp	C_l, C_h, [src],#16
+	ldp	D_l, D_h, [src],#16
+1:
+	/*
+	* interlace the load of next 64 bytes data block with store of the last
+	* loaded 64 bytes data.
+	*/
+	stp	A_l, A_h, [dst],#16
+	ldp	A_l, A_h, [src],#16
+	stp	B_l, B_h, [dst],#16
+	ldp	B_l, B_h, [src],#16
+	stp	C_l, C_h, [dst],#16
+	ldp	C_l, C_h, [src],#16
+	stp	D_l, D_h, [dst],#16
+	ldp	D_l, D_h, [src],#16
+	subs	count, count, #64
+	b.ge	1b
+	stp	A_l, A_h, [dst],#16
+	stp	B_l, B_h, [dst],#16
+	stp	C_l, C_h, [dst],#16
+	stp	D_l, D_h, [dst],#16
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
 ENDPROC(memcpy)
diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S
index 1bf0936..f4065b9 100644
--- a/xen/arch/arm/arm64/lib/memmove.S
+++ b/xen/arch/arm/arm64/lib/memmove.S
@@ -1,5 +1,13 @@
 /*
  * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
@@ -15,6 +23,8 @@
  */
 
 #include <xen/config.h>
+#include <asm/cache.h>
+#include "assembler.h"
 
 /*
  * Move a buffer from src to test (alignment handled by the hardware).
@@ -27,30 +37,161 @@
  * Returns:
  *	x0 - dest
  */
+dstin	.req	x0
+src	.req	x1
+count	.req	x2
+tmp1	.req	x3
+tmp1w	.req	w3
+tmp2	.req	x4
+tmp2w	.req	w4
+tmp3	.req	x5
+tmp3w	.req	w5
+dst	.req	x6
+
+A_l	.req	x7
+A_h	.req	x8
+B_l	.req	x9
+B_h	.req	x10
+C_l	.req	x11
+C_h	.req	x12
+D_l	.req	x13
+D_h	.req	x14
+
 ENTRY(memmove)
-	cmp	x0, x1
-	b.ls	memcpy
-	add	x4, x0, x2
-	add	x1, x1, x2
-	subs	x2, x2, #8
-	b.mi	2f
-1:	ldr	x3, [x1, #-8]!
-	subs	x2, x2, #8
-	str	x3, [x4, #-8]!
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-	ldr	w3, [x1, #-4]!
-	sub	x2, x2, #4
-	str	w3, [x4, #-4]!
-3:	adds	x2, x2, #2
-	b.mi	4f
-	ldrh	w3, [x1, #-2]!
-	sub	x2, x2, #2
-	strh	w3, [x4, #-2]!
-4:	adds	x2, x2, #1
-	b.mi	5f
-	ldrb	w3, [x1, #-1]
-	strb	w3, [x4, #-1]
-5:	ret
+	cmp	dstin, src
+	b.lo	memcpy
+	add	tmp1, src, count
+	cmp	dstin, tmp1
+	b.hs	memcpy		/* No overlap.  */
+
+	add	dst, dstin, count
+	add	src, src, count
+	cmp	count, #16
+	b.lo	.Ltail15  /*probably non-alignment accesses.*/
+
+	ands	tmp2, src, #15     /* Bytes to reach alignment.  */
+	b.eq	.LSrcAligned
+	sub	count, count, tmp2
+	/*
+	* process the aligned offset length to make the src aligned firstly.
+	* those extra instructions' cost is acceptable. It also make the
+	* coming accesses are based on aligned address.
+	*/
+	tbz	tmp2, #0, 1f
+	ldrb	tmp1w, [src, #-1]!
+	strb	tmp1w, [dst, #-1]!
+1:
+	tbz	tmp2, #1, 2f
+	ldrh	tmp1w, [src, #-2]!
+	strh	tmp1w, [dst, #-2]!
+2:
+	tbz	tmp2, #2, 3f
+	ldr	tmp1w, [src, #-4]!
+	str	tmp1w, [dst, #-4]!
+3:
+	tbz	tmp2, #3, .LSrcAligned
+	ldr	tmp1, [src, #-8]!
+	str	tmp1, [dst, #-8]!
+
+.LSrcAligned:
+	cmp	count, #64
+	b.ge	.Lcpy_over64
+
+	/*
+	* Deal with small copies quickly by dropping straight into the
+	* exit block.
+	*/
+.Ltail63:
+	/*
+	* Copy up to 48 bytes of data. At this point we only need the
+	* bottom 6 bits of count to be accurate.
+	*/
+	ands	tmp1, count, #0x30
+	b.eq	.Ltail15
+	cmp	tmp1w, #0x20
+	b.eq	1f
+	b.lt	2f
+	ldp	A_l, A_h, [src, #-16]!
+	stp	A_l, A_h, [dst, #-16]!
+1:
+	ldp	A_l, A_h, [src, #-16]!
+	stp	A_l, A_h, [dst, #-16]!
+2:
+	ldp	A_l, A_h, [src, #-16]!
+	stp	A_l, A_h, [dst, #-16]!
+
+.Ltail15:
+	tbz	count, #3, 1f
+	ldr	tmp1, [src, #-8]!
+	str	tmp1, [dst, #-8]!
+1:
+	tbz	count, #2, 2f
+	ldr	tmp1w, [src, #-4]!
+	str	tmp1w, [dst, #-4]!
+2:
+	tbz	count, #1, 3f
+	ldrh	tmp1w, [src, #-2]!
+	strh	tmp1w, [dst, #-2]!
+3:
+	tbz	count, #0, .Lexitfunc
+	ldrb	tmp1w, [src, #-1]
+	strb	tmp1w, [dst, #-1]
+
+.Lexitfunc:
+	ret
+
+.Lcpy_over64:
+	subs	count, count, #128
+	b.ge	.Lcpy_body_large
+	/*
+	* Less than 128 bytes to copy, so handle 64 bytes here and then jump
+	* to the tail.
+	*/
+	ldp	A_l, A_h, [src, #-16]
+	stp	A_l, A_h, [dst, #-16]
+	ldp	B_l, B_h, [src, #-32]
+	ldp	C_l, C_h, [src, #-48]
+	stp	B_l, B_h, [dst, #-32]
+	stp	C_l, C_h, [dst, #-48]
+	ldp	D_l, D_h, [src, #-64]!
+	stp	D_l, D_h, [dst, #-64]!
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
+
+	/*
+	* Critical loop. Start at a new cache line boundary. Assuming
+	* 64 bytes per line this ensures the entire loop is in one line.
+	*/
+	.p2align	L1_CACHE_SHIFT
+.Lcpy_body_large:
+	/* pre-load 64 bytes data. */
+	ldp	A_l, A_h, [src, #-16]
+	ldp	B_l, B_h, [src, #-32]
+	ldp	C_l, C_h, [src, #-48]
+	ldp	D_l, D_h, [src, #-64]!
+1:
+	/*
+	* interlace the load of next 64 bytes data block with store of the last
+	* loaded 64 bytes data.
+	*/
+	stp	A_l, A_h, [dst, #-16]
+	ldp	A_l, A_h, [src, #-16]
+	stp	B_l, B_h, [dst, #-32]
+	ldp	B_l, B_h, [src, #-32]
+	stp	C_l, C_h, [dst, #-48]
+	ldp	C_l, C_h, [src, #-48]
+	stp	D_l, D_h, [dst, #-64]!
+	ldp	D_l, D_h, [src, #-64]!
+	subs	count, count, #64
+	b.ge	1b
+	stp	A_l, A_h, [dst, #-16]
+	stp	B_l, B_h, [dst, #-32]
+	stp	C_l, C_h, [dst, #-48]
+	stp	D_l, D_h, [dst, #-64]!
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
 ENDPROC(memmove)
diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S
index 25a4fb6..4ee714d 100644
--- a/xen/arch/arm/arm64/lib/memset.S
+++ b/xen/arch/arm/arm64/lib/memset.S
@@ -1,5 +1,13 @@
 /*
  * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
@@ -15,6 +23,8 @@
  */
 
 #include <xen/config.h>
+#include <asm/cache.h>
+#include "assembler.h"
 
 /*
  * Fill in the buffer with character c (alignment handled by the hardware)
@@ -26,27 +36,181 @@
  * Returns:
  *	x0 - buf
  */
+
+dstin		.req	x0
+val		.req	w1
+count		.req	x2
+tmp1		.req	x3
+tmp1w		.req	w3
+tmp2		.req	x4
+tmp2w		.req	w4
+zva_len_x	.req	x5
+zva_len		.req	w5
+zva_bits_x	.req	x6
+
+A_l		.req	x7
+A_lw		.req	w7
+dst		.req	x8
+tmp3w		.req	w9
+tmp3		.req	x9
+
 ENTRY(memset)
-	mov	x4, x0
-	and	w1, w1, #0xff
-	orr	w1, w1, w1, lsl #8
-	orr	w1, w1, w1, lsl #16
-	orr	x1, x1, x1, lsl #32
-	subs	x2, x2, #8
-	b.mi	2f
-1:	str	x1, [x4], #8
-	subs	x2, x2, #8
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-	sub	x2, x2, #4
-	str	w1, [x4], #4
-3:	adds	x2, x2, #2
-	b.mi	4f
-	sub	x2, x2, #2
-	strh	w1, [x4], #2
-4:	adds	x2, x2, #1
-	b.mi	5f
-	strb	w1, [x4]
-5:	ret
+	mov	dst, dstin	/* Preserve return value.  */
+	and	A_lw, val, #255
+	orr	A_lw, A_lw, A_lw, lsl #8
+	orr	A_lw, A_lw, A_lw, lsl #16
+	orr	A_l, A_l, A_l, lsl #32
+
+	cmp	count, #15
+	b.hi	.Lover16_proc
+	/*All store maybe are non-aligned..*/
+	tbz	count, #3, 1f
+	str	A_l, [dst], #8
+1:
+	tbz	count, #2, 2f
+	str	A_lw, [dst], #4
+2:
+	tbz	count, #1, 3f
+	strh	A_lw, [dst], #2
+3:
+	tbz	count, #0, 4f
+	strb	A_lw, [dst]
+4:
+	ret
+
+.Lover16_proc:
+	/*Whether  the start address is aligned with 16.*/
+	neg	tmp2, dst
+	ands	tmp2, tmp2, #15
+	b.eq	.Laligned
+/*
+* The count is not less than 16, we can use stp to store the start 16 bytes,
+* then adjust the dst aligned with 16.This process will make the current
+* memory address at alignment boundary.
+*/
+	stp	A_l, A_l, [dst] /*non-aligned store..*/
+	/*make the dst aligned..*/
+	sub	count, count, tmp2
+	add	dst, dst, tmp2
+
+.Laligned:
+	cbz	A_l, .Lzero_mem
+
+.Ltail_maybe_long:
+	cmp	count, #64
+	b.ge	.Lnot_short
+.Ltail63:
+	ands	tmp1, count, #0x30
+	b.eq	3f
+	cmp	tmp1w, #0x20
+	b.eq	1f
+	b.lt	2f
+	stp	A_l, A_l, [dst], #16
+1:
+	stp	A_l, A_l, [dst], #16
+2:
+	stp	A_l, A_l, [dst], #16
+/*
+* The last store length is less than 16,use stp to write last 16 bytes.
+* It will lead some bytes written twice and the access is non-aligned.
+*/
+3:
+	ands	count, count, #15
+	cbz	count, 4f
+	add	dst, dst, count
+	stp	A_l, A_l, [dst, #-16]	/* Repeat some/all of last store. */
+4:
+	ret
+
+	/*
+	* Critical loop. Start at a new cache line boundary. Assuming
+	* 64 bytes per line, this ensures the entire loop is in one line.
+	*/
+	.p2align	L1_CACHE_SHIFT
+.Lnot_short:
+	sub	dst, dst, #16/* Pre-bias.  */
+	sub	count, count, #64
+1:
+	stp	A_l, A_l, [dst, #16]
+	stp	A_l, A_l, [dst, #32]
+	stp	A_l, A_l, [dst, #48]
+	stp	A_l, A_l, [dst, #64]!
+	subs	count, count, #64
+	b.ge	1b
+	tst	count, #0x3f
+	add	dst, dst, #16
+	b.ne	.Ltail63
+.Lexitfunc:
+	ret
+
+	/*
+	* For zeroing memory, check to see if we can use the ZVA feature to
+	* zero entire 'cache' lines.
+	*/
+.Lzero_mem:
+	cmp	count, #63
+	b.le	.Ltail63
+	/*
+	* For zeroing small amounts of memory, it's not worth setting up
+	* the line-clear code.
+	*/
+	cmp	count, #128
+	b.lt	.Lnot_short /*count is at least  128 bytes*/
+
+	mrs	tmp1, dczid_el0
+	tbnz	tmp1, #4, .Lnot_short
+	mov	tmp3w, #4
+	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
+	lsl	zva_len, tmp3w, zva_len
+
+	ands	tmp3w, zva_len, #63
+	/*
+	* ensure the zva_len is not less than 64.
+	* It is not meaningful to use ZVA if the block size is less than 64.
+	*/
+	b.ne	.Lnot_short
+.Lzero_by_line:
+	/*
+	* Compute how far we need to go to become suitably aligned. We're
+	* already at quad-word alignment.
+	*/
+	cmp	count, zva_len_x
+	b.lt	.Lnot_short		/* Not enough to reach alignment.  */
+	sub	zva_bits_x, zva_len_x, #1
+	neg	tmp2, dst
+	ands	tmp2, tmp2, zva_bits_x
+	b.eq	2f			/* Already aligned.  */
+	/* Not aligned, check that there's enough to copy after alignment.*/
+	sub	tmp1, count, tmp2
+	/*
+	* grantee the remain length to be ZVA is bigger than 64,
+	* avoid to make the 2f's process over mem range.*/
+	cmp	tmp1, #64
+	ccmp	tmp1, zva_len_x, #8, ge	/* NZCV=0b1000 */
+	b.lt	.Lnot_short
+	/*
+	* We know that there's at least 64 bytes to zero and that it's safe
+	* to overrun by 64 bytes.
+	*/
+	mov	count, tmp1
+1:
+	stp	A_l, A_l, [dst]
+	stp	A_l, A_l, [dst, #16]
+	stp	A_l, A_l, [dst, #32]
+	subs	tmp2, tmp2, #64
+	stp	A_l, A_l, [dst, #48]
+	add	dst, dst, #64
+	b.ge	1b
+	/* We've overrun a bit, so adjust dst downwards.*/
+	add	dst, dst, tmp2
+2:
+	sub	count, count, zva_len_x
+3:
+	dc	zva, dst
+	add	dst, dst, zva_len_x
+	subs	count, count, zva_len_x
+	b.ge	3b
+	ands	count, count, zva_bits_x
+	b.ne	.Ltail_maybe_long
+	ret
 ENDPROC(memset)
diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
index 3f4e7a1..9a511f2 100644
--- a/xen/include/asm-arm/arm32/cmpxchg.h
+++ b/xen/include/asm-arm/arm32/cmpxchg.h
@@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	return ret;
 }
 
+#define xchg(ptr,x) \
+	((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
+
 /*
  * Atomic compare and exchange.  Compare OLD with MEM, if identical,
  * store NEW in MEM.  Return the initial value in MEM.  Success is
diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h
index b5d50f2..b49219e 100644
--- a/xen/include/asm-arm/arm64/atomic.h
+++ b/xen/include/asm-arm/arm64/atomic.h
@@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u)
 
 #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
 
-#define smp_mb__before_atomic_dec()	smp_mb()
-#define smp_mb__after_atomic_dec()	smp_mb()
-#define smp_mb__before_atomic_inc()	smp_mb()
-#define smp_mb__after_atomic_inc()	smp_mb()
-
 #endif
 /*
  * Local variables:
diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h
index 4e930ce..ae42b2f 100644
--- a/xen/include/asm-arm/arm64/cmpxchg.h
+++ b/xen/include/asm-arm/arm64/cmpxchg.h
@@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 }
 
 #define xchg(ptr,x) \
-	((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
+({ \
+	__typeof__(*(ptr)) __ret; \
+	__ret = (__typeof__(*(ptr))) \
+		__xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \
+	__ret; \
+})
 
 extern void __bad_cmpxchg(volatile void *ptr, int size);
 
@@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
 	return ret;
 }
 
-#define cmpxchg(ptr,o,n)						\
-	((__typeof__(*(ptr)))__cmpxchg_mb((ptr),			\
-					  (unsigned long)(o),		\
-					  (unsigned long)(n),		\
-					  sizeof(*(ptr))))
-
-#define cmpxchg_local(ptr,o,n)						\
-	((__typeof__(*(ptr)))__cmpxchg((ptr),				\
-				       (unsigned long)(o),		\
-				       (unsigned long)(n),		\
-				       sizeof(*(ptr))))
+#define cmpxchg(ptr, o, n) \
+({ \
+	__typeof__(*(ptr)) __ret; \
+	__ret = (__typeof__(*(ptr))) \
+		__cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \
+			     sizeof(*(ptr))); \
+	__ret; \
+})
+
+#define cmpxchg_local(ptr, o, n) \
+({ \
+	__typeof__(*(ptr)) __ret; \
+	__ret = (__typeof__(*(ptr))) \
+		__cmpxchg((ptr), (unsigned long)(o), \
+			  (unsigned long)(n), sizeof(*(ptr))); \
+	__ret; \
+})
 
 #endif
 /*
diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h
index 3242762..dfad1fe 100644
--- a/xen/include/asm-arm/string.h
+++ b/xen/include/asm-arm/string.h
@@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c);
 #define __HAVE_ARCH_MEMCPY
 extern void * memcpy(void *, const void *, __kernel_size_t);
 
+#if defined(CONFIG_ARM_64)
+#define __HAVE_ARCH_MEMCMP
+extern int memcmp(const void *, const void *, __kernel_size_t);
+#endif
+
 /* Some versions of gcc don't have this builtin. It's non-critical anyway. */
 #define __HAVE_ARCH_MEMMOVE
 extern void *memmove(void *dest, const void *src, size_t n);
diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
index 7aaaf50..ce3d38a 100644
--- a/xen/include/asm-arm/system.h
+++ b/xen/include/asm-arm/system.h
@@ -33,9 +33,6 @@
 
 #define smp_wmb()       dmb(ishst)
 
-#define xchg(ptr,x) \
-        ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
-
 /*
  * This is used to ensure the compiler did actually allocate the register we
  * asked it for some inline assembly sequences.  Apparently we can't trust
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
@ 2014-07-25 15:22 ` Ian Campbell
  2014-07-25 15:42   ` Julien Grall
  2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
  2014-07-25 15:43 ` Ian Campbell
  2 siblings, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: julien.grall, tim, Ian Campbell, stefano.stabellini

bitops, cmpxchg, atomics: Import:
  c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
    Author: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

atomics: In addition to the above import:
  db38ee8 ARM: 7983/1: atomics: implement a better __atomic_add_unless for v6+
    Author: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

spinlocks: We have diverged from Linux, so no updates but note this in the README.

mem* and str*: Import:
  d98b90e ARM: 7990/1: asm: rename logical shift macros push pull into lspush lspull
    Author: Victor Kamensky <victor.kamensky@linaro.org>
    Suggested-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
    Acked-by: Nicolas Pitre <nico@linaro.org>
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

  For some reason str* were mentioned under mem* in the README, fix.

libgcc: No changes, update baseline

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 xen/arch/arm/README.LinuxPrimitives    |   17 +++++++--------
 xen/arch/arm/arm32/lib/assembler.h     |    8 +++----
 xen/arch/arm/arm32/lib/bitops.h        |    5 +++++
 xen/arch/arm/arm32/lib/copy_template.S |   36 ++++++++++++++++----------------
 xen/arch/arm/arm32/lib/memmove.S       |   36 ++++++++++++++++----------------
 xen/include/asm-arm/arm32/atomic.h     |   32 ++++++++++++++++++++++++++++
 xen/include/asm-arm/arm32/cmpxchg.h    |    5 +++++
 7 files changed, 90 insertions(+), 49 deletions(-)

diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
index 69eeb70..7e15b04 100644
--- a/xen/arch/arm/README.LinuxPrimitives
+++ b/xen/arch/arm/README.LinuxPrimitives
@@ -65,7 +65,7 @@ linux/arch/arm64/lib/copy_page.S        unused in Xen
 arm32
 =====================================================================
 
-bitops: last sync @ v3.14-rc7 (last commit: b7ec699)
+bitops: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
 
 linux/arch/arm/lib/bitops.h             xen/arch/arm/arm32/lib/bitops.h
 linux/arch/arm/lib/changebit.S          xen/arch/arm/arm32/lib/changebit.S
@@ -83,13 +83,13 @@ done
 
 ---------------------------------------------------------------------
 
-cmpxchg: last sync @ v3.14-rc7 (last commit: 775ebcc)
+cmpxchg: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
 
 linux/arch/arm/include/asm/cmpxchg.h    xen/include/asm-arm/arm32/cmpxchg.h
 
 ---------------------------------------------------------------------
 
-atomics: last sync @ v3.14-rc7 (last commit: aed3a4e)
+atomics: last sync @ v3.16-rc6 (last commit: 030d0178bdbd)
 
 linux/arch/arm/include/asm/atomic.h     xen/include/asm-arm/arm32/atomic.h
 
@@ -99,6 +99,8 @@ spinlocks: last sync: 15e7e5c1ebf5
 
 linux/arch/arm/include/asm/spinlock.h   xen/include/asm-arm/arm32/spinlock.h
 
+*** Linux has switched to ticket locks but we still use bitlocks.
+
 resync to v3.14-rc7:
 
   7c8746a ARM: 7955/1: spinlock: ensure we have a compiler barrier before sev
@@ -111,7 +113,7 @@ resync to v3.14-rc7:
 
 ---------------------------------------------------------------------
 
-mem*: last sync @ v3.14-rc7 (last commit: 418df63a)
+mem*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
 
 linux/arch/arm/lib/copy_template.S      xen/arch/arm/arm32/lib/copy_template.S
 linux/arch/arm/lib/memchr.S             xen/arch/arm/arm32/lib/memchr.S
@@ -120,9 +122,6 @@ linux/arch/arm/lib/memmove.S            xen/arch/arm/arm32/lib/memmove.S
 linux/arch/arm/lib/memset.S             xen/arch/arm/arm32/lib/memset.S
 linux/arch/arm/lib/memzero.S            xen/arch/arm/arm32/lib/memzero.S
 
-linux/arch/arm/lib/strchr.S             xen/arch/arm/arm32/lib/strchr.S
-linux/arch/arm/lib/strrchr.S            xen/arch/arm/arm32/lib/strrchr.S
-
 for i in copy_template.S memchr.S memcpy.S memmove.S memset.S \
          memzero.S ; do
     diff -u linux/arch/arm/lib/$i xen/arch/arm/arm32/lib/$i
@@ -130,7 +129,7 @@ done
 
 ---------------------------------------------------------------------
 
-str*: last sync @ v3.13-rc7 (last commit: 93ed397)
+str*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
 
 linux/arch/arm/lib/strchr.S             xen/arch/arm/arm32/lib/strchr.S
 linux/arch/arm/lib/strrchr.S            xen/arch/arm/arm32/lib/strrchr.S
@@ -145,7 +144,7 @@ clear_page == memset
 
 ---------------------------------------------------------------------
 
-libgcc: last sync @ v3.14-rc7 (last commit: 01885bc)
+libgcc: last sync @ v3.16-rc6 (last commit: 01885bc)
 
 linux/arch/arm/lib/lib1funcs.S          xen/arch/arm/arm32/lib/lib1funcs.S
 linux/arch/arm/lib/lshrdi3.S            xen/arch/arm/arm32/lib/lshrdi3.S
diff --git a/xen/arch/arm/arm32/lib/assembler.h b/xen/arch/arm/arm32/lib/assembler.h
index f8d4b3a..6de2638 100644
--- a/xen/arch/arm/arm32/lib/assembler.h
+++ b/xen/arch/arm/arm32/lib/assembler.h
@@ -36,8 +36,8 @@
  * Endian independent macros for shifting bytes within registers.
  */
 #ifndef __ARMEB__
-#define pull            lsr
-#define push            lsl
+#define lspull          lsr
+#define lspush          lsl
 #define get_byte_0      lsl #0
 #define get_byte_1	lsr #8
 #define get_byte_2	lsr #16
@@ -47,8 +47,8 @@
 #define put_byte_2	lsl #16
 #define put_byte_3	lsl #24
 #else
-#define pull            lsl
-#define push            lsr
+#define lspull          lsl
+#define lspush          lsr
 #define get_byte_0	lsr #24
 #define get_byte_1	lsr #16
 #define get_byte_2	lsr #8
diff --git a/xen/arch/arm/arm32/lib/bitops.h b/xen/arch/arm/arm32/lib/bitops.h
index 25784c3..a167c2d 100644
--- a/xen/arch/arm/arm32/lib/bitops.h
+++ b/xen/arch/arm/arm32/lib/bitops.h
@@ -37,6 +37,11 @@ UNWIND(	.fnstart	)
 	add	r1, r1, r0, lsl #2	@ Get word offset
 	mov	r3, r2, lsl r3		@ create mask
 	smp_dmb
+#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
+	.arch_extension	mp
+	ALT_SMP(W(pldw)	[r1])
+	ALT_UP(W(nop))
+#endif
 1:	ldrex	r2, [r1]
 	ands	r0, r2, r3		@ save old value of bit
 	\instr	r2, r2, r3		@ toggle bit
diff --git a/xen/arch/arm/arm32/lib/copy_template.S b/xen/arch/arm/arm32/lib/copy_template.S
index 805e3f8..3bc8eb8 100644
--- a/xen/arch/arm/arm32/lib/copy_template.S
+++ b/xen/arch/arm/arm32/lib/copy_template.S
@@ -197,24 +197,24 @@
 
 12:	PLD(	pld	[r1, #124]		)
 13:		ldr4w	r1, r4, r5, r6, r7, abort=19f
-		mov	r3, lr, pull #\pull
+		mov	r3, lr, lspull #\pull
 		subs	r2, r2, #32
 		ldr4w	r1, r8, r9, ip, lr, abort=19f
-		orr	r3, r3, r4, push #\push
-		mov	r4, r4, pull #\pull
-		orr	r4, r4, r5, push #\push
-		mov	r5, r5, pull #\pull
-		orr	r5, r5, r6, push #\push
-		mov	r6, r6, pull #\pull
-		orr	r6, r6, r7, push #\push
-		mov	r7, r7, pull #\pull
-		orr	r7, r7, r8, push #\push
-		mov	r8, r8, pull #\pull
-		orr	r8, r8, r9, push #\push
-		mov	r9, r9, pull #\pull
-		orr	r9, r9, ip, push #\push
-		mov	ip, ip, pull #\pull
-		orr	ip, ip, lr, push #\push
+		orr	r3, r3, r4, lspush #\push
+		mov	r4, r4, lspull #\pull
+		orr	r4, r4, r5, lspush #\push
+		mov	r5, r5, lspull #\pull
+		orr	r5, r5, r6, lspush #\push
+		mov	r6, r6, lspull #\pull
+		orr	r6, r6, r7, lspush #\push
+		mov	r7, r7, lspull #\pull
+		orr	r7, r7, r8, lspush #\push
+		mov	r8, r8, lspull #\pull
+		orr	r8, r8, r9, lspush #\push
+		mov	r9, r9, lspull #\pull
+		orr	r9, r9, ip, lspush #\push
+		mov	ip, ip, lspull #\pull
+		orr	ip, ip, lr, lspush #\push
 		str8w	r0, r3, r4, r5, r6, r7, r8, r9, ip, , abort=19f
 		bge	12b
 	PLD(	cmn	r2, #96			)
@@ -225,10 +225,10 @@
 14:		ands	ip, r2, #28
 		beq	16f
 
-15:		mov	r3, lr, pull #\pull
+15:		mov	r3, lr, lspull #\pull
 		ldr1w	r1, lr, abort=21f
 		subs	ip, ip, #4
-		orr	r3, r3, lr, push #\push
+		orr	r3, r3, lr, lspush #\push
 		str1w	r0, r3, abort=21f
 		bgt	15b
 	CALGN(	cmp	r2, #0			)
diff --git a/xen/arch/arm/arm32/lib/memmove.S b/xen/arch/arm/arm32/lib/memmove.S
index 4e142b8..18634c3 100644
--- a/xen/arch/arm/arm32/lib/memmove.S
+++ b/xen/arch/arm/arm32/lib/memmove.S
@@ -148,24 +148,24 @@ ENTRY(memmove)
 
 12:	PLD(	pld	[r1, #-128]		)
 13:		ldmdb   r1!, {r7, r8, r9, ip}
-		mov     lr, r3, push #\push
+		mov     lr, r3, lspush #\push
 		subs    r2, r2, #32
 		ldmdb   r1!, {r3, r4, r5, r6}
-		orr     lr, lr, ip, pull #\pull
-		mov     ip, ip, push #\push
-		orr     ip, ip, r9, pull #\pull
-		mov     r9, r9, push #\push
-		orr     r9, r9, r8, pull #\pull
-		mov     r8, r8, push #\push
-		orr     r8, r8, r7, pull #\pull
-		mov     r7, r7, push #\push
-		orr     r7, r7, r6, pull #\pull
-		mov     r6, r6, push #\push
-		orr     r6, r6, r5, pull #\pull
-		mov     r5, r5, push #\push
-		orr     r5, r5, r4, pull #\pull
-		mov     r4, r4, push #\push
-		orr     r4, r4, r3, pull #\pull
+		orr     lr, lr, ip, lspull #\pull
+		mov     ip, ip, lspush #\push
+		orr     ip, ip, r9, lspull #\pull
+		mov     r9, r9, lspush #\push
+		orr     r9, r9, r8, lspull #\pull
+		mov     r8, r8, lspush #\push
+		orr     r8, r8, r7, lspull #\pull
+		mov     r7, r7, lspush #\push
+		orr     r7, r7, r6, lspull #\pull
+		mov     r6, r6, lspush #\push
+		orr     r6, r6, r5, lspull #\pull
+		mov     r5, r5, lspush #\push
+		orr     r5, r5, r4, lspull #\pull
+		mov     r4, r4, lspush #\push
+		orr     r4, r4, r3, lspull #\pull
 		stmdb   r0!, {r4 - r9, ip, lr}
 		bge	12b
 	PLD(	cmn	r2, #96			)
@@ -176,10 +176,10 @@ ENTRY(memmove)
 14:		ands	ip, r2, #28
 		beq	16f
 
-15:		mov     lr, r3, push #\push
+15:		mov     lr, r3, lspush #\push
 		ldr	r3, [r1, #-4]!
 		subs	ip, ip, #4
-		orr	lr, lr, r3, pull #\pull
+		orr	lr, lr, r3, lspull #\pull
 		str	lr, [r0, #-4]!
 		bgt	15b
 	CALGN(	cmp	r2, #0			)
diff --git a/xen/include/asm-arm/arm32/atomic.h b/xen/include/asm-arm/arm32/atomic.h
index 3d601d1..7ec712f 100644
--- a/xen/include/asm-arm/arm32/atomic.h
+++ b/xen/include/asm-arm/arm32/atomic.h
@@ -39,6 +39,7 @@ static inline int atomic_add_return(int i, atomic_t *v)
 	int result;
 
 	smp_mb();
+	prefetchw(&v->counter);
 
 	__asm__ __volatile__("@ atomic_add_return\n"
 "1:	ldrex	%0, [%3]\n"
@@ -78,6 +79,7 @@ static inline int atomic_sub_return(int i, atomic_t *v)
 	int result;
 
 	smp_mb();
+	prefetchw(&v->counter);
 
 	__asm__ __volatile__("@ atomic_sub_return\n"
 "1:	ldrex	%0, [%3]\n"
@@ -100,6 +102,7 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 	unsigned long res;
 
 	smp_mb();
+	prefetchw(&ptr->counter);
 
 	do {
 		__asm__ __volatile__("@ atomic_cmpxchg\n"
@@ -117,6 +120,35 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 	return oldval;
 }
 
+static inline int __atomic_add_unless(atomic_t *v, int a, int u)
+{
+	int oldval, newval;
+	unsigned long tmp;
+
+	smp_mb();
+	prefetchw(&v->counter);
+
+	__asm__ __volatile__ ("@ atomic_add_unless\n"
+"1:	ldrex	%0, [%4]\n"
+"	teq	%0, %5\n"
+"	beq	2f\n"
+"	add	%1, %0, %6\n"
+"	strex	%2, %1, [%4]\n"
+"	teq	%2, #0\n"
+"	bne	1b\n"
+"2:"
+	: "=&r" (oldval), "=&r" (newval), "=&r" (tmp), "+Qo" (v->counter)
+	: "r" (&v->counter), "r" (u), "r" (a)
+	: "cc");
+
+	if (oldval != u)
+		smp_mb();
+
+	return oldval;
+}
+
+#define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+
 #define atomic_inc(v)		atomic_add(1, v)
 #define atomic_dec(v)		atomic_sub(1, v)
 
diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
index 9a511f2..03e0bed 100644
--- a/xen/include/asm-arm/arm32/cmpxchg.h
+++ b/xen/include/asm-arm/arm32/cmpxchg.h
@@ -1,6 +1,8 @@
 #ifndef __ASM_ARM32_CMPXCHG_H
 #define __ASM_ARM32_CMPXCHG_H
 
+#include <xen/prefetch.h>
+
 extern void __bad_xchg(volatile void *, int);
 
 static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size)
@@ -9,6 +11,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	unsigned int tmp;
 
 	smp_mb();
+	prefetchw((const void *)ptr);
 
 	switch (size) {
 	case 1:
@@ -56,6 +59,8 @@ static always_inline unsigned long __cmpxchg(
 {
 	unsigned long oldval, res;
 
+	prefetchw((const void *)ptr);
+
 	switch (size) {
 	case 1:
 		do {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
  2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
@ 2014-07-25 15:36 ` Julien Grall
  2014-08-04 16:16   ` Ian Campbell
  2014-07-25 15:43 ` Ian Campbell
  2 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 15:36 UTC (permalink / raw)
  To: Ian Campbell, xen-devel; +Cc: tim, stefano.stabellini

Hi Ian,

On 07/25/2014 04:22 PM, Ian Campbell wrote:
> The only really interesting changes here are the updates to mem* which update
> to actually optimised versions and introduce an optimised memcmp.

I didn't read the whole code as I assume it's just a copy with few
changes from Linux.

Acked-by: Julien Grall <julien.grall@linaro.org>

Regards,

> bitops: No change to the bits we import. Record new baseline.
> 
> cmpxchg: Import:
>   60010e5 arm64: cmpxchg: update macros to prevent warnings
>     Author: Mark Hambleton <mahamble@broadcom.com>
>     Signed-off-by: Mark Hambleton <mahamble@broadcom.com>
>     Signed-off-by: Mark Brown <broonie@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>   e1dfda9 arm64: xchg: prevent warning if return value is unused
>     Author: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>   e1dfda9 resolves the warning which previous caused us to skip 60010e508111.
> 
>   Since arm32 and arm64 now differ (as do Linux arm and arm64) here the
>   existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h.
>   Previously this was shadowing the arm64 one but they happened to be identical.
> 
> atomics: Import:
>   8715466 arch,arm64: Convert smp_mb__*()
>     Author: Peter Zijlstra <peterz@infradead.org>
>     Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> 
>   This just drops some unused (by us) smp_mb__*_atomic_*.
> 
> spinlocks: No change. Record new baseline.
> 
> mem*: Import:
>   808dbac arm64: lib: Implement optimized memcpy routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>   280adc1 arm64: lib: Implement optimized memmove routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>   b29a51f arm64: lib: Implement optimized memset routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>   d875c9b arm64: lib: Implement optimized memcmp routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>   These import various routines from Linaro's Cortex Strings library.
> 
>   Added assembler.h similar to on arm32 to define the various magic symbols
>   which these imported routines depend on (e.g. CPU_LE() and CPU_BE())
> 
> str*: No changes. Record new baseline.
> 
>   Correct the paths in the README.
> 
> *_page: No changes. Record new baseline.
> 
>   README previous said clear_page was unused while clear page was, which was
>   backwards.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  xen/arch/arm/README.LinuxPrimitives |   36 +++--
>  xen/arch/arm/arm64/lib/Makefile     |    2 +-
>  xen/arch/arm/arm64/lib/assembler.h  |   13 ++
>  xen/arch/arm/arm64/lib/memchr.S     |    1 +
>  xen/arch/arm/arm64/lib/memcmp.S     |  258 +++++++++++++++++++++++++++++++++++
>  xen/arch/arm/arm64/lib/memcpy.S     |  193 +++++++++++++++++++++++---
>  xen/arch/arm/arm64/lib/memmove.S    |  191 ++++++++++++++++++++++----
>  xen/arch/arm/arm64/lib/memset.S     |  208 +++++++++++++++++++++++++---
>  xen/include/asm-arm/arm32/cmpxchg.h |    3 +
>  xen/include/asm-arm/arm64/atomic.h  |    5 -
>  xen/include/asm-arm/arm64/cmpxchg.h |   35 +++--
>  xen/include/asm-arm/string.h        |    5 +
>  xen/include/asm-arm/system.h        |    3 -
>  13 files changed, 844 insertions(+), 109 deletions(-)
>  create mode 100644 xen/arch/arm/arm64/lib/assembler.h
>  create mode 100644 xen/arch/arm/arm64/lib/memcmp.S
> 
> diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
> index 6cd03ca..69eeb70 100644
> --- a/xen/arch/arm/README.LinuxPrimitives
> +++ b/xen/arch/arm/README.LinuxPrimitives
> @@ -6,29 +6,26 @@ were last updated.
>  arm64:
>  =====================================================================
>  
> -bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b)
> +bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027)
>  
>  linux/arch/arm64/lib/bitops.S           xen/arch/arm/arm64/lib/bitops.S
>  linux/arch/arm64/include/asm/bitops.h   xen/include/asm-arm/arm64/bitops.h
>  
>  ---------------------------------------------------------------------
>  
> -cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189)
> +cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b)
>  
>  linux/arch/arm64/include/asm/cmpxchg.h  xen/include/asm-arm/arm64/cmpxchg.h
>  
> -Skipped:
> -  60010e5 arm64: cmpxchg: update macros to prevent warnings
> -
>  ---------------------------------------------------------------------
>  
> -atomics: last sync @ v3.14-rc7 (last commit: 95c4189)
> +atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027)
>  
>  linux/arch/arm64/include/asm/atomic.h   xen/include/asm-arm/arm64/atomic.h
>  
>  ---------------------------------------------------------------------
>  
> -spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189)
> +spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9)
>  
>  linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h
>  
> @@ -38,30 +35,31 @@ Skipped:
>  
>  ---------------------------------------------------------------------
>  
> -mem*: last sync @ v3.14-rc7 (last commit: 4a89922)
> +mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240)
>  
> -linux/arch/arm64/lib/memchr.S             xen/arch/arm/arm64/lib/memchr.S
> -linux/arch/arm64/lib/memcpy.S             xen/arch/arm/arm64/lib/memcpy.S
> -linux/arch/arm64/lib/memmove.S            xen/arch/arm/arm64/lib/memmove.S
> -linux/arch/arm64/lib/memset.S             xen/arch/arm/arm64/lib/memset.S
> +linux/arch/arm64/lib/memchr.S           xen/arch/arm/arm64/lib/memchr.S
> +linux/arch/arm64/lib/memcmp.S           xen/arch/arm/arm64/lib/memcmp.S
> +linux/arch/arm64/lib/memcpy.S           xen/arch/arm/arm64/lib/memcpy.S
> +linux/arch/arm64/lib/memmove.S          xen/arch/arm/arm64/lib/memmove.S
> +linux/arch/arm64/lib/memset.S           xen/arch/arm/arm64/lib/memset.S
>  
> -for i in memchr.S memcpy.S memmove.S memset.S ; do
> +for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do
>      diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i
>  done
>  
>  ---------------------------------------------------------------------
>  
> -str*: last sync @ v3.14-rc7 (last commit: 2b8cac8)
> +str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5)
>  
> -linux/arch/arm/lib/strchr.S             xen/arch/arm/arm64/lib/strchr.S
> -linux/arch/arm/lib/strrchr.S            xen/arch/arm/arm64/lib/strrchr.S
> +linux/arch/arm64/lib/strchr.S           xen/arch/arm/arm64/lib/strchr.S
> +linux/arch/arm64/lib/strrchr.S          xen/arch/arm/arm64/lib/strrchr.S
>  
>  ---------------------------------------------------------------------
>  
> -{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13)
> +{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387)
>  
> -linux/arch/arm64/lib/clear_page.S       unused in Xen
> -linux/arch/arm64/lib/copy_page.S        xen/arch/arm/arm64/lib/copy_page.S
> +linux/arch/arm64/lib/clear_page.S       xen/arch/arm/arm64/lib/clear_page.S
> +linux/arch/arm64/lib/copy_page.S        unused in Xen
>  
>  =====================================================================
>  arm32
> diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile
> index b895afa..2e7fb64 100644
> --- a/xen/arch/arm/arm64/lib/Makefile
> +++ b/xen/arch/arm/arm64/lib/Makefile
> @@ -1,4 +1,4 @@
> -obj-y += memcpy.o memmove.o memset.o memchr.o
> +obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o
>  obj-y += clear_page.o
>  obj-y += bitops.o find_next_bit.o
>  obj-y += strchr.o strrchr.o
> diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h
> new file mode 100644
> index 0000000..84669d1
> --- /dev/null
> +++ b/xen/arch/arm/arm64/lib/assembler.h
> @@ -0,0 +1,13 @@
> +#ifndef __ASM_ASSEMBLER_H__
> +#define __ASM_ASSEMBLER_H__
> +
> +#ifndef __ASSEMBLY__
> +#error "Only include this from assembly code"
> +#endif
> +
> +/* Only LE support so far */
> +#define CPU_BE(x...)
> +#define CPU_LE(x...) x
> +
> +#endif /* __ASM_ASSEMBLER_H__ */
> +
> diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S
> index 3cc1b01..b04590c 100644
> --- a/xen/arch/arm/arm64/lib/memchr.S
> +++ b/xen/arch/arm/arm64/lib/memchr.S
> @@ -18,6 +18,7 @@
>   */
>  
>  #include <xen/config.h>
> +#include "assembler.h"
>  
>  /*
>   * Find a character in an area of memory.
> diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S
> new file mode 100644
> index 0000000..9aad925
> --- /dev/null
> +++ b/xen/arch/arm/arm64/lib/memcmp.S
> @@ -0,0 +1,258 @@
> +/*
> + * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/config.h>
> +#include "assembler.h"
> +
> +/*
> +* compare memory areas(when two memory areas' offset are different,
> +* alignment handled by the hardware)
> +*
> +* Parameters:
> +*  x0 - const memory area 1 pointer
> +*  x1 - const memory area 2 pointer
> +*  x2 - the maximal compare byte length
> +* Returns:
> +*  x0 - a compare result, maybe less than, equal to, or greater than ZERO
> +*/
> +
> +/* Parameters and result.  */
> +src1		.req	x0
> +src2		.req	x1
> +limit		.req	x2
> +result		.req	x0
> +
> +/* Internal variables.  */
> +data1		.req	x3
> +data1w		.req	w3
> +data2		.req	x4
> +data2w		.req	w4
> +has_nul		.req	x5
> +diff		.req	x6
> +endloop		.req	x7
> +tmp1		.req	x8
> +tmp2		.req	x9
> +tmp3		.req	x10
> +pos		.req	x11
> +limit_wd	.req	x12
> +mask		.req	x13
> +
> +ENTRY(memcmp)
> +	cbz	limit, .Lret0
> +	eor	tmp1, src1, src2
> +	tst	tmp1, #7
> +	b.ne	.Lmisaligned8
> +	ands	tmp1, src1, #7
> +	b.ne	.Lmutual_align
> +	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
> +	lsr	limit_wd, limit_wd, #3 /* Convert to Dwords.  */
> +	/*
> +	* The input source addresses are at alignment boundary.
> +	* Directly compare eight bytes each time.
> +	*/
> +.Lloop_aligned:
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +.Lstart_realigned:
> +	subs	limit_wd, limit_wd, #1
> +	eor	diff, data1, data2	/* Non-zero if differences found.  */
> +	csinv	endloop, diff, xzr, cs	/* Last Dword or differences.  */
> +	cbz	endloop, .Lloop_aligned
> +
> +	/* Not reached the limit, must have found a diff.  */
> +	tbz	limit_wd, #63, .Lnot_limit
> +
> +	/* Limit % 8 == 0 => the diff is in the last 8 bytes. */
> +	ands	limit, limit, #7
> +	b.eq	.Lnot_limit
> +	/*
> +	* The remained bytes less than 8. It is needed to extract valid data
> +	* from last eight bytes of the intended memory range.
> +	*/
> +	lsl	limit, limit, #3	/* bytes-> bits.  */
> +	mov	mask, #~0
> +CPU_BE( lsr	mask, mask, limit )
> +CPU_LE( lsl	mask, mask, limit )
> +	bic	data1, data1, mask
> +	bic	data2, data2, mask
> +
> +	orr	diff, diff, mask
> +	b	.Lnot_limit
> +
> +.Lmutual_align:
> +	/*
> +	* Sources are mutually aligned, but are not currently at an
> +	* alignment boundary. Round down the addresses and then mask off
> +	* the bytes that precede the start point.
> +	*/
> +	bic	src1, src1, #7
> +	bic	src2, src2, #7
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +	/*
> +	* We can not add limit with alignment offset(tmp1) here. Since the
> +	* addition probably make the limit overflown.
> +	*/
> +	sub	limit_wd, limit, #1/*limit != 0, so no underflow.*/
> +	and	tmp3, limit_wd, #7
> +	lsr	limit_wd, limit_wd, #3
> +	add	tmp3, tmp3, tmp1
> +	add	limit_wd, limit_wd, tmp3, lsr #3
> +	add	limit, limit, tmp1/* Adjust the limit for the extra.  */
> +
> +	lsl	tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
> +	neg	tmp1, tmp1/* Bits to alignment -64.  */
> +	mov	tmp2, #~0
> +	/*mask off the non-intended bytes before the start address.*/
> +CPU_BE( lsl	tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/
> +	/* Little-endian.  Early bytes are at LSB.  */
> +CPU_LE( lsr	tmp2, tmp2, tmp1 )
> +
> +	orr	data1, data1, tmp2
> +	orr	data2, data2, tmp2
> +	b	.Lstart_realigned
> +
> +	/*src1 and src2 have different alignment offset.*/
> +.Lmisaligned8:
> +	cmp	limit, #8
> +	b.lo	.Ltiny8proc /*limit < 8: compare byte by byte*/
> +
> +	and	tmp1, src1, #7
> +	neg	tmp1, tmp1
> +	add	tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/
> +	and	tmp2, src2, #7
> +	neg	tmp2, tmp2
> +	add	tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/
> +	subs	tmp3, tmp1, tmp2
> +	csel	pos, tmp1, tmp2, hi /*Choose the maximum.*/
> +
> +	sub	limit, limit, pos
> +	/*compare the proceeding bytes in the first 8 byte segment.*/
> +.Ltinycmp:
> +	ldrb	data1w, [src1], #1
> +	ldrb	data2w, [src2], #1
> +	subs	pos, pos, #1
> +	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000.  */
> +	b.eq	.Ltinycmp
> +	cbnz	pos, 1f /*diff occurred before the last byte.*/
> +	cmp	data1w, data2w
> +	b.eq	.Lstart_align
> +1:
> +	sub	result, data1, data2
> +	ret
> +
> +.Lstart_align:
> +	lsr	limit_wd, limit, #3
> +	cbz	limit_wd, .Lremain8
> +
> +	ands	xzr, src1, #7
> +	b.eq	.Lrecal_offset
> +	/*process more leading bytes to make src1 aligned...*/
> +	add	src1, src1, tmp3 /*backwards src1 to alignment boundary*/
> +	add	src2, src2, tmp3
> +	sub	limit, limit, tmp3
> +	lsr	limit_wd, limit, #3
> +	cbz	limit_wd, .Lremain8
> +	/*load 8 bytes from aligned SRC1..*/
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +
> +	subs	limit_wd, limit_wd, #1
> +	eor	diff, data1, data2  /*Non-zero if differences found.*/
> +	csinv	endloop, diff, xzr, ne
> +	cbnz	endloop, .Lunequal_proc
> +	/*How far is the current SRC2 from the alignment boundary...*/
> +	and	tmp3, tmp3, #7
> +
> +.Lrecal_offset:/*src1 is aligned now..*/
> +	neg	pos, tmp3
> +.Lloopcmp_proc:
> +	/*
> +	* Divide the eight bytes into two parts. First,backwards the src2
> +	* to an alignment boundary,load eight bytes and compare from
> +	* the SRC2 alignment boundary. If all 8 bytes are equal,then start
> +	* the second part's comparison. Otherwise finish the comparison.
> +	* This special handle can garantee all the accesses are in the
> +	* thread/task space in avoid to overrange access.
> +	*/
> +	ldr	data1, [src1,pos]
> +	ldr	data2, [src2,pos]
> +	eor	diff, data1, data2  /* Non-zero if differences found.  */
> +	cbnz	diff, .Lnot_limit
> +
> +	/*The second part process*/
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +	eor	diff, data1, data2  /* Non-zero if differences found.  */
> +	subs	limit_wd, limit_wd, #1
> +	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
> +	cbz	endloop, .Lloopcmp_proc
> +.Lunequal_proc:
> +	cbz	diff, .Lremain8
> +
> +/*There is differnence occured in the latest comparison.*/
> +.Lnot_limit:
> +/*
> +* For little endian,reverse the low significant equal bits into MSB,then
> +* following CLZ can find how many equal bits exist.
> +*/
> +CPU_LE( rev	diff, diff )
> +CPU_LE( rev	data1, data1 )
> +CPU_LE( rev	data2, data2 )
> +
> +	/*
> +	* The MS-non-zero bit of DIFF marks either the first bit
> +	* that is different, or the end of the significant data.
> +	* Shifting left now will bring the critical information into the
> +	* top bits.
> +	*/
> +	clz	pos, diff
> +	lsl	data1, data1, pos
> +	lsl	data2, data2, pos
> +	/*
> +	* We need to zero-extend (char is unsigned) the value and then
> +	* perform a signed subtraction.
> +	*/
> +	lsr	data1, data1, #56
> +	sub	result, data1, data2, lsr #56
> +	ret
> +
> +.Lremain8:
> +	/* Limit % 8 == 0 =>. all data are equal.*/
> +	ands	limit, limit, #7
> +	b.eq	.Lret0
> +
> +.Ltiny8proc:
> +	ldrb	data1w, [src1], #1
> +	ldrb	data2w, [src2], #1
> +	subs	limit, limit, #1
> +
> +	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000. */
> +	b.eq	.Ltiny8proc
> +	sub	result, data1, data2
> +	ret
> +.Lret0:
> +	mov	result, #0
> +	ret
> +ENDPROC(memcmp)
> diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S
> index c8197c6..7cc885d 100644
> --- a/xen/arch/arm/arm64/lib/memcpy.S
> +++ b/xen/arch/arm/arm64/lib/memcpy.S
> @@ -1,5 +1,13 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
>   */
>  
>  #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>  
>  /*
>   * Copy a buffer from src to dest (alignment handled by the hardware)
> @@ -26,27 +36,166 @@
>   * Returns:
>   *	x0 - dest
>   */
> +dstin	.req	x0
> +src	.req	x1
> +count	.req	x2
> +tmp1	.req	x3
> +tmp1w	.req	w3
> +tmp2	.req	x4
> +tmp2w	.req	w4
> +tmp3	.req	x5
> +tmp3w	.req	w5
> +dst	.req	x6
> +
> +A_l	.req	x7
> +A_h	.req	x8
> +B_l	.req	x9
> +B_h	.req	x10
> +C_l	.req	x11
> +C_h	.req	x12
> +D_l	.req	x13
> +D_h	.req	x14
> +
>  ENTRY(memcpy)
> -	mov	x4, x0
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:	ldr	x3, [x1], #8
> -	subs	x2, x2, #8
> -	str	x3, [x4], #8
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	ldr	w3, [x1], #4
> -	sub	x2, x2, #4
> -	str	w3, [x4], #4
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	ldrh	w3, [x1], #2
> -	sub	x2, x2, #2
> -	strh	w3, [x4], #2
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	ldrb	w3, [x1]
> -	strb	w3, [x4]
> -5:	ret
> +	mov	dst, dstin
> +	cmp	count, #16
> +	/*When memory length is less than 16, the accessed are not aligned.*/
> +	b.lo	.Ltiny15
> +
> +	neg	tmp2, src
> +	ands	tmp2, tmp2, #15/* Bytes to reach alignment. */
> +	b.eq	.LSrcAligned
> +	sub	count, count, tmp2
> +	/*
> +	* Copy the leading memory data from src to dst in an increasing
> +	* address order.By this way,the risk of overwritting the source
> +	* memory data is eliminated when the distance between src and
> +	* dst is less than 16. The memory accesses here are alignment.
> +	*/
> +	tbz	tmp2, #0, 1f
> +	ldrb	tmp1w, [src], #1
> +	strb	tmp1w, [dst], #1
> +1:
> +	tbz	tmp2, #1, 2f
> +	ldrh	tmp1w, [src], #2
> +	strh	tmp1w, [dst], #2
> +2:
> +	tbz	tmp2, #2, 3f
> +	ldr	tmp1w, [src], #4
> +	str	tmp1w, [dst], #4
> +3:
> +	tbz	tmp2, #3, .LSrcAligned
> +	ldr	tmp1, [src],#8
> +	str	tmp1, [dst],#8
> +
> +.LSrcAligned:
> +	cmp	count, #64
> +	b.ge	.Lcpy_over64
> +	/*
> +	* Deal with small copies quickly by dropping straight into the
> +	* exit block.
> +	*/
> +.Ltail63:
> +	/*
> +	* Copy up to 48 bytes of data. At this point we only need the
> +	* bottom 6 bits of count to be accurate.
> +	*/
> +	ands	tmp1, count, #0x30
> +	b.eq	.Ltiny15
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +1:
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +2:
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +.Ltiny15:
> +	/*
> +	* Prefer to break one ldp/stp into several load/store to access
> +	* memory in an increasing address order,rather than to load/store 16
> +	* bytes from (src-16) to (dst-16) and to backward the src to aligned
> +	* address,which way is used in original cortex memcpy. If keeping
> +	* the original memcpy process here, memmove need to satisfy the
> +	* precondition that src address is at least 16 bytes bigger than dst
> +	* address,otherwise some source data will be overwritten when memove
> +	* call memcpy directly. To make memmove simpler and decouple the
> +	* memcpy's dependency on memmove, withdrew the original process.
> +	*/
> +	tbz	count, #3, 1f
> +	ldr	tmp1, [src], #8
> +	str	tmp1, [dst], #8
> +1:
> +	tbz	count, #2, 2f
> +	ldr	tmp1w, [src], #4
> +	str	tmp1w, [dst], #4
> +2:
> +	tbz	count, #1, 3f
> +	ldrh	tmp1w, [src], #2
> +	strh	tmp1w, [dst], #2
> +3:
> +	tbz	count, #0, .Lexitfunc
> +	ldrb	tmp1w, [src]
> +	strb	tmp1w, [dst]
> +
> +.Lexitfunc:
> +	ret
> +
> +.Lcpy_over64:
> +	subs	count, count, #128
> +	b.ge	.Lcpy_body_large
> +	/*
> +	* Less than 128 bytes to copy, so handle 64 here and then jump
> +	* to the tail.
> +	*/
> +	ldp	A_l, A_h, [src],#16
> +	stp	A_l, A_h, [dst],#16
> +	ldp	B_l, B_h, [src],#16
> +	ldp	C_l, C_h, [src],#16
> +	stp	B_l, B_h, [dst],#16
> +	stp	C_l, C_h, [dst],#16
> +	ldp	D_l, D_h, [src],#16
> +	stp	D_l, D_h, [dst],#16
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
> +
> +	/*
> +	* Critical loop.  Start at a new cache line boundary.  Assuming
> +	* 64 bytes per line this ensures the entire loop is in one line.
> +	*/
> +	.p2align	L1_CACHE_SHIFT
> +.Lcpy_body_large:
> +	/* pre-get 64 bytes data. */
> +	ldp	A_l, A_h, [src],#16
> +	ldp	B_l, B_h, [src],#16
> +	ldp	C_l, C_h, [src],#16
> +	ldp	D_l, D_h, [src],#16
> +1:
> +	/*
> +	* interlace the load of next 64 bytes data block with store of the last
> +	* loaded 64 bytes data.
> +	*/
> +	stp	A_l, A_h, [dst],#16
> +	ldp	A_l, A_h, [src],#16
> +	stp	B_l, B_h, [dst],#16
> +	ldp	B_l, B_h, [src],#16
> +	stp	C_l, C_h, [dst],#16
> +	ldp	C_l, C_h, [src],#16
> +	stp	D_l, D_h, [dst],#16
> +	ldp	D_l, D_h, [src],#16
> +	subs	count, count, #64
> +	b.ge	1b
> +	stp	A_l, A_h, [dst],#16
> +	stp	B_l, B_h, [dst],#16
> +	stp	C_l, C_h, [dst],#16
> +	stp	D_l, D_h, [dst],#16
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
>  ENDPROC(memcpy)
> diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S
> index 1bf0936..f4065b9 100644
> --- a/xen/arch/arm/arm64/lib/memmove.S
> +++ b/xen/arch/arm/arm64/lib/memmove.S
> @@ -1,5 +1,13 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
>   */
>  
>  #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>  
>  /*
>   * Move a buffer from src to test (alignment handled by the hardware).
> @@ -27,30 +37,161 @@
>   * Returns:
>   *	x0 - dest
>   */
> +dstin	.req	x0
> +src	.req	x1
> +count	.req	x2
> +tmp1	.req	x3
> +tmp1w	.req	w3
> +tmp2	.req	x4
> +tmp2w	.req	w4
> +tmp3	.req	x5
> +tmp3w	.req	w5
> +dst	.req	x6
> +
> +A_l	.req	x7
> +A_h	.req	x8
> +B_l	.req	x9
> +B_h	.req	x10
> +C_l	.req	x11
> +C_h	.req	x12
> +D_l	.req	x13
> +D_h	.req	x14
> +
>  ENTRY(memmove)
> -	cmp	x0, x1
> -	b.ls	memcpy
> -	add	x4, x0, x2
> -	add	x1, x1, x2
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:	ldr	x3, [x1, #-8]!
> -	subs	x2, x2, #8
> -	str	x3, [x4, #-8]!
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	ldr	w3, [x1, #-4]!
> -	sub	x2, x2, #4
> -	str	w3, [x4, #-4]!
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	ldrh	w3, [x1, #-2]!
> -	sub	x2, x2, #2
> -	strh	w3, [x4, #-2]!
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	ldrb	w3, [x1, #-1]
> -	strb	w3, [x4, #-1]
> -5:	ret
> +	cmp	dstin, src
> +	b.lo	memcpy
> +	add	tmp1, src, count
> +	cmp	dstin, tmp1
> +	b.hs	memcpy		/* No overlap.  */
> +
> +	add	dst, dstin, count
> +	add	src, src, count
> +	cmp	count, #16
> +	b.lo	.Ltail15  /*probably non-alignment accesses.*/
> +
> +	ands	tmp2, src, #15     /* Bytes to reach alignment.  */
> +	b.eq	.LSrcAligned
> +	sub	count, count, tmp2
> +	/*
> +	* process the aligned offset length to make the src aligned firstly.
> +	* those extra instructions' cost is acceptable. It also make the
> +	* coming accesses are based on aligned address.
> +	*/
> +	tbz	tmp2, #0, 1f
> +	ldrb	tmp1w, [src, #-1]!
> +	strb	tmp1w, [dst, #-1]!
> +1:
> +	tbz	tmp2, #1, 2f
> +	ldrh	tmp1w, [src, #-2]!
> +	strh	tmp1w, [dst, #-2]!
> +2:
> +	tbz	tmp2, #2, 3f
> +	ldr	tmp1w, [src, #-4]!
> +	str	tmp1w, [dst, #-4]!
> +3:
> +	tbz	tmp2, #3, .LSrcAligned
> +	ldr	tmp1, [src, #-8]!
> +	str	tmp1, [dst, #-8]!
> +
> +.LSrcAligned:
> +	cmp	count, #64
> +	b.ge	.Lcpy_over64
> +
> +	/*
> +	* Deal with small copies quickly by dropping straight into the
> +	* exit block.
> +	*/
> +.Ltail63:
> +	/*
> +	* Copy up to 48 bytes of data. At this point we only need the
> +	* bottom 6 bits of count to be accurate.
> +	*/
> +	ands	tmp1, count, #0x30
> +	b.eq	.Ltail15
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	ldp	A_l, A_h, [src, #-16]!
> +	stp	A_l, A_h, [dst, #-16]!
> +1:
> +	ldp	A_l, A_h, [src, #-16]!
> +	stp	A_l, A_h, [dst, #-16]!
> +2:
> +	ldp	A_l, A_h, [src, #-16]!
> +	stp	A_l, A_h, [dst, #-16]!
> +
> +.Ltail15:
> +	tbz	count, #3, 1f
> +	ldr	tmp1, [src, #-8]!
> +	str	tmp1, [dst, #-8]!
> +1:
> +	tbz	count, #2, 2f
> +	ldr	tmp1w, [src, #-4]!
> +	str	tmp1w, [dst, #-4]!
> +2:
> +	tbz	count, #1, 3f
> +	ldrh	tmp1w, [src, #-2]!
> +	strh	tmp1w, [dst, #-2]!
> +3:
> +	tbz	count, #0, .Lexitfunc
> +	ldrb	tmp1w, [src, #-1]
> +	strb	tmp1w, [dst, #-1]
> +
> +.Lexitfunc:
> +	ret
> +
> +.Lcpy_over64:
> +	subs	count, count, #128
> +	b.ge	.Lcpy_body_large
> +	/*
> +	* Less than 128 bytes to copy, so handle 64 bytes here and then jump
> +	* to the tail.
> +	*/
> +	ldp	A_l, A_h, [src, #-16]
> +	stp	A_l, A_h, [dst, #-16]
> +	ldp	B_l, B_h, [src, #-32]
> +	ldp	C_l, C_h, [src, #-48]
> +	stp	B_l, B_h, [dst, #-32]
> +	stp	C_l, C_h, [dst, #-48]
> +	ldp	D_l, D_h, [src, #-64]!
> +	stp	D_l, D_h, [dst, #-64]!
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
> +
> +	/*
> +	* Critical loop. Start at a new cache line boundary. Assuming
> +	* 64 bytes per line this ensures the entire loop is in one line.
> +	*/
> +	.p2align	L1_CACHE_SHIFT
> +.Lcpy_body_large:
> +	/* pre-load 64 bytes data. */
> +	ldp	A_l, A_h, [src, #-16]
> +	ldp	B_l, B_h, [src, #-32]
> +	ldp	C_l, C_h, [src, #-48]
> +	ldp	D_l, D_h, [src, #-64]!
> +1:
> +	/*
> +	* interlace the load of next 64 bytes data block with store of the last
> +	* loaded 64 bytes data.
> +	*/
> +	stp	A_l, A_h, [dst, #-16]
> +	ldp	A_l, A_h, [src, #-16]
> +	stp	B_l, B_h, [dst, #-32]
> +	ldp	B_l, B_h, [src, #-32]
> +	stp	C_l, C_h, [dst, #-48]
> +	ldp	C_l, C_h, [src, #-48]
> +	stp	D_l, D_h, [dst, #-64]!
> +	ldp	D_l, D_h, [src, #-64]!
> +	subs	count, count, #64
> +	b.ge	1b
> +	stp	A_l, A_h, [dst, #-16]
> +	stp	B_l, B_h, [dst, #-32]
> +	stp	C_l, C_h, [dst, #-48]
> +	stp	D_l, D_h, [dst, #-64]!
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
>  ENDPROC(memmove)
> diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S
> index 25a4fb6..4ee714d 100644
> --- a/xen/arch/arm/arm64/lib/memset.S
> +++ b/xen/arch/arm/arm64/lib/memset.S
> @@ -1,5 +1,13 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
>   */
>  
>  #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>  
>  /*
>   * Fill in the buffer with character c (alignment handled by the hardware)
> @@ -26,27 +36,181 @@
>   * Returns:
>   *	x0 - buf
>   */
> +
> +dstin		.req	x0
> +val		.req	w1
> +count		.req	x2
> +tmp1		.req	x3
> +tmp1w		.req	w3
> +tmp2		.req	x4
> +tmp2w		.req	w4
> +zva_len_x	.req	x5
> +zva_len		.req	w5
> +zva_bits_x	.req	x6
> +
> +A_l		.req	x7
> +A_lw		.req	w7
> +dst		.req	x8
> +tmp3w		.req	w9
> +tmp3		.req	x9
> +
>  ENTRY(memset)
> -	mov	x4, x0
> -	and	w1, w1, #0xff
> -	orr	w1, w1, w1, lsl #8
> -	orr	w1, w1, w1, lsl #16
> -	orr	x1, x1, x1, lsl #32
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:	str	x1, [x4], #8
> -	subs	x2, x2, #8
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	sub	x2, x2, #4
> -	str	w1, [x4], #4
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	sub	x2, x2, #2
> -	strh	w1, [x4], #2
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	strb	w1, [x4]
> -5:	ret
> +	mov	dst, dstin	/* Preserve return value.  */
> +	and	A_lw, val, #255
> +	orr	A_lw, A_lw, A_lw, lsl #8
> +	orr	A_lw, A_lw, A_lw, lsl #16
> +	orr	A_l, A_l, A_l, lsl #32
> +
> +	cmp	count, #15
> +	b.hi	.Lover16_proc
> +	/*All store maybe are non-aligned..*/
> +	tbz	count, #3, 1f
> +	str	A_l, [dst], #8
> +1:
> +	tbz	count, #2, 2f
> +	str	A_lw, [dst], #4
> +2:
> +	tbz	count, #1, 3f
> +	strh	A_lw, [dst], #2
> +3:
> +	tbz	count, #0, 4f
> +	strb	A_lw, [dst]
> +4:
> +	ret
> +
> +.Lover16_proc:
> +	/*Whether  the start address is aligned with 16.*/
> +	neg	tmp2, dst
> +	ands	tmp2, tmp2, #15
> +	b.eq	.Laligned
> +/*
> +* The count is not less than 16, we can use stp to store the start 16 bytes,
> +* then adjust the dst aligned with 16.This process will make the current
> +* memory address at alignment boundary.
> +*/
> +	stp	A_l, A_l, [dst] /*non-aligned store..*/
> +	/*make the dst aligned..*/
> +	sub	count, count, tmp2
> +	add	dst, dst, tmp2
> +
> +.Laligned:
> +	cbz	A_l, .Lzero_mem
> +
> +.Ltail_maybe_long:
> +	cmp	count, #64
> +	b.ge	.Lnot_short
> +.Ltail63:
> +	ands	tmp1, count, #0x30
> +	b.eq	3f
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	stp	A_l, A_l, [dst], #16
> +1:
> +	stp	A_l, A_l, [dst], #16
> +2:
> +	stp	A_l, A_l, [dst], #16
> +/*
> +* The last store length is less than 16,use stp to write last 16 bytes.
> +* It will lead some bytes written twice and the access is non-aligned.
> +*/
> +3:
> +	ands	count, count, #15
> +	cbz	count, 4f
> +	add	dst, dst, count
> +	stp	A_l, A_l, [dst, #-16]	/* Repeat some/all of last store. */
> +4:
> +	ret
> +
> +	/*
> +	* Critical loop. Start at a new cache line boundary. Assuming
> +	* 64 bytes per line, this ensures the entire loop is in one line.
> +	*/
> +	.p2align	L1_CACHE_SHIFT
> +.Lnot_short:
> +	sub	dst, dst, #16/* Pre-bias.  */
> +	sub	count, count, #64
> +1:
> +	stp	A_l, A_l, [dst, #16]
> +	stp	A_l, A_l, [dst, #32]
> +	stp	A_l, A_l, [dst, #48]
> +	stp	A_l, A_l, [dst, #64]!
> +	subs	count, count, #64
> +	b.ge	1b
> +	tst	count, #0x3f
> +	add	dst, dst, #16
> +	b.ne	.Ltail63
> +.Lexitfunc:
> +	ret
> +
> +	/*
> +	* For zeroing memory, check to see if we can use the ZVA feature to
> +	* zero entire 'cache' lines.
> +	*/
> +.Lzero_mem:
> +	cmp	count, #63
> +	b.le	.Ltail63
> +	/*
> +	* For zeroing small amounts of memory, it's not worth setting up
> +	* the line-clear code.
> +	*/
> +	cmp	count, #128
> +	b.lt	.Lnot_short /*count is at least  128 bytes*/
> +
> +	mrs	tmp1, dczid_el0
> +	tbnz	tmp1, #4, .Lnot_short
> +	mov	tmp3w, #4
> +	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
> +	lsl	zva_len, tmp3w, zva_len
> +
> +	ands	tmp3w, zva_len, #63
> +	/*
> +	* ensure the zva_len is not less than 64.
> +	* It is not meaningful to use ZVA if the block size is less than 64.
> +	*/
> +	b.ne	.Lnot_short
> +.Lzero_by_line:
> +	/*
> +	* Compute how far we need to go to become suitably aligned. We're
> +	* already at quad-word alignment.
> +	*/
> +	cmp	count, zva_len_x
> +	b.lt	.Lnot_short		/* Not enough to reach alignment.  */
> +	sub	zva_bits_x, zva_len_x, #1
> +	neg	tmp2, dst
> +	ands	tmp2, tmp2, zva_bits_x
> +	b.eq	2f			/* Already aligned.  */
> +	/* Not aligned, check that there's enough to copy after alignment.*/
> +	sub	tmp1, count, tmp2
> +	/*
> +	* grantee the remain length to be ZVA is bigger than 64,
> +	* avoid to make the 2f's process over mem range.*/
> +	cmp	tmp1, #64
> +	ccmp	tmp1, zva_len_x, #8, ge	/* NZCV=0b1000 */
> +	b.lt	.Lnot_short
> +	/*
> +	* We know that there's at least 64 bytes to zero and that it's safe
> +	* to overrun by 64 bytes.
> +	*/
> +	mov	count, tmp1
> +1:
> +	stp	A_l, A_l, [dst]
> +	stp	A_l, A_l, [dst, #16]
> +	stp	A_l, A_l, [dst, #32]
> +	subs	tmp2, tmp2, #64
> +	stp	A_l, A_l, [dst, #48]
> +	add	dst, dst, #64
> +	b.ge	1b
> +	/* We've overrun a bit, so adjust dst downwards.*/
> +	add	dst, dst, tmp2
> +2:
> +	sub	count, count, zva_len_x
> +3:
> +	dc	zva, dst
> +	add	dst, dst, zva_len_x
> +	subs	count, count, zva_len_x
> +	b.ge	3b
> +	ands	count, count, zva_bits_x
> +	b.ne	.Ltail_maybe_long
> +	ret
>  ENDPROC(memset)
> diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
> index 3f4e7a1..9a511f2 100644
> --- a/xen/include/asm-arm/arm32/cmpxchg.h
> +++ b/xen/include/asm-arm/arm32/cmpxchg.h
> @@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  	return ret;
>  }
>  
> +#define xchg(ptr,x) \
> +	((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> +
>  /*
>   * Atomic compare and exchange.  Compare OLD with MEM, if identical,
>   * store NEW in MEM.  Return the initial value in MEM.  Success is
> diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h
> index b5d50f2..b49219e 100644
> --- a/xen/include/asm-arm/arm64/atomic.h
> +++ b/xen/include/asm-arm/arm64/atomic.h
> @@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u)
>  
>  #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
>  
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  #endif
>  /*
>   * Local variables:
> diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h
> index 4e930ce..ae42b2f 100644
> --- a/xen/include/asm-arm/arm64/cmpxchg.h
> +++ b/xen/include/asm-arm/arm64/cmpxchg.h
> @@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  }
>  
>  #define xchg(ptr,x) \
> -	((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> +({ \
> +	__typeof__(*(ptr)) __ret; \
> +	__ret = (__typeof__(*(ptr))) \
> +		__xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \
> +	__ret; \
> +})
>  
>  extern void __bad_cmpxchg(volatile void *ptr, int size);
>  
> @@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
>  	return ret;
>  }
>  
> -#define cmpxchg(ptr,o,n)						\
> -	((__typeof__(*(ptr)))__cmpxchg_mb((ptr),			\
> -					  (unsigned long)(o),		\
> -					  (unsigned long)(n),		\
> -					  sizeof(*(ptr))))
> -
> -#define cmpxchg_local(ptr,o,n)						\
> -	((__typeof__(*(ptr)))__cmpxchg((ptr),				\
> -				       (unsigned long)(o),		\
> -				       (unsigned long)(n),		\
> -				       sizeof(*(ptr))))
> +#define cmpxchg(ptr, o, n) \
> +({ \
> +	__typeof__(*(ptr)) __ret; \
> +	__ret = (__typeof__(*(ptr))) \
> +		__cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \
> +			     sizeof(*(ptr))); \
> +	__ret; \
> +})
> +
> +#define cmpxchg_local(ptr, o, n) \
> +({ \
> +	__typeof__(*(ptr)) __ret; \
> +	__ret = (__typeof__(*(ptr))) \
> +		__cmpxchg((ptr), (unsigned long)(o), \
> +			  (unsigned long)(n), sizeof(*(ptr))); \
> +	__ret; \
> +})
>  
>  #endif
>  /*
> diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h
> index 3242762..dfad1fe 100644
> --- a/xen/include/asm-arm/string.h
> +++ b/xen/include/asm-arm/string.h
> @@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c);
>  #define __HAVE_ARCH_MEMCPY
>  extern void * memcpy(void *, const void *, __kernel_size_t);
>  
> +#if defined(CONFIG_ARM_64)
> +#define __HAVE_ARCH_MEMCMP
> +extern int memcmp(const void *, const void *, __kernel_size_t);
> +#endif
> +
>  /* Some versions of gcc don't have this builtin. It's non-critical anyway. */
>  #define __HAVE_ARCH_MEMMOVE
>  extern void *memmove(void *dest, const void *src, size_t n);
> diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> index 7aaaf50..ce3d38a 100644
> --- a/xen/include/asm-arm/system.h
> +++ b/xen/include/asm-arm/system.h
> @@ -33,9 +33,6 @@
>  
>  #define smp_wmb()       dmb(ishst)
>  
> -#define xchg(ptr,x) \
> -        ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> -
>  /*
>   * This is used to ensure the compiler did actually allocate the register we
>   * asked it for some inline assembly sequences.  Apparently we can't trust
> 


-- 
Julien Grall

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
@ 2014-07-25 15:42   ` Julien Grall
  2014-07-25 15:48     ` Ian Campbell
  0 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 15:42 UTC (permalink / raw)
  To: Ian Campbell, xen-devel; +Cc: tim, stefano.stabellini

Hi Ian,

On 07/25/2014 04:22 PM, Ian Campbell wrote:
> bitops, cmpxchg, atomics: Import:
>   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics

Compare to Linux we don't have specific prefetch* helpers. We directly
use the compiler builtin ones. Shouldn't we import the ARM specific
helpers to gain in performance?

Regards,

>     Author: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
> 
> atomics: In addition to the above import:
>   db38ee8 ARM: 7983/1: atomics: implement a better __atomic_add_unless for v6+
>     Author: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
> 
> spinlocks: We have diverged from Linux, so no updates but note this in the README.
> 
> mem* and str*: Import:
>   d98b90e ARM: 7990/1: asm: rename logical shift macros push pull into lspush lspull
>     Author: Victor Kamensky <victor.kamensky@linaro.org>
>     Suggested-by: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
>     Acked-by: Nicolas Pitre <nico@linaro.org>
>     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
> 
>   For some reason str* were mentioned under mem* in the README, fix.
> 
> libgcc: No changes, update baseline
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  xen/arch/arm/README.LinuxPrimitives    |   17 +++++++--------
>  xen/arch/arm/arm32/lib/assembler.h     |    8 +++----
>  xen/arch/arm/arm32/lib/bitops.h        |    5 +++++
>  xen/arch/arm/arm32/lib/copy_template.S |   36 ++++++++++++++++----------------
>  xen/arch/arm/arm32/lib/memmove.S       |   36 ++++++++++++++++----------------
>  xen/include/asm-arm/arm32/atomic.h     |   32 ++++++++++++++++++++++++++++
>  xen/include/asm-arm/arm32/cmpxchg.h    |    5 +++++
>  7 files changed, 90 insertions(+), 49 deletions(-)
> 
> diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
> index 69eeb70..7e15b04 100644
> --- a/xen/arch/arm/README.LinuxPrimitives
> +++ b/xen/arch/arm/README.LinuxPrimitives
> @@ -65,7 +65,7 @@ linux/arch/arm64/lib/copy_page.S        unused in Xen
>  arm32
>  =====================================================================
>  
> -bitops: last sync @ v3.14-rc7 (last commit: b7ec699)
> +bitops: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
>  
>  linux/arch/arm/lib/bitops.h             xen/arch/arm/arm32/lib/bitops.h
>  linux/arch/arm/lib/changebit.S          xen/arch/arm/arm32/lib/changebit.S
> @@ -83,13 +83,13 @@ done
>  
>  ---------------------------------------------------------------------
>  
> -cmpxchg: last sync @ v3.14-rc7 (last commit: 775ebcc)
> +cmpxchg: last sync @ v3.16-rc6 (last commit: c32ffce0f66e)
>  
>  linux/arch/arm/include/asm/cmpxchg.h    xen/include/asm-arm/arm32/cmpxchg.h
>  
>  ---------------------------------------------------------------------
>  
> -atomics: last sync @ v3.14-rc7 (last commit: aed3a4e)
> +atomics: last sync @ v3.16-rc6 (last commit: 030d0178bdbd)
>  
>  linux/arch/arm/include/asm/atomic.h     xen/include/asm-arm/arm32/atomic.h
>  
> @@ -99,6 +99,8 @@ spinlocks: last sync: 15e7e5c1ebf5
>  
>  linux/arch/arm/include/asm/spinlock.h   xen/include/asm-arm/arm32/spinlock.h
>  
> +*** Linux has switched to ticket locks but we still use bitlocks.
> +
>  resync to v3.14-rc7:
>  
>    7c8746a ARM: 7955/1: spinlock: ensure we have a compiler barrier before sev
> @@ -111,7 +113,7 @@ resync to v3.14-rc7:
>  
>  ---------------------------------------------------------------------
>  
> -mem*: last sync @ v3.14-rc7 (last commit: 418df63a)
> +mem*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
>  
>  linux/arch/arm/lib/copy_template.S      xen/arch/arm/arm32/lib/copy_template.S
>  linux/arch/arm/lib/memchr.S             xen/arch/arm/arm32/lib/memchr.S
> @@ -120,9 +122,6 @@ linux/arch/arm/lib/memmove.S            xen/arch/arm/arm32/lib/memmove.S
>  linux/arch/arm/lib/memset.S             xen/arch/arm/arm32/lib/memset.S
>  linux/arch/arm/lib/memzero.S            xen/arch/arm/arm32/lib/memzero.S
>  
> -linux/arch/arm/lib/strchr.S             xen/arch/arm/arm32/lib/strchr.S
> -linux/arch/arm/lib/strrchr.S            xen/arch/arm/arm32/lib/strrchr.S
> -
>  for i in copy_template.S memchr.S memcpy.S memmove.S memset.S \
>           memzero.S ; do
>      diff -u linux/arch/arm/lib/$i xen/arch/arm/arm32/lib/$i
> @@ -130,7 +129,7 @@ done
>  
>  ---------------------------------------------------------------------
>  
> -str*: last sync @ v3.13-rc7 (last commit: 93ed397)
> +str*: last sync @ v3.16-rc6 (last commit: d98b90ea22b0)
>  
>  linux/arch/arm/lib/strchr.S             xen/arch/arm/arm32/lib/strchr.S
>  linux/arch/arm/lib/strrchr.S            xen/arch/arm/arm32/lib/strrchr.S
> @@ -145,7 +144,7 @@ clear_page == memset
>  
>  ---------------------------------------------------------------------
>  
> -libgcc: last sync @ v3.14-rc7 (last commit: 01885bc)
> +libgcc: last sync @ v3.16-rc6 (last commit: 01885bc)
>  
>  linux/arch/arm/lib/lib1funcs.S          xen/arch/arm/arm32/lib/lib1funcs.S
>  linux/arch/arm/lib/lshrdi3.S            xen/arch/arm/arm32/lib/lshrdi3.S
> diff --git a/xen/arch/arm/arm32/lib/assembler.h b/xen/arch/arm/arm32/lib/assembler.h
> index f8d4b3a..6de2638 100644
> --- a/xen/arch/arm/arm32/lib/assembler.h
> +++ b/xen/arch/arm/arm32/lib/assembler.h
> @@ -36,8 +36,8 @@
>   * Endian independent macros for shifting bytes within registers.
>   */
>  #ifndef __ARMEB__
> -#define pull            lsr
> -#define push            lsl
> +#define lspull          lsr
> +#define lspush          lsl
>  #define get_byte_0      lsl #0
>  #define get_byte_1	lsr #8
>  #define get_byte_2	lsr #16
> @@ -47,8 +47,8 @@
>  #define put_byte_2	lsl #16
>  #define put_byte_3	lsl #24
>  #else
> -#define pull            lsl
> -#define push            lsr
> +#define lspull          lsl
> +#define lspush          lsr
>  #define get_byte_0	lsr #24
>  #define get_byte_1	lsr #16
>  #define get_byte_2	lsr #8
> diff --git a/xen/arch/arm/arm32/lib/bitops.h b/xen/arch/arm/arm32/lib/bitops.h
> index 25784c3..a167c2d 100644
> --- a/xen/arch/arm/arm32/lib/bitops.h
> +++ b/xen/arch/arm/arm32/lib/bitops.h
> @@ -37,6 +37,11 @@ UNWIND(	.fnstart	)
>  	add	r1, r1, r0, lsl #2	@ Get word offset
>  	mov	r3, r2, lsl r3		@ create mask
>  	smp_dmb
> +#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
> +	.arch_extension	mp
> +	ALT_SMP(W(pldw)	[r1])
> +	ALT_UP(W(nop))
> +#endif
>  1:	ldrex	r2, [r1]
>  	ands	r0, r2, r3		@ save old value of bit
>  	\instr	r2, r2, r3		@ toggle bit
> diff --git a/xen/arch/arm/arm32/lib/copy_template.S b/xen/arch/arm/arm32/lib/copy_template.S
> index 805e3f8..3bc8eb8 100644
> --- a/xen/arch/arm/arm32/lib/copy_template.S
> +++ b/xen/arch/arm/arm32/lib/copy_template.S
> @@ -197,24 +197,24 @@
>  
>  12:	PLD(	pld	[r1, #124]		)
>  13:		ldr4w	r1, r4, r5, r6, r7, abort=19f
> -		mov	r3, lr, pull #\pull
> +		mov	r3, lr, lspull #\pull
>  		subs	r2, r2, #32
>  		ldr4w	r1, r8, r9, ip, lr, abort=19f
> -		orr	r3, r3, r4, push #\push
> -		mov	r4, r4, pull #\pull
> -		orr	r4, r4, r5, push #\push
> -		mov	r5, r5, pull #\pull
> -		orr	r5, r5, r6, push #\push
> -		mov	r6, r6, pull #\pull
> -		orr	r6, r6, r7, push #\push
> -		mov	r7, r7, pull #\pull
> -		orr	r7, r7, r8, push #\push
> -		mov	r8, r8, pull #\pull
> -		orr	r8, r8, r9, push #\push
> -		mov	r9, r9, pull #\pull
> -		orr	r9, r9, ip, push #\push
> -		mov	ip, ip, pull #\pull
> -		orr	ip, ip, lr, push #\push
> +		orr	r3, r3, r4, lspush #\push
> +		mov	r4, r4, lspull #\pull
> +		orr	r4, r4, r5, lspush #\push
> +		mov	r5, r5, lspull #\pull
> +		orr	r5, r5, r6, lspush #\push
> +		mov	r6, r6, lspull #\pull
> +		orr	r6, r6, r7, lspush #\push
> +		mov	r7, r7, lspull #\pull
> +		orr	r7, r7, r8, lspush #\push
> +		mov	r8, r8, lspull #\pull
> +		orr	r8, r8, r9, lspush #\push
> +		mov	r9, r9, lspull #\pull
> +		orr	r9, r9, ip, lspush #\push
> +		mov	ip, ip, lspull #\pull
> +		orr	ip, ip, lr, lspush #\push
>  		str8w	r0, r3, r4, r5, r6, r7, r8, r9, ip, , abort=19f
>  		bge	12b
>  	PLD(	cmn	r2, #96			)
> @@ -225,10 +225,10 @@
>  14:		ands	ip, r2, #28
>  		beq	16f
>  
> -15:		mov	r3, lr, pull #\pull
> +15:		mov	r3, lr, lspull #\pull
>  		ldr1w	r1, lr, abort=21f
>  		subs	ip, ip, #4
> -		orr	r3, r3, lr, push #\push
> +		orr	r3, r3, lr, lspush #\push
>  		str1w	r0, r3, abort=21f
>  		bgt	15b
>  	CALGN(	cmp	r2, #0			)
> diff --git a/xen/arch/arm/arm32/lib/memmove.S b/xen/arch/arm/arm32/lib/memmove.S
> index 4e142b8..18634c3 100644
> --- a/xen/arch/arm/arm32/lib/memmove.S
> +++ b/xen/arch/arm/arm32/lib/memmove.S
> @@ -148,24 +148,24 @@ ENTRY(memmove)
>  
>  12:	PLD(	pld	[r1, #-128]		)
>  13:		ldmdb   r1!, {r7, r8, r9, ip}
> -		mov     lr, r3, push #\push
> +		mov     lr, r3, lspush #\push
>  		subs    r2, r2, #32
>  		ldmdb   r1!, {r3, r4, r5, r6}
> -		orr     lr, lr, ip, pull #\pull
> -		mov     ip, ip, push #\push
> -		orr     ip, ip, r9, pull #\pull
> -		mov     r9, r9, push #\push
> -		orr     r9, r9, r8, pull #\pull
> -		mov     r8, r8, push #\push
> -		orr     r8, r8, r7, pull #\pull
> -		mov     r7, r7, push #\push
> -		orr     r7, r7, r6, pull #\pull
> -		mov     r6, r6, push #\push
> -		orr     r6, r6, r5, pull #\pull
> -		mov     r5, r5, push #\push
> -		orr     r5, r5, r4, pull #\pull
> -		mov     r4, r4, push #\push
> -		orr     r4, r4, r3, pull #\pull
> +		orr     lr, lr, ip, lspull #\pull
> +		mov     ip, ip, lspush #\push
> +		orr     ip, ip, r9, lspull #\pull
> +		mov     r9, r9, lspush #\push
> +		orr     r9, r9, r8, lspull #\pull
> +		mov     r8, r8, lspush #\push
> +		orr     r8, r8, r7, lspull #\pull
> +		mov     r7, r7, lspush #\push
> +		orr     r7, r7, r6, lspull #\pull
> +		mov     r6, r6, lspush #\push
> +		orr     r6, r6, r5, lspull #\pull
> +		mov     r5, r5, lspush #\push
> +		orr     r5, r5, r4, lspull #\pull
> +		mov     r4, r4, lspush #\push
> +		orr     r4, r4, r3, lspull #\pull
>  		stmdb   r0!, {r4 - r9, ip, lr}
>  		bge	12b
>  	PLD(	cmn	r2, #96			)
> @@ -176,10 +176,10 @@ ENTRY(memmove)
>  14:		ands	ip, r2, #28
>  		beq	16f
>  
> -15:		mov     lr, r3, push #\push
> +15:		mov     lr, r3, lspush #\push
>  		ldr	r3, [r1, #-4]!
>  		subs	ip, ip, #4
> -		orr	lr, lr, r3, pull #\pull
> +		orr	lr, lr, r3, lspull #\pull
>  		str	lr, [r0, #-4]!
>  		bgt	15b
>  	CALGN(	cmp	r2, #0			)
> diff --git a/xen/include/asm-arm/arm32/atomic.h b/xen/include/asm-arm/arm32/atomic.h
> index 3d601d1..7ec712f 100644
> --- a/xen/include/asm-arm/arm32/atomic.h
> +++ b/xen/include/asm-arm/arm32/atomic.h
> @@ -39,6 +39,7 @@ static inline int atomic_add_return(int i, atomic_t *v)
>  	int result;
>  
>  	smp_mb();
> +	prefetchw(&v->counter);
>  
>  	__asm__ __volatile__("@ atomic_add_return\n"
>  "1:	ldrex	%0, [%3]\n"
> @@ -78,6 +79,7 @@ static inline int atomic_sub_return(int i, atomic_t *v)
>  	int result;
>  
>  	smp_mb();
> +	prefetchw(&v->counter);
>  
>  	__asm__ __volatile__("@ atomic_sub_return\n"
>  "1:	ldrex	%0, [%3]\n"
> @@ -100,6 +102,7 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
>  	unsigned long res;
>  
>  	smp_mb();
> +	prefetchw(&ptr->counter);
>  
>  	do {
>  		__asm__ __volatile__("@ atomic_cmpxchg\n"
> @@ -117,6 +120,35 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
>  	return oldval;
>  }
>  
> +static inline int __atomic_add_unless(atomic_t *v, int a, int u)
> +{
> +	int oldval, newval;
> +	unsigned long tmp;
> +
> +	smp_mb();
> +	prefetchw(&v->counter);
> +
> +	__asm__ __volatile__ ("@ atomic_add_unless\n"
> +"1:	ldrex	%0, [%4]\n"
> +"	teq	%0, %5\n"
> +"	beq	2f\n"
> +"	add	%1, %0, %6\n"
> +"	strex	%2, %1, [%4]\n"
> +"	teq	%2, #0\n"
> +"	bne	1b\n"
> +"2:"
> +	: "=&r" (oldval), "=&r" (newval), "=&r" (tmp), "+Qo" (v->counter)
> +	: "r" (&v->counter), "r" (u), "r" (a)
> +	: "cc");
> +
> +	if (oldval != u)
> +		smp_mb();
> +
> +	return oldval;
> +}
> +
> +#define atomic_xchg(v, new) (xchg(&((v)->counter), new))
> +
>  #define atomic_inc(v)		atomic_add(1, v)
>  #define atomic_dec(v)		atomic_sub(1, v)
>  
> diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
> index 9a511f2..03e0bed 100644
> --- a/xen/include/asm-arm/arm32/cmpxchg.h
> +++ b/xen/include/asm-arm/arm32/cmpxchg.h
> @@ -1,6 +1,8 @@
>  #ifndef __ASM_ARM32_CMPXCHG_H
>  #define __ASM_ARM32_CMPXCHG_H
>  
> +#include <xen/prefetch.h>
> +
>  extern void __bad_xchg(volatile void *, int);
>  
>  static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size)
> @@ -9,6 +11,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  	unsigned int tmp;
>  
>  	smp_mb();
> +	prefetchw((const void *)ptr);
>  
>  	switch (size) {
>  	case 1:
> @@ -56,6 +59,8 @@ static always_inline unsigned long __cmpxchg(
>  {
>  	unsigned long oldval, res;
>  
> +	prefetchw((const void *)ptr);
> +
>  	switch (size) {
>  	case 1:
>  		do {
> 


-- 
Julien Grall

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
  2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
  2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
@ 2014-07-25 15:43 ` Ian Campbell
  2 siblings, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:43 UTC (permalink / raw)
  To: xen-devel; +Cc: julien.grall, tim, stefano.stabellini

On Fri, 2014-07-25 at 16:22 +0100, Ian Campbell wrote:
> str*: No changes. Record new baseline.

I missed that there were some new primitives (str[n]len and str[n]cmp).
Rather than respin this big patch here is a followup:

8<-------------------

>From 66c115115122ca21035d55f486ea2eed1e284dd7 Mon Sep 17 00:00:00 2001
Message-Id: <66c115115122ca21035d55f486ea2eed1e284dd7.1406302952.git.ian.campbell@citrix.com>
From: Ian Campbell <ian.campbell@citrix.com>
Date: Fri, 25 Jul 2014 16:31:46 +0100
Subject: [PATCH] xen: arm: Add new str* primitives from Linux v3.16-rc6.

Imports:
  0a42cb0 arm64: lib: Implement optimized string length routines
    Author: zhichang.yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
  192c4d9 arm64: lib: Implement optimized string compare routines
    Author: zhichang.yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
    Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 xen/arch/arm/README.LinuxPrimitives |   10 +-
 xen/arch/arm/arm64/lib/Makefile     |    2 +-
 xen/arch/arm/arm64/lib/strcmp.S     |  235 ++++++++++++++++++++++++++
 xen/arch/arm/arm64/lib/strlen.S     |  128 ++++++++++++++
 xen/arch/arm/arm64/lib/strncmp.S    |  311 +++++++++++++++++++++++++++++++++++
 xen/arch/arm/arm64/lib/strnlen.S    |  172 +++++++++++++++++++
 xen/include/asm-arm/string.h        |   14 ++
 7 files changed, 870 insertions(+), 2 deletions(-)
 create mode 100644 xen/arch/arm/arm64/lib/strcmp.S
 create mode 100644 xen/arch/arm/arm64/lib/strlen.S
 create mode 100644 xen/arch/arm/arm64/lib/strncmp.S
 create mode 100644 xen/arch/arm/arm64/lib/strnlen.S

diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
index 7e15b04..7f33fc7 100644
--- a/xen/arch/arm/README.LinuxPrimitives
+++ b/xen/arch/arm/README.LinuxPrimitives
@@ -49,11 +49,19 @@ done
 
 ---------------------------------------------------------------------
 
-str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5)
+str*: last sync @ v3.16-rc6 (last commit: 0a42cb0a6fa6)
 
 linux/arch/arm64/lib/strchr.S           xen/arch/arm/arm64/lib/strchr.S
+linux/arch/arm64/lib/strcmp.S           xen/arch/arm/arm64/lib/strcmp.S
+linux/arch/arm64/lib/strlen.S           xen/arch/arm/arm64/lib/strlen.S
+linux/arch/arm64/lib/strncmp.S          xen/arch/arm/arm64/lib/strncmp.S
+linux/arch/arm64/lib/strnlen.S          xen/arch/arm/arm64/lib/strnlen.S
 linux/arch/arm64/lib/strrchr.S          xen/arch/arm/arm64/lib/strrchr.S
 
+for i in strchr.S strcmp.S strlen.S strncmp.S strnlen.S strrchr.S ; do
+    diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i
+done
+
 ---------------------------------------------------------------------
 
 {clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387)
diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile
index 2e7fb64..1b9c7a9 100644
--- a/xen/arch/arm/arm64/lib/Makefile
+++ b/xen/arch/arm/arm64/lib/Makefile
@@ -1,4 +1,4 @@
 obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o
 obj-y += clear_page.o
 obj-y += bitops.o find_next_bit.o
-obj-y += strchr.o strrchr.o
+obj-y += strchr.o strcmp.o strlen.o strncmp.o strnlen.o strrchr.o
diff --git a/xen/arch/arm/arm64/lib/strcmp.S b/xen/arch/arm/arm64/lib/strcmp.S
new file mode 100644
index 0000000..bdcf7b0
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strcmp.S
@@ -0,0 +1,235 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ *	x0 - const string 1 pointer
+ *    x1 - const string 2 pointer
+ * Returns:
+ * x0 - an integer less than, equal to, or greater than zero
+ * if  s1  is  found, respectively, to be less than, to match,
+ * or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result.  */
+src1		.req	x0
+src2		.req	x1
+result		.req	x0
+
+/* Internal variables.  */
+data1		.req	x2
+data1w		.req	w2
+data2		.req	x3
+data2w		.req	w3
+has_nul		.req	x4
+diff		.req	x5
+syndrome	.req	x6
+tmp1		.req	x7
+tmp2		.req	x8
+tmp3		.req	x9
+zeroones	.req	x10
+pos		.req	x11
+
+ENTRY(strcmp)
+	eor	tmp1, src1, src2
+	mov	zeroones, #REP8_01
+	tst	tmp1, #7
+	b.ne	.Lmisaligned8
+	ands	tmp1, src1, #7
+	b.ne	.Lmutual_align
+
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+.Lloop_aligned:
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+.Lstart_realigned:
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	bic	has_nul, tmp1, tmp2	/* Non-zero if NUL terminator.  */
+	orr	syndrome, diff, has_nul
+	cbz	syndrome, .Lloop_aligned
+	b	.Lcal_cmpresult
+
+.Lmutual_align:
+	/*
+	* Sources are mutually aligned, but are not currently at an
+	* alignment boundary.  Round down the addresses and then mask off
+	* the bytes that preceed the start point.
+	*/
+	bic	src1, src1, #7
+	bic	src2, src2, #7
+	lsl	tmp1, tmp1, #3		/* Bytes beyond alignment -> bits.  */
+	ldr	data1, [src1], #8
+	neg	tmp1, tmp1		/* Bits to alignment -64.  */
+	ldr	data2, [src2], #8
+	mov	tmp2, #~0
+	/* Big-endian.  Early bytes are at MSB.  */
+CPU_BE( lsl	tmp2, tmp2, tmp1 )	/* Shift (tmp1 & 63).  */
+	/* Little-endian.  Early bytes are at LSB.  */
+CPU_LE( lsr	tmp2, tmp2, tmp1 )	/* Shift (tmp1 & 63).  */
+
+	orr	data1, data1, tmp2
+	orr	data2, data2, tmp2
+	b	.Lstart_realigned
+
+.Lmisaligned8:
+	/*
+	* Get the align offset length to compare per byte first.
+	* After this process, one string's address will be aligned.
+	*/
+	and	tmp1, src1, #7
+	neg	tmp1, tmp1
+	add	tmp1, tmp1, #8
+	and	tmp2, src2, #7
+	neg	tmp2, tmp2
+	add	tmp2, tmp2, #8
+	subs	tmp3, tmp1, tmp2
+	csel	pos, tmp1, tmp2, hi /*Choose the maximum. */
+.Ltinycmp:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	pos, pos, #1
+	ccmp	data1w, #1, #0, ne  /* NZCV = 0b0000.  */
+	ccmp	data1w, data2w, #0, cs  /* NZCV = 0b0000.  */
+	b.eq	.Ltinycmp
+	cbnz	pos, 1f /*find the null or unequal...*/
+	cmp	data1w, #1
+	ccmp	data1w, data2w, #0, cs
+	b.eq	.Lstart_align /*the last bytes are equal....*/
+1:
+	sub	result, data1, data2
+	ret
+
+.Lstart_align:
+	ands	xzr, src1, #7
+	b.eq	.Lrecal_offset
+	/*process more leading bytes to make str1 aligned...*/
+	add	src1, src1, tmp3
+	add	src2, src2, tmp3
+	/*load 8 bytes from aligned str1 and non-aligned str2..*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	eor	diff, data1, data2 /* Non-zero if differences found.  */
+	orr	syndrome, diff, has_nul
+	cbnz	syndrome, .Lcal_cmpresult
+	/*How far is the current str2 from the alignment boundary...*/
+	and	tmp3, tmp3, #7
+.Lrecal_offset:
+	neg	pos, tmp3
+.Lloopcmp_proc:
+	/*
+	* Divide the eight bytes into two parts. First,backwards the src2
+	* to an alignment boundary,load eight bytes from the SRC2 alignment
+	* boundary,then compare with the relative bytes from SRC1.
+	* If all 8 bytes are equal,then start the second part's comparison.
+	* Otherwise finish the comparison.
+	* This special handle can garantee all the accesses are in the
+	* thread/task space in avoid to overrange access.
+	*/
+	ldr	data1, [src1,pos]
+	ldr	data2, [src2,pos]
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	orr	syndrome, diff, has_nul
+	cbnz	syndrome, .Lcal_cmpresult
+
+	/*The second part process*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	orr	syndrome, diff, has_nul
+	cbz	syndrome, .Lloopcmp_proc
+
+.Lcal_cmpresult:
+	/*
+	* reversed the byte-order as big-endian,then CLZ can find the most
+	* significant zero bits.
+	*/
+CPU_LE( rev	syndrome, syndrome )
+CPU_LE( rev	data1, data1 )
+CPU_LE( rev	data2, data2 )
+
+	/*
+	* For big-endian we cannot use the trick with the syndrome value
+	* as carry-propagation can corrupt the upper bits if the trailing
+	* bytes in the string contain 0x01.
+	* However, if there is no NUL byte in the dword, we can generate
+	* the result directly.  We ca not just subtract the bytes as the
+	* MSB might be significant.
+	*/
+CPU_BE( cbnz	has_nul, 1f )
+CPU_BE( cmp	data1, data2 )
+CPU_BE( cset	result, ne )
+CPU_BE( cneg	result, result, lo )
+CPU_BE( ret )
+CPU_BE( 1: )
+	/*Re-compute the NUL-byte detection, using a byte-reversed value. */
+CPU_BE(	rev	tmp3, data1 )
+CPU_BE(	sub	tmp1, tmp3, zeroones )
+CPU_BE(	orr	tmp2, tmp3, #REP8_7f )
+CPU_BE(	bic	has_nul, tmp1, tmp2 )
+CPU_BE(	rev	has_nul, has_nul )
+CPU_BE(	orr	syndrome, diff, has_nul )
+
+	clz	pos, syndrome
+	/*
+	* The MS-non-zero bit of the syndrome marks either the first bit
+	* that is different, or the top bit of the first zero byte.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/*
+	* But we need to zero-extend (char is unsigned) the value and then
+	* perform a signed 32-bit subtraction.
+	*/
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+ENDPROC(strcmp)
diff --git a/xen/arch/arm/arm64/lib/strlen.S b/xen/arch/arm/arm64/lib/strlen.S
new file mode 100644
index 0000000..ee055a2
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strlen.S
@@ -0,0 +1,128 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+
+/*
+ * calculate the length of a string
+ *
+ * Parameters:
+ *	x0 - const string pointer
+ * Returns:
+ *	x0 - the return length of specific string
+ */
+
+/* Arguments and results.  */
+srcin		.req	x0
+len		.req	x0
+
+/* Locals and temporaries.  */
+src		.req	x1
+data1		.req	x2
+data2		.req	x3
+data2a		.req	x4
+has_nul1	.req	x5
+has_nul2	.req	x6
+tmp1		.req	x7
+tmp2		.req	x8
+tmp3		.req	x9
+tmp4		.req	x10
+zeroones	.req	x11
+pos		.req	x12
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strlen)
+	mov	zeroones, #REP8_01
+	bic	src, srcin, #15
+	ands	tmp1, srcin, #15
+	b.ne	.Lmisaligned
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+	/*
+	* The inner loop deals with two Dwords at a time. This has a
+	* slightly higher start-up cost, but we should win quite quickly,
+	* especially on cores with a high number of issue slots per
+	* cycle, as we get much better parallelism out of the operations.
+	*/
+.Lloop:
+	ldp	data1, data2, [src], #16
+.Lrealigned:
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	sub	tmp3, data2, zeroones
+	orr	tmp4, data2, #REP8_7f
+	bic	has_nul1, tmp1, tmp2
+	bics	has_nul2, tmp3, tmp4
+	ccmp	has_nul1, #0, #0, eq	/* NZCV = 0000  */
+	b.eq	.Lloop
+
+	sub	len, src, srcin
+	cbz	has_nul1, .Lnul_in_data2
+CPU_BE(	mov	data2, data1 )	/*prepare data to re-calculate the syndrome*/
+	sub	len, len, #8
+	mov	has_nul2, has_nul1
+.Lnul_in_data2:
+	/*
+	* For big-endian, carry propagation (if the final byte in the
+	* string is 0x01) means we cannot use has_nul directly.  The
+	* easiest way to get the correct byte is to byte-swap the data
+	* and calculate the syndrome a second time.
+	*/
+CPU_BE( rev	data2, data2 )
+CPU_BE( sub	tmp1, data2, zeroones )
+CPU_BE( orr	tmp2, data2, #REP8_7f )
+CPU_BE( bic	has_nul2, tmp1, tmp2 )
+
+	sub	len, len, #8
+	rev	has_nul2, has_nul2
+	clz	pos, has_nul2
+	add	len, len, pos, lsr #3		/* Bits to bytes.  */
+	ret
+
+.Lmisaligned:
+	cmp	tmp1, #8
+	neg	tmp1, tmp1
+	ldp	data1, data2, [src], #16
+	lsl	tmp1, tmp1, #3		/* Bytes beyond alignment -> bits.  */
+	mov	tmp2, #~0
+	/* Big-endian.  Early bytes are at MSB.  */
+CPU_BE( lsl	tmp2, tmp2, tmp1 )	/* Shift (tmp1 & 63).  */
+	/* Little-endian.  Early bytes are at LSB.  */
+CPU_LE( lsr	tmp2, tmp2, tmp1 )	/* Shift (tmp1 & 63).  */
+
+	orr	data1, data1, tmp2
+	orr	data2a, data2, tmp2
+	csinv	data1, data1, xzr, le
+	csel	data2, data2, data2a, le
+	b	.Lrealigned
+ENDPROC(strlen)
diff --git a/xen/arch/arm/arm64/lib/strncmp.S b/xen/arch/arm/arm64/lib/strncmp.S
new file mode 100644
index 0000000..ca2e4a6
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strncmp.S
@@ -0,0 +1,311 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ *  x0 - const string 1 pointer
+ *  x1 - const string 2 pointer
+ *  x2 - the maximal length to be compared
+ * Returns:
+ *  x0 - an integer less than, equal to, or greater than zero if s1 is found,
+ *     respectively, to be less than, to match, or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result.  */
+src1		.req	x0
+src2		.req	x1
+limit		.req	x2
+result		.req	x0
+
+/* Internal variables.  */
+data1		.req	x3
+data1w		.req	w3
+data2		.req	x4
+data2w		.req	w4
+has_nul		.req	x5
+diff		.req	x6
+syndrome	.req	x7
+tmp1		.req	x8
+tmp2		.req	x9
+tmp3		.req	x10
+zeroones	.req	x11
+pos		.req	x12
+limit_wd	.req	x13
+mask		.req	x14
+endloop		.req	x15
+
+ENTRY(strncmp)
+	cbz	limit, .Lret0
+	eor	tmp1, src1, src2
+	mov	zeroones, #REP8_01
+	tst	tmp1, #7
+	b.ne	.Lmisaligned8
+	ands	tmp1, src1, #7
+	b.ne	.Lmutual_align
+	/* Calculate the number of full and partial words -1.  */
+	/*
+	* when limit is mulitply of 8, if not sub 1,
+	* the judgement of last dword will wrong.
+	*/
+	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
+	lsr	limit_wd, limit_wd, #3  /* Convert to Dwords.  */
+
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+.Lloop_aligned:
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+.Lstart_realigned:
+	subs	limit_wd, limit_wd, #1
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, pl  /* Last Dword or differences.*/
+	bics	has_nul, tmp1, tmp2 /* Non-zero if NUL terminator.  */
+	ccmp	endloop, #0, #0, eq
+	b.eq	.Lloop_aligned
+
+	/*Not reached the limit, must have found the end or a diff.  */
+	tbz	limit_wd, #63, .Lnot_limit
+
+	/* Limit % 8 == 0 => all bytes significant.  */
+	ands	limit, limit, #7
+	b.eq	.Lnot_limit
+
+	lsl	limit, limit, #3    /* Bits -> bytes.  */
+	mov	mask, #~0
+CPU_BE( lsr	mask, mask, limit )
+CPU_LE( lsl	mask, mask, limit )
+	bic	data1, data1, mask
+	bic	data2, data2, mask
+
+	/* Make sure that the NUL byte is marked in the syndrome.  */
+	orr	has_nul, has_nul, mask
+
+.Lnot_limit:
+	orr	syndrome, diff, has_nul
+	b	.Lcal_cmpresult
+
+.Lmutual_align:
+	/*
+	* Sources are mutually aligned, but are not currently at an
+	* alignment boundary.  Round down the addresses and then mask off
+	* the bytes that precede the start point.
+	* We also need to adjust the limit calculations, but without
+	* overflowing if the limit is near ULONG_MAX.
+	*/
+	bic	src1, src1, #7
+	bic	src2, src2, #7
+	ldr	data1, [src1], #8
+	neg	tmp3, tmp1, lsl #3  /* 64 - bits(bytes beyond align). */
+	ldr	data2, [src2], #8
+	mov	tmp2, #~0
+	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
+	/* Big-endian.  Early bytes are at MSB.  */
+CPU_BE( lsl	tmp2, tmp2, tmp3 )	/* Shift (tmp1 & 63).  */
+	/* Little-endian.  Early bytes are at LSB.  */
+CPU_LE( lsr	tmp2, tmp2, tmp3 )	/* Shift (tmp1 & 63).  */
+
+	and	tmp3, limit_wd, #7
+	lsr	limit_wd, limit_wd, #3
+	/* Adjust the limit. Only low 3 bits used, so overflow irrelevant.*/
+	add	limit, limit, tmp1
+	add	tmp3, tmp3, tmp1
+	orr	data1, data1, tmp2
+	orr	data2, data2, tmp2
+	add	limit_wd, limit_wd, tmp3, lsr #3
+	b	.Lstart_realigned
+
+/*when src1 offset is not equal to src2 offset...*/
+.Lmisaligned8:
+	cmp	limit, #8
+	b.lo	.Ltiny8proc /*limit < 8... */
+	/*
+	* Get the align offset length to compare per byte first.
+	* After this process, one string's address will be aligned.*/
+	and	tmp1, src1, #7
+	neg	tmp1, tmp1
+	add	tmp1, tmp1, #8
+	and	tmp2, src2, #7
+	neg	tmp2, tmp2
+	add	tmp2, tmp2, #8
+	subs	tmp3, tmp1, tmp2
+	csel	pos, tmp1, tmp2, hi /*Choose the maximum. */
+	/*
+	* Here, limit is not less than 8, so directly run .Ltinycmp
+	* without checking the limit.*/
+	sub	limit, limit, pos
+.Ltinycmp:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	pos, pos, #1
+	ccmp	data1w, #1, #0, ne  /* NZCV = 0b0000.  */
+	ccmp	data1w, data2w, #0, cs  /* NZCV = 0b0000.  */
+	b.eq	.Ltinycmp
+	cbnz	pos, 1f /*find the null or unequal...*/
+	cmp	data1w, #1
+	ccmp	data1w, data2w, #0, cs
+	b.eq	.Lstart_align /*the last bytes are equal....*/
+1:
+	sub	result, data1, data2
+	ret
+
+.Lstart_align:
+	lsr	limit_wd, limit, #3
+	cbz	limit_wd, .Lremain8
+	/*process more leading bytes to make str1 aligned...*/
+	ands	xzr, src1, #7
+	b.eq	.Lrecal_offset
+	add	src1, src1, tmp3	/*tmp3 is positive in this branch.*/
+	add	src2, src2, tmp3
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+
+	sub	limit, limit, tmp3
+	lsr	limit_wd, limit, #3
+	subs	limit_wd, limit_wd, #1
+
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+	bics	has_nul, tmp1, tmp2
+	ccmp	endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+	b.ne	.Lunequal_proc
+	/*How far is the current str2 from the alignment boundary...*/
+	and	tmp3, tmp3, #7
+.Lrecal_offset:
+	neg	pos, tmp3
+.Lloopcmp_proc:
+	/*
+	* Divide the eight bytes into two parts. First,backwards the src2
+	* to an alignment boundary,load eight bytes from the SRC2 alignment
+	* boundary,then compare with the relative bytes from SRC1.
+	* If all 8 bytes are equal,then start the second part's comparison.
+	* Otherwise finish the comparison.
+	* This special handle can garantee all the accesses are in the
+	* thread/task space in avoid to overrange access.
+	*/
+	ldr	data1, [src1,pos]
+	ldr	data2, [src2,pos]
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bics	has_nul, tmp1, tmp2 /* Non-zero if NUL terminator.  */
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, eq
+	cbnz	endloop, .Lunequal_proc
+
+	/*The second part process*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	subs	limit_wd, limit_wd, #1
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+	bics	has_nul, tmp1, tmp2
+	ccmp	endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+	b.eq	.Lloopcmp_proc
+
+.Lunequal_proc:
+	orr	syndrome, diff, has_nul
+	cbz	syndrome, .Lremain8
+.Lcal_cmpresult:
+	/*
+	* reversed the byte-order as big-endian,then CLZ can find the most
+	* significant zero bits.
+	*/
+CPU_LE( rev	syndrome, syndrome )
+CPU_LE( rev	data1, data1 )
+CPU_LE( rev	data2, data2 )
+	/*
+	* For big-endian we cannot use the trick with the syndrome value
+	* as carry-propagation can corrupt the upper bits if the trailing
+	* bytes in the string contain 0x01.
+	* However, if there is no NUL byte in the dword, we can generate
+	* the result directly.  We can't just subtract the bytes as the
+	* MSB might be significant.
+	*/
+CPU_BE( cbnz	has_nul, 1f )
+CPU_BE( cmp	data1, data2 )
+CPU_BE( cset	result, ne )
+CPU_BE( cneg	result, result, lo )
+CPU_BE( ret )
+CPU_BE( 1: )
+	/* Re-compute the NUL-byte detection, using a byte-reversed value.*/
+CPU_BE( rev	tmp3, data1 )
+CPU_BE( sub	tmp1, tmp3, zeroones )
+CPU_BE( orr	tmp2, tmp3, #REP8_7f )
+CPU_BE( bic	has_nul, tmp1, tmp2 )
+CPU_BE( rev	has_nul, has_nul )
+CPU_BE( orr	syndrome, diff, has_nul )
+	/*
+	* The MS-non-zero bit of the syndrome marks either the first bit
+	* that is different, or the top bit of the first zero byte.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	clz	pos, syndrome
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/*
+	* But we need to zero-extend (char is unsigned) the value and then
+	* perform a signed 32-bit subtraction.
+	*/
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+
+.Lremain8:
+	/* Limit % 8 == 0 => all bytes significant.  */
+	ands	limit, limit, #7
+	b.eq	.Lret0
+.Ltiny8proc:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	limit, limit, #1
+
+	ccmp	data1w, #1, #0, ne  /* NZCV = 0b0000.  */
+	ccmp	data1w, data2w, #0, cs  /* NZCV = 0b0000.  */
+	b.eq	.Ltiny8proc
+	sub	result, data1, data2
+	ret
+
+.Lret0:
+	mov	result, #0
+	ret
+ENDPROC(strncmp)
diff --git a/xen/arch/arm/arm64/lib/strnlen.S b/xen/arch/arm/arm64/lib/strnlen.S
new file mode 100644
index 0000000..8aa5bbf
--- /dev/null
+++ b/xen/arch/arm/arm64/lib/strnlen.S
@@ -0,0 +1,172 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/config.h>
+
+#include "assembler.h"
+
+/*
+ * determine the length of a fixed-size string
+ *
+ * Parameters:
+ *	x0 - const string pointer
+ *	x1 - maximal string length
+ * Returns:
+ *	x0 - the return length of specific string
+ */
+
+/* Arguments and results.  */
+srcin		.req	x0
+len		.req	x0
+limit		.req	x1
+
+/* Locals and temporaries.  */
+src		.req	x2
+data1		.req	x3
+data2		.req	x4
+data2a		.req	x5
+has_nul1	.req	x6
+has_nul2	.req	x7
+tmp1		.req	x8
+tmp2		.req	x9
+tmp3		.req	x10
+tmp4		.req	x11
+zeroones	.req	x12
+pos		.req	x13
+limit_wd	.req	x14
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strnlen)
+	cbz	limit, .Lhit_limit
+	mov	zeroones, #REP8_01
+	bic	src, srcin, #15
+	ands	tmp1, srcin, #15
+	b.ne	.Lmisaligned
+	/* Calculate the number of full and partial words -1.  */
+	sub	limit_wd, limit, #1 /* Limit != 0, so no underflow.  */
+	lsr	limit_wd, limit_wd, #4  /* Convert to Qwords.  */
+
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+	/*
+	* The inner loop deals with two Dwords at a time.  This has a
+	* slightly higher start-up cost, but we should win quite quickly,
+	* especially on cores with a high number of issue slots per
+	* cycle, as we get much better parallelism out of the operations.
+	*/
+.Lloop:
+	ldp	data1, data2, [src], #16
+.Lrealigned:
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	sub	tmp3, data2, zeroones
+	orr	tmp4, data2, #REP8_7f
+	bic	has_nul1, tmp1, tmp2
+	bic	has_nul2, tmp3, tmp4
+	subs	limit_wd, limit_wd, #1
+	orr	tmp1, has_nul1, has_nul2
+	ccmp	tmp1, #0, #0, pl    /* NZCV = 0000  */
+	b.eq	.Lloop
+
+	cbz	tmp1, .Lhit_limit   /* No null in final Qword.  */
+
+	/*
+	* We know there's a null in the final Qword. The easiest thing
+	* to do now is work out the length of the string and return
+	* MIN (len, limit).
+	*/
+	sub	len, src, srcin
+	cbz	has_nul1, .Lnul_in_data2
+CPU_BE( mov	data2, data1 )	/*perpare data to re-calculate the syndrome*/
+
+	sub	len, len, #8
+	mov	has_nul2, has_nul1
+.Lnul_in_data2:
+	/*
+	* For big-endian, carry propagation (if the final byte in the
+	* string is 0x01) means we cannot use has_nul directly.  The
+	* easiest way to get the correct byte is to byte-swap the data
+	* and calculate the syndrome a second time.
+	*/
+CPU_BE( rev	data2, data2 )
+CPU_BE( sub	tmp1, data2, zeroones )
+CPU_BE( orr	tmp2, data2, #REP8_7f )
+CPU_BE( bic	has_nul2, tmp1, tmp2 )
+
+	sub	len, len, #8
+	rev	has_nul2, has_nul2
+	clz	pos, has_nul2
+	add	len, len, pos, lsr #3       /* Bits to bytes.  */
+	cmp	len, limit
+	csel	len, len, limit, ls     /* Return the lower value.  */
+	ret
+
+.Lmisaligned:
+	/*
+	* Deal with a partial first word.
+	* We're doing two things in parallel here;
+	* 1) Calculate the number of words (but avoiding overflow if
+	* limit is near ULONG_MAX) - to do this we need to work out
+	* limit + tmp1 - 1 as a 65-bit value before shifting it;
+	* 2) Load and mask the initial data words - we force the bytes
+	* before the ones we are interested in to 0xff - this ensures
+	* early bytes will not hit any zero detection.
+	*/
+	ldp	data1, data2, [src], #16
+
+	sub	limit_wd, limit, #1
+	and	tmp3, limit_wd, #15
+	lsr	limit_wd, limit_wd, #4
+
+	add	tmp3, tmp3, tmp1
+	add	limit_wd, limit_wd, tmp3, lsr #4
+
+	neg	tmp4, tmp1
+	lsl	tmp4, tmp4, #3  /* Bytes beyond alignment -> bits.  */
+
+	mov	tmp2, #~0
+	/* Big-endian.  Early bytes are at MSB.  */
+CPU_BE( lsl	tmp2, tmp2, tmp4 )	/* Shift (tmp1 & 63).  */
+	/* Little-endian.  Early bytes are at LSB.  */
+CPU_LE( lsr	tmp2, tmp2, tmp4 )	/* Shift (tmp1 & 63).  */
+
+	cmp	tmp1, #8
+
+	orr	data1, data1, tmp2
+	orr	data2a, data2, tmp2
+
+	csinv	data1, data1, xzr, le
+	csel	data2, data2, data2a, le
+	b	.Lrealigned
+
+.Lhit_limit:
+	mov	len, limit
+	ret
+ENDPROC(strnlen)
diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h
index dfad1fe..e4b4469 100644
--- a/xen/include/asm-arm/string.h
+++ b/xen/include/asm-arm/string.h
@@ -14,6 +14,20 @@ extern char * strrchr(const char * s, int c);
 #define __HAVE_ARCH_STRCHR
 extern char * strchr(const char * s, int c);
 
+#if defined(CONFIG_ARM_64)
+#define __HAVE_ARCH_STRCMP
+extern int strcmp(const char *, const char *);
+
+#define __HAVE_ARCH_STRNCMP
+extern int strncmp(const char *, const char *, __kernel_size_t);
+
+#define __HAVE_ARCH_STRLEN
+extern __kernel_size_t strlen(const char *);
+
+#define __HAVE_ARCH_STRNLEN
+extern __kernel_size_t strnlen(const char *, __kernel_size_t);
+#endif
+
 #define __HAVE_ARCH_MEMCPY
 extern void * memcpy(void *, const void *, __kernel_size_t);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:42   ` Julien Grall
@ 2014-07-25 15:48     ` Ian Campbell
  2014-07-25 15:48       ` Julien Grall
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 15:48 UTC (permalink / raw)
  To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel

On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> Hi Ian,
> 
> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> > bitops, cmpxchg, atomics: Import:
> >   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
> 
> Compare to Linux we don't have specific prefetch* helpers. We directly
> use the compiler builtin ones. Shouldn't we import the ARM specific
> helpers to gain in performance?

My binaries are full of pld instructions where I think I would expect
them, so it seems like the compiler builtin ones are sufficient.

I suspect the Linux define is there to cope with older compilers or
something.

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:48     ` Ian Campbell
@ 2014-07-25 15:48       ` Julien Grall
  2014-07-25 16:03         ` Ian Campbell
  0 siblings, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 15:48 UTC (permalink / raw)
  To: Ian Campbell; +Cc: stefano.stabellini, tim, xen-devel

On 07/25/2014 04:48 PM, Ian Campbell wrote:
> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
>> Hi Ian,
>>
>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
>>> bitops, cmpxchg, atomics: Import:
>>>   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
>>
>> Compare to Linux we don't have specific prefetch* helpers. We directly
>> use the compiler builtin ones. Shouldn't we import the ARM specific
>> helpers to gain in performance?
> 
> My binaries are full of pld instructions where I think I would expect
> them, so it seems like the compiler builtin ones are sufficient.
> 
> I suspect the Linux define is there to cope with older compilers or
> something.

If so:

Acked-by: Julien Grall <julien.grall@linaro.org>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:48       ` Julien Grall
@ 2014-07-25 16:03         ` Ian Campbell
  2014-07-25 16:13           ` Ian Campbell
  2014-07-25 16:17           ` Julien Grall
  0 siblings, 2 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 16:03 UTC (permalink / raw)
  To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel

On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
> On 07/25/2014 04:48 PM, Ian Campbell wrote:
> > On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> >> Hi Ian,
> >>
> >> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> >>> bitops, cmpxchg, atomics: Import:
> >>>   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
> >>
> >> Compare to Linux we don't have specific prefetch* helpers. We directly
> >> use the compiler builtin ones. Shouldn't we import the ARM specific
> >> helpers to gain in performance?
> > 
> > My binaries are full of pld instructions where I think I would expect
> > them, so it seems like the compiler builtin ones are sufficient.
> > 
> > I suspect the Linux define is there to cope with older compilers or
> > something.
> 
> If so:

The compiled output is very different if I use the arch specific
explicit variants. The explicit variant generates (lots) more pldw and
(somewhat) fewer pld. I've no idea what this means...

Note that the builtins presumably let gcc reason about whether preloads
are needed, whereas the explicit variants do not. I'm not sure how that
results in fewer pld with the explicit variant though! (unless it's
doing some sort of peephole optimisation and throwing them away?)

I've no idea what the right answer is.

How about we take the updates for now and revisit the question of
builtin vs explicit prefetches some other time?


> Acked-by: Julien Grall <julien.grall@linaro.org>
> 
> Regards,
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 16:03         ` Ian Campbell
@ 2014-07-25 16:13           ` Ian Campbell
  2014-07-25 16:20             ` Julien Grall
  2014-07-25 16:17           ` Julien Grall
  1 sibling, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 16:13 UTC (permalink / raw)
  To: Julien Grall; +Cc: xen-devel, tim, stefano.stabellini

On Fri, 2014-07-25 at 17:03 +0100, Ian Campbell wrote:
> On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
> > On 07/25/2014 04:48 PM, Ian Campbell wrote:
> > > On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> > >> Hi Ian,
> > >>
> > >> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> > >>> bitops, cmpxchg, atomics: Import:
> > >>>   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
> > >>
> > >> Compare to Linux we don't have specific prefetch* helpers. We directly
> > >> use the compiler builtin ones. Shouldn't we import the ARM specific
> > >> helpers to gain in performance?
> > > 
> > > My binaries are full of pld instructions where I think I would expect
> > > them, so it seems like the compiler builtin ones are sufficient.
> > > 
> > > I suspect the Linux define is there to cope with older compilers or
> > > something.
> > 
> > If so:
> 
> The compiled output is very different if I use the arch specific
> explicit variants. The explicit variant generates (lots) more pldw and
> (somewhat) fewer pld. I've no idea what this means...

It's a bit more obvious for aarch64 where gcc 4.8 doesn't generate any
prefetches at all via the builtins...

Here's what I've got in my tree. I've no idea if we should take some or
all of it...

Ian.

8<-----------------

>From feb516fee01a0af60f54337b323975154eb466d8 Mon Sep 17 00:00:00 2001
Message-Id: <feb516fee01a0af60f54337b323975154eb466d8.1406304807.git.ian.campbell@citrix.com>
From: Ian Campbell <ian.campbell@citrix.com>
Date: Fri, 25 Jul 2014 17:08:42 +0100
Subject: [PATCH] xen: arm: Use explicit prefetch instructions.

On ARM32 these certainly generate *different* sets of prefetches.
I've no clue if that is a good thing...

On ARM64 the builtin variants seems to be non-functional (at least
with gcc 4.8).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 xen/include/asm-arm/arm32/processor.h |   17 +++++++++++++++++
 xen/include/asm-arm/arm64/processor.h |   22 ++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/xen/include/asm-arm/arm32/processor.h b/xen/include/asm-arm/arm32/processor.h
index f41644d..6feacc9 100644
--- a/xen/include/asm-arm/arm32/processor.h
+++ b/xen/include/asm-arm/arm32/processor.h
@@ -119,6 +119,23 @@ struct cpu_user_regs
 #define cpu_has_erratum_766422()                             \
     (unlikely(current_cpu_data.midr.bits == 0x410fc0f4))
 
+#define ARCH_HAS_PREFETCH
+static inline void prefetch(const void *ptr)
+{
+        __asm__ __volatile__(
+                "pld\t%a0"
+                :: "p" (ptr));
+}
+
+#define ARCH_HAS_PREFETCHW
+static inline void prefetchw(const void *ptr)
+{
+        __asm__ __volatile__(
+                ".arch_extension        mp\n"
+                "pldw\t%a0"
+                :: "p" (ptr));
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __ASM_ARM_ARM32_PROCESSOR_H */
diff --git a/xen/include/asm-arm/arm64/processor.h b/xen/include/asm-arm/arm64/processor.h
index 5bf0867..56b1002 100644
--- a/xen/include/asm-arm/arm64/processor.h
+++ b/xen/include/asm-arm/arm64/processor.h
@@ -106,6 +106,28 @@ struct cpu_user_regs
 
 #define cpu_has_erratum_766422() 0
 
+/*
+ * Prefetching support
+ */
+#define ARCH_HAS_PREFETCH
+static inline void prefetch(const void *ptr)
+{
+        asm volatile("prfm pldl1keep, %a0\n" : : "p" (ptr));
+}
+
+#define ARCH_HAS_PREFETCHW
+static inline void prefetchw(const void *ptr)
+{
+        asm volatile("prfm pstl1keep, %a0\n" : : "p" (ptr));
+}
+
+#define ARCH_HAS_SPINLOCK_PREFETCH
+static inline void spin_lock_prefetch(const void *x)
+{
+        prefetchw(x);
+}
+
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __ASM_ARM_ARM64_PROCESSOR_H */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 16:03         ` Ian Campbell
  2014-07-25 16:13           ` Ian Campbell
@ 2014-07-25 16:17           ` Julien Grall
  2014-07-25 16:23             ` Ian Campbell
  1 sibling, 1 reply; 13+ messages in thread
From: Julien Grall @ 2014-07-25 16:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: stefano.stabellini, tim, xen-devel

On 07/25/2014 05:03 PM, Ian Campbell wrote:
> On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
>> On 07/25/2014 04:48 PM, Ian Campbell wrote:
>>> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
>>>> Hi Ian,
>>>>
>>>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
>>>>> bitops, cmpxchg, atomics: Import:
>>>>>   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
>>>>
>>>> Compare to Linux we don't have specific prefetch* helpers. We directly
>>>> use the compiler builtin ones. Shouldn't we import the ARM specific
>>>> helpers to gain in performance?
>>>
>>> My binaries are full of pld instructions where I think I would expect
>>> them, so it seems like the compiler builtin ones are sufficient.
>>>
>>> I suspect the Linux define is there to cope with older compilers or
>>> something.
>>
>> If so:
> 
> The compiled output is very different if I use the arch specific
> explicit variants. The explicit variant generates (lots) more pldw and
> (somewhat) fewer pld. I've no idea what this means...

It looks like that pldw has been defined for ARMv7 with MP extensions.

AFAIU, pldw is used to signal we will likely write on this address.

I guess, we use the prefetch* helpers more often for write in the memory.

> 
> Note that the builtins presumably let gcc reason about whether preloads
> are needed, whereas the explicit variants do not. I'm not sure how that
> results in fewer pld with the explicit variant though! (unless it's
> doing some sort of peephole optimisation and throwing them away?)
> 
> I've no idea what the right answer is.
> 
> How about we take the updates for now and revisit the question of
> builtin vs explicit prefetches some other time?

I'm fine with it. You can keep the ack for this patch.

Regards,


-- 
Julien Grall

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 16:13           ` Ian Campbell
@ 2014-07-25 16:20             ` Julien Grall
  0 siblings, 0 replies; 13+ messages in thread
From: Julien Grall @ 2014-07-25 16:20 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, tim, stefano.stabellini

On 07/25/2014 05:13 PM, Ian Campbell wrote:
> On Fri, 2014-07-25 at 17:03 +0100, Ian Campbell wrote:
>> On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
>>> On 07/25/2014 04:48 PM, Ian Campbell wrote:
>>>> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
>>>>> Hi Ian,
>>>>>
>>>>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
>>>>>> bitops, cmpxchg, atomics: Import:
>>>>>>   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
>>>>>
>>>>> Compare to Linux we don't have specific prefetch* helpers. We directly
>>>>> use the compiler builtin ones. Shouldn't we import the ARM specific
>>>>> helpers to gain in performance?
>>>>
>>>> My binaries are full of pld instructions where I think I would expect
>>>> them, so it seems like the compiler builtin ones are sufficient.
>>>>
>>>> I suspect the Linux define is there to cope with older compilers or
>>>> something.
>>>
>>> If so:
>>
>> The compiled output is very different if I use the arch specific
>> explicit variants. The explicit variant generates (lots) more pldw and
>> (somewhat) fewer pld. I've no idea what this means...
> 
> It's a bit more obvious for aarch64 where gcc 4.8 doesn't generate any
> prefetches at all via the builtins...
> 
> Here's what I've got in my tree. I've no idea if we should take some or
> all of it...

I don't think it will be harmful for ARMv7 to use specific prefetch*
helpers.

[..]

> +/*
> + * Prefetching support
> + */
> +#define ARCH_HAS_PREFETCH
> +static inline void prefetch(const void *ptr)
> +{
> +        asm volatile("prfm pldl1keep, %a0\n" : : "p" (ptr));
> +}
> +
> +#define ARCH_HAS_PREFETCHW
> +static inline void prefetchw(const void *ptr)
> +{
> +        asm volatile("prfm pstl1keep, %a0\n" : : "p" (ptr));
> +}
> +
> +#define ARCH_HAS_SPINLOCK_PREFETCH
> +static inline void spin_lock_prefetch(const void *x)
> +{
> +        prefetchw(x);
> +}

Looking to the code. spin_lock_prefetch is called in the tree. I'm not
sure we should keep this helper.

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] xen: arm: update arm32 assembly primitives to Linux v3.16-rc6
  2014-07-25 16:17           ` Julien Grall
@ 2014-07-25 16:23             ` Ian Campbell
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2014-07-25 16:23 UTC (permalink / raw)
  To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel

On Fri, 2014-07-25 at 17:17 +0100, Julien Grall wrote:
> On 07/25/2014 05:03 PM, Ian Campbell wrote:
> > On Fri, 2014-07-25 at 16:48 +0100, Julien Grall wrote:
> >> On 07/25/2014 04:48 PM, Ian Campbell wrote:
> >>> On Fri, 2014-07-25 at 16:42 +0100, Julien Grall wrote:
> >>>> Hi Ian,
> >>>>
> >>>> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> >>>>> bitops, cmpxchg, atomics: Import:
> >>>>>   c32ffce ARM: 7984/1: prefetch: add prefetchw invocations for barriered atomics
> >>>>
> >>>> Compare to Linux we don't have specific prefetch* helpers. We directly
> >>>> use the compiler builtin ones. Shouldn't we import the ARM specific
> >>>> helpers to gain in performance?
> >>>
> >>> My binaries are full of pld instructions where I think I would expect
> >>> them, so it seems like the compiler builtin ones are sufficient.
> >>>
> >>> I suspect the Linux define is there to cope with older compilers or
> >>> something.
> >>
> >> If so:
> > 
> > The compiled output is very different if I use the arch specific
> > explicit variants. The explicit variant generates (lots) more pldw and
> > (somewhat) fewer pld. I've no idea what this means...
> 
> It looks like that pldw has been defined for ARMv7 with MP extensions.
> 
> AFAIU, pldw is used to signal we will likely write on this address.

Oh, I know *that*.

What I couldn't explain is why the builtins should generate 181 pld's
and 6 pldw's (total 187) while the explicit ones generate 127 pld's and
93 pldw's (total 220) for the exact same code base.

Perhaps we simply use prefetchw too often in our code in gcc's opinion
so it elides some of them. Or perhaps the volatile in the explicit
version stops gcc from making other optimisations so there's simply more
occasions where the prefetching is needed.

The difference in the write prefetches is pretty stark though, 6 vs 93.

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
  2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
@ 2014-08-04 16:16   ` Ian Campbell
  0 siblings, 0 replies; 13+ messages in thread
From: Ian Campbell @ 2014-08-04 16:16 UTC (permalink / raw)
  To: Julien Grall; +Cc: stefano.stabellini, tim, xen-devel

On Fri, 2014-07-25 at 16:36 +0100, Julien Grall wrote:
> Hi Ian,
> 
> On 07/25/2014 04:22 PM, Ian Campbell wrote:
> > The only really interesting changes here are the updates to mem* which update
> > to actually optimised versions and introduce an optimised memcmp.
> 
> I didn't read the whole code as I assume it's just a copy with few
> changes from Linux.
> 
> Acked-by: Julien Grall <julien.grall@linaro.org>

Thanks.

Julien also acked the other two patches via IRC, so I have applied.

Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-08-04 16:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
2014-07-25 15:42   ` Julien Grall
2014-07-25 15:48     ` Ian Campbell
2014-07-25 15:48       ` Julien Grall
2014-07-25 16:03         ` Ian Campbell
2014-07-25 16:13           ` Ian Campbell
2014-07-25 16:20             ` Julien Grall
2014-07-25 16:17           ` Julien Grall
2014-07-25 16:23             ` Ian Campbell
2014-07-25 15:36 ` [PATCH 1/2] xen: arm: update arm64 " Julien Grall
2014-08-04 16:16   ` Ian Campbell
2014-07-25 15:43 ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.