All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
@ 2016-08-24  4:17 Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 1/7] cutils: Remove SPLAT macro Richard Henderson
                   ` (9 more replies)
  0 siblings, 10 replies; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell

Patches 1-3 remove the use of ifunc from the implementation.

Patch 5 adjusts the x86 implementation a bit more to take
advantage of ptest (in sse4.1) and unaligned accesses (in avx1).

Patches 2 and 6 are the result of my conversation with Vijaya
Kumar with respect to ThunderX.

Patch 7 is the result of seeing some really really horrible code
produced for ppc64le (gcc 4.9 and mainline).

This has had limited testing.  What I don't know is the best way
to benchmark this -- the only way I know to trigger this is via
the console, by hand, which doesn't make for reasonable timing.


r~


Richard Henderson (7):
  cutils: Remove SPLAT macro
  cutils: Export only buffer_is_zero
  cutils: Rearrange buffer_is_zero acceleration
  cutils: Add generic prefetch
  cutils: Rewrite x86 buffer zero checking
  cutils: Rewrite aarch64 buffer zero checking
  cutils: Rewrite ppc buffer zero checking

 configure             |  21 +-
 include/qemu/cutils.h |   2 -
 migration/ram.c       |   2 +-
 migration/rdma.c      |   5 +-
 util/cutils.c         | 526 +++++++++++++++++++++++++++++++++-----------------
 5 files changed, 352 insertions(+), 204 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 1/7] cutils: Remove SPLAT macro
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
@ 2016-08-24  4:17 ` Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 2/7] cutils: Export only buffer_is_zero Richard Henderson
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell

This is unused and complicates the vector interface.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/cutils.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/util/cutils.c b/util/cutils.c
index 7505fda..1c8635c 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -172,7 +172,6 @@ int qemu_fdatasync(int fd)
 #undef pixel
 #undef bool
 #define VECTYPE        __vector unsigned char
-#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
 #define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
 #define VEC_OR(v1, v2) ((v1) | (v2))
 /* altivec.h may redefine the bool macro as vector type.
@@ -181,7 +180,6 @@ int qemu_fdatasync(int fd)
 #elif defined __SSE2__
 #include <emmintrin.h>
 #define VECTYPE        __m128i
-#define SPLAT(p)       _mm_set1_epi8(*(p))
 #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
 #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
 #elif defined(__aarch64__)
@@ -193,7 +191,6 @@ int qemu_fdatasync(int fd)
 #define VEC_OR(v1, v2) ((v1) | (v2))
 #else
 #define VECTYPE        unsigned long
-#define SPLAT(p)       (*(p) * (~0UL / 255))
 #define ALL_EQ(v1, v2) ((v1) == (v2))
 #define VEC_OR(v1, v2) ((v1) | (v2))
 #endif
@@ -270,7 +267,6 @@ static size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
 #include <immintrin.h>
 
 #define AVX2_VECTYPE        __m256i
-#define AVX2_SPLAT(p)       _mm256_set1_epi8(*(p))
 #define AVX2_ALL_EQ(v1, v2) \
     (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
 #define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 2/7] cutils: Export only buffer_is_zero
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 1/7] cutils: Remove SPLAT macro Richard Henderson
@ 2016-08-24  4:17 ` Richard Henderson
  2016-08-24  8:37   ` Dr. David Alan Gilbert
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 3/7] cutils: Rearrange buffer_is_zero acceleration Richard Henderson
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell

Since the two users don't make use of the returned offset,
beyond ensuring that the entire buffer is zero, consider the
can_use_buffer_find_nonzero_offset and buffer_find_nonzero_offset
functions internal.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 include/qemu/cutils.h | 2 --
 migration/ram.c       | 2 +-
 migration/rdma.c      | 5 +----
 util/cutils.c         | 8 ++++----
 4 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 3e4ea23..ca58577 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -168,8 +168,6 @@ int64_t qemu_strtosz_suffix_unit(const char *nptr, char **end,
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len);
-size_t buffer_find_nonzero_offset(const void *buf, size_t len);
 bool buffer_is_zero(const void *buf, size_t len);
 
 /*
diff --git a/migration/ram.c b/migration/ram.c
index a3d70c4..a6e1c63 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -73,7 +73,7 @@ static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
 
 static inline bool is_zero_range(uint8_t *p, uint64_t size)
 {
-    return buffer_find_nonzero_offset(p, size) == size;
+    return buffer_is_zero(p, size);
 }
 
 /* struct contains XBZRLE cache and a static page
diff --git a/migration/rdma.c b/migration/rdma.c
index 5110ec8..88bdb64 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1934,10 +1934,7 @@ retry:
              * memset() + madvise() the entire chunk without RDMA.
              */
 
-            if (can_use_buffer_find_nonzero_offset((void *)(uintptr_t)sge.addr,
-                                                   length)
-                   && buffer_find_nonzero_offset((void *)(uintptr_t)sge.addr,
-                                                    length) == length) {
+            if (buffer_is_zero((void *)(uintptr_t)sge.addr, length)) {
                 RDMACompress comp = {
                                         .offset = current_addr,
                                         .value = 0,
diff --git a/util/cutils.c b/util/cutils.c
index 1c8635c..621ca67 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -327,9 +327,9 @@ static bool avx2_support(void)
     return b & bit_AVX2;
 }
 
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
+static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
          __attribute__ ((ifunc("can_use_buffer_find_nonzero_offset_ifunc")));
-size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
+static size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
          __attribute__ ((ifunc("buffer_find_nonzero_offset_ifunc")));
 
 static void *buffer_find_nonzero_offset_ifunc(void)
@@ -350,12 +350,12 @@ static void *can_use_buffer_find_nonzero_offset_ifunc(void)
 }
 #pragma GCC pop_options
 #else
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
+static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
 {
     return can_use_buffer_find_nonzero_offset_inner(buf, len);
 }
 
-size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+static size_t buffer_find_nonzero_offset(const void *buf, size_t len)
 {
     return buffer_find_nonzero_offset_inner(buf, len);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 3/7] cutils: Rearrange buffer_is_zero acceleration
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 1/7] cutils: Remove SPLAT macro Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 2/7] cutils: Export only buffer_is_zero Richard Henderson
@ 2016-08-24  4:17 ` Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 4/7] cutils: Add generic prefetch Richard Henderson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell

Allow selection of several acceleration functions
based on the size and alignment of the buffer.
Do not require ifunc support for AVX2 acceleration.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 configure     |  21 +---
 util/cutils.c | 357 +++++++++++++++++++++++++++-------------------------------
 2 files changed, 175 insertions(+), 203 deletions(-)

diff --git a/configure b/configure
index 4b808f9..9f3d1fa 100755
--- a/configure
+++ b/configure
@@ -1788,28 +1788,19 @@ fi
 ##########################################
 # avx2 optimization requirement check
 
-
-if test "$static" = "no" ; then
-  cat > $TMPC << EOF
+cat > $TMPC << EOF
 #pragma GCC push_options
 #pragma GCC target("avx2")
 #include <cpuid.h>
 #include <immintrin.h>
-
 static int bar(void *a) {
-    return _mm256_movemask_epi8(_mm256_cmpeq_epi8(*(__m256i *)a, (__m256i){0}));
+    __m256i x = *(__m256i *)a;
+    return _mm256_testz_si256(x, x);
 }
-static void *bar_ifunc(void) {return (void*) bar;}
-int foo(void *a) __attribute__((ifunc("bar_ifunc")));
-int main(int argc, char *argv[]) { return foo(argv[0]);}
+int main(int argc, char *argv[]) { return bar(argv[0]); }
 EOF
-  if compile_object "" ; then
-      if has readelf; then
-          if readelf --syms $TMPO 2>/dev/null |grep -q "IFUNC.*foo"; then
-              avx2_opt="yes"
-          fi
-      fi
-  fi
+if compile_object "" ; then
+  avx2_opt="yes"
 fi
 
 #########################################
diff --git a/util/cutils.c b/util/cutils.c
index 621ca67..4d2edd6 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -162,243 +162,224 @@ int qemu_fdatasync(int fd)
 }
 
 /* vector definitions */
-#ifdef __ALTIVEC__
-#include <altivec.h>
-/* The altivec.h header says we're allowed to undef these for
- * C++ compatibility.  Here we don't care about C++, but we
- * undef them anyway to avoid namespace pollution.
- */
-#undef vector
-#undef pixel
-#undef bool
-#define VECTYPE        __vector unsigned char
-#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
-#define VEC_OR(v1, v2) ((v1) | (v2))
-/* altivec.h may redefine the bool macro as vector type.
- * Reset it to POSIX semantics. */
-#define bool _Bool
-#elif defined __SSE2__
-#include <emmintrin.h>
-#define VECTYPE        __m128i
-#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
-#define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
-#elif defined(__aarch64__)
-#include "arm_neon.h"
-#define VECTYPE        uint64x2_t
-#define ALL_EQ(v1, v2) \
-        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
-         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
-#define VEC_OR(v1, v2) ((v1) | (v2))
-#else
-#define VECTYPE        unsigned long
-#define ALL_EQ(v1, v2) ((v1) == (v2))
-#define VEC_OR(v1, v2) ((v1) | (v2))
-#endif
-
-#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
-
-static bool
-can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
-{
-    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
-                   * sizeof(VECTYPE)) == 0
-            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
-}
-
-/*
- * Searches for an area with non-zero content in a buffer
- *
- * Attention! The len must be a multiple of
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
- * and addr must be a multiple of sizeof(VECTYPE) due to
- * restriction of optimizations in this function.
- *
- * can_use_buffer_find_nonzero_offset_inner() can be used to
- * check these requirements.
- *
- * The return value is the offset of the non-zero area rounded
- * down to a multiple of sizeof(VECTYPE) for the first
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR chunks and down to
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
- * afterwards.
- *
- * If the buffer is all zero the return value is equal to len.
- */
 
-static size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
+extern void link_error(void);
+
+#define ACCEL_BUFFER_ZERO(NAME, SIZE, VECTYPE, ZERO)            \
+static bool __attribute__((noinline))                           \
+NAME(const void *buf, size_t len)                               \
+{                                                               \
+    const void *end = buf + len;                                \
+    do {                                                        \
+        const VECTYPE *p = buf;                                 \
+        VECTYPE t;                                              \
+        if (SIZE == sizeof(VECTYPE) * 4) {                      \
+            t = (p[0] | p[1]) | (p[2] | p[3]);                  \
+        } else if (SIZE == sizeof(VECTYPE) * 8) {               \
+            t  = p[0] | p[1];                                   \
+            t |= p[2] | p[3];                                   \
+            t |= p[4] | p[5];                                   \
+            t |= p[6] | p[7];                                   \
+        } else {                                                \
+            link_error();                                       \
+        }                                                       \
+        if (unlikely(!ZERO(t))) {                               \
+            return false;                                       \
+        }                                                       \
+        buf += SIZE;                                            \
+    } while (buf < end);                                        \
+    return true;                                                \
+}
+
+typedef bool (*accel_zero_fn)(const void *, size_t);
+
+static bool __attribute__((noinline))
+buffer_zero_base(const void *buf, size_t len)
 {
-    const VECTYPE *p = buf;
-    const VECTYPE zero = (VECTYPE){0};
     size_t i;
 
-    assert(can_use_buffer_find_nonzero_offset_inner(buf, len));
-
-    if (!len) {
-        return 0;
+    /* Check bytes until the buffer is aligned.  */
+    for (i = 0; i < len && ((uintptr_t)buf + i) % sizeof(long); ++i) {
+        const char *p = buf + i;
+        if (*p) {
+            return false;
+        }
     }
 
-    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
-        if (!ALL_EQ(p[i], zero)) {
-            return i * sizeof(VECTYPE);
+    /* Check longs until we run out.  */
+    for (; i + sizeof(long) <= len; i += sizeof(long)) {
+        const long *p = buf + i;
+        if (*p) {
+            return false;
         }
     }
 
-    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
-         i < len / sizeof(VECTYPE);
-         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
-        VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
-        VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
-        VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
-        VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
-        VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
-        VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
-        if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
-            break;
+    /* Check the last few bytes of the tail.  */
+    for (; i < len; ++i) {
+        const char *p = buf + i;
+        if (*p) {
+            return false;
         }
     }
 
-    return i * sizeof(VECTYPE);
+    return true;
 }
 
-#if defined CONFIG_AVX2_OPT
-#pragma GCC push_options
-#pragma GCC target("avx2")
-#include <cpuid.h>
-#include <immintrin.h>
-
-#define AVX2_VECTYPE        __m256i
-#define AVX2_ALL_EQ(v1, v2) \
-    (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
-#define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
+#define IDENT_ZERO(X)  (X)
+ACCEL_BUFFER_ZERO(buffer_zero_int, 4*sizeof(long), long, IDENT_ZERO)
 
-static bool
-can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+static bool select_accel_int(const void *buf, size_t len)
 {
-    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
-                   * sizeof(AVX2_VECTYPE)) == 0
-            && ((uintptr_t) buf) % sizeof(AVX2_VECTYPE) == 0);
+    uintptr_t ibuf = (uintptr_t)buf;
+    /* Note that this condition used to be the input constraint for
+       buffer_is_zero, therefore it is highly likely to be true.  */
+    if (likely(len % (4 * sizeof(long)) == 0)
+        && likely(ibuf % sizeof(long) == 0)) {
+        return buffer_zero_int(buf, len);
+    }
+    return buffer_zero_base(buf, len);
 }
 
-static size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+#ifdef __ALTIVEC__
+#include <altivec.h>
+/* The altivec.h header says we're allowed to undef these for
+ * C++ compatibility.  Here we don't care about C++, but we
+ * undef them anyway to avoid namespace pollution.
+ * altivec.h may redefine the bool macro as vector type.
+ * Reset it to POSIX semantics.
+ */
+#undef vector
+#undef pixel
+#undef bool
+#define bool _Bool
+#define DO_ZERO(X)  vec_all_eq(X, (__vector unsigned char){ 0 })
+ACCEL_BUFFER_ZERO(buffer_zero_ppc, 128, __vector unsigned char, DO_ZERO)
+
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    const AVX2_VECTYPE *p = buf;
-    const AVX2_VECTYPE zero = (AVX2_VECTYPE){0};
-    size_t i;
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 128 == 0 && ibuf % sizeof(__vector unsigned char) == 0) {
+        return buffer_zero_ppc(buf, len);
+    }
+    return select_accel_int(buf, len);
+}
 
-    assert(can_use_buffer_find_nonzero_offset_avx2(buf, len));
+#elif defined(CONFIG_AVX2_OPT)
+#include <cpuid.h>
+#include <x86intrin.h>
 
-    if (!len) {
-        return 0;
-    }
+#pragma GCC push_options
+#pragma GCC target("avx2")
+#define AVX2_ZERO(X)  _mm256_testz_si256((X), (X))
+ACCEL_BUFFER_ZERO(buffer_zero_avx2, 128, __m256i, AVX2_ZERO)
+#pragma GCC pop_options
 
-    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
-        if (!AVX2_ALL_EQ(p[i], zero)) {
-            return i * sizeof(AVX2_VECTYPE);
-        }
-    }
+#pragma GCC push_options
+#pragma GCC target("sse2")
+#define SSE2_ZERO(X) \
+    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) == 0xFFFF)
+ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_ZERO)
+#pragma GCC pop_options
 
-    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
-         i < len / sizeof(AVX2_VECTYPE);
-         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
-        AVX2_VECTYPE tmp0 = AVX2_VEC_OR(p[i + 0], p[i + 1]);
-        AVX2_VECTYPE tmp1 = AVX2_VEC_OR(p[i + 2], p[i + 3]);
-        AVX2_VECTYPE tmp2 = AVX2_VEC_OR(p[i + 4], p[i + 5]);
-        AVX2_VECTYPE tmp3 = AVX2_VEC_OR(p[i + 6], p[i + 7]);
-        AVX2_VECTYPE tmp01 = AVX2_VEC_OR(tmp0, tmp1);
-        AVX2_VECTYPE tmp23 = AVX2_VEC_OR(tmp2, tmp3);
-        if (!AVX2_ALL_EQ(AVX2_VEC_OR(tmp01, tmp23), zero)) {
-            break;
-        }
-    }
+#define CACHE_SSE2    1
+#define CACHE_SSE4    2
+#define CACHE_AVX1    4
+#define CACHE_AVX2    8
 
-    return i * sizeof(AVX2_VECTYPE);
-}
+static int cpuid_cache;
 
-static bool avx2_support(void)
+static void __attribute__((constructor)) init_cpuid_cache(void)
 {
+    int max = __get_cpuid_max(0, NULL);
     int a, b, c, d;
+    int cache = 0;
 
-    if (__get_cpuid_max(0, NULL) < 7) {
-        return false;
-    }
-
-    __cpuid_count(7, 0, a, b, c, d);
+    if (max >= 1) {
+        __cpuid(1, a, b, c, d);
+        if (d & bit_SSE2) {
+            cache |= CACHE_SSE2;
+        }
+        if (c & bit_SSE4_1) {
+            cache |= CACHE_SSE4;
+        }
 
-    return b & bit_AVX2;
+        /* We must check that AVX is not just available, but usable.  */
+        if ((c & bit_OSXSAVE) && (c & bit_AVX)) {
+            __asm("xgetbv" : "=a"(a), "=d"(d) : "c"(0));
+            if ((a & 6) == 6) {
+                cache |= CACHE_AVX1;
+                if (max >= 7) {
+                    __cpuid_count(7, 0, a, b, c, d);
+                    if (b & bit_AVX2) {
+                        cache |= CACHE_AVX2;
+                    }
+                }
+            }
+        }
+    }
+    cpuid_cache = cache;
 }
 
-static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
-         __attribute__ ((ifunc("can_use_buffer_find_nonzero_offset_ifunc")));
-static size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
-         __attribute__ ((ifunc("buffer_find_nonzero_offset_ifunc")));
-
-static void *buffer_find_nonzero_offset_ifunc(void)
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    typeof(buffer_find_nonzero_offset) *func = (avx2_support()) ?
-        buffer_find_nonzero_offset_avx2 : buffer_find_nonzero_offset_inner;
-
-    return func;
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 128 == 0 && ibuf % 32 == 0 && (cpuid_cache & CACHE_AVX2)) {
+        return buffer_zero_avx2(buf, len);
+    }
+    if (len % 64 == 0 && ibuf % 16 == 0 && (cpuid_cache & CACHE_SSE2)) {
+        return buffer_zero_sse2(buf, len);
+    }
+    return select_accel_int(buf, len);
 }
 
-static void *can_use_buffer_find_nonzero_offset_ifunc(void)
-{
-    typeof(can_use_buffer_find_nonzero_offset) *func = (avx2_support()) ?
-        can_use_buffer_find_nonzero_offset_avx2 :
-        can_use_buffer_find_nonzero_offset_inner;
+#elif defined __SSE2__
+#include <emmintrin.h>
 
-    return func;
-}
-#pragma GCC pop_options
-#else
-static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
+#define SSE2_ZERO(X) \
+    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) == 0xFFFF)
+ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_ZERO)
+
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    return can_use_buffer_find_nonzero_offset_inner(buf, len);
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 64 == 0 && ibuf % sizeof(__m128i) == 0) {
+        return buffer_zero_sse2(buf, len);
+    }
+    return select_accel_int(buf, len);
 }
 
-static size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+#elif defined(__aarch64__)
+#include "arm_neon.h"
+
+#define DO_ZERO(X)  (vgetq_lane_u64((X), 0) | vgetq_lane_u64((X), 1))
+ACCEL_BUFFER_ZERO(buffer_zero_neon, 128, uint64x2_t, DO_ZERO)
+
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    return buffer_find_nonzero_offset_inner(buf, len);
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 128 == 0 && ibuf % sizeof(uint64x2_t) == 0) {
+        return buffer_zero_neon(buf, len);
+    }
+    return select_accel_int(buf, len);
 }
+
+#else
+#define select_accel_fn  select_accel_int
 #endif
 
 /*
  * Checks if a buffer is all zeroes
- *
- * Attention! The len must be a multiple of 4 * sizeof(long) due to
- * restriction of optimizations in this function.
  */
 bool buffer_is_zero(const void *buf, size_t len)
 {
-    /*
-     * Use long as the biggest available internal data type that fits into the
-     * CPU register and unroll the loop to smooth out the effect of memory
-     * latency.
-     */
-
-    size_t i;
-    long d0, d1, d2, d3;
-    const long * const data = buf;
-
-    /* use vector optimized zero check if possible */
-    if (can_use_buffer_find_nonzero_offset(buf, len)) {
-        return buffer_find_nonzero_offset(buf, len) == len;
+    if (unlikely(len == 0)) {
+        return true;
     }
 
-    assert(len % (4 * sizeof(long)) == 0);
-    len /= sizeof(long);
-
-    for (i = 0; i < len; i += 4) {
-        d0 = data[i + 0];
-        d1 = data[i + 1];
-        d2 = data[i + 2];
-        d3 = data[i + 3];
-
-        if (d0 || d1 || d2 || d3) {
-            return false;
-        }
-    }
-
-    return true;
+    /* Use an optimized zero check if possible.  Note that this also
+       includes a check for an unrolled loop over longs, as well as
+       the unsized, unaligned fallback to buffer_zero_base.  */
+    return select_accel_fn(buf, len);
 }
 
 #ifndef _WIN32
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 4/7] cutils: Add generic prefetch
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
                   ` (2 preceding siblings ...)
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 3/7] cutils: Rearrange buffer_is_zero acceleration Richard Henderson
@ 2016-08-24  4:17 ` Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 5/7] cutils: Rewrite x86 buffer zero checking Richard Henderson
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell

There's no real knowledge of the cacheline size,
just prefetching one loop ahead.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/cutils.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/util/cutils.c b/util/cutils.c
index 4d2edd6..0f1ce1d 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -173,6 +173,8 @@ NAME(const void *buf, size_t len)                               \
     do {                                                        \
         const VECTYPE *p = buf;                                 \
         VECTYPE t;                                              \
+        __builtin_prefetch(buf + SIZE);                         \
+        barrier();                                              \
         if (SIZE == sizeof(VECTYPE) * 4) {                      \
             t = (p[0] | p[1]) | (p[2] | p[3]);                  \
         } else if (SIZE == sizeof(VECTYPE) * 8) {               \
@@ -376,6 +378,9 @@ bool buffer_is_zero(const void *buf, size_t len)
         return true;
     }
 
+    /* Fetch the beginning of the buffer while we select the accelerator.  */
+    __builtin_prefetch(buf);
+
     /* Use an optimized zero check if possible.  Note that this also
        includes a check for an unrolled loop over longs, as well as
        the unsized, unaligned fallback to buffer_zero_base.  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 5/7] cutils: Rewrite x86 buffer zero checking
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
                   ` (3 preceding siblings ...)
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 4/7] cutils: Add generic prefetch Richard Henderson
@ 2016-08-24  4:17 ` Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 6/7] cutils: Rewrite aarch64 " Richard Henderson
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell

Use unaligned load operations.
Add prefetches for the next loop iteration.
Add versions for avx1 and sse4.1.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/cutils.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 145 insertions(+), 24 deletions(-)

diff --git a/util/cutils.c b/util/cutils.c
index 0f1ce1d..ec4bd78 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -266,21 +266,127 @@ static bool select_accel_fn(const void *buf, size_t len)
     return select_accel_int(buf, len);
 }
 
-#elif defined(CONFIG_AVX2_OPT)
+#elif defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <cpuid.h>
 #include <x86intrin.h>
 
+#ifdef CONFIG_AVX2_OPT
 #pragma GCC push_options
 #pragma GCC target("avx2")
-#define AVX2_ZERO(X)  _mm256_testz_si256((X), (X))
-ACCEL_BUFFER_ZERO(buffer_zero_avx2, 128, __m256i, AVX2_ZERO)
+
+static bool __attribute__((noinline))
+buffer_zero_avx2(const void *buf, size_t len)
+{
+    const __m256i *p = buf;
+    const __m256i *end = buf + len;
+    __m256i t;
+
+    do {
+        p += 4;
+        __builtin_prefetch(p);
+        /* Note that most AVX insns handle unaligned operands by
+           default; we only need take care for the initial load.  */
+        __asm volatile("vmovdqu -0x80(%1),%0\n\t"
+                       "vpor -0x60(%1),%0,%0\n\t"
+                       "vpor -0x40(%1),%0,%0\n\t"
+                       "vpor -0x20(%1),%0,%0"
+                       : "=x"(t) : "r"(p));
+        if (unlikely(!_mm256_testz_si256(t, t))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
+
+#pragma GCC pop_options
+#pragma GCC push_options
+#pragma GCC target("avx")
+
+static bool __attribute__((noinline))
+buffer_zero_avx(const void *buf, size_t len)
+{
+    const __m128i *p = buf;
+    const __m128i *end = buf + len;
+    __m128i t;
+
+    do {
+        p += 4;
+        __builtin_prefetch(p);
+        /* Note that most AVX insns handle unaligned operands by
+           default; we only need take care for the initial load.  */
+        __asm volatile("vmovdqu -0x40(%1),%0\n\t"
+                       "vpor -0x20(%1),%0,%0\n\t"
+                       "vpor -0x20(%1),%0,%0\n\t"
+                       "vpor -0x10(%1),%0,%0"
+                       : "=x"(t) : "r"(p));
+        if (unlikely(!_mm_testz_si128(t, t))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
+
 #pragma GCC pop_options
+#pragma GCC push_options
+#pragma GCC target("sse4")
 
+static bool __attribute__((noinline))
+buffer_zero_sse4(const void *buf, size_t len)
+{
+    const __m128i *p = buf;
+    const __m128i *end = buf + len;
+    __m128i t0, t1, t2, t3;
+
+    do {
+        p += 4;
+        __builtin_prefetch(p);
+        __asm volatile("movdqu -0x40(%4),%0\n\t"
+                       "movdqu -0x20(%4),%1\n\t"
+                       "movdqu -0x20(%4),%2\n\t"
+                       "movdqu -0x10(%4),%3\n\t"
+                       "por %1,%0\n\t"
+                       "por %3,%2\n\t"
+                       "por %2,%0"
+                       : "=x"(t0), "=x"(t1), "=x"(t2), "=x"(t3) : "r"(p));
+        if (unlikely(!_mm_testz_si128(t0, t0))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
+
+#pragma GCC pop_options
 #pragma GCC push_options
 #pragma GCC target("sse2")
-#define SSE2_ZERO(X) \
-    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) == 0xFFFF)
-ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_ZERO)
+#endif /* CONFIG_AVX2_OPT */
+
+static bool __attribute__((noinline))
+buffer_zero_sse2(const void *buf, size_t len)
+{
+    const __m128i *p = buf;
+    const __m128i *end = buf + len;
+    __m128i zero = _mm_setzero_si128();
+    __m128i t0, t1, t2, t3;
+
+    do {
+        p += 4;
+        __builtin_prefetch(p);
+        __asm volatile("movdqu -0x40(%4),%0\n\t"
+                       "movdqu -0x20(%4),%1\n\t"
+                       "movdqu -0x20(%4),%2\n\t"
+                       "movdqu -0x10(%4),%3\n\t"
+                       "por %1,%0\n\t"
+                       "por %3,%2\n\t"
+                       "por %2,%0"
+                       : "=x"(t0), "=x"(t1), "=x"(t2), "=x"(t3) : "r"(p));
+        if (unlikely(_mm_movemask_epi8(_mm_cmpeq_epi8(t0, zero)) == 0xFFFF)) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
+
+#ifdef CONFIG_AVX2_OPT
 #pragma GCC pop_options
 
 #define CACHE_SSE2    1
@@ -321,32 +427,47 @@ static void __attribute__((constructor)) init_cpuid_cache(void)
     }
     cpuid_cache = cache;
 }
+#endif /* CONFIG_AVX2_OPT */
 
 static bool select_accel_fn(const void *buf, size_t len)
 {
-    uintptr_t ibuf = (uintptr_t)buf;
-    if (len % 128 == 0 && ibuf % 32 == 0 && (cpuid_cache & CACHE_AVX2)) {
+#ifdef CONFIG_AVX2_OPT
+    int cache = cpuid_cache;
+
+    /* Force bits that the compiler tells us must be there.
+       This allows the compiler to optimize subsequent tests.  */
+#ifdef __AVX2__
+    cache |= CACHE_AVX2;
+#endif
+#ifdef __AVX__
+    cache |= CACHE_AVX1;
+#endif
+#ifdef __SSE4_1__
+    cache |= CACHE_SSE4;
+#endif
+#ifdef __SSE2__
+    cache |= CACHE_SSE2;
+#endif
+
+    if (len % 128 == 0 && (cache & CACHE_AVX2)) {
         return buffer_zero_avx2(buf, len);
     }
-    if (len % 64 == 0 && ibuf % 16 == 0 && (cpuid_cache & CACHE_SSE2)) {
-        return buffer_zero_sse2(buf, len);
+    if (len % 64 == 0) {
+        if (cache & CACHE_AVX1) {
+            return buffer_zero_avx(buf, len);
+        }
+        if (cache & CACHE_SSE4) {
+            return buffer_zero_sse4(buf, len);
+        }
+        if (cache & CACHE_SSE2) {
+            return buffer_zero_sse2(buf, len);
+        }
     }
-    return select_accel_int(buf, len);
-}
-
-#elif defined __SSE2__
-#include <emmintrin.h>
-
-#define SSE2_ZERO(X) \
-    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) == 0xFFFF)
-ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_ZERO)
-
-static bool select_accel_fn(const void *buf, size_t len)
-{
-    uintptr_t ibuf = (uintptr_t)buf;
-    if (len % 64 == 0 && ibuf % sizeof(__m128i) == 0) {
+#else
+    if (len % 64 == 0) {
         return buffer_zero_sse2(buf, len);
     }
+#endif
     return select_accel_int(buf, len);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 6/7] cutils: Rewrite aarch64 buffer zero checking
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
                   ` (4 preceding siblings ...)
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 5/7] cutils: Rewrite x86 buffer zero checking Richard Henderson
@ 2016-08-24  4:17 ` Richard Henderson
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 7/7] cutils: Rewrite ppc " Richard Henderson
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell

Provide 64-byte and 128-byte versions.
Use dczid_el0 as a proxy for the cacheline size.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/cutils.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/util/cutils.c b/util/cutils.c
index ec4bd78..fe860e8 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -475,13 +475,35 @@ static bool select_accel_fn(const void *buf, size_t len)
 #include "arm_neon.h"
 
 #define DO_ZERO(X)  (vgetq_lane_u64((X), 0) | vgetq_lane_u64((X), 1))
-ACCEL_BUFFER_ZERO(buffer_zero_neon, 128, uint64x2_t, DO_ZERO)
+ACCEL_BUFFER_ZERO(buffer_zero_neon_64, 64, uint64x2_t, DO_ZERO)
+ACCEL_BUFFER_ZERO(buffer_zero_neon_128, 128, uint64x2_t, DO_ZERO)
+
+static uint32_t buffer_zero_line_mask;
+static accel_zero_fn buffer_zero_accel;
+
+static void __attribute__((constructor)) init_buffer_zero_accel(void)
+{
+    uint64_t t;
+
+    /* Use the DZP block size as a proxy for the cacheline size,
+       since the later is not available to userspace.  This seems
+       to work in practice for existing implementations.  */
+    asm("mrs %0, dczid_el0" : "=r"(t));
+    if ((t & 15) * 16 >= 128) {
+        buffer_zero_line_mask = 128 - 1;
+        buffer_zero_accel = buffer_zero_neon_128;
+    } else {
+        buffer_zero_line_mask = 64 - 1;
+        buffer_zero_accel = buffer_zero_neon_64;
+    }
+}
 
 static bool select_accel_fn(const void *buf, size_t len)
 {
     uintptr_t ibuf = (uintptr_t)buf;
-    if (len % 128 == 0 && ibuf % sizeof(uint64x2_t) == 0) {
-        return buffer_zero_neon(buf, len);
+    if (likely(ibuf % sizeof(uint64_t) == 0)
+        && (len & buffer_zero_line_mask) == 0) {
+        return buffer_zero_accel(buf, len);
     }
     return select_accel_int(buf, len);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 7/7] cutils: Rewrite ppc buffer zero checking
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
                   ` (5 preceding siblings ...)
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 6/7] cutils: Rewrite aarch64 " Richard Henderson
@ 2016-08-24  4:17 ` Richard Henderson
  2016-08-24  4:30 ` [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero no-reply
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2016-08-24  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: vijay.kilari, qemu-arm, pbonzini, peter.maydell, qemu-ppc, David Gibson

GCC versions through 6 do a poor job with the indexed addressing,
and (for ppc64le) issues unnecessary xxswapd insns.

Cc: qemu-ppc@nongnu.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/cutils.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/util/cutils.c b/util/cutils.c
index fe860e8..30fac02 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -254,8 +254,45 @@ static bool select_accel_int(const void *buf, size_t len)
 #undef pixel
 #undef bool
 #define bool _Bool
-#define DO_ZERO(X)  vec_all_eq(X, (__vector unsigned char){ 0 })
-ACCEL_BUFFER_ZERO(buffer_zero_ppc, 128, __vector unsigned char, DO_ZERO)
+
+static bool __attribute__((noinline))
+buffer_zero_ppc(const void *buf, size_t len)
+{
+    typedef unsigned char vec __attribute__((vector_size(16)));
+    const vec *p = buf;
+    const vec *end = buf + len;
+    vec t0, t1, t2, t3, zero = (vec){ 0 };
+
+    do {
+        p += 8;
+        __builtin_prefetch(p);
+        /* ??? GCC6 does poorly with power64le; extra xxswap.  */
+        __asm volatile("lvebx %0,%4,%5\n\t"
+                       "lvebx %1,%4,%6\n\t"
+                       "lvebx %2,%4,%7\n\t"
+                       "lvebx %3,%4,%8\n\t"
+                       "vor %0,%0,%1\n\t"
+                       "vor %1,%2,%3\n\t"
+                       "lvebx %2,%4,%9\n\t"
+                       "lvebx %3,%4,%10\n\t"
+                       "vor %0,%0,%1\n\t"
+                       "vor %1,%2,%3\n\t"
+                       "lvebx %2,%4,%11\n\t"
+                       "lvebx %3,%4,%12\n\t"
+                       "vor %0,%0,%1\n\t"
+                       "vor %1,%2,%3\n\t"
+                       "vor %0,%0,%1"
+                       : "=v"(t0), "=v"(t1), "=v"(t2), "=v"(t3)
+                       : "b"(p), "b"(-8 * 16), "b"(-7 * 16),
+                         "b"(-6 * 16), "b"(-5 * 16),
+                         "b"(-4 * 16), "b"(-3 * 16),
+                         "b"(-2 * 16), "b"(-1 * 16));
+        if (unlikely(vec_any_ne(t0, zero))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
 
 static bool select_accel_fn(const void *buf, size_t len)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
                   ` (6 preceding siblings ...)
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 7/7] cutils: Rewrite ppc " Richard Henderson
@ 2016-08-24  4:30 ` no-reply
  2016-08-24  4:38   ` Paolo Bonzini
  2016-08-24  8:34 ` Dr. David Alan Gilbert
  2016-08-25  6:37 ` Vijay Kilari
  9 siblings, 1 reply; 20+ messages in thread
From: no-reply @ 2016-08-24  4:30 UTC (permalink / raw)
  To: rth; +Cc: famz, qemu-devel, pbonzini, qemu-arm, vijay.kilari, peter.maydell

Hi,

Your series seems to have some coding style problems. See output below for
more information:

Message-id: 1472012279-20581-1-git-send-email-rth@twiddle.net
Subject: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git show --no-patch --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/1472012279-20581-1-git-send-email-rth@twiddle.net -> patchew/1472012279-20581-1-git-send-email-rth@twiddle.net
Switched to a new branch 'test'
12a04a4 cutils: Rewrite ppc buffer zero checking
2841895 cutils: Rewrite aarch64 buffer zero checking
00cb541 cutils: Rewrite x86 buffer zero checking
457e08e cutils: Add generic prefetch
4063093 cutils: Rearrange buffer_is_zero acceleration
514f601 cutils: Export only buffer_is_zero
aabd7b2 cutils: Remove SPLAT macro

=== OUTPUT BEGIN ===
Checking PATCH 1/7: cutils: Remove SPLAT macro...
Checking PATCH 2/7: cutils: Export only buffer_is_zero...
Checking PATCH 3/7: cutils: Rearrange buffer_is_zero acceleration...
ERROR: externs should be avoided in .c files
#124: FILE: util/cutils.c:166:
+extern void link_error(void);

ERROR: spaces required around that '*' (ctx:VxV)
#218: FILE: util/cutils.c:229:
+ACCEL_BUFFER_ZERO(buffer_zero_int, 4*sizeof(long), long, IDENT_ZERO)
                                     ^

ERROR: architecture specific defines should be avoided
#238: FILE: util/cutils.c:243:
+#ifdef __ALTIVEC__

ERROR: space required before the open brace '{'
#250: FILE: util/cutils.c:255:
+#define DO_ZERO(X)  vec_all_eq(X, (__vector unsigned char){ 0 })

total: 4 errors, 0 warnings, 446 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 4/7: cutils: Add generic prefetch...
Checking PATCH 5/7: cutils: Rewrite x86 buffer zero checking...
ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#44: FILE: util/cutils.c:289:
+        __asm volatile("vmovdqu -0x80(%1),%0\n\t"

ERROR: externs should be avoided in .c files
#44: FILE: util/cutils.c:289:
+        __asm volatile("vmovdqu -0x80(%1),%0\n\t"

ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#72: FILE: util/cutils.c:317:
+        __asm volatile("vmovdqu -0x40(%1),%0\n\t"

ERROR: externs should be avoided in .c files
#72: FILE: util/cutils.c:317:
+        __asm volatile("vmovdqu -0x40(%1),%0\n\t"

ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#98: FILE: util/cutils.c:343:
+        __asm volatile("movdqu -0x40(%4),%0\n\t"

ERROR: externs should be avoided in .c files
#98: FILE: util/cutils.c:343:
+        __asm volatile("movdqu -0x40(%4),%0\n\t"

ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#132: FILE: util/cutils.c:374:
+        __asm volatile("movdqu -0x40(%4),%0\n\t"

ERROR: externs should be avoided in .c files
#132: FILE: util/cutils.c:374:
+        __asm volatile("movdqu -0x40(%4),%0\n\t"

ERROR: architecture specific defines should be avoided
#166: FILE: util/cutils.c:439:
+#ifdef __AVX2__

ERROR: architecture specific defines should be avoided
#169: FILE: util/cutils.c:442:
+#ifdef __AVX__

ERROR: architecture specific defines should be avoided
#172: FILE: util/cutils.c:445:
+#ifdef __SSE4_1__

ERROR: architecture specific defines should be avoided
#175: FILE: util/cutils.c:448:
+#ifdef __SSE2__

total: 12 errors, 0 warnings, 198 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 6/7: cutils: Rewrite aarch64 buffer zero checking...
Checking PATCH 7/7: cutils: Rewrite ppc buffer zero checking...
ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
#37: FILE: util/cutils.c:270:
+        __asm volatile("lvebx %0,%4,%5\n\t"

ERROR: externs should be avoided in .c files
#37: FILE: util/cutils.c:270:
+        __asm volatile("lvebx %0,%4,%5\n\t"

total: 2 errors, 0 warnings, 47 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24  4:30 ` [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero no-reply
@ 2016-08-24  4:38   ` Paolo Bonzini
  2016-08-24 14:53     ` Richard Henderson
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2016-08-24  4:38 UTC (permalink / raw)
  To: qemu-devel, rth; +Cc: peter.maydell, famz, vijay.kilari, qemu-arm



On 24/08/2016 06:30, no-reply@patchew.org wrote:
> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #44: FILE: util/cutils.c:289:
> +        __asm volatile("vmovdqu -0x80(%1),%0\n\t"

Other errors can be ignored, but please use __asm__ __volatile__ here or
just __asm__ (I don't think volatile is useful).

Also, perhaps move this function to its own file since you're rewriting
it anyway?

Thanks,

Paolo

> ERROR: externs should be avoided in .c files
> #44: FILE: util/cutils.c:289:
> +        __asm volatile("vmovdqu -0x80(%1),%0\n\t"
> 
> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #72: FILE: util/cutils.c:317:
> +        __asm volatile("vmovdqu -0x40(%1),%0\n\t"
> 
> ERROR: externs should be avoided in .c files
> #72: FILE: util/cutils.c:317:
> +        __asm volatile("vmovdqu -0x40(%1),%0\n\t"
> 
> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #98: FILE: util/cutils.c:343:
> +        __asm volatile("movdqu -0x40(%4),%0\n\t"
> 
> ERROR: externs should be avoided in .c files
> #98: FILE: util/cutils.c:343:
> +        __asm volatile("movdqu -0x40(%4),%0\n\t"
> 
> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
> #132: FILE: util/cutils.c:374:
> +        __asm volatile("movdqu -0x40(%4),%0\n\t"
> 
> ERROR: externs should be avoided in .c files
> #132: FILE: util/cutils.c:374:
> +        __asm volatile("movdqu -0x40(%4),%0\n\t"

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
                   ` (7 preceding siblings ...)
  2016-08-24  4:30 ` [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero no-reply
@ 2016-08-24  8:34 ` Dr. David Alan Gilbert
  2016-08-24 10:26   ` Adam Richter
  2016-08-25  6:37 ` Vijay Kilari
  9 siblings, 1 reply; 20+ messages in thread
From: Dr. David Alan Gilbert @ 2016-08-24  8:34 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, pbonzini, qemu-arm, vijay.kilari, peter.maydell, liang.z.li


cc'ing in Liang Li who did the original avx2 code.

Dave


* Richard Henderson (rth@twiddle.net) wrote:
> Patches 1-3 remove the use of ifunc from the implementation.
> 
> Patch 5 adjusts the x86 implementation a bit more to take
> advantage of ptest (in sse4.1) and unaligned accesses (in avx1).
> 
> Patches 2 and 6 are the result of my conversation with Vijaya
> Kumar with respect to ThunderX.
> 
> Patch 7 is the result of seeing some really really horrible code
> produced for ppc64le (gcc 4.9 and mainline).
> 
> This has had limited testing.  What I don't know is the best way
> to benchmark this -- the only way I know to trigger this is via
> the console, by hand, which doesn't make for reasonable timing.
> 
> 
> r~
> 
> 
> Richard Henderson (7):
>   cutils: Remove SPLAT macro
>   cutils: Export only buffer_is_zero
>   cutils: Rearrange buffer_is_zero acceleration
>   cutils: Add generic prefetch
>   cutils: Rewrite x86 buffer zero checking
>   cutils: Rewrite aarch64 buffer zero checking
>   cutils: Rewrite ppc buffer zero checking
> 
>  configure             |  21 +-
>  include/qemu/cutils.h |   2 -
>  migration/ram.c       |   2 +-
>  migration/rdma.c      |   5 +-
>  util/cutils.c         | 526 +++++++++++++++++++++++++++++++++-----------------
>  5 files changed, 352 insertions(+), 204 deletions(-)
> 
> -- 
> 2.7.4
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 2/7] cutils: Export only buffer_is_zero
  2016-08-24  4:17 ` [Qemu-devel] [PATCH 2/7] cutils: Export only buffer_is_zero Richard Henderson
@ 2016-08-24  8:37   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 20+ messages in thread
From: Dr. David Alan Gilbert @ 2016-08-24  8:37 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, pbonzini, qemu-arm, vijay.kilari, peter.maydell

* Richard Henderson (rth@twiddle.net) wrote:
> Since the two users don't make use of the returned offset,
> beyond ensuring that the entire buffer is zero, consider the
> can_use_buffer_find_nonzero_offset and buffer_find_nonzero_offset
> functions internal.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>

Thanks, I've wanted to kill that off for a while.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/qemu/cutils.h | 2 --
>  migration/ram.c       | 2 +-
>  migration/rdma.c      | 5 +----
>  util/cutils.c         | 8 ++++----
>  4 files changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
> index 3e4ea23..ca58577 100644
> --- a/include/qemu/cutils.h
> +++ b/include/qemu/cutils.h
> @@ -168,8 +168,6 @@ int64_t qemu_strtosz_suffix_unit(const char *nptr, char **end,
>  /* used to print char* safely */
>  #define STR_OR_NULL(str) ((str) ? (str) : "null")
>  
> -bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len);
> -size_t buffer_find_nonzero_offset(const void *buf, size_t len);
>  bool buffer_is_zero(const void *buf, size_t len);
>  
>  /*
> diff --git a/migration/ram.c b/migration/ram.c
> index a3d70c4..a6e1c63 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -73,7 +73,7 @@ static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
>  
>  static inline bool is_zero_range(uint8_t *p, uint64_t size)
>  {
> -    return buffer_find_nonzero_offset(p, size) == size;
> +    return buffer_is_zero(p, size);
>  }
>  
>  /* struct contains XBZRLE cache and a static page
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 5110ec8..88bdb64 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1934,10 +1934,7 @@ retry:
>               * memset() + madvise() the entire chunk without RDMA.
>               */
>  
> -            if (can_use_buffer_find_nonzero_offset((void *)(uintptr_t)sge.addr,
> -                                                   length)
> -                   && buffer_find_nonzero_offset((void *)(uintptr_t)sge.addr,
> -                                                    length) == length) {
> +            if (buffer_is_zero((void *)(uintptr_t)sge.addr, length)) {
>                  RDMACompress comp = {
>                                          .offset = current_addr,
>                                          .value = 0,
> diff --git a/util/cutils.c b/util/cutils.c
> index 1c8635c..621ca67 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -327,9 +327,9 @@ static bool avx2_support(void)
>      return b & bit_AVX2;
>  }
>  
> -bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
> +static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
>           __attribute__ ((ifunc("can_use_buffer_find_nonzero_offset_ifunc")));
> -size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
> +static size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
>           __attribute__ ((ifunc("buffer_find_nonzero_offset_ifunc")));
>  
>  static void *buffer_find_nonzero_offset_ifunc(void)
> @@ -350,12 +350,12 @@ static void *can_use_buffer_find_nonzero_offset_ifunc(void)
>  }
>  #pragma GCC pop_options
>  #else
> -bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> +static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>  {
>      return can_use_buffer_find_nonzero_offset_inner(buf, len);
>  }
>  
> -size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> +static size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>  {
>      return buffer_find_nonzero_offset_inner(buf, len);
>  }
> -- 
> 2.7.4
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24  8:34 ` Dr. David Alan Gilbert
@ 2016-08-24 10:26   ` Adam Richter
  2016-08-24 10:52     ` Peter Maydell
  0 siblings, 1 reply; 20+ messages in thread
From: Adam Richter @ 2016-08-24 10:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Richard Henderson, peter.maydell, vijay.kilari, liang.z.li,
	qemu-devel, qemu-arm, pbonzini

> * Richard Henderson (rth@twiddle.net) wrote:
>> Patches 1-3 remove the use of ifunc from the implementation.
[...]

I am not a qemu developer, but I wanted to write in support of
removing the use of ifunc.

I filed a glibc bug at
https://sourceware.org/bugzilla/show_bug.cgi?id=20480 that I actually
found from these ifuncs in qemu that results in an attempt to execute
and unexecutable page, under unusual conditions that were arguably my
fault but that could happen on other systems.  I have only attempted
to implement a partial fix for this, and I think a complete fix would
be difficult, and the scenario that remains unfixed involves a
security policy that would probably be popular for systems hosting
virtual machine (prohibiting mapping pages simultaneiously writable
and executable).

I hope that that consideration, combined with the micro-costs to
readability and portability of using and ELF specific and perhaps
currently GCC specific feature might tip the balance against the
savings of a level of function call indirection that I assume the use
of ifunc was intended to provide.

Adam

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24 10:26   ` Adam Richter
@ 2016-08-24 10:52     ` Peter Maydell
  2016-08-24 11:45       ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2016-08-24 10:52 UTC (permalink / raw)
  To: Adam Richter
  Cc: Dr. David Alan Gilbert, Richard Henderson, Vijay Kilari,
	Liang Li, QEMU Developers, qemu-arm, Paolo Bonzini

On 24 August 2016 at 11:26, Adam Richter <adamrichter4@gmail.com> wrote:
> I hope that that consideration, combined with the micro-costs to
> readability and portability of using and ELF specific and perhaps
> currently GCC specific feature might tip the balance against the
> savings of a level of function call indirection that I assume the use
> of ifunc was intended to provide.

It doesn't actually save a level of indirection -- if you single step
through an ifunc call it goes via some ELF section. The thing it
does save is that you don't pay the cost of figuring out the right
ifunc to use on this system at startup, but only when the ifunc call
path is first used. That's useful for a big thing like glibc which
might have lots of ifuncs and not want to pay a big startup cost,
but for QEMU there's really no need given we only have one...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24 10:52     ` Peter Maydell
@ 2016-08-24 11:45       ` Paolo Bonzini
  2016-08-24 12:22         ` Peter Maydell
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2016-08-24 11:45 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Adam Richter, Dr. David Alan Gilbert, Richard Henderson,
	Vijay Kilari, Liang Li, QEMU Developers, qemu-arm


> On 24 August 2016 at 11:26, Adam Richter <adamrichter4@gmail.com> wrote:
> > I hope that that consideration, combined with the micro-costs to
> > readability and portability of using and ELF specific and perhaps
> > currently GCC specific feature might tip the balance against the
> > savings of a level of function call indirection that I assume the use
> > of ifunc was intended to provide.
> 
> It doesn't actually save a level of indirection -- if you single step
> through an ifunc call it goes via some ELF section. The thing it
> does save is that you don't pay the cost of figuring out the right
> ifunc to use on this system at startup, but only when the ifunc call
> path is first used. That's useful for a big thing like glibc which
> might have lots of ifuncs and not want to pay a big startup cost,
> but for QEMU there's really no need given we only have one...

It does save a level of indirection after the first call AFAIK, but
it shouldn't be measurable.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24 11:45       ` Paolo Bonzini
@ 2016-08-24 12:22         ` Peter Maydell
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Maydell @ 2016-08-24 12:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Adam Richter, Dr. David Alan Gilbert, Richard Henderson,
	Vijay Kilari, Liang Li, QEMU Developers, qemu-arm

On 24 August 2016 at 12:45, Paolo Bonzini <pbonzini@redhat.com> wrote:
> It does save a level of indirection after the first call AFAIK, but
> it shouldn't be measurable.

It's worse on first call, but I don't think the subsequent calls
are better than straight pointer-indirection. It's been a while
since I looked though so I could be misremembering.

-- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24  4:38   ` Paolo Bonzini
@ 2016-08-24 14:53     ` Richard Henderson
  2016-08-24 14:59       ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2016-08-24 14:53 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: peter.maydell, qemu-arm, famz, vijay.kilari

On 08/23/2016 09:38 PM, Paolo Bonzini wrote:
>
>
> On 24/08/2016 06:30, no-reply@patchew.org wrote:
>> ERROR: Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt
>> #44: FILE: util/cutils.c:289:
>> +        __asm volatile("vmovdqu -0x80(%1),%0\n\t"
>
> Other errors can be ignored, but please use __asm__ __volatile__ here or
> just __asm__ (I don't think volatile is useful).

I had to add volatile to keep the prefetch in advance of the loop.
I suppose I could just add the prefetch to the asm block...


> Also, perhaps move this function to its own file since you're rewriting
> it anyway?

Sure.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24 14:53     ` Richard Henderson
@ 2016-08-24 14:59       ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2016-08-24 14:59 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, peter maydell, qemu-arm, famz, vijay kilari



----- Original Message -----
> From: "Richard Henderson" <rth@twiddle.net>
> To: "Paolo Bonzini" <pbonzini@redhat.com>, qemu-devel@nongnu.org
> Cc: "peter maydell" <peter.maydell@linaro.org>, qemu-arm@nongnu.org, famz@redhat.com, "vijay kilari"
> <vijay.kilari@gmail.com>
> Sent: Wednesday, August 24, 2016 4:53:37 PM
> Subject: Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
> 
> On 08/23/2016 09:38 PM, Paolo Bonzini wrote:
> >
> >
> > On 24/08/2016 06:30, no-reply@patchew.org wrote:
> >> ERROR: Use of volatile is usually wrong: see
> >> Documentation/volatile-considered-harmful.txt
> >> #44: FILE: util/cutils.c:289:
> >> +        __asm volatile("vmovdqu -0x80(%1),%0\n\t"
> >
> > Other errors can be ignored, but please use __asm__ __volatile__ here or
> > just __asm__ (I don't think volatile is useful).
> 
> I had to add volatile to keep the prefetch in advance of the loop.
> I suppose I could just add the prefetch to the asm block...

Probably easiest, or maybe add barrier() too.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
                   ` (8 preceding siblings ...)
  2016-08-24  8:34 ` Dr. David Alan Gilbert
@ 2016-08-25  6:37 ` Vijay Kilari
  2016-08-25  8:04   ` Vijay Kilari
  9 siblings, 1 reply; 20+ messages in thread
From: Vijay Kilari @ 2016-08-25  6:37 UTC (permalink / raw)
  To: Richard Henderson; +Cc: QEMU Developers, qemu-arm, Paolo Bonzini, Peter Maydell

Hi Richard,

  Migration fails on arm64 with these patches.
On the destination VM, follow errors are appearing.

qemu-system-aarch64: VQ 0 size 0x400 Guest index 0x0 inconsistent with
Host index 0x1937: delta 0xe6c9
qemu-system-aarch64: error while loading state for instance 0x0 of
device 'virtio-mmio@000000000a003e00/virtio-net'
qemu-system-aarch64: load of migration failed: Operation not permitted
qemu-system-aarch64: network script /etc/qemu-ifdown failed with status 256

Regards
Vijay


On Wed, Aug 24, 2016 at 9:47 AM, Richard Henderson <rth@twiddle.net> wrote:
> Patches 1-3 remove the use of ifunc from the implementation.
>
> Patch 5 adjusts the x86 implementation a bit more to take
> advantage of ptest (in sse4.1) and unaligned accesses (in avx1).
>
> Patches 2 and 6 are the result of my conversation with Vijaya
> Kumar with respect to ThunderX.
>
> Patch 7 is the result of seeing some really really horrible code
> produced for ppc64le (gcc 4.9 and mainline).
>
> This has had limited testing.  What I don't know is the best way
> to benchmark this -- the only way I know to trigger this is via
> the console, by hand, which doesn't make for reasonable timing.
>
>
> r~
>
>
> Richard Henderson (7):
>   cutils: Remove SPLAT macro
>   cutils: Export only buffer_is_zero
>   cutils: Rearrange buffer_is_zero acceleration
>   cutils: Add generic prefetch
>   cutils: Rewrite x86 buffer zero checking
>   cutils: Rewrite aarch64 buffer zero checking
>   cutils: Rewrite ppc buffer zero checking
>
>  configure             |  21 +-
>  include/qemu/cutils.h |   2 -
>  migration/ram.c       |   2 +-
>  migration/rdma.c      |   5 +-
>  util/cutils.c         | 526 +++++++++++++++++++++++++++++++++-----------------
>  5 files changed, 352 insertions(+), 204 deletions(-)
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero
  2016-08-25  6:37 ` Vijay Kilari
@ 2016-08-25  8:04   ` Vijay Kilari
  0 siblings, 0 replies; 20+ messages in thread
From: Vijay Kilari @ 2016-08-25  8:04 UTC (permalink / raw)
  To: Richard Henderson; +Cc: QEMU Developers, qemu-arm, Paolo Bonzini, Peter Maydell

On Thu, Aug 25, 2016 at 12:07 PM, Vijay Kilari <vijay.kilari@gmail.com> wrote:
> Hi Richard,
>
>   Migration fails on arm64 with these patches.
> On the destination VM, follow errors are appearing.
>
> qemu-system-aarch64: VQ 0 size 0x400 Guest index 0x0 inconsistent with
> Host index 0x1937: delta 0xe6c9
> qemu-system-aarch64: error while loading state for instance 0x0 of
> device 'virtio-mmio@000000000a003e00/virtio-net'
> qemu-system-aarch64: load of migration failed: Operation not permitted
> qemu-system-aarch64: network script /etc/qemu-ifdown failed with status 256

With below changes, migration is working fine on arm64.

diff --git a/util/cutils.c b/util/cutils.c
index 30fac02..9bbf31f 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -170,6 +170,7 @@ static bool __attribute__((noinline))
             \
 NAME(const void *buf, size_t len)                               \
 {                                                               \
     const void *end = buf + len;                                \
+    const VECTYPE zero = (VECTYPE){0};                          \
     do {                                                        \
         const VECTYPE *p = buf;                                 \
         VECTYPE t;                                              \
@@ -185,7 +186,7 @@ NAME(const void *buf, size_t len)
             \
         } else {                                                \
             link_error();                                       \
         }                                                       \
-        if (unlikely(!ZERO(t))) {                               \
+        if (unlikely(!ZERO(t, zero))) {                         \
             return false;                                       \
         }                                                       \
         buf += SIZE;                                            \
@@ -227,7 +228,7 @@ buffer_zero_base(const void *buf, size_t len)
     return true;
 }
-#define IDENT_ZERO(X)  (X)
+#define IDENT_ZERO(X1, X2)  (X1 == X2)
 ACCEL_BUFFER_ZERO(buffer_zero_int, 4*sizeof(long), long, IDENT_ZERO)

 static bool select_accel_int(const void *buf, size_t len)
@@ -511,7 +512,9 @@ static bool select_accel_fn(const void *buf, size_t len)
 #elif defined(__aarch64__)
 #include "arm_neon.h"

-#define DO_ZERO(X)  (vgetq_lane_u64((X), 0) | vgetq_lane_u64((X), 1))
+#define DO_ZERO(X1, X2) \
+        ((vgetq_lane_u64(X1, 0) == vgetq_lane_u64(X2, 0)) && \
+         (vgetq_lane_u64(X1, 1) == vgetq_lane_u64(X2, 1)))
 ACCEL_BUFFER_ZERO(buffer_zero_neon_64, 64, uint64x2_t, DO_ZERO)
 ACCEL_BUFFER_ZERO(buffer_zero_neon_128, 128, uint64x2_t, DO_ZERO)

@@ -526,7 +529,7 @@ static void __attribute__((constructor))
init_buffer_zero_accel(void)
        since the later is not available to userspace.  This seems
        to work in practice for existing implementations.  */
     asm("mrs %0, dczid_el0" : "=r"(t));
-    if ((t & 15) * 16 >= 128) {
+    if (pow(2, (t & 0xf)) * 4 >= 128) {
         buffer_zero_line_mask = 128 - 1;
         buffer_zero_accel = buffer_zero_neon_128;
     } else {


>
> Regards
> Vijay
>
>
> On Wed, Aug 24, 2016 at 9:47 AM, Richard Henderson <rth@twiddle.net> wrote:
>> Patches 1-3 remove the use of ifunc from the implementation.
>>
>> Patch 5 adjusts the x86 implementation a bit more to take
>> advantage of ptest (in sse4.1) and unaligned accesses (in avx1).
>>
>> Patches 2 and 6 are the result of my conversation with Vijaya
>> Kumar with respect to ThunderX.
>>
>> Patch 7 is the result of seeing some really really horrible code
>> produced for ppc64le (gcc 4.9 and mainline).
>>
>> This has had limited testing.  What I don't know is the best way
>> to benchmark this -- the only way I know to trigger this is via
>> the console, by hand, which doesn't make for reasonable timing.
>>
>>
>> r~
>>
>>
>> Richard Henderson (7):
>>   cutils: Remove SPLAT macro
>>   cutils: Export only buffer_is_zero
>>   cutils: Rearrange buffer_is_zero acceleration
>>   cutils: Add generic prefetch
>>   cutils: Rewrite x86 buffer zero checking
>>   cutils: Rewrite aarch64 buffer zero checking
>>   cutils: Rewrite ppc buffer zero checking
>>
>>  configure             |  21 +-
>>  include/qemu/cutils.h |   2 -
>>  migration/ram.c       |   2 +-
>>  migration/rdma.c      |   5 +-
>>  util/cutils.c         | 526 +++++++++++++++++++++++++++++++++-----------------
>>  5 files changed, 352 insertions(+), 204 deletions(-)
>>
>> --
>> 2.7.4
>>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-08-25  8:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-24  4:17 [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero Richard Henderson
2016-08-24  4:17 ` [Qemu-devel] [PATCH 1/7] cutils: Remove SPLAT macro Richard Henderson
2016-08-24  4:17 ` [Qemu-devel] [PATCH 2/7] cutils: Export only buffer_is_zero Richard Henderson
2016-08-24  8:37   ` Dr. David Alan Gilbert
2016-08-24  4:17 ` [Qemu-devel] [PATCH 3/7] cutils: Rearrange buffer_is_zero acceleration Richard Henderson
2016-08-24  4:17 ` [Qemu-devel] [PATCH 4/7] cutils: Add generic prefetch Richard Henderson
2016-08-24  4:17 ` [Qemu-devel] [PATCH 5/7] cutils: Rewrite x86 buffer zero checking Richard Henderson
2016-08-24  4:17 ` [Qemu-devel] [PATCH 6/7] cutils: Rewrite aarch64 " Richard Henderson
2016-08-24  4:17 ` [Qemu-devel] [PATCH 7/7] cutils: Rewrite ppc " Richard Henderson
2016-08-24  4:30 ` [Qemu-devel] [PATCH 0/7] Improve buffer_is_zero no-reply
2016-08-24  4:38   ` Paolo Bonzini
2016-08-24 14:53     ` Richard Henderson
2016-08-24 14:59       ` Paolo Bonzini
2016-08-24  8:34 ` Dr. David Alan Gilbert
2016-08-24 10:26   ` Adam Richter
2016-08-24 10:52     ` Peter Maydell
2016-08-24 11:45       ` Paolo Bonzini
2016-08-24 12:22         ` Peter Maydell
2016-08-25  6:37 ` Vijay Kilari
2016-08-25  8:04   ` Vijay Kilari

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.