All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero
@ 2016-08-24 17:48 Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 1/8] cutils: Move buffer_is_zero and subroutines to a new file Richard Henderson
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell

Patches 1-4 remove the use of ifunc from the implementation.

Patch 6 adjusts the x86 implementation a bit more to take
advantage of ptest (in sse4.1) and unaligned accesses (in avx1).

Patches 3 and 7 are the result of my conversation with Vijaya
Kumar with respect to ThunderX.

Patch 8 is the result of seeing some really really horrible code
produced for ppc64le (gcc 4.9 and mainline).

This has had limited testing.  What I don't know is the best way
to benchmark this -- the only way I know to trigger this is via
the console, by hand, which doesn't make for reasonable timing.

Changes v1-v2:
  * Add patch 1, moving everything to a new file.
  * Fix a typo or two, which had the wrong sense of zero test.
    These had mostly beed fixed in the intermediate patches,
    but it wouldn't have helped bisection.


r~


Richard Henderson (8):
  cutils: Move buffer_is_zero and subroutines to a new file
  cutils: Remove SPLAT macro
  cutils: Export only buffer_is_zero
  cutils: Rearrange buffer_is_zero acceleration
  cutils: Add generic prefetch
  cutils: Rewrite x86 buffer zero checking
  cutils: Rewrite aarch64 buffer zero checking
  cutils: Rewrite ppc buffer zero checking

 configure             |  21 +--
 include/qemu/cutils.h |   2 -
 migration/ram.c       |   2 +-
 migration/rdma.c      |   5 +-
 util/Makefile.objs    |   1 +
 util/bufferiszero.c   | 432 ++++++++++++++++++++++++++++++++++++++++++++++++++
 util/cutils.c         | 244 ----------------------------
 7 files changed, 441 insertions(+), 266 deletions(-)
 create mode 100644 util/bufferiszero.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 1/8] cutils: Move buffer_is_zero and subroutines to a new file
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
@ 2016-08-24 17:48 ` Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 3/8] cutils: Export only buffer_is_zero Richard Henderson
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/Makefile.objs  |   1 +
 util/bufferiszero.c | 272 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 util/cutils.c       | 244 ----------------------------------------------
 3 files changed, 273 insertions(+), 244 deletions(-)
 create mode 100644 util/bufferiszero.c

diff --git a/util/Makefile.objs b/util/Makefile.objs
index 96cb1e0..ffca8f3 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,5 @@
 util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y += bufferiszero.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
new file mode 100644
index 0000000..9bb1ae5
--- /dev/null
+++ b/util/bufferiszero.c
@@ -0,0 +1,272 @@
+/*
+ * Simple C functions to supplement the C library
+ *
+ * Copyright (c) 2006 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/cutils.h"
+
+
+/* vector definitions */
+#ifdef __ALTIVEC__
+#include <altivec.h>
+/* The altivec.h header says we're allowed to undef these for
+ * C++ compatibility.  Here we don't care about C++, but we
+ * undef them anyway to avoid namespace pollution.
+ */
+#undef vector
+#undef pixel
+#undef bool
+#define VECTYPE        __vector unsigned char
+#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
+#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
+#define VEC_OR(v1, v2) ((v1) | (v2))
+/* altivec.h may redefine the bool macro as vector type.
+ * Reset it to POSIX semantics. */
+#define bool _Bool
+#elif defined __SSE2__
+#include <emmintrin.h>
+#define VECTYPE        __m128i
+#define SPLAT(p)       _mm_set1_epi8(*(p))
+#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
+#define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
+#elif defined(__aarch64__)
+#include "arm_neon.h"
+#define VECTYPE        uint64x2_t
+#define ALL_EQ(v1, v2) \
+        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
+         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
+#define VEC_OR(v1, v2) ((v1) | (v2))
+#else
+#define VECTYPE        unsigned long
+#define SPLAT(p)       (*(p) * (~0UL / 255))
+#define ALL_EQ(v1, v2) ((v1) == (v2))
+#define VEC_OR(v1, v2) ((v1) | (v2))
+#endif
+
+#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
+
+static bool
+can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
+{
+    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+                   * sizeof(VECTYPE)) == 0
+            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
+}
+
+/*
+ * Searches for an area with non-zero content in a buffer
+ *
+ * Attention! The len must be a multiple of
+ * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
+ * and addr must be a multiple of sizeof(VECTYPE) due to
+ * restriction of optimizations in this function.
+ *
+ * can_use_buffer_find_nonzero_offset_inner() can be used to
+ * check these requirements.
+ *
+ * The return value is the offset of the non-zero area rounded
+ * down to a multiple of sizeof(VECTYPE) for the first
+ * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR chunks and down to
+ * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
+ * afterwards.
+ *
+ * If the buffer is all zero the return value is equal to len.
+ */
+
+static size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
+{
+    const VECTYPE *p = buf;
+    const VECTYPE zero = (VECTYPE){0};
+    size_t i;
+
+    assert(can_use_buffer_find_nonzero_offset_inner(buf, len));
+
+    if (!len) {
+        return 0;
+    }
+
+    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
+        if (!ALL_EQ(p[i], zero)) {
+            return i * sizeof(VECTYPE);
+        }
+    }
+
+    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
+         i < len / sizeof(VECTYPE);
+         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+        VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
+        VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
+        VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
+        VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
+        VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
+        VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
+        if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
+            break;
+        }
+    }
+
+    return i * sizeof(VECTYPE);
+}
+
+#if defined CONFIG_AVX2_OPT
+#pragma GCC push_options
+#pragma GCC target("avx2")
+#include <cpuid.h>
+#include <immintrin.h>
+
+#define AVX2_VECTYPE        __m256i
+#define AVX2_SPLAT(p)       _mm256_set1_epi8(*(p))
+#define AVX2_ALL_EQ(v1, v2) \
+    (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
+#define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
+
+static bool
+can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+{
+    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+                   * sizeof(AVX2_VECTYPE)) == 0
+            && ((uintptr_t) buf) % sizeof(AVX2_VECTYPE) == 0);
+}
+
+static size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+{
+    const AVX2_VECTYPE *p = buf;
+    const AVX2_VECTYPE zero = (AVX2_VECTYPE){0};
+    size_t i;
+
+    assert(can_use_buffer_find_nonzero_offset_avx2(buf, len));
+
+    if (!len) {
+        return 0;
+    }
+
+    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
+        if (!AVX2_ALL_EQ(p[i], zero)) {
+            return i * sizeof(AVX2_VECTYPE);
+        }
+    }
+
+    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
+         i < len / sizeof(AVX2_VECTYPE);
+         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+        AVX2_VECTYPE tmp0 = AVX2_VEC_OR(p[i + 0], p[i + 1]);
+        AVX2_VECTYPE tmp1 = AVX2_VEC_OR(p[i + 2], p[i + 3]);
+        AVX2_VECTYPE tmp2 = AVX2_VEC_OR(p[i + 4], p[i + 5]);
+        AVX2_VECTYPE tmp3 = AVX2_VEC_OR(p[i + 6], p[i + 7]);
+        AVX2_VECTYPE tmp01 = AVX2_VEC_OR(tmp0, tmp1);
+        AVX2_VECTYPE tmp23 = AVX2_VEC_OR(tmp2, tmp3);
+        if (!AVX2_ALL_EQ(AVX2_VEC_OR(tmp01, tmp23), zero)) {
+            break;
+        }
+    }
+
+    return i * sizeof(AVX2_VECTYPE);
+}
+
+static bool avx2_support(void)
+{
+    int a, b, c, d;
+
+    if (__get_cpuid_max(0, NULL) < 7) {
+        return false;
+    }
+
+    __cpuid_count(7, 0, a, b, c, d);
+
+    return b & bit_AVX2;
+}
+
+bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
+         __attribute__ ((ifunc("can_use_buffer_find_nonzero_offset_ifunc")));
+size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
+         __attribute__ ((ifunc("buffer_find_nonzero_offset_ifunc")));
+
+static void *buffer_find_nonzero_offset_ifunc(void)
+{
+    typeof(buffer_find_nonzero_offset) *func = (avx2_support()) ?
+        buffer_find_nonzero_offset_avx2 : buffer_find_nonzero_offset_inner;
+
+    return func;
+}
+
+static void *can_use_buffer_find_nonzero_offset_ifunc(void)
+{
+    typeof(can_use_buffer_find_nonzero_offset) *func = (avx2_support()) ?
+        can_use_buffer_find_nonzero_offset_avx2 :
+        can_use_buffer_find_nonzero_offset_inner;
+
+    return func;
+}
+#pragma GCC pop_options
+#else
+bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
+{
+    return can_use_buffer_find_nonzero_offset_inner(buf, len);
+}
+
+size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+{
+    return buffer_find_nonzero_offset_inner(buf, len);
+}
+#endif
+
+/*
+ * Checks if a buffer is all zeroes
+ *
+ * Attention! The len must be a multiple of 4 * sizeof(long) due to
+ * restriction of optimizations in this function.
+ */
+bool buffer_is_zero(const void *buf, size_t len)
+{
+    /*
+     * Use long as the biggest available internal data type that fits into the
+     * CPU register and unroll the loop to smooth out the effect of memory
+     * latency.
+     */
+
+    size_t i;
+    long d0, d1, d2, d3;
+    const long * const data = buf;
+
+    /* use vector optimized zero check if possible */
+    if (can_use_buffer_find_nonzero_offset(buf, len)) {
+        return buffer_find_nonzero_offset(buf, len) == len;
+    }
+
+    assert(len % (4 * sizeof(long)) == 0);
+    len /= sizeof(long);
+
+    for (i = 0; i < len; i += 4) {
+        d0 = data[i + 0];
+        d1 = data[i + 1];
+        d2 = data[i + 2];
+        d3 = data[i + 3];
+
+        if (d0 || d1 || d2 || d3) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
diff --git a/util/cutils.c b/util/cutils.c
index 7505fda..4fefcf3 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -161,250 +161,6 @@ int qemu_fdatasync(int fd)
 #endif
 }
 
-/* vector definitions */
-#ifdef __ALTIVEC__
-#include <altivec.h>
-/* The altivec.h header says we're allowed to undef these for
- * C++ compatibility.  Here we don't care about C++, but we
- * undef them anyway to avoid namespace pollution.
- */
-#undef vector
-#undef pixel
-#undef bool
-#define VECTYPE        __vector unsigned char
-#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
-#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
-#define VEC_OR(v1, v2) ((v1) | (v2))
-/* altivec.h may redefine the bool macro as vector type.
- * Reset it to POSIX semantics. */
-#define bool _Bool
-#elif defined __SSE2__
-#include <emmintrin.h>
-#define VECTYPE        __m128i
-#define SPLAT(p)       _mm_set1_epi8(*(p))
-#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
-#define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
-#elif defined(__aarch64__)
-#include "arm_neon.h"
-#define VECTYPE        uint64x2_t
-#define ALL_EQ(v1, v2) \
-        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
-         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
-#define VEC_OR(v1, v2) ((v1) | (v2))
-#else
-#define VECTYPE        unsigned long
-#define SPLAT(p)       (*(p) * (~0UL / 255))
-#define ALL_EQ(v1, v2) ((v1) == (v2))
-#define VEC_OR(v1, v2) ((v1) | (v2))
-#endif
-
-#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
-
-static bool
-can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
-{
-    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
-                   * sizeof(VECTYPE)) == 0
-            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
-}
-
-/*
- * Searches for an area with non-zero content in a buffer
- *
- * Attention! The len must be a multiple of
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
- * and addr must be a multiple of sizeof(VECTYPE) due to
- * restriction of optimizations in this function.
- *
- * can_use_buffer_find_nonzero_offset_inner() can be used to
- * check these requirements.
- *
- * The return value is the offset of the non-zero area rounded
- * down to a multiple of sizeof(VECTYPE) for the first
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR chunks and down to
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
- * afterwards.
- *
- * If the buffer is all zero the return value is equal to len.
- */
-
-static size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
-{
-    const VECTYPE *p = buf;
-    const VECTYPE zero = (VECTYPE){0};
-    size_t i;
-
-    assert(can_use_buffer_find_nonzero_offset_inner(buf, len));
-
-    if (!len) {
-        return 0;
-    }
-
-    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
-        if (!ALL_EQ(p[i], zero)) {
-            return i * sizeof(VECTYPE);
-        }
-    }
-
-    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
-         i < len / sizeof(VECTYPE);
-         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
-        VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
-        VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
-        VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
-        VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
-        VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
-        VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
-        if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
-            break;
-        }
-    }
-
-    return i * sizeof(VECTYPE);
-}
-
-#if defined CONFIG_AVX2_OPT
-#pragma GCC push_options
-#pragma GCC target("avx2")
-#include <cpuid.h>
-#include <immintrin.h>
-
-#define AVX2_VECTYPE        __m256i
-#define AVX2_SPLAT(p)       _mm256_set1_epi8(*(p))
-#define AVX2_ALL_EQ(v1, v2) \
-    (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
-#define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
-
-static bool
-can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
-{
-    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
-                   * sizeof(AVX2_VECTYPE)) == 0
-            && ((uintptr_t) buf) % sizeof(AVX2_VECTYPE) == 0);
-}
-
-static size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
-{
-    const AVX2_VECTYPE *p = buf;
-    const AVX2_VECTYPE zero = (AVX2_VECTYPE){0};
-    size_t i;
-
-    assert(can_use_buffer_find_nonzero_offset_avx2(buf, len));
-
-    if (!len) {
-        return 0;
-    }
-
-    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
-        if (!AVX2_ALL_EQ(p[i], zero)) {
-            return i * sizeof(AVX2_VECTYPE);
-        }
-    }
-
-    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
-         i < len / sizeof(AVX2_VECTYPE);
-         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
-        AVX2_VECTYPE tmp0 = AVX2_VEC_OR(p[i + 0], p[i + 1]);
-        AVX2_VECTYPE tmp1 = AVX2_VEC_OR(p[i + 2], p[i + 3]);
-        AVX2_VECTYPE tmp2 = AVX2_VEC_OR(p[i + 4], p[i + 5]);
-        AVX2_VECTYPE tmp3 = AVX2_VEC_OR(p[i + 6], p[i + 7]);
-        AVX2_VECTYPE tmp01 = AVX2_VEC_OR(tmp0, tmp1);
-        AVX2_VECTYPE tmp23 = AVX2_VEC_OR(tmp2, tmp3);
-        if (!AVX2_ALL_EQ(AVX2_VEC_OR(tmp01, tmp23), zero)) {
-            break;
-        }
-    }
-
-    return i * sizeof(AVX2_VECTYPE);
-}
-
-static bool avx2_support(void)
-{
-    int a, b, c, d;
-
-    if (__get_cpuid_max(0, NULL) < 7) {
-        return false;
-    }
-
-    __cpuid_count(7, 0, a, b, c, d);
-
-    return b & bit_AVX2;
-}
-
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
-         __attribute__ ((ifunc("can_use_buffer_find_nonzero_offset_ifunc")));
-size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
-         __attribute__ ((ifunc("buffer_find_nonzero_offset_ifunc")));
-
-static void *buffer_find_nonzero_offset_ifunc(void)
-{
-    typeof(buffer_find_nonzero_offset) *func = (avx2_support()) ?
-        buffer_find_nonzero_offset_avx2 : buffer_find_nonzero_offset_inner;
-
-    return func;
-}
-
-static void *can_use_buffer_find_nonzero_offset_ifunc(void)
-{
-    typeof(can_use_buffer_find_nonzero_offset) *func = (avx2_support()) ?
-        can_use_buffer_find_nonzero_offset_avx2 :
-        can_use_buffer_find_nonzero_offset_inner;
-
-    return func;
-}
-#pragma GCC pop_options
-#else
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
-{
-    return can_use_buffer_find_nonzero_offset_inner(buf, len);
-}
-
-size_t buffer_find_nonzero_offset(const void *buf, size_t len)
-{
-    return buffer_find_nonzero_offset_inner(buf, len);
-}
-#endif
-
-/*
- * Checks if a buffer is all zeroes
- *
- * Attention! The len must be a multiple of 4 * sizeof(long) due to
- * restriction of optimizations in this function.
- */
-bool buffer_is_zero(const void *buf, size_t len)
-{
-    /*
-     * Use long as the biggest available internal data type that fits into the
-     * CPU register and unroll the loop to smooth out the effect of memory
-     * latency.
-     */
-
-    size_t i;
-    long d0, d1, d2, d3;
-    const long * const data = buf;
-
-    /* use vector optimized zero check if possible */
-    if (can_use_buffer_find_nonzero_offset(buf, len)) {
-        return buffer_find_nonzero_offset(buf, len) == len;
-    }
-
-    assert(len % (4 * sizeof(long)) == 0);
-    len /= sizeof(long);
-
-    for (i = 0; i < len; i += 4) {
-        d0 = data[i + 0];
-        d1 = data[i + 1];
-        d2 = data[i + 2];
-        d3 = data[i + 3];
-
-        if (d0 || d1 || d2 || d3) {
-            return false;
-        }
-    }
-
-    return true;
-}
-
 #ifndef _WIN32
 /* Sets a specific flag */
 int fcntl_setfl(int fd, int flag)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 3/8] cutils: Export only buffer_is_zero
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 1/8] cutils: Move buffer_is_zero and subroutines to a new file Richard Henderson
@ 2016-08-24 17:48 ` Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 4/8] cutils: Rearrange buffer_is_zero acceleration Richard Henderson
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell

Since the two users don't make use of the returned offset,
beyond ensuring that the entire buffer is zero, consider the
can_use_buffer_find_nonzero_offset and buffer_find_nonzero_offset
functions internal.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 include/qemu/cutils.h | 2 --
 migration/ram.c       | 2 +-
 migration/rdma.c      | 5 +----
 util/bufferiszero.c   | 8 ++++----
 4 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 3e4ea23..ca58577 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -168,8 +168,6 @@ int64_t qemu_strtosz_suffix_unit(const char *nptr, char **end,
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len);
-size_t buffer_find_nonzero_offset(const void *buf, size_t len);
 bool buffer_is_zero(const void *buf, size_t len);
 
 /*
diff --git a/migration/ram.c b/migration/ram.c
index a3d70c4..a6e1c63 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -73,7 +73,7 @@ static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
 
 static inline bool is_zero_range(uint8_t *p, uint64_t size)
 {
-    return buffer_find_nonzero_offset(p, size) == size;
+    return buffer_is_zero(p, size);
 }
 
 /* struct contains XBZRLE cache and a static page
diff --git a/migration/rdma.c b/migration/rdma.c
index 5110ec8..88bdb64 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1934,10 +1934,7 @@ retry:
              * memset() + madvise() the entire chunk without RDMA.
              */
 
-            if (can_use_buffer_find_nonzero_offset((void *)(uintptr_t)sge.addr,
-                                                   length)
-                   && buffer_find_nonzero_offset((void *)(uintptr_t)sge.addr,
-                                                    length) == length) {
+            if (buffer_is_zero((void *)(uintptr_t)sge.addr, length)) {
                 RDMACompress comp = {
                                         .offset = current_addr,
                                         .value = 0,
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 067d08f..0cf8b6e 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -192,9 +192,9 @@ static bool avx2_support(void)
     return b & bit_AVX2;
 }
 
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
+static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
          __attribute__ ((ifunc("can_use_buffer_find_nonzero_offset_ifunc")));
-size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
+static size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
          __attribute__ ((ifunc("buffer_find_nonzero_offset_ifunc")));
 
 static void *buffer_find_nonzero_offset_ifunc(void)
@@ -215,12 +215,12 @@ static void *can_use_buffer_find_nonzero_offset_ifunc(void)
 }
 #pragma GCC pop_options
 #else
-bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
+static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
 {
     return can_use_buffer_find_nonzero_offset_inner(buf, len);
 }
 
-size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+static size_t buffer_find_nonzero_offset(const void *buf, size_t len)
 {
     return buffer_find_nonzero_offset_inner(buf, len);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 4/8] cutils: Rearrange buffer_is_zero acceleration
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 1/8] cutils: Move buffer_is_zero and subroutines to a new file Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 3/8] cutils: Export only buffer_is_zero Richard Henderson
@ 2016-08-24 17:48 ` Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 5/8] cutils: Add generic prefetch Richard Henderson
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell

Allow selection of several acceleration functions
based on the size and alignment of the buffer.
Do not require ifunc support for AVX2 acceleration.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 configure           |  21 +---
 util/bufferiszero.c | 352 +++++++++++++++++++++++++---------------------------
 2 files changed, 172 insertions(+), 201 deletions(-)

diff --git a/configure b/configure
index 4b808f9..9f3d1fa 100755
--- a/configure
+++ b/configure
@@ -1788,28 +1788,19 @@ fi
 ##########################################
 # avx2 optimization requirement check
 
-
-if test "$static" = "no" ; then
-  cat > $TMPC << EOF
+cat > $TMPC << EOF
 #pragma GCC push_options
 #pragma GCC target("avx2")
 #include <cpuid.h>
 #include <immintrin.h>
-
 static int bar(void *a) {
-    return _mm256_movemask_epi8(_mm256_cmpeq_epi8(*(__m256i *)a, (__m256i){0}));
+    __m256i x = *(__m256i *)a;
+    return _mm256_testz_si256(x, x);
 }
-static void *bar_ifunc(void) {return (void*) bar;}
-int foo(void *a) __attribute__((ifunc("bar_ifunc")));
-int main(int argc, char *argv[]) { return foo(argv[0]);}
+int main(int argc, char *argv[]) { return bar(argv[0]); }
 EOF
-  if compile_object "" ; then
-      if has readelf; then
-          if readelf --syms $TMPO 2>/dev/null |grep -q "IFUNC.*foo"; then
-              avx2_opt="yes"
-          fi
-      fi
-  fi
+if compile_object "" ; then
+  avx2_opt="yes"
 fi
 
 #########################################
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 0cf8b6e..5246c5b 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -27,242 +27,222 @@
 
 
 /* vector definitions */
-#ifdef __ALTIVEC__
-#include <altivec.h>
-/* The altivec.h header says we're allowed to undef these for
- * C++ compatibility.  Here we don't care about C++, but we
- * undef them anyway to avoid namespace pollution.
- */
-#undef vector
-#undef pixel
-#undef bool
-#define VECTYPE        __vector unsigned char
-#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
-#define VEC_OR(v1, v2) ((v1) | (v2))
-/* altivec.h may redefine the bool macro as vector type.
- * Reset it to POSIX semantics. */
-#define bool _Bool
-#elif defined __SSE2__
-#include <emmintrin.h>
-#define VECTYPE        __m128i
-#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
-#define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
-#elif defined(__aarch64__)
-#include "arm_neon.h"
-#define VECTYPE        uint64x2_t
-#define ALL_EQ(v1, v2) \
-        ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
-         (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
-#define VEC_OR(v1, v2) ((v1) | (v2))
-#else
-#define VECTYPE        unsigned long
-#define ALL_EQ(v1, v2) ((v1) == (v2))
-#define VEC_OR(v1, v2) ((v1) | (v2))
-#endif
 
-#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
-
-static bool
-can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
-{
-    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
-                   * sizeof(VECTYPE)) == 0
-            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
+extern void link_error(void);
+
+#define ACCEL_BUFFER_ZERO(NAME, SIZE, VECTYPE, NONZERO)         \
+static bool __attribute__((noinline))                           \
+NAME(const void *buf, size_t len)                               \
+{                                                               \
+    const void *end = buf + len;                                \
+    do {                                                        \
+        const VECTYPE *p = buf;                                 \
+        VECTYPE t;                                              \
+        if (SIZE == sizeof(VECTYPE) * 4) {                      \
+            t = (p[0] | p[1]) | (p[2] | p[3]);                  \
+        } else if (SIZE == sizeof(VECTYPE) * 8) {               \
+            t  = p[0] | p[1];                                   \
+            t |= p[2] | p[3];                                   \
+            t |= p[4] | p[5];                                   \
+            t |= p[6] | p[7];                                   \
+        } else {                                                \
+            link_error();                                       \
+        }                                                       \
+        if (unlikely(NONZERO(t))) {                             \
+            return false;                                       \
+        }                                                       \
+        buf += SIZE;                                            \
+    } while (buf < end);                                        \
+    return true;                                                \
 }
 
-/*
- * Searches for an area with non-zero content in a buffer
- *
- * Attention! The len must be a multiple of
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
- * and addr must be a multiple of sizeof(VECTYPE) due to
- * restriction of optimizations in this function.
- *
- * can_use_buffer_find_nonzero_offset_inner() can be used to
- * check these requirements.
- *
- * The return value is the offset of the non-zero area rounded
- * down to a multiple of sizeof(VECTYPE) for the first
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR chunks and down to
- * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
- * afterwards.
- *
- * If the buffer is all zero the return value is equal to len.
- */
+typedef bool (*accel_zero_fn)(const void *, size_t);
 
-static size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
+static bool __attribute__((noinline))
+buffer_zero_base(const void *buf, size_t len)
 {
-    const VECTYPE *p = buf;
-    const VECTYPE zero = (VECTYPE){0};
     size_t i;
 
-    assert(can_use_buffer_find_nonzero_offset_inner(buf, len));
-
-    if (!len) {
-        return 0;
+    /* Check bytes until the buffer is aligned.  */
+    for (i = 0; i < len && ((uintptr_t)buf + i) % sizeof(long); ++i) {
+        const char *p = buf + i;
+        if (*p) {
+            return false;
+        }
     }
 
-    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
-        if (!ALL_EQ(p[i], zero)) {
-            return i * sizeof(VECTYPE);
+    /* Check longs until we run out.  */
+    for (; i + sizeof(long) <= len; i += sizeof(long)) {
+        const long *p = buf + i;
+        if (*p) {
+            return false;
         }
     }
 
-    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
-         i < len / sizeof(VECTYPE);
-         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
-        VECTYPE tmp0 = VEC_OR(p[i + 0], p[i + 1]);
-        VECTYPE tmp1 = VEC_OR(p[i + 2], p[i + 3]);
-        VECTYPE tmp2 = VEC_OR(p[i + 4], p[i + 5]);
-        VECTYPE tmp3 = VEC_OR(p[i + 6], p[i + 7]);
-        VECTYPE tmp01 = VEC_OR(tmp0, tmp1);
-        VECTYPE tmp23 = VEC_OR(tmp2, tmp3);
-        if (!ALL_EQ(VEC_OR(tmp01, tmp23), zero)) {
-            break;
+    /* Check the last few bytes of the tail.  */
+    for (; i < len; ++i) {
+        const char *p = buf + i;
+        if (*p) {
+            return false;
         }
     }
 
-    return i * sizeof(VECTYPE);
+    return true;
 }
 
-#if defined CONFIG_AVX2_OPT
-#pragma GCC push_options
-#pragma GCC target("avx2")
-#include <cpuid.h>
-#include <immintrin.h>
-
-#define AVX2_VECTYPE        __m256i
-#define AVX2_ALL_EQ(v1, v2) \
-    (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
-#define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
+#define IDENT_NONZERO(X)  (X)
+ACCEL_BUFFER_ZERO(buffer_zero_int, 4*sizeof(long), long, IDENT_NONZERO)
 
-static bool
-can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+static bool select_accel_int(const void *buf, size_t len)
 {
-    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
-                   * sizeof(AVX2_VECTYPE)) == 0
-            && ((uintptr_t) buf) % sizeof(AVX2_VECTYPE) == 0);
+    uintptr_t ibuf = (uintptr_t)buf;
+    /* Note that this condition used to be the input constraint for
+       buffer_is_zero, therefore it is highly likely to be true.  */
+    if (likely(len % (4 * sizeof(long)) == 0)
+        && likely(ibuf % sizeof(long) == 0)) {
+        return buffer_zero_int(buf, len);
+    }
+    return buffer_zero_base(buf, len);
 }
 
-static size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+#ifdef __ALTIVEC__
+#include <altivec.h>
+/* The altivec.h header says we're allowed to undef these for
+ * C++ compatibility.  Here we don't care about C++, but we
+ * undef them anyway to avoid namespace pollution.
+ * altivec.h may redefine the bool macro as vector type.
+ * Reset it to POSIX semantics.
+ */
+#undef vector
+#undef pixel
+#undef bool
+#define bool _Bool
+#define DO_NONZERO(X)  vec_any_ne(X, (__vector unsigned char){ 0 })
+ACCEL_BUFFER_ZERO(buffer_zero_ppc, 128, __vector unsigned char, DO_NONZERO)
+
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    const AVX2_VECTYPE *p = buf;
-    const AVX2_VECTYPE zero = (AVX2_VECTYPE){0};
-    size_t i;
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 128 == 0 && ibuf % sizeof(__vector unsigned char) == 0) {
+        return buffer_zero_ppc(buf, len);
+    }
+    return select_accel_int(buf, len);
+}
 
-    assert(can_use_buffer_find_nonzero_offset_avx2(buf, len));
+#elif defined(CONFIG_AVX2_OPT)
+#include <cpuid.h>
+#include <x86intrin.h>
 
-    if (!len) {
-        return 0;
-    }
+#pragma GCC push_options
+#pragma GCC target("avx2")
+#define AVX2_NONZERO(X)  !_mm256_testz_si256((X), (X))
+ACCEL_BUFFER_ZERO(buffer_zero_avx2, 128, __m256i, AVX2_NONZERO)
+#pragma GCC pop_options
 
-    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
-        if (!AVX2_ALL_EQ(p[i], zero)) {
-            return i * sizeof(AVX2_VECTYPE);
-        }
-    }
+#pragma GCC push_options
+#pragma GCC target("sse2")
+#define SSE2_NONZERO(X) \
+    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) != 0xFFFF)
+ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_NONZERO)
+#pragma GCC pop_options
 
-    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
-         i < len / sizeof(AVX2_VECTYPE);
-         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
-        AVX2_VECTYPE tmp0 = AVX2_VEC_OR(p[i + 0], p[i + 1]);
-        AVX2_VECTYPE tmp1 = AVX2_VEC_OR(p[i + 2], p[i + 3]);
-        AVX2_VECTYPE tmp2 = AVX2_VEC_OR(p[i + 4], p[i + 5]);
-        AVX2_VECTYPE tmp3 = AVX2_VEC_OR(p[i + 6], p[i + 7]);
-        AVX2_VECTYPE tmp01 = AVX2_VEC_OR(tmp0, tmp1);
-        AVX2_VECTYPE tmp23 = AVX2_VEC_OR(tmp2, tmp3);
-        if (!AVX2_ALL_EQ(AVX2_VEC_OR(tmp01, tmp23), zero)) {
-            break;
-        }
-    }
+#define CACHE_SSE2    1
+#define CACHE_SSE4    2
+#define CACHE_AVX1    4
+#define CACHE_AVX2    8
 
-    return i * sizeof(AVX2_VECTYPE);
-}
+static int cpuid_cache;
 
-static bool avx2_support(void)
+static void __attribute__((constructor)) init_cpuid_cache(void)
 {
+    int max = __get_cpuid_max(0, NULL);
     int a, b, c, d;
+    int cache = 0;
 
-    if (__get_cpuid_max(0, NULL) < 7) {
-        return false;
-    }
-
-    __cpuid_count(7, 0, a, b, c, d);
+    if (max >= 1) {
+        __cpuid(1, a, b, c, d);
+        if (d & bit_SSE2) {
+            cache |= CACHE_SSE2;
+        }
+        if (c & bit_SSE4_1) {
+            cache |= CACHE_SSE4;
+        }
 
-    return b & bit_AVX2;
+        /* We must check that AVX is not just available, but usable.  */
+        if ((c & bit_OSXSAVE) && (c & bit_AVX)) {
+            __asm("xgetbv" : "=a"(a), "=d"(d) : "c"(0));
+            if ((a & 6) == 6) {
+                cache |= CACHE_AVX1;
+                if (max >= 7) {
+                    __cpuid_count(7, 0, a, b, c, d);
+                    if (b & bit_AVX2) {
+                        cache |= CACHE_AVX2;
+                    }
+                }
+            }
+        }
+    }
+    cpuid_cache = cache;
 }
 
-static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len) \
-         __attribute__ ((ifunc("can_use_buffer_find_nonzero_offset_ifunc")));
-static size_t buffer_find_nonzero_offset(const void *buf, size_t len) \
-         __attribute__ ((ifunc("buffer_find_nonzero_offset_ifunc")));
-
-static void *buffer_find_nonzero_offset_ifunc(void)
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    typeof(buffer_find_nonzero_offset) *func = (avx2_support()) ?
-        buffer_find_nonzero_offset_avx2 : buffer_find_nonzero_offset_inner;
-
-    return func;
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 128 == 0 && ibuf % 32 == 0 && (cpuid_cache & CACHE_AVX2)) {
+        return buffer_zero_avx2(buf, len);
+    }
+    if (len % 64 == 0 && ibuf % 16 == 0 && (cpuid_cache & CACHE_SSE2)) {
+        return buffer_zero_sse2(buf, len);
+    }
+    return select_accel_int(buf, len);
 }
 
-static void *can_use_buffer_find_nonzero_offset_ifunc(void)
-{
-    typeof(can_use_buffer_find_nonzero_offset) *func = (avx2_support()) ?
-        can_use_buffer_find_nonzero_offset_avx2 :
-        can_use_buffer_find_nonzero_offset_inner;
+#elif defined __SSE2__
+#include <emmintrin.h>
 
-    return func;
-}
-#pragma GCC pop_options
-#else
-static bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
+#define SSE2_NONZERO(X) \
+    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) != 0xFFFF)
+ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_NONZERO)
+
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    return can_use_buffer_find_nonzero_offset_inner(buf, len);
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 64 == 0 && ibuf % sizeof(__m128i) == 0) {
+        return buffer_zero_sse2(buf, len);
+    }
+    return select_accel_int(buf, len);
 }
 
-static size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+#elif defined(__aarch64__)
+#include "arm_neon.h"
+
+#define DO_NONZERO(X)  (vgetq_lane_u64((X), 0) | vgetq_lane_u64((X), 1))
+ACCEL_BUFFER_ZERO(buffer_zero_neon, 128, uint64x2_t, DO_NONZERO)
+
+static bool select_accel_fn(const void *buf, size_t len)
 {
-    return buffer_find_nonzero_offset_inner(buf, len);
+    uintptr_t ibuf = (uintptr_t)buf;
+    if (len % 128 == 0 && ibuf % sizeof(uint64x2_t) == 0) {
+        return buffer_zero_neon(buf, len);
+    }
+    return select_accel_int(buf, len);
 }
+
+#else
+#define select_accel_fn  select_accel_int
 #endif
 
 /*
  * Checks if a buffer is all zeroes
- *
- * Attention! The len must be a multiple of 4 * sizeof(long) due to
- * restriction of optimizations in this function.
  */
 bool buffer_is_zero(const void *buf, size_t len)
 {
-    /*
-     * Use long as the biggest available internal data type that fits into the
-     * CPU register and unroll the loop to smooth out the effect of memory
-     * latency.
-     */
-
-    size_t i;
-    long d0, d1, d2, d3;
-    const long * const data = buf;
-
-    /* use vector optimized zero check if possible */
-    if (can_use_buffer_find_nonzero_offset(buf, len)) {
-        return buffer_find_nonzero_offset(buf, len) == len;
-    }
-
-    assert(len % (4 * sizeof(long)) == 0);
-    len /= sizeof(long);
-
-    for (i = 0; i < len; i += 4) {
-        d0 = data[i + 0];
-        d1 = data[i + 1];
-        d2 = data[i + 2];
-        d3 = data[i + 3];
-
-        if (d0 || d1 || d2 || d3) {
-            return false;
-        }
+    if (unlikely(len == 0)) {
+        return true;
     }
 
-    return true;
+    /* Use an optimized zero check if possible.  Note that this also
+       includes a check for an unrolled loop over longs, as well as
+       the unsized, unaligned fallback to buffer_zero_base.  */
+    return select_accel_fn(buf, len);
 }
-
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 5/8] cutils: Add generic prefetch
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
                   ` (2 preceding siblings ...)
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 4/8] cutils: Rearrange buffer_is_zero acceleration Richard Henderson
@ 2016-08-24 17:48 ` Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 6/8] cutils: Rewrite x86 buffer zero checking Richard Henderson
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell

There's no real knowledge of the cacheline size,
just prefetching one loop ahead.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/bufferiszero.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 5246c5b..264598b 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -38,6 +38,8 @@ NAME(const void *buf, size_t len)                               \
     do {                                                        \
         const VECTYPE *p = buf;                                 \
         VECTYPE t;                                              \
+        __builtin_prefetch(buf + SIZE);                         \
+        barrier();                                              \
         if (SIZE == sizeof(VECTYPE) * 4) {                      \
             t = (p[0] | p[1]) | (p[2] | p[3]);                  \
         } else if (SIZE == sizeof(VECTYPE) * 8) {               \
@@ -241,6 +243,9 @@ bool buffer_is_zero(const void *buf, size_t len)
         return true;
     }
 
+    /* Fetch the beginning of the buffer while we select the accelerator.  */
+    __builtin_prefetch(buf);
+
     /* Use an optimized zero check if possible.  Note that this also
        includes a check for an unrolled loop over longs, as well as
        the unsized, unaligned fallback to buffer_zero_base.  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 6/8] cutils: Rewrite x86 buffer zero checking
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
                   ` (3 preceding siblings ...)
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 5/8] cutils: Add generic prefetch Richard Henderson
@ 2016-08-24 17:48 ` Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 7/8] cutils: Rewrite aarch64 " Richard Henderson
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell, liang.z.li

Use unaligned load operations.
Add versions for avx1 and sse4.1.

Cc: liang.z.li@intel.com
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/bufferiszero.c | 169 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 145 insertions(+), 24 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 264598b..e5e4459 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -131,21 +131,127 @@ static bool select_accel_fn(const void *buf, size_t len)
     return select_accel_int(buf, len);
 }
 
-#elif defined(CONFIG_AVX2_OPT)
+#elif defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <cpuid.h>
 #include <x86intrin.h>
 
+#ifdef CONFIG_AVX2_OPT
 #pragma GCC push_options
 #pragma GCC target("avx2")
-#define AVX2_NONZERO(X)  !_mm256_testz_si256((X), (X))
-ACCEL_BUFFER_ZERO(buffer_zero_avx2, 128, __m256i, AVX2_NONZERO)
+
+static bool __attribute__((noinline))
+buffer_zero_avx2(const void *buf, size_t len)
+{
+    const __m256i *p = buf;
+    const __m256i *end = buf + len;
+    __m256i t;
+
+    do {
+        p += 4;
+        /* Note that most AVX insns handle unaligned operands by
+           default; we only need take care for the initial load.  */
+        __asm("prefetcht0 (%1)\n\t"
+              "vmovdqu -0x80(%1),%0\n\t"
+              "vpor -0x60(%1),%0,%0\n\t"
+              "vpor -0x40(%1),%0,%0\n\t"
+              "vpor -0x20(%1),%0,%0"
+              : "=x"(t) : "r"(p));
+        if (unlikely(!_mm256_testz_si256(t, t))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
+
+#pragma GCC pop_options
+#pragma GCC push_options
+#pragma GCC target("avx")
+
+static bool __attribute__((noinline))
+buffer_zero_avx(const void *buf, size_t len)
+{
+    const __m128i *p = buf;
+    const __m128i *end = buf + len;
+    __m128i t;
+
+    do {
+        p += 4;
+        /* Note that most AVX insns handle unaligned operands by
+           default; we only need take care for the initial load.  */
+        __asm("prefetcht0 (%1)\n\t"
+              "vmovdqu -0x40(%1),%0\n\t"
+              "vpor -0x20(%1),%0,%0\n\t"
+              "vpor -0x20(%1),%0,%0\n\t"
+              "vpor -0x10(%1),%0,%0"
+              : "=x"(t) : "r"(p));
+        if (unlikely(!_mm_testz_si128(t, t))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
+
 #pragma GCC pop_options
+#pragma GCC push_options
+#pragma GCC target("sse4")
+
+static bool __attribute__((noinline))
+buffer_zero_sse4(const void *buf, size_t len)
+{
+    const __m128i *p = buf;
+    const __m128i *end = buf + len;
+    __m128i t0, t1, t2, t3;
+
+    do {
+        p += 4;
+        __asm("prefetcht0 (%4)\n\t"
+              "movdqu -0x40(%4),%0\n\t"
+              "movdqu -0x20(%4),%1\n\t"
+              "movdqu -0x20(%4),%2\n\t"
+              "movdqu -0x10(%4),%3\n\t"
+              "por %1,%0\n\t"
+              "por %3,%2\n\t"
+              "por %2,%0"
+              : "=x"(t0), "=x"(t1), "=x"(t2), "=x"(t3) : "r"(p));
+        if (unlikely(!_mm_testz_si128(t0, t0))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
 
+#pragma GCC pop_options
 #pragma GCC push_options
 #pragma GCC target("sse2")
-#define SSE2_NONZERO(X) \
-    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) != 0xFFFF)
-ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_NONZERO)
+#endif /* CONFIG_AVX2_OPT */
+
+static bool __attribute__((noinline))
+buffer_zero_sse2(const void *buf, size_t len)
+{
+    const __m128i *p = buf;
+    const __m128i *end = buf + len;
+    __m128i zero = _mm_setzero_si128();
+    __m128i t0, t1, t2, t3;
+
+    do {
+        p += 4;
+        __asm("prefetcht0 (%4)\n\t"
+              "movdqu -0x40(%4),%0\n\t"
+              "movdqu -0x20(%4),%1\n\t"
+              "movdqu -0x20(%4),%2\n\t"
+              "movdqu -0x10(%4),%3\n\t"
+              "por %1,%0\n\t"
+              "por %3,%2\n\t"
+              "por %2,%0"
+              : "=x"(t0), "=x"(t1), "=x"(t2), "=x"(t3) : "r"(p));
+        if (unlikely(_mm_movemask_epi8(_mm_cmpeq_epi8(t0, zero)) != 0xFFFF)) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
+
+#ifdef CONFIG_AVX2_OPT
 #pragma GCC pop_options
 
 #define CACHE_SSE2    1
@@ -186,32 +292,47 @@ static void __attribute__((constructor)) init_cpuid_cache(void)
     }
     cpuid_cache = cache;
 }
+#endif /* CONFIG_AVX2_OPT */
 
 static bool select_accel_fn(const void *buf, size_t len)
 {
-    uintptr_t ibuf = (uintptr_t)buf;
-    if (len % 128 == 0 && ibuf % 32 == 0 && (cpuid_cache & CACHE_AVX2)) {
+#ifdef CONFIG_AVX2_OPT
+    int cache = cpuid_cache;
+
+    /* Force bits that the compiler tells us must be there.
+       This allows the compiler to optimize subsequent tests.  */
+#ifdef __AVX2__
+    cache |= CACHE_AVX2;
+#endif
+#ifdef __AVX__
+    cache |= CACHE_AVX1;
+#endif
+#ifdef __SSE4_1__
+    cache |= CACHE_SSE4;
+#endif
+#ifdef __SSE2__
+    cache |= CACHE_SSE2;
+#endif
+
+    if (len % 128 == 0 && (cache & CACHE_AVX2)) {
         return buffer_zero_avx2(buf, len);
     }
-    if (len % 64 == 0 && ibuf % 16 == 0 && (cpuid_cache & CACHE_SSE2)) {
-        return buffer_zero_sse2(buf, len);
+    if (len % 64 == 0) {
+        if (cache & CACHE_AVX1) {
+            return buffer_zero_avx(buf, len);
+        }
+        if (cache & CACHE_SSE4) {
+            return buffer_zero_sse4(buf, len);
+        }
+        if (cache & CACHE_SSE2) {
+            return buffer_zero_sse2(buf, len);
+        }
     }
-    return select_accel_int(buf, len);
-}
-
-#elif defined __SSE2__
-#include <emmintrin.h>
-
-#define SSE2_NONZERO(X) \
-    (_mm_movemask_epi8(_mm_cmpeq_epi8((X), _mm_setzero_si128())) != 0xFFFF)
-ACCEL_BUFFER_ZERO(buffer_zero_sse2, 64, __m128i, SSE2_NONZERO)
-
-static bool select_accel_fn(const void *buf, size_t len)
-{
-    uintptr_t ibuf = (uintptr_t)buf;
-    if (len % 64 == 0 && ibuf % sizeof(__m128i) == 0) {
+#else
+    if (len % 64 == 0) {
         return buffer_zero_sse2(buf, len);
     }
+#endif
     return select_accel_int(buf, len);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 7/8] cutils: Rewrite aarch64 buffer zero checking
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
                   ` (4 preceding siblings ...)
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 6/8] cutils: Rewrite x86 buffer zero checking Richard Henderson
@ 2016-08-24 17:48 ` Richard Henderson
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 8/8] cutils: Rewrite ppc " Richard Henderson
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell, qemu-arm, vijay.kilari

Provide 64-byte and 128-byte versions.
Use dczid_el0 as a proxy for the cacheline size.

Cc: qemu-arm@nongnu.org
Cc: vijay.kilari@gmail.com
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/bufferiszero.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index e5e4459..28a1419 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -340,13 +340,35 @@ static bool select_accel_fn(const void *buf, size_t len)
 #include "arm_neon.h"
 
 #define DO_NONZERO(X)  (vgetq_lane_u64((X), 0) | vgetq_lane_u64((X), 1))
-ACCEL_BUFFER_ZERO(buffer_zero_neon, 128, uint64x2_t, DO_NONZERO)
+ACCEL_BUFFER_ZERO(buffer_zero_neon_64, 64, uint64x2_t, DO_NONZERO)
+ACCEL_BUFFER_ZERO(buffer_zero_neon_128, 128, uint64x2_t, DO_NONZERO)
+
+static uint32_t buffer_zero_line_mask;
+static accel_zero_fn buffer_zero_accel;
+
+static void __attribute__((constructor)) init_buffer_zero_accel(void)
+{
+    uint64_t t;
+
+    /* Use the DZP block size as a proxy for the cacheline size,
+       since the later is not available to userspace.  This seems
+       to work in practice for existing implementations.  */
+    asm("mrs %0, dczid_el0" : "=r"(t));
+    if ((t & 15) * 16 >= 128) {
+        buffer_zero_line_mask = 128 - 1;
+        buffer_zero_accel = buffer_zero_neon_128;
+    } else {
+        buffer_zero_line_mask = 64 - 1;
+        buffer_zero_accel = buffer_zero_neon_64;
+    }
+}
 
 static bool select_accel_fn(const void *buf, size_t len)
 {
     uintptr_t ibuf = (uintptr_t)buf;
-    if (len % 128 == 0 && ibuf % sizeof(uint64x2_t) == 0) {
-        return buffer_zero_neon(buf, len);
+    if (likely(ibuf % sizeof(uint64_t) == 0)
+        && (len & buffer_zero_line_mask) == 0) {
+        return buffer_zero_accel(buf, len);
     }
     return select_accel_int(buf, len);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v2 8/8] cutils: Rewrite ppc buffer zero checking
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
                   ` (5 preceding siblings ...)
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 7/8] cutils: Rewrite aarch64 " Richard Henderson
@ 2016-08-24 17:48 ` Richard Henderson
  2016-08-24 19:18 ` [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Eric Blake
  2016-08-25 12:49 ` Daniel P. Berrange
  8 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 17:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peter.maydell, qemu-ppc, David Gibson

GCC versions through 6 do a poor job with the indexed addressing,
and (for ppc64le) issues unnecessary xxswapd insns.

Cc: qemu-ppc@nongnu.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 util/bufferiszero.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 28a1419..d580b57 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -119,8 +119,44 @@ static bool select_accel_int(const void *buf, size_t len)
 #undef pixel
 #undef bool
 #define bool _Bool
-#define DO_NONZERO(X)  vec_any_ne(X, (__vector unsigned char){ 0 })
-ACCEL_BUFFER_ZERO(buffer_zero_ppc, 128, __vector unsigned char, DO_NONZERO)
+
+static bool __attribute__((noinline))
+buffer_zero_ppc(const void *buf, size_t len)
+{
+    typedef unsigned char vec __attribute__((vector_size(16)));
+    const vec *p = buf;
+    const vec *end = buf + len;
+    vec t0, t1, t2, t3, zero = (vec){ 0 };
+
+    do {
+        p += 8;
+        __builtin_prefetch(p);
+        barrier();
+        /* ??? GCC6 does poorly with power64le; extra xxswap.  */
+        __asm("lvebx %0,%4,%5\n\t"
+              "lvebx %1,%4,%6\n\t"
+              "lvebx %2,%4,%7\n\t"
+              "lvebx %3,%4,%8\n\t"
+              "vor %0,%0,%1\n\t"
+              "vor %1,%2,%3\n\t"
+              "lvebx %2,%4,%9\n\t"
+              "lvebx %3,%4,%10\n\t"
+              "vor %0,%0,%1\n\t"
+              "vor %1,%2,%3\n\t"
+              "lvebx %2,%4,%11\n\t"
+              "lvebx %3,%4,%12\n\t"
+              "vor %0,%0,%1\n\t"
+              "vor %1,%2,%3\n\t"
+              "vor %0,%0,%1"
+              : "=v"(t0), "=v"(t1), "=v"(t2), "=v"(t3)
+              : "b"(p), "b"(-8 * 16), "b"(-7 * 16), "b"(-6 * 16), "b"(-5 * 16),
+                "b"(-4 * 16), "b"(-3 * 16), "b"(-2 * 16), "b"(-1 * 16));
+        if (unlikely(vec_any_ne(t0, zero))) {
+            return false;
+        }
+    } while (p < end);
+    return true;
+}
 
 static bool select_accel_fn(const void *buf, size_t len)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
                   ` (6 preceding siblings ...)
  2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 8/8] cutils: Rewrite ppc " Richard Henderson
@ 2016-08-24 19:18 ` Eric Blake
  2016-08-24 20:31   ` Richard Henderson
  2016-08-25 12:49 ` Daniel P. Berrange
  8 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2016-08-24 19:18 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: pbonzini, peter.maydell

[-- Attachment #1: Type: text/plain, Size: 530 bytes --]

On 08/24/2016 12:48 PM, Richard Henderson wrote:
> Patches 1-4 remove the use of ifunc from the implementation.
> 
> Patch 6 adjusts the x86 implementation a bit more to take
> advantage of ptest (in sse4.1) and unaligned accesses (in avx1).

Do we really care about unaligned access?  Or can we guarantee that all
our calls to buffer_is_zero are already aligned, and make optimizations
along those lines?



-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero
  2016-08-24 19:18 ` [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Eric Blake
@ 2016-08-24 20:31   ` Richard Henderson
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2016-08-24 20:31 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: pbonzini, peter.maydell

On 08/24/2016 12:18 PM, Eric Blake wrote:
> On 08/24/2016 12:48 PM, Richard Henderson wrote:
>> Patches 1-4 remove the use of ifunc from the implementation.
>>
>> Patch 6 adjusts the x86 implementation a bit more to take
>> advantage of ptest (in sse4.1) and unaligned accesses (in avx1).
>
> Do we really care about unaligned access?  Or can we guarantee that all
> our calls to buffer_is_zero are already aligned, and make optimizations
> along those lines?

The old code asserted alignment of at least sizeof(long), although a survey of 
call sites doesn't make this obvious.  I could imagine that we get alignment 
consistent with that of malloc, but can't prove it.

However, we're certainly not going to be able to assert arbitrary alignment, 
such as the 32-byte for AVX2, or the 64-byte for AVX512 (when that comes along).

Thankfully, at least AVX capable cpus are very efficient with unaligned accesses.


r~

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero
  2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
                   ` (7 preceding siblings ...)
  2016-08-24 19:18 ` [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Eric Blake
@ 2016-08-25 12:49 ` Daniel P. Berrange
  8 siblings, 0 replies; 11+ messages in thread
From: Daniel P. Berrange @ 2016-08-25 12:49 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, pbonzini, peter.maydell

On Wed, Aug 24, 2016 at 10:48:27AM -0700, Richard Henderson wrote:
> Patches 1-4 remove the use of ifunc from the implementation.
> 
> Patch 6 adjusts the x86 implementation a bit more to take
> advantage of ptest (in sse4.1) and unaligned accesses (in avx1).
> 
> Patches 3 and 7 are the result of my conversation with Vijaya
> Kumar with respect to ThunderX.
> 
> Patch 8 is the result of seeing some really really horrible code
> produced for ppc64le (gcc 4.9 and mainline).
> 
> This has had limited testing.  What I don't know is the best way
> to benchmark this -- the only way I know to trigger this is via
> the console, by hand, which doesn't make for reasonable timing.
> 
> Changes v1-v2:
>   * Add patch 1, moving everything to a new file.
>   * Fix a typo or two, which had the wrong sense of zero test.
>     These had mostly beed fixed in the intermediate patches,
>     but it wouldn't have helped bisection.
> 
> 
> r~
> 
> 
> Richard Henderson (8):
>   cutils: Move buffer_is_zero and subroutines to a new file
>   cutils: Remove SPLAT macro
>   cutils: Export only buffer_is_zero
>   cutils: Rearrange buffer_is_zero acceleration
>   cutils: Add generic prefetch
>   cutils: Rewrite x86 buffer zero checking
>   cutils: Rewrite aarch64 buffer zero checking
>   cutils: Rewrite ppc buffer zero checking
> 
>  configure             |  21 +--
>  include/qemu/cutils.h |   2 -
>  migration/ram.c       |   2 +-
>  migration/rdma.c      |   5 +-
>  util/Makefile.objs    |   1 +
>  util/bufferiszero.c   | 432 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  util/cutils.c         | 244 ----------------------------
>  7 files changed, 441 insertions(+), 266 deletions(-)
>  create mode 100644 util/bufferiszero.c

Since your v1 series has a report of breaking arm64, I thnk this is a good
candidate for adding unit tests eg a tests/test-bufferiszero.c file which
exercises & validates the various codepaths.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-25 12:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-24 17:48 [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Richard Henderson
2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 1/8] cutils: Move buffer_is_zero and subroutines to a new file Richard Henderson
2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 3/8] cutils: Export only buffer_is_zero Richard Henderson
2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 4/8] cutils: Rearrange buffer_is_zero acceleration Richard Henderson
2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 5/8] cutils: Add generic prefetch Richard Henderson
2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 6/8] cutils: Rewrite x86 buffer zero checking Richard Henderson
2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 7/8] cutils: Rewrite aarch64 " Richard Henderson
2016-08-24 17:48 ` [Qemu-devel] [PATCH v2 8/8] cutils: Rewrite ppc " Richard Henderson
2016-08-24 19:18 ` [Qemu-devel] [PATCH v2 0/8] Improve buffer_is_zero Eric Blake
2016-08-24 20:31   ` Richard Henderson
2016-08-25 12:49 ` Daniel P. Berrange

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.