All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [v2 0/2] add avx2 instruction optimization
@ 2015-11-10  2:51 Liang Li
  2015-11-10  2:51 ` [Qemu-devel] [v2 1/2] cutils: " Liang Li
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Liang Li @ 2015-11-10  2:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, Liang Li, mst, amit.shah, pbonzini

buffer_find_nonzero_offset() is a hot function during live migration.
Now it use SSE2 intructions for optimization. For platform supports
AVX2 instructions, use the AVX2 instructions for optimization can help
to improve the performance about 30% comparing to SSE2.
Zero page check can be faster with this optimization, the test result
shows that for an 8GB RAM idle guest, this patch can help to shorten
the total live migration time about 6%.

This patch use the ifunc mechanism to select the proper function when
running, for platform supports AVX2, excute the AVX2 instructions,
else, excute the original code.

With patch, if build QEMU binary with AVX2 enabled, the binary can run
on both platforms support AVX2 or not.

If build QEMU binary with AVX2 diabled, or if compiler can not support
AVX2, the binary will not contain the AVX2 instruction, and it can run
on both platforms support AVX2 or not.

 
Liang Li (2):
  cutils: add avx2 instruction optimization
  configure: add options to config avx2

 configure             | 29 ++++++++++++++++++++++
 include/qemu-common.h | 28 +++++++++++++++------
 util/Makefile.objs    |  2 ++
 util/avx2.c           | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
 util/cutils.c         | 53 +++++++++++++++++++++++++++++++++++++--
 5 files changed, 172 insertions(+), 9 deletions(-)
 create mode 100644 util/avx2.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Qemu-devel] [v2 1/2] cutils: add avx2 instruction optimization
  2015-11-10  2:51 [Qemu-devel] [v2 0/2] add avx2 instruction optimization Liang Li
@ 2015-11-10  2:51 ` Liang Li
  2015-11-12 10:08   ` Paolo Bonzini
  2015-11-12 14:43   ` Richard Henderson
  2015-11-10  2:51 ` [Qemu-devel] [v2 2/2] configure: add options to config avx2 Liang Li
  2015-11-10  3:43 ` [Qemu-devel] [v2 0/2] add avx2 instruction optimization Eric Blake
  2 siblings, 2 replies; 35+ messages in thread
From: Liang Li @ 2015-11-10  2:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, Liang Li, mst, amit.shah, pbonzini

buffer_find_nonzero_offset() is a hot function during live migration.
Now it use SSE2 intructions for optimization. For platform supports
AVX2 instructions, use the AVX2 instructions for optimization can help
to improve the performance about 30% comparing to SSE2.
Zero page check can be faster with this optimization, the test result
shows that for an 8GB RAM idle guest, this patch can help to shorten
the total live migration time about 6%.

This patch use the ifunc mechanism to select the proper function when
running, for platform supports AVX2, excute the AVX2 instructions,
else, excute the original code.

Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/qemu-common.h | 28 +++++++++++++++------
 util/Makefile.objs    |  2 ++
 util/avx2.c           | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
 util/cutils.c         | 53 +++++++++++++++++++++++++++++++++++++--
 4 files changed, 143 insertions(+), 9 deletions(-)
 create mode 100644 util/avx2.c

diff --git a/include/qemu-common.h b/include/qemu-common.h
index 2f74540..9fa7501 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -484,15 +484,29 @@ void qemu_hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
 #endif
 
 #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
-static inline bool
-can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
-{
-    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
-                   * sizeof(VECTYPE)) == 0
-            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
-}
+bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len);
+
 size_t buffer_find_nonzero_offset(const void *buf, size_t len);
 
+extern bool
+can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len);
+
+extern size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len);
+
+extern bool
+can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len);
+
+extern size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len);
+
+__asm__(".type can_use_buffer_find_nonzero_offset, \%gnu_indirect_function");
+__asm__(".type buffer_find_nonzero_offset, \%gnu_indirect_function");
+
+
+void *can_use_buffer_find_nonzero_offset_ifunc(void) \
+                     __asm__("can_use_buffer_find_nonzero_offset");
+
+void *buffer_find_nonzero_offset_ifunc(void) \
+                     __asm__("buffer_find_nonzero_offset");
 /*
  * helper to parse debug environment variables
  */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index d7cc399..6aacad7 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,5 @@
 util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y += avx2.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
@@ -29,3 +30,4 @@ util-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
 util-obj-y += qemu-coroutine-sleep.o
 util-obj-y += coroutine-$(CONFIG_COROUTINE_BACKEND).o
 util-obj-y += buffer.o
+avx2.o-cflags      := $(AVX2_CFLAGS)
diff --git a/util/avx2.c b/util/avx2.c
new file mode 100644
index 0000000..0e6915a
--- /dev/null
+++ b/util/avx2.c
@@ -0,0 +1,69 @@
+#include "qemu-common.h"
+
+#ifdef __AVX2__
+#include <immintrin.h>
+#define AVX2_VECTYPE        __m256i
+#define AVX2_SPLAT(p)       _mm256_set1_epi8(*(p))
+#define AVX2_ALL_EQ(v1, v2) \
+    (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
+#define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
+
+inline bool
+can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+{
+    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+                   * sizeof(AVX2_VECTYPE)) == 0
+            && ((uintptr_t) buf) % sizeof(AVX2_VECTYPE) == 0);
+}
+
+size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+{
+    const AVX2_VECTYPE *p = buf;
+    const AVX2_VECTYPE zero = (AVX2_VECTYPE){0};
+    size_t i;
+
+    assert(can_use_buffer_find_nonzero_offset_avx2(buf, len));
+
+    if (!len) {
+        return 0;
+    }
+
+    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
+        if (!AVX2_ALL_EQ(p[i], zero)) {
+            return i * sizeof(AVX2_VECTYPE);
+        }
+    }
+
+    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
+         i < len / sizeof(AVX2_VECTYPE);
+         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+        AVX2_VECTYPE tmp0 = AVX2_VEC_OR(p[i + 0], p[i + 1]);
+        AVX2_VECTYPE tmp1 = AVX2_VEC_OR(p[i + 2], p[i + 3]);
+        AVX2_VECTYPE tmp2 = AVX2_VEC_OR(p[i + 4], p[i + 5]);
+        AVX2_VECTYPE tmp3 = AVX2_VEC_OR(p[i + 6], p[i + 7]);
+        AVX2_VECTYPE tmp01 = AVX2_VEC_OR(tmp0, tmp1);
+        AVX2_VECTYPE tmp23 = AVX2_VEC_OR(tmp2, tmp3);
+        if (!AVX2_ALL_EQ(AVX2_VEC_OR(tmp01, tmp23), zero)) {
+            break;
+        }
+    }
+
+    return i * sizeof(AVX2_VECTYPE);
+}
+
+#else
+/* use the original functions if avx2 is not enabled when buiding*/
+
+inline bool
+can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+{
+    return can_use_buffer_find_nonzero_offset_inner(buf, len);
+}
+
+inline size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
+{
+    return buffer_find_nonzero_offset_inner(buf, len);
+}
+
+#endif
+
diff --git a/util/cutils.c b/util/cutils.c
index cfeb848..cd478ce 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -26,6 +26,7 @@
 #include <math.h>
 #include <limits.h>
 #include <errno.h>
+#include <cpuid.h>
 
 #include "qemu/sockets.h"
 #include "qemu/iov.h"
@@ -161,6 +162,54 @@ int qemu_fdatasync(int fd)
 #endif
 }
 
+/* old compiler maynot define bit_AVX2 */
+#ifndef bit_AVX2
+#define bit_AVX2 (1 << 5)
+#endif
+
+static inline bool avx2_support(void)
+{
+    int a, b, c, d;
+
+    if (__get_cpuid_max(0, NULL) < 7) {
+        printf("max cpuid < 7\n");
+        return false;
+    }
+
+    __cpuid_count(7, 0, a, b, c, d);
+    printf("b = %x\n", b);
+    return b & bit_AVX2;
+}
+
+void *buffer_find_nonzero_offset_ifunc(void)
+{
+    printf("deciding %s\n", __func__);
+
+    typeof(buffer_find_nonzero_offset) *func = (avx2_support()) ?
+        buffer_find_nonzero_offset_avx2 : buffer_find_nonzero_offset_inner;
+
+    return func;
+}
+
+void *can_use_buffer_find_nonzero_offset_ifunc(void)
+{
+    printf("deciding %s\n", __func__);
+
+    typeof(can_use_buffer_find_nonzero_offset) *func = (avx2_support()) ?
+        can_use_buffer_find_nonzero_offset_avx2 :
+        can_use_buffer_find_nonzero_offset_inner;
+
+    return func;
+}
+
+inline bool
+can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
+{
+    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+                   * sizeof(VECTYPE)) == 0
+            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
+}
+
 /*
  * Searches for an area with non-zero content in a buffer
  *
@@ -181,13 +230,13 @@ int qemu_fdatasync(int fd)
  * If the buffer is all zero the return value is equal to len.
  */
 
-size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
 {
     const VECTYPE *p = buf;
     const VECTYPE zero = (VECTYPE){0};
     size_t i;
 
-    assert(can_use_buffer_find_nonzero_offset(buf, len));
+    assert(can_use_buffer_find_nonzero_offset_inner(buf, len));
 
     if (!len) {
         return 0;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [Qemu-devel] [v2 2/2] configure: add options to config avx2
  2015-11-10  2:51 [Qemu-devel] [v2 0/2] add avx2 instruction optimization Liang Li
  2015-11-10  2:51 ` [Qemu-devel] [v2 1/2] cutils: " Liang Li
@ 2015-11-10  2:51 ` Liang Li
  2015-11-10  3:43 ` [Qemu-devel] [v2 0/2] add avx2 instruction optimization Eric Blake
  2 siblings, 0 replies; 35+ messages in thread
From: Liang Li @ 2015-11-10  2:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: quintela, Liang Li, mst, amit.shah, pbonzini

Add the '--enable-avx2' & '--disable-avx2' option so as to config
the AVX2 instruction optimization.

By default, avx2 optimization is enabled, if '--disable-avx2' is not
set, configure will detect if the compiler can support AVX2 option,
if yes, AVX2 optimization is eabled, else disabled.

Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 configure | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/configure b/configure
index 42e57c0..4d81be2 100755
--- a/configure
+++ b/configure
@@ -310,6 +310,7 @@ smartcard=""
 libusb=""
 usb_redir=""
 opengl=""
+avx2="yes"
 zlib="yes"
 lzo=""
 snappy=""
@@ -1057,6 +1058,10 @@ for opt do
   ;;
   --enable-usb-redir) usb_redir="yes"
   ;;
+  --disable-avx2) avx2="no"
+  ;;
+  --enable-avx2) avx2="yes"
+  ;;
   --disable-zlib-test) zlib="no"
   ;;
   --disable-lzo) lzo="no"
@@ -1373,6 +1378,7 @@ disabled with --disable-FEATURE, default is enabled if available:
   smartcard       smartcard support (libcacard)
   libusb          libusb (for usb passthrough)
   usb-redir       usb network redirection support
+  avx2            support of avx2 instruction
   lzo             support of lzo compression library
   snappy          support of snappy compression library
   bzip2           support of bzip2 compression library
@@ -1809,6 +1815,24 @@ EOF
   fi
 fi
 
+########################################
+# avx2 check
+
+if test "$avx2" != "no" ; then
+    cat > $TMPC << EOF
+int main(void) { return 0; }
+EOF
+    if compile_prog "" "-mavx2" ; then
+        avx2="yes"
+    else
+        avx2="no"
+    fi
+fi
+
+if test "$avx2" = "yes" ; then
+    avx2_cflags=" -mavx2"
+fi
+
 ##########################################
 # zlib check
 
@@ -4782,6 +4806,7 @@ echo "libssh2 support   $libssh2"
 echo "TPM passthrough   $tpm_passthrough"
 echo "QOM debugging     $qom_cast_debug"
 echo "vhdx              $vhdx"
+echo "avx2 support      $avx2"
 echo "lzo support       $lzo"
 echo "snappy support    $snappy"
 echo "bzip2 support     $bzip2"
@@ -5166,6 +5191,10 @@ if test "$opengl" = "yes" ; then
   echo "OPENGL_LIBS=$opengl_libs" >> $config_host_mak
 fi
 
+if test "$avx2" = "yes" ; then
+  echo "AVX2_CFLAGS=$avx2_cflags" >> $config_host_mak
+fi
+
 if test "$lzo" = "yes" ; then
   echo "CONFIG_LZO=y" >> $config_host_mak
 fi
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  2:51 [Qemu-devel] [v2 0/2] add avx2 instruction optimization Liang Li
  2015-11-10  2:51 ` [Qemu-devel] [v2 1/2] cutils: " Liang Li
  2015-11-10  2:51 ` [Qemu-devel] [v2 2/2] configure: add options to config avx2 Liang Li
@ 2015-11-10  3:43 ` Eric Blake
  2015-11-10  5:48   ` Li, Liang Z
  2 siblings, 1 reply; 35+ messages in thread
From: Eric Blake @ 2015-11-10  3:43 UTC (permalink / raw)
  To: Liang Li, qemu-devel; +Cc: amit.shah, pbonzini, mst, quintela

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

On 11/09/2015 07:51 PM, Liang Li wrote:
> buffer_find_nonzero_offset() is a hot function during live migration.
> Now it use SSE2 intructions for optimization. For platform supports
> AVX2 instructions, use the AVX2 instructions for optimization can help
> to improve the performance about 30% comparing to SSE2.

Rather than trying to cater to multiple assembly instruction
implementations ourselves, have you tried taking the ideas in this
earlier thread?
https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05298.html

Ideally, libc's memcmp() will already be using the most efficient
assembly instructions without us having to reproduce the work of picking
the instructions that work best.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  3:43 ` [Qemu-devel] [v2 0/2] add avx2 instruction optimization Eric Blake
@ 2015-11-10  5:48   ` Li, Liang Z
  2015-11-10  9:13     ` Juan Quintela
  0 siblings, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-10  5:48 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: amit.shah, pbonzini, mst, quintela

> Rather than trying to cater to multiple assembly instruction implementations
> ourselves, have you tried taking the ideas in this earlier thread?
> https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05298.html
> 
> Ideally, libc's memcmp() will already be using the most efficient assembly
> instructions without us having to reproduce the work of picking the instructions
> that work best.
> 

Eric, thanks for you information. I didn't notice that discussion before.


I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo length'
then write a test program to check a large amount of zero pages,  and use the 'time' to 
recode the time takes by different optimization. Test result is like this:

SSE2:
------------------------------------------------------
              |            test 1         |     test 2
----------------------------------------------------
Time(S):|       13.696            | 13.533  
------------------------------------------------


AVX2:
-------------------------------------------
              |        test 1     | test 2
-------------------------------------------
Time (S):|      10.583      |  10.306
-------------------------------------------

memeqzero4_paolo:
---------------------------------------
              |        test 1     | test 2
---------------------------------------
Time (S):|      9.718     |  9.817
----------------------------------------


Paolo's implementation has the best performance. It seems that we can remove the SSE2 related Intrinsics.

Liang
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  5:48   ` Li, Liang Z
@ 2015-11-10  9:13     ` Juan Quintela
  2015-11-10  9:26       ` Li, Liang Z
  2015-11-10  9:30       ` Paolo Bonzini
  0 siblings, 2 replies; 35+ messages in thread
From: Juan Quintela @ 2015-11-10  9:13 UTC (permalink / raw)
  To: Li, Liang Z; +Cc: amit.shah, pbonzini, qemu-devel, mst

"Li, Liang Z" <liang.z.li@intel.com> wrote:
>> Rather than trying to cater to multiple assembly instruction implementations
>> ourselves, have you tried taking the ideas in this earlier thread?
>> https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05298.html
>> 
>> Ideally, libc's memcmp() will already be using the most efficient assembly
>> instructions without us having to reproduce the work of picking the instructions
>> that work best.
>> 
>
> Eric, thanks for you information. I didn't notice that discussion before.
>
>
> I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo length'
> then write a test program to check a large amount of zero pages, and
> use the 'time' to
> recode the time takes by different optimization. Test result is like this:
>
> SSE2:
> ------------------------------------------------------
>               |            test 1         |     test 2
> ----------------------------------------------------
> Time(S):|       13.696            | 13.533  
> ------------------------------------------------
>
>
> AVX2:
> -------------------------------------------
>               |        test 1     | test 2
> -------------------------------------------
> Time (S):|      10.583      |  10.306
> -------------------------------------------
>
> memeqzero4_paolo:
> ---------------------------------------
>               |        test 1     | test 2
> ---------------------------------------
> Time (S):|      9.718     |  9.817
> ----------------------------------------
>
>
> Paolo's implementation has the best performance. It seems that we can
> remove the SSE2 related Intrinsics.

How should I understand that comment?  That you are about to send an
email to remove the sse2 support and that I can forget about this patch?

Thanks, Juan.


>
> Liang
>> --
>> Eric Blake   eblake redhat com    +1-919-301-3266
>> Libvirt virtualization library http://libvirt.org

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:13     ` Juan Quintela
@ 2015-11-10  9:26       ` Li, Liang Z
  2015-11-10  9:35         ` Paolo Bonzini
  2015-11-10  9:30       ` Paolo Bonzini
  1 sibling, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-10  9:26 UTC (permalink / raw)
  To: quintela; +Cc: amit.shah, pbonzini, qemu-devel, mst

> > Eric, thanks for you information. I didn't notice that discussion before.
> >
> >
> > I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo
> length'
> > then write a test program to check a large amount of zero pages, and
> > use the 'time' to recode the time takes by different optimization.
> > Test result is like this:
> >
> > SSE2:
> > ------------------------------------------------------
> >               |            test 1         |     test 2
> > ----------------------------------------------------
> > Time(S):|       13.696            | 13.533
> > ------------------------------------------------
> >
> >
> > AVX2:
> > -------------------------------------------
> >               |        test 1     | test 2
> > -------------------------------------------
> > Time (S):|      10.583      |  10.306
> > -------------------------------------------
> >
> > memeqzero4_paolo:
> > ---------------------------------------
> >               |        test 1     | test 2
> > ---------------------------------------
> > Time (S):|      9.718     |  9.817
> > ----------------------------------------
> >
> >
> > Paolo's implementation has the best performance. It seems that we can
> > remove the SSE2 related Intrinsics.
> 
> How should I understand that comment?  That you are about to send an email
> to remove the sse2 support and that I can forget about this patch?
> 
> Thanks, Juan.
> 

I don't know Paolo's opinion about how to deal with the SSE2 Intrinsics, he is the author. From my personal view, 
now that we have found a better way, why to use such low level SSE2/AVX2 Intrinsics. I don't know if someone else
is working on this. if not, and the related maintainer agrees to remove them, I am happy to send out a new patch.

Let's forget my patch at the moment.

Liang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:13     ` Juan Quintela
  2015-11-10  9:26       ` Li, Liang Z
@ 2015-11-10  9:30       ` Paolo Bonzini
  1 sibling, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-10  9:30 UTC (permalink / raw)
  To: quintela, Li, Liang Z; +Cc: amit.shah, qemu-devel, mst



On 10/11/2015 10:13, Juan Quintela wrote:
>> > I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo length'
>> > then write a test program to check a large amount of zero pages, and
>> > use the 'time' to
>> > recode the time takes by different optimization. Test result is like this:
>> >
>> > SSE2:
>> > ------------------------------------------------------
>> >               |            test 1         |     test 2
>> > ----------------------------------------------------
>> > Time(S):|       13.696            | 13.533  
>> > ------------------------------------------------
>> >
>> >
>> > AVX2:
>> > -------------------------------------------
>> >               |        test 1     | test 2
>> > -------------------------------------------
>> > Time (S):|      10.583      |  10.306
>> > -------------------------------------------
>> >
>> > memeqzero4_paolo:
>> > ---------------------------------------
>> >               |        test 1     | test 2
>> > ---------------------------------------
>> > Time (S):|      9.718     |  9.817
>> > ----------------------------------------
>> >
>> >
>> > Paolo's implementation has the best performance. It seems that we can
>> > remove the SSE2 related Intrinsics.

Note that you can simplify my implementation a lot, because
buffer_find_nonzero_offset already assumes that the buffer is aligned to
sizeof(VECTYPE), i.e. 16 bytes.  For example you can just check the
first 4 unsigned longs against zero and then call memcmp.

Paolo

> How should I understand that comment?  That you are about to send an
> email to remove the sse2 support and that I can forget about this patch?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:26       ` Li, Liang Z
@ 2015-11-10  9:35         ` Paolo Bonzini
  2015-11-10  9:41           ` Li, Liang Z
  2015-11-12  2:49           ` Li, Liang Z
  0 siblings, 2 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-10  9:35 UTC (permalink / raw)
  To: Li, Liang Z, quintela; +Cc: amit.shah, qemu-devel, mst



On 10/11/2015 10:26, Li, Liang Z wrote:
> I don't know Paolo's opinion about how to deal with the SSE2
> Intrinsics, he is the author. From my personal view, now that we have
> found a better way, why to use such low level SSE2/AVX2 Intrinsics.

I totally agree. :)

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:35         ` Paolo Bonzini
@ 2015-11-10  9:41           ` Li, Liang Z
  2015-11-10  9:50             ` Paolo Bonzini
  2015-11-12  2:49           ` Li, Liang Z
  1 sibling, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-10  9:41 UTC (permalink / raw)
  To: Paolo Bonzini, quintela; +Cc: amit.shah, qemu-devel, mst

> On 10/11/2015 10:26, Li, Liang Z wrote:
> > I don't know Paolo's opinion about how to deal with the SSE2
> > Intrinsics, he is the author. From my personal view, now that we have
> > found a better way, why to use such low level SSE2/AVX2 Intrinsics.
> 
> I totally agree. :)
> 
> Paolo

Hi Paolo,

It seems you are the right person to remove them, you are the author for both the 'SSE2 Intrinsics' and 'memeqzero4_paolo'.
Please forget my patch totally.

Liang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:41           ` Li, Liang Z
@ 2015-11-10  9:50             ` Paolo Bonzini
  2015-11-10  9:56               ` Li, Liang Z
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-10  9:50 UTC (permalink / raw)
  To: Li, Liang Z, quintela; +Cc: amit.shah, qemu-devel, mst



On 10/11/2015 10:41, Li, Liang Z wrote:
>> On 10/11/2015 10:26, Li, Liang Z wrote:
>>> I don't know Paolo's opinion about how to deal with the SSE2 
>>> Intrinsics, he is the author. From my personal view, now that we
>>> have found a better way, why to use such low level SSE2/AVX2
>>> Intrinsics.
>> 
>> I totally agree. :)
> 
> It seems you are the right person to remove them, you are the author
> for both the 'SSE2 Intrinsics' and 'memeqzero4_paolo'. Please forget
> my patch totally.

I agree that your patch can be dropped, but go ahead and submit your
improvements!

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:50             ` Paolo Bonzini
@ 2015-11-10  9:56               ` Li, Liang Z
  2015-11-10 10:00                 ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-10  9:56 UTC (permalink / raw)
  To: Paolo Bonzini, quintela; +Cc: amit.shah, qemu-devel, mst

> On 10/11/2015 10:41, Li, Liang Z wrote:
> >> On 10/11/2015 10:26, Li, Liang Z wrote:
> >>> I don't know Paolo's opinion about how to deal with the SSE2
> >>> Intrinsics, he is the author. From my personal view, now that we
> >>> have found a better way, why to use such low level SSE2/AVX2
> >>> Intrinsics.
> >>
> >> I totally agree. :)
> >
> > It seems you are the right person to remove them, you are the author
> > for both the 'SSE2 Intrinsics' and 'memeqzero4_paolo'. Please forget
> > my patch totally.
> 
> I agree that your patch can be dropped, but go ahead and submit your
> improvements!
> 
> Paolo

You mean I do this work? 
If you are busy, I can do this. I really hope the related improvement can be merged into QEMU 2.5.0.

Liang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:56               ` Li, Liang Z
@ 2015-11-10 10:00                 ` Paolo Bonzini
  2015-11-10 10:04                   ` Li, Liang Z
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-10 10:00 UTC (permalink / raw)
  To: Li, Liang Z, quintela; +Cc: amit.shah, qemu-devel, mst



On 10/11/2015 10:56, Li, Liang Z wrote:
> > I agree that your patch can be dropped, but go ahead and submit your
> > improvements!
> 
> You mean I do this work? 
> If you are busy, I can do this.

It's not that I'm busy, it's that it's your idea.  It doesn't matter if
I (and Peter Lieven too, actually) originally did the optimizations.

You also have the infrastructure to benchmark the improvements.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10 10:00                 ` Paolo Bonzini
@ 2015-11-10 10:04                   ` Li, Liang Z
  0 siblings, 0 replies; 35+ messages in thread
From: Li, Liang Z @ 2015-11-10 10:04 UTC (permalink / raw)
  To: Paolo Bonzini, quintela; +Cc: amit.shah, qemu-devel, mst

> On 10/11/2015 10:56, Li, Liang Z wrote:
> > > I agree that your patch can be dropped, but go ahead and submit your
> > > improvements!
> >
> > You mean I do this work?
> > If you are busy, I can do this.
> 
> It's not that I'm busy, it's that it's your idea.  It doesn't matter if I (and Peter
> Lieven too, actually) originally did the optimizations.
> 
> You also have the infrastructure to benchmark the improvements.
> 
> Paolo

OK. I will rework and send a new patch.

Liang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-10  9:35         ` Paolo Bonzini
  2015-11-10  9:41           ` Li, Liang Z
@ 2015-11-12  2:49           ` Li, Liang Z
  2015-11-12  8:43             ` Paolo Bonzini
  1 sibling, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-12  2:49 UTC (permalink / raw)
  To: Paolo Bonzini, quintela; +Cc: amit.shah, qemu-devel, mst

> 
> On 10/11/2015 10:26, Li, Liang Z wrote:
> > I don't know Paolo's opinion about how to deal with the SSE2
> > Intrinsics, he is the author. From my personal view, now that we have
> > found a better way, why to use such low level SSE2/AVX2 Intrinsics.
> 
> I totally agree. :)
> 
> Paolo

Hi Paolo,

I am very surprised about the live migration performance  result when I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to check the zero pages.
The total live migration time increased about 8%!   Not decreased.  Although in the unit test your ' memeqzero4_paolo'  has better performance, any idea?

Liang


  

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12  2:49           ` Li, Liang Z
@ 2015-11-12  8:43             ` Paolo Bonzini
  2015-11-12  8:53               ` Li, Liang Z
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-12  8:43 UTC (permalink / raw)
  To: Li, Liang Z, quintela; +Cc: amit.shah, qemu-devel, mst



On 12/11/2015 03:49, Li, Liang Z wrote:
> I am very surprised about the live migration performance  result when
> I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to
> check the zero pages.

What code were you using?  Remember I suggested using only unsigned long
checks, like

	unsigned long *p = ...
	if (p[0] || p[1] || p[2] || p[3]
	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
		return BUFFER_NOT_ZERO;
	else
		return BUFFER_ZERO;

> The total live migration time increased about
> 8%!   Not decreased.  Although in the unit test your '
> memeqzero4_paolo'  has better performance, any idea?

You only tested the case of zero pages.  But real pages usually are not
zero, even if they have a few zero bytes at the beginning.  It's very
important to optimize the initial check before the memcmp call.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12  8:43             ` Paolo Bonzini
@ 2015-11-12  8:53               ` Li, Liang Z
  2015-11-12  9:04                 ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-12  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, quintela; +Cc: amit.shah, qemu-devel, mst

> On 12/11/2015 03:49, Li, Liang Z wrote:
> > I am very surprised about the live migration performance  result when
> > I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to
> > check the zero pages.
> 
> What code were you using?  Remember I suggested using only unsigned long
> checks, like
> 
> 	unsigned long *p = ...
> 	if (p[0] || p[1] || p[2] || p[3]
> 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> 		return BUFFER_NOT_ZERO;
> 	else
> 		return BUFFER_ZERO;
> 



I use the following code:


bool memeqzero4_paolo(const void *data, size_t length)
{
    const unsigned char *p = data;
    unsigned long word;

    if (!length)
        return true;

    /* Check len bytes not aligned on a word.  */
    while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
        if (*p)
            return false;
        p++;
        length--;
        if (!length)
            return true;
    }

    /* Check up to 16 bytes a word at a time.  */
    for (;;) {
        memcpy(&word, p, sizeof(word));
        if (word)
            return false;
        p += sizeof(word);
        length -= sizeof(word);
        if (!length)
            return true;
        if (__builtin_expect(length & 15, 0) == 0)
            break;
    }

     /* Now we know that's zero, memcmp with self. */
     return memcmp(data, p, length) == 0;
}

> > The total live migration time increased about
> > 8%!   Not decreased.  Although in the unit test your '
> > memeqzero4_paolo'  has better performance, any idea?
> 
> You only tested the case of zero pages.  But real pages usually are not zero,
> even if they have a few zero bytes at the beginning.  It's very important to
> optimize the initial check before the memcmp call.
> 

In the unit test, I only test zero pages too, and the performance of  'memeqzero4_paolo' is better.
But when merged into QEMU, it caused performance drop. Why?

> Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12  8:53               ` Li, Liang Z
@ 2015-11-12  9:04                 ` Paolo Bonzini
  2015-11-12  9:40                   ` Li, Liang Z
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-12  9:04 UTC (permalink / raw)
  To: Li, Liang Z, quintela; +Cc: amit.shah, qemu-devel, mst



On 12/11/2015 09:53, Li, Liang Z wrote:
>> On 12/11/2015 03:49, Li, Liang Z wrote:
>>> I am very surprised about the live migration performance  result when
>>> I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to
>>> check the zero pages.
>>
>> What code were you using?  Remember I suggested using only unsigned long
>> checks, like
>>
>> 	unsigned long *p = ...
>> 	if (p[0] || p[1] || p[2] || p[3]
>> 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
>> 		return BUFFER_NOT_ZERO;
>> 	else
>> 		return BUFFER_ZERO;
>>
> 
> I use the following code:
> 
> 
> bool memeqzero4_paolo(const void *data, size_t length)
> {
>      ...
> }

The code you used is very generic and not optimized for the kind of data
you see during migration, hence the existing code in QEMU fares better.

>>> The total live migration time increased about
>>> 8%!   Not decreased.  Although in the unit test your '
>>> memeqzero4_paolo'  has better performance, any idea?
>>
>> You only tested the case of zero pages.  But real pages usually are not zero,
>> even if they have a few zero bytes at the beginning.  It's very important to
>> optimize the initial check before the memcmp call.
>>
> 
> In the unit test, I only test zero pages too, and the performance of  'memeqzero4_paolo' is better.
> But when merged into QEMU, it caused performance drop. Why?

Because QEMU is not migrating zero pages only.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12  9:04                 ` Paolo Bonzini
@ 2015-11-12  9:40                   ` Li, Liang Z
  2015-11-12  9:45                     ` Paolo Bonzini
  0 siblings, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-12  9:40 UTC (permalink / raw)
  To: Paolo Bonzini, quintela; +Cc: amit.shah, qemu-devel, mst

> >>> I am very surprised about the live migration performance  result
> >>> when I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics
> >>> to check the zero pages.
> >>
> >> What code were you using?  Remember I suggested using only unsigned
> >> long checks, like
> >>
> >> 	unsigned long *p = ...
> >> 	if (p[0] || p[1] || p[2] || p[3]
> >> 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> >> 		return BUFFER_NOT_ZERO;
> >> 	else
> >> 		return BUFFER_ZERO;
> >>
> >
> > I use the following code:
> >
> >
> > bool memeqzero4_paolo(const void *data, size_t length) {
> >      ...
> > }
> 
> The code you used is very generic and not optimized for the kind of data you
> see during migration, hence the existing code in QEMU fares better.
> 

I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.

I use your new code:
-------------------------------------------------
	unsigned long *p = ...
	if (p[0] || p[1] || p[2] || p[3]
	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
		return BUFFER_NOT_ZERO;
	else
		return BUFFER_ZERO;
---------------------------------------------------
and the result is almost the same.  I also tried the check 8, 16 long data at the beginning, 
same result.

> >>> The total live migration time increased about
> >>> 8%!   Not decreased.  Although in the unit test your '
> >>> memeqzero4_paolo'  has better performance, any idea?
> >>
> >> You only tested the case of zero pages.  But real pages usually are
> >> not zero, even if they have a few zero bytes at the beginning.  It's
> >> very important to optimize the initial check before the memcmp call.
> >>
> >
> > In the unit test, I only test zero pages too, and the performance of
> 'memeqzero4_paolo' is better.
> > But when merged into QEMU, it caused performance drop. Why?
> 
> Because QEMU is not migrating zero pages only.
> 
> Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12  9:40                   ` Li, Liang Z
@ 2015-11-12  9:45                     ` Paolo Bonzini
  2015-11-12  9:53                       ` Li, Liang Z
  0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-12  9:45 UTC (permalink / raw)
  To: Li, Liang Z, quintela; +Cc: amit.shah, qemu-devel, mst



On 12/11/2015 10:40, Li, Liang Z wrote:
> I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.
> 
> I use your new code:
> -------------------------------------------------
> 	unsigned long *p = ...
> 	if (p[0] || p[1] || p[2] || p[3]
> 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> 		return BUFFER_NOT_ZERO;
> 	else
> 		return BUFFER_ZERO;
> ---------------------------------------------------
> and the result is almost the same.  I also tried the check 8, 16 long data at the beginning, 
> same result.

Interesting...  Well, all I can say is that applaud you for testing your
hypothesis with the benchmark.

Probably the setup cost of memcmp is too high, because the testing loop
is already very optimized.

Please submit the AVX2 version if it helps!

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12  9:45                     ` Paolo Bonzini
@ 2015-11-12  9:53                       ` Li, Liang Z
  2015-11-12 11:34                         ` Juan Quintela
  0 siblings, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-12  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, quintela; +Cc: amit.shah, qemu-devel, mst

> On 12/11/2015 10:40, Li, Liang Z wrote:
> > I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.
> >
> > I use your new code:
> > -------------------------------------------------
> > 	unsigned long *p = ...
> > 	if (p[0] || p[1] || p[2] || p[3]
> > 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> > 		return BUFFER_NOT_ZERO;
> > 	else
> > 		return BUFFER_ZERO;
> > ---------------------------------------------------
> > and the result is almost the same.  I also tried the check 8, 16 long
> > data at the beginning, same result.
> 
> Interesting...  Well, all I can say is that applaud you for testing your hypothesis
> with the benchmark.
> 
> Probably the setup cost of memcmp is too high, because the testing loop is
> already very optimized.
> 
> Please submit the AVX2 version if it helps!

Yes, the AVX2 version really helps. I have already submitted it, could you help to review it?

I am curious about the original intention to add the SSE2 Intrinsics, is the same reason?

I even suspect the VM may impact the 'memcmp()' performance, is it possible?

Liang

> Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 1/2] cutils: add avx2 instruction optimization
  2015-11-10  2:51 ` [Qemu-devel] [v2 1/2] cutils: " Liang Li
@ 2015-11-12 10:08   ` Paolo Bonzini
  2015-11-12 10:12     ` Li, Liang Z
                       ` (2 more replies)
  2015-11-12 14:43   ` Richard Henderson
  1 sibling, 3 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-12 10:08 UTC (permalink / raw)
  To: Liang Li, qemu-devel; +Cc: amit.shah, mst, quintela



On 10/11/2015 03:51, Liang Li wrote:
> buffer_find_nonzero_offset() is a hot function during live migration.
> Now it use SSE2 intructions for optimization. For platform supports
> AVX2 instructions, use the AVX2 instructions for optimization can help
> to improve the performance about 30% comparing to SSE2.
> Zero page check can be faster with this optimization, the test result
> shows that for an 8GB RAM idle guest, this patch can help to shorten
> the total live migration time about 6%.
> 
> This patch use the ifunc mechanism to select the proper function when
> running, for platform supports AVX2, excute the AVX2 instructions,
> else, excute the original code.
> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  include/qemu-common.h | 28 +++++++++++++++------
>  util/Makefile.objs    |  2 ++
>  util/avx2.c           | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  util/cutils.c         | 53 +++++++++++++++++++++++++++++++++++++--
>  4 files changed, 143 insertions(+), 9 deletions(-)
>  create mode 100644 util/avx2.c
> 
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index 2f74540..9fa7501 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -484,15 +484,29 @@ void qemu_hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>  #endif
>  
>  #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> -static inline bool
> -can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> -{
> -    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> -                   * sizeof(VECTYPE)) == 0
> -            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> -}
> +bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len);
> +
>  size_t buffer_find_nonzero_offset(const void *buf, size_t len);
>  
> +extern bool
> +can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len);
> +
> +extern size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len);
> +
> +extern bool
> +can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len);
> +
> +extern size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len);
> +
> +__asm__(".type can_use_buffer_find_nonzero_offset, \%gnu_indirect_function");
> +__asm__(".type buffer_find_nonzero_offset, \%gnu_indirect_function");
> +
> +
> +void *can_use_buffer_find_nonzero_offset_ifunc(void) \
> +                     __asm__("can_use_buffer_find_nonzero_offset");
> +
> +void *buffer_find_nonzero_offset_ifunc(void) \
> +                     __asm__("buffer_find_nonzero_offset");
>  /*
>   * helper to parse debug environment variables
>   */
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index d7cc399..6aacad7 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -1,4 +1,5 @@
>  util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
> +util-obj-y += avx2.o
>  util-obj-$(CONFIG_POSIX) += compatfd.o
>  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
>  util-obj-$(CONFIG_POSIX) += mmap-alloc.o
> @@ -29,3 +30,4 @@ util-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
>  util-obj-y += qemu-coroutine-sleep.o
>  util-obj-y += coroutine-$(CONFIG_COROUTINE_BACKEND).o
>  util-obj-y += buffer.o
> +avx2.o-cflags      := $(AVX2_CFLAGS)
> diff --git a/util/avx2.c b/util/avx2.c
> new file mode 100644
> index 0000000..0e6915a
> --- /dev/null
> +++ b/util/avx2.c
> @@ -0,0 +1,69 @@
> +#include "qemu-common.h"
> +
> +#ifdef __AVX2__
> +#include <immintrin.h>
> +#define AVX2_VECTYPE        __m256i
> +#define AVX2_SPLAT(p)       _mm256_set1_epi8(*(p))
> +#define AVX2_ALL_EQ(v1, v2) \
> +    (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
> +#define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
> +
> +inline bool
> +can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
> +{
> +    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +                   * sizeof(AVX2_VECTYPE)) == 0
> +            && ((uintptr_t) buf) % sizeof(AVX2_VECTYPE) == 0);
> +}
> +
> +size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
> +{
> +    const AVX2_VECTYPE *p = buf;
> +    const AVX2_VECTYPE zero = (AVX2_VECTYPE){0};
> +    size_t i;
> +
> +    assert(can_use_buffer_find_nonzero_offset_avx2(buf, len));
> +
> +    if (!len) {
> +        return 0;
> +    }
> +
> +    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
> +        if (!AVX2_ALL_EQ(p[i], zero)) {
> +            return i * sizeof(AVX2_VECTYPE);
> +        }
> +    }
> +
> +    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
> +         i < len / sizeof(AVX2_VECTYPE);
> +         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
> +        AVX2_VECTYPE tmp0 = AVX2_VEC_OR(p[i + 0], p[i + 1]);
> +        AVX2_VECTYPE tmp1 = AVX2_VEC_OR(p[i + 2], p[i + 3]);
> +        AVX2_VECTYPE tmp2 = AVX2_VEC_OR(p[i + 4], p[i + 5]);
> +        AVX2_VECTYPE tmp3 = AVX2_VEC_OR(p[i + 6], p[i + 7]);
> +        AVX2_VECTYPE tmp01 = AVX2_VEC_OR(tmp0, tmp1);
> +        AVX2_VECTYPE tmp23 = AVX2_VEC_OR(tmp2, tmp3);
> +        if (!AVX2_ALL_EQ(AVX2_VEC_OR(tmp01, tmp23), zero)) {
> +            break;
> +        }
> +    }
> +
> +    return i * sizeof(AVX2_VECTYPE);
> +}
> +
> +#else
> +/* use the original functions if avx2 is not enabled when buiding*/
> +
> +inline bool
> +can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
> +{
> +    return can_use_buffer_find_nonzero_offset_inner(buf, len);
> +}
> +
> +inline size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
> +{
> +    return buffer_find_nonzero_offset_inner(buf, len);
> +}
> +
> +#endif
> +
> diff --git a/util/cutils.c b/util/cutils.c
> index cfeb848..cd478ce 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -26,6 +26,7 @@
>  #include <math.h>
>  #include <limits.h>
>  #include <errno.h>
> +#include <cpuid.h>
>  
>  #include "qemu/sockets.h"
>  #include "qemu/iov.h"
> @@ -161,6 +162,54 @@ int qemu_fdatasync(int fd)
>  #endif
>  }
>  
> +/* old compiler maynot define bit_AVX2 */
> +#ifndef bit_AVX2
> +#define bit_AVX2 (1 << 5)
> +#endif
> +
> +static inline bool avx2_support(void)
> +{
> +    int a, b, c, d;
> +
> +    if (__get_cpuid_max(0, NULL) < 7) {
> +        printf("max cpuid < 7\n");
> +        return false;
> +    }
> +
> +    __cpuid_count(7, 0, a, b, c, d);
> +    printf("b = %x\n", b);
> +    return b & bit_AVX2;
> +}
> +
> +void *buffer_find_nonzero_offset_ifunc(void)
> +{
> +    printf("deciding %s\n", __func__);
> +
> +    typeof(buffer_find_nonzero_offset) *func = (avx2_support()) ?
> +        buffer_find_nonzero_offset_avx2 : buffer_find_nonzero_offset_inner;
> +
> +    return func;
> +}
> +
> +void *can_use_buffer_find_nonzero_offset_ifunc(void)
> +{
> +    printf("deciding %s\n", __func__);
> +
> +    typeof(can_use_buffer_find_nonzero_offset) *func = (avx2_support()) ?
> +        can_use_buffer_find_nonzero_offset_avx2 :
> +        can_use_buffer_find_nonzero_offset_inner;
> +
> +    return func;
> +}
> +
> +inline bool
> +can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
> +{
> +    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +                   * sizeof(VECTYPE)) == 0
> +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> +}
> +
>  /*
>   * Searches for an area with non-zero content in a buffer
>   *
> @@ -181,13 +230,13 @@ int qemu_fdatasync(int fd)
>   * If the buffer is all zero the return value is equal to len.
>   */
>  
> -size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> +size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
>  {
>      const VECTYPE *p = buf;
>      const VECTYPE zero = (VECTYPE){0};
>      size_t i;
>  
> -    assert(can_use_buffer_find_nonzero_offset(buf, len));
> +    assert(can_use_buffer_find_nonzero_offset_inner(buf, len));
>  
>      if (!len) {
>          return 0;
> 

The main issue here is that you are not testing whether the compiler 
supports gnu_indirect_function.

I suggest that you start by moving the functions to util/buffer-zero.c

Then the structure should be something like

#ifdef CONFIG_HAVE_AVX2
#include <immintrin.h>
#endif

... define buffer_find_nonzero_offset_inner ...
... define can_use_buffer_find_nonzero_offset_inner ...

#if defined CONFIG_HAVE_GNU_IFUNC && defined CONFIG_HAVE_AVX2
... define buffer_find_nonzero_offset_avx2 ...
... define can_use_buffer_find_nonzero_offset_avx2 ...
... define the indirect functions ...
#else
... define buffer_find_nonzero_offset that just calls buffer_find_nonzero_offset_inner ...
... define can_use_buffer_find_nonzero_offset that just calls can_use_buffer_find_nonzero_offset_inner ...
#endif

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 1/2] cutils: add avx2 instruction optimization
  2015-11-12 10:08   ` Paolo Bonzini
@ 2015-11-12 10:12     ` Li, Liang Z
  2015-11-12 11:30     ` Juan Quintela
  2015-11-13  2:49     ` Li, Liang Z
  2 siblings, 0 replies; 35+ messages in thread
From: Li, Liang Z @ 2015-11-12 10:12 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: amit.shah, mst, quintela

> 
> The main issue here is that you are not testing whether the compiler supports
> gnu_indirect_function.
> 
> I suggest that you start by moving the functions to util/buffer-zero.c
> 
> Then the structure should be something like
> 
> #ifdef CONFIG_HAVE_AVX2
> #include <immintrin.h>
> #endif
> 
> ... define buffer_find_nonzero_offset_inner ...
> ... define can_use_buffer_find_nonzero_offset_inner ...
> 
> #if defined CONFIG_HAVE_GNU_IFUNC && defined CONFIG_HAVE_AVX2 ...
> define buffer_find_nonzero_offset_avx2 ...
> ... define can_use_buffer_find_nonzero_offset_avx2 ...
> ... define the indirect functions ...
> #else
> ... define buffer_find_nonzero_offset that just calls
> buffer_find_nonzero_offset_inner ...
> ... define can_use_buffer_find_nonzero_offset that just calls
> can_use_buffer_find_nonzero_offset_inner ...
> #endif
> 
> Thanks,
> 
> Paolo

Got it, thanks.

Liang

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 1/2] cutils: add avx2 instruction optimization
  2015-11-12 10:08   ` Paolo Bonzini
  2015-11-12 10:12     ` Li, Liang Z
@ 2015-11-12 11:30     ` Juan Quintela
  2015-11-13  2:49     ` Li, Liang Z
  2 siblings, 0 replies; 35+ messages in thread
From: Juan Quintela @ 2015-11-12 11:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: amit.shah, Liang Li, qemu-devel, mst

Paolo Bonzini <pbonzini@redhat.com> wrote:

>
> The main issue here is that you are not testing whether the compiler 
> supports gnu_indirect_function.
>
> I suggest that you start by moving the functions to util/buffer-zero.c
>
> Then the structure should be something like
>
> #ifdef CONFIG_HAVE_AVX2
> #include <immintrin.h>
> #endif
>
> ... define buffer_find_nonzero_offset_inner ...
> ... define can_use_buffer_find_nonzero_offset_inner ...
>
> #if defined CONFIG_HAVE_GNU_IFUNC && defined CONFIG_HAVE_AVX2
> ... define buffer_find_nonzero_offset_avx2 ...
> ... define can_use_buffer_find_nonzero_offset_avx2 ...
> ... define the indirect functions ...
> #else
> ... define buffer_find_nonzero_offset that just calls
> buffer_find_nonzero_offset_inner ...
> ... define can_use_buffer_find_nonzero_offset that just calls
> can_use_buffer_find_nonzero_offset_inner ...
> #endif

My understanding for this was that glibc is better than hand made asm,
and paolo4_memzero (or whatever was it called) was the best approach.
And just remove SSE.  Have I missed something?


Later, Juan.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12  9:53                       ` Li, Liang Z
@ 2015-11-12 11:34                         ` Juan Quintela
  2015-11-12 11:42                           ` Li, Liang Z
  0 siblings, 1 reply; 35+ messages in thread
From: Juan Quintela @ 2015-11-12 11:34 UTC (permalink / raw)
  To: Li, Liang Z; +Cc: amit.shah, Paolo Bonzini, qemu-devel, mst

"Li, Liang Z" <liang.z.li@intel.com> wrote:
>> On 12/11/2015 10:40, Li, Liang Z wrote:
>> > I migrate a 8GB RAM Idle guest,  I think most of it's pages are zero pages.
>> >
>> > I use your new code:
>> > -------------------------------------------------
>> > 	unsigned long *p = ...
>> > 	if (p[0] || p[1] || p[2] || p[3]
>> > 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
>> > 		return BUFFER_NOT_ZERO;
>> > 	else
>> > 		return BUFFER_ZERO;
>> > ---------------------------------------------------
>> > and the result is almost the same.  I also tried the check 8, 16 long
>> > data at the beginning, same result.
>> 
>> Interesting...  Well, all I can say is that applaud you for testing
>> your hypothesis
>> with the benchmark.
>> 
>> Probably the setup cost of memcmp is too high, because the testing loop is
>> already very optimized.
>> 
>> Please submit the AVX2 version if it helps!

I read the email in the wrong order.  Forget about my other email.

Sorry, Juan.


>
> Yes, the AVX2 version really helps. I have already submitted it, could
> you help to review it?
>
> I am curious about the original intention to add the SSE2 Intrinsics,
> is the same reason?
>
> I even suspect the VM may impact the 'memcmp()' performance, is it possible?
>
> Liang
>
>> Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12 11:34                         ` Juan Quintela
@ 2015-11-12 11:42                           ` Li, Liang Z
  2015-11-12 19:56                             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-12 11:42 UTC (permalink / raw)
  To: quintela; +Cc: amit.shah, Paolo Bonzini, qemu-devel, mst

> >> >
> >> > I use your new code:
> >> > -------------------------------------------------
> >> > 	unsigned long *p = ...
> >> > 	if (p[0] || p[1] || p[2] || p[3]
> >> > 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> >> > 		return BUFFER_NOT_ZERO;
> >> > 	else
> >> > 		return BUFFER_ZERO;
> >> > ---------------------------------------------------
> >> > and the result is almost the same.  I also tried the check 8, 16
> >> > long data at the beginning, same result.
> >>
> >> Interesting...  Well, all I can say is that applaud you for testing
> >> your hypothesis with the benchmark.
> >>
> >> Probably the setup cost of memcmp is too high, because the testing
> >> loop is already very optimized.
> >>
> >> Please submit the AVX2 version if it helps!
> 
> I read the email in the wrong order.  Forget about my other email.
> 
> Sorry, Juan.
> 

One thing I still can't understand, why the unit test in host environment shows
'memcmp()' have better performance?

Liang
> 
> >
> > Yes, the AVX2 version really helps. I have already submitted it, could
> > you help to review it?
> >
> > I am curious about the original intention to add the SSE2 Intrinsics,
> > is the same reason?
> >
> > I even suspect the VM may impact the 'memcmp()' performance, is it
> possible?
> >
> > Liang
> >
> >> Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 1/2] cutils: add avx2 instruction optimization
  2015-11-10  2:51 ` [Qemu-devel] [v2 1/2] cutils: " Liang Li
  2015-11-12 10:08   ` Paolo Bonzini
@ 2015-11-12 14:43   ` Richard Henderson
  1 sibling, 0 replies; 35+ messages in thread
From: Richard Henderson @ 2015-11-12 14:43 UTC (permalink / raw)
  To: Liang Li, qemu-devel; +Cc: amit.shah, pbonzini, mst, quintela

On 11/10/2015 03:51 AM, Liang Li wrote:
> +__asm__(".type can_use_buffer_find_nonzero_offset, \%gnu_indirect_function");
> +__asm__(".type buffer_find_nonzero_offset, \%gnu_indirect_function");
> +
> +
> +void *can_use_buffer_find_nonzero_offset_ifunc(void) \
> +                     __asm__("can_use_buffer_find_nonzero_offset");
> +
> +void *buffer_find_nonzero_offset_ifunc(void) \
> +                     __asm__("buffer_find_nonzero_offset");


Not keen on this.  You can use the ifunc attribute instead of inline asm, and 
the target attribute to enable per-function use of avx2.  And if neither are 
supported, due to compiler limitations, I don't think you should attempt to 
work around that.


r~

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12 11:42                           ` Li, Liang Z
@ 2015-11-12 19:56                             ` Dr. David Alan Gilbert
  2015-11-12 20:20                               ` Eric Blake
  0 siblings, 1 reply; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2015-11-12 19:56 UTC (permalink / raw)
  To: Li, Liang Z; +Cc: amit.shah, Paolo Bonzini, mst, qemu-devel, quintela

* Li, Liang Z (liang.z.li@intel.com) wrote:
> > >> >
> > >> > I use your new code:
> > >> > -------------------------------------------------
> > >> > 	unsigned long *p = ...
> > >> > 	if (p[0] || p[1] || p[2] || p[3]
> > >> > 	    || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0)
> > >> > 		return BUFFER_NOT_ZERO;
> > >> > 	else
> > >> > 		return BUFFER_ZERO;
> > >> > ---------------------------------------------------
> > >> > and the result is almost the same.  I also tried the check 8, 16
> > >> > long data at the beginning, same result.
> > >>
> > >> Interesting...  Well, all I can say is that applaud you for testing
> > >> your hypothesis with the benchmark.
> > >>
> > >> Probably the setup cost of memcmp is too high, because the testing
> > >> loop is already very optimized.
> > >>
> > >> Please submit the AVX2 version if it helps!
> > 
> > I read the email in the wrong order.  Forget about my other email.
> > 
> > Sorry, Juan.
> > 
> 
> One thing I still can't understand, why the unit test in host environment shows
> 'memcmp()' have better performance?

Are you aware of any program other than QEMU that also wants to do something
similar?  Finding whether a block of memory is zero, sounds like something
that would be useful in lots of places, I just can't think which ones.

Dave

> 
> Liang
> > 
> > >
> > > Yes, the AVX2 version really helps. I have already submitted it, could
> > > you help to review it?
> > >
> > > I am curious about the original intention to add the SSE2 Intrinsics,
> > > is the same reason?
> > >
> > > I even suspect the VM may impact the 'memcmp()' performance, is it
> > possible?
> > >
> > > Liang
> > >
> > >> Paolo
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12 19:56                             ` Dr. David Alan Gilbert
@ 2015-11-12 20:20                               ` Eric Blake
  2016-04-07 11:09                                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Blake @ 2015-11-12 20:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Li, Liang Z
  Cc: amit.shah, Paolo Bonzini, quintela, qemu-devel, mst

[-- Attachment #1: Type: text/plain, Size: 943 bytes --]

On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:

>> One thing I still can't understand, why the unit test in host environment shows
>> 'memcmp()' have better performance?

Have you tried running under a profiler, to see if there are hotspots or
at least get an idea of where the time is being spent?

> 
> Are you aware of any program other than QEMU that also wants to do something
> similar?  Finding whether a block of memory is zero, sounds like something
> that would be useful in lots of places, I just can't think which ones.

At least dd, cp, and probably several other utilities.  It would be nice
to post an RFE to glibc to see if they can come up with a dedicated
interface that is faster than memcmp(), although that still only helps
us when targetting a system new enough to have that interface.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 1/2] cutils: add avx2 instruction optimization
  2015-11-12 10:08   ` Paolo Bonzini
  2015-11-12 10:12     ` Li, Liang Z
  2015-11-12 11:30     ` Juan Quintela
@ 2015-11-13  2:49     ` Li, Liang Z
  2015-11-13  9:30       ` Paolo Bonzini
  2 siblings, 1 reply; 35+ messages in thread
From: Li, Liang Z @ 2015-11-13  2:49 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: amit.shah, mst, quintela

> > This patch use the ifunc mechanism to select the proper function when
> > running, for platform supports AVX2, excute the AVX2 instructions,
> > else, excute the original code.
> >
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > ---
> >  include/qemu-common.h | 28 +++++++++++++++------
> >  util/Makefile.objs    |  2 ++
> >  util/avx2.c           | 69
> +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  util/cutils.c         | 53 +++++++++++++++++++++++++++++++++++++--
> >  4 files changed, 143 insertions(+), 9 deletions(-)  create mode
> > 100644 util/avx2.c
> >
> > diff --git a/include/qemu-common.h b/include/qemu-common.h index
> > 2f74540..9fa7501 100644
> > --- a/include/qemu-common.h
> > +++ b/include/qemu-common.h
> > @@ -484,15 +484,29 @@ void qemu_hexdump(const char *buf, FILE *fp,
> > const char *prefix, size_t size);  #endif
> >
> >  #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8 -static inline
> > bool -can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> > -{
> > -    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> > -                   * sizeof(VECTYPE)) == 0
> > -            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> > -}
> > +bool can_use_buffer_find_nonzero_offset(const void *buf, size_t len);
> > +
> >  size_t buffer_find_nonzero_offset(const void *buf, size_t len);
> >
> > +extern bool
> > +can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len);
> > +
> > +extern size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t
> > +len);
> > +
> > +extern bool
> > +can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t
> > +len);
> > +
> > +extern size_t buffer_find_nonzero_offset_inner(const void *buf,
> > +size_t len);
> > +
> > +__asm__(".type can_use_buffer_find_nonzero_offset,
> > +\%gnu_indirect_function"); __asm__(".type buffer_find_nonzero_offset,
> > +\%gnu_indirect_function");
> > +
> > +
> > +void *can_use_buffer_find_nonzero_offset_ifunc(void) \
> > +                     __asm__("can_use_buffer_find_nonzero_offset");
> > +
> > +void *buffer_find_nonzero_offset_ifunc(void) \
> > +                     __asm__("buffer_find_nonzero_offset");
> >  /*
> >   * helper to parse debug environment variables
> >   */
> > diff --git a/util/Makefile.objs b/util/Makefile.objs index
> > d7cc399..6aacad7 100644
> > --- a/util/Makefile.objs
> > +++ b/util/Makefile.objs
> > @@ -1,4 +1,5 @@
> >  util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
> > +util-obj-y += avx2.o
> >  util-obj-$(CONFIG_POSIX) += compatfd.o
> >  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
> >  util-obj-$(CONFIG_POSIX) += mmap-alloc.o @@ -29,3 +30,4 @@ util-obj-y
> > += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
> > util-obj-y += qemu-coroutine-sleep.o  util-obj-y +=
> > coroutine-$(CONFIG_COROUTINE_BACKEND).o
> >  util-obj-y += buffer.o
> > +avx2.o-cflags      := $(AVX2_CFLAGS)
> > diff --git a/util/avx2.c b/util/avx2.c new file mode 100644 index
> > 0000000..0e6915a
> > --- /dev/null
> > +++ b/util/avx2.c
> > @@ -0,0 +1,69 @@
> > +#include "qemu-common.h"
> > +
> > +#ifdef __AVX2__
> > +#include <immintrin.h>
> > +#define AVX2_VECTYPE        __m256i
> > +#define AVX2_SPLAT(p)       _mm256_set1_epi8(*(p))
> > +#define AVX2_ALL_EQ(v1, v2) \
> > +    (_mm256_movemask_epi8(_mm256_cmpeq_epi8(v1, v2)) == 0xFFFFFFFF)
> > +#define AVX2_VEC_OR(v1, v2) (_mm256_or_si256(v1, v2))
> > +
> > +inline bool
> > +can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
> > +{
> > +    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> > +                   * sizeof(AVX2_VECTYPE)) == 0
> > +            && ((uintptr_t) buf) % sizeof(AVX2_VECTYPE) == 0); }
> > +
> > +size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t len) {
> > +    const AVX2_VECTYPE *p = buf;
> > +    const AVX2_VECTYPE zero = (AVX2_VECTYPE){0};
> > +    size_t i;
> > +
> > +    assert(can_use_buffer_find_nonzero_offset_avx2(buf, len));
> > +
> > +    if (!len) {
> > +        return 0;
> > +    }
> > +
> > +    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
> > +        if (!AVX2_ALL_EQ(p[i], zero)) {
> > +            return i * sizeof(AVX2_VECTYPE);
> > +        }
> > +    }
> > +
> > +    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
> > +         i < len / sizeof(AVX2_VECTYPE);
> > +         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
> > +        AVX2_VECTYPE tmp0 = AVX2_VEC_OR(p[i + 0], p[i + 1]);
> > +        AVX2_VECTYPE tmp1 = AVX2_VEC_OR(p[i + 2], p[i + 3]);
> > +        AVX2_VECTYPE tmp2 = AVX2_VEC_OR(p[i + 4], p[i + 5]);
> > +        AVX2_VECTYPE tmp3 = AVX2_VEC_OR(p[i + 6], p[i + 7]);
> > +        AVX2_VECTYPE tmp01 = AVX2_VEC_OR(tmp0, tmp1);
> > +        AVX2_VECTYPE tmp23 = AVX2_VEC_OR(tmp2, tmp3);
> > +        if (!AVX2_ALL_EQ(AVX2_VEC_OR(tmp01, tmp23), zero)) {
> > +            break;
> > +        }
> > +    }
> > +
> > +    return i * sizeof(AVX2_VECTYPE);
> > +}
> > +
> > +#else
> > +/* use the original functions if avx2 is not enabled when buiding*/
> > +
> > +inline bool
> > +can_use_buffer_find_nonzero_offset_avx2(const void *buf, size_t len)
> > +{
> > +    return can_use_buffer_find_nonzero_offset_inner(buf, len); }
> > +
> > +inline size_t buffer_find_nonzero_offset_avx2(const void *buf, size_t
> > +len) {
> > +    return buffer_find_nonzero_offset_inner(buf, len); }
> > +
> > +#endif
> > +
> > diff --git a/util/cutils.c b/util/cutils.c index cfeb848..cd478ce
> > 100644
> > --- a/util/cutils.c
> > +++ b/util/cutils.c
> > @@ -26,6 +26,7 @@
> >  #include <math.h>
> >  #include <limits.h>
> >  #include <errno.h>
> > +#include <cpuid.h>
> >
> >  #include "qemu/sockets.h"
> >  #include "qemu/iov.h"
> > @@ -161,6 +162,54 @@ int qemu_fdatasync(int fd)  #endif  }
> >
> > +/* old compiler maynot define bit_AVX2 */ #ifndef bit_AVX2 #define
> > +bit_AVX2 (1 << 5) #endif
> > +
> > +static inline bool avx2_support(void) {
> > +    int a, b, c, d;
> > +
> > +    if (__get_cpuid_max(0, NULL) < 7) {
> > +        printf("max cpuid < 7\n");
> > +        return false;
> > +    }
> > +
> > +    __cpuid_count(7, 0, a, b, c, d);
> > +    printf("b = %x\n", b);
> > +    return b & bit_AVX2;
> > +}
> > +
> > +void *buffer_find_nonzero_offset_ifunc(void)
> > +{
> > +    printf("deciding %s\n", __func__);
> > +
> > +    typeof(buffer_find_nonzero_offset) *func = (avx2_support()) ?
> > +        buffer_find_nonzero_offset_avx2 :
> > + buffer_find_nonzero_offset_inner;
> > +
> > +    return func;
> > +}
> > +
> > +void *can_use_buffer_find_nonzero_offset_ifunc(void)
> > +{
> > +    printf("deciding %s\n", __func__);
> > +
> > +    typeof(can_use_buffer_find_nonzero_offset) *func = (avx2_support()) ?
> > +        can_use_buffer_find_nonzero_offset_avx2 :
> > +        can_use_buffer_find_nonzero_offset_inner;
> > +
> > +    return func;
> > +}
> > +
> > +inline bool
> > +can_use_buffer_find_nonzero_offset_inner(const void *buf, size_t len)
> > +{
> > +    return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> > +                   * sizeof(VECTYPE)) == 0
> > +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0); }
> > +
> >  /*
> >   * Searches for an area with non-zero content in a buffer
> >   *
> > @@ -181,13 +230,13 @@ int qemu_fdatasync(int fd)
> >   * If the buffer is all zero the return value is equal to len.
> >   */
> >
> > -size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> > +size_t buffer_find_nonzero_offset_inner(const void *buf, size_t len)
> >  {
> >      const VECTYPE *p = buf;
> >      const VECTYPE zero = (VECTYPE){0};
> >      size_t i;
> >
> > -    assert(can_use_buffer_find_nonzero_offset(buf, len));
> > +    assert(can_use_buffer_find_nonzero_offset_inner(buf, len));
> >
> >      if (!len) {
> >          return 0;
> >
> 
> The main issue here is that you are not testing whether the compiler supports
> gnu_indirect_function.
> 
> I suggest that you start by moving the functions to util/buffer-zero.c
> 
> Then the structure should be something like
> 
> #ifdef CONFIG_HAVE_AVX2
> #include <immintrin.h>
> #endif
> 
> ... define buffer_find_nonzero_offset_inner ...
> ... define can_use_buffer_find_nonzero_offset_inner ...

> #if defined CONFIG_HAVE_GNU_IFUNC && defined CONFIG_HAVE_AVX2 ...
> define buffer_find_nonzero_offset_avx2 ...
> ... define can_use_buffer_find_nonzero_offset_avx2 ...
> ... define the indirect functions ...
> #else
> ... define buffer_find_nonzero_offset that just calls
> buffer_find_nonzero_offset_inner ...
> ... define can_use_buffer_find_nonzero_offset that just calls
> can_use_buffer_find_nonzero_offset_inner ...
> #endif
> 
> Thanks,
> 
> Paolo

The buffer_find_nonzero_offset_inner  & buffer_find_nonzero_offset_avx2  can't defined in the same .c file.
Or, if the '-maxv2' is enabled, the " buffer_find_nonzero_offset_inner  ()" will be compiled to AVX2 instructions.

Liang


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 1/2] cutils: add avx2 instruction optimization
  2015-11-13  2:49     ` Li, Liang Z
@ 2015-11-13  9:30       ` Paolo Bonzini
  0 siblings, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2015-11-13  9:30 UTC (permalink / raw)
  To: Liang Z Li; +Cc: amit shah, mst, qemu-devel, quintela


> > ... define buffer_find_nonzero_offset_inner ...
> > ... define can_use_buffer_find_nonzero_offset_inner ...
> 
> > #if defined CONFIG_HAVE_GNU_IFUNC && defined CONFIG_HAVE_AVX2 ...
> > define buffer_find_nonzero_offset_avx2 ...
> > ... define can_use_buffer_find_nonzero_offset_avx2 ...
> > ... define the indirect functions ...
> > #else
> > ... define buffer_find_nonzero_offset that just calls
> > buffer_find_nonzero_offset_inner ...
> > ... define can_use_buffer_find_nonzero_offset that just calls
> > can_use_buffer_find_nonzero_offset_inner ...
> > #endif
> > 
> > Thanks,
> > 
> > Paolo
> 
> The buffer_find_nonzero_offset_inner  & buffer_find_nonzero_offset_avx2
> can't defined in the same .c file.
> Or, if the '-maxv2' is enabled, the " buffer_find_nonzero_offset_inner  ()"
> will be compiled to AVX2 instructions.

You can use __attribute__((__target__("avx2"))) on the avx2 version,
instead of compiling the whole file with -mavx2.

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2015-11-12 20:20                               ` Eric Blake
@ 2016-04-07 11:09                                 ` Dr. David Alan Gilbert
  2016-04-07 12:54                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2016-04-07 11:09 UTC (permalink / raw)
  To: Eric Blake
  Cc: quintela, Li, Liang Z, qemu-devel, mst, amit.shah, Paolo Bonzini

* Eric Blake (eblake@redhat.com) wrote:
> On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:
> 
> >> One thing I still can't understand, why the unit test in host environment shows
> >> 'memcmp()' have better performance?
> 
> Have you tried running under a profiler, to see if there are hotspots or
> at least get an idea of where the time is being spent?
> 
> > 
> > Are you aware of any program other than QEMU that also wants to do something
> > similar?  Finding whether a block of memory is zero, sounds like something
> > that would be useful in lots of places, I just can't think which ones.
> 
> At least dd, cp, and probably several other utilities.  It would be nice
> to post an RFE to glibc to see if they can come up with a dedicated
> interface that is faster than memcmp(), although that still only helps
> us when targetting a system new enough to have that interface.

I've just posted that RFE:
https://sourceware.org/bugzilla/show_bug.cgi?id=19920

Dave

> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2016-04-07 11:09                                 ` Dr. David Alan Gilbert
@ 2016-04-07 12:54                                   ` Michael S. Tsirkin
  2016-04-07 13:42                                     ` Dr. David Alan Gilbert
  2016-04-07 13:54                                     ` Paolo Bonzini
  0 siblings, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2016-04-07 12:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Victor Kaplansky, quintela, Li, Liang Z, qemu-devel, amit.shah,
	Paolo Bonzini

On Thu, Apr 07, 2016 at 12:09:52PM +0100, Dr. David Alan Gilbert wrote:
> * Eric Blake (eblake@redhat.com) wrote:
> > On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:
> > 
> > >> One thing I still can't understand, why the unit test in host environment shows
> > >> 'memcmp()' have better performance?
> > 
> > Have you tried running under a profiler, to see if there are hotspots or
> > at least get an idea of where the time is being spent?
> > 
> > > 
> > > Are you aware of any program other than QEMU that also wants to do something
> > > similar?  Finding whether a block of memory is zero, sounds like something
> > > that would be useful in lots of places, I just can't think which ones.
> > 
> > At least dd, cp, and probably several other utilities.  It would be nice
> > to post an RFE to glibc to see if they can come up with a dedicated
> > interface that is faster than memcmp(), although that still only helps
> > us when targetting a system new enough to have that interface.
> 
> I've just posted that RFE:
> https://sourceware.org/bugzilla/show_bug.cgi?id=19920
> 
> Dave

Have you guys seen the discussion in
http://rusty.ozlabs.org/?p=560#respond

In particular it claims this is close to optimal:


char check_zero(char *p, int len)
{
    char res = 0;
    int i;

    for (i = 0; i < len; i++) {
        res = res | p[i];
    }

    return res;
}


If you compile this function with --tree-vectorize and --unroll-loops.

Now, this version always scans all of the buffer, so
it will be slower when buffer is *not* all-zeroes.

Which might indicate that you need to know what your
workload is to implement compare to zero efficiently,
and if that is the case, it's not clear this is appropriate for libc.


> > -- 
> > Eric Blake   eblake redhat com    +1-919-301-3266
> > Libvirt virtualization library http://libvirt.org
> > 
> 
> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2016-04-07 12:54                                   ` Michael S. Tsirkin
@ 2016-04-07 13:42                                     ` Dr. David Alan Gilbert
  2016-04-07 13:54                                     ` Paolo Bonzini
  1 sibling, 0 replies; 35+ messages in thread
From: Dr. David Alan Gilbert @ 2016-04-07 13:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Victor Kaplansky, quintela, Li, Liang Z, qemu-devel, amit.shah,
	Paolo Bonzini

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Thu, Apr 07, 2016 at 12:09:52PM +0100, Dr. David Alan Gilbert wrote:
> > * Eric Blake (eblake@redhat.com) wrote:
> > > On 11/12/2015 12:56 PM, Dr. David Alan Gilbert wrote:
> > > 
> > > >> One thing I still can't understand, why the unit test in host environment shows
> > > >> 'memcmp()' have better performance?
> > > 
> > > Have you tried running under a profiler, to see if there are hotspots or
> > > at least get an idea of where the time is being spent?
> > > 
> > > > 
> > > > Are you aware of any program other than QEMU that also wants to do something
> > > > similar?  Finding whether a block of memory is zero, sounds like something
> > > > that would be useful in lots of places, I just can't think which ones.
> > > 
> > > At least dd, cp, and probably several other utilities.  It would be nice
> > > to post an RFE to glibc to see if they can come up with a dedicated
> > > interface that is faster than memcmp(), although that still only helps
> > > us when targetting a system new enough to have that interface.
> > 
> > I've just posted that RFE:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=19920
> > 
> > Dave
> 
> Have you guys seen the discussion in
> http://rusty.ozlabs.org/?p=560#respond
> 
> In particular it claims this is close to optimal:
> 
> 
> char check_zero(char *p, int len)
> {
>     char res = 0;
>     int i;
> 
>     for (i = 0; i < len; i++) {
>         res = res | p[i];
>     }
> 
>     return res;
> }
> 
> 
> If you compile this function with --tree-vectorize and --unroll-loops.
> 
> Now, this version always scans all of the buffer, so
> it will be slower when buffer is *not* all-zeroes.
> 
> Which might indicate that you need to know what your
> workload is to implement compare to zero efficiently,
> and if that is the case, it's not clear this is appropriate for libc.

On the contrary; anything that needs a couple of carefully chosen
compiler switches and assumes a particular workload is much
better optimised in a library for the general workload.

Dave

> 
> > > -- 
> > > Eric Blake   eblake redhat com    +1-919-301-3266
> > > Libvirt virtualization library http://libvirt.org
> > > 
> > 
> > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [v2 0/2] add avx2 instruction optimization
  2016-04-07 12:54                                   ` Michael S. Tsirkin
  2016-04-07 13:42                                     ` Dr. David Alan Gilbert
@ 2016-04-07 13:54                                     ` Paolo Bonzini
  1 sibling, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2016-04-07 13:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Dr. David Alan Gilbert
  Cc: Victor Kaplansky, quintela, Li, Liang Z, qemu-devel, amit.shah



On 07/04/2016 14:54, Michael S. Tsirkin wrote:
> 
> char check_zero(char *p, int len)
> {
>     char res = 0;
>     int i;
> 
>     for (i = 0; i < len; i++) {
>         res = res | p[i];
>     }
> 
>     return res;
> }
> 
> 
> If you compile this function with --tree-vectorize and --unroll-loops.

What you get then is exactly the same as what we already have in QEMU,
except for:

- the QEMU one has 128 extra instructions (32 times pcmpeq, movmsk, cmp,
je) in the loop.  Those extra instructions probably are free because, in
the case where the function goes through the whole buffer, the cache
misses dominate despite the efforts of the hardware prefetcher

- the QEMU one has an extra small loop at the beginning that proceeds a
word at a time to catch the case where almost everything in the page is
nonzero.

> Now, this version always scans all of the buffer, so
> it will be slower when buffer is *not* all-zeroes.

This is by far the common case.

> Which might indicate that you need to know what your
> workload is to implement compare to zero efficiently,

Not necessarily.  The two cases (unrolled/higher setup cost, and
non-unrolled/lower setup cost) are the same as the "parallel" and
"sequential" parts in Amdahl's law, and they optimize for completely
opposite workloads.  Amdahl's law then tells you that by making the
non-unrolled part small enough you can get very close to the absolute
maximum speedup.

Now of course if you know that your workload is "almost everything is
zero except a few bytes at the end of the page" then you have the
problem that your workload sucks and you should hate the guy who wrote
the software running in the guest. :)

Paolo

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-04-07 13:55 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-10  2:51 [Qemu-devel] [v2 0/2] add avx2 instruction optimization Liang Li
2015-11-10  2:51 ` [Qemu-devel] [v2 1/2] cutils: " Liang Li
2015-11-12 10:08   ` Paolo Bonzini
2015-11-12 10:12     ` Li, Liang Z
2015-11-12 11:30     ` Juan Quintela
2015-11-13  2:49     ` Li, Liang Z
2015-11-13  9:30       ` Paolo Bonzini
2015-11-12 14:43   ` Richard Henderson
2015-11-10  2:51 ` [Qemu-devel] [v2 2/2] configure: add options to config avx2 Liang Li
2015-11-10  3:43 ` [Qemu-devel] [v2 0/2] add avx2 instruction optimization Eric Blake
2015-11-10  5:48   ` Li, Liang Z
2015-11-10  9:13     ` Juan Quintela
2015-11-10  9:26       ` Li, Liang Z
2015-11-10  9:35         ` Paolo Bonzini
2015-11-10  9:41           ` Li, Liang Z
2015-11-10  9:50             ` Paolo Bonzini
2015-11-10  9:56               ` Li, Liang Z
2015-11-10 10:00                 ` Paolo Bonzini
2015-11-10 10:04                   ` Li, Liang Z
2015-11-12  2:49           ` Li, Liang Z
2015-11-12  8:43             ` Paolo Bonzini
2015-11-12  8:53               ` Li, Liang Z
2015-11-12  9:04                 ` Paolo Bonzini
2015-11-12  9:40                   ` Li, Liang Z
2015-11-12  9:45                     ` Paolo Bonzini
2015-11-12  9:53                       ` Li, Liang Z
2015-11-12 11:34                         ` Juan Quintela
2015-11-12 11:42                           ` Li, Liang Z
2015-11-12 19:56                             ` Dr. David Alan Gilbert
2015-11-12 20:20                               ` Eric Blake
2016-04-07 11:09                                 ` Dr. David Alan Gilbert
2016-04-07 12:54                                   ` Michael S. Tsirkin
2016-04-07 13:42                                     ` Dr. David Alan Gilbert
2016-04-07 13:54                                     ` Paolo Bonzini
2015-11-10  9:30       ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.