[Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
@ 2013-03-22 12:46 Peter Lieven
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h Peter Lieven
                   ` (9 more replies)
  0 siblings, 10 replies; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

this is v4 of my patch series with various optimizations in
zero buffer checking and migration tweaks.

thanks especially to Eric Blake for reviewing.

v4:
- do not inline buffer_find_nonzero_offset()
- inline can_usebuffer_find_nonzero_offset() correctly
- readd asserts in buffer_find_nonzero_offset() as profiling
  shows they do not hurt.
- change last occurences of scalar 8 by 
  BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
- avoid deferencing p already in patch 5 where we
  know that the page (p) is zero
- explicitly set bytes_sent = 0 if we skip a zero page.
  bytes_sent was 0 before, but it was not obvious.
- add accounting information for skipped zero pages
- fix errors reported by checkpatch.pl

v3:
- remove asserts, inline functions and add a check
  function if buffer_find_nonzero_offset() can be used.
- use above check function in buffer_is_zero() and
  find_next_bit().
- use buffer_is_nonzero_offset() directly to find
  zero pages. we know that all requirements are met
  for memory pages.
- fix C89 violation in buffer_is_zero().
- avoid derefencing p in ram_save_block() if we already
  know the page is zero.
- fix initialization of last_offset in reset_ram_globals().
- avoid skipping pages with offset == 0 in bulk stage in
  migration_bitmap_find_and_reset_dirty().
- compared to v1 check for zero pages also after bulk
  ram migration as there are guests (e.g. Windows) which
  zero out large amount of memory while running.

v2:
- fix description, add trivial zero check and add asserts 
  to buffer_find_nonzero_offset.
- add a constant for the unroll factor of buffer_find_nonzero_offset
- replace is_dup_page() by buffer_is_zero()
- added test results to xbzrle patch
- optimize descriptions

Peter Lieven (9):
  move vector definitions to qemu-common.h
  cutils: add a function to find non-zero content in a buffer
  buffer_is_zero: use vector optimizations if possible
  bitops: use vector algorithm to optimize find_next_bit()
  migration: search for zero instead of dup pages
  migration: add an indicator for bulk state of ram migration
  migration: do not sent zero pages in bulk stage
  migration: do not search dirty pages in bulk stage
  migration: use XBZRLE only after bulk stage

 arch_init.c                   |   74 +++++++++++++++++++----------------------
 hmp.c                         |    2 ++
 include/migration/migration.h |    2 ++
 include/qemu-common.h         |   37 +++++++++++++++++++++
 migration.c                   |    3 +-
 qapi-schema.json              |    6 ++--
 qmp-commands.hx               |    3 +-
 util/bitops.c                 |   24 +++++++++++--
 util/cutils.c                 |   50 ++++++++++++++++++++++++++++
 9 files changed, 155 insertions(+), 46 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-25  8:35   ` Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

vector optimizations will now be used at various places
not just in is_dup_page() in arch_init.c

this patch also adds a zero splat vector.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 arch_init.c           |   20 --------------------
 include/qemu-common.h |   24 ++++++++++++++++++++++++
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 98e2bc6..1b71912 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -114,26 +114,6 @@ const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
 
-#ifdef __ALTIVEC__
-#include <altivec.h>
-#define VECTYPE        vector unsigned char
-#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
-#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
-/* altivec.h may redefine the bool macro as vector type.
- * Reset it to POSIX semantics. */
-#undef bool
-#define bool _Bool
-#elif defined __SSE2__
-#include <emmintrin.h>
-#define VECTYPE        __m128i
-#define SPLAT(p)       _mm_set1_epi8(*(p))
-#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
-#else
-#define VECTYPE        unsigned long
-#define SPLAT(p)       (*(p) * (~0UL / 255))
-#define ALL_EQ(v1, v2) ((v1) == (v2))
-#endif
-
 
 static struct defconfig_file {
     const char *filename;
diff --git a/include/qemu-common.h b/include/qemu-common.h
index 7754ee2..e76ade3 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -448,4 +448,28 @@ int uleb128_decode_small(const uint8_t *in, uint32_t *n);
 
 void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
 
+/* vector definitions */
+#ifdef __ALTIVEC__
+#include <altivec.h>
+#define VECTYPE        vector unsigned char
+#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
+#define ZERO_SPLAT     vec_splat(vec_ld(0, 0), 0)
+#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
+/* altivec.h may redefine the bool macro as vector type.
+ * Reset it to POSIX semantics. */
+#undef bool
+#define bool _Bool
+#elif defined __SSE2__
+#include <emmintrin.h>
+#define VECTYPE        __m128i
+#define SPLAT(p)       _mm_set1_epi8(*(p))
+#define ZERO_SPLAT     _mm_setzero_si128()
+#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
+#else
+#define VECTYPE        unsigned long
+#define SPLAT(p)       (*(p) * (~0UL / 255))
+#define ZERO_SPLAT     0x0UL
+#define ALL_EQ(v1, v2) ((v1) == (v2))
+#endif
+
 #endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-22 19:37   ` Eric Blake
  2013-03-25  8:53   ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
optimized function that searches for non-zero content in a
buffer.

due to the optimizations used in the function there are restrictions
on buffer address and search length. the function
can_use_buffer_find_nonzero_content() can be used to check if
the function can be used safely.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/qemu-common.h |   13 +++++++++++++
 util/cutils.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/qemu-common.h b/include/qemu-common.h
index e76ade3..078e535 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
 #define ALL_EQ(v1, v2) ((v1) == (v2))
 #endif
 
+#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
+static inline bool
+can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
+{
+    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+                * sizeof(VECTYPE)) == 0
+            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
+        return true;
+    }
+    return false;
+}
+size_t buffer_find_nonzero_offset(const void *buf, size_t len);
+
 #endif
diff --git a/util/cutils.c b/util/cutils.c
index 1439da4..41c627e 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -143,6 +143,51 @@ int qemu_fdatasync(int fd)
 }
 
 /*
+ * Searches for an area with non-zero content in a buffer
+ *
+ * Attention! The len must be a multiple of
+ * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
+ * and addr must be a multiple of sizeof(VECTYPE) due to
+ * restriction of optimizations in this function.
+ *
+ * can_use_buffer_find_nonzero_offset() can be used to check
+ * these requirements.
+ *
+ * The return value is the offset of the non-zero area rounded
+ * down to BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE).
+ * If the buffer is all zero the return value is equal to len.
+ */
+
+size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+{
+    VECTYPE *p = (VECTYPE *)buf;
+    VECTYPE zero = ZERO_SPLAT;
+    size_t i;
+
+    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+        * sizeof(VECTYPE)) == 0);
+    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
+
+    if (*((const long *) buf)) {
+        return 0;
+    }
+
+    for (i = 0; i < len / sizeof(VECTYPE);
+            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+        VECTYPE tmp0 = p[i + 0] | p[i + 1];
+        VECTYPE tmp1 = p[i + 2] | p[i + 3];
+        VECTYPE tmp2 = p[i + 4] | p[i + 5];
+        VECTYPE tmp3 = p[i + 6] | p[i + 7];
+        VECTYPE tmp01 = tmp0 | tmp1;
+        VECTYPE tmp23 = tmp2 | tmp3;
+        if (!ALL_EQ(tmp01 | tmp23, zero)) {
+            break;
+        }
+    }
+    return i * sizeof(VECTYPE);
+}
+
+/*
  * Checks if a buffer is all zeroes
  *
  * Attention! The len must be a multiple of 4 * sizeof(long) due to
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 3/9] buffer_is_zero: use vector optimizations if possible
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h Peter Lieven
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-25  8:53   ` Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

performance gain on SSE2 is approx. 20-25%. altivec
is not tested. performance for unsigned long arithmetic
is unchanged.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 util/cutils.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/util/cutils.c b/util/cutils.c
index 41c627e..0f43c22 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -205,6 +205,11 @@ bool buffer_is_zero(const void *buf, size_t len)
     long d0, d1, d2, d3;
     const long * const data = buf;
 
+    /* use vector optimized zero check if possible */
+    if (can_use_buffer_find_nonzero_offset(buf, len)) {
+        return buffer_find_nonzero_offset(buf, len) == len;
+    }
+
     assert(len % (4 * sizeof(long)) == 0);
     len /= sizeof(long);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 4/9] bitops: use vector algorithm to optimize find_next_bit()
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (2 preceding siblings ...)
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-25  9:04   ` Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages Peter Lieven
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

this patch adds the usage of buffer_find_nonzero_offset()
to skip large areas of zeroes.

compared to loop unrolling presented in an earlier
patch this adds another 50% performance benefit for
skipping large areas of zeroes. loop unrolling alone
added close to 100% speedup.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 util/bitops.c |   24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/util/bitops.c b/util/bitops.c
index e72237a..9bb61ff 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -42,10 +42,28 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
         size -= BITS_PER_LONG;
         result += BITS_PER_LONG;
     }
-    while (size & ~(BITS_PER_LONG-1)) {
-        if ((tmp = *(p++))) {
-            goto found_middle;
+    while (size >= BITS_PER_LONG) {
+        tmp = *p;
+        if (tmp) {
+             goto found_middle;
+        }
+        if (can_use_buffer_find_nonzero_offset(p, size / BITS_PER_BYTE)) {
+            size_t tmp2 =
+                buffer_find_nonzero_offset(p, size / BITS_PER_BYTE);
+            result += tmp2 * BITS_PER_BYTE;
+            size -= tmp2 * BITS_PER_BYTE;
+            p += tmp2 / sizeof(unsigned long);
+            if (!size) {
+                return result;
+            }
+            if (tmp2) {
+                tmp = *p;
+                if (tmp) {
+                    goto found_middle;
+                }
+            }
         }
+        p++;
         result += BITS_PER_LONG;
         size -= BITS_PER_LONG;
     }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (3 preceding siblings ...)
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-22 19:49   ` Eric Blake
  2013-03-25  9:30   ` Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

virtually all dup pages are zero pages. remove
the special is_dup_page() function and use the
optimized buffer_find_nonzero_offset() function
instead.

here buffer_find_nonzero_offset() is used directly
to avoid the unnecssary additional checks in
buffer_is_zero().

raw performace gain checking zeroed memory
over is_dup_page() is approx. 15-20% with SSE2.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 1b71912..9ebca83 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -144,19 +144,10 @@ int qemu_read_default_config_files(bool userconfig)
     return 0;
 }
 
-static int is_dup_page(uint8_t *page)
+static inline bool is_zero_page(uint8_t *p)
 {
-    VECTYPE *p = (VECTYPE *)page;
-    VECTYPE val = SPLAT(page);
-    int i;
-
-    for (i = 0; i < TARGET_PAGE_SIZE / sizeof(VECTYPE); i++) {
-        if (!ALL_EQ(val, p[i])) {
-            return 0;
-        }
-    }
-
-    return 1;
+    return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) ==
+        TARGET_PAGE_SIZE;
 }
 
 /* struct contains XBZRLE cache and a static page
@@ -443,12 +434,12 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 
             /* In doubt sent page as normal */
             bytes_sent = -1;
-            if (is_dup_page(p)) {
+            if (is_zero_page(p)) {
                 acct_info.dup_pages++;
                 bytes_sent = save_block_hdr(f, block, offset, cont,
                                             RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, *p);
-                bytes_sent += 1;
+                qemu_put_byte(f, 0);
+                bytes_sent++;
             } else if (migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 6/9] migration: add an indicator for bulk state of ram migration
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (4 preceding siblings ...)
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-25  9:32   ` Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

the first round of ram transfer is special since all pages
are dirty and thus all memory pages are transferred to
the target. this patch adds a boolean variable to track
this stage.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 arch_init.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch_init.c b/arch_init.c
index 9ebca83..4c4caf4 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -317,6 +317,7 @@ static ram_addr_t last_offset;
 static unsigned long *migration_bitmap;
 static uint64_t migration_dirty_pages;
 static uint32_t last_version;
+static bool ram_bulk_stage;
 
 static inline
 ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
@@ -424,6 +425,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             if (!block) {
                 block = QTAILQ_FIRST(&ram_list.blocks);
                 complete_round = true;
+                ram_bulk_stage = false;
             }
         } else {
             uint8_t *p;
@@ -527,6 +529,7 @@ static void reset_ram_globals(void)
     last_sent_block = NULL;
     last_offset = 0;
     last_version = ram_list.version;
+    ram_bulk_stage = true;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (5 preceding siblings ...)
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-22 20:13   ` Eric Blake
  2013-03-25  9:44   ` Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 8/9] migration: do not search dirty " Peter Lieven
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

during bulk stage of ram migration if a page is a
zero page do not send it at all.
the memory at the destination reads as zero anyway.

even if there is an madvise with QEMU_MADV_DONTNEED
at the target upon receipt of a zero page I have observed
that the target starts swapping if the memory is overcommitted.
it seems that the pages are dropped asynchronously.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c                   |   24 ++++++++++++++++++++----
 hmp.c                         |    2 ++
 include/migration/migration.h |    2 ++
 migration.c                   |    3 ++-
 qapi-schema.json              |    6 ++++--
 qmp-commands.hx               |    3 ++-
 6 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 4c4caf4..c34a4af 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -181,6 +181,7 @@ int64_t xbzrle_cache_resize(int64_t new_size)
 /* accounting for migration statistics */
 typedef struct AccountingInfo {
     uint64_t dup_pages;
+    uint64_t skipped_pages;
     uint64_t norm_pages;
     uint64_t iterations;
     uint64_t xbzrle_bytes;
@@ -206,6 +207,16 @@ uint64_t dup_mig_pages_transferred(void)
     return acct_info.dup_pages;
 }
 
+uint64_t skipped_mig_bytes_transferred(void)
+{
+    return acct_info.skipped_pages * TARGET_PAGE_SIZE;
+}
+
+uint64_t skipped_mig_pages_transferred(void)
+{
+    return acct_info.skipped_pages;
+}
+
 uint64_t norm_mig_bytes_transferred(void)
 {
     return acct_info.norm_pages * TARGET_PAGE_SIZE;
@@ -438,10 +449,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             bytes_sent = -1;
             if (is_zero_page(p)) {
                 acct_info.dup_pages++;
-                bytes_sent = save_block_hdr(f, block, offset, cont,
-                                            RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, 0);
-                bytes_sent++;
+                if (!ram_bulk_stage) {
+                    bytes_sent = save_block_hdr(f, block, offset, cont,
+                                                RAM_SAVE_FLAG_COMPRESS);
+                    qemu_put_byte(f, 0);
+                    bytes_sent++;
+                } else {
+                    acct_info.skipped_pages++;
+                    bytes_sent = 0;
+                }
             } else if (migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
diff --git a/hmp.c b/hmp.c
index b0a861c..e3e833e 100644
--- a/hmp.c
+++ b/hmp.c
@@ -173,6 +173,8 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->ram->total >> 10);
         monitor_printf(mon, "duplicate: %" PRIu64 " pages\n",
                        info->ram->duplicate);
+        monitor_printf(mon, "skipped: %" PRIu64 " pages\n",
+                       info->ram->skipped);
         monitor_printf(mon, "normal: %" PRIu64 " pages\n",
                        info->ram->normal);
         monitor_printf(mon, "normal bytes: %" PRIu64 " kbytes\n",
diff --git a/include/migration/migration.h b/include/migration/migration.h
index bb617fd..e2acec6 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -96,6 +96,8 @@ extern SaveVMHandlers savevm_ram_handlers;
 
 uint64_t dup_mig_bytes_transferred(void);
 uint64_t dup_mig_pages_transferred(void);
+uint64_t skipped_mig_bytes_transferred(void);
+uint64_t skipped_mig_pages_transferred(void);
 uint64_t norm_mig_bytes_transferred(void);
 uint64_t norm_mig_pages_transferred(void);
 uint64_t xbzrle_mig_bytes_transferred(void);
diff --git a/migration.c b/migration.c
index 185d112..7fb2147 100644
--- a/migration.c
+++ b/migration.c
@@ -197,11 +197,11 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         info->ram->remaining = ram_bytes_remaining();
         info->ram->total = ram_bytes_total();
         info->ram->duplicate = dup_mig_pages_transferred();
+        info->ram->skipped = skipped_mig_pages_transferred();
         info->ram->normal = norm_mig_pages_transferred();
         info->ram->normal_bytes = norm_mig_bytes_transferred();
         info->ram->dirty_pages_rate = s->dirty_pages_rate;
 
-
         if (blk_mig_active()) {
             info->has_disk = true;
             info->disk = g_malloc0(sizeof(*info->disk));
@@ -227,6 +227,7 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         info->ram->remaining = 0;
         info->ram->total = ram_bytes_total();
         info->ram->duplicate = dup_mig_pages_transferred();
+        info->ram->skipped = skipped_mig_pages_transferred();
         info->ram->normal = norm_mig_pages_transferred();
         info->ram->normal_bytes = norm_mig_bytes_transferred();
         break;
diff --git a/qapi-schema.json b/qapi-schema.json
index fdaa9da..b737460 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -496,7 +496,9 @@
 #
 # @total: total amount of bytes involved in the migration process
 #
-# @duplicate: number of duplicate pages (since 1.2)
+# @duplicate: number of duplicate (zero) pages (since 1.2)
+#
+# @skipped: number of skipped zero pages (since 1.5)
 #
 # @normal : number of normal pages (since 1.2)
 #
@@ -510,7 +512,7 @@
 { 'type': 'MigrationStats',
   'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
            'duplicate': 'int', 'normal': 'int', 'normal-bytes': 'int',
-           'dirty-pages-rate' : 'int' } }
+           'dirty-pages-rate' : 'int', 'skipped': 'int' } }
 
 ##
 # @XBZRLECacheStats
diff --git a/qmp-commands.hx b/qmp-commands.hx
index b370060..fed74c6 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2442,7 +2442,8 @@ The main json-object contains the following:
          - "transferred": amount transferred (json-int)
          - "remaining": amount remaining (json-int)
          - "total": total (json-int)
-         - "duplicate": number of duplicated pages (json-int)
+         - "duplicate": number of duplicated (zero) pages (json-int)
+         - "skipped": number of skipped zero pages (json-int)
          - "normal" : number of normal pages transferred (json-int)
          - "normal-bytes" : number of normal bytes transferred (json-int)
 - "disk": only present if "status" is "active" and it is a block migration,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 8/9] migration: do not search dirty pages in bulk stage
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (6 preceding siblings ...)
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-25 10:05   ` Orit Wasserman
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 9/9] migration: use XBZRLE only after " Peter Lieven
  2013-03-22 17:25 ` [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Paolo Bonzini
  9 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

avoid searching for dirty pages just increment the
page offset. all pages are dirty anyway.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 arch_init.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch_init.c b/arch_init.c
index c34a4af..b2b932a 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -338,7 +338,13 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
     unsigned long nr = base + (start >> TARGET_PAGE_BITS);
     unsigned long size = base + (int128_get64(mr->size) >> TARGET_PAGE_BITS);
 
-    unsigned long next = find_next_bit(migration_bitmap, size, nr);
+    unsigned long next;
+
+    if (ram_bulk_stage && nr > base) {
+        next = nr + 1;
+    } else {
+        next = find_next_bit(migration_bitmap, size, nr);
+    }
 
     if (next < size) {
         clear_bit(next, migration_bitmap);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Qemu-devel] [PATCHv4 9/9] migration: use XBZRLE only after bulk stage
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (7 preceding siblings ...)
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 8/9] migration: do not search dirty " Peter Lieven
@ 2013-03-22 12:46 ` Peter Lieven
  2013-03-25 10:16   ` Orit Wasserman
  2013-03-22 17:25 ` [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Paolo Bonzini
  9 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 12:46 UTC (permalink / raw)
  To: qemu-devel
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, Orit Wasserman, Paolo Bonzini

at the beginning of migration all pages are marked dirty and
in the first round a bulk migration of all pages is performed.

currently all these pages are copied to the page cache regardless
of whether they are frequently updated or not. this doesn't make sense
since most of these pages are never transferred again.

this patch changes the XBZRLE transfer to only be used after
the bulk stage has been completed. that means a page is added
to the page cache the second time it is transferred and XBZRLE
can benefit from the third time of transfer.

since the page cache is likely smaller than the number of pages
it's also likely that in the second round the page is missing in the
cache due to collisions in the bulk phase.

on the other hand a lot of unnecessary mallocs, memdups and frees
are saved.

the following results have been taken earlier while executing
the test program from docs/xbzrle.txt. (+) with the patch and (-)
without. (thanks to Eric Blake for reformatting and comments)

+ total time: 22185 milliseconds
- total time: 22410 milliseconds

Shaved 0.3 seconds, better than 1%!

+ downtime: 29 milliseconds
- downtime: 21 milliseconds

Not sure why downtime seemed worse, but probably not the end of the world.

+ transferred ram: 706034 kbytes
- transferred ram: 721318 kbytes

Fewer bytes sent - good.

+ remaining ram: 0 kbytes
- remaining ram: 0 kbytes
+ total ram: 1057216 kbytes
- total ram: 1057216 kbytes
+ duplicate: 108556 pages
- duplicate: 105553 pages
+ normal: 175146 pages
- normal: 179589 pages
+ normal bytes: 700584 kbytes
- normal bytes: 718356 kbytes

Fewer normal bytes...

+ cache size: 67108864 bytes
- cache size: 67108864 bytes
+ xbzrle transferred: 3127 kbytes
- xbzrle transferred: 630 kbytes

...and more compressed pages sent - good.

+ xbzrle pages: 117811 pages
- xbzrle pages: 21527 pages
+ xbzrle cache miss: 18750
- xbzrle cache miss: 179589

And very good improvement on the cache miss rate.

+ xbzrle overflow : 0
- xbzrle overflow : 0

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 arch_init.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch_init.c b/arch_init.c
index b2b932a..86f7e28 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -464,7 +464,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
                     acct_info.skipped_pages++;
                     bytes_sent = 0;
                 }
-            } else if (migrate_use_xbzrle()) {
+            } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
                                               offset, cont, last_stage);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (8 preceding siblings ...)
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 9/9] migration: use XBZRLE only after " Peter Lieven
@ 2013-03-22 17:25 ` Paolo Bonzini
  2013-03-22 19:20   ` Peter Lieven
  9 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2013-03-22 17:25 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, Stefan Hajnoczi, qemu-devel, quintela

Il 22/03/2013 13:46, Peter Lieven ha scritto:
> this is v4 of my patch series with various optimizations in
> zero buffer checking and migration tweaks.
> 
> thanks especially to Eric Blake for reviewing.
> 
> v4:
> - do not inline buffer_find_nonzero_offset()
> - inline can_usebuffer_find_nonzero_offset() correctly
> - readd asserts in buffer_find_nonzero_offset() as profiling
>   shows they do not hurt.
> - change last occurences of scalar 8 by 
>   BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> - avoid deferencing p already in patch 5 where we
>   know that the page (p) is zero
> - explicitly set bytes_sent = 0 if we skip a zero page.
>   bytes_sent was 0 before, but it was not obvious.
> - add accounting information for skipped zero pages
> - fix errors reported by checkpatch.pl
> 
> v3:
> - remove asserts, inline functions and add a check
>   function if buffer_find_nonzero_offset() can be used.
> - use above check function in buffer_is_zero() and
>   find_next_bit().
> - use buffer_is_nonzero_offset() directly to find
>   zero pages. we know that all requirements are met
>   for memory pages.
> - fix C89 violation in buffer_is_zero().
> - avoid derefencing p in ram_save_block() if we already
>   know the page is zero.
> - fix initialization of last_offset in reset_ram_globals().
> - avoid skipping pages with offset == 0 in bulk stage in
>   migration_bitmap_find_and_reset_dirty().
> - compared to v1 check for zero pages also after bulk
>   ram migration as there are guests (e.g. Windows) which
>   zero out large amount of memory while running.
> 
> v2:
> - fix description, add trivial zero check and add asserts 
>   to buffer_find_nonzero_offset.
> - add a constant for the unroll factor of buffer_find_nonzero_offset
> - replace is_dup_page() by buffer_is_zero()
> - added test results to xbzrle patch
> - optimize descriptions
> 
> Peter Lieven (9):
>   move vector definitions to qemu-common.h
>   cutils: add a function to find non-zero content in a buffer
>   buffer_is_zero: use vector optimizations if possible
>   bitops: use vector algorithm to optimize find_next_bit()
>   migration: search for zero instead of dup pages
>   migration: add an indicator for bulk state of ram migration
>   migration: do not sent zero pages in bulk stage
>   migration: do not search dirty pages in bulk stage
>   migration: use XBZRLE only after bulk stage
> 
>  arch_init.c                   |   74 +++++++++++++++++++----------------------
>  hmp.c                         |    2 ++
>  include/migration/migration.h |    2 ++
>  include/qemu-common.h         |   37 +++++++++++++++++++++
>  migration.c                   |    3 +-
>  qapi-schema.json              |    6 ++--
>  qmp-commands.hx               |    3 +-
>  util/bitops.c                 |   24 +++++++++++--
>  util/cutils.c                 |   50 ++++++++++++++++++++++++++++
>  9 files changed, 155 insertions(+), 46 deletions(-)
> 

I think patch 4 is a bit overengineered.  I would prefer the simple
patch you had using three/four non-vectorized accesses.  The setup cost
of the vectorized buffer_is_zero is quite high, and 64 bits are just
256k RAM; if the host doesn't touch 256k RAM, it will incur the overhead.

I would prefer some more benchmarking for patch 5, but it looks ok.

The rest are fine, thanks!

Paolo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-22 17:25 ` [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Paolo Bonzini
@ 2013-03-22 19:20   ` Peter Lieven
  2013-03-22 21:24     ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 19:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, Stefan Hajnoczi, qemu-devel, quintela

Am 22.03.2013 18:25, schrieb Paolo Bonzini:
> Il 22/03/2013 13:46, Peter Lieven ha scritto:
>> this is v4 of my patch series with various optimizations in
>> zero buffer checking and migration tweaks.
>>
>> thanks especially to Eric Blake for reviewing.
>>
>> v4:
>> - do not inline buffer_find_nonzero_offset()
>> - inline can_usebuffer_find_nonzero_offset() correctly
>> - readd asserts in buffer_find_nonzero_offset() as profiling
>>   shows they do not hurt.
>> - change last occurences of scalar 8 by 
>>   BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> - avoid deferencing p already in patch 5 where we
>>   know that the page (p) is zero
>> - explicitly set bytes_sent = 0 if we skip a zero page.
>>   bytes_sent was 0 before, but it was not obvious.
>> - add accounting information for skipped zero pages
>> - fix errors reported by checkpatch.pl
>>
>> v3:
>> - remove asserts, inline functions and add a check
>>   function if buffer_find_nonzero_offset() can be used.
>> - use above check function in buffer_is_zero() and
>>   find_next_bit().
>> - use buffer_is_nonzero_offset() directly to find
>>   zero pages. we know that all requirements are met
>>   for memory pages.
>> - fix C89 violation in buffer_is_zero().
>> - avoid derefencing p in ram_save_block() if we already
>>   know the page is zero.
>> - fix initialization of last_offset in reset_ram_globals().
>> - avoid skipping pages with offset == 0 in bulk stage in
>>   migration_bitmap_find_and_reset_dirty().
>> - compared to v1 check for zero pages also after bulk
>>   ram migration as there are guests (e.g. Windows) which
>>   zero out large amount of memory while running.
>>
>> v2:
>> - fix description, add trivial zero check and add asserts 
>>   to buffer_find_nonzero_offset.
>> - add a constant for the unroll factor of buffer_find_nonzero_offset
>> - replace is_dup_page() by buffer_is_zero()
>> - added test results to xbzrle patch
>> - optimize descriptions
>>
>> Peter Lieven (9):
>>   move vector definitions to qemu-common.h
>>   cutils: add a function to find non-zero content in a buffer
>>   buffer_is_zero: use vector optimizations if possible
>>   bitops: use vector algorithm to optimize find_next_bit()
>>   migration: search for zero instead of dup pages
>>   migration: add an indicator for bulk state of ram migration
>>   migration: do not sent zero pages in bulk stage
>>   migration: do not search dirty pages in bulk stage
>>   migration: use XBZRLE only after bulk stage
>>
>>  arch_init.c                   |   74 +++++++++++++++++++----------------------
>>  hmp.c                         |    2 ++
>>  include/migration/migration.h |    2 ++
>>  include/qemu-common.h         |   37 +++++++++++++++++++++
>>  migration.c                   |    3 +-
>>  qapi-schema.json              |    6 ++--
>>  qmp-commands.hx               |    3 +-
>>  util/bitops.c                 |   24 +++++++++++--
>>  util/cutils.c                 |   50 ++++++++++++++++++++++++++++
>>  9 files changed, 155 insertions(+), 46 deletions(-)
>>
> I think patch 4 is a bit overengineered.  I would prefer the simple
> patch you had using three/four non-vectorized accesses.  The setup cost
> of the vectorized buffer_is_zero is quite high, and 64 bits are just
> 256k RAM; if the host doesn't touch 256k RAM, it will incur the overhead.
I think you are right. I was a little to eager to utilize buffer_find_nonzero_offset()
as much as possible. The performance gain by unrolling was impressive enough.
The gain by the vector functions is not that big that it would justify a possible
slow down by the high setup costs. My testings revealed that in most cases buffer_find_nonzero_offset()
returns 0 or a big offset. All the 0 return values would have increased setup costs with
the vectorized version of patch 4.

>
> I would prefer some more benchmarking for patch 5, but it looks ok.
What would you like to see? Statistics how many pages of a real system
are not zero, but zero in the first sizeof(long) bytes?

>
> The rest are fine, thanks!
Thank you for reviewing. If we are done with this patches I will continue with
the block migration optimizations next week.

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
@ 2013-03-22 19:37   ` Eric Blake
  2013-03-22 20:03     ` Peter Lieven
  2013-03-25  8:53   ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Orit Wasserman
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Blake @ 2013-03-22 19:37 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, quintela, Orit Wasserman, qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2185 bytes --]

On 03/22/2013 06:46 AM, Peter Lieven wrote:
> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
> optimized function that searches for non-zero content in a
> buffer.
> 
> due to the optimizations used in the function there are restrictions
> on buffer address and search length. the function
> can_use_buffer_find_nonzero_content() can be used to check if
> the function can be used safely.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qemu-common.h |   13 +++++++++++++
>  util/cutils.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+)

> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> +static inline bool
> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> +{
> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +                * sizeof(VECTYPE)) == 0
> +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {

I know that emacs tends to indent the second line to the column after
the ( that it is associated with, as in:

+    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+               * sizeof(VECTYPE)) == 0
+        && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {

But since checkpatch.pl didn't complain, and since I'm not sure if there
is a codified qemu indentation style, and since I _am_ sure that not
everyone uses emacs [hi, vi guys], it's not worth respinning.  A
maintainer can touch it up if desired.



> +
> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> +{
> +    VECTYPE *p = (VECTYPE *)buf;
> +    VECTYPE zero = ZERO_SPLAT;
> +    size_t i;
> +
> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +        * sizeof(VECTYPE)) == 0);
> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);

I would have written this:

assert(can_use_buffer_find_nonzero_offset(buf, len));

But that's cosmetic, and compiles to the same code, so it's not worth a
respin.

You've addressed my concerns on v3.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages Peter Lieven
@ 2013-03-22 19:49   ` Eric Blake
  2013-03-22 20:02     ` Peter Lieven
  2013-03-25  9:30   ` Orit Wasserman
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Blake @ 2013-03-22 19:49 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, quintela, Orit Wasserman, qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 886 bytes --]

On 03/22/2013 06:46 AM, Peter Lieven wrote:
> virtually all dup pages are zero pages. remove
> the special is_dup_page() function and use the
> optimized buffer_find_nonzero_offset() function
> instead.
> 
> here buffer_find_nonzero_offset() is used directly
> to avoid the unnecssary additional checks in
> buffer_is_zero().
> 
> raw performace gain checking zeroed memory
> over is_dup_page() is approx. 15-20% with SSE2.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c |   21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

The code is sound, but I agree with Paolo's assessment that seeing a bit
more benchmarking, such as on non-SSE2 seupts, wouldn't hurt.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages
  2013-03-22 19:49   ` Eric Blake
@ 2013-03-22 20:02     ` Peter Lieven
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 20:02 UTC (permalink / raw)
  To: Eric Blake
  Cc: Paolo Bonzini, quintela, Orit Wasserman, qemu-devel, Stefan Hajnoczi

Am 22.03.2013 20:49, schrieb Eric Blake:
> On 03/22/2013 06:46 AM, Peter Lieven wrote:
>> virtually all dup pages are zero pages. remove
>> the special is_dup_page() function and use the
>> optimized buffer_find_nonzero_offset() function
>> instead.
>>
>> here buffer_find_nonzero_offset() is used directly
>> to avoid the unnecssary additional checks in
>> buffer_is_zero().
>>
>> raw performace gain checking zeroed memory
>> over is_dup_page() is approx. 15-20% with SSE2.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>  arch_init.c |   21 ++++++---------------
>>  1 file changed, 6 insertions(+), 15 deletions(-)
> Reviewed-by: Eric Blake <eblake@redhat.com>
>
> The code is sound, but I agree with Paolo's assessment that seeing a bit
> more benchmarking, such as on non-SSE2 seupts, wouldn't hurt.
>
The performance for checking zeroed memory is equal to the standard
unrolled version of buffer_is_zero(). So this is a big gain over normal is_dup_page()
which checks only one long per iteration. I can provide some numbers Monday.

However, if you have a good idea for a test case, please let me know.
My first idea was how many pages are out there, that are non-zero, but
zero in the first sizeof(long) bytes so that reading 128 Byte (on SSE2)
seems to be a real disadvantage.

But with all your and especially Paolos concerns, please keep in mind, even
if the setup costs are high, if we abort on the first 128Byte we will need all
of them anyway, as we copy all this data either raw or through XBZRLE.
So does it hurt if they are in the cache? Or am I wrong here?

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-22 19:37   ` Eric Blake
@ 2013-03-22 20:03     ` Peter Lieven
  2013-03-22 20:22       ` [Qemu-devel] indentation hints [was: [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer] Eric Blake
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-22 20:03 UTC (permalink / raw)
  To: Eric Blake
  Cc: Paolo Bonzini, quintela, Orit Wasserman, qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2419 bytes --]

Am 22.03.2013 20:37, schrieb Eric Blake:

> On 03/22/2013 06:46 AM, Peter Lieven wrote:
>> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
>> optimized function that searches for non-zero content in a
>> buffer.
>>
>> due to the optimizations used in the function there are restrictions
>> on buffer address and search length. the function
>> can_use_buffer_find_nonzero_content() can be used to check if
>> the function can be used safely.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>  include/qemu-common.h |   13 +++++++++++++
>>  util/cutils.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 58 insertions(+)
>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>> +static inline bool
>> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>> +{
>> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +                * sizeof(VECTYPE)) == 0
>> +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
> I know that emacs tends to indent the second line to the column after
> the ( that it is associated with, as in:
>
> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +               * sizeof(VECTYPE)) == 0
> +        && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
>
> But since checkpatch.pl didn't complain, and since I'm not sure if there
> is a codified qemu indentation style, and since I _am_ sure that not
> everyone uses emacs [hi, vi guys], it's not worth respinning.  A
> maintainer can touch it up if desired.

Actually, I was totally unsure how to indent this. Maybe just give
a hint what you would like to see. As I will replace patch 4 with
an earlier version that is not vector optimized, but uses loop unrolling,
I will have to do a v5 and so I can fix this.

>
>> +
>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>> +{
>> +    VECTYPE *p = (VECTYPE *)buf;
>> +    VECTYPE zero = ZERO_SPLAT;
>> +    size_t i;
>> +
>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +        * sizeof(VECTYPE)) == 0);
>> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> I would have written this:
>
> assert(can_use_buffer_find_nonzero_offset(buf, len));

Good point. Will be changed in v5.

> But that's cosmetic, and compiles to the same code, so it's not worth a
> respin.
>
> You've addressed my concerns on v3.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
>
Peter


[-- Attachment #2: Type: text/html, Size: 3548 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
@ 2013-03-22 20:13   ` Eric Blake
  2013-03-25  9:44   ` Orit Wasserman
  1 sibling, 0 replies; 44+ messages in thread
From: Eric Blake @ 2013-03-22 20:13 UTC (permalink / raw)
  To: Peter Lieven
  Cc: quintela, Stefan Hajnoczi, qemu-devel, Luiz Capitulino,
	Orit Wasserman, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2675 bytes --]

On 03/22/2013 06:46 AM, Peter Lieven wrote:
> during bulk stage of ram migration if a page is a
> zero page do not send it at all.
> the memory at the destination reads as zero anyway.
> 
> even if there is an madvise with QEMU_MADV_DONTNEED
> at the target upon receipt of a zero page I have observed
> that the target starts swapping if the memory is overcommitted.
> it seems that the pages are dropped asynchronously.

Your commit message fails to mention that you are updating QMP to record
a new stat, although I agree with what you've done.  If you do respin,
make mention of this fact in the commit message.

> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c                   |   24 ++++++++++++++++++++----
>  hmp.c                         |    2 ++
>  include/migration/migration.h |    2 ++
>  migration.c                   |    3 ++-
>  qapi-schema.json              |    6 ++++--
>  qmp-commands.hx               |    3 ++-
>  6 files changed, 32 insertions(+), 8 deletions(-)
> 

> +++ b/qapi-schema.json
> @@ -496,7 +496,9 @@
>  #
>  # @total: total amount of bytes involved in the migration process
>  #
> -# @duplicate: number of duplicate pages (since 1.2)
> +# @duplicate: number of duplicate (zero) pages (since 1.2)
> +#
> +# @skipped: number of skipped zero pages (since 1.5)
>  #
>  # @normal : number of normal pages (since 1.2)
>  #
> @@ -510,7 +512,7 @@
>  { 'type': 'MigrationStats',
>    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
>             'duplicate': 'int', 'normal': 'int', 'normal-bytes': 'int',
> -           'dirty-pages-rate' : 'int' } }
> +           'dirty-pages-rate' : 'int', 'skipped': 'int' } }

Your layout here doesn't match the order that you documented things in.
 But it is a dictionary of name-value pairs, so order is not significant
to the interface.  About the only thing the order might affect is
whether the rest of your code, which assigns fields in documentation
order, is slightly less efficient because it is jumping around the C
struct rather than hitting it in linear order, but that would be in the
noise on a benchmark.  So I won't insist on a respin.  However, since
you are touching QMP, it wouldn't hurt to have Luiz chime in.

I'm okay if this goes in as-is.  Or, if you do spin a v5 for other
reasons, then lay out MigrationStats in documentation order, and improve
the commit message.  If those are the only changes you make, then you
can keep:

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Qemu-devel] indentation hints [was: [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer]
  2013-03-22 20:03     ` Peter Lieven
@ 2013-03-22 20:22       ` Eric Blake
  2013-03-23 11:18         ` Peter Maydell
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Blake @ 2013-03-22 20:22 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Paolo Bonzini, quintela, Orit Wasserman, qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1645 bytes --]

On 03/22/2013 02:03 PM, Peter Lieven wrote:

>>> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>> +                * sizeof(VECTYPE)) == 0
>>> +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
>> I know that emacs tends to indent the second line to the column after
>> the ( that it is associated with, as in:
>>
>> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +               * sizeof(VECTYPE)) == 0
>> +        && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
>>

> 
> Actually, I was totally unsure how to indent this. Maybe just give
> a hint what you would like to see.

I thought I did just that, with my rewrite of your line (best if you
view the mail in fixed-width font) :)

> As I will replace patch 4 with
> an earlier version that is not vector optimized, but uses loop unrolling,
> I will have to do a v5 and so I can fix this.

Note that qemu.git already has a .exrc file that enables default vi
settings for vim users; I'm not sure if there is a counterpart
.dir-locals.el file to set up emacs settings, but someone probably has
one.  Ultimately, having instructions on how to set up your editor so
that 'TAB' just magically indents to the preferred style seems like a
tip worth having on the wiki page on contributing a patch, but I'm not
sure I'm the one to provide such a patch (since I focus most of my qemu
work on reviewing other's patches, and not writing my own, I don't
really have my own preferred editor set up to indent in a qemu style).
	
-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-22 19:20   ` Peter Lieven
@ 2013-03-22 21:24     ` Paolo Bonzini
  2013-03-23  7:34       ` Peter Lieven
  2013-03-25 10:17       ` Peter Lieven
  0 siblings, 2 replies; 44+ messages in thread
From: Paolo Bonzini @ 2013-03-22 21:24 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi

Il 22/03/2013 20:20, Peter Lieven ha scritto:
>> I think patch 4 is a bit overengineered.  I would prefer the simple
>> patch you had using three/four non-vectorized accesses.  The setup cost
>> of the vectorized buffer_is_zero is quite high, and 64 bits are just
>> 256k RAM; if the host doesn't touch 256k RAM, it will incur the overhead.
> I think you are right. I was a little to eager to utilize buffer_find_nonzero_offset()
> as much as possible. The performance gain by unrolling was impressive enough.
> The gain by the vector functions is not that big that it would justify a possible
> slow down by the high setup costs. My testings revealed that in most cases buffer_find_nonzero_offset()
> returns 0 or a big offset. All the 0 return values would have increased setup costs with
> the vectorized version of patch 4.
> 
>>
>> I would prefer some more benchmarking for patch 5, but it looks ok.
> What would you like to see? Statistics how many pages of a real system
> are not zero, but zero in the first sizeof(long) bytes?

Yeah, more or less.  Running the system for a while, migrating, and
plotting a histogram of the return values of buffer_find_nonzero_offset
(hmm, perhaps using a nonvectorized version is better for this experiment).

Paolo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-22 21:24     ` Paolo Bonzini
@ 2013-03-23  7:34       ` Peter Lieven
  2013-03-25 10:17       ` Peter Lieven
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Lieven @ 2013-03-23  7:34 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi


Am 22.03.2013 um 22:24 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 22/03/2013 20:20, Peter Lieven ha scritto:
>>> I think patch 4 is a bit overengineered.  I would prefer the simple
>>> patch you had using three/four non-vectorized accesses.  The setup cost
>>> of the vectorized buffer_is_zero is quite high, and 64 bits are just
>>> 256k RAM; if the host doesn't touch 256k RAM, it will incur the overhead.
>> I think you are right. I was a little to eager to utilize buffer_find_nonzero_offset()
>> as much as possible. The performance gain by unrolling was impressive enough.
>> The gain by the vector functions is not that big that it would justify a possible
>> slow down by the high setup costs. My testings revealed that in most cases buffer_find_nonzero_offset()
>> returns 0 or a big offset. All the 0 return values would have increased setup costs with
>> the vectorized version of patch 4.
>> 
>>> 
>>> I would prefer some more benchmarking for patch 5, but it looks ok.
>> What would you like to see? Statistics how many pages of a real system
>> are not zero, but zero in the first sizeof(long) bytes?
> 
> Yeah, more or less.  Running the system for a while, migrating, and
> plotting a histogram of the return values of buffer_find_nonzero_offset
> (hmm, perhaps using a nonvectorized version is better for this experiment).

I will follow up with this on Monday. Have you seen my concern, that the whole
page is read anyway if it is non-zero?

Peter


> 
> Paolo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] indentation hints [was: [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer]
  2013-03-22 20:22       ` [Qemu-devel] indentation hints [was: [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer] Eric Blake
@ 2013-03-23 11:18         ` Peter Maydell
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Maydell @ 2013-03-23 11:18 UTC (permalink / raw)
  To: Eric Blake
  Cc: quintela, Stefan Hajnoczi, Peter Lieven, qemu-devel,
	Orit Wasserman, Paolo Bonzini

On 22 March 2013 20:22, Eric Blake <eblake@redhat.com> wrote:
> Note that qemu.git already has a .exrc file that enables default vi
> settings for vim users; I'm not sure if there is a counterpart
> .dir-locals.el file to set up emacs settings

My .emacs has the following config:
https://wiki.linaro.org/PeterMaydell/QemuEmacsStyle
(it doesn't get things perfect but it's pretty close.)


-- PMM

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h Peter Lieven
@ 2013-03-25  8:35   ` Orit Wasserman
  0 siblings, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  8:35 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> vector optimizations will now be used at various places
> not just in is_dup_page() in arch_init.c
> 
> this patch also adds a zero splat vector.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  arch_init.c           |   20 --------------------
>  include/qemu-common.h |   24 ++++++++++++++++++++++++
>  2 files changed, 24 insertions(+), 20 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 98e2bc6..1b71912 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -114,26 +114,6 @@ const uint32_t arch_type = QEMU_ARCH;
>  #define RAM_SAVE_FLAG_CONTINUE 0x20
>  #define RAM_SAVE_FLAG_XBZRLE   0x40
>  
> -#ifdef __ALTIVEC__
> -#include <altivec.h>
> -#define VECTYPE        vector unsigned char
> -#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
> -#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
> -/* altivec.h may redefine the bool macro as vector type.
> - * Reset it to POSIX semantics. */
> -#undef bool
> -#define bool _Bool
> -#elif defined __SSE2__
> -#include <emmintrin.h>
> -#define VECTYPE        __m128i
> -#define SPLAT(p)       _mm_set1_epi8(*(p))
> -#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
> -#else
> -#define VECTYPE        unsigned long
> -#define SPLAT(p)       (*(p) * (~0UL / 255))
> -#define ALL_EQ(v1, v2) ((v1) == (v2))
> -#endif
> -
>  
>  static struct defconfig_file {
>      const char *filename;
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index 7754ee2..e76ade3 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -448,4 +448,28 @@ int uleb128_decode_small(const uint8_t *in, uint32_t *n);
>  
>  void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>  
> +/* vector definitions */
> +#ifdef __ALTIVEC__
> +#include <altivec.h>
> +#define VECTYPE        vector unsigned char
> +#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
> +#define ZERO_SPLAT     vec_splat(vec_ld(0, 0), 0)
This is new macro please move it to a separate patch
Orit
> +#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
> +/* altivec.h may redefine the bool macro as vector type.
> + * Reset it to POSIX semantics. */
> +#undef bool
> +#define bool _Bool
> +#elif defined __SSE2__
> +#include <emmintrin.h>
> +#define VECTYPE        __m128i
> +#define SPLAT(p)       _mm_set1_epi8(*(p))
> +#define ZERO_SPLAT     _mm_setzero_si128()

> +#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
> +#else
> +#define VECTYPE        unsigned long
> +#define SPLAT(p)       (*(p) * (~0UL / 255))
> +#define ZERO_SPLAT     0x0UL
> +#define ALL_EQ(v1, v2) ((v1) == (v2))
> +#endif
> +
>  #endif
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
  2013-03-22 19:37   ` Eric Blake
@ 2013-03-25  8:53   ` Orit Wasserman
  2013-03-25  8:56     ` Peter Lieven
  1 sibling, 1 reply; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  8:53 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
> optimized function that searches for non-zero content in a
> buffer.
> 
> due to the optimizations used in the function there are restrictions
> on buffer address and search length. the function
> can_use_buffer_find_nonzero_content() can be used to check if
> the function can be used safely.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qemu-common.h |   13 +++++++++++++
>  util/cutils.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index e76ade3..078e535 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>  #define ALL_EQ(v1, v2) ((v1) == (v2))
>  #endif
>  
> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> +static inline bool
> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> +{
> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +                * sizeof(VECTYPE)) == 0
> +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
> +        return true;
> +    }
> +    return false;
> +}
> +size_t buffer_find_nonzero_offset(const void *buf, size_t len);
> +
>  #endif
> diff --git a/util/cutils.c b/util/cutils.c
> index 1439da4..41c627e 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -143,6 +143,51 @@ int qemu_fdatasync(int fd)
>  }
>  
>  /*
> + * Searches for an area with non-zero content in a buffer
> + *
> + * Attention! The len must be a multiple of
> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
> + * and addr must be a multiple of sizeof(VECTYPE) due to
> + * restriction of optimizations in this function.
> + *
> + * can_use_buffer_find_nonzero_offset() can be used to check
> + * these requirements.
> + *
> + * The return value is the offset of the non-zero area rounded
> + * down to BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE).
> + * If the buffer is all zero the return value is equal to len.
> + */
> +
> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> +{
> +    VECTYPE *p = (VECTYPE *)buf;
> +    VECTYPE zero = ZERO_SPLAT;
> +    size_t i;
> +
> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +        * sizeof(VECTYPE)) == 0);
> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> +
> +    if (*((const long *) buf)) {
> +        return 0;
> +    }
> +
> +    for (i = 0; i < len / sizeof(VECTYPE);
Why not put len/sizeof(VECTYPE) in a variable?
Orit
> +            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
> +        VECTYPE tmp0 = p[i + 0] | p[i + 1];
> +        VECTYPE tmp1 = p[i + 2] | p[i + 3];
> +        VECTYPE tmp2 = p[i + 4] | p[i + 5];
> +        VECTYPE tmp3 = p[i + 6] | p[i + 7];
> +        VECTYPE tmp01 = tmp0 | tmp1;
> +        VECTYPE tmp23 = tmp2 | tmp3;
> +        if (!ALL_EQ(tmp01 | tmp23, zero)) {
> +            break;
> +        }
> +    }
> +    return i * sizeof(VECTYPE);
> +}
> +
> +/*
>   * Checks if a buffer is all zeroes
>   *
>   * Attention! The len must be a multiple of 4 * sizeof(long) due to
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 3/9] buffer_is_zero: use vector optimizations if possible
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
@ 2013-03-25  8:53   ` Orit Wasserman
  0 siblings, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  8:53 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> performance gain on SSE2 is approx. 20-25%. altivec
> is not tested. performance for unsigned long arithmetic
> is unchanged.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  util/cutils.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/util/cutils.c b/util/cutils.c
> index 41c627e..0f43c22 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -205,6 +205,11 @@ bool buffer_is_zero(const void *buf, size_t len)
>      long d0, d1, d2, d3;
>      const long * const data = buf;
>  
> +    /* use vector optimized zero check if possible */
> +    if (can_use_buffer_find_nonzero_offset(buf, len)) {
> +        return buffer_find_nonzero_offset(buf, len) == len;
> +    }
> +
>      assert(len % (4 * sizeof(long)) == 0);
>      len /= sizeof(long);
>  
> 
Reviewed-by: Orit Wasserman <owasserm@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-25  8:53   ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Orit Wasserman
@ 2013-03-25  8:56     ` Peter Lieven
  2013-03-25  9:26       ` Orit Wasserman
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-25  8:56 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela


Am 25.03.2013 um 09:53 schrieb Orit Wasserman <owasserm@redhat.com>:

> On 03/22/2013 02:46 PM, Peter Lieven wrote:
>> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
>> optimized function that searches for non-zero content in a
>> buffer.
>> 
>> due to the optimizations used in the function there are restrictions
>> on buffer address and search length. the function
>> can_use_buffer_find_nonzero_content() can be used to check if
>> the function can be used safely.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> include/qemu-common.h |   13 +++++++++++++
>> util/cutils.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 58 insertions(+)
>> 
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index e76ade3..078e535 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>> #endif
>> 
>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>> +static inline bool
>> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>> +{
>> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +                * sizeof(VECTYPE)) == 0
>> +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len);
>> +
>> #endif
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 1439da4..41c627e 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -143,6 +143,51 @@ int qemu_fdatasync(int fd)
>> }
>> 
>> /*
>> + * Searches for an area with non-zero content in a buffer
>> + *
>> + * Attention! The len must be a multiple of
>> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
>> + * and addr must be a multiple of sizeof(VECTYPE) due to
>> + * restriction of optimizations in this function.
>> + *
>> + * can_use_buffer_find_nonzero_offset() can be used to check
>> + * these requirements.
>> + *
>> + * The return value is the offset of the non-zero area rounded
>> + * down to BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE).
>> + * If the buffer is all zero the return value is equal to len.
>> + */
>> +
>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>> +{
>> +    VECTYPE *p = (VECTYPE *)buf;
>> +    VECTYPE zero = ZERO_SPLAT;
>> +    size_t i;
>> +
>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +        * sizeof(VECTYPE)) == 0);
>> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>> +
>> +    if (*((const long *) buf)) {
>> +        return 0;
>> +    }
>> +
>> +    for (i = 0; i < len / sizeof(VECTYPE);
> Why not put len/sizeof(VECTYPE) in a variable?

are you afraid that there is a division at each iteration?

sizeof(VECTYPE) is a power of 2 so i think the compiler will optimize it
to a >> at compile time.

I would also be ok with writing len /= sizeof(VECTYPE) before the loop.

Peter

> Orit
>> +            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>> +        VECTYPE tmp0 = p[i + 0] | p[i + 1];
>> +        VECTYPE tmp1 = p[i + 2] | p[i + 3];
>> +        VECTYPE tmp2 = p[i + 4] | p[i + 5];
>> +        VECTYPE tmp3 = p[i + 6] | p[i + 7];
>> +        VECTYPE tmp01 = tmp0 | tmp1;
>> +        VECTYPE tmp23 = tmp2 | tmp3;
>> +        if (!ALL_EQ(tmp01 | tmp23, zero)) {
>> +            break;
>> +        }
>> +    }
>> +    return i * sizeof(VECTYPE);
>> +}
>> +
>> +/*
>>  * Checks if a buffer is all zeroes
>>  *
>>  * Attention! The len must be a multiple of 4 * sizeof(long) due to
>> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 4/9] bitops: use vector algorithm to optimize find_next_bit()
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
@ 2013-03-25  9:04   ` Orit Wasserman
  0 siblings, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  9:04 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> this patch adds the usage of buffer_find_nonzero_offset()
> to skip large areas of zeroes.
> 
> compared to loop unrolling presented in an earlier
> patch this adds another 50% performance benefit for
> skipping large areas of zeroes. loop unrolling alone
> added close to 100% speedup.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  util/bitops.c |   24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/util/bitops.c b/util/bitops.c
> index e72237a..9bb61ff 100644
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -42,10 +42,28 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>          size -= BITS_PER_LONG;
>          result += BITS_PER_LONG;
>      }
> -    while (size & ~(BITS_PER_LONG-1)) {
> -        if ((tmp = *(p++))) {
> -            goto found_middle;
> +    while (size >= BITS_PER_LONG) {
> +        tmp = *p;
> +        if (tmp) {
> +             goto found_middle;
> +        }
> +        if (can_use_buffer_find_nonzero_offset(p, size / BITS_PER_BYTE)) {
> +            size_t tmp2 =
> +                buffer_find_nonzero_offset(p, size / BITS_PER_BYTE);
> +            result += tmp2 * BITS_PER_BYTE;
> +            size -= tmp2 * BITS_PER_BYTE;
> +            p += tmp2 / sizeof(unsigned long);
> +            if (!size) {
> +                return result;
> +            }
> +            if (tmp2) {
> +                tmp = *p;
> +                if (tmp) {
> +                    goto found_middle;
> +                }
> +            }
>          }
> +        p++;
>          result += BITS_PER_LONG;
>          size -= BITS_PER_LONG;
>      }
> 
Reviewed-by: Orit Wasserman <owasserm@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-25  8:56     ` Peter Lieven
@ 2013-03-25  9:26       ` Orit Wasserman
  2013-03-25  9:42         ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  9:26 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/25/2013 10:56 AM, Peter Lieven wrote:
> 
> Am 25.03.2013 um 09:53 schrieb Orit Wasserman <owasserm@redhat.com>:
> 
>> On 03/22/2013 02:46 PM, Peter Lieven wrote:
>>> this adds buffer_find_nonzero_offset() which is a SSE2/Altivec
>>> optimized function that searches for non-zero content in a
>>> buffer.
>>>
>>> due to the optimizations used in the function there are restrictions
>>> on buffer address and search length. the function
>>> can_use_buffer_find_nonzero_content() can be used to check if
>>> the function can be used safely.
>>>
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> ---
>>> include/qemu-common.h |   13 +++++++++++++
>>> util/cutils.c         |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 58 insertions(+)
>>>
>>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>>> index e76ade3..078e535 100644
>>> --- a/include/qemu-common.h
>>> +++ b/include/qemu-common.h
>>> @@ -472,4 +472,17 @@ void hexdump(const char *buf, FILE *fp, const char *prefix, size_t size);
>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>> #endif
>>>
>>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>> +static inline bool
>>> +can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>> +{
>>> +    if (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>> +                * sizeof(VECTYPE)) == 0
>>> +            && ((uintptr_t) buf) % sizeof(VECTYPE) == 0) {
>>> +        return true;
>>> +    }
>>> +    return false;
>>> +}
>>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len);
>>> +
>>> #endif
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 1439da4..41c627e 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -143,6 +143,51 @@ int qemu_fdatasync(int fd)
>>> }
>>>
>>> /*
>>> + * Searches for an area with non-zero content in a buffer
>>> + *
>>> + * Attention! The len must be a multiple of
>>> + * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE)
>>> + * and addr must be a multiple of sizeof(VECTYPE) due to
>>> + * restriction of optimizations in this function.
>>> + *
>>> + * can_use_buffer_find_nonzero_offset() can be used to check
>>> + * these requirements.
>>> + *
>>> + * The return value is the offset of the non-zero area rounded
>>> + * down to BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE).
>>> + * If the buffer is all zero the return value is equal to len.
>>> + */
>>> +
>>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>> +{
>>> +    VECTYPE *p = (VECTYPE *)buf;
>>> +    VECTYPE zero = ZERO_SPLAT;
>>> +    size_t i;
>>> +
>>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>> +        * sizeof(VECTYPE)) == 0);
>>> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>> +
>>> +    if (*((const long *) buf)) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    for (i = 0; i < len / sizeof(VECTYPE);
>> Why not put len/sizeof(VECTYPE) in a variable?
> 
> are you afraid that there is a division at each iteration?
> 
> sizeof(VECTYPE) is a power of 2 so i think the compiler will optimize it
> to a >> at compile time.
true, but it still is done every iteration.
> 
> I would also be ok with writing len /= sizeof(VECTYPE) before the loop.
I would prefer it :)

Orit
> 
> Peter
> 
>> Orit
>>> +            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>> +        VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>> +        VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>> +        VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>> +        VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>> +        VECTYPE tmp01 = tmp0 | tmp1;
>>> +        VECTYPE tmp23 = tmp2 | tmp3;
>>> +        if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>> +            break;
>>> +        }
>>> +    }
>>> +    return i * sizeof(VECTYPE);
>>> +}
>>> +
>>> +/*
>>>  * Checks if a buffer is all zeroes
>>>  *
>>>  * Attention! The len must be a multiple of 4 * sizeof(long) due to
>>>
>>
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages Peter Lieven
  2013-03-22 19:49   ` Eric Blake
@ 2013-03-25  9:30   ` Orit Wasserman
  1 sibling, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  9:30 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> virtually all dup pages are zero pages. remove
> the special is_dup_page() function and use the
> optimized buffer_find_nonzero_offset() function
> instead.
> 
> here buffer_find_nonzero_offset() is used directly
> to avoid the unnecssary additional checks in
> buffer_is_zero().
> 
> raw performace gain checking zeroed memory
> over is_dup_page() is approx. 15-20% with SSE2.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c |   21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 1b71912..9ebca83 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -144,19 +144,10 @@ int qemu_read_default_config_files(bool userconfig)
>      return 0;
>  }
>  
> -static int is_dup_page(uint8_t *page)
> +static inline bool is_zero_page(uint8_t *p)
>  {
> -    VECTYPE *p = (VECTYPE *)page;
> -    VECTYPE val = SPLAT(page);
> -    int i;
> -
> -    for (i = 0; i < TARGET_PAGE_SIZE / sizeof(VECTYPE); i++) {
> -        if (!ALL_EQ(val, p[i])) {
> -            return 0;
> -        }
> -    }
> -
> -    return 1;
> +    return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) ==
> +        TARGET_PAGE_SIZE;
>  }
>  
>  /* struct contains XBZRLE cache and a static page
> @@ -443,12 +434,12 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>  
>              /* In doubt sent page as normal */
>              bytes_sent = -1;
> -            if (is_dup_page(p)) {
> +            if (is_zero_page(p)) {
>                  acct_info.dup_pages++;
>                  bytes_sent = save_block_hdr(f, block, offset, cont,
>                                              RAM_SAVE_FLAG_COMPRESS);
> -                qemu_put_byte(f, *p);
> -                bytes_sent += 1;
> +                qemu_put_byte(f, 0);
> +                bytes_sent++;
>              } else if (migrate_use_xbzrle()) {
>                  current_addr = block->offset + offset;
>                  bytes_sent = save_xbzrle_page(f, p, current_addr, block,
> 
Reviewed-by: Orit Wasserman <owasserm@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 6/9] migration: add an indicator for bulk state of ram migration
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
@ 2013-03-25  9:32   ` Orit Wasserman
  0 siblings, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  9:32 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> the first round of ram transfer is special since all pages
> are dirty and thus all memory pages are transferred to
> the target. this patch adds a boolean variable to track
> this stage.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  arch_init.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 9ebca83..4c4caf4 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -317,6 +317,7 @@ static ram_addr_t last_offset;
>  static unsigned long *migration_bitmap;
>  static uint64_t migration_dirty_pages;
>  static uint32_t last_version;
> +static bool ram_bulk_stage;
>  
>  static inline
>  ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
> @@ -424,6 +425,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>              if (!block) {
>                  block = QTAILQ_FIRST(&ram_list.blocks);
>                  complete_round = true;
> +                ram_bulk_stage = false;
>              }
>          } else {
>              uint8_t *p;
> @@ -527,6 +529,7 @@ static void reset_ram_globals(void)
>      last_sent_block = NULL;
>      last_offset = 0;
>      last_version = ram_list.version;
> +    ram_bulk_stage = true;
>  }
>  
>  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> 
Reviewed-by: Orit Wasserman <owasserm@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-25  9:26       ` Orit Wasserman
@ 2013-03-25  9:42         ` Paolo Bonzini
  2013-03-25 10:03           ` Orit Wasserman
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2013-03-25  9:42 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: Stefan Hajnoczi, Peter Lieven, qemu-devel, quintela


> >>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> >>> +{
> >>> +    VECTYPE *p = (VECTYPE *)buf;
> >>> +    VECTYPE zero = ZERO_SPLAT;
> >>> +    size_t i;
> >>> +
> >>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> >>> +        * sizeof(VECTYPE)) == 0);
> >>> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> >>> +
> >>> +    if (*((const long *) buf)) {
> >>> +        return 0;
> >>> +    }
> >>> +
> >>> +    for (i = 0; i < len / sizeof(VECTYPE);
> >> Why not put len/sizeof(VECTYPE) in a variable?
> > 
> > are you afraid that there is a division at each iteration?
> > 
> > sizeof(VECTYPE) is a power of 2 so i think the compiler will
> > optimize it
> > to a >> at compile time.
> true, but it still is done every iteration.

len is an invariant, the compiler will move it out of the loop
automatically.  Write readable code unless you have good clues
that it is also slow.

Paolo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
  2013-03-22 20:13   ` Eric Blake
@ 2013-03-25  9:44   ` Orit Wasserman
  1 sibling, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25  9:44 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> during bulk stage of ram migration if a page is a
> zero page do not send it at all.
> the memory at the destination reads as zero anyway.
> 
> even if there is an madvise with QEMU_MADV_DONTNEED
> at the target upon receipt of a zero page I have observed
> that the target starts swapping if the memory is overcommitted.
> it seems that the pages are dropped asynchronously.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c                   |   24 ++++++++++++++++++++----
>  hmp.c                         |    2 ++
>  include/migration/migration.h |    2 ++
>  migration.c                   |    3 ++-
>  qapi-schema.json              |    6 ++++--
>  qmp-commands.hx               |    3 ++-
>  6 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 4c4caf4..c34a4af 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -181,6 +181,7 @@ int64_t xbzrle_cache_resize(int64_t new_size)
>  /* accounting for migration statistics */
>  typedef struct AccountingInfo {
>      uint64_t dup_pages;
> +    uint64_t skipped_pages;
>      uint64_t norm_pages;
>      uint64_t iterations;
>      uint64_t xbzrle_bytes;
> @@ -206,6 +207,16 @@ uint64_t dup_mig_pages_transferred(void)
>      return acct_info.dup_pages;
>  }
>  
> +uint64_t skipped_mig_bytes_transferred(void)
> +{
> +    return acct_info.skipped_pages * TARGET_PAGE_SIZE;
> +}
> +
> +uint64_t skipped_mig_pages_transferred(void)
> +{
> +    return acct_info.skipped_pages;
> +}
> +
>  uint64_t norm_mig_bytes_transferred(void)
>  {
>      return acct_info.norm_pages * TARGET_PAGE_SIZE;
> @@ -438,10 +449,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>              bytes_sent = -1;
>              if (is_zero_page(p)) {
>                  acct_info.dup_pages++;
> -                bytes_sent = save_block_hdr(f, block, offset, cont,
> -                                            RAM_SAVE_FLAG_COMPRESS);
> -                qemu_put_byte(f, 0);
> -                bytes_sent++;
> +                if (!ram_bulk_stage) {
> +                    bytes_sent = save_block_hdr(f, block, offset, cont,
> +                                                RAM_SAVE_FLAG_COMPRESS);
> +                    qemu_put_byte(f, 0);
> +                    bytes_sent++;
> +                } else {
> +                    acct_info.skipped_pages++;
> +                    bytes_sent = 0;
> +                }
>              } else if (migrate_use_xbzrle()) {
>                  current_addr = block->offset + offset;
>                  bytes_sent = save_xbzrle_page(f, p, current_addr, block,
> diff --git a/hmp.c b/hmp.c
> index b0a861c..e3e833e 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -173,6 +173,8 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>                         info->ram->total >> 10);
>          monitor_printf(mon, "duplicate: %" PRIu64 " pages\n",
>                         info->ram->duplicate);
> +        monitor_printf(mon, "skipped: %" PRIu64 " pages\n",
> +                       info->ram->skipped);
>          monitor_printf(mon, "normal: %" PRIu64 " pages\n",
>                         info->ram->normal);
>          monitor_printf(mon, "normal bytes: %" PRIu64 " kbytes\n",
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index bb617fd..e2acec6 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -96,6 +96,8 @@ extern SaveVMHandlers savevm_ram_handlers;
>  
>  uint64_t dup_mig_bytes_transferred(void);
>  uint64_t dup_mig_pages_transferred(void);
> +uint64_t skipped_mig_bytes_transferred(void);
> +uint64_t skipped_mig_pages_transferred(void);
>  uint64_t norm_mig_bytes_transferred(void);
>  uint64_t norm_mig_pages_transferred(void);
>  uint64_t xbzrle_mig_bytes_transferred(void);
> diff --git a/migration.c b/migration.c
> index 185d112..7fb2147 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -197,11 +197,11 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          info->ram->remaining = ram_bytes_remaining();
>          info->ram->total = ram_bytes_total();
>          info->ram->duplicate = dup_mig_pages_transferred();
> +        info->ram->skipped = skipped_mig_pages_transferred();
>          info->ram->normal = norm_mig_pages_transferred();
>          info->ram->normal_bytes = norm_mig_bytes_transferred();
>          info->ram->dirty_pages_rate = s->dirty_pages_rate;
>  
> -
>          if (blk_mig_active()) {
>              info->has_disk = true;
>              info->disk = g_malloc0(sizeof(*info->disk));
> @@ -227,6 +227,7 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          info->ram->remaining = 0;
>          info->ram->total = ram_bytes_total();
>          info->ram->duplicate = dup_mig_pages_transferred();
> +        info->ram->skipped = skipped_mig_pages_transferred();
>          info->ram->normal = norm_mig_pages_transferred();
>          info->ram->normal_bytes = norm_mig_bytes_transferred();
>          break;
> diff --git a/qapi-schema.json b/qapi-schema.json
> index fdaa9da..b737460 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -496,7 +496,9 @@
>  #
>  # @total: total amount of bytes involved in the migration process
>  #
> -# @duplicate: number of duplicate pages (since 1.2)
> +# @duplicate: number of duplicate (zero) pages (since 1.2)
> +#
> +# @skipped: number of skipped zero pages (since 1.5)
>  #
>  # @normal : number of normal pages (since 1.2)
>  #
> @@ -510,7 +512,7 @@
>  { 'type': 'MigrationStats',
>    'data': {'transferred': 'int', 'remaining': 'int', 'total': 'int' ,
>             'duplicate': 'int', 'normal': 'int', 'normal-bytes': 'int',
> -           'dirty-pages-rate' : 'int' } }
> +           'dirty-pages-rate' : 'int', 'skipped': 'int' } }
>  
>  ##
>  # @XBZRLECacheStats
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index b370060..fed74c6 100644
> --- a/qmp-commands.hx
> +++ b/qmp-commands.hx
> @@ -2442,7 +2442,8 @@ The main json-object contains the following:
>           - "transferred": amount transferred (json-int)
>           - "remaining": amount remaining (json-int)
>           - "total": total (json-int)
> -         - "duplicate": number of duplicated pages (json-int)
> +         - "duplicate": number of duplicated (zero) pages (json-int)
> +         - "skipped": number of skipped zero pages (json-int)
>           - "normal" : number of normal pages transferred (json-int)
>           - "normal-bytes" : number of normal bytes transferred (json-int)
>  - "disk": only present if "status" is "active" and it is a block migration,
> 
Reviewed-by: Orit Wasserman <owasserm@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-25  9:42         ` Paolo Bonzini
@ 2013-03-25 10:03           ` Orit Wasserman
  0 siblings, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25 10:03 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Stefan Hajnoczi, Peter Lieven, qemu-devel, quintela

On 03/25/2013 11:42 AM, Paolo Bonzini wrote:
> 
>>>>> +size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>> +{
>>>>> +    VECTYPE *p = (VECTYPE *)buf;
>>>>> +    VECTYPE zero = ZERO_SPLAT;
>>>>> +    size_t i;
>>>>> +
>>>>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>>> +        * sizeof(VECTYPE)) == 0);
>>>>> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>>>> +
>>>>> +    if (*((const long *) buf)) {
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    for (i = 0; i < len / sizeof(VECTYPE);
>>>> Why not put len/sizeof(VECTYPE) in a variable?
>>>
>>> are you afraid that there is a division at each iteration?
>>>
>>> sizeof(VECTYPE) is a power of 2 so i think the compiler will
>>> optimize it
>>> to a >> at compile time.
>> true, but it still is done every iteration.
> 
> len is an invariant, the compiler will move it out of the loop
> automatically.  Write readable code unless you have good clues
> that it is also slow.
> 
I know it does for x86 but I wasn't sure for other platforms.
I'm fine with as is.

Orit
> Paolo
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 8/9] migration: do not search dirty pages in bulk stage
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 8/9] migration: do not search dirty " Peter Lieven
@ 2013-03-25 10:05   ` Orit Wasserman
  0 siblings, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25 10:05 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> avoid searching for dirty pages just increment the
> page offset. all pages are dirty anyway.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  arch_init.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index c34a4af..b2b932a 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -338,7 +338,13 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
>      unsigned long nr = base + (start >> TARGET_PAGE_BITS);
>      unsigned long size = base + (int128_get64(mr->size) >> TARGET_PAGE_BITS);
>  
> -    unsigned long next = find_next_bit(migration_bitmap, size, nr);
> +    unsigned long next;
> +
> +    if (ram_bulk_stage && nr > base) {
> +        next = nr + 1;
> +    } else {
> +        next = find_next_bit(migration_bitmap, size, nr);
> +    }
>  
>      if (next < size) {
>          clear_bit(next, migration_bitmap);
> 
Reviewed-by: Orit Wasserman <owasserm@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 9/9] migration: use XBZRLE only after bulk stage
  2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 9/9] migration: use XBZRLE only after " Peter Lieven
@ 2013-03-25 10:16   ` Orit Wasserman
  0 siblings, 0 replies; 44+ messages in thread
From: Orit Wasserman @ 2013-03-25 10:16 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, quintela

On 03/22/2013 02:46 PM, Peter Lieven wrote:
> at the beginning of migration all pages are marked dirty and
> in the first round a bulk migration of all pages is performed.
> 
> currently all these pages are copied to the page cache regardless
> of whether they are frequently updated or not. this doesn't make sense
> since most of these pages are never transferred again.
> 
> this patch changes the XBZRLE transfer to only be used after
> the bulk stage has been completed. that means a page is added
> to the page cache the second time it is transferred and XBZRLE
> can benefit from the third time of transfer.
> 
> since the page cache is likely smaller than the number of pages
> it's also likely that in the second round the page is missing in the
> cache due to collisions in the bulk phase.
> 
> on the other hand a lot of unnecessary mallocs, memdups and frees
> are saved.
> 
> the following results have been taken earlier while executing
> the test program from docs/xbzrle.txt. (+) with the patch and (-)
> without. (thanks to Eric Blake for reformatting and comments)
> 
> + total time: 22185 milliseconds
> - total time: 22410 milliseconds
> 
> Shaved 0.3 seconds, better than 1%!
> 
> + downtime: 29 milliseconds
> - downtime: 21 milliseconds
> 
> Not sure why downtime seemed worse, but probably not the end of the world.
> 
> + transferred ram: 706034 kbytes
> - transferred ram: 721318 kbytes
> 
> Fewer bytes sent - good.
> 
> + remaining ram: 0 kbytes
> - remaining ram: 0 kbytes
> + total ram: 1057216 kbytes
> - total ram: 1057216 kbytes
> + duplicate: 108556 pages
> - duplicate: 105553 pages
> + normal: 175146 pages
> - normal: 179589 pages
> + normal bytes: 700584 kbytes
> - normal bytes: 718356 kbytes
> 
> Fewer normal bytes...
> 
> + cache size: 67108864 bytes
> - cache size: 67108864 bytes
> + xbzrle transferred: 3127 kbytes
> - xbzrle transferred: 630 kbytes
> 
> ...and more compressed pages sent - good.
> 
> + xbzrle pages: 117811 pages
> - xbzrle pages: 21527 pages
> + xbzrle cache miss: 18750
> - xbzrle cache miss: 179589
> 
> And very good improvement on the cache miss rate.
> 
> + xbzrle overflow : 0
> - xbzrle overflow : 0
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  arch_init.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index b2b932a..86f7e28 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -464,7 +464,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>                      acct_info.skipped_pages++;
>                      bytes_sent = 0;
>                  }
> -            } else if (migrate_use_xbzrle()) {
> +            } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
>                  current_addr = block->offset + offset;
>                  bytes_sent = save_xbzrle_page(f, p, current_addr, block,
>                                                offset, cont, last_stage);
> 
Reviewed-by: Orit Wasserman <owasserm@redhat.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-22 21:24     ` Paolo Bonzini
  2013-03-23  7:34       ` Peter Lieven
@ 2013-03-25 10:17       ` Peter Lieven
  2013-03-25 10:53         ` Paolo Bonzini
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-25 10:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi

On 22.03.2013 22:24, Paolo Bonzini wrote:
> Il 22/03/2013 20:20, Peter Lieven ha scritto:
>>> I think patch 4 is a bit overengineered.  I would prefer the simple
>>> patch you had using three/four non-vectorized accesses.  The setup cost
>>> of the vectorized buffer_is_zero is quite high, and 64 bits are just
>>> 256k RAM; if the host doesn't touch 256k RAM, it will incur the overhead.
>> I think you are right. I was a little to eager to utilize buffer_find_nonzero_offset()
>> as much as possible. The performance gain by unrolling was impressive enough.
>> The gain by the vector functions is not that big that it would justify a possible
>> slow down by the high setup costs. My testings revealed that in most cases buffer_find_nonzero_offset()
>> returns 0 or a big offset. All the 0 return values would have increased setup costs with
>> the vectorized version of patch 4.
>>
>>> I would prefer some more benchmarking for patch 5, but it looks ok.
>> What would you like to see? Statistics how many pages of a real system
>> are not zero, but zero in the first sizeof(long) bytes?
> Yeah, more or less.  Running the system for a while, migrating, and
> plotting a histogram of the return values of buffer_find_nonzero_offset
> (hmm, perhaps using a nonvectorized version is better for this experiment).

It seems that Paolos concern regarding only checking the first 64-bit where right. What I would propose is
to check the first BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * sizeof(VECTYPE) bytes in
sizeof(VECTYPE) chunks and use the unrolled version afterwards.

basically this would result in sth like this:

size_t buffer_find_nonzero_offset(const void *buf, size_t len)
{
     VECTYPE *p = (VECTYPE *)buf;
     VECTYPE zero = ZERO_SPLAT;
     size_t i;

     assert(can_use_buffer_find_nonzero_offset(buf, len));

     if (!len) {
         return 0;
     }

     for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
         if (!ALL_EQ(p[i], zero)) {
             return 0;
         }
     }

     for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
             i < len / sizeof(VECTYPE);
             i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
         VECTYPE tmp0 = p[i + 0] | p[i + 1];
         VECTYPE tmp1 = p[i + 2] | p[i + 3];
         VECTYPE tmp2 = p[i + 4] | p[i + 5];
         VECTYPE tmp3 = p[i + 6] | p[i + 7];
         VECTYPE tmp01 = tmp0 | tmp1;
         VECTYPE tmp23 = tmp2 | tmp3;
         if (!ALL_EQ(tmp01 | tmp23, zero)) {
             break;
         }
     }

     return i * sizeof(VECTYPE);
}

this version is approx. 1-2% slower than the first one, but still 15% faster than old is_dup_page() for zero pages.
BUT, if the first 8 bytes are zero and afterwards non-zero, the first version is approx. 200% slower due to
the high setup costs.

Paolo, with this one maybe you would also be fine witht he vectorized version of patch 4?

Peter

---

here are the results of the tests with the return values of buffer_find_nonzero_offset (64-bit chunks):

ubuntu 12.04 LTS 64-bit desktop with 1G memory shortly after boot:

return values: 83905 3281 1169 448 412 212 284 180 146 93 54 77 64 44 48 50 68 46 28 62 40 81 34 69 52 47 31 21 35 29 39 24 83 43 22 17 10 37 30 10 17 23 12 12 12 17 9 23 12 20 2 9 22 16 16 64 15 39 8 9 7 12 8 10 10 13 8 12 58 10 7 8 18 18 10 12 11 6 9 16 
9 60 5 6 7 7 5 12 98 32 7 9 4 11 7 6 11 4 11 45 7 19 4 6 6 13 5 8 5 14 7 5 11 6 3 8 12 8 3 12 10 23 11 5 9 3 10 13 46 6 2 14 7 7 4 11 9 4 1 9 5 10 4 6 14 62 5 10 106 6 7 7 6 26 3 34 80 8 12 12 8 5 2 6 7 14 5 8 8 8 7 3 6 16 4 13 16 9 4 14 6 22 14 15 6 25 4 
12 6 6 3 7 13 11 5 11 3 11 8 5 16 12 2 5 3 8 4 3 11 62 147 9 54 20 14 3 5 28 12 3 6 5 7 2 9 9 10 8 11 4 4 6 10 7 20 5 4 1 6 9 6 7 9 2 2 6 5 10 3 23 5 13 6 7 20 11 12 15 17 2 4 2 3 25 2 6 3 15 4 6 5 30 15 9 4 28 3 4 6 5 6 18 7 2 9 2 2 9 8 11 8 1 4 5 4 4 2 
4 6 75 9 8 6 5 3 6 3 6 15 5 5 5 6 20 6 10 7 9 6 4 9 5 6 6 9 7 5 5 5 4 4 1 46 6 10 10 92 4 7 3 3 32 6 7 34 30 2 8 2 7 8 5 9 8 4 21 9 9 12 5 12 5 3 5 5 5 65 4 4 67 7 7 5 8 7 1 3 3 5 7 4 7 7 7 15 15 8 11 5 6 2 7 12 6 5 9 13 2 19 6 2 8 3 11 8 9 38 1 7 1 20 11 
5 7 1 4 6 5 4 2 2 7 5 5 2 51 5 4 9 7 4 16 3 67 7 45 9 8 9 12 38 11 6 14 2 2 10 10 6 10 4 9 9 5 4 7 2 8 7 4 5 1 2 2 6 3 5 4 7 0 2 6 5 13 5 5 4 11 4 4 9 4 2 8 10 5 6 10 6 4 2 2 6 4 6 3 4 7 5 7 0 6 4 5 4 1 8 11 15 15 14 20 168432

histogram: 31.7% 32.9% 33.3% 33.5% 33.7% 33.7% 33.9% 33.9% 34.0% 34.0% 34.0% 34.1% 34.1% 34.1% 34.1% 34.1% 34.2% 34.2% 34.2% 34.2% 34.2% 34.3% 34.3% 34.3% 34.3% 34.3% 34.3% 34.4% 34.4% 34.4% 34.4% 34.4% 34.4% 34.5% 34.5% 34.5% 34.5% 34.5% 34.5% 34.5% 
34.5% 34.5% 34.5% 34.5% 34.5% 34.5% 34.5% 34.5% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.6% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.8% 
34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 
35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 
35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 
35.3% 35.3% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 
35.5% 35.5% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 
35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 
35.8% 35.8% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 35.9% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 
36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.0% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.1% 36.2% 36.2% 
36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.2% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 
36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.3% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 
36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 36.4% 100.0%

---

opensuse 11.1 64-bit with 24GB ram (busy server)

return values: 7279752 30276 15074 8523 6218 4882 3998 2940 19212 1887 1408 1123 892 670 507 673 1284 362 306 214 201 760 232 148 1062 129 266 102 661 101 116 112 911 88 86 74 174 149 70 98 627 77 632 190 115 466 85 72 545 88 92 99 87 114 127 79 379 91 
190 739 93 67 178 614 294 60 149 470 84 70 35 76 179 124 58 264 72 81 59 56 295 102 343 65 73 201 174 53 206 153 278 123 88 98 42 86 127 61 62 52 90 51 110 67 148 43 62 62 112 65 339 44 235 324 60 67 86 47 138 50 687 165 64 40 65 60 665 55 326 111 64 52 
43 60 149 48 444 42 33 65 117 157 70 43 219 53 181 46 177 38 125 45 95 56 189 553 204 39 76 49 88 85 730 38 109 44 895 120 241 45 44 41 51 33 35 44 357 49 39 71 28 694 263 43 104 43 45 34 35 47 153 44 233 44 31 55 40 27 547 47 264 33 36 30 33 39 38 33 105 
200 42 47 41 40 25 34 109 44 28 33 45 31 47 30 117 166 36 30 199 30 160 41 168 101 40 28 92 45 42 33 53 53 31 46 30 42 32 32 57 44 38 39 25 50 49 49 1392 30 157 294 35 58 36 35 45 38 29 32 32 31 36 27 61 42 24 31 35 85 61 26 43 19 35 36 26 26 87 33 34 34 
37 31 28 39 35 27 131 20 30 35 27 31 24 33 39 41 21 32 26 32 17 18 38 21 163 39 30 29 29 20 40 43 26 39 24 39 23 19 38 48 38 33 99 21 142 57 46 39 50 31 26 37 44 28 42 23 29 63 413 27 22 22 125 30 24 348 34 46 43 40 39 34 25 145 15 26 14 32 39 29 22 26 23 
31 11 29 159 215 19 77 41 24 85 42 46 23 19 59 24 32 18 83 146 24 22 22 32 32 20 25 32 30 16 27 36 26 24 22 30 31 15 27 27 33 15 23 27 16 27 29 173 38 34 39 451 17 30 24 26 77 18 21 29 22 24 157 26 29 19 24 38 30 32 35 18 37 20 16 53 97 23 25 24 29 31 27 
36 21 20 26 22 29 13 26 27 35 21 10 11 26 17 28 30 16 29 33 34 45 25 7 29 31 21 14 21 28 17 17 53 16 9 20 26 17 13 11 15 32 25 32 32 26 16 20 10 24 31 25 46 33 42 47 38201

histogram:
97.5% 97.9% 98.1% 98.3% 98.3% 98.4% 98.5% 98.5% 98.8% 98.8% 98.8% 98.8% 98.8% 98.8% 98.8% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 98.9% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 
99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.0% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 
99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 99.1% 
99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 
99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.2% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 
99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 
99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 
99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 
99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 
99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 99.4% 
99.4% 99.4% 99.4% 99.4% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 
99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 
99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 99.5% 100.0%

---

windows server 2008 R2 with 8G ram running for 3 days:

return values: 440570 8602 1952 1424 678 728 518 517 562 439 316 723 247 280 235 320 351 263 300 207 186 177 113 170 147 186 90 151 56 146 62 145 181 97 64 114 66 114 49 91 109 77 70 82 52 88 76 146 98 70 47 83 62 91 41 57 86 81 40 68 47 73 30 58 136 83 
18 45 38 45 17 86 36 80 247 301 27 75 20 55 93 51 32 38 42 41 16 41 54 50 29 40 41 303 15 61 80 42 19 25 13 29 15 33 24 52 18 42 19 28 16 28 109 16 25 43 30 31 13 40 22 24 18 34 10 20 43 27 62 19 18 22 12 33 17 23 24 14 12 23 14 22 24 26 44 33 20 21 23 22 
25 26 54 18 16 25 16 23 9 22 76 20 11 30 6 24 17 16 32 23 8 18 7 20 8 28 65 11 18 15 20 15 11 9 25 19 16 18 12 14 12 12 53 25 13 24 18 17 7 20 7 29 18 34 19 21 13 13 69 32 23 22 14 26 8 20 22 27 15 31 8 23 16 18 47 34 14 39 43 15 13 14 32 22 8 40 13 22 8 
22 25 19 13 15 9 13 10 17 11 19 18 22 20 13 6 15 98 15 12 12 16 17 13 13 46 15 6 29 10 19 12 17 38 9 8 15 14 19 5 17 9 16 13 14 9 14 13 18 51 22 13 12 11 105 12 20 15 21 8 12 13 12 5 13 54 23 24 34 9 19 48 11 26 18 7 19 8 22 10 4 56 21 11 18 12 11 10 14 
13 12 9 12 17 11 6 14 47 26 21 19 16 13 6 12 23 20 24 22 6 8 1 7 48 10 16 6 17 10 9 12 12 4 12 5 20 31 6 7 39 27 6 57 4 5 9 9 5 9 4 11 5 13 5 7 32 3 7 13 7 12 4 18 15 15 9 12 10 18 7 11 25 8 6 20 17 17 4 9 19 18 11 6 7 12 7 8 36 12 7 12 7 15 10 11 15 11 
12 24 20 13 10 10 42 14 5 10 9 15 5 14 12 14 10 12 4 37 8 11 27 11 10 30 7 8 8 21 22 14 16 13 15 18 19 12 53 17 15 16 5 12 5 13 92 15 8 18 6 13 26 27 109 10 8 10 6 15 11 12 15 13 12 12 15 9 14 12 89 7 11 15 10 16 7 18 24 179 13 58 48 47 28 68 1632222

histogram:  20.9% 21.3% 21.4% 21.5% 21.5% 21.6% 21.6% 21.6% 21.6% 21.7% 21.7% 21.7% 21.7% 21.7% 21.7% 21.8% 21.8% 21.8% 21.8% 21.8% 21.8% 21.8% 21.8% 21.8% 21.8% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 
21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 21.9% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.0% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 
22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.1% 22.2% 
22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 
22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 
22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.2% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 
22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 
22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.3% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 
22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 
22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 
22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 
22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.4% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 
22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 22.5% 100.0%

---

windows XP guest with 1G Ram running for approx. 1 hours

return values: 71567 1377 21943 339 422 132 134 110 151 56 72 53 72 49 66 40 90 55 32 57 54 25 53 30 38 21 29 27 61 13 23 67 79 23 20 19 18 22 16 12 12 19 182 11 8 11 11 11 17 21 5 5 4 10 6 4 11 8 11 15 11 5 4 10 41 5 13 7 9 2 23 10 18 43 6 6 6 2 14 5 33 
5 6 5 9 2 6 3 5 4 3 10 5 2 4 2 43 6 2 4 7 4 4 5 5 14 5 5 4 7 2 3 20 4 3 1 3 21 5 4 7 3 3 1 0 1 14 6 84 8 6 5 21 4 6 2 3 9 4 1 4 0 0 1 12 7 4 9 2 8 2 0 4 4 3 10 3 4 1 2 15 0 1 2 0 5 0 2 7 18 4 8 1 5 9 0 3 1 1 2 2 1 9 3 4 1 2 1 0 3 1 1 18 1 2 2 2 0 7 3 3 0 
0 6 3 2 1 0 27 1 0 1 3 1 2 2 8 4 0 1 0 9 0 2 12 0 1 3 2 5 1 1 2 2 3 3 18 3 12 6 60 3 1 0 7 11 2 1 16 1 1 1 16 2 6 1 49 4 1 54 0 2 0 1 1 3 1 2 3 0 1 0 18 1 0 1 2 0 2 1 1 3 0 1 1 2 0 0 14 4 15 0 0 7 6 0 12 2 0 3 1 0 1 1 5 0 1 2 3 1 1 0 2 2 0 2 1 0 1 3 9 2 3 
2 1 0 5 2 2 1 1 0 2 3 2 0 3 0 0 2 0 0 2 1 3 0 0 3 3 2 1 0 26 0 2 4 2 2 5 3 1 2 1 2 2 3 4 3 11 4 2 1 1 4 4 3 3 2 1 3 5 3 1 2 14 3 2 3 14 2 1 2 3 10 4 1 4 2 1 4 21 8 4 1 2 1 1 2 10 3 3 2 4 4 2 4 18 3 5 9 4 3 3 7 6 4 1 3 5 1 3 3 29 2 2 1 2 2 6 1 5 1 2 1 4 1 
0 0 6 1 2 2 2 1 5 3 0 0 3 2 0 2 4 0 14 0 8 1 2 0 0 0 0 2 0 3 4 5 1 2 12 3 6 3 3 0 1 3 0 1 1 2 4 2 0 0 7 1 3 0 1 2 1 0 2 4 2 1 4 7 7 0 179427

histogram: 25.6% 26.1% 34.0% 34.1% 34.2% 34.3% 34.3% 34.4% 34.4% 34.4% 34.5% 34.5% 34.5% 34.5% 34.6% 34.6% 34.6% 34.6% 34.6% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.7% 34.8% 34.8% 34.8% 34.8% 34.8% 34.8% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 34.9% 
34.9% 34.9% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.0% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 
35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.1% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 35.2% 
35.2% 35.2% 35.2% 35.2% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 
35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.3% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 
35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.4% 35.5% 35.5% 
35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 
35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.5% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 
35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 
35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.6% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 
35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 
35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.7% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 
35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 35.8% 100.0%


Peter

>
> Paolo


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 10:17       ` Peter Lieven
@ 2013-03-25 10:53         ` Paolo Bonzini
  2013-03-25 11:26           ` Peter Lieven
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2013-03-25 10:53 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi

> ubuntu 12.04 LTS 64-bit desktop with 1G memory shortly after boot:
> histogram: 31.7% 32.9% [...] 36.4% 100.0%
> 
> ---
> 
> opensuse 11.1 64-bit with 24GB ram (busy server)
> histogram: 97.5% 97.9% [...] 99.5% 100.0%
> 
> ---
> 
> windows server 2008 R2 with 8G ram running for 3 days:
> histogram:  20.9% 21.3% [...] 22.5% 100.0%
> 
> ---
> 
> windows XP guest with 1G Ram running for approx. 1 hours
> histogram: 25.6% [...] 35.8% 100.0%

Doesn't this suggest checking the first _and the last_ word,
and using the vectorized loop if none is zero?

Paolo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 10:53         ` Paolo Bonzini
@ 2013-03-25 11:26           ` Peter Lieven
  2013-03-25 13:02             ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-25 11:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi


Am 25.03.2013 um 11:53 schrieb Paolo Bonzini <pbonzini@redhat.com>:

>> ubuntu 12.04 LTS 64-bit desktop with 1G memory shortly after boot:
>> histogram: 31.7% 32.9% [...] 36.4% 100.0%
>> 
>> ---
>> 
>> opensuse 11.1 64-bit with 24GB ram (busy server)
>> histogram: 97.5% 97.9% [...] 99.5% 100.0%
>> 
>> ---
>> 
>> windows server 2008 R2 with 8G ram running for 3 days:
>> histogram:  20.9% 21.3% [...] 22.5% 100.0%
>> 
>> ---
>> 
>> windows XP guest with 1G Ram running for approx. 1 hours
>> histogram: 25.6% [...] 35.8% 100.0%
> 
> Doesn't this suggest checking the first _and the last_ word,
> and using the vectorized loop if none is zero?

Maybe I should have explained the output more detailed. The percentages are added. 35.8% in the second last column means that 35.8% have a return
value that is less than TARGET_PAGE_SIZE. This was meant to illustrate at how many 64-bit chunks you have to look to grab a certain
percentage of non-zero pages.

25.6% 26.1% 34.0% 34.1% 34.2% 34.3% 34.3% 34.4% 34.4% 34.4% 34.5% 34.5% 34.5% [...] 35.8% 100%

Looking e.g. at the third value it means that looking at the first three 64-bit chunks it will catch 34.0% of all pages.
It turns out that the non-zeroness of a page can be detected looking at the first 256 or so bits and only a low
percentage turns out to be non-zero at a later position. So after having checked the first chunks one by one
there is no big penalty looking at the remaining chunks with the vectorized loop.

Here is the distribution of return values for the Windows XP example:

25.62% 0.49% 7.86% 0.12% 0.15% 0.05% 0.05% 0.04% 0.05% 0.02% 0.03% 0.02% 0.03% 0.02% 0.02% 0.01% 0.03% 0.02% 0.01% 0.02% 0.02% 0.01% 0.02% 0.01% 0.01% 0.01% 0.01% 0.01% 0.02% 0.00% 0.01% 0.02% 0.03% 0.01% 0.01% 0.01% 0.01% 0.01% 0.01% 0.00% 0.00% 0.01% 0.07% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.01% 0.02% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.02% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.03% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.02% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.02% 0.00% 0.00% 0.02% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 64.23%

The last value is the percentage of return value of TARGET_PAGE_SIZE meaning the page is all zero.

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 11:26           ` Peter Lieven
@ 2013-03-25 13:02             ` Paolo Bonzini
  2013-03-25 13:23               ` Peter Lieven
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2013-03-25 13:02 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi

> Maybe I should have explained the output more detailed. The percentages
> are added. 35.8% in the second last column means that
> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
> This was meant to illustrate at how many 64-bit chunks you have
> to look to grab a certain percentage of non-zero pages.

Ok, I wrongly understood that many pages had 4088 zero bytes but
the last 8 were not zero.  Now it's clearer, and more logical too. :)

> Looking e.g. at the third value it means that looking at the first
> three 64-bit chunks it will catch 34.0% of all pages.
> It turns out that the non-zeroness of a page can be detected looking
> at the first 256 or so bits and only a low
> percentage turns out to be non-zero at a later position. So after
> having checked the first chunks one by one
> there is no big penalty looking at the remaining chunks with the
> vectorized loop.

I think it makes most sense to unroll the first four non-vectorized
iterations, i.e. not use SSE and use three or four ifs.  Either:

   if (foo[0]) return 0;
   if (foo[1]) return 8;
   if (foo[2]) return 16;
   if (foo[3]) return 24;

or

   if (foo[0]) return 0;
   if (foo[1] | foo[2] | foo[3]) return 8;

and then proceed on the remaining 4096-4*sizeof(long) bytes with
the vectorized loop.  foo+4 is aligned for SIMD operations on both
32- and 64-bit machines, which makes this a nice choice.

Paolo

> Here is the distribution of return values for the Windows XP example:
> 
> 25.62% 0.49% 7.86% 0.12% 0.15% 0.05% 0.05% 0.04% 0.05% 0.02% 0.03%
> 0.02% 0.03% 0.02% 0.02% 0.01% 0.03% 0.02% 0.01% 0.02% 0.02% 0.01%
> 0.02% 0.01% 0.01% 0.01% 0.01% 0.01% 0.02% 0.00% 0.01% 0.02% 0.03%
> 0.01% 0.01% 0.01% 0.01% 0.01% 0.01% 0.00% 0.00% 0.01% 0.07% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.01% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.01% 0.02% 0.00% 0.00% 0.00%
> 0.00% 0.01% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.02% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.03% 0.00% 0.00% 0.00%
> 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.02% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.01%
> 0.00% 0.00% 0.00% 0.02% 0.00% 0.00% 0.02% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.01% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01%
> 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 64.23%
> 
> The last value is the percentage of return value of TARGET_PAGE_SIZE
> meaning the page is all zero.
> 
> Peter
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 13:02             ` Paolo Bonzini
@ 2013-03-25 13:23               ` Peter Lieven
  2013-03-25 13:32                 ` Peter Lieven
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-25 13:23 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi


Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <pbonzini@redhat.com>:

>> Maybe I should have explained the output more detailed. The percentages
>> are added. 35.8% in the second last column means that
>> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
>> This was meant to illustrate at how many 64-bit chunks you have
>> to look to grab a certain percentage of non-zero pages.
> 
> Ok, I wrongly understood that many pages had 4088 zero bytes but
> the last 8 were not zero.  Now it's clearer, and more logical too. :)
> 
>> Looking e.g. at the third value it means that looking at the first
>> three 64-bit chunks it will catch 34.0% of all pages.
>> It turns out that the non-zeroness of a page can be detected looking
>> at the first 256 or so bits and only a low
>> percentage turns out to be non-zero at a later position. So after
>> having checked the first chunks one by one
>> there is no big penalty looking at the remaining chunks with the
>> vectorized loop.
> 
> I think it makes most sense to unroll the first four non-vectorized
> iterations, i.e. not use SSE and use three or four ifs.  Either:
> 
>   if (foo[0]) return 0;
>   if (foo[1]) return 8;
>   if (foo[2]) return 16;
>   if (foo[3]) return 24;
> 
> or
> 
>   if (foo[0]) return 0;
>   if (foo[1] | foo[2] | foo[3]) return 8;
> 
> and then proceed on the remaining 4096-4*sizeof(long) bytes with
> the vectorized loop.  foo+4 is aligned for SIMD operations on both
> 32- and 64-bit machines, which makes this a nice choice.

i can't start at foo+4 since the remaining X-4*sizeof(long) bytes
are not dividable by 8*sizeof(VECTYPE).

I could just do sty like the following:

    const unsigned long *tmp = buf;
    
    for (i = 0; 
         i < sizeof(VECTYPE) * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
             / sizeof(unsigned long);
         i += 4) {
        if (tmp[i + 0]) return i * sizeof(unsigned long);
        if (tmp[i + 1]) return (i+1) * sizeof(unsigned long);
        if (tmp[i + 2]) return (i+2) * sizeof(unsigned long);
        if (tmp[i + 3]) return (i+3) * sizeof(unsigned long);
    }

    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; 
         i < len / sizeof(VECTYPE); 
         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
        …
    }

Peter

> 
> Paolo
> 
>> Here is the distribution of return values for the Windows XP example:
>> 
>> 25.62% 0.49% 7.86% 0.12% 0.15% 0.05% 0.05% 0.04% 0.05% 0.02% 0.03%
>> 0.02% 0.03% 0.02% 0.02% 0.01% 0.03% 0.02% 0.01% 0.02% 0.02% 0.01%
>> 0.02% 0.01% 0.01% 0.01% 0.01% 0.01% 0.02% 0.00% 0.01% 0.02% 0.03%
>> 0.01% 0.01% 0.01% 0.01% 0.01% 0.01% 0.00% 0.00% 0.01% 0.07% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.01% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.01% 0.02% 0.00% 0.00% 0.00%
>> 0.00% 0.01% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.02% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.03% 0.00% 0.00% 0.00%
>> 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.02% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.01%
>> 0.00% 0.00% 0.00% 0.02% 0.00% 0.00% 0.02% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.01% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01%
>> 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
>> 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 64.23%
>> 
>> The last value is the percentage of return value of TARGET_PAGE_SIZE
>> meaning the page is all zero.
>> 
>> Peter
>> 
>> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 13:23               ` Peter Lieven
@ 2013-03-25 13:32                 ` Peter Lieven
  2013-03-25 14:34                   ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-25 13:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Orit Wasserman, quintela, qemu-devel, Stefan Hajnoczi


Am 25.03.2013 um 14:23 schrieb Peter Lieven <pl@kamp.de>:

> 
> Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>>> Maybe I should have explained the output more detailed. The percentages
>>> are added. 35.8% in the second last column means that
>>> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
>>> This was meant to illustrate at how many 64-bit chunks you have
>>> to look to grab a certain percentage of non-zero pages.
>> 
>> Ok, I wrongly understood that many pages had 4088 zero bytes but
>> the last 8 were not zero.  Now it's clearer, and more logical too. :)
>> 
>>> Looking e.g. at the third value it means that looking at the first
>>> three 64-bit chunks it will catch 34.0% of all pages.
>>> It turns out that the non-zeroness of a page can be detected looking
>>> at the first 256 or so bits and only a low
>>> percentage turns out to be non-zero at a later position. So after
>>> having checked the first chunks one by one
>>> there is no big penalty looking at the remaining chunks with the
>>> vectorized loop.
>> 
>> I think it makes most sense to unroll the first four non-vectorized
>> iterations, i.e. not use SSE and use three or four ifs.  Either:
>> 
>>  if (foo[0]) return 0;
>>  if (foo[1]) return 8;
>>  if (foo[2]) return 16;
>>  if (foo[3]) return 24;
>> 
>> or
>> 
>>  if (foo[0]) return 0;
>>  if (foo[1] | foo[2] | foo[3]) return 8;
>> 
>> and then proceed on the remaining 4096-4*sizeof(long) bytes with
>> the vectorized loop.  foo+4 is aligned for SIMD operations on both
>> 32- and 64-bit machines, which makes this a nice choice.
> 
> i can't start at foo+4 since the remaining X-4*sizeof(long) bytes
> are not dividable by 8*sizeof(VECTYPE).
> 
> I could just do sty like the following:
> 
>    const unsigned long *tmp = buf;
> 
>    for (i = 0; 
>         i < sizeof(VECTYPE) * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>             / sizeof(unsigned long);
>         i += 4) {
>        if (tmp[i + 0]) return i * sizeof(unsigned long);
>        if (tmp[i + 1]) return (i+1) * sizeof(unsigned long);
>        if (tmp[i + 2]) return (i+2) * sizeof(unsigned long);
>        if (tmp[i + 3]) return (i+3) * sizeof(unsigned long);
>    }
> 
>    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; 
>         i < len / sizeof(VECTYPE); 
>         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>        …
>    }

performance of the above is bad compared to:

    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
        if (!ALL_EQ(p[i], zero)) {
            return i * sizeof(VECTYPE);
        }
    }

…

The above is basically what old is_dup_page is doing, but after the first
8 iterations the optimized version kicks in.

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 13:32                 ` Peter Lieven
@ 2013-03-25 14:34                   ` Paolo Bonzini
  2013-03-25 21:37                     ` Peter Lieven
  2013-03-26  8:14                     ` Peter Lieven
  0 siblings, 2 replies; 44+ messages in thread
From: Paolo Bonzini @ 2013-03-25 14:34 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Orit Wasserman, qemu-devel, quintela

Il 25/03/2013 14:32, Peter Lieven ha scritto:
> 
> Am 25.03.2013 um 14:23 schrieb Peter Lieven <pl@kamp.de>:
> 
>>
>> Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>
>>>> Maybe I should have explained the output more detailed. The percentages
>>>> are added. 35.8% in the second last column means that
>>>> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
>>>> This was meant to illustrate at how many 64-bit chunks you have
>>>> to look to grab a certain percentage of non-zero pages.
>>>
>>> Ok, I wrongly understood that many pages had 4088 zero bytes but
>>> the last 8 were not zero.  Now it's clearer, and more logical too. :)
>>>
>>>> Looking e.g. at the third value it means that looking at the first
>>>> three 64-bit chunks it will catch 34.0% of all pages.
>>>> It turns out that the non-zeroness of a page can be detected looking
>>>> at the first 256 or so bits and only a low
>>>> percentage turns out to be non-zero at a later position. So after
>>>> having checked the first chunks one by one
>>>> there is no big penalty looking at the remaining chunks with the
>>>> vectorized loop.
>>>
>>> I think it makes most sense to unroll the first four non-vectorized
>>> iterations, i.e. not use SSE and use three or four ifs.  Either:
>>>
>>>  if (foo[0]) return 0;
>>>  if (foo[1]) return 8;
>>>  if (foo[2]) return 16;
>>>  if (foo[3]) return 24;
>>>
>>> or
>>>
>>>  if (foo[0]) return 0;
>>>  if (foo[1] | foo[2] | foo[3]) return 8;
>>>
>>> and then proceed on the remaining 4096-4*sizeof(long) bytes with
>>> the vectorized loop.  foo+4 is aligned for SIMD operations on both
>>> 32- and 64-bit machines, which makes this a nice choice.
>>
>> i can't start at foo+4 since the remaining X-4*sizeof(long) bytes
>> are not dividable by 8*sizeof(VECTYPE).


Hmm, right.  What about just processing the first few longs twice, i.e.
the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i
+= BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?

Paolo

>>
>>    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; 
>>         i < len / sizeof(VECTYPE); 
>>         i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>        …
>>    }
> 
> performance of the above is bad compared to:
> 
>     for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>         if (!ALL_EQ(p[i], zero)) {
>             return i * sizeof(VECTYPE);
>         }
>     }
> 
> …
> 
> The above is basically what old is_dup_page is doing, but after the first
> 8 iterations the optimized version kicks in.
> 
> Peter
> 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 14:34                   ` Paolo Bonzini
@ 2013-03-25 21:37                     ` Peter Lieven
  2013-03-26  8:14                     ` Peter Lieven
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Lieven @ 2013-03-25 21:37 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Stefan Hajnoczi, Orit Wasserman, qemu-devel, quintela


Am 25.03.2013 um 15:34 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 25/03/2013 14:32, Peter Lieven ha scritto:
>> 
>> Am 25.03.2013 um 14:23 schrieb Peter Lieven <pl@kamp.de>:
>> 
>>> 
>>> Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>> 
>>>>> Maybe I should have explained the output more detailed. The percentages
>>>>> are added. 35.8% in the second last column means that
>>>>> 35.8% have a return value that is less than TARGET_PAGE_SIZE.
>>>>> This was meant to illustrate at how many 64-bit chunks you have
>>>>> to look to grab a certain percentage of non-zero pages.
>>>> 
>>>> Ok, I wrongly understood that many pages had 4088 zero bytes but
>>>> the last 8 were not zero.  Now it's clearer, and more logical too. :)
>>>> 
>>>>> Looking e.g. at the third value it means that looking at the first
>>>>> three 64-bit chunks it will catch 34.0% of all pages.
>>>>> It turns out that the non-zeroness of a page can be detected looking
>>>>> at the first 256 or so bits and only a low
>>>>> percentage turns out to be non-zero at a later position. So after
>>>>> having checked the first chunks one by one
>>>>> there is no big penalty looking at the remaining chunks with the
>>>>> vectorized loop.
>>>> 
>>>> I think it makes most sense to unroll the first four non-vectorized
>>>> iterations, i.e. not use SSE and use three or four ifs.  Either:
>>>> 
>>>> if (foo[0]) return 0;
>>>> if (foo[1]) return 8;
>>>> if (foo[2]) return 16;
>>>> if (foo[3]) return 24;
>>>> 
>>>> or
>>>> 
>>>> if (foo[0]) return 0;
>>>> if (foo[1] | foo[2] | foo[3]) return 8;
>>>> 
>>>> and then proceed on the remaining 4096-4*sizeof(long) bytes with
>>>> the vectorized loop.  foo+4 is aligned for SIMD operations on both
>>>> 32- and 64-bit machines, which makes this a nice choice.
>>> 
>>> i can't start at foo+4 since the remaining X-4*sizeof(long) bytes
>>> are not dividable by 8*sizeof(VECTYPE).
> 
> 
> Hmm, right.  What about just processing the first few longs twice, i.e.
> the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i
> += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?

i will profile it tomorrow.

what is bad about processing the first 8 vectors like described below?

>>  for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>        if (!ALL_EQ(p[i], zero)) {
>>            return i * sizeof(VECTYPE);
>>        }
>>    }


this way it would not be necessary to process them twice.

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-25 14:34                   ` Paolo Bonzini
  2013-03-25 21:37                     ` Peter Lieven
@ 2013-03-26  8:14                     ` Peter Lieven
  2013-03-26  9:20                       ` Paolo Bonzini
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Lieven @ 2013-03-26  8:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Stefan Hajnoczi, Orit Wasserman, qemu-devel, quintela


Am 25.03.2013 um 15:34 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> 
> Hmm, right.  What about just processing the first few longs twice, i.e.
> the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i
> += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?

I tested this version as v3:

size_t buffer_find_nonzero_offset_v3(const void *buf, size_t len)
{
    VECTYPE *p = (VECTYPE *)buf;
    unsigned long *tmp = (unsigned long *)buf;
    VECTYPE zero = ZERO_SPLAT;
    size_t i;
    
    assert(can_use_buffer_find_nonzero_offset(buf, len));
    
    if (!len) {
        return 0;
    }
    
    if (tmp[0]) {
        return 0;
    }

    if (tmp[1]) {
        return 1 * sizeof(unsigned long);
    }

    if (tmp[2]) {
        return 2 * sizeof(unsigned long);
    }

    if (tmp[3]) {
        return 3 * sizeof(unsigned long);
    }

    for (i = 0; i < len / sizeof(VECTYPE); 
            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
        VECTYPE tmp0 = p[i + 0] | p[i + 1];
        VECTYPE tmp1 = p[i + 2] | p[i + 3];
        VECTYPE tmp2 = p[i + 4] | p[i + 5];
        VECTYPE tmp3 = p[i + 6] | p[i + 7];
        VECTYPE tmp01 = tmp0 | tmp1;
        VECTYPE tmp23 = tmp2 | tmp3;
        if (!ALL_EQ(tmp01 | tmp23, zero)) {
            break;
        }
    }
    
    return i * sizeof(VECTYPE);
}

For reference this is v2:

size_t buffer_find_nonzero_offset_v2(const void *buf, size_t len)
{
    VECTYPE *p = (VECTYPE *)buf;
    VECTYPE zero = ZERO_SPLAT;
    size_t i;
    
    assert(can_use_buffer_find_nonzero_offset(buf, len));
    
    if (!len) {
        return 0;
    }
    
    for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
        if (!ALL_EQ(p[i], zero)) {
            return i * sizeof(VECTYPE);
        }
    }

    for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; 
            i < len / sizeof(VECTYPE); 
            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
        VECTYPE tmp0 = p[i + 0] | p[i + 1];
        VECTYPE tmp1 = p[i + 2] | p[i + 3];
        VECTYPE tmp2 = p[i + 4] | p[i + 5];
        VECTYPE tmp3 = p[i + 6] | p[i + 7];
        VECTYPE tmp01 = tmp0 | tmp1;
        VECTYPE tmp23 = tmp2 | tmp3;
        if (!ALL_EQ(tmp01 | tmp23, zero)) {
            break;
        }
    }
    
    return i * sizeof(VECTYPE);
}

I ran 3*2 tests. Each with 1GB memory and 256 iterations of checking each 4k page for zero.

1) all pages zero

a) SSE2
is_zero_page: res=67108864 (ticks 3289 user 1 system)
is_zero_page_v2: res=67108864 (ticks 3326 user 0 system)
is_zero_page_v3: res=67108864 (ticks 3305 user 3 system)
is_dup_page: res=67108864 (ticks 3648 user 1 system)

b) unsigned long arithmetic

is_zero_page: res=67108864 (ticks 3474 user 3 system)
is_zero_page_2: res=67108864 (ticks 3516 user 1 system)
is_zero_page_3: res=67108864 (ticks 3525 user 3 system)
is_dup_page: res=67108864 (ticks 3826 user 4 system)

2) all pages non-zero, but first 64-bit of each page zero

a) SSE2
is_zero_page: res=0 (ticks 251 user 0 system)
is_zero_page_v2: res=0 (ticks 87 user 0 system)
is_zero_page_v3: res=0 (ticks 91 user 0 system)
is_dup_page: res=0 (ticks 82 user 0 system)

b) unsigned long arithmetic
is_zero_page: res=0 (ticks 209 user 0 system)
is_zero_page_v2: res=0 (ticks 89 user 0 system)
is_zero_page_v3: res=0 (ticks 88 user 0 system)
is_dup_page: res=0 (ticks 88 user 0 system)

3) all pages non-zero, but first 256-bit of each page zero

a)
is_zero_pages: res=0 (ticks 260 user 0 system)
is_zero_pages_2: res=0 (ticks 199 user 0 system)
is_zero_pages_3: res=0 (ticks 342 user 0 system)
is_dup_pages: res=0 (ticks 223 user 0 system)

b) unsigned long arithmetic
is_zero_pages: res=0 (ticks 230 user 0 system)
is_zero_pages_2: res=0 (ticks 194 user 0 system)
is_zero_pages_3: res=0 (ticks 280 user 0 system)
is_dup_pages: res=0 (ticks 191 user 0 system)


---

is_zero_page is the version from patch set v4.
is_zero_page_2 is checking the first 8 * sizeof(VECTYPE) chunks one by one and than continuing 8 chunks at once without double-checks
is_zero_page_3 is the above version.
is_dup_page the old implementation.

All compiled with gcc -O3

If noone objects I would use is_zero_page_2 and continue with v5 of the patch set. As I am
ooo for the next 8 days from tomorrow. i prefer v3 as it has better performance if the non-zeroness
is within the 8*sizeof(VECTYPE) bytes and not in the first 256-bit.

Paolo, with the version that has lower setup costs in mind shall I use the vectorized or the unrolled version of patch 4 (find_next_bit optimization)?

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations
  2013-03-26  8:14                     ` Peter Lieven
@ 2013-03-26  9:20                       ` Paolo Bonzini
  0 siblings, 0 replies; 44+ messages in thread
From: Paolo Bonzini @ 2013-03-26  9:20 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, Orit Wasserman, qemu-devel, quintela

Il 26/03/2013 09:14, Peter Lieven ha scritto:
> If noone objects I would use is_zero_page_2 and continue with v5 of
> the patch set. As I am ooo for the next 8 days from tomorrow. i
> prefer v3 as it has better performance if the non-zeroness is within
> the 8*sizeof(VECTYPE) bytes and not in the first 256-bit.

Either v2 or v3 is fine.  v3 has slightly simpler code and v2 optimizes
for a rare case, but v2 is indeed a bit faster and your benchmarking
effort should be rewarded. :)

> Paolo, with the version that has lower setup costs in mind shall I
> use the vectorized or the unrolled version of patch 4 (find_next_bit
> optimization)?

I think for that we should, at least for now, use the version we
discussed a few weeks ago (with no SIMD and just unrolling).

Paolo

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2013-03-26  9:20 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-22 12:46 [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Peter Lieven
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 1/9] move vector definitions to qemu-common.h Peter Lieven
2013-03-25  8:35   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
2013-03-22 19:37   ` Eric Blake
2013-03-22 20:03     ` Peter Lieven
2013-03-22 20:22       ` [Qemu-devel] indentation hints [was: [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer] Eric Blake
2013-03-23 11:18         ` Peter Maydell
2013-03-25  8:53   ` [Qemu-devel] [PATCHv4 2/9] cutils: add a function to find non-zero content in a buffer Orit Wasserman
2013-03-25  8:56     ` Peter Lieven
2013-03-25  9:26       ` Orit Wasserman
2013-03-25  9:42         ` Paolo Bonzini
2013-03-25 10:03           ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
2013-03-25  8:53   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
2013-03-25  9:04   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 5/9] migration: search for zero instead of dup pages Peter Lieven
2013-03-22 19:49   ` Eric Blake
2013-03-22 20:02     ` Peter Lieven
2013-03-25  9:30   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
2013-03-25  9:32   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
2013-03-22 20:13   ` Eric Blake
2013-03-25  9:44   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 8/9] migration: do not search dirty " Peter Lieven
2013-03-25 10:05   ` Orit Wasserman
2013-03-22 12:46 ` [Qemu-devel] [PATCHv4 9/9] migration: use XBZRLE only after " Peter Lieven
2013-03-25 10:16   ` Orit Wasserman
2013-03-22 17:25 ` [Qemu-devel] [PATCHv4 0/9] buffer_is_zero / migration optimizations Paolo Bonzini
2013-03-22 19:20   ` Peter Lieven
2013-03-22 21:24     ` Paolo Bonzini
2013-03-23  7:34       ` Peter Lieven
2013-03-25 10:17       ` Peter Lieven
2013-03-25 10:53         ` Paolo Bonzini
2013-03-25 11:26           ` Peter Lieven
2013-03-25 13:02             ` Paolo Bonzini
2013-03-25 13:23               ` Peter Lieven
2013-03-25 13:32                 ` Peter Lieven
2013-03-25 14:34                   ` Paolo Bonzini
2013-03-25 21:37                     ` Peter Lieven
2013-03-26  8:14                     ` Peter Lieven
2013-03-26  9:20                       ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.