All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations
@ 2013-03-15 15:50 Peter Lieven
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 1/9] move vector definitions to qemu-common.h Peter Lieven
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

this is v2 of my patch series with various optimizations in
zero buffer checking and migration tweaks.

hopefully properly threaded this time.

thanks especially to Eric Blake and Paolo Bonzini for their comments.

the patches can also be fetched from:
git://github.com/plieven/qemu.git [branch migration_optimizations_v2]

v2:
 - fix description, add trivial zero check and add asserts 
   to buffer_find_nonzero_offset.
 - add a constant for the unroll factor of buffer_find_nonzero_offset
 - replace is_dup_page() by buffer_is_zero()
 - added test results to xbzrle patch
 - optimize descriptions

Have a nice weekend,
Peter

Peter Lieven (9):
  move vector definitions to qemu-common.h
  cutils: add a function to find non-zero content in a buffer
  buffer_is_zero: use vector optimizations if possible
  bitops: use vector algorithm to optimize find_next_bit()
  migration: search for zero instead of dup pages
  migration: add an indicator for bulk state of ram migration
  migration: do not sent zero pages in bulk stage
  migration: do not search dirty pages in bulk stage
  migration: use XBZRLE only after bulk stage

 arch_init.c           |   62 +++++++++++++++----------------------------------
 include/qemu-common.h |   26 +++++++++++++++++++++
 util/bitops.c         |   26 ++++++++++++++++++---
 util/cutils.c         |   47 +++++++++++++++++++++++++++++++++++++
 4 files changed, 115 insertions(+), 46 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 1/9] move vector definitions to qemu-common.h
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 15:35   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

vector optimizations will now be used at various places
not just in is_dup_page() in arch_init.c

this patch also adds a zero splat vector.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c           |   20 --------------------
 include/qemu-common.h |   24 ++++++++++++++++++++++++
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 98e2bc6..1b71912 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -114,26 +114,6 @@ const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
 
-#ifdef __ALTIVEC__
-#include <altivec.h>
-#define VECTYPE        vector unsigned char
-#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
-#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
-/* altivec.h may redefine the bool macro as vector type.
- * Reset it to POSIX semantics. */
-#undef bool
-#define bool _Bool
-#elif defined __SSE2__
-#include <emmintrin.h>
-#define VECTYPE        __m128i
-#define SPLAT(p)       _mm_set1_epi8(*(p))
-#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
-#else
-#define VECTYPE        unsigned long
-#define SPLAT(p)       (*(p) * (~0UL / 255))
-#define ALL_EQ(v1, v2) ((v1) == (v2))
-#endif
-
 
 static struct defconfig_file {
     const char *filename;
diff --git a/include/qemu-common.h b/include/qemu-common.h
index 5e13708..b59328f 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -442,4 +442,28 @@ int64_t pow2floor(int64_t value);
 int uleb128_encode_small(uint8_t *out, uint32_t n);
 int uleb128_decode_small(const uint8_t *in, uint32_t *n);
 
+/* vector definitions */
+#ifdef __ALTIVEC__
+#include <altivec.h>
+#define VECTYPE        vector unsigned char
+#define SPLAT(p)       vec_splat(vec_ld(0, p), 0)
+#define ZERO_SPLAT     vec_splat(vec_ld(0, 0), 0)
+#define ALL_EQ(v1, v2) vec_all_eq(v1, v2)
+/* altivec.h may redefine the bool macro as vector type.
+ * Reset it to POSIX semantics. */
+#undef bool
+#define bool _Bool
+#elif defined __SSE2__
+#include <emmintrin.h>
+#define VECTYPE        __m128i
+#define SPLAT(p)       _mm_set1_epi8(*(p))
+#define ZERO_SPLAT     _mm_setzero_si128()
+#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
+#else
+#define VECTYPE        unsigned long
+#define SPLAT(p)       (*(p) * (~0UL / 255))
+#define ZERO_SPLAT     0x0UL
+#define ALL_EQ(v1, v2) ((v1) == (v2))
+#endif
+
 #endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 1/9] move vector definitions to qemu-common.h Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 15:54   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/qemu-common.h |    2 ++
 util/cutils.c         |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/qemu-common.h b/include/qemu-common.h
index b59328f..51a7677 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -362,6 +362,8 @@ size_t qemu_iovec_from_buf(QEMUIOVector *qiov, size_t offset,
 size_t qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
                          int fillc, size_t bytes);
 
+#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
+size_t buffer_find_nonzero_offset(const void *buf, size_t len);
 bool buffer_is_zero(const void *buf, size_t len);
 
 void qemu_progress_init(int enabled, float min_skip);
diff --git a/util/cutils.c b/util/cutils.c
index 1439da4..857dd7d 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -143,6 +143,46 @@ int qemu_fdatasync(int fd)
 }
 
 /*
+ * Searches for an area with non-zero content in a buffer
+ *
+ * Attention! The len must be a multiple of 8 * sizeof(VECTYPE) 
+ * and addr must be a multiple of sizeof(VECTYPE) due to 
+ * restriction of optimizations in this function.
+ * 
+ * The return value is the offset of the non-zero area rounded
+ * down to 8 * sizeof(VECTYPE). If the buffer is all zero 
+ * the return value is equal to len.
+ */
+
+size_t buffer_find_nonzero_offset(const void *buf, size_t len)
+{
+    VECTYPE *p = (VECTYPE *)buf;
+    VECTYPE zero = ZERO_SPLAT;
+    size_t i;
+    
+    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 
+        * sizeof(VECTYPE)) == 0);
+    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
+
+    if (*((const long *) buf)) {
+        return 0;
+    }
+
+    for (i = 0; i < len / sizeof(VECTYPE); i += 8) {
+		VECTYPE tmp0 = p[i+0] | p[i+1];
+		VECTYPE tmp1 = p[i+2] | p[i+3];
+		VECTYPE tmp2 = p[i+4] | p[i+5];
+		VECTYPE tmp3 = p[i+6] | p[i+7];
+		VECTYPE tmp01 = tmp0 | tmp1;
+		VECTYPE tmp23 = tmp2 | tmp3;
+		if (!ALL_EQ(tmp01 | tmp23, zero)) {
+		    break;
+		}
+    }
+    return i * sizeof(VECTYPE);
+}
+
+/*
  * Checks if a buffer is all zeroes
  *
  * Attention! The len must be a multiple of 4 * sizeof(long) due to
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 1/9] move vector definitions to qemu-common.h Peter Lieven
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 16:08   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

performance gain on SSE2 is approx. 20-25%. altivec
is not tested. performance for unsigned long arithmetic
is unchanged.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 util/cutils.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/util/cutils.c b/util/cutils.c
index 857dd7d..00d98fb 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -190,6 +190,13 @@ size_t buffer_find_nonzero_offset(const void *buf, size_t len)
  */
 bool buffer_is_zero(const void *buf, size_t len)
 {
+    /* use vector optimized zero check if possible */
+    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
+          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+             * sizeof(VECTYPE)) == 0) {
+        return buffer_find_nonzero_offset(buf, len)==len;
+    }
+
     /*
      * Use long as the biggest available internal data type that fits into the
      * CPU register and unroll the loop to smooth out the effect of memory
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 4/9] bitops: use vector algorithm to optimize find_next_bit()
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (2 preceding siblings ...)
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 16:49   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 5/9] migration: search for zero instead of dup pages Peter Lieven
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

this patch adds the usage of buffer_find_nonzero_offset()
to skip large areas of zeroes.

compared to loop unrolling presented in an earlier
patch this adds another 50% performance benefit for
skipping large areas of zeroes. loop unrolling alone
added close to 100% speedup.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 util/bitops.c |   26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/util/bitops.c b/util/bitops.c
index e72237a..3c301fa 100644
--- a/util/bitops.c
+++ b/util/bitops.c
@@ -42,10 +42,30 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
         size -= BITS_PER_LONG;
         result += BITS_PER_LONG;
     }
-    while (size & ~(BITS_PER_LONG-1)) {
-        if ((tmp = *(p++))) {
-            goto found_middle;
+    while (size >= BITS_PER_LONG) {
+        if ((tmp = *p)) {
+             goto found_middle;
+        }
+        if (((uintptr_t) p) % sizeof(VECTYPE) == 0 
+                && size >= BITS_PER_BYTE * sizeof(VECTYPE)
+                   * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
+            unsigned long tmp2 =
+                buffer_find_nonzero_offset(p, ((size / BITS_PER_BYTE) & 
+                           ~(BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * 
+                             sizeof(VECTYPE) - 1)));
+            result += tmp2 * BITS_PER_BYTE;
+            size -= tmp2 * BITS_PER_BYTE;
+            p += tmp2 / sizeof(unsigned long);
+            if (!size) {
+                return result;
+            }
+            if (tmp2) {
+                if ((tmp = *p)) {
+                    goto found_middle;
+                }
+            }
         }
+        p++;
         result += BITS_PER_LONG;
         size -= BITS_PER_LONG;
     }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 5/9] migration: search for zero instead of dup pages
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (3 preceding siblings ...)
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 16:55   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

virtually all dup pages are zero pages. remove
the speical is_dup_page() function and use the
optimized buffer_is_zero() function instead.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c |   17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 1b71912..87c16fc 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -144,21 +144,6 @@ int qemu_read_default_config_files(bool userconfig)
     return 0;
 }
 
-static int is_dup_page(uint8_t *page)
-{
-    VECTYPE *p = (VECTYPE *)page;
-    VECTYPE val = SPLAT(page);
-    int i;
-
-    for (i = 0; i < TARGET_PAGE_SIZE / sizeof(VECTYPE); i++) {
-        if (!ALL_EQ(val, p[i])) {
-            return 0;
-        }
-    }
-
-    return 1;
-}
-
 /* struct contains XBZRLE cache and a static page
    used by the compression */
 static struct {
@@ -443,7 +428,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 
             /* In doubt sent page as normal */
             bytes_sent = -1;
-            if (is_dup_page(p)) {
+            if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
                 acct_info.dup_pages++;
                 bytes_sent = save_block_hdr(f, block, offset, cont,
                                             RAM_SAVE_FLAG_COMPRESS);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 6/9] migration: add an indicator for bulk state of ram migration
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (4 preceding siblings ...)
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 5/9] migration: search for zero instead of dup pages Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 17:32   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

the first round of ram transfer is special since all pages
are dirty and thus all memory pages are transferred to
the target. this patch adds a boolean variable to track
this stage.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch_init.c b/arch_init.c
index 87c16fc..e5531e8 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -311,6 +311,7 @@ static ram_addr_t last_offset;
 static unsigned long *migration_bitmap;
 static uint64_t migration_dirty_pages;
 static uint32_t last_version;
+static bool ram_bulk_stage;
 
 static inline
 ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
@@ -418,6 +419,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             if (!block) {
                 block = QTAILQ_FIRST(&ram_list.blocks);
                 complete_round = true;
+                ram_bulk_stage = false;
             }
         } else {
             uint8_t *p;
@@ -521,6 +523,7 @@ static void reset_ram_globals(void)
     last_sent_block = NULL;
     last_offset = 0;
     last_version = ram_list.version;
+    ram_bulk_stage = true;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 7/9] migration: do not sent zero pages in bulk stage
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (5 preceding siblings ...)
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 17:36   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 8/9] migration: do not search dirty " Peter Lieven
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 9/9] migration: use XBZRLE only after " Peter Lieven
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

during bulk stage of ram migration if a page is a
zero page do not send it at all.
the memory at the destination reads as zero anyway.

even if there is an madvise with QEMU_MADV_DONTNEED
at the target upon receival of a zero page I have observed
that the target starts swapping if the memory is overcommitted.
it seems that the pages are dropped asynchronously.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index e5531e8..a3dc20d 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -432,9 +432,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             bytes_sent = -1;
             if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
                 acct_info.dup_pages++;
-                bytes_sent = save_block_hdr(f, block, offset, cont,
-                                            RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, *p);
+                if (!ram_bulk_stage) {
+                    bytes_sent = save_block_hdr(f, block, offset, cont,
+                                                RAM_SAVE_FLAG_COMPRESS);
+                    qemu_put_byte(f, *p);
+                }
                 bytes_sent += 1;
             } else if (migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 8/9] migration: do not search dirty pages in bulk stage
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (6 preceding siblings ...)
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 17:40   ` Eric Blake
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 9/9] migration: use XBZRLE only after " Peter Lieven
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

avoid searching for dirty pages just increment the
page offset. all pages are dirty anyway.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index a3dc20d..ca281ad 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -321,8 +321,14 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
     unsigned long nr = base + (start >> TARGET_PAGE_BITS);
     unsigned long size = base + (int128_get64(mr->size) >> TARGET_PAGE_BITS);
 
-    unsigned long next = find_next_bit(migration_bitmap, size, nr);
-
+    unsigned long next;
+    
+    if (ram_bulk_stage) {
+        next = nr + 1;
+    } else {
+        next = find_next_bit(migration_bitmap, size, nr);
+    }
+    
     if (next < size) {
         clear_bit(next, migration_bitmap);
         migration_dirty_pages--;
@@ -523,7 +529,7 @@ static void reset_ram_globals(void)
 {
     last_seen_block = NULL;
     last_sent_block = NULL;
-    last_offset = 0;
+    last_offset = -1;
     last_version = ram_list.version;
     ram_bulk_stage = true;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCHv2 9/9] migration: use XBZRLE only after bulk stage
  2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
                   ` (7 preceding siblings ...)
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 8/9] migration: do not search dirty " Peter Lieven
@ 2013-03-15 15:50 ` Peter Lieven
  2013-03-19 17:43   ` Eric Blake
  8 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-15 15:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Lieven

at the beginning of migration all pages are marked dirty and
in the first round a bulk migration of all pages is performed.

currently all these pages are copied to the page cache regardless
if there are frequently updated or not. this doesn't make sense
since most of these pages are never transferred again.

this patch changes the XBZRLE transfer to only be used after
the bulk stage has been completed. that means a page is added
to the page cache the second time it is transferred and XBZRLE
can benefit from the third time of transfer.

since the page cache is likely smaller than the number of pages
its also likely that in the second round the page is missing in the
cache due to collisions in the bulk phase.

on the other hand a lot of unneccssary mallocs, memdups and frees
are saved.

the following results have been taken earlier while executing
the test program from docs/xbzrle.txt. (+) with the patch and (-)
without. (thanks to Eric Blake for reformatting and comments)

+ total time: 22185 milliseconds
- total time: 22410 milliseconds

Shaved 0.3 seconds, better than 1%!

+ downtime: 29 milliseconds
- downtime: 21 milliseconds

Not sure why downtime seemed worse, but probably not the end of the world.

+ transferred ram: 706034 kbytes
- transferred ram: 721318 kbytes

Fewer bytes sent - good.

+ remaining ram: 0 kbytes
- remaining ram: 0 kbytes
+ total ram: 1057216 kbytes
- total ram: 1057216 kbytes
+ duplicate: 108556 pages
- duplicate: 105553 pages
+ normal: 175146 pages
- normal: 179589 pages
+ normal bytes: 700584 kbytes
- normal bytes: 718356 kbytes

Fewer normal bytes...

+ cache size: 67108864 bytes
- cache size: 67108864 bytes
+ xbzrle transferred: 3127 kbytes
- xbzrle transferred: 630 kbytes

...and more compressed pages sent - good.

+ xbzrle pages: 117811 pages
- xbzrle pages: 21527 pages
+ xbzrle cache miss: 18750
- xbzrle cache miss: 179589

And very good improvement on the cache miss rate.

+ xbzrle overflow : 0
- xbzrle overflow : 0

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 arch_init.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch_init.c b/arch_init.c
index ca281ad..08e2744 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -444,7 +444,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
                     qemu_put_byte(f, *p);
                 }
                 bytes_sent += 1;
-            } else if (migrate_use_xbzrle()) {
+            } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
                                               offset, cont, last_stage);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 1/9] move vector definitions to qemu-common.h
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 1/9] move vector definitions to qemu-common.h Peter Lieven
@ 2013-03-19 15:35   ` Eric Blake
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2013-03-19 15:35 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 577 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> vector optimizations will now be used at various places
> not just in is_dup_page() in arch_init.c
> 
> this patch also adds a zero splat vector.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c           |   20 --------------------
>  include/qemu-common.h |   24 ++++++++++++++++++++++++
>  2 files changed, 24 insertions(+), 20 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
@ 2013-03-19 15:54   ` Eric Blake
  2013-03-19 16:18     ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Blake @ 2013-03-19 15:54 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  include/qemu-common.h |    2 ++
>  util/cutils.c         |   40 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 42 insertions(+)
> 

>  
> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8

Good.

> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 
> +        * sizeof(VECTYPE)) == 0);

Good use of it.

> +    assert(((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> +
> +    if (*((const long *) buf)) {
> +        return 0;
> +    }
> +
> +    for (i = 0; i < len / sizeof(VECTYPE); i += 8) {

Magic number 8, instead of reusing BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR.

> +		VECTYPE tmp0 = p[i+0] | p[i+1];

Indentation looks off, because you used TABs.

Spaces around binary operators (two instances of '+' per line).

> +		VECTYPE tmp1 = p[i+2] | p[i+3];
> +		VECTYPE tmp2 = p[i+4] | p[i+5];
> +		VECTYPE tmp3 = p[i+6] | p[i+7];
> +		VECTYPE tmp01 = tmp0 | tmp1;
> +		VECTYPE tmp23 = tmp2 | tmp3;
> +		if (!ALL_EQ(tmp01 | tmp23, zero)) {
> +		    break;
> +		}
> +    }
> +    return i * sizeof(VECTYPE);
> +}

Algorithm looks correct, but worth a respin to fix the appearance.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
@ 2013-03-19 16:08   ` Eric Blake
  2013-03-19 16:14     ` Peter Lieven
  2013-03-19 19:44     ` Peter Lieven
  0 siblings, 2 replies; 27+ messages in thread
From: Eric Blake @ 2013-03-19 16:08 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2041 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> performance gain on SSE2 is approx. 20-25%. altivec
> is not tested. performance for unsigned long arithmetic
> is unchanged.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  util/cutils.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/util/cutils.c b/util/cutils.c
> index 857dd7d..00d98fb 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -190,6 +190,13 @@ size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>   */
>  bool buffer_is_zero(const void *buf, size_t len)
>  {
> +    /* use vector optimized zero check if possible */
> +    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
> +          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +             * sizeof(VECTYPE)) == 0) {

Is it worth factoring this check into something more reusable, by adding
something like 'bool buffer_can_use_vectors(buf, len)' in patch 2/9?

> +        return buffer_find_nonzero_offset(buf, len)==len;

Spaces around binary operators.

Is it worth rewriting this function into a simpler:

check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs until
we are aligned
check buffer_find_nonzero_offset on the aligned middle
check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs at tail

instead of having two instances of code that can loop over the entire
buffer?  Otherwise, searching for zeros on an unaligned buffer will
remain slower, even though the bulk of the search could still benefit
from the vector operations.

> +    }
> +
>      /*
>       * Use long as the biggest available internal data type that fits into the
>       * CPU register and unroll the loop to smooth out the effect of memory

Your patch results in C99 declarations after statements; while we
require C99, I'm not sure if qemu prefers to stick to the C89 style of
declarations before statements.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible
  2013-03-19 16:08   ` Eric Blake
@ 2013-03-19 16:14     ` Peter Lieven
  2013-03-19 19:44     ` Peter Lieven
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2013-03-19 16:14 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel


Am 19.03.2013 um 17:08 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> performance gain on SSE2 is approx. 20-25%. altivec
>> is not tested. performance for unsigned long arithmetic
>> is unchanged.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> util/cutils.c |    7 +++++++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 857dd7d..00d98fb 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -190,6 +190,13 @@ size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>  */
>> bool buffer_is_zero(const void *buf, size_t len)
>> {
>> +    /* use vector optimized zero check if possible */
>> +    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
>> +          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +             * sizeof(VECTYPE)) == 0) {
> 
> Is it worth factoring this check into something more reusable, by adding
> something like 'bool buffer_can_use_vectors(buf, len)' in patch 2/9?

good idea!

> 
>> +        return buffer_find_nonzero_offset(buf, len)==len;
> 
> Spaces around binary operators.
> 
> Is it worth rewriting this function into a simpler:
> 
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs until
> we are aligned
> check buffer_find_nonzero_offset on the aligned middle
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs at tail
> 
> instead of having two instances of code that can loop over the entire
> buffer?  Otherwise, searching for zeros on an unaligned buffer will
> remain slower, even though the bulk of the search could still benefit
> from the vector operations.

i do not think that this is necessary as in all cases I know the buffers are aligned properly.

a) ram pages (4k)
b) block migration pages (1M)
c) bdrv sectors (512)

At least on x86_64 they have all 16-byte aligned addresses. 

It would just add more branches to the case where the optimized version can be used maybe
making it slower.

Peter

> 
>> +    }
>> +
>>     /*
>>      * Use long as the biggest available internal data type that fits into the
>>      * CPU register and unroll the loop to smooth out the effect of memory
> 
> Your patch results in C99 declarations after statements; while we
> require C99, I'm not sure if qemu prefers to stick to the C89 style of
> declarations before statements.

the code was slower if I made the checks afterwards.

Peter

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-19 15:54   ` Eric Blake
@ 2013-03-19 16:18     ` Peter Lieven
  2013-03-19 16:43       ` Eric Blake
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2013-03-19 16:18 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel


Am 19.03.2013 um 16:54 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> include/qemu-common.h |    2 ++
>> util/cutils.c         |   40 ++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 42 insertions(+)
>> 
> 
>> 
>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> 
> Good.
> 
>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 
>> +        * sizeof(VECTYPE)) == 0);
> 
> Good use of it.

A question I had is if these asserts make the code slower? In case of all
the 4k pages in RAM migration this could be significant.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-19 16:18     ` Peter Lieven
@ 2013-03-19 16:43       ` Eric Blake
  2013-03-19 19:42         ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Blake @ 2013-03-19 16:43 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On 03/19/2013 10:18 AM, Peter Lieven wrote:
> 
> Am 19.03.2013 um 16:54 schrieb Eric Blake <eblake@redhat.com>:
> 
>> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> ---
>>> include/qemu-common.h |    2 ++
>>> util/cutils.c         |   40 ++++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 42 insertions(+)
>>>
>>
>>>
>>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>
>> Good.
>>
>>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 
>>> +        * sizeof(VECTYPE)) == 0);
>>
>> Good use of it.
> 
> A question I had is if these asserts make the code slower? In case of all
> the 4k pages in RAM migration this could be significant.

Yes, it probably does slow things down when asserts are enabled.  I'm
not sure whether it is better to keep the code forcefully robust, or to
rely on auditing all callers for properly obeying assumptions.  Maybe
you should wait for an opinion from one of the maintainers.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 4/9] bitops: use vector algorithm to optimize find_next_bit()
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
@ 2013-03-19 16:49   ` Eric Blake
  2013-03-19 19:40     ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Blake @ 2013-03-19 16:49 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2330 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> this patch adds the usage of buffer_find_nonzero_offset()
> to skip large areas of zeroes.
> 
> compared to loop unrolling presented in an earlier
> patch this adds another 50% performance benefit for
> skipping large areas of zeroes. loop unrolling alone
> added close to 100% speedup.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  util/bitops.c |   26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)

> +    while (size >= BITS_PER_LONG) {
> +        if ((tmp = *p)) {
> +             goto found_middle;
> +        }
> +        if (((uintptr_t) p) % sizeof(VECTYPE) == 0 
> +                && size >= BITS_PER_BYTE * sizeof(VECTYPE)
> +                   * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {

Another instance where a helper function to check for alignment would be
nice.  Except this time you have a BITS_PER_BYTE factor, so you would be
calling something like buffer_can_use_vectors(buf, size / BITS_PER_BYTE)

> +            unsigned long tmp2 =
> +                buffer_find_nonzero_offset(p, ((size / BITS_PER_BYTE) & 
> +                           ~(BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * 
> +                             sizeof(VECTYPE) - 1)));

Type mismatch - buffer_find_nonzero_offset returns size_t, which isn't
necessarily the same size as unsigned long.  I'm not sure if it can bite
you.

> +            result += tmp2 * BITS_PER_BYTE;
> +            size -= tmp2 * BITS_PER_BYTE;
> +            p += tmp2 / sizeof(unsigned long);
> +            if (!size) {
> +                return result;
> +            }
> +            if (tmp2) {

Do you really need this condition, or would it suffice to just
'continue;' the loop?  Once buffer_find_nonzero_offset returns anything
that leaves size as non-zero, we are guaranteed that the loop will goto
found_middle without any further calls to buffer_find_nonzero_offset.

> +                if ((tmp = *p)) {
> +                    goto found_middle;
> +                }
> +            }
>          }
> +        p++;
>          result += BITS_PER_LONG;
>          size -= BITS_PER_LONG;
>      }
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 5/9] migration: search for zero instead of dup pages
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 5/9] migration: search for zero instead of dup pages Peter Lieven
@ 2013-03-19 16:55   ` Eric Blake
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2013-03-19 16:55 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 523 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> virtually all dup pages are zero pages. remove
> the speical is_dup_page() function and use the

s/speical/special/

> optimized buffer_is_zero() function instead.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c |   17 +----------------
>  1 file changed, 1 insertion(+), 16 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 6/9] migration: add an indicator for bulk state of ram migration
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
@ 2013-03-19 17:32   ` Eric Blake
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2013-03-19 17:32 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 515 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> the first round of ram transfer is special since all pages
> are dirty and thus all memory pages are transferred to
> the target. this patch adds a boolean variable to track
> this stage.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c |    3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 7/9] migration: do not sent zero pages in bulk stage
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
@ 2013-03-19 17:36   ` Eric Blake
  2013-03-19 19:35     ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Blake @ 2013-03-19 17:36 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1853 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> during bulk stage of ram migration if a page is a
> zero page do not send it at all.
> the memory at the destination reads as zero anyway.
> 
> even if there is an madvise with QEMU_MADV_DONTNEED
> at the target upon receival of a zero page I have observed

s/receival/receipt/

> that the target starts swapping if the memory is overcommitted.
> it seems that the pages are dropped asynchronously.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c |    8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)

> +++ b/arch_init.c
> @@ -432,9 +432,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>              bytes_sent = -1;
>              if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>                  acct_info.dup_pages++;
> -                bytes_sent = save_block_hdr(f, block, offset, cont,
> -                                            RAM_SAVE_FLAG_COMPRESS);
> -                qemu_put_byte(f, *p);
> +                if (!ram_bulk_stage) {
> +                    bytes_sent = save_block_hdr(f, block, offset, cont,
> +                                                RAM_SAVE_FLAG_COMPRESS);
> +                    qemu_put_byte(f, *p);

Hmm, in patch 5/9, why not 'qemu_put_byte(f, 0)' instead of spending
time dereferencing *p, since you already know the byte is 0?

> +                }
>                  bytes_sent += 1;

Your accounting is now off.  This increments bytes_sent even when
nothing was sent in the bulk phase.  When touching this line, consider
using ++ instead of += 1.

>              } else if (migrate_use_xbzrle()) {
>                  current_addr = block->offset + offset;
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 8/9] migration: do not search dirty pages in bulk stage
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 8/9] migration: do not search dirty " Peter Lieven
@ 2013-03-19 17:40   ` Eric Blake
  2013-03-19 19:29     ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Blake @ 2013-03-19 17:40 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> avoid searching for dirty pages just increment the
> page offset. all pages are dirty anyway.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c |   12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)

> 
> diff --git a/arch_init.c b/arch_init.c
> index a3dc20d..ca281ad 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -321,8 +321,14 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
>      unsigned long nr = base + (start >> TARGET_PAGE_BITS);
>      unsigned long size = base + (int128_get64(mr->size) >> TARGET_PAGE_BITS);
>  
> -    unsigned long next = find_next_bit(migration_bitmap, size, nr);
> -
> +    unsigned long next;
> +    
> +    if (ram_bulk_stage) {
> +        next = nr + 1;
> +    } else {
> +        next = find_next_bit(migration_bitmap, size, nr);
> +    }

This part makes sense.

> +    
>      if (next < size) {
>          clear_bit(next, migration_bitmap);
>          migration_dirty_pages--;
> @@ -523,7 +529,7 @@ static void reset_ram_globals(void)
>  {
>      last_seen_block = NULL;
>      last_sent_block = NULL;
> -    last_offset = 0;
> +    last_offset = -1;

But what is this change doing?  I don't see it mentioned in the commit
message.  Does it belong to a different commit?

>      last_version = ram_list.version;
>      ram_bulk_stage = true;
>  }
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 9/9] migration: use XBZRLE only after bulk stage
  2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 9/9] migration: use XBZRLE only after " Peter Lieven
@ 2013-03-19 17:43   ` Eric Blake
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2013-03-19 17:43 UTC (permalink / raw)
  To: Peter Lieven; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2582 bytes --]

On 03/15/2013 09:50 AM, Peter Lieven wrote:
> at the beginning of migration all pages are marked dirty and
> in the first round a bulk migration of all pages is performed.
> 
> currently all these pages are copied to the page cache regardless
> if there are frequently updated or not. this doesn't make sense
> since most of these pages are never transferred again.
> 
> this patch changes the XBZRLE transfer to only be used after
> the bulk stage has been completed. that means a page is added
> to the page cache the second time it is transferred and XBZRLE
> can benefit from the third time of transfer.
> 
> since the page cache is likely smaller than the number of pages
> its also likely that in the second round the page is missing in the
> cache due to collisions in the bulk phase.
> 
> on the other hand a lot of unneccssary mallocs, memdups and frees

s/unneccssary/unnecessary/

> are saved.
> 
> the following results have been taken earlier while executing
> the test program from docs/xbzrle.txt. (+) with the patch and (-)
> without. (thanks to Eric Blake for reformatting and comments)

Hey - I know that guy :)

> 
> + total time: 22185 milliseconds
> - total time: 22410 milliseconds
> 
> Shaved 0.3 seconds, better than 1%!
> 
> + downtime: 29 milliseconds
> - downtime: 21 milliseconds
> 
> Not sure why downtime seemed worse, but probably not the end of the world.
> 
> + transferred ram: 706034 kbytes
> - transferred ram: 721318 kbytes
> 
> Fewer bytes sent - good.
> 
> + remaining ram: 0 kbytes
> - remaining ram: 0 kbytes
> + total ram: 1057216 kbytes
> - total ram: 1057216 kbytes
> + duplicate: 108556 pages
> - duplicate: 105553 pages
> + normal: 175146 pages
> - normal: 179589 pages
> + normal bytes: 700584 kbytes
> - normal bytes: 718356 kbytes
> 
> Fewer normal bytes...
> 
> + cache size: 67108864 bytes
> - cache size: 67108864 bytes
> + xbzrle transferred: 3127 kbytes
> - xbzrle transferred: 630 kbytes
> 
> ...and more compressed pages sent - good.
> 
> + xbzrle pages: 117811 pages
> - xbzrle pages: 21527 pages
> + xbzrle cache miss: 18750
> - xbzrle cache miss: 179589
> 
> And very good improvement on the cache miss rate.
> 
> + xbzrle overflow : 0
> - xbzrle overflow : 0
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  arch_init.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 8/9] migration: do not search dirty pages in bulk stage
  2013-03-19 17:40   ` Eric Blake
@ 2013-03-19 19:29     ` Peter Lieven
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2013-03-19 19:29 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel


Am 19.03.2013 um 18:40 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> avoid searching for dirty pages just increment the
>> page offset. all pages are dirty anyway.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> arch_init.c |   12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
> 
>> 
>> diff --git a/arch_init.c b/arch_init.c
>> index a3dc20d..ca281ad 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -321,8 +321,14 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(MemoryRegion *mr,
>>     unsigned long nr = base + (start >> TARGET_PAGE_BITS);
>>     unsigned long size = base + (int128_get64(mr->size) >> TARGET_PAGE_BITS);
>> 
>> -    unsigned long next = find_next_bit(migration_bitmap, size, nr);
>> -
>> +    unsigned long next;
>> +    
>> +    if (ram_bulk_stage) {
>> +        next = nr + 1;
>> +    } else {
>> +        next = find_next_bit(migration_bitmap, size, nr);
>> +    }
> 
> This part makes sense.
> 
>> +    
>>     if (next < size) {
>>         clear_bit(next, migration_bitmap);
>>         migration_dirty_pages--;
>> @@ -523,7 +529,7 @@ static void reset_ram_globals(void)
>> {
>>     last_seen_block = NULL;
>>     last_sent_block = NULL;
>> -    last_offset = 0;
>> +    last_offset = -1;
> 
> But what is this change doing?  I don't see it mentioned in the commit
> message.  Does it belong to a different commit?

Sorry, i forgot to comment this. If in bulk stage next is nr + 1. I was aiming
to have next = 0 with last_offset = -1, but actually I think I'm wrong here,
I missed the (start >> TARGET_PAGE_BITS) at the beginning of migration_bitmap_find_and_reset_dirty.
If last_offset is initialized with 0. next is set to 1.

Peter

> 
>>     last_version = ram_list.version;
>>     ram_bulk_stage = true;
>> }
>> 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 7/9] migration: do not sent zero pages in bulk stage
  2013-03-19 17:36   ` Eric Blake
@ 2013-03-19 19:35     ` Peter Lieven
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2013-03-19 19:35 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel


Am 19.03.2013 um 18:36 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> during bulk stage of ram migration if a page is a
>> zero page do not send it at all.
>> the memory at the destination reads as zero anyway.
>> 
>> even if there is an madvise with QEMU_MADV_DONTNEED
>> at the target upon receival of a zero page I have observed
> 
> s/receival/receipt/
> 
>> that the target starts swapping if the memory is overcommitted.
>> it seems that the pages are dropped asynchronously.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> arch_init.c |    8 +++++---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
> 
>> +++ b/arch_init.c
>> @@ -432,9 +432,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>             bytes_sent = -1;
>>             if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {
>>                 acct_info.dup_pages++;
>> -                bytes_sent = save_block_hdr(f, block, offset, cont,
>> -                                            RAM_SAVE_FLAG_COMPRESS);
>> -                qemu_put_byte(f, *p);
>> +                if (!ram_bulk_stage) {
>> +                    bytes_sent = save_block_hdr(f, block, offset, cont,
>> +                                                RAM_SAVE_FLAG_COMPRESS);
>> +                    qemu_put_byte(f, *p);
> 
> Hmm, in patch 5/9, why not 'qemu_put_byte(f, 0)' instead of spending
> time dereferencing *p, since you already know the byte is 0?

Good idea. Actually to avoid unnecessary checks in buffer_is_zero I think
I would like to change "if (buffer_is_zero(p, TARGET_PAGE_SIZE)) {..} "  to

bool inline is_zero_page(uint8_t *p) {return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) == TARGET_PAGE_SIZE);}
...
if (is_zero_page(p)) {
…
}

also for better readability. For pages we know that they satisfy
can_use_buffer_find_nonzero_offset().

Peter


> 
>> +                }
>>                 bytes_sent += 1;
> 
> Your accounting is now off.  This increments bytes_sent even when
> nothing was sent in the bulk phase.  When touching this line, consider
> using ++ instead of += 1.
> 
>>             } else if (migrate_use_xbzrle()) {
>>                 current_addr = block->offset + offset;
>> 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 4/9] bitops: use vector algorithm to optimize find_next_bit()
  2013-03-19 16:49   ` Eric Blake
@ 2013-03-19 19:40     ` Peter Lieven
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2013-03-19 19:40 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel


Am 19.03.2013 um 17:49 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> this patch adds the usage of buffer_find_nonzero_offset()
>> to skip large areas of zeroes.
>> 
>> compared to loop unrolling presented in an earlier
>> patch this adds another 50% performance benefit for
>> skipping large areas of zeroes. loop unrolling alone
>> added close to 100% speedup.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> util/bitops.c |   26 +++++++++++++++++++++++---
>> 1 file changed, 23 insertions(+), 3 deletions(-)
> 
>> +    while (size >= BITS_PER_LONG) {
>> +        if ((tmp = *p)) {
>> +             goto found_middle;
>> +        }
>> +        if (((uintptr_t) p) % sizeof(VECTYPE) == 0 
>> +                && size >= BITS_PER_BYTE * sizeof(VECTYPE)
>> +                   * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
> 
> Another instance where a helper function to check for alignment would be
> nice.  Except this time you have a BITS_PER_BYTE factor, so you would be
> calling something like buffer_can_use_vectors(buf, size / BITS_PER_BYTE)
> 
>> +            unsigned long tmp2 =
>> +                buffer_find_nonzero_offset(p, ((size / BITS_PER_BYTE) & 
>> +                           ~(BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR * 
>> +                             sizeof(VECTYPE) - 1)));
> 
> Type mismatch - buffer_find_nonzero_offset returns size_t, which isn't
> necessarily the same size as unsigned long.  I'm not sure if it can bite
> you.

I will look into it.

> 
>> +            result += tmp2 * BITS_PER_BYTE;
>> +            size -= tmp2 * BITS_PER_BYTE;
>> +            p += tmp2 / sizeof(unsigned long);
>> +            if (!size) {
>> +                return result;
>> +            }
>> +            if (tmp2) {
> 
> Do you really need this condition, or would it suffice to just
> 'continue;' the loop?  Once buffer_find_nonzero_offset returns anything
> that leaves size as non-zero, we are guaranteed that the loop will goto
> found_middle without any further calls to buffer_find_nonzero_offset.

Note in all cases. It will do if the nonzero content is in the first sizeof(unsigned long)
bytes. If not, buffer_find_nonzero_offset() is called again. It will return 0 because
in the first sizeof(VECTYPE)*BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
bytes is a non-zero byte. To avoid this I placed this check.

Peter


> 
>> +                if ((tmp = *p)) {
>> +                    goto found_middle;
>> +                }
>> +            }
>>         }
>> +        p++;
>>         result += BITS_PER_LONG;
>>         size -= BITS_PER_LONG;
>>     }
>> 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer
  2013-03-19 16:43       ` Eric Blake
@ 2013-03-19 19:42         ` Peter Lieven
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2013-03-19 19:42 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel


Am 19.03.2013 um 17:43 schrieb Eric Blake <eblake@redhat.com>:

> On 03/19/2013 10:18 AM, Peter Lieven wrote:
>> 
>> Am 19.03.2013 um 16:54 schrieb Eric Blake <eblake@redhat.com>:
>> 
>>> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> ---
>>>> include/qemu-common.h |    2 ++
>>>> util/cutils.c         |   40 ++++++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 42 insertions(+)
>>>> 
>>> 
>>>> 
>>>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>> 
>>> Good.
>>> 
>>>> +    assert(len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 
>>>> +        * sizeof(VECTYPE)) == 0);
>>> 
>>> Good use of it.
>> 
>> A question I had is if these asserts make the code slower? In case of all
>> the 4k pages in RAM migration this could be significant.
> 
> Yes, it probably does slow things down when asserts are enabled.  I'm
> not sure whether it is better to keep the code forcefully robust, or to
> rely on auditing all callers for properly obeying assumptions.  Maybe
> you should wait for an opinion from one of the maintainers.

If I follow your idea to add a can_use_buffer_find_nonzero_offset() function
and add it to the description of the function my feeling is that we can drop the asserts.
Actually we are always checking the same things twice.

Peter

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible
  2013-03-19 16:08   ` Eric Blake
  2013-03-19 16:14     ` Peter Lieven
@ 2013-03-19 19:44     ` Peter Lieven
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2013-03-19 19:44 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel


Am 19.03.2013 um 17:08 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> performance gain on SSE2 is approx. 20-25%. altivec
>> is not tested. performance for unsigned long arithmetic
>> is unchanged.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> util/cutils.c |    7 +++++++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 857dd7d..00d98fb 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -190,6 +190,13 @@ size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>  */
>> bool buffer_is_zero(const void *buf, size_t len)
>> {
>> +    /* use vector optimized zero check if possible */
>> +    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
>> +          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +             * sizeof(VECTYPE)) == 0) {
> 
> Is it worth factoring this check into something more reusable, by adding
> something like 'bool buffer_can_use_vectors(buf, len)' in patch 2/9?
> 
>> +        return buffer_find_nonzero_offset(buf, len)==len;
> 
> Spaces around binary operators.
> 
> Is it worth rewriting this function into a simpler:
> 
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs until
> we are aligned
> check buffer_find_nonzero_offset on the aligned middle
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs at tail
> 
> instead of having two instances of code that can loop over the entire
> buffer?  Otherwise, searching for zeros on an unaligned buffer will
> remain slower, even though the bulk of the search could still benefit
> from the vector operations.
> 
>> +    }
>> +
>>     /*
>>      * Use long as the biggest available internal data type that fits into the
>>      * CPU register and unroll the loop to smooth out the effect of memory
> 
> Your patch results in C99 declarations after statements; while we
> require C99, I'm not sure if qemu prefers to stick to the C89 style of
> declarations before statements.

See my comments to patch 7/9 regarding is_zero_page(). I will move the
call to the vector optimized version after the variable definitions. The
speed concern at that point was mainline to use it for zero page checking.

Peter

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-03-19 19:43 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-15 15:50 [Qemu-devel] [PATCHv2 0/9] buffer_is_zero / migration optimizations Peter Lieven
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 1/9] move vector definitions to qemu-common.h Peter Lieven
2013-03-19 15:35   ` Eric Blake
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 2/9] cutils: add a function to find non-zero content in a buffer Peter Lieven
2013-03-19 15:54   ` Eric Blake
2013-03-19 16:18     ` Peter Lieven
2013-03-19 16:43       ` Eric Blake
2013-03-19 19:42         ` Peter Lieven
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 3/9] buffer_is_zero: use vector optimizations if possible Peter Lieven
2013-03-19 16:08   ` Eric Blake
2013-03-19 16:14     ` Peter Lieven
2013-03-19 19:44     ` Peter Lieven
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 4/9] bitops: use vector algorithm to optimize find_next_bit() Peter Lieven
2013-03-19 16:49   ` Eric Blake
2013-03-19 19:40     ` Peter Lieven
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 5/9] migration: search for zero instead of dup pages Peter Lieven
2013-03-19 16:55   ` Eric Blake
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 6/9] migration: add an indicator for bulk state of ram migration Peter Lieven
2013-03-19 17:32   ` Eric Blake
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 7/9] migration: do not sent zero pages in bulk stage Peter Lieven
2013-03-19 17:36   ` Eric Blake
2013-03-19 19:35     ` Peter Lieven
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 8/9] migration: do not search dirty " Peter Lieven
2013-03-19 17:40   ` Eric Blake
2013-03-19 19:29     ` Peter Lieven
2013-03-15 15:50 ` [Qemu-devel] [PATCHv2 9/9] migration: use XBZRLE only after " Peter Lieven
2013-03-19 17:43   ` Eric Blake

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.