All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG
@ 2019-05-08  6:15 Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 01/11] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Summary
=====================

Based-on: <20190426062705.4651-1-peterx@redhat.com>

(It's "[PATCH] checkpatch: allow SPDX-License-Identifier", not a big
 deal, just to make sure no spoil message since one patch used SPDX
 license identifier and checkpatch doesn't like it...)

This series allows QEMU to start using the new KVM_CLEAR_DIRTY_LOG
interface. For more on KVM_CLEAR_DIRTY_LOG itself, please refer to:

  https://github.com/torvalds/linux/blob/master/Documentation/virtual/kvm/api.txt#L3810

The QEMU work (which is this series) is pushed too, please find the
tree here:

  https://github.com/xzpeter/qemu/tree/kvm-clear-dirty-log

Meanwhile, For anyone who really wants to try this out, please also
upgrade the host kernel and use the tree here:

  https://github.com/xzpeter/linux/tree/kvm-clear-dirty-fixes

The new kernel is required because there are still some fixes to make
the whole thing work, which have not yet been finalized and they
haven't reached Linux master.

Design
===================

I started with a naive/stupid design that I always pass all 1's to the
KVM for a memory range to clear all the dirty bits within that memory
range, but then I encountered guest oops - it's simply because we
can't clear any dirty bit from QEMU if we are not _sure_ that the bit
is dirty in the kernel.  Otherwise we might accidentally clear a bit
that we don't even know of (e.g., the bit was clear in migration's
dirty bitmap in QEMU) but actually that page was just being written so
QEMU will never remember to migrate that new page again.

The new design is focused on a dirty bitmap cache within the QEMU kvm
layer (which is per kvm memory slot).  With that we know what's dirty
in the kernel previously (note! the kernel bitmap is still growing all
the time so the cache will only be a subset of the realtime kernel
bitmap but that's far enough for us) and with that we'll be sure to
not accidentally clear unknown dirty pages.

With this method, we can also avoid race when multiple users (e.g.,
DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
multiple time.  If without the kvm memory slot cached dirty bitmap we
won't be able to know which bit has been cleared and then if we send
the CLEAR operation upon the same bit twice (or more) we can still
face the same issue to clear something accidentally while we
shouldn't.

Summary: we really need to be careful on what bit to clear otherwise
we can face anything after the migration completes.  And I hope this
series has considered all about this.

Besides the new KVM cache layer and the new ioctl support, this series
introduced the memory_region_clear_dirty_bitmap() in the memory API
layer to allow clearing dirty bits of a specific memory range within
the memory region.

Implementations
============================

Patch 1-3:  these should be nothing directly related to the series but
            they are things I found during working on it.  They can be
            picked even earlier if reviewers are happy with them.

Patch 4:    pre-work on bitmap operations, and within the patch I added
            the first unit test for utils/bitmap.c.

Patch 5-6:  the new memory API interface.  Since no one is providing
            log_clear() yet so it's not working yet.  Note that this
            only splits the dirty clear operation from sync but it
            hasn't yet been splitted into smaller chunk so it's not
            really helpful for us yet.

Patch 7-10: kvm support of KVM_CLEAR_DIRTY_LOG.

Patch 11:   do the log_clear() splitting for the case of migration.
            Also a new parameter is introduced to define the block
            size of the small chunks (the unit to clear dirty bits)

Tests
===========================

- make check
- migrate idle/memory-heavy guests

(Not yet tested with huge guests but it'll be more than welcomed if
 someone has the resource and wants to give it a shot)

Please have a look, thanks.

Peter Xu (11):
  migration: No need to take rcu during sync_dirty_bitmap
  memory: Remove memory_region_get_dirty()
  memory: Don't set migration bitmap when without migration
  bitmap: Add bitmap_copy_with_{src|dst}_offset()
  memory: Pass mr into snapshot_and_clear_dirty
  memory: Introduce memory listener hook log_clear()
  kvm: Update comments for sync_dirty_bitmap
  kvm: Persistent per kvmslot dirty bitmap
  kvm: Introduce slots lock for memory listener
  kvm: Support KVM_CLEAR_DIRTY_LOG
  migration: Split log_clear() into smaller chunks

 accel/kvm/kvm-all.c      | 287 ++++++++++++++++++++++++++++++++++-----
 accel/kvm/trace-events   |   1 +
 exec.c                   |  15 +-
 include/exec/memory.h    |  36 ++---
 include/exec/ram_addr.h  |  91 ++++++++++++-
 include/qemu/bitmap.h    |   9 ++
 include/sysemu/kvm_int.h |   4 +
 memory.c                 |  64 +++++++--
 migration/migration.c    |   4 +
 migration/migration.h    |  27 ++++
 migration/ram.c          |  45 ++++++
 migration/trace-events   |   1 +
 tests/Makefile.include   |   2 +
 tests/test-bitmap.c      |  81 +++++++++++
 util/bitmap.c            |  73 ++++++++++
 15 files changed, 671 insertions(+), 69 deletions(-)
 create mode 100644 tests/test-bitmap.c

-- 
2.17.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 01/11] migration: No need to take rcu during sync_dirty_bitmap
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08 10:09   ` Paolo Bonzini
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 02/11] memory: Remove memory_region_get_dirty() Peter Xu
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

cpu_physical_memory_sync_dirty_bitmap() has one RAMBlock* as
parameter, which means that it must be with RCU read lock held
already.  Taking it again inside seems redundant.  Removing it.
Instead comment on the functions about the RCU read lock.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ram_addr.h | 5 +----
 migration/ram.c         | 1 +
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 139ad79390..993fb760f3 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -408,6 +408,7 @@ static inline void cpu_physical_memory_clear_dirty_range(ram_addr_t start,
 }
 
 
+/* Must be with rcu read lock held */
 static inline
 uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                                                ram_addr_t start,
@@ -431,8 +432,6 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                                         DIRTY_MEMORY_BLOCK_SIZE);
         unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
 
-        rcu_read_lock();
-
         src = atomic_rcu_read(
                 &ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION])->blocks;
 
@@ -452,8 +451,6 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                 idx++;
             }
         }
-
-        rcu_read_unlock();
     } else {
         ram_addr_t offset = rb->offset;
 
diff --git a/migration/ram.c b/migration/ram.c
index 1ca9ba77b6..3e11806ccd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1680,6 +1680,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
+/* Must be with rcu read lock held */
 static void migration_bitmap_sync_range(RAMState *rs, RAMBlock *rb,
                                         ram_addr_t start, ram_addr_t length)
 {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 02/11] memory: Remove memory_region_get_dirty()
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 01/11] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08 10:09   ` Paolo Bonzini
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 03/11] memory: Don't set migration bitmap when without migration Peter Xu
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

It's never used anywhere.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 17 -----------------
 memory.c              |  8 --------
 2 files changed, 25 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9144a47f57..e6140e8a04 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1254,23 +1254,6 @@ void memory_region_ram_resize(MemoryRegion *mr, ram_addr_t newsize,
  */
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client);
 
-/**
- * memory_region_get_dirty: Check whether a range of bytes is dirty
- *                          for a specified client.
- *
- * Checks whether a range of bytes has been written to since the last
- * call to memory_region_reset_dirty() with the same @client.  Dirty logging
- * must be enabled.
- *
- * @mr: the memory region being queried.
- * @addr: the address (relative to the start of the region) being queried.
- * @size: the size of the range being queried.
- * @client: the user of the logging information; %DIRTY_MEMORY_MIGRATION or
- *          %DIRTY_MEMORY_VGA.
- */
-bool memory_region_get_dirty(MemoryRegion *mr, hwaddr addr,
-                             hwaddr size, unsigned client);
-
 /**
  * memory_region_set_dirty: Mark a range of bytes as dirty in a memory region.
  *
diff --git a/memory.c b/memory.c
index bb2b71ee38..0cce3a6fc5 100644
--- a/memory.c
+++ b/memory.c
@@ -2027,14 +2027,6 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
     memory_region_transaction_commit();
 }
 
-bool memory_region_get_dirty(MemoryRegion *mr, hwaddr addr,
-                             hwaddr size, unsigned client)
-{
-    assert(mr->ram_block);
-    return cpu_physical_memory_get_dirty(memory_region_get_ram_addr(mr) + addr,
-                                         size, client);
-}
-
 void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
                              hwaddr size)
 {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 03/11] memory: Don't set migration bitmap when without migration
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 01/11] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 02/11] memory: Remove memory_region_get_dirty() Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08 10:09   ` Paolo Bonzini
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 04/11] bitmap: Add bitmap_copy_with_{src|dst}_offset() Peter Xu
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Similar to 9460dee4b2 ("memory: do not touch code dirty bitmap unless
TCG is enabled", 2015-06-05) but for the migration bitmap - we can
skip the MIGRATION bitmap update if migration not enabled.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h   |  2 ++
 include/exec/ram_addr.h | 12 +++++++++++-
 memory.c                |  2 +-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index e6140e8a04..f29300c54d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -46,6 +46,8 @@
         OBJECT_GET_CLASS(IOMMUMemoryRegionClass, (obj), \
                          TYPE_IOMMU_MEMORY_REGION)
 
+extern bool global_dirty_log;
+
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionMmio MemoryRegionMmio;
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 993fb760f3..86bc8e1a4a 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -348,8 +348,13 @@ static inline void cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
             if (bitmap[k]) {
                 unsigned long temp = leul_to_cpu(bitmap[k]);
 
-                atomic_or(&blocks[DIRTY_MEMORY_MIGRATION][idx][offset], temp);
                 atomic_or(&blocks[DIRTY_MEMORY_VGA][idx][offset], temp);
+
+                if (global_dirty_log) {
+                    atomic_or(&blocks[DIRTY_MEMORY_MIGRATION][idx][offset],
+                              temp);
+                }
+
                 if (tcg_enabled()) {
                     atomic_or(&blocks[DIRTY_MEMORY_CODE][idx][offset], temp);
                 }
@@ -366,6 +371,11 @@ static inline void cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
         xen_hvm_modified_memory(start, pages << TARGET_PAGE_BITS);
     } else {
         uint8_t clients = tcg_enabled() ? DIRTY_CLIENTS_ALL : DIRTY_CLIENTS_NOCODE;
+
+        if (!global_dirty_log) {
+            clients &= ~(1 << DIRTY_MEMORY_MIGRATION);
+        }
+
         /*
          * bitmap-traveling is faster than memory-traveling (for addr...)
          * especially when most of the memory is not dirty.
diff --git a/memory.c b/memory.c
index 0cce3a6fc5..1cc3531df5 100644
--- a/memory.c
+++ b/memory.c
@@ -38,7 +38,7 @@
 static unsigned memory_region_transaction_depth;
 static bool memory_region_update_pending;
 static bool ioeventfd_update_pending;
-static bool global_dirty_log = false;
+bool global_dirty_log;
 
 static QTAILQ_HEAD(, MemoryListener) memory_listeners
     = QTAILQ_HEAD_INITIALIZER(memory_listeners);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 04/11] bitmap: Add bitmap_copy_with_{src|dst}_offset()
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (2 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 03/11] memory: Don't set migration bitmap when without migration Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 05/11] memory: Pass mr into snapshot_and_clear_dirty Peter Xu
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

These helpers copy the source bitmap to destination bitmap with a
shift either on the src or dst bitmap.

Meanwhile, we never have bitmap tests but we should.

This patch also introduces the initial test cases for utils/bitmap.c
but it only tests the newly introduced functions.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/bitmap.h  |  9 +++++
 tests/Makefile.include |  2 ++
 tests/test-bitmap.c    | 81 ++++++++++++++++++++++++++++++++++++++++++
 util/bitmap.c          | 73 +++++++++++++++++++++++++++++++++++++
 4 files changed, 165 insertions(+)
 create mode 100644 tests/test-bitmap.c

diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 5c313346b9..cdaa953371 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -41,6 +41,10 @@
  * bitmap_find_next_zero_area(buf, len, pos, n, mask)	Find bit free area
  * bitmap_to_le(dst, src, nbits)      Convert bitmap to little endian
  * bitmap_from_le(dst, src, nbits)    Convert bitmap from little endian
+ * bitmap_copy_with_src_offset(dst, src, offset, nbits)
+ *                                    *dst = *src (with an offset upon src)
+ * bitmap_copy_with_dst_offset(dst, src, offset, nbits)
+ *                                    *dst = *src (with an offset upon dst)
  */
 
 /*
@@ -271,4 +275,9 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
 void bitmap_from_le(unsigned long *dst, const unsigned long *src,
                     long nbits);
 
+void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
+                                 long offset, long nbits);
+void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
+                                 long shift, long nbits);
+
 #endif /* BITMAP_H */
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 7c8b9c84b2..eb14b79764 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -69,6 +69,7 @@ check-unit-y += tests/test-aio-multithread$(EXESUF)
 check-unit-y += tests/test-throttle$(EXESUF)
 check-unit-y += tests/test-thread-pool$(EXESUF)
 check-unit-y += tests/test-hbitmap$(EXESUF)
+check-unit-y += tests/test-bitmap$(EXESUF)
 check-unit-y += tests/test-bdrv-drain$(EXESUF)
 check-unit-y += tests/test-bdrv-graph-mod$(EXESUF)
 check-unit-y += tests/test-blockjob$(EXESUF)
@@ -548,6 +549,7 @@ tests/test-image-locking$(EXESUF): tests/test-image-locking.o $(test-block-obj-y
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) $(test-crypto-obj-y)
+tests/test-bitmap$(EXESUF): tests/test-bitmap.o $(test-util-obj-y)
 tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
 tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o migration/page_cache.o $(test-util-obj-y)
 tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o $(test-util-obj-y)
diff --git a/tests/test-bitmap.c b/tests/test-bitmap.c
new file mode 100644
index 0000000000..36b4c07bf2
--- /dev/null
+++ b/tests/test-bitmap.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Bitmap.c unit-tests.
+ *
+ * Copyright (C) 2019, Red Hat, Inc.
+ *
+ * Author: Peter Xu <peterx@redhat.com>
+ */
+
+#include <stdlib.h>
+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+
+#define BMAP_SIZE  1024
+
+static void check_bitmap_copy_with_offset(void)
+{
+    int i;
+    unsigned long *bmap1, *bmap2, *bmap3, total;
+
+    bmap1 = bitmap_new(BMAP_SIZE);
+    bmap2 = bitmap_new(BMAP_SIZE);
+    bmap3 = bitmap_new(BMAP_SIZE);
+
+    *bmap1 = random();
+    *(bmap1 + 1) = random();
+    *(bmap1 + 2) = random();
+    *(bmap1 + 3) = random();
+    total = BITS_PER_LONG * 4;
+
+    /* Shift 115 bits into bmap2 */
+    bitmap_copy_with_dst_offset(bmap2, bmap1, 115, total);
+    /* Shift another 85 bits into bmap3 */
+    bitmap_copy_with_dst_offset(bmap3, bmap2, 85, total + 115);
+    /* Shift back 200 bits back */
+    bitmap_copy_with_src_offset(bmap2, bmap3, 200, total);
+
+    for (i = 0; i < 3; i++) {
+        g_assert(*(bmap1 + i) == *(bmap2 + i));
+    }
+
+    bitmap_clear(bmap1, 0, BMAP_SIZE);
+    /* Set bits in bmap1 are 100-245 */
+    bitmap_set(bmap1, 100, 145);
+
+    /* Set bits in bmap2 are 60-205 */
+    bitmap_copy_with_src_offset(bmap2, bmap1, 40, 250);
+    for (i = 0; i < 60; i++) {
+        g_assert(test_bit(i, bmap2) == 0);
+    }
+    for (i = 60; i < 205; i++) {
+        g_assert(test_bit(i, bmap2));
+    }
+    g_assert(test_bit(205, bmap2) == 0);
+
+    /* Set bits in bmap3 are 135-280 */
+    bitmap_copy_with_dst_offset(bmap3, bmap1, 35, 250);
+    for (i = 0; i < 135; i++) {
+        g_assert(test_bit(i, bmap3) == 0);
+    }
+    for (i = 135; i < 280; i++) {
+        g_assert(test_bit(i, bmap3));
+    }
+    g_assert(test_bit(280, bmap3) == 0);
+
+    g_free(bmap1);
+    g_free(bmap2);
+    g_free(bmap3);
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+
+    g_test_add_func("/bitmap/bitmap_copy_with_offset",
+                    check_bitmap_copy_with_offset);
+
+    g_test_run();
+
+    return 0;
+}
diff --git a/util/bitmap.c b/util/bitmap.c
index cb618c65a5..391a7bb744 100644
--- a/util/bitmap.c
+++ b/util/bitmap.c
@@ -402,3 +402,76 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
 {
     bitmap_to_from_le(dst, src, nbits);
 }
+
+/*
+ * Copy "src" bitmap with a positive offset and put it into the "dst"
+ * bitmap.  The caller needs to make sure the bitmap size of "src"
+ * is bigger than (shift + nbits).
+ */
+void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
+                                 long shift, long nbits)
+{
+    unsigned long left_mask, right_mask, last_mask;
+
+    assert(shift >= 0);
+    /* Proper shift src pointer to the first word to copy from */
+    src += BIT_WORD(shift);
+    shift %= BITS_PER_LONG;
+    right_mask = (1ul << shift) - 1;
+    left_mask = ~right_mask;
+
+    while (nbits >= BITS_PER_LONG) {
+        *dst = (*src & left_mask) >> shift;
+        *dst |= (*(src + 1) & right_mask) << (BITS_PER_LONG - shift);
+        dst++;
+        src++;
+        nbits -= BITS_PER_LONG;
+    }
+
+    if (nbits > BITS_PER_LONG - shift) {
+        *dst = (*src & left_mask) >> shift;
+        nbits -= BITS_PER_LONG - shift;
+        last_mask = (1 << nbits) - 1;
+        *dst |= (*(src + 1) & last_mask) << (BITS_PER_LONG - shift);
+    } else if (nbits) {
+        last_mask = (1 << nbits) - 1;
+        *dst = (*src >> shift) & last_mask;
+    }
+}
+
+/*
+ * Copy "src" bitmap into the "dst" bitmap with an offset in the
+ * "dst".  The caller needs to make sure the bitmap size of "dst" is
+ * bigger than (shift + nbits).
+ */
+void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
+                                 long shift, long nbits)
+{
+    unsigned long left_mask, right_mask, last_mask;
+
+    assert(shift >= 0);
+    /* Proper shift src pointer to the first word to copy from */
+    dst += BIT_WORD(shift);
+    shift %= BITS_PER_LONG;
+    right_mask = (1ul << (BITS_PER_LONG - shift)) - 1;
+    left_mask = ~right_mask;
+
+    *dst &= (1ul << shift) - 1;
+    while (nbits >= BITS_PER_LONG) {
+        *dst |= (*src & right_mask) << shift;
+        *(dst + 1) = (*src & left_mask) >> (BITS_PER_LONG - shift);
+        dst++;
+        src++;
+        nbits -= BITS_PER_LONG;
+    }
+
+    if (nbits > BITS_PER_LONG - shift) {
+        *dst |= (*src & right_mask) << shift;
+        nbits -= BITS_PER_LONG - shift;
+        last_mask = ((1 << nbits) - 1) << (BITS_PER_LONG - shift);
+        *(dst + 1) = (*src & last_mask) >> (BITS_PER_LONG - shift);
+    } else if (nbits) {
+        last_mask = (1 << nbits) - 1;
+        *dst |= (*src & last_mask) << shift;
+    }
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 05/11] memory: Pass mr into snapshot_and_clear_dirty
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (3 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 04/11] bitmap: Add bitmap_copy_with_{src|dst}_offset() Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 06/11] memory: Introduce memory listener hook log_clear() Peter Xu
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Also we change the 2nd parameter of it to be the relative offset
within the memory region. This is to be used in follow up patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 exec.c                  | 3 ++-
 include/exec/ram_addr.h | 2 +-
 memory.c                | 3 +--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/exec.c b/exec.c
index 4e734770c2..2615b4cfed 100644
--- a/exec.c
+++ b/exec.c
@@ -1387,9 +1387,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
 }
 
 DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
-     (ram_addr_t start, ram_addr_t length, unsigned client)
+(MemoryRegion *mr, hwaddr addr, hwaddr length, unsigned client)
 {
     DirtyMemoryBlocks *blocks;
+    ram_addr_t start = memory_region_get_ram_addr(mr) + addr;
     unsigned long align = 1UL << (TARGET_PAGE_BITS + BITS_PER_LEVEL);
     ram_addr_t first = QEMU_ALIGN_DOWN(start, align);
     ram_addr_t last  = QEMU_ALIGN_UP(start + length, align);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 86bc8e1a4a..69cc528c98 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -403,7 +403,7 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
                                               unsigned client);
 
 DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
-    (ram_addr_t start, ram_addr_t length, unsigned client);
+(MemoryRegion *mr, ram_addr_t start, hwaddr length, unsigned client);
 
 bool cpu_physical_memory_snapshot_get_dirty(DirtyBitmapSnapshot *snap,
                                             ram_addr_t start,
diff --git a/memory.c b/memory.c
index 1cc3531df5..be16beea76 100644
--- a/memory.c
+++ b/memory.c
@@ -2071,8 +2071,7 @@ DirtyBitmapSnapshot *memory_region_snapshot_and_clear_dirty(MemoryRegion *mr,
 {
     assert(mr->ram_block);
     memory_region_sync_dirty_bitmap(mr);
-    return cpu_physical_memory_snapshot_and_clear_dirty(
-                memory_region_get_ram_addr(mr) + addr, size, client);
+    return cpu_physical_memory_snapshot_and_clear_dirty(mr, addr, size, client);
 }
 
 bool memory_region_snapshot_get_dirty(MemoryRegion *mr, DirtyBitmapSnapshot *snap,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 06/11] memory: Introduce memory listener hook log_clear()
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (4 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 05/11] memory: Pass mr into snapshot_and_clear_dirty Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 07/11] kvm: Update comments for sync_dirty_bitmap Peter Xu
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Introduce a new memory region listener hook log_clear() to allow the
listeners to hook onto the points where the dirty bitmap is cleared by
the bitmap users.

Previously log_sync() contains two operations:

  - dirty bitmap collection, and,
  - dirty bitmap clear on remote site.

Let's take KVM as example - log_sync() for KVM will first copy the
kernel dirty bitmap to userspace, and at the same time we'll clear the
dirty bitmap there along with re-protecting all the guest pages again.

We add this new log_clear() interface only to split the old log_sync()
into two separated procedures:

  - use log_sync() to collect the collection only, and,
  - use log_clear() to clear the remote dirty bitmap.

With the new interface, the memory listener users will still be able
to decide how to implement the log synchronization procedure, e.g.,
they can still only provide log_sync() method only and put all the two
procedures within log_sync() (that's how the old KVM works before
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is introduced).  However with this
new interface the memory listener users will start to have a chance to
postpone the log clear operation explicitly if the module supports.
That can really benefit users like KVM at least for host kernels that
support KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.

There are three places that can clear dirty bits in any one of the
dirty bitmap in the ram_list.dirty_memory[3] array:

        cpu_physical_memory_snapshot_and_clear_dirty
        cpu_physical_memory_test_and_clear_dirty
        cpu_physical_memory_sync_dirty_bitmap

Currently we hook directly into each of the functions to notify about
the log_clear().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 exec.c                  | 12 ++++++++++
 include/exec/memory.h   | 17 ++++++++++++++
 include/exec/ram_addr.h |  3 +++
 memory.c                | 51 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 83 insertions(+)

diff --git a/exec.c b/exec.c
index 2615b4cfed..ab595e1e4b 100644
--- a/exec.c
+++ b/exec.c
@@ -1355,6 +1355,8 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
     DirtyMemoryBlocks *blocks;
     unsigned long end, page;
     bool dirty = false;
+    RAMBlock *ramblock;
+    uint64_t mr_offset, mr_size;
 
     if (length == 0) {
         return false;
@@ -1366,6 +1368,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
     rcu_read_lock();
 
     blocks = atomic_rcu_read(&ram_list.dirty_memory[client]);
+    ramblock = qemu_get_ram_block(start);
+    /* Range sanity check on the ramblock */
+    assert(start >= ramblock->offset &&
+           start + length <= ramblock->offset + ramblock->used_length);
 
     while (page < end) {
         unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
@@ -1377,6 +1383,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
         page += num;
     }
 
+    mr_offset = (ram_addr_t)(page << TARGET_PAGE_BITS) - ramblock->offset;
+    mr_size = (end - page) << TARGET_PAGE_BITS;
+    memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+
     rcu_read_unlock();
 
     if (dirty && tcg_enabled()) {
@@ -1432,6 +1442,8 @@ DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
         tlb_reset_dirty_range_all(start, length);
     }
 
+    memory_region_clear_dirty_bitmap(mr, addr, length);
+
     return snap;
 }
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index f29300c54d..d752b2a758 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -416,6 +416,7 @@ struct MemoryListener {
     void (*log_stop)(MemoryListener *listener, MemoryRegionSection *section,
                      int old, int new);
     void (*log_sync)(MemoryListener *listener, MemoryRegionSection *section);
+    void (*log_clear)(MemoryListener *listener, MemoryRegionSection *section);
     void (*log_global_start)(MemoryListener *listener);
     void (*log_global_stop)(MemoryListener *listener);
     void (*eventfd_add)(MemoryListener *listener, MemoryRegionSection *section,
@@ -1269,6 +1270,22 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client);
 void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
                              hwaddr size);
 
+/**
+ * memory_region_clear_dirty_bitmap - clear dirty bitmap for memory range
+ *
+ * This function is called when the caller wants to clear the remote
+ * dirty bitmap of a memory range within the memory region.  This can
+ * be used by e.g. KVM to manually clear dirty log when
+ * KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is declared support by the host
+ * kernel.
+ *
+ * @mr:     the memory region to clear the dirty log upon
+ * @start:  start address offset within the memory region
+ * @len:    length of the memory region to clear dirty bitmap
+ */
+void memory_region_clear_dirty_bitmap(MemoryRegion *mr, hwaddr start,
+                                      hwaddr len);
+
 /**
  * memory_region_snapshot_and_clear_dirty: Get a snapshot of the dirty
  *                                         bitmap and clear it.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 69cc528c98..896324281a 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -461,6 +461,9 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                 idx++;
             }
         }
+
+        /* TODO: split the huge bitmap into smaller chunks */
+        memory_region_clear_dirty_bitmap(rb->mr, start, length);
     } else {
         ram_addr_t offset = rb->offset;
 
diff --git a/memory.c b/memory.c
index be16beea76..b4e2c41ba0 100644
--- a/memory.c
+++ b/memory.c
@@ -2064,6 +2064,57 @@ static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
     }
 }
 
+void memory_region_clear_dirty_bitmap(MemoryRegion *mr, hwaddr start,
+                                      hwaddr len)
+{
+    MemoryRegionSection mrs;
+    MemoryListener *listener;
+    AddressSpace *as;
+    FlatView *view;
+    FlatRange *fr;
+    hwaddr sec_start, sec_end, sec_size;
+
+    QTAILQ_FOREACH(listener, &memory_listeners, link) {
+        if (!listener->log_clear) {
+            continue;
+        }
+        as = listener->address_space;
+        view = address_space_get_flatview(as);
+        FOR_EACH_FLAT_RANGE(fr, view) {
+            if (!fr->dirty_log_mask || fr->mr != mr) {
+                /*
+                 * Clear dirty bitmap operation only applies to those
+                 * regions whose dirty logging is at least enabled
+                 */
+                continue;
+            }
+
+            mrs = section_from_flat_range(fr, view);
+
+            sec_start = MAX(mrs.offset_within_region, start);
+            sec_end = mrs.offset_within_region + int128_get64(mrs.size);
+            sec_end = MIN(sec_end, start + len);
+
+            if (sec_start >= sec_end) {
+                /*
+                 * If this memory region section has no intersection
+                 * with the requested range, skip.
+                 */
+                continue;
+            }
+
+            /* Valid case; shrink the section if needed */
+            mrs.offset_within_address_space +=
+                sec_start - mrs.offset_within_region;
+            mrs.offset_within_region = sec_start;
+            sec_size = sec_end - sec_start;
+            mrs.size = int128_make64(sec_size);
+            listener->log_clear(listener, &mrs);
+        }
+        flatview_unref(view);
+    }
+}
+
 DirtyBitmapSnapshot *memory_region_snapshot_and_clear_dirty(MemoryRegion *mr,
                                                             hwaddr addr,
                                                             hwaddr size,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 07/11] kvm: Update comments for sync_dirty_bitmap
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (5 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 06/11] memory: Introduce memory listener hook log_clear() Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 08/11] kvm: Persistent per kvmslot dirty bitmap Peter Xu
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

It's obviously obsolete.  Do some update.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 524c4ddfbd..b686531586 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -473,13 +473,13 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
 
 /**
- * kvm_physical_sync_dirty_bitmap - Grab dirty bitmap from kernel space
- * This function updates qemu's dirty bitmap using
- * memory_region_set_dirty().  This means all bits are set
- * to dirty.
+ * kvm_physical_sync_dirty_bitmap - Sync dirty bitmap from kernel space
  *
- * @start_add: start of logged region.
- * @end_addr: end of logged region.
+ * This function will first try to fetch dirty bitmap from the kernel,
+ * and then updates qemu's dirty bitmap.
+ *
+ * @kml: the KVM memory listener object
+ * @section: the memory section to sync the dirty bitmap with
  */
 static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
                                           MemoryRegionSection *section)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 08/11] kvm: Persistent per kvmslot dirty bitmap
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (6 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 07/11] kvm: Update comments for sync_dirty_bitmap Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 09/11] kvm: Introduce slots lock for memory listener Peter Xu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

When synchronizing dirty bitmap from kernel KVM we do it in a
per-kvmslot fashion and we allocate the userspace bitmap for each of
the ioctl.  This patch instead make the bitmap cache be persistent
then we don't need to g_malloc0() every time.

More importantly, the cached per-kvmslot dirty bitmap will be further
used when we want to add support for the KVM_CLEAR_DIRTY_LOG and this
cached bitmap will be used to guarantee we won't clear any unknown
dirty bits otherwise that can be a severe data loss issue for
migration code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 39 +++++++++++++++++++++------------------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index b686531586..334c610918 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -497,31 +497,14 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
             return 0;
         }
 
-        /* XXX bad kernel interface alert
-         * For dirty bitmap, kernel allocates array of size aligned to
-         * bits-per-long.  But for case when the kernel is 64bits and
-         * the userspace is 32bits, userspace can't align to the same
-         * bits-per-long, since sizeof(long) is different between kernel
-         * and user space.  This way, userspace will provide buffer which
-         * may be 4 bytes less than the kernel will use, resulting in
-         * userspace memory corruption (which is not detectable by valgrind
-         * too, in most cases).
-         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
-         * a hope that sizeof(long) won't become >8 any time soon.
-         */
-        size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
-                     /*HOST_LONG_BITS*/ 64) / 8;
-        d.dirty_bitmap = g_malloc0(size);
-
+        d.dirty_bitmap = mem->dirty_bmap;
         d.slot = mem->slot | (kml->as_id << 16);
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
             DPRINTF("ioctl failed %d\n", errno);
-            g_free(d.dirty_bitmap);
             return -1;
         }
 
         kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
-        g_free(d.dirty_bitmap);
     }
 
     return 0;
@@ -765,6 +748,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     MemoryRegion *mr = section->mr;
     bool writeable = !mr->readonly && !mr->rom_device;
     hwaddr start_addr, size;
+    unsigned long bmap_size;
     void *ram;
 
     if (!memory_region_is_ram(mr)) {
@@ -796,6 +780,8 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         }
 
         /* unregister the slot */
+        g_free(mem->dirty_bmap);
+        mem->dirty_bmap = NULL;
         mem->memory_size = 0;
         mem->flags = 0;
         err = kvm_set_user_memory_region(kml, mem, false);
@@ -807,12 +793,29 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         return;
     }
 
+    /*
+     * XXX bad kernel interface alert For dirty bitmap, kernel
+     * allocates array of size aligned to bits-per-long.  But for case
+     * when the kernel is 64bits and the userspace is 32bits,
+     * userspace can't align to the same bits-per-long, since
+     * sizeof(long) is different between kernel and user space.  This
+     * way, userspace will provide buffer which may be 4 bytes less
+     * than the kernel will use, resulting in userspace memory
+     * corruption (which is not detectable by valgrind too, in most
+     * cases).  So for now, let's align to 64 instead of
+     * HOST_LONG_BITS here, in a hope that sizeof(long) won't become
+     * >8 any time soon.
+     */
+    bmap_size = ALIGN((size >> TARGET_PAGE_BITS),
+                      /*HOST_LONG_BITS*/ 64) / 8;
+
     /* register the new slot */
     mem = kvm_alloc_slot(kml);
     mem->memory_size = size;
     mem->start_addr = start_addr;
     mem->ram = ram;
     mem->flags = kvm_mem_flags(mr);
+    mem->dirty_bmap = g_malloc0(bmap_size);
 
     err = kvm_set_user_memory_region(kml, mem, true);
     if (err) {
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index f838412491..687a2ee423 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -21,6 +21,8 @@ typedef struct KVMSlot
     int slot;
     int flags;
     int old_flags;
+    /* Dirty bitmap cache for the slot */
+    unsigned long *dirty_bmap;
 } KVMSlot;
 
 typedef struct KVMMemoryListener {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 09/11] kvm: Introduce slots lock for memory listener
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (7 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 08/11] kvm: Persistent per kvmslot dirty bitmap Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 10/11] kvm: Support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Introduce KVMMemoryListener.slots_lock to protect the slots inside the
kvm memory listener.  Currently it is close to useless because all the
KVM code path now is always protected by the BQL.  But it'll start to
make sense in follow up patches where we might do remote dirty bitmap
clear and also we'll update the per-slot cached dirty bitmap even
without the BQL.  So let's prepare for it.

We can also use per-slot lock for above reason but it seems to be an
overkill.  Let's just use this bigger one (which covers all the slots
of a single address space) but anyway this lock is still much smaller
than the BQL.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 58 +++++++++++++++++++++++++++++++---------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 334c610918..a19535bf6a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -138,6 +138,9 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
     KVM_CAP_LAST_INFO
 };
 
+#define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
+#define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_machine->accelerator);
@@ -165,6 +168,7 @@ int kvm_memcrypt_encrypt_data(uint8_t *ptr, uint64_t len)
     return 1;
 }
 
+/* Must be with slots_lock held */
 static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 {
     KVMState *s = kvm_state;
@@ -182,10 +186,17 @@ static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 bool kvm_has_free_slot(MachineState *ms)
 {
     KVMState *s = KVM_STATE(ms->accelerator);
+    bool result;
+    KVMMemoryListener *kml = &s->memory_listener;
+
+    kvm_slots_lock(kml);
+    result = !!kvm_get_free_slot(kml);
+    kvm_slots_unlock(kml);
 
-    return kvm_get_free_slot(&s->memory_listener);
+    return result;
 }
 
+/* Must be with slots_lock held */
 static KVMSlot *kvm_alloc_slot(KVMMemoryListener *kml)
 {
     KVMSlot *slot = kvm_get_free_slot(kml);
@@ -244,18 +255,21 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
                                        hwaddr *phys_addr)
 {
     KVMMemoryListener *kml = &s->memory_listener;
-    int i;
+    int i, ret = 0;
 
+    kvm_slots_lock(kml);
     for (i = 0; i < s->nr_slots; i++) {
         KVMSlot *mem = &kml->slots[i];
 
         if (ram >= mem->ram && ram < mem->ram + mem->memory_size) {
             *phys_addr = mem->start_addr + (ram - mem->ram);
-            return 1;
+            ret = 1;
+            break;
         }
     }
+    kvm_slots_unlock(kml);
 
-    return 0;
+    return ret;
 }
 
 static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
@@ -391,6 +405,7 @@ static int kvm_mem_flags(MemoryRegion *mr)
     return flags;
 }
 
+/* Must be with slots_lock held */
 static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
                                  MemoryRegion *mr)
 {
@@ -409,19 +424,26 @@ static int kvm_section_update_flags(KVMMemoryListener *kml,
 {
     hwaddr start_addr, size;
     KVMSlot *mem;
+    int ret = 0;
 
     size = kvm_align_section(section, &start_addr);
     if (!size) {
         return 0;
     }
 
+    kvm_slots_lock(kml);
+
     mem = kvm_lookup_matching_slot(kml, start_addr, size);
     if (!mem) {
         /* We don't have a slot if we want to trap every access. */
-        return 0;
+        goto out;
     }
 
-    return kvm_slot_update_flags(kml, mem, section->mr);
+    ret = kvm_slot_update_flags(kml, mem, section->mr);
+
+out:
+    kvm_slots_unlock(kml);
+    return ret;
 }
 
 static void kvm_log_start(MemoryListener *listener,
@@ -478,6 +500,8 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
  * This function will first try to fetch dirty bitmap from the kernel,
  * and then updates qemu's dirty bitmap.
  *
+ * NOTE: caller must be with kml->slots_lock held.
+ *
  * @kml: the KVM memory listener object
  * @section: the memory section to sync the dirty bitmap with
  */
@@ -488,26 +512,28 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
     struct kvm_dirty_log d = {};
     KVMSlot *mem;
     hwaddr start_addr, size;
+    int ret = 0;
 
     size = kvm_align_section(section, &start_addr);
     if (size) {
         mem = kvm_lookup_matching_slot(kml, start_addr, size);
         if (!mem) {
             /* We don't have a slot if we want to trap every access. */
-            return 0;
+            goto out;
         }
 
         d.dirty_bitmap = mem->dirty_bmap;
         d.slot = mem->slot | (kml->as_id << 16);
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
             DPRINTF("ioctl failed %d\n", errno);
-            return -1;
+            ret = -1;
+            goto out;
         }
 
         kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
     }
-
-    return 0;
+out:
+    return ret;
 }
 
 static void kvm_coalesce_mmio_region(MemoryListener *listener,
@@ -770,10 +796,12 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
           (start_addr - section->offset_within_address_space);
 
+    kvm_slots_lock(kml);
+
     if (!add) {
         mem = kvm_lookup_matching_slot(kml, start_addr, size);
         if (!mem) {
-            return;
+            goto out;
         }
         if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
             kvm_physical_sync_dirty_bitmap(kml, section);
@@ -790,7 +818,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                     __func__, strerror(-err));
             abort();
         }
-        return;
+        goto out;
     }
 
     /*
@@ -823,6 +851,9 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                 strerror(-err));
         abort();
     }
+
+out:
+    kvm_slots_unlock(kml);
 }
 
 static void kvm_region_add(MemoryListener *listener,
@@ -849,7 +880,9 @@ static void kvm_log_sync(MemoryListener *listener,
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
     int r;
 
+    kvm_slots_lock(kml);
     r = kvm_physical_sync_dirty_bitmap(kml, section);
+    kvm_slots_unlock(kml);
     if (r < 0) {
         abort();
     }
@@ -929,6 +962,7 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
 {
     int i;
 
+    qemu_mutex_init(&kml->slots_lock);
     kml->slots = g_malloc0(s->nr_slots * sizeof(KVMSlot));
     kml->as_id = as_id;
 
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 687a2ee423..31df465fdc 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -27,6 +27,8 @@ typedef struct KVMSlot
 
 typedef struct KVMMemoryListener {
     MemoryListener listener;
+    /* Protects the slots and all inside them */
+    QemuMutex slots_lock;
     KVMSlot *slots;
     int as_id;
 } KVMMemoryListener;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 10/11] kvm: Support KVM_CLEAR_DIRTY_LOG
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (8 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 09/11] kvm: Introduce slots lock for memory listener Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 11/11] migration: Split log_clear() into smaller chunks Peter Xu
  2019-05-08 10:09 ` [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Paolo Bonzini
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Firstly detect the interface using KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
and mark it.  When failed to enable the new feature we'll fall back to
the old sync.

Provide the log_clear() hook for the memory listeners for both address
spaces of KVM (normal system memory, and SMM) and deliever the clear
message to kernel.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c    | 180 +++++++++++++++++++++++++++++++++++++++++
 accel/kvm/trace-events |   1 +
 2 files changed, 181 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a19535bf6a..2d8067aadc 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -91,6 +91,7 @@ struct KVMState
     int many_ioeventfds;
     int intx_set_mask;
     bool sync_mmu;
+    bool manual_dirty_log_protect;
     /* The man page (and posix) say ioctl numbers are signed int, but
      * they're not.  Linux, glibc and *BSD all treat ioctl numbers as
      * unsigned, and treating them as signed here can break things */
@@ -536,6 +537,157 @@ out:
     return ret;
 }
 
+/* Alignment requirement for KVM_CLEAR_DIRTY_LOG - 64 pages */
+#define KVM_CLEAR_LOG_SHIFT  6
+#define KVM_CLEAR_LOG_ALIGN  (qemu_real_host_page_size << KVM_CLEAR_LOG_SHIFT)
+#define KVM_CLEAR_LOG_MASK   (-KVM_CLEAR_LOG_ALIGN)
+
+/**
+ * kvm_physical_log_clear - Clear the kernel's dirty bitmap for range
+ *
+ * NOTE: this will be a no-op if we haven't enabled manual dirty log
+ * protection in the host kernel because in that case this operation
+ * will be done within log_sync().
+ *
+ * @kml:     the kvm memory listener
+ * @section: the memory range to clear dirty bitmap
+ */
+static int kvm_physical_log_clear(KVMMemoryListener *kml,
+                                  MemoryRegionSection *section)
+{
+    KVMState *s = kvm_state;
+    struct kvm_clear_dirty_log d;
+    uint64_t start, end, bmap_start, start_delta, bmap_npages, size;
+    unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size;
+    KVMSlot *mem;
+    int ret, i;
+
+    if (!s->manual_dirty_log_protect) {
+        /* No need to do explicit clear */
+        return 0;
+    }
+
+    start = section->offset_within_address_space;
+    size = int128_get64(section->size);
+
+    if (!size) {
+        /* Nothing more we can do... */
+        return 0;
+    }
+
+    kvm_slots_lock(kml);
+
+    /* Find any possible slot that covers the section */
+    for (i = 0; i < s->nr_slots; i++) {
+        mem = &kml->slots[i];
+        if (mem->start_addr <= start &&
+            start + size <= mem->start_addr + mem->memory_size) {
+            break;
+        }
+    }
+
+    /*
+     * We should always find one memslot until this point, otherwise
+     * there could be something wrong from the upper layer
+     */
+    assert(i != s->nr_slots);
+
+    /*
+     * We need to extend either the start or the size or both to
+     * satisfy the KVM interface requirement.  Firstly, do the start
+     * page alignment on 64 host pages
+     */
+    bmap_start = (start - mem->start_addr) & KVM_CLEAR_LOG_MASK;
+    start_delta = start - mem->start_addr - bmap_start;
+    bmap_start /= psize;
+
+    /*
+     * The kernel interface has restriction on the size too, that either:
+     *
+     * (1) the size is 64 host pages aligned (just like the start), or
+     * (2) the size fills up until the end of the KVM memslot.
+     */
+    bmap_npages = DIV_ROUND_UP(size + start_delta, KVM_CLEAR_LOG_ALIGN)
+        << KVM_CLEAR_LOG_SHIFT;
+    end = mem->memory_size / psize;
+    if (bmap_npages > end - bmap_start) {
+        bmap_npages = end - bmap_start;
+    }
+    start_delta /= psize;
+
+    /*
+     * Prepare the bitmap to clear dirty bits.  Here we must guarantee
+     * that we won't clear any unknown dirty bits otherwise we might
+     * accidentally clear some set bits which are not yet synced from
+     * the kernel into QEMU's bitmap, then we'll lose track of the
+     * guest modifications upon those pages (which can directly lead
+     * to guest data loss or panic after migration).
+     *
+     * Layout of the KVMSlot.dirty_bmap:
+     *
+     *                   |<-------- bmap_npages -----------..>|
+     *                                                     [1]
+     *                     start_delta         size
+     *  |----------------|-------------|------------------|------------|
+     *  ^                ^             ^                               ^
+     *  |                |             |                               |
+     * start          bmap_start     (start)                         end
+     * of memslot                                             of memslot
+     *
+     * [1] bmap_npages can be aligned to either 64 pages or the end of slot
+     */
+
+    assert(bmap_start % BITS_PER_LONG == 0);
+    if (start_delta) {
+        /* Slow path - we need to manipulate a temp bitmap */
+        bmap_clear = bitmap_new(bmap_npages);
+        bitmap_copy_with_src_offset(bmap_clear, mem->dirty_bmap,
+                                    bmap_start, start_delta + size / psize);
+        /*
+         * We need to fill the holes at start because that was not
+         * specified by the caller and we extended the bitmap only for
+         * 64 pages alignment
+         */
+        bitmap_clear(bmap_clear, 0, start_delta);
+        d.dirty_bitmap = bmap_clear;
+    } else {
+        /* Fast path - start address aligns well with BITS_PER_LONG */
+        d.dirty_bitmap = mem->dirty_bmap + BIT_WORD(bmap_start);
+    }
+
+    d.first_page = bmap_start;
+    /* It should never overflow.  If it happens, say something */
+    assert(bmap_npages <= UINT32_MAX);
+    d.num_pages = bmap_npages;
+    d.slot = mem->slot | (kml->as_id << 16);
+
+    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
+        ret = -errno;
+        error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
+                     "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
+                     __func__, d.slot, (uint64_t)d.first_page,
+                     (uint32_t)d.num_pages, ret);
+    } else {
+        ret = 0;
+        trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
+    }
+
+    /*
+     * After we have updated the remote dirty bitmap, we update the
+     * cached bitmap as well for the memslot, then if another user
+     * clears the same region we know we shouldn't clear it again on
+     * the remote otherwise it's data loss as well.
+     */
+    bitmap_clear(mem->dirty_bmap, bmap_start + start_delta,
+                 size / psize);
+    /* This handles the NULL case well */
+    g_free(bmap_clear);
+
+    kvm_slots_unlock(kml);
+
+    return ret;
+}
+
 static void kvm_coalesce_mmio_region(MemoryListener *listener,
                                      MemoryRegionSection *secion,
                                      hwaddr start, hwaddr size)
@@ -888,6 +1040,22 @@ static void kvm_log_sync(MemoryListener *listener,
     }
 }
 
+static void kvm_log_clear(MemoryListener *listener,
+                          MemoryRegionSection *section)
+{
+    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
+    int r;
+
+    r = kvm_physical_log_clear(kml, section);
+    if (r < 0) {
+        error_report_once("%s: kvm log clear failed: mr=%s "
+                          "offset=%"HWADDR_PRIx" size=%"PRIx64, __func__,
+                          section->mr->name, section->offset_within_region,
+                          int128_get64(section->size));
+        abort();
+    }
+}
+
 static void kvm_mem_ioeventfd_add(MemoryListener *listener,
                                   MemoryRegionSection *section,
                                   bool match_data, uint64_t data,
@@ -975,6 +1143,7 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
     kml->listener.log_start = kvm_log_start;
     kml->listener.log_stop = kvm_log_stop;
     kml->listener.log_sync = kvm_log_sync;
+    kml->listener.log_clear = kvm_log_clear;
     kml->listener.priority = 10;
 
     memory_listener_register(&kml->listener, as);
@@ -1699,6 +1868,17 @@ static int kvm_init(MachineState *ms)
     s->coalesced_pio = s->coalesced_mmio &&
                        kvm_check_extension(s, KVM_CAP_COALESCED_PIO);
 
+    s->manual_dirty_log_protect =
+        kvm_check_extension(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT);
+    if (s->manual_dirty_log_protect) {
+        ret = kvm_vm_enable_cap(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, 0, 1);
+        if (ret) {
+            warn_report("Trying to enable KVM_CAP_MANUAL_DIRTY_LOG_PROTECT "
+                        "but failed.  Falling back to the legacy mode. ");
+            s->manual_dirty_log_protect = false;
+        }
+    }
+
 #ifdef KVM_CAP_VCPU_EVENTS
     s->vcpu_events = kvm_check_extension(s, KVM_CAP_VCPU_EVENTS);
 #endif
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 33c5b1b3af..4fb6e59d19 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -15,4 +15,5 @@ kvm_irqchip_release_virq(int virq) "virq %d"
 kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%" PRIx64 " val=0x%x assign: %d size: %d match: %d"
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
+kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 11/11] migration: Split log_clear() into smaller chunks
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (9 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 10/11] kvm: Support KVM_CLEAR_DIRTY_LOG Peter Xu
@ 2019-05-08  6:15 ` Peter Xu
  2019-05-08 10:09 ` [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Paolo Bonzini
  11 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-08  6:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Currently we are doing log_clear() right after log_sync() which mostly
keeps the old behavior when log_clear() was still part of log_sync().

This patch tries to further optimize the migration log_clear() code
path to split huge log_clear()s into smaller chunks.

We do this by spliting the whole guest memory region into memory
chunks, whose size is decided by MigrationState.clear_bitmap_shift (an
example will be given below).  With that, we don't do the dirty bitmap
clear operation on the remote node (e.g., KVM) when we fetch the dirty
bitmap, instead we explicitly clear the dirty bitmap for the memory
chunk for each of the first time we send a page in that chunk.

Here comes an example.

Assuming the guest has 64G memory, then before this patch the KVM
ioctl KVM_CLEAR_DIRTY_LOG will be a single one covering 64G memory.
If after the patch, let's assume when the clear bitmap shift is 18,
then the memory chunk size on x86_64 will be 1UL<<18 * 4K = 1GB.  Then
instead of sending a big 64G ioctl, we'll send 64 small ioctls, each
of the ioctl will cover 1G of the guest memory.  For each of the 64
small ioctls, we'll only send if any of the page in that small chunk
was going to be sent right away.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ram_addr.h | 75 +++++++++++++++++++++++++++++++++++++++--
 migration/migration.c   |  4 +++
 migration/migration.h   | 27 +++++++++++++++
 migration/ram.c         | 44 ++++++++++++++++++++++++
 migration/trace-events  |  1 +
 5 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 896324281a..c2db157d98 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -50,8 +50,69 @@ struct RAMBlock {
     unsigned long *unsentmap;
     /* bitmap of already received pages in postcopy */
     unsigned long *receivedmap;
+
+    /*
+     * bitmap of already cleared dirty bitmap.  Set this up to
+     * non-NULL to enable the capability to postpone and split
+     * clearing of dirty bitmap on the remote node (e.g., KVM).  The
+     * bitmap will be set only when doing global sync.
+     *
+     * NOTE: this bitmap is different comparing to the other bitmaps
+     * in that one bit can represent multiple guest pages (which is
+     * decided by the `clear_bmap_shift' variable below).  On
+     * destination side, this should always be NULL, and the variable
+     * `clear_bmap_shift' is meaningless.
+     */
+    unsigned long *clear_bmap;
+    uint8_t clear_bmap_shift;
 };
 
+/**
+ * clear_bmap_size: calculate clear bitmap size
+ *
+ * @pages: number of guest pages
+ * @shift: guest page number shift
+ *
+ * Returns: number of bits for the clear bitmap
+ */
+static inline long clear_bmap_size(uint64_t pages, uint8_t shift)
+{
+    return DIV_ROUND_UP(pages, 1UL << shift);
+}
+
+/**
+ * clear_bmap_set: set clear bitmap for the page range
+ *
+ * @rb: the ramblock to operate on
+ * @start: the start page number
+ * @size: number of pages to set in the bitmap
+ *
+ * Returns: None
+ */
+static inline void clear_bmap_set(RAMBlock *rb, uint64_t start,
+                                  uint64_t npages)
+{
+    uint8_t shift = rb->clear_bmap_shift;
+
+    bitmap_set_atomic(rb->clear_bmap, start >> shift,
+                      clear_bmap_size(npages, shift));
+}
+
+/**
+ * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set
+ *
+ * @rb: the ramblock to operate on
+ * @page: the page number to check
+ *
+ * Returns: true if the bit was set, false otherwise
+ */
+static inline bool clear_bmap_test_and_clear(RAMBlock *rb, uint64_t page)
+{
+    uint8_t shift = rb->clear_bmap_shift;
+
+    return bitmap_test_and_clear_atomic(rb->clear_bmap, page >> shift, 1);
+}
+
 static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
 {
     return (b && b->host && offset < b->used_length) ? true : false;
@@ -462,8 +523,18 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
             }
         }
 
-        /* TODO: split the huge bitmap into smaller chunks */
-        memory_region_clear_dirty_bitmap(rb->mr, start, length);
+        if (rb->clear_bmap) {
+            /*
+             * Postpone the dirty bitmap clear to the point before we
+             * really send the pages, also we will split the clear
+             * dirty procedure into smaller chunks.
+             */
+            clear_bmap_set(rb, start >> TARGET_PAGE_BITS,
+                           length >> TARGET_PAGE_BITS);
+        } else {
+            /* Slow path - still do that in a huge chunk */
+            memory_region_clear_dirty_bitmap(rb->mr, start, length);
+        }
     } else {
         ram_addr_t offset = rb->offset;
 
diff --git a/migration/migration.c b/migration/migration.c
index 609e0df5d0..cc85326955 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3348,6 +3348,8 @@ void migration_global_dump(Monitor *mon)
                    ms->send_section_footer ? "on" : "off");
     monitor_printf(mon, "decompress-error-check: %s\n",
                    ms->decompress_error_check ? "on" : "off");
+    monitor_printf(mon, "clear-bitmap-shift: %u\n",
+                   ms->clear_bitmap_shift);
 }
 
 #define DEFINE_PROP_MIG_CAP(name, x)             \
@@ -3362,6 +3364,8 @@ static Property migration_properties[] = {
                      send_section_footer, true),
     DEFINE_PROP_BOOL("decompress-error-check", MigrationState,
                       decompress_error_check, true),
+    DEFINE_PROP_UINT8("x-clear-bitmap-shift", MigrationState,
+                      clear_bitmap_shift, CLEAR_BITMAP_SHIFT_DEFAULT),
 
     /* Migration parameters */
     DEFINE_PROP_UINT8("x-compress-level", MigrationState,
diff --git a/migration/migration.h b/migration/migration.h
index 438f17edad..ed836ccf03 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -27,6 +27,23 @@ struct PostcopyBlocktimeContext;
 
 #define  MIGRATION_RESUME_ACK_VALUE  (1)
 
+/*
+ * 1<<6=64 pages -> 256K chunk when page size is 4K.  This gives us
+ * the benefit that all the chunks are 64 pages aligned then the
+ * bitmaps are always aligned to LONG.
+ */
+#define CLEAR_BITMAP_SHIFT_MIN             6
+/*
+ * 1<<18=256K pages -> 1G chunk when page size is 4K.  This is the
+ * default value to use if no one specified.
+ */
+#define CLEAR_BITMAP_SHIFT_DEFAULT        18
+/*
+ * 1<<31=2G pages -> 8T chunk when page size is 4K.  This should be
+ * big enough and make sure we won't overflow easily.
+ */
+#define CLEAR_BITMAP_SHIFT_MAX            31
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
@@ -234,6 +251,16 @@ struct MigrationState
      * do not trigger spurious decompression errors.
      */
     bool decompress_error_check;
+
+    /*
+     * This decides the size of guest memory chunk that will be used
+     * to track dirty bitmap clearing.  The size of memory chunk will
+     * be GUEST_PAGE_SIZE << N.  Say, N=0 means we will clear dirty
+     * bitmap for each page to send (1<<0=1); N=10 means we will clear
+     * dirty bitmap only once for 1<<10=1K continuous guest pages
+     * (which is in 4M chunk).
+     */
+    uint8_t clear_bitmap_shift;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/ram.c b/migration/ram.c
index 3e11806ccd..6cecce39a4 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1670,6 +1670,33 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     bool ret;
 
     qemu_mutex_lock(&rs->bitmap_mutex);
+
+    /*
+     * Clear dirty bitmap if needed.  This _must_ be called before we
+     * send any of the page in the chunk because we need to make sure
+     * we can capture further page content changes when we sync dirty
+     * log the next time.  So as long as we are going to send any of
+     * the page in the chunk we clear the remote dirty bitmap for all.
+     * Clearing it earlier won't be a problem, but too late will.
+     */
+    if (rb->clear_bmap && clear_bmap_test_and_clear(rb, page)) {
+        uint8_t shift = rb->clear_bmap_shift;
+        hwaddr size = 1ULL << (TARGET_PAGE_BITS + shift);
+        hwaddr start = (page << TARGET_PAGE_BITS) & (-size);
+
+        /*
+         * CLEAR_BITMAP_SHIFT_MIN should always guarantee this... this
+         * can make things easier sometimes since then start address
+         * of the small chunk will always be 64 pages aligned so the
+         * bitmap will always be aligned to unsigned long.  We should
+         * even be able to remove this restriction but I'm simply
+         * keeping it.
+         */
+        assert(shift >= 6);
+        trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
+        memory_region_clear_dirty_bitmap(rb->mr, start, size);
+    }
+
     ret = test_and_clear_bit(page, rb->bmap);
 
     if (ret) {
@@ -2687,6 +2714,8 @@ static void ram_save_cleanup(void *opaque)
     memory_global_dirty_log_stop();
 
     RAMBLOCK_FOREACH_NOT_IGNORED(block) {
+        g_free(block->clear_bmap);
+        block->clear_bmap = NULL;
         g_free(block->bmap);
         block->bmap = NULL;
         g_free(block->unsentmap);
@@ -3197,15 +3226,30 @@ static int ram_state_init(RAMState **rsp)
 
 static void ram_list_init_bitmaps(void)
 {
+    MigrationState *ms = migrate_get_current();
     RAMBlock *block;
     unsigned long pages;
+    uint8_t shift;
 
     /* Skip setting bitmap if there is no RAM */
     if (ram_bytes_total()) {
+        shift = ms->clear_bitmap_shift;
+        if (shift > CLEAR_BITMAP_SHIFT_MAX) {
+            error_report("clear_bitmap_shift (%u) too big, using "
+                         "max value (%u)", shift, CLEAR_BITMAP_SHIFT_MAX);
+            shift = CLEAR_BITMAP_SHIFT_MAX;
+        } else if (shift < CLEAR_BITMAP_SHIFT_MIN) {
+            error_report("clear_bitmap_shift (%u) too small, using "
+                         "min value (%u)", shift, CLEAR_BITMAP_SHIFT_MIN);
+            shift = CLEAR_BITMAP_SHIFT_MIN;
+        }
+
         RAMBLOCK_FOREACH_NOT_IGNORED(block) {
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            block->clear_bmap_shift = shift;
+            block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
             if (migrate_postcopy_ram()) {
                 block->unsentmap = bitmap_new(pages);
                 bitmap_set(block->unsentmap, 0, pages);
diff --git a/migration/trace-events b/migration/trace-events
index de2e136e57..2c45974330 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -79,6 +79,7 @@ get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_
 get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned long page_abs, int sent) "%s/0x%" PRIx64 " page_abs=0x%lx (sent=%d)"
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
+migration_bitmap_clear_dirty(char *str, uint64_t start, uint64_t size, unsigned long page) "rb %s start 0x%"PRIx64" size 0x%"PRIx64" page 0x%lx"
 migration_throttle(void) ""
 multifd_recv(uint8_t id, uint64_t packet_num, uint32_t used, uint32_t flags, uint32_t next_packet_size) "channel %d packet number %" PRIu64 " pages %d flags 0x%x next packet size %d"
 multifd_recv_sync_main(long packet_num) "packet num %ld"
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG
  2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (10 preceding siblings ...)
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 11/11] migration: Split log_clear() into smaller chunks Peter Xu
@ 2019-05-08 10:09 ` Paolo Bonzini
  2019-05-08 11:39   ` Peter Xu
  11 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2019-05-08 10:09 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Laurent Vivier, Dr . David Alan Gilbert, Juan Quintela

On 08/05/19 01:15, Peter Xu wrote:
> Design
> ===================
> 
> I started with a naive/stupid design that I always pass all 1's to the
> KVM for a memory range to clear all the dirty bits within that memory
> range, but then I encountered guest oops - it's simply because we
> can't clear any dirty bit from QEMU if we are not _sure_ that the bit
> is dirty in the kernel.  Otherwise we might accidentally clear a bit
> that we don't even know of (e.g., the bit was clear in migration's
> dirty bitmap in QEMU) but actually that page was just being written so
> QEMU will never remember to migrate that new page again.
> 
> The new design is focused on a dirty bitmap cache within the QEMU kvm
> layer (which is per kvm memory slot).  With that we know what's dirty
> in the kernel previously (note! the kernel bitmap is still growing all
> the time so the cache will only be a subset of the realtime kernel
> bitmap but that's far enough for us) and with that we'll be sure to
> not accidentally clear unknown dirty pages.
> 
> With this method, we can also avoid race when multiple users (e.g.,
> DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
> multiple time.  If without the kvm memory slot cached dirty bitmap we
> won't be able to know which bit has been cleared and then if we send
> the CLEAR operation upon the same bit twice (or more) we can still
> face the same issue to clear something accidentally while we
> shouldn't.
> 
> Summary: we really need to be careful on what bit to clear otherwise
> we can face anything after the migration completes.  And I hope this
> series has considered all about this.

The disadvantage of this is that you won't clear in the kernel those
dirty bits that come from other sources (e.g. vhost or
address_space_map).  This can lead to double-copying of pages.

Migration already makes a local copy in rb->bmap, and
memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
be possible to invoke the clear using rb->bmap instead of the KVMSlot's
new bitmap?

Thanks,

Paolo

> Besides the new KVM cache layer and the new ioctl support, this series
> introduced the memory_region_clear_dirty_bitmap() in the memory API
> layer to allow clearing dirty bits of a specific memory range within
> the memory region.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 01/11] migration: No need to take rcu during sync_dirty_bitmap
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 01/11] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
@ 2019-05-08 10:09   ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2019-05-08 10:09 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Laurent Vivier, Dr . David Alan Gilbert, Juan Quintela

On 08/05/19 01:15, Peter Xu wrote:
> cpu_physical_memory_sync_dirty_bitmap() has one RAMBlock* as
> parameter, which means that it must be with RCU read lock held
> already.  Taking it again inside seems redundant.  Removing it.
> Instead comment on the functions about the RCU read lock.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/exec/ram_addr.h | 5 +----
>  migration/ram.c         | 1 +
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 139ad79390..993fb760f3 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -408,6 +408,7 @@ static inline void cpu_physical_memory_clear_dirty_range(ram_addr_t start,
>  }
>  
>  
> +/* Must be with rcu read lock held */
>  static inline
>  uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>                                                 ram_addr_t start,
> @@ -431,8 +432,6 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>                                          DIRTY_MEMORY_BLOCK_SIZE);
>          unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
>  
> -        rcu_read_lock();
> -
>          src = atomic_rcu_read(
>                  &ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION])->blocks;
>  
> @@ -452,8 +451,6 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>                  idx++;
>              }
>          }
> -
> -        rcu_read_unlock();
>      } else {
>          ram_addr_t offset = rb->offset;
>  
> diff --git a/migration/ram.c b/migration/ram.c
> index 1ca9ba77b6..3e11806ccd 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1680,6 +1680,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
>      return ret;
>  }
>  
> +/* Must be with rcu read lock held */
>  static void migration_bitmap_sync_range(RAMState *rs, RAMBlock *rb,
>                                          ram_addr_t start, ram_addr_t length)
>  {
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 02/11] memory: Remove memory_region_get_dirty()
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 02/11] memory: Remove memory_region_get_dirty() Peter Xu
@ 2019-05-08 10:09   ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2019-05-08 10:09 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Laurent Vivier, Dr . David Alan Gilbert, Juan Quintela

On 08/05/19 01:15, Peter Xu wrote:
> It's never used anywhere.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/exec/memory.h | 17 -----------------
>  memory.c              |  8 --------
>  2 files changed, 25 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 9144a47f57..e6140e8a04 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1254,23 +1254,6 @@ void memory_region_ram_resize(MemoryRegion *mr, ram_addr_t newsize,
>   */
>  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client);
>  
> -/**
> - * memory_region_get_dirty: Check whether a range of bytes is dirty
> - *                          for a specified client.
> - *
> - * Checks whether a range of bytes has been written to since the last
> - * call to memory_region_reset_dirty() with the same @client.  Dirty logging
> - * must be enabled.
> - *
> - * @mr: the memory region being queried.
> - * @addr: the address (relative to the start of the region) being queried.
> - * @size: the size of the range being queried.
> - * @client: the user of the logging information; %DIRTY_MEMORY_MIGRATION or
> - *          %DIRTY_MEMORY_VGA.
> - */
> -bool memory_region_get_dirty(MemoryRegion *mr, hwaddr addr,
> -                             hwaddr size, unsigned client);
> -
>  /**
>   * memory_region_set_dirty: Mark a range of bytes as dirty in a memory region.
>   *
> diff --git a/memory.c b/memory.c
> index bb2b71ee38..0cce3a6fc5 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2027,14 +2027,6 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>      memory_region_transaction_commit();
>  }
>  
> -bool memory_region_get_dirty(MemoryRegion *mr, hwaddr addr,
> -                             hwaddr size, unsigned client)
> -{
> -    assert(mr->ram_block);
> -    return cpu_physical_memory_get_dirty(memory_region_get_ram_addr(mr) + addr,
> -                                         size, client);
> -}
> -
>  void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
>                               hwaddr size)
>  {
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 03/11] memory: Don't set migration bitmap when without migration
  2019-05-08  6:15 ` [Qemu-devel] [PATCH 03/11] memory: Don't set migration bitmap when without migration Peter Xu
@ 2019-05-08 10:09   ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2019-05-08 10:09 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Laurent Vivier, Dr . David Alan Gilbert, Juan Quintela

On 08/05/19 01:15, Peter Xu wrote:
> Similar to 9460dee4b2 ("memory: do not touch code dirty bitmap unless
> TCG is enabled", 2015-06-05) but for the migration bitmap - we can
> skip the MIGRATION bitmap update if migration not enabled.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/exec/memory.h   |  2 ++
>  include/exec/ram_addr.h | 12 +++++++++++-
>  memory.c                |  2 +-
>  3 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e6140e8a04..f29300c54d 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -46,6 +46,8 @@
>          OBJECT_GET_CLASS(IOMMUMemoryRegionClass, (obj), \
>                           TYPE_IOMMU_MEMORY_REGION)
>  
> +extern bool global_dirty_log;
> +
>  typedef struct MemoryRegionOps MemoryRegionOps;
>  typedef struct MemoryRegionMmio MemoryRegionMmio;
>  
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 993fb760f3..86bc8e1a4a 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -348,8 +348,13 @@ static inline void cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
>              if (bitmap[k]) {
>                  unsigned long temp = leul_to_cpu(bitmap[k]);
>  
> -                atomic_or(&blocks[DIRTY_MEMORY_MIGRATION][idx][offset], temp);
>                  atomic_or(&blocks[DIRTY_MEMORY_VGA][idx][offset], temp);
> +
> +                if (global_dirty_log) {
> +                    atomic_or(&blocks[DIRTY_MEMORY_MIGRATION][idx][offset],
> +                              temp);
> +                }
> +
>                  if (tcg_enabled()) {
>                      atomic_or(&blocks[DIRTY_MEMORY_CODE][idx][offset], temp);
>                  }
> @@ -366,6 +371,11 @@ static inline void cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
>          xen_hvm_modified_memory(start, pages << TARGET_PAGE_BITS);
>      } else {
>          uint8_t clients = tcg_enabled() ? DIRTY_CLIENTS_ALL : DIRTY_CLIENTS_NOCODE;
> +
> +        if (!global_dirty_log) {
> +            clients &= ~(1 << DIRTY_MEMORY_MIGRATION);
> +        }
> +
>          /*
>           * bitmap-traveling is faster than memory-traveling (for addr...)
>           * especially when most of the memory is not dirty.
> diff --git a/memory.c b/memory.c
> index 0cce3a6fc5..1cc3531df5 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -38,7 +38,7 @@
>  static unsigned memory_region_transaction_depth;
>  static bool memory_region_update_pending;
>  static bool ioeventfd_update_pending;
> -static bool global_dirty_log = false;
> +bool global_dirty_log;
>  
>  static QTAILQ_HEAD(, MemoryListener) memory_listeners
>      = QTAILQ_HEAD_INITIALIZER(memory_listeners);
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG
  2019-05-08 10:09 ` [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Paolo Bonzini
@ 2019-05-08 11:39   ` Peter Xu
  2019-05-08 11:55     ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-05-08 11:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Laurent Vivier, qemu-devel, Dr . David Alan Gilbert, Juan Quintela

On Wed, May 08, 2019 at 12:09:00PM +0200, Paolo Bonzini wrote:
> On 08/05/19 01:15, Peter Xu wrote:
> > Design
> > ===================
> > 
> > I started with a naive/stupid design that I always pass all 1's to the
> > KVM for a memory range to clear all the dirty bits within that memory
> > range, but then I encountered guest oops - it's simply because we
> > can't clear any dirty bit from QEMU if we are not _sure_ that the bit
> > is dirty in the kernel.  Otherwise we might accidentally clear a bit
> > that we don't even know of (e.g., the bit was clear in migration's
> > dirty bitmap in QEMU) but actually that page was just being written so
> > QEMU will never remember to migrate that new page again.
> > 
> > The new design is focused on a dirty bitmap cache within the QEMU kvm
> > layer (which is per kvm memory slot).  With that we know what's dirty
> > in the kernel previously (note! the kernel bitmap is still growing all
> > the time so the cache will only be a subset of the realtime kernel
> > bitmap but that's far enough for us) and with that we'll be sure to
> > not accidentally clear unknown dirty pages.
> > 
> > With this method, we can also avoid race when multiple users (e.g.,
> > DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
> > multiple time.  If without the kvm memory slot cached dirty bitmap we
> > won't be able to know which bit has been cleared and then if we send
> > the CLEAR operation upon the same bit twice (or more) we can still
> > face the same issue to clear something accidentally while we
> > shouldn't.
> > 
> > Summary: we really need to be careful on what bit to clear otherwise
> > we can face anything after the migration completes.  And I hope this
> > series has considered all about this.
> 
> The disadvantage of this is that you won't clear in the kernel those
> dirty bits that come from other sources (e.g. vhost or
> address_space_map).  This can lead to double-copying of pages.
> 
> Migration already makes a local copy in rb->bmap, and
> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
> new bitmap?

Do you mean to invoke the clear notifications and deliver the bitmap
from the very top of the stack (for VGA it would probably be the
DirtyBitmapSnapshot, for MIGRATION it's the rb->bmap)?

Actually that's what I did in the first version before I post the work
but I noticed that there seems to have a race condition with the
design.  The problem is we have multiple copies of the same dirty
bitmap from KVM and the race can happen with those multiple users
(bitmaps of the users can be a merged version containing KVM and other
sources like vhost, address_space_map, etc. but let's just make it
simpler to not have them yet).  Here's one possible race condition
(assuming migration has started):

  (1) page P is written by the guest (let's version its data as P1),
      its dirty bit D is set in KVM

  (2) QEMU sync dirty log, fetch D for page P (which is set).  D is
      not cleared in KVM due to MANUAL_PROTECT cap.

  (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and
      VGA as example).

  (4) MIGRATION code collects bit D in migration bitmap, clear it,
      send CLEAR to KVM to clear the bit on remote (then D in KVM is
      cleared too), send the page P with content P1 to destination.
      CAUTION: when reach here VGA bitmap still has the D bit set.

  (5) page P is written by the guest again (let's version its data as
      P2), bit D set again in KVM.

  (6) VGA code collectes bit D (note! we haven't synced the log again
      so this is the _old_ dirty bit came from step 3 above) in VGA
      bitmap, clear it, send CLEAR to KVM to clear the bit on remote.
      Then D bit in KVM is cleared again.  Until here, D bit in all
      bitmaps are cleared (MIGRATION, VGA, KVM).

  (7) page P is never written again until migration completes (no
      matter whether we sync again D bit will be cleared).

  (8) On destination VM page P will contain content P1 rather than P2,
      this is data loss...

After I noticed it, I added the kvm bitmap layer with the lock to make
sure we won't send the 2nd clear at step (6) above because with the
per memslot bitmap we will be able to know that we've already cleared
a bit by one user so we shouldn't clear it again (as long as we don't
SYNC again).  However later on I just noticed maybe I don't even need
the top layer bitmap at all, because it seems to not bring much
benefit but instead it'll bring lots of more complexity - I'll need to
do bitmap convertions (because upper layer bitmaps are per ramblock,
and kvm bitmaps are per memslot), not to mention I'll need to AND the
two bitmaps (upper layer bitmap, and the QEMU kvm cached bitmap) to
avoid the above race.  So at last I simply removed the top layer
bitmap and only keep the per memslot bitmap as what this series did.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG
  2019-05-08 11:39   ` Peter Xu
@ 2019-05-08 11:55     ` Paolo Bonzini
  2019-05-09  2:33       ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2019-05-08 11:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Laurent Vivier, qemu-devel, Dr . David Alan Gilbert, Juan Quintela

On 08/05/19 06:39, Peter Xu wrote:
>> The disadvantage of this is that you won't clear in the kernel those
>> dirty bits that come from other sources (e.g. vhost or
>> address_space_map).  This can lead to double-copying of pages.
>>
>> Migration already makes a local copy in rb->bmap, and
>> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
>> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
>> new bitmap?
>
> Actually that's what I did in the first version before I post the work
> but I noticed that there seems to have a race condition with the
> design.  The problem is we have multiple copies of the same dirty
> bitmap from KVM and the race can happen with those multiple users
> (bitmaps of the users can be a merged version containing KVM and other
> sources like vhost, address_space_map, etc. but let's just make it
> simpler to not have them yet).

I see now.  And in fact the same double-copying inefficiency happens
already without this series, so you are improving the situation anyway.

Have you done any kind of benchmarking already?

Thanks,

Paolo

>   (1) page P is written by the guest (let's version its data as P1),
>       its dirty bit D is set in KVM
> 
>   (2) QEMU sync dirty log, fetch D for page P (which is set).  D is
>       not cleared in KVM due to MANUAL_PROTECT cap.
> 
>   (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and
>       VGA as example).
> 
>   (4) MIGRATION code collects bit D in migration bitmap, clear it,
>       send CLEAR to KVM to clear the bit on remote (then D in KVM is
>       cleared too), send the page P with content P1 to destination.
>       CAUTION: when reach here VGA bitmap still has the D bit set.
> 
>   (5) page P is written by the guest again (let's version its data as
>       P2), bit D set again in KVM.
> 
>   (6) VGA code collectes bit D (note! we haven't synced the log again
>       so this is the _old_ dirty bit came from step 3 above) in VGA
>       bitmap, clear it, send CLEAR to KVM to clear the bit on remote.
>       Then D bit in KVM is cleared again.  Until here, D bit in all
>       bitmaps are cleared (MIGRATION, VGA, KVM).
> 
>   (7) page P is never written again until migration completes (no
>       matter whether we sync again D bit will be cleared).
> 
>   (8) On destination VM page P will contain content P1 rather than P2,
>       this is data loss...


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG
  2019-05-08 11:55     ` Paolo Bonzini
@ 2019-05-09  2:33       ` Peter Xu
  2019-05-16  9:26         ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-05-09  2:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Laurent Vivier, qemu-devel, Dr . David Alan Gilbert, Juan Quintela

On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote:
> On 08/05/19 06:39, Peter Xu wrote:
> >> The disadvantage of this is that you won't clear in the kernel those
> >> dirty bits that come from other sources (e.g. vhost or
> >> address_space_map).  This can lead to double-copying of pages.
> >>
> >> Migration already makes a local copy in rb->bmap, and
> >> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
> >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
> >> new bitmap?
> >
> > Actually that's what I did in the first version before I post the work
> > but I noticed that there seems to have a race condition with the
> > design.  The problem is we have multiple copies of the same dirty
> > bitmap from KVM and the race can happen with those multiple users
> > (bitmaps of the users can be a merged version containing KVM and other
> > sources like vhost, address_space_map, etc. but let's just make it
> > simpler to not have them yet).
> 
> I see now.  And in fact the same double-copying inefficiency happens
> already without this series, so you are improving the situation anyway.
> 
> Have you done any kind of benchmarking already?

Not yet.  I posted the series for some initial reviews first before
moving on with performance tests.

My plan of the test scenario could be:

- find a guest with relatively large memory (I would guess it is
  better to have memory like 64G or even more to make some big
  difference)

- run random dirty memory workload upon most of the mem, with dirty
  rate X Bps.

- setup the migration bandwidth to Y Bps (Y should be bigger than X
  but not that big.  One could be X=800M and Y=1G to emulate 10G nic
  with a workload that we can still converge with precopy only) and
  start precopy migration.

- measure total migration time with CLEAR_LOG on & off. We should
  expect the guest to have these with CLEAR_LOG: (1) not hang during
  log_sync, and (2) migration should complete faster.

Does above test plan makes sense?

If both the QEMU/KVM changes looks ok in general, I can at least try
this on some smaller guests (I can manage ~10G mem guests with my own
hosts, but I can also try to find some bigger ones).

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG
  2019-05-09  2:33       ` Peter Xu
@ 2019-05-16  9:26         ` Peter Xu
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-05-16  9:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Laurent Vivier, qemu-devel, Dr . David Alan Gilbert, Juan Quintela

On Thu, May 09, 2019 at 10:33:19AM +0800, Peter Xu wrote:
> On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote:
> > On 08/05/19 06:39, Peter Xu wrote:
> > >> The disadvantage of this is that you won't clear in the kernel those
> > >> dirty bits that come from other sources (e.g. vhost or
> > >> address_space_map).  This can lead to double-copying of pages.
> > >>
> > >> Migration already makes a local copy in rb->bmap, and
> > >> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
> > >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
> > >> new bitmap?
> > >
> > > Actually that's what I did in the first version before I post the work
> > > but I noticed that there seems to have a race condition with the
> > > design.  The problem is we have multiple copies of the same dirty
> > > bitmap from KVM and the race can happen with those multiple users
> > > (bitmaps of the users can be a merged version containing KVM and other
> > > sources like vhost, address_space_map, etc. but let's just make it
> > > simpler to not have them yet).
> > 
> > I see now.  And in fact the same double-copying inefficiency happens
> > already without this series, so you are improving the situation anyway.
> > 
> > Have you done any kind of benchmarking already?
> 
> Not yet.  I posted the series for some initial reviews first before
> moving on with performance tests.
> 
> My plan of the test scenario could be:
> 
> - find a guest with relatively large memory (I would guess it is
>   better to have memory like 64G or even more to make some big
>   difference)
> 
> - run random dirty memory workload upon most of the mem, with dirty
>   rate X Bps.
> 
> - setup the migration bandwidth to Y Bps (Y should be bigger than X
>   but not that big.  One could be X=800M and Y=1G to emulate 10G nic
>   with a workload that we can still converge with precopy only) and
>   start precopy migration.
> 
> - measure total migration time with CLEAR_LOG on & off. We should
>   expect the guest to have these with CLEAR_LOG: (1) not hang during
>   log_sync, and (2) migration should complete faster.

Some updates on performance numbers.

Summary: the ideal case below shows ~40% or even more time reduced to
migrate the same VM with the same workload.  In other words, it could
be seen as ~40% faster than before.

Test environment: 13G guest, 10G test mem (so I leave 3G untouched),
dirty rate 900MB/s, BW 10Gbps to emulate ixgbe, downtime 100ms.

IO pattern: I pre-fault all the 10G mem then I do random writes (with
command "mig_mon mm_dirty 10240 900 random" [1]) upon the test memory
with a constant dirty rate (900MB/s, as mentioned).  Then I migrate
during the IOs.

Here's the total migration time of such VM (for each scenario, I run
the migration 5 times then I get an average migration total time
used):

   |--------------+---------------------+-------------|
   | scenario     | migration times (s) | average (s) |
   |--------------+---------------------+-------------|
   | no CLEAR_LOG | 55, 54, 56, 74, 54  |          58 |
   | 1G chunk     | 40, 39, 41, 39, 40  |          40 |
   | 128M chunk   | 38, 40, 37, 40, 38  |          38 |
   | 16M chunk    | 42, 40, 38, 41, 38  |          39 |
   | 1M chunk     | 37, 40, 36, 40, 39  |          38 |
   |--------------+---------------------+-------------|

The first "no CLEAR_LOG" means the master branch which still uses the
GET_DIRTY only.  The latter four scenarios are all with the new
CLEAR_LOG interface, aka, this series.  The test result shows that
128M chunk size seems to be a good default value instead of 1G (which
this series used).  I'll adjust that accordingly when I post the next
version.

[1] https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c

Regards,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-05-16  9:27 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-08  6:15 [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 01/11] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
2019-05-08 10:09   ` Paolo Bonzini
2019-05-08  6:15 ` [Qemu-devel] [PATCH 02/11] memory: Remove memory_region_get_dirty() Peter Xu
2019-05-08 10:09   ` Paolo Bonzini
2019-05-08  6:15 ` [Qemu-devel] [PATCH 03/11] memory: Don't set migration bitmap when without migration Peter Xu
2019-05-08 10:09   ` Paolo Bonzini
2019-05-08  6:15 ` [Qemu-devel] [PATCH 04/11] bitmap: Add bitmap_copy_with_{src|dst}_offset() Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 05/11] memory: Pass mr into snapshot_and_clear_dirty Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 06/11] memory: Introduce memory listener hook log_clear() Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 07/11] kvm: Update comments for sync_dirty_bitmap Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 08/11] kvm: Persistent per kvmslot dirty bitmap Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 09/11] kvm: Introduce slots lock for memory listener Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 10/11] kvm: Support KVM_CLEAR_DIRTY_LOG Peter Xu
2019-05-08  6:15 ` [Qemu-devel] [PATCH 11/11] migration: Split log_clear() into smaller chunks Peter Xu
2019-05-08 10:09 ` [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Paolo Bonzini
2019-05-08 11:39   ` Peter Xu
2019-05-08 11:55     ` Paolo Bonzini
2019-05-09  2:33       ` Peter Xu
2019-05-16  9:26         ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.