qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG
@ 2019-05-30  9:29 Peter Xu
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier Peter Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

This is v3 of the series.  Note that Paolo should have queued the
patch 1 so we can start review with patch 2.  I just kept it for
completeness.

v3:
- drop header update because another same patch already merged in
  master by cohuck
- drop qmp/hmp patches [Paolo]
- comment fixes [Paolo]
- fix misuse of kvm cap names in either strings or commit messages [Paolo]

v2:
- rebase, add r-bs from Paolo
- added a few patches
  - linux-headers: Update to Linux 5.2-rc1
    this is needed because we've got a new cap in kvm
  - checkpatch: Allow SPDX-License-Identifier
    picked up the standalone patch into the series in case it got lost
  - hmp: Expose manual_dirty_log_protect via "info kvm"
    qmp: Expose manual_dirty_log_protect via "query-kvm"
    add interface to detect the new kvm capability
- switched default chunk size to 128M

Performance update is here:

  https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03621.html

Summary
=====================

This series allows QEMU to start using the new KVM_CLEAR_DIRTY_LOG
interface. For more on KVM_CLEAR_DIRTY_LOG itself, please refer to:

  https://github.com/torvalds/linux/blob/master/Documentation/virtual/kvm/api.txt#L3810

The QEMU work (which is this series) is pushed too, please find the
tree here:

  https://github.com/xzpeter/qemu/tree/kvm-clear-dirty-log

Meanwhile, For anyone who really wants to try this out, please also
upgrade the host kernel to linux 5.2-rc1.

Design
===================

I started with a naive/stupid design that I always pass all 1's to the
KVM for a memory range to clear all the dirty bits within that memory
range, but then I encountered guest oops - it's simply because we
can't clear any dirty bit from QEMU if we are not _sure_ that the bit
is dirty in the kernel.  Otherwise we might accidentally clear a bit
that we don't even know of (e.g., the bit was clear in migration's
dirty bitmap in QEMU) but actually that page was just being written so
QEMU will never remember to migrate that new page again.

The new design is focused on a dirty bitmap cache within the QEMU kvm
layer (which is per kvm memory slot).  With that we know what's dirty
in the kernel previously (note! the kernel bitmap is still growing all
the time so the cache will only be a subset of the realtime kernel
bitmap but that's far enough for us) and with that we'll be sure to
not accidentally clear unknown dirty pages.

With this method, we can also avoid race when multiple users (e.g.,
DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
multiple time.  If without the kvm memory slot cached dirty bitmap we
won't be able to know which bit has been cleared and then if we send
the CLEAR operation upon the same bit twice (or more) we can still
face the same issue to clear something accidentally while we
shouldn't.

Summary: we really need to be careful on what bit to clear otherwise
we can face anything after the migration completes.  And I hope this
series has considered all about this.

Besides the new KVM cache layer and the new ioctl support, this series
introduced the memory_region_clear_dirty_bitmap() in the memory API
layer to allow clearing dirty bits of a specific memory range within
the memory region.

Implementations
============================

Patch 1-3:  these should be nothing directly related to the series but
            they are things I found during working on it.  They can be
            picked even earlier if reviewers are happy with them.

Patch 4:    pre-work on bitmap operations, and within the patch I added
            the first unit test for utils/bitmap.c.

Patch 5-6:  the new memory API interface.  Since no one is providing
            log_clear() yet so it's not working yet.  Note that this
            only splits the dirty clear operation from sync but it
            hasn't yet been splitted into smaller chunk so it's not
            really helpful for us yet.

Patch 7-10: kvm support of KVM_CLEAR_DIRTY_LOG.

Patch 11:   do the log_clear() splitting for the case of migration.
            Also a new parameter is introduced to define the block
            size of the small chunks (the unit to clear dirty bits)

Tests
===========================

- make check
- migrate idle/memory-heavy guests

(Not yet tested with huge guests but it'll be more than welcomed if
 someone has the resource and wants to give it a shot)

Please have a look, thanks.

Peter Xu (12):
  checkpatch: Allow SPDX-License-Identifier
  migration: No need to take rcu during sync_dirty_bitmap
  memory: Remove memory_region_get_dirty()
  memory: Don't set migration bitmap when without migration
  bitmap: Add bitmap_copy_with_{src|dst}_offset()
  memory: Pass mr into snapshot_and_clear_dirty
  memory: Introduce memory listener hook log_clear()
  kvm: Update comments for sync_dirty_bitmap
  kvm: Persistent per kvmslot dirty bitmap
  kvm: Introduce slots lock for memory listener
  kvm: Support KVM_CLEAR_DIRTY_LOG
  migration: Split log_clear() into smaller chunks

 accel/kvm/kvm-all.c      | 287 ++++++++++++++++++++++++++++++++++-----
 accel/kvm/trace-events   |   1 +
 exec.c                   |  15 +-
 include/exec/memory.h    |  36 ++---
 include/exec/ram_addr.h  |  91 ++++++++++++-
 include/qemu/bitmap.h    |   9 ++
 include/sysemu/kvm_int.h |   4 +
 memory.c                 |  64 +++++++--
 migration/migration.c    |   4 +
 migration/migration.h    |  27 ++++
 migration/ram.c          |  45 ++++++
 migration/trace-events   |   1 +
 scripts/checkpatch.pl    |   3 +-
 tests/Makefile.include   |   2 +
 tests/test-bitmap.c      |  81 +++++++++++
 util/bitmap.c            |  73 ++++++++++
 16 files changed, 673 insertions(+), 70 deletions(-)
 create mode 100644 tests/test-bitmap.c

-- 
2.17.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-31 12:56   ` Juan Quintela
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

According to: https://spdx.org/ids-how, let's still allow QEMU to use
the SPDX license identifier:

// SPDX-License-Identifier: ***

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 scripts/checkpatch.pl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 88682cb0a9..c2aaf421da 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -1949,7 +1949,8 @@ sub process {
 		}
 
 # no C99 // comments
-		if ($line =~ m{//}) {
+		if ($line =~ m{//} &&
+		    $rawline !~ m{// SPDX-License-Identifier: }) {
 			ERROR("do not use C99 // comments\n" . $herecurr);
 		}
 		# Remove C99 comments.
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-31 12:57   ` Juan Quintela
  2019-05-31 12:58   ` Juan Quintela
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 03/12] memory: Remove memory_region_get_dirty() Peter Xu
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

cpu_physical_memory_sync_dirty_bitmap() has one RAMBlock* as
parameter, which means that it must be with RCU read lock held
already.  Taking it again inside seems redundant.  Removing it.
Instead comment on the functions about the RCU read lock.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ram_addr.h | 5 +----
 migration/ram.c         | 1 +
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 139ad79390..6fc49e5db5 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -408,6 +408,7 @@ static inline void cpu_physical_memory_clear_dirty_range(ram_addr_t start,
 }
 
 
+/* Called with RCU critical section */
 static inline
 uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                                                ram_addr_t start,
@@ -431,8 +432,6 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                                         DIRTY_MEMORY_BLOCK_SIZE);
         unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
 
-        rcu_read_lock();
-
         src = atomic_rcu_read(
                 &ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION])->blocks;
 
@@ -452,8 +451,6 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                 idx++;
             }
         }
-
-        rcu_read_unlock();
     } else {
         ram_addr_t offset = rb->offset;
 
diff --git a/migration/ram.c b/migration/ram.c
index 4c60869226..dc916042fb 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1678,6 +1678,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
+/* Called with RCU critical section */
 static void migration_bitmap_sync_range(RAMState *rs, RAMBlock *rb,
                                         ram_addr_t length)
 {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 03/12] memory: Remove memory_region_get_dirty()
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier Peter Xu
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-31 12:59   ` Juan Quintela
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 04/12] memory: Don't set migration bitmap when without migration Peter Xu
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

It's never used anywhere.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 17 -----------------
 memory.c              |  8 --------
 2 files changed, 25 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9144a47f57..e6140e8a04 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1254,23 +1254,6 @@ void memory_region_ram_resize(MemoryRegion *mr, ram_addr_t newsize,
  */
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client);
 
-/**
- * memory_region_get_dirty: Check whether a range of bytes is dirty
- *                          for a specified client.
- *
- * Checks whether a range of bytes has been written to since the last
- * call to memory_region_reset_dirty() with the same @client.  Dirty logging
- * must be enabled.
- *
- * @mr: the memory region being queried.
- * @addr: the address (relative to the start of the region) being queried.
- * @size: the size of the range being queried.
- * @client: the user of the logging information; %DIRTY_MEMORY_MIGRATION or
- *          %DIRTY_MEMORY_VGA.
- */
-bool memory_region_get_dirty(MemoryRegion *mr, hwaddr addr,
-                             hwaddr size, unsigned client);
-
 /**
  * memory_region_set_dirty: Mark a range of bytes as dirty in a memory region.
  *
diff --git a/memory.c b/memory.c
index 3071c4bdad..0920c105aa 100644
--- a/memory.c
+++ b/memory.c
@@ -2027,14 +2027,6 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
     memory_region_transaction_commit();
 }
 
-bool memory_region_get_dirty(MemoryRegion *mr, hwaddr addr,
-                             hwaddr size, unsigned client)
-{
-    assert(mr->ram_block);
-    return cpu_physical_memory_get_dirty(memory_region_get_ram_addr(mr) + addr,
-                                         size, client);
-}
-
 void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
                              hwaddr size)
 {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 04/12] memory: Don't set migration bitmap when without migration
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (2 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 03/12] memory: Remove memory_region_get_dirty() Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-31 13:01   ` Juan Quintela
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 05/12] bitmap: Add bitmap_copy_with_{src|dst}_offset() Peter Xu
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Similar to 9460dee4b2 ("memory: do not touch code dirty bitmap unless
TCG is enabled", 2015-06-05) but for the migration bitmap - we can
skip the MIGRATION bitmap update if migration not enabled.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h   |  2 ++
 include/exec/ram_addr.h | 12 +++++++++++-
 memory.c                |  2 +-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index e6140e8a04..f29300c54d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -46,6 +46,8 @@
         OBJECT_GET_CLASS(IOMMUMemoryRegionClass, (obj), \
                          TYPE_IOMMU_MEMORY_REGION)
 
+extern bool global_dirty_log;
+
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegionMmio MemoryRegionMmio;
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 6fc49e5db5..79e70a96ee 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -348,8 +348,13 @@ static inline void cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
             if (bitmap[k]) {
                 unsigned long temp = leul_to_cpu(bitmap[k]);
 
-                atomic_or(&blocks[DIRTY_MEMORY_MIGRATION][idx][offset], temp);
                 atomic_or(&blocks[DIRTY_MEMORY_VGA][idx][offset], temp);
+
+                if (global_dirty_log) {
+                    atomic_or(&blocks[DIRTY_MEMORY_MIGRATION][idx][offset],
+                              temp);
+                }
+
                 if (tcg_enabled()) {
                     atomic_or(&blocks[DIRTY_MEMORY_CODE][idx][offset], temp);
                 }
@@ -366,6 +371,11 @@ static inline void cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
         xen_hvm_modified_memory(start, pages << TARGET_PAGE_BITS);
     } else {
         uint8_t clients = tcg_enabled() ? DIRTY_CLIENTS_ALL : DIRTY_CLIENTS_NOCODE;
+
+        if (!global_dirty_log) {
+            clients &= ~(1 << DIRTY_MEMORY_MIGRATION);
+        }
+
         /*
          * bitmap-traveling is faster than memory-traveling (for addr...)
          * especially when most of the memory is not dirty.
diff --git a/memory.c b/memory.c
index 0920c105aa..cff0ea8f40 100644
--- a/memory.c
+++ b/memory.c
@@ -38,7 +38,7 @@
 static unsigned memory_region_transaction_depth;
 static bool memory_region_update_pending;
 static bool ioeventfd_update_pending;
-static bool global_dirty_log = false;
+bool global_dirty_log;
 
 static QTAILQ_HEAD(, MemoryListener) memory_listeners
     = QTAILQ_HEAD_INITIALIZER(memory_listeners);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 05/12] bitmap: Add bitmap_copy_with_{src|dst}_offset()
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (3 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 04/12] memory: Don't set migration bitmap when without migration Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30 11:05   ` Dr. David Alan Gilbert
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 06/12] memory: Pass mr into snapshot_and_clear_dirty Peter Xu
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

These helpers copy the source bitmap to destination bitmap with a
shift either on the src or dst bitmap.

Meanwhile, we never have bitmap tests but we should.

This patch also introduces the initial test cases for utils/bitmap.c
but it only tests the newly introduced functions.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/bitmap.h  |  9 +++++
 tests/Makefile.include |  2 ++
 tests/test-bitmap.c    | 81 ++++++++++++++++++++++++++++++++++++++++++
 util/bitmap.c          | 73 +++++++++++++++++++++++++++++++++++++
 4 files changed, 165 insertions(+)
 create mode 100644 tests/test-bitmap.c

diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
index 5c313346b9..cdaa953371 100644
--- a/include/qemu/bitmap.h
+++ b/include/qemu/bitmap.h
@@ -41,6 +41,10 @@
  * bitmap_find_next_zero_area(buf, len, pos, n, mask)	Find bit free area
  * bitmap_to_le(dst, src, nbits)      Convert bitmap to little endian
  * bitmap_from_le(dst, src, nbits)    Convert bitmap from little endian
+ * bitmap_copy_with_src_offset(dst, src, offset, nbits)
+ *                                    *dst = *src (with an offset upon src)
+ * bitmap_copy_with_dst_offset(dst, src, offset, nbits)
+ *                                    *dst = *src (with an offset upon dst)
  */
 
 /*
@@ -271,4 +275,9 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
 void bitmap_from_le(unsigned long *dst, const unsigned long *src,
                     long nbits);
 
+void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
+                                 long offset, long nbits);
+void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
+                                 long shift, long nbits);
+
 #endif /* BITMAP_H */
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 1865f6b322..5e2d7dddff 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -64,6 +64,7 @@ check-unit-y += tests/test-opts-visitor$(EXESUF)
 check-unit-$(CONFIG_BLOCK) += tests/test-coroutine$(EXESUF)
 check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
+check-unit-y += tests/test-bitmap$(EXESUF)
 check-unit-$(CONFIG_BLOCK) += tests/test-aio$(EXESUF)
 check-unit-$(CONFIG_BLOCK) += tests/test-aio-multithread$(EXESUF)
 check-unit-$(CONFIG_BLOCK) += tests/test-throttle$(EXESUF)
@@ -529,6 +530,7 @@ tests/test-image-locking$(EXESUF): tests/test-image-locking.o $(test-block-obj-y
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) $(test-crypto-obj-y)
+tests/test-bitmap$(EXESUF): tests/test-bitmap.o $(test-util-obj-y)
 tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
 tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o migration/page_cache.o $(test-util-obj-y)
 tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o $(test-util-obj-y)
diff --git a/tests/test-bitmap.c b/tests/test-bitmap.c
new file mode 100644
index 0000000000..36b4c07bf2
--- /dev/null
+++ b/tests/test-bitmap.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Bitmap.c unit-tests.
+ *
+ * Copyright (C) 2019, Red Hat, Inc.
+ *
+ * Author: Peter Xu <peterx@redhat.com>
+ */
+
+#include <stdlib.h>
+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+
+#define BMAP_SIZE  1024
+
+static void check_bitmap_copy_with_offset(void)
+{
+    int i;
+    unsigned long *bmap1, *bmap2, *bmap3, total;
+
+    bmap1 = bitmap_new(BMAP_SIZE);
+    bmap2 = bitmap_new(BMAP_SIZE);
+    bmap3 = bitmap_new(BMAP_SIZE);
+
+    *bmap1 = random();
+    *(bmap1 + 1) = random();
+    *(bmap1 + 2) = random();
+    *(bmap1 + 3) = random();
+    total = BITS_PER_LONG * 4;
+
+    /* Shift 115 bits into bmap2 */
+    bitmap_copy_with_dst_offset(bmap2, bmap1, 115, total);
+    /* Shift another 85 bits into bmap3 */
+    bitmap_copy_with_dst_offset(bmap3, bmap2, 85, total + 115);
+    /* Shift back 200 bits back */
+    bitmap_copy_with_src_offset(bmap2, bmap3, 200, total);
+
+    for (i = 0; i < 3; i++) {
+        g_assert(*(bmap1 + i) == *(bmap2 + i));
+    }
+
+    bitmap_clear(bmap1, 0, BMAP_SIZE);
+    /* Set bits in bmap1 are 100-245 */
+    bitmap_set(bmap1, 100, 145);
+
+    /* Set bits in bmap2 are 60-205 */
+    bitmap_copy_with_src_offset(bmap2, bmap1, 40, 250);
+    for (i = 0; i < 60; i++) {
+        g_assert(test_bit(i, bmap2) == 0);
+    }
+    for (i = 60; i < 205; i++) {
+        g_assert(test_bit(i, bmap2));
+    }
+    g_assert(test_bit(205, bmap2) == 0);
+
+    /* Set bits in bmap3 are 135-280 */
+    bitmap_copy_with_dst_offset(bmap3, bmap1, 35, 250);
+    for (i = 0; i < 135; i++) {
+        g_assert(test_bit(i, bmap3) == 0);
+    }
+    for (i = 135; i < 280; i++) {
+        g_assert(test_bit(i, bmap3));
+    }
+    g_assert(test_bit(280, bmap3) == 0);
+
+    g_free(bmap1);
+    g_free(bmap2);
+    g_free(bmap3);
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+
+    g_test_add_func("/bitmap/bitmap_copy_with_offset",
+                    check_bitmap_copy_with_offset);
+
+    g_test_run();
+
+    return 0;
+}
diff --git a/util/bitmap.c b/util/bitmap.c
index cb618c65a5..391a7bb744 100644
--- a/util/bitmap.c
+++ b/util/bitmap.c
@@ -402,3 +402,76 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
 {
     bitmap_to_from_le(dst, src, nbits);
 }
+
+/*
+ * Copy "src" bitmap with a positive offset and put it into the "dst"
+ * bitmap.  The caller needs to make sure the bitmap size of "src"
+ * is bigger than (shift + nbits).
+ */
+void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
+                                 long shift, long nbits)
+{
+    unsigned long left_mask, right_mask, last_mask;
+
+    assert(shift >= 0);
+    /* Proper shift src pointer to the first word to copy from */
+    src += BIT_WORD(shift);
+    shift %= BITS_PER_LONG;
+    right_mask = (1ul << shift) - 1;
+    left_mask = ~right_mask;
+
+    while (nbits >= BITS_PER_LONG) {
+        *dst = (*src & left_mask) >> shift;
+        *dst |= (*(src + 1) & right_mask) << (BITS_PER_LONG - shift);
+        dst++;
+        src++;
+        nbits -= BITS_PER_LONG;
+    }
+
+    if (nbits > BITS_PER_LONG - shift) {
+        *dst = (*src & left_mask) >> shift;
+        nbits -= BITS_PER_LONG - shift;
+        last_mask = (1 << nbits) - 1;
+        *dst |= (*(src + 1) & last_mask) << (BITS_PER_LONG - shift);
+    } else if (nbits) {
+        last_mask = (1 << nbits) - 1;
+        *dst = (*src >> shift) & last_mask;
+    }
+}
+
+/*
+ * Copy "src" bitmap into the "dst" bitmap with an offset in the
+ * "dst".  The caller needs to make sure the bitmap size of "dst" is
+ * bigger than (shift + nbits).
+ */
+void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
+                                 long shift, long nbits)
+{
+    unsigned long left_mask, right_mask, last_mask;
+
+    assert(shift >= 0);
+    /* Proper shift src pointer to the first word to copy from */
+    dst += BIT_WORD(shift);
+    shift %= BITS_PER_LONG;
+    right_mask = (1ul << (BITS_PER_LONG - shift)) - 1;
+    left_mask = ~right_mask;
+
+    *dst &= (1ul << shift) - 1;
+    while (nbits >= BITS_PER_LONG) {
+        *dst |= (*src & right_mask) << shift;
+        *(dst + 1) = (*src & left_mask) >> (BITS_PER_LONG - shift);
+        dst++;
+        src++;
+        nbits -= BITS_PER_LONG;
+    }
+
+    if (nbits > BITS_PER_LONG - shift) {
+        *dst |= (*src & right_mask) << shift;
+        nbits -= BITS_PER_LONG - shift;
+        last_mask = ((1 << nbits) - 1) << (BITS_PER_LONG - shift);
+        *(dst + 1) = (*src & last_mask) >> (BITS_PER_LONG - shift);
+    } else if (nbits) {
+        last_mask = (1 << nbits) - 1;
+        *dst |= (*src & last_mask) << shift;
+    }
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 06/12] memory: Pass mr into snapshot_and_clear_dirty
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (4 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 05/12] bitmap: Add bitmap_copy_with_{src|dst}_offset() Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30 11:22   ` Dr. David Alan Gilbert
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 07/12] memory: Introduce memory listener hook log_clear() Peter Xu
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Also we change the 2nd parameter of it to be the relative offset
within the memory region. This is to be used in follow up patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 exec.c                  | 3 ++-
 include/exec/ram_addr.h | 2 +-
 memory.c                | 3 +--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/exec.c b/exec.c
index 4e734770c2..2615b4cfed 100644
--- a/exec.c
+++ b/exec.c
@@ -1387,9 +1387,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
 }
 
 DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
-     (ram_addr_t start, ram_addr_t length, unsigned client)
+(MemoryRegion *mr, hwaddr addr, hwaddr length, unsigned client)
 {
     DirtyMemoryBlocks *blocks;
+    ram_addr_t start = memory_region_get_ram_addr(mr) + addr;
     unsigned long align = 1UL << (TARGET_PAGE_BITS + BITS_PER_LEVEL);
     ram_addr_t first = QEMU_ALIGN_DOWN(start, align);
     ram_addr_t last  = QEMU_ALIGN_UP(start + length, align);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 79e70a96ee..f8ee011d3c 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -403,7 +403,7 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
                                               unsigned client);
 
 DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
-    (ram_addr_t start, ram_addr_t length, unsigned client);
+(MemoryRegion *mr, ram_addr_t start, hwaddr length, unsigned client);
 
 bool cpu_physical_memory_snapshot_get_dirty(DirtyBitmapSnapshot *snap,
                                             ram_addr_t start,
diff --git a/memory.c b/memory.c
index cff0ea8f40..84bba7b65c 100644
--- a/memory.c
+++ b/memory.c
@@ -2071,8 +2071,7 @@ DirtyBitmapSnapshot *memory_region_snapshot_and_clear_dirty(MemoryRegion *mr,
 {
     assert(mr->ram_block);
     memory_region_sync_dirty_bitmap(mr);
-    return cpu_physical_memory_snapshot_and_clear_dirty(
-                memory_region_get_ram_addr(mr) + addr, size, client);
+    return cpu_physical_memory_snapshot_and_clear_dirty(mr, addr, size, client);
 }
 
 bool memory_region_snapshot_get_dirty(MemoryRegion *mr, DirtyBitmapSnapshot *snap,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 07/12] memory: Introduce memory listener hook log_clear()
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (5 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 06/12] memory: Pass mr into snapshot_and_clear_dirty Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30 13:20   ` Dr. David Alan Gilbert
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 08/12] kvm: Update comments for sync_dirty_bitmap Peter Xu
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Introduce a new memory region listener hook log_clear() to allow the
listeners to hook onto the points where the dirty bitmap is cleared by
the bitmap users.

Previously log_sync() contains two operations:

  - dirty bitmap collection, and,
  - dirty bitmap clear on remote site.

Let's take KVM as example - log_sync() for KVM will first copy the
kernel dirty bitmap to userspace, and at the same time we'll clear the
dirty bitmap there along with re-protecting all the guest pages again.

We add this new log_clear() interface only to split the old log_sync()
into two separated procedures:

  - use log_sync() to collect the collection only, and,
  - use log_clear() to clear the remote dirty bitmap.

With the new interface, the memory listener users will still be able
to decide how to implement the log synchronization procedure, e.g.,
they can still only provide log_sync() method only and put all the two
procedures within log_sync() (that's how the old KVM works before
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is introduced).  However with this
new interface the memory listener users will start to have a chance to
postpone the log clear operation explicitly if the module supports.
That can really benefit users like KVM at least for host kernels that
support KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2.

There are three places that can clear dirty bits in any one of the
dirty bitmap in the ram_list.dirty_memory[3] array:

        cpu_physical_memory_snapshot_and_clear_dirty
        cpu_physical_memory_test_and_clear_dirty
        cpu_physical_memory_sync_dirty_bitmap

Currently we hook directly into each of the functions to notify about
the log_clear().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 exec.c                  | 12 ++++++++++
 include/exec/memory.h   | 17 ++++++++++++++
 include/exec/ram_addr.h |  3 +++
 memory.c                | 51 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 83 insertions(+)

diff --git a/exec.c b/exec.c
index 2615b4cfed..ab595e1e4b 100644
--- a/exec.c
+++ b/exec.c
@@ -1355,6 +1355,8 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
     DirtyMemoryBlocks *blocks;
     unsigned long end, page;
     bool dirty = false;
+    RAMBlock *ramblock;
+    uint64_t mr_offset, mr_size;
 
     if (length == 0) {
         return false;
@@ -1366,6 +1368,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
     rcu_read_lock();
 
     blocks = atomic_rcu_read(&ram_list.dirty_memory[client]);
+    ramblock = qemu_get_ram_block(start);
+    /* Range sanity check on the ramblock */
+    assert(start >= ramblock->offset &&
+           start + length <= ramblock->offset + ramblock->used_length);
 
     while (page < end) {
         unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
@@ -1377,6 +1383,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
         page += num;
     }
 
+    mr_offset = (ram_addr_t)(page << TARGET_PAGE_BITS) - ramblock->offset;
+    mr_size = (end - page) << TARGET_PAGE_BITS;
+    memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
+
     rcu_read_unlock();
 
     if (dirty && tcg_enabled()) {
@@ -1432,6 +1442,8 @@ DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
         tlb_reset_dirty_range_all(start, length);
     }
 
+    memory_region_clear_dirty_bitmap(mr, addr, length);
+
     return snap;
 }
 
diff --git a/include/exec/memory.h b/include/exec/memory.h
index f29300c54d..d752b2a758 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -416,6 +416,7 @@ struct MemoryListener {
     void (*log_stop)(MemoryListener *listener, MemoryRegionSection *section,
                      int old, int new);
     void (*log_sync)(MemoryListener *listener, MemoryRegionSection *section);
+    void (*log_clear)(MemoryListener *listener, MemoryRegionSection *section);
     void (*log_global_start)(MemoryListener *listener);
     void (*log_global_stop)(MemoryListener *listener);
     void (*eventfd_add)(MemoryListener *listener, MemoryRegionSection *section,
@@ -1269,6 +1270,22 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client);
 void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
                              hwaddr size);
 
+/**
+ * memory_region_clear_dirty_bitmap - clear dirty bitmap for memory range
+ *
+ * This function is called when the caller wants to clear the remote
+ * dirty bitmap of a memory range within the memory region.  This can
+ * be used by e.g. KVM to manually clear dirty log when
+ * KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is declared support by the host
+ * kernel.
+ *
+ * @mr:     the memory region to clear the dirty log upon
+ * @start:  start address offset within the memory region
+ * @len:    length of the memory region to clear dirty bitmap
+ */
+void memory_region_clear_dirty_bitmap(MemoryRegion *mr, hwaddr start,
+                                      hwaddr len);
+
 /**
  * memory_region_snapshot_and_clear_dirty: Get a snapshot of the dirty
  *                                         bitmap and clear it.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index f8ee011d3c..f8930914cd 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -461,6 +461,9 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
                 idx++;
             }
         }
+
+        /* TODO: split the huge bitmap into smaller chunks */
+        memory_region_clear_dirty_bitmap(rb->mr, start, length);
     } else {
         ram_addr_t offset = rb->offset;
 
diff --git a/memory.c b/memory.c
index 84bba7b65c..a051025dd1 100644
--- a/memory.c
+++ b/memory.c
@@ -2064,6 +2064,57 @@ static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
     }
 }
 
+void memory_region_clear_dirty_bitmap(MemoryRegion *mr, hwaddr start,
+                                      hwaddr len)
+{
+    MemoryRegionSection mrs;
+    MemoryListener *listener;
+    AddressSpace *as;
+    FlatView *view;
+    FlatRange *fr;
+    hwaddr sec_start, sec_end, sec_size;
+
+    QTAILQ_FOREACH(listener, &memory_listeners, link) {
+        if (!listener->log_clear) {
+            continue;
+        }
+        as = listener->address_space;
+        view = address_space_get_flatview(as);
+        FOR_EACH_FLAT_RANGE(fr, view) {
+            if (!fr->dirty_log_mask || fr->mr != mr) {
+                /*
+                 * Clear dirty bitmap operation only applies to those
+                 * regions whose dirty logging is at least enabled
+                 */
+                continue;
+            }
+
+            mrs = section_from_flat_range(fr, view);
+
+            sec_start = MAX(mrs.offset_within_region, start);
+            sec_end = mrs.offset_within_region + int128_get64(mrs.size);
+            sec_end = MIN(sec_end, start + len);
+
+            if (sec_start >= sec_end) {
+                /*
+                 * If this memory region section has no intersection
+                 * with the requested range, skip.
+                 */
+                continue;
+            }
+
+            /* Valid case; shrink the section if needed */
+            mrs.offset_within_address_space +=
+                sec_start - mrs.offset_within_region;
+            mrs.offset_within_region = sec_start;
+            sec_size = sec_end - sec_start;
+            mrs.size = int128_make64(sec_size);
+            listener->log_clear(listener, &mrs);
+        }
+        flatview_unref(view);
+    }
+}
+
 DirtyBitmapSnapshot *memory_region_snapshot_and_clear_dirty(MemoryRegion *mr,
                                                             hwaddr addr,
                                                             hwaddr size,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 08/12] kvm: Update comments for sync_dirty_bitmap
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (6 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 07/12] memory: Introduce memory listener hook log_clear() Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 09/12] kvm: Persistent per kvmslot dirty bitmap Peter Xu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

It's obviously obsolete.  Do some update.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 524c4ddfbd..b686531586 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -473,13 +473,13 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
 
 /**
- * kvm_physical_sync_dirty_bitmap - Grab dirty bitmap from kernel space
- * This function updates qemu's dirty bitmap using
- * memory_region_set_dirty().  This means all bits are set
- * to dirty.
+ * kvm_physical_sync_dirty_bitmap - Sync dirty bitmap from kernel space
  *
- * @start_add: start of logged region.
- * @end_addr: end of logged region.
+ * This function will first try to fetch dirty bitmap from the kernel,
+ * and then updates qemu's dirty bitmap.
+ *
+ * @kml: the KVM memory listener object
+ * @section: the memory section to sync the dirty bitmap with
  */
 static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
                                           MemoryRegionSection *section)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 09/12] kvm: Persistent per kvmslot dirty bitmap
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (7 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 08/12] kvm: Update comments for sync_dirty_bitmap Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30 13:53   ` Dr. David Alan Gilbert
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 10/12] kvm: Introduce slots lock for memory listener Peter Xu
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

When synchronizing dirty bitmap from kernel KVM we do it in a
per-kvmslot fashion and we allocate the userspace bitmap for each of
the ioctl.  This patch instead make the bitmap cache be persistent
then we don't need to g_malloc0() every time.

More importantly, the cached per-kvmslot dirty bitmap will be further
used when we want to add support for the KVM_CLEAR_DIRTY_LOG and this
cached bitmap will be used to guarantee we won't clear any unknown
dirty bits otherwise that can be a severe data loss issue for
migration code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 39 +++++++++++++++++++++------------------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index b686531586..334c610918 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -497,31 +497,14 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
             return 0;
         }
 
-        /* XXX bad kernel interface alert
-         * For dirty bitmap, kernel allocates array of size aligned to
-         * bits-per-long.  But for case when the kernel is 64bits and
-         * the userspace is 32bits, userspace can't align to the same
-         * bits-per-long, since sizeof(long) is different between kernel
-         * and user space.  This way, userspace will provide buffer which
-         * may be 4 bytes less than the kernel will use, resulting in
-         * userspace memory corruption (which is not detectable by valgrind
-         * too, in most cases).
-         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
-         * a hope that sizeof(long) won't become >8 any time soon.
-         */
-        size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
-                     /*HOST_LONG_BITS*/ 64) / 8;
-        d.dirty_bitmap = g_malloc0(size);
-
+        d.dirty_bitmap = mem->dirty_bmap;
         d.slot = mem->slot | (kml->as_id << 16);
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
             DPRINTF("ioctl failed %d\n", errno);
-            g_free(d.dirty_bitmap);
             return -1;
         }
 
         kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
-        g_free(d.dirty_bitmap);
     }
 
     return 0;
@@ -765,6 +748,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     MemoryRegion *mr = section->mr;
     bool writeable = !mr->readonly && !mr->rom_device;
     hwaddr start_addr, size;
+    unsigned long bmap_size;
     void *ram;
 
     if (!memory_region_is_ram(mr)) {
@@ -796,6 +780,8 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         }
 
         /* unregister the slot */
+        g_free(mem->dirty_bmap);
+        mem->dirty_bmap = NULL;
         mem->memory_size = 0;
         mem->flags = 0;
         err = kvm_set_user_memory_region(kml, mem, false);
@@ -807,12 +793,29 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         return;
     }
 
+    /*
+     * XXX bad kernel interface alert For dirty bitmap, kernel
+     * allocates array of size aligned to bits-per-long.  But for case
+     * when the kernel is 64bits and the userspace is 32bits,
+     * userspace can't align to the same bits-per-long, since
+     * sizeof(long) is different between kernel and user space.  This
+     * way, userspace will provide buffer which may be 4 bytes less
+     * than the kernel will use, resulting in userspace memory
+     * corruption (which is not detectable by valgrind too, in most
+     * cases).  So for now, let's align to 64 instead of
+     * HOST_LONG_BITS here, in a hope that sizeof(long) won't become
+     * >8 any time soon.
+     */
+    bmap_size = ALIGN((size >> TARGET_PAGE_BITS),
+                      /*HOST_LONG_BITS*/ 64) / 8;
+
     /* register the new slot */
     mem = kvm_alloc_slot(kml);
     mem->memory_size = size;
     mem->start_addr = start_addr;
     mem->ram = ram;
     mem->flags = kvm_mem_flags(mr);
+    mem->dirty_bmap = g_malloc0(bmap_size);
 
     err = kvm_set_user_memory_region(kml, mem, true);
     if (err) {
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index f838412491..687a2ee423 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -21,6 +21,8 @@ typedef struct KVMSlot
     int slot;
     int flags;
     int old_flags;
+    /* Dirty bitmap cache for the slot */
+    unsigned long *dirty_bmap;
 } KVMSlot;
 
 typedef struct KVMMemoryListener {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 10/12] kvm: Introduce slots lock for memory listener
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (8 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 09/12] kvm: Persistent per kvmslot dirty bitmap Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30 16:40   ` Dr. David Alan Gilbert
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 11/12] kvm: Support KVM_CLEAR_DIRTY_LOG Peter Xu
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 12/12] migration: Split log_clear() into smaller chunks Peter Xu
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Introduce KVMMemoryListener.slots_lock to protect the slots inside the
kvm memory listener.  Currently it is close to useless because all the
KVM code path now is always protected by the BQL.  But it'll start to
make sense in follow up patches where we might do remote dirty bitmap
clear and also we'll update the per-slot cached dirty bitmap even
without the BQL.  So let's prepare for it.

We can also use per-slot lock for above reason but it seems to be an
overkill.  Let's just use this bigger one (which covers all the slots
of a single address space) but anyway this lock is still much smaller
than the BQL.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 58 +++++++++++++++++++++++++++++++---------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 334c610918..e687060296 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -138,6 +138,9 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
     KVM_CAP_LAST_INFO
 };
 
+#define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
+#define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_machine->accelerator);
@@ -165,6 +168,7 @@ int kvm_memcrypt_encrypt_data(uint8_t *ptr, uint64_t len)
     return 1;
 }
 
+/* Called with KVMMemoryListener.slots_lock held */
 static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 {
     KVMState *s = kvm_state;
@@ -182,10 +186,17 @@ static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 bool kvm_has_free_slot(MachineState *ms)
 {
     KVMState *s = KVM_STATE(ms->accelerator);
+    bool result;
+    KVMMemoryListener *kml = &s->memory_listener;
+
+    kvm_slots_lock(kml);
+    result = !!kvm_get_free_slot(kml);
+    kvm_slots_unlock(kml);
 
-    return kvm_get_free_slot(&s->memory_listener);
+    return result;
 }
 
+/* Called with KVMMemoryListener.slots_lock held */
 static KVMSlot *kvm_alloc_slot(KVMMemoryListener *kml)
 {
     KVMSlot *slot = kvm_get_free_slot(kml);
@@ -244,18 +255,21 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
                                        hwaddr *phys_addr)
 {
     KVMMemoryListener *kml = &s->memory_listener;
-    int i;
+    int i, ret = 0;
 
+    kvm_slots_lock(kml);
     for (i = 0; i < s->nr_slots; i++) {
         KVMSlot *mem = &kml->slots[i];
 
         if (ram >= mem->ram && ram < mem->ram + mem->memory_size) {
             *phys_addr = mem->start_addr + (ram - mem->ram);
-            return 1;
+            ret = 1;
+            break;
         }
     }
+    kvm_slots_unlock(kml);
 
-    return 0;
+    return ret;
 }
 
 static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
@@ -391,6 +405,7 @@ static int kvm_mem_flags(MemoryRegion *mr)
     return flags;
 }
 
+/* Called with KVMMemoryListener.slots_lock held */
 static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
                                  MemoryRegion *mr)
 {
@@ -409,19 +424,26 @@ static int kvm_section_update_flags(KVMMemoryListener *kml,
 {
     hwaddr start_addr, size;
     KVMSlot *mem;
+    int ret = 0;
 
     size = kvm_align_section(section, &start_addr);
     if (!size) {
         return 0;
     }
 
+    kvm_slots_lock(kml);
+
     mem = kvm_lookup_matching_slot(kml, start_addr, size);
     if (!mem) {
         /* We don't have a slot if we want to trap every access. */
-        return 0;
+        goto out;
     }
 
-    return kvm_slot_update_flags(kml, mem, section->mr);
+    ret = kvm_slot_update_flags(kml, mem, section->mr);
+
+out:
+    kvm_slots_unlock(kml);
+    return ret;
 }
 
 static void kvm_log_start(MemoryListener *listener,
@@ -478,6 +500,8 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
  * This function will first try to fetch dirty bitmap from the kernel,
  * and then updates qemu's dirty bitmap.
  *
+ * NOTE: caller must be with kml->slots_lock held.
+ *
  * @kml: the KVM memory listener object
  * @section: the memory section to sync the dirty bitmap with
  */
@@ -488,26 +512,28 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
     struct kvm_dirty_log d = {};
     KVMSlot *mem;
     hwaddr start_addr, size;
+    int ret = 0;
 
     size = kvm_align_section(section, &start_addr);
     if (size) {
         mem = kvm_lookup_matching_slot(kml, start_addr, size);
         if (!mem) {
             /* We don't have a slot if we want to trap every access. */
-            return 0;
+            goto out;
         }
 
         d.dirty_bitmap = mem->dirty_bmap;
         d.slot = mem->slot | (kml->as_id << 16);
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
             DPRINTF("ioctl failed %d\n", errno);
-            return -1;
+            ret = -1;
+            goto out;
         }
 
         kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
     }
-
-    return 0;
+out:
+    return ret;
 }
 
 static void kvm_coalesce_mmio_region(MemoryListener *listener,
@@ -770,10 +796,12 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
           (start_addr - section->offset_within_address_space);
 
+    kvm_slots_lock(kml);
+
     if (!add) {
         mem = kvm_lookup_matching_slot(kml, start_addr, size);
         if (!mem) {
-            return;
+            goto out;
         }
         if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
             kvm_physical_sync_dirty_bitmap(kml, section);
@@ -790,7 +818,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                     __func__, strerror(-err));
             abort();
         }
-        return;
+        goto out;
     }
 
     /*
@@ -823,6 +851,9 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                 strerror(-err));
         abort();
     }
+
+out:
+    kvm_slots_unlock(kml);
 }
 
 static void kvm_region_add(MemoryListener *listener,
@@ -849,7 +880,9 @@ static void kvm_log_sync(MemoryListener *listener,
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
     int r;
 
+    kvm_slots_lock(kml);
     r = kvm_physical_sync_dirty_bitmap(kml, section);
+    kvm_slots_unlock(kml);
     if (r < 0) {
         abort();
     }
@@ -929,6 +962,7 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
 {
     int i;
 
+    qemu_mutex_init(&kml->slots_lock);
     kml->slots = g_malloc0(s->nr_slots * sizeof(KVMSlot));
     kml->as_id = as_id;
 
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 687a2ee423..31df465fdc 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -27,6 +27,8 @@ typedef struct KVMSlot
 
 typedef struct KVMMemoryListener {
     MemoryListener listener;
+    /* Protects the slots and all inside them */
+    QemuMutex slots_lock;
     KVMSlot *slots;
     int as_id;
 } KVMMemoryListener;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 11/12] kvm: Support KVM_CLEAR_DIRTY_LOG
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (9 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 10/12] kvm: Introduce slots lock for memory listener Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30 17:56   ` Dr. David Alan Gilbert
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 12/12] migration: Split log_clear() into smaller chunks Peter Xu
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Firstly detect the interface using KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
and mark it.  When failed to enable the new feature we'll fall back to
the old sync.

Provide the log_clear() hook for the memory listeners for both address
spaces of KVM (normal system memory, and SMM) and deliever the clear
message to kernel.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c    | 180 +++++++++++++++++++++++++++++++++++++++++
 accel/kvm/trace-events |   1 +
 2 files changed, 181 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index e687060296..23895a95a2 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -91,6 +91,7 @@ struct KVMState
     int many_ioeventfds;
     int intx_set_mask;
     bool sync_mmu;
+    bool manual_dirty_log_protect;
     /* The man page (and posix) say ioctl numbers are signed int, but
      * they're not.  Linux, glibc and *BSD all treat ioctl numbers as
      * unsigned, and treating them as signed here can break things */
@@ -536,6 +537,157 @@ out:
     return ret;
 }
 
+/* Alignment requirement for KVM_CLEAR_DIRTY_LOG - 64 pages */
+#define KVM_CLEAR_LOG_SHIFT  6
+#define KVM_CLEAR_LOG_ALIGN  (qemu_real_host_page_size << KVM_CLEAR_LOG_SHIFT)
+#define KVM_CLEAR_LOG_MASK   (-KVM_CLEAR_LOG_ALIGN)
+
+/**
+ * kvm_physical_log_clear - Clear the kernel's dirty bitmap for range
+ *
+ * NOTE: this will be a no-op if we haven't enabled manual dirty log
+ * protection in the host kernel because in that case this operation
+ * will be done within log_sync().
+ *
+ * @kml:     the kvm memory listener
+ * @section: the memory range to clear dirty bitmap
+ */
+static int kvm_physical_log_clear(KVMMemoryListener *kml,
+                                  MemoryRegionSection *section)
+{
+    KVMState *s = kvm_state;
+    struct kvm_clear_dirty_log d;
+    uint64_t start, end, bmap_start, start_delta, bmap_npages, size;
+    unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size;
+    KVMSlot *mem = NULL;
+    int ret, i;
+
+    if (!s->manual_dirty_log_protect) {
+        /* No need to do explicit clear */
+        return 0;
+    }
+
+    start = section->offset_within_address_space;
+    size = int128_get64(section->size);
+
+    if (!size) {
+        /* Nothing more we can do... */
+        return 0;
+    }
+
+    kvm_slots_lock(kml);
+
+    /* Find any possible slot that covers the section */
+    for (i = 0; i < s->nr_slots; i++) {
+        mem = &kml->slots[i];
+        if (mem->start_addr <= start &&
+            start + size <= mem->start_addr + mem->memory_size) {
+            break;
+        }
+    }
+
+    /*
+     * We should always find one memslot until this point, otherwise
+     * there could be something wrong from the upper layer
+     */
+    assert(mem && i != s->nr_slots);
+
+    /*
+     * We need to extend either the start or the size or both to
+     * satisfy the KVM interface requirement.  Firstly, do the start
+     * page alignment on 64 host pages
+     */
+    bmap_start = (start - mem->start_addr) & KVM_CLEAR_LOG_MASK;
+    start_delta = start - mem->start_addr - bmap_start;
+    bmap_start /= psize;
+
+    /*
+     * The kernel interface has restriction on the size too, that either:
+     *
+     * (1) the size is 64 host pages aligned (just like the start), or
+     * (2) the size fills up until the end of the KVM memslot.
+     */
+    bmap_npages = DIV_ROUND_UP(size + start_delta, KVM_CLEAR_LOG_ALIGN)
+        << KVM_CLEAR_LOG_SHIFT;
+    end = mem->memory_size / psize;
+    if (bmap_npages > end - bmap_start) {
+        bmap_npages = end - bmap_start;
+    }
+    start_delta /= psize;
+
+    /*
+     * Prepare the bitmap to clear dirty bits.  Here we must guarantee
+     * that we won't clear any unknown dirty bits otherwise we might
+     * accidentally clear some set bits which are not yet synced from
+     * the kernel into QEMU's bitmap, then we'll lose track of the
+     * guest modifications upon those pages (which can directly lead
+     * to guest data loss or panic after migration).
+     *
+     * Layout of the KVMSlot.dirty_bmap:
+     *
+     *                   |<-------- bmap_npages -----------..>|
+     *                                                     [1]
+     *                     start_delta         size
+     *  |----------------|-------------|------------------|------------|
+     *  ^                ^             ^                               ^
+     *  |                |             |                               |
+     * start          bmap_start     (start)                         end
+     * of memslot                                             of memslot
+     *
+     * [1] bmap_npages can be aligned to either 64 pages or the end of slot
+     */
+
+    assert(bmap_start % BITS_PER_LONG == 0);
+    if (start_delta) {
+        /* Slow path - we need to manipulate a temp bitmap */
+        bmap_clear = bitmap_new(bmap_npages);
+        bitmap_copy_with_src_offset(bmap_clear, mem->dirty_bmap,
+                                    bmap_start, start_delta + size / psize);
+        /*
+         * We need to fill the holes at start because that was not
+         * specified by the caller and we extended the bitmap only for
+         * 64 pages alignment
+         */
+        bitmap_clear(bmap_clear, 0, start_delta);
+        d.dirty_bitmap = bmap_clear;
+    } else {
+        /* Fast path - start address aligns well with BITS_PER_LONG */
+        d.dirty_bitmap = mem->dirty_bmap + BIT_WORD(bmap_start);
+    }
+
+    d.first_page = bmap_start;
+    /* It should never overflow.  If it happens, say something */
+    assert(bmap_npages <= UINT32_MAX);
+    d.num_pages = bmap_npages;
+    d.slot = mem->slot | (kml->as_id << 16);
+
+    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
+        ret = -errno;
+        error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
+                     "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
+                     __func__, d.slot, (uint64_t)d.first_page,
+                     (uint32_t)d.num_pages, ret);
+    } else {
+        ret = 0;
+        trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
+    }
+
+    /*
+     * After we have updated the remote dirty bitmap, we update the
+     * cached bitmap as well for the memslot, then if another user
+     * clears the same region we know we shouldn't clear it again on
+     * the remote otherwise it's data loss as well.
+     */
+    bitmap_clear(mem->dirty_bmap, bmap_start + start_delta,
+                 size / psize);
+    /* This handles the NULL case well */
+    g_free(bmap_clear);
+
+    kvm_slots_unlock(kml);
+
+    return ret;
+}
+
 static void kvm_coalesce_mmio_region(MemoryListener *listener,
                                      MemoryRegionSection *secion,
                                      hwaddr start, hwaddr size)
@@ -888,6 +1040,22 @@ static void kvm_log_sync(MemoryListener *listener,
     }
 }
 
+static void kvm_log_clear(MemoryListener *listener,
+                          MemoryRegionSection *section)
+{
+    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
+    int r;
+
+    r = kvm_physical_log_clear(kml, section);
+    if (r < 0) {
+        error_report_once("%s: kvm log clear failed: mr=%s "
+                          "offset=%"HWADDR_PRIx" size=%"PRIx64, __func__,
+                          section->mr->name, section->offset_within_region,
+                          int128_get64(section->size));
+        abort();
+    }
+}
+
 static void kvm_mem_ioeventfd_add(MemoryListener *listener,
                                   MemoryRegionSection *section,
                                   bool match_data, uint64_t data,
@@ -975,6 +1143,7 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
     kml->listener.log_start = kvm_log_start;
     kml->listener.log_stop = kvm_log_stop;
     kml->listener.log_sync = kvm_log_sync;
+    kml->listener.log_clear = kvm_log_clear;
     kml->listener.priority = 10;
 
     memory_listener_register(&kml->listener, as);
@@ -1699,6 +1868,17 @@ static int kvm_init(MachineState *ms)
     s->coalesced_pio = s->coalesced_mmio &&
                        kvm_check_extension(s, KVM_CAP_COALESCED_PIO);
 
+    s->manual_dirty_log_protect =
+        kvm_check_extension(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
+    if (s->manual_dirty_log_protect) {
+        ret = kvm_vm_enable_cap(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, 0, 1);
+        if (ret) {
+            warn_report("Trying to enable KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 "
+                        "but failed.  Falling back to the legacy mode. ");
+            s->manual_dirty_log_protect = false;
+        }
+    }
+
 #ifdef KVM_CAP_VCPU_EVENTS
     s->vcpu_events = kvm_check_extension(s, KVM_CAP_VCPU_EVENTS);
 #endif
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 33c5b1b3af..4fb6e59d19 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -15,4 +15,5 @@ kvm_irqchip_release_virq(int virq) "virq %d"
 kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%" PRIx64 " val=0x%x assign: %d size: %d match: %d"
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
+kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v3 12/12] migration: Split log_clear() into smaller chunks
  2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
                   ` (10 preceding siblings ...)
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 11/12] kvm: Support KVM_CLEAR_DIRTY_LOG Peter Xu
@ 2019-05-30  9:29 ` Peter Xu
  2019-05-30 18:58   ` Dr. David Alan Gilbert
  11 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-05-30  9:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Laurent Vivier, Paolo Bonzini, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Currently we are doing log_clear() right after log_sync() which mostly
keeps the old behavior when log_clear() was still part of log_sync().

This patch tries to further optimize the migration log_clear() code
path to split huge log_clear()s into smaller chunks.

We do this by spliting the whole guest memory region into memory
chunks, whose size is decided by MigrationState.clear_bitmap_shift (an
example will be given below).  With that, we don't do the dirty bitmap
clear operation on the remote node (e.g., KVM) when we fetch the dirty
bitmap, instead we explicitly clear the dirty bitmap for the memory
chunk for each of the first time we send a page in that chunk.

Here comes an example.

Assuming the guest has 64G memory, then before this patch the KVM
ioctl KVM_CLEAR_DIRTY_LOG will be a single one covering 64G memory.
If after the patch, let's assume when the clear bitmap shift is 18,
then the memory chunk size on x86_64 will be 1UL<<18 * 4K = 1GB.  Then
instead of sending a big 64G ioctl, we'll send 64 small ioctls, each
of the ioctl will cover 1G of the guest memory.  For each of the 64
small ioctls, we'll only send if any of the page in that small chunk
was going to be sent right away.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ram_addr.h | 75 +++++++++++++++++++++++++++++++++++++++--
 migration/migration.c   |  4 +++
 migration/migration.h   | 27 +++++++++++++++
 migration/ram.c         | 44 ++++++++++++++++++++++++
 migration/trace-events  |  1 +
 5 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index f8930914cd..c85896f4d5 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -50,8 +50,69 @@ struct RAMBlock {
     unsigned long *unsentmap;
     /* bitmap of already received pages in postcopy */
     unsigned long *receivedmap;
+
+    /*
+     * bitmap of already cleared dirty bitmap.  Set this up to
+     * non-NULL to enable the capability to postpone and split
+     * clearing of dirty bitmap on the remote node (e.g., KVM).  The
+     * bitmap will be set only when doing global sync.
+     *
+     * NOTE: this bitmap is different comparing to the other bitmaps
+     * in that one bit can represent multiple guest pages (which is
+     * decided by the `clear_bmap_shift' variable below).  On
+     * destination side, this should always be NULL, and the variable
+     * `clear_bmap_shift' is meaningless.
+     */
+    unsigned long *clear_bmap;
+    uint8_t clear_bmap_shift;
 };
 
+/**
+ * clear_bmap_size: calculate clear bitmap size
+ *
+ * @pages: number of guest pages
+ * @shift: guest page number shift
+ *
+ * Returns: number of bits for the clear bitmap
+ */
+static inline long clear_bmap_size(uint64_t pages, uint8_t shift)
+{
+    return DIV_ROUND_UP(pages, 1UL << shift);
+}
+
+/**
+ * clear_bmap_set: set clear bitmap for the page range
+ *
+ * @rb: the ramblock to operate on
+ * @start: the start page number
+ * @size: number of pages to set in the bitmap
+ *
+ * Returns: None
+ */
+static inline void clear_bmap_set(RAMBlock *rb, uint64_t start,
+                                  uint64_t npages)
+{
+    uint8_t shift = rb->clear_bmap_shift;
+
+    bitmap_set_atomic(rb->clear_bmap, start >> shift,
+                      clear_bmap_size(npages, shift));
+}
+
+/**
+ * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set
+ *
+ * @rb: the ramblock to operate on
+ * @page: the page number to check
+ *
+ * Returns: true if the bit was set, false otherwise
+ */
+static inline bool clear_bmap_test_and_clear(RAMBlock *rb, uint64_t page)
+{
+    uint8_t shift = rb->clear_bmap_shift;
+
+    return bitmap_test_and_clear_atomic(rb->clear_bmap, page >> shift, 1);
+}
+
 static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
 {
     return (b && b->host && offset < b->used_length) ? true : false;
@@ -462,8 +523,18 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
             }
         }
 
-        /* TODO: split the huge bitmap into smaller chunks */
-        memory_region_clear_dirty_bitmap(rb->mr, start, length);
+        if (rb->clear_bmap) {
+            /*
+             * Postpone the dirty bitmap clear to the point before we
+             * really send the pages, also we will split the clear
+             * dirty procedure into smaller chunks.
+             */
+            clear_bmap_set(rb, start >> TARGET_PAGE_BITS,
+                           length >> TARGET_PAGE_BITS);
+        } else {
+            /* Slow path - still do that in a huge chunk */
+            memory_region_clear_dirty_bitmap(rb->mr, start, length);
+        }
     } else {
         ram_addr_t offset = rb->offset;
 
diff --git a/migration/migration.c b/migration/migration.c
index 2865ae3fa9..8a607fe1e2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3362,6 +3362,8 @@ void migration_global_dump(Monitor *mon)
                    ms->send_section_footer ? "on" : "off");
     monitor_printf(mon, "decompress-error-check: %s\n",
                    ms->decompress_error_check ? "on" : "off");
+    monitor_printf(mon, "clear-bitmap-shift: %u\n",
+                   ms->clear_bitmap_shift);
 }
 
 #define DEFINE_PROP_MIG_CAP(name, x)             \
@@ -3376,6 +3378,8 @@ static Property migration_properties[] = {
                      send_section_footer, true),
     DEFINE_PROP_BOOL("decompress-error-check", MigrationState,
                       decompress_error_check, true),
+    DEFINE_PROP_UINT8("x-clear-bitmap-shift", MigrationState,
+                      clear_bitmap_shift, CLEAR_BITMAP_SHIFT_DEFAULT),
 
     /* Migration parameters */
     DEFINE_PROP_UINT8("x-compress-level", MigrationState,
diff --git a/migration/migration.h b/migration/migration.h
index 780a096857..6e3178d8b2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -27,6 +27,23 @@ struct PostcopyBlocktimeContext;
 
 #define  MIGRATION_RESUME_ACK_VALUE  (1)
 
+/*
+ * 1<<6=64 pages -> 256K chunk when page size is 4K.  This gives us
+ * the benefit that all the chunks are 64 pages aligned then the
+ * bitmaps are always aligned to LONG.
+ */
+#define CLEAR_BITMAP_SHIFT_MIN             6
+/*
+ * 1<<18=256K pages -> 1G chunk when page size is 4K.  This is the
+ * default value to use if no one specified.
+ */
+#define CLEAR_BITMAP_SHIFT_DEFAULT        18
+/*
+ * 1<<31=2G pages -> 8T chunk when page size is 4K.  This should be
+ * big enough and make sure we won't overflow easily.
+ */
+#define CLEAR_BITMAP_SHIFT_MAX            31
+
 /* State for the incoming migration */
 struct MigrationIncomingState {
     QEMUFile *from_src_file;
@@ -233,6 +250,16 @@ struct MigrationState
      * do not trigger spurious decompression errors.
      */
     bool decompress_error_check;
+
+    /*
+     * This decides the size of guest memory chunk that will be used
+     * to track dirty bitmap clearing.  The size of memory chunk will
+     * be GUEST_PAGE_SIZE << N.  Say, N=0 means we will clear dirty
+     * bitmap for each page to send (1<<0=1); N=10 means we will clear
+     * dirty bitmap only once for 1<<10=1K continuous guest pages
+     * (which is in 4M chunk).
+     */
+    uint8_t clear_bitmap_shift;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/ram.c b/migration/ram.c
index dc916042fb..632bf02575 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1668,6 +1668,33 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     bool ret;
 
     qemu_mutex_lock(&rs->bitmap_mutex);
+
+    /*
+     * Clear dirty bitmap if needed.  This _must_ be called before we
+     * send any of the page in the chunk because we need to make sure
+     * we can capture further page content changes when we sync dirty
+     * log the next time.  So as long as we are going to send any of
+     * the page in the chunk we clear the remote dirty bitmap for all.
+     * Clearing it earlier won't be a problem, but too late will.
+     */
+    if (rb->clear_bmap && clear_bmap_test_and_clear(rb, page)) {
+        uint8_t shift = rb->clear_bmap_shift;
+        hwaddr size = 1ULL << (TARGET_PAGE_BITS + shift);
+        hwaddr start = (page << TARGET_PAGE_BITS) & (-size);
+
+        /*
+         * CLEAR_BITMAP_SHIFT_MIN should always guarantee this... this
+         * can make things easier sometimes since then start address
+         * of the small chunk will always be 64 pages aligned so the
+         * bitmap will always be aligned to unsigned long.  We should
+         * even be able to remove this restriction but I'm simply
+         * keeping it.
+         */
+        assert(shift >= 6);
+        trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
+        memory_region_clear_dirty_bitmap(rb->mr, start, size);
+    }
+
     ret = test_and_clear_bit(page, rb->bmap);
 
     if (ret) {
@@ -2685,6 +2712,8 @@ static void ram_save_cleanup(void *opaque)
     memory_global_dirty_log_stop();
 
     RAMBLOCK_FOREACH_NOT_IGNORED(block) {
+        g_free(block->clear_bmap);
+        block->clear_bmap = NULL;
         g_free(block->bmap);
         block->bmap = NULL;
         g_free(block->unsentmap);
@@ -3195,15 +3224,30 @@ static int ram_state_init(RAMState **rsp)
 
 static void ram_list_init_bitmaps(void)
 {
+    MigrationState *ms = migrate_get_current();
     RAMBlock *block;
     unsigned long pages;
+    uint8_t shift;
 
     /* Skip setting bitmap if there is no RAM */
     if (ram_bytes_total()) {
+        shift = ms->clear_bitmap_shift;
+        if (shift > CLEAR_BITMAP_SHIFT_MAX) {
+            error_report("clear_bitmap_shift (%u) too big, using "
+                         "max value (%u)", shift, CLEAR_BITMAP_SHIFT_MAX);
+            shift = CLEAR_BITMAP_SHIFT_MAX;
+        } else if (shift < CLEAR_BITMAP_SHIFT_MIN) {
+            error_report("clear_bitmap_shift (%u) too small, using "
+                         "min value (%u)", shift, CLEAR_BITMAP_SHIFT_MIN);
+            shift = CLEAR_BITMAP_SHIFT_MIN;
+        }
+
         RAMBLOCK_FOREACH_NOT_IGNORED(block) {
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            block->clear_bmap_shift = shift;
+            block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
             if (migrate_postcopy_ram()) {
                 block->unsentmap = bitmap_new(pages);
                 bitmap_set(block->unsentmap, 0, pages);
diff --git a/migration/trace-events b/migration/trace-events
index de2e136e57..2c45974330 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -79,6 +79,7 @@ get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_
 get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned long page_abs, int sent) "%s/0x%" PRIx64 " page_abs=0x%lx (sent=%d)"
 migration_bitmap_sync_start(void) ""
 migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
+migration_bitmap_clear_dirty(char *str, uint64_t start, uint64_t size, unsigned long page) "rb %s start 0x%"PRIx64" size 0x%"PRIx64" page 0x%lx"
 migration_throttle(void) ""
 multifd_recv(uint8_t id, uint64_t packet_num, uint32_t used, uint32_t flags, uint32_t next_packet_size) "channel %d packet number %" PRIu64 " pages %d flags 0x%x next packet size %d"
 multifd_recv_sync_main(long packet_num) "packet num %ld"
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/12] bitmap: Add bitmap_copy_with_{src|dst}_offset()
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 05/12] bitmap: Add bitmap_copy_with_{src|dst}_offset() Peter Xu
@ 2019-05-30 11:05   ` Dr. David Alan Gilbert
  2019-05-31  1:45     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-30 11:05 UTC (permalink / raw)
  To: Peter Xu; +Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> These helpers copy the source bitmap to destination bitmap with a
> shift either on the src or dst bitmap.
> 
> Meanwhile, we never have bitmap tests but we should.
> 
> This patch also introduces the initial test cases for utils/bitmap.c
> but it only tests the newly introduced functions.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/qemu/bitmap.h  |  9 +++++
>  tests/Makefile.include |  2 ++
>  tests/test-bitmap.c    | 81 ++++++++++++++++++++++++++++++++++++++++++
>  util/bitmap.c          | 73 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 165 insertions(+)
>  create mode 100644 tests/test-bitmap.c
> 
> diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
> index 5c313346b9..cdaa953371 100644
> --- a/include/qemu/bitmap.h
> +++ b/include/qemu/bitmap.h
> @@ -41,6 +41,10 @@
>   * bitmap_find_next_zero_area(buf, len, pos, n, mask)	Find bit free area
>   * bitmap_to_le(dst, src, nbits)      Convert bitmap to little endian
>   * bitmap_from_le(dst, src, nbits)    Convert bitmap from little endian
> + * bitmap_copy_with_src_offset(dst, src, offset, nbits)
> + *                                    *dst = *src (with an offset upon src)

'into' would be better than 'upon'

> + * bitmap_copy_with_dst_offset(dst, src, offset, nbits)
> + *                                    *dst = *src (with an offset upon dst)
>   */
>  
>  /*
> @@ -271,4 +275,9 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
>  void bitmap_from_le(unsigned long *dst, const unsigned long *src,
>                      long nbits);
>  
> +void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
> +                                 long offset, long nbits);
> +void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
> +                                 long shift, long nbits);
> +
>  #endif /* BITMAP_H */
> diff --git a/tests/Makefile.include b/tests/Makefile.include
> index 1865f6b322..5e2d7dddff 100644
> --- a/tests/Makefile.include
> +++ b/tests/Makefile.include
> @@ -64,6 +64,7 @@ check-unit-y += tests/test-opts-visitor$(EXESUF)
>  check-unit-$(CONFIG_BLOCK) += tests/test-coroutine$(EXESUF)
>  check-unit-y += tests/test-visitor-serialization$(EXESUF)
>  check-unit-y += tests/test-iov$(EXESUF)
> +check-unit-y += tests/test-bitmap$(EXESUF)
>  check-unit-$(CONFIG_BLOCK) += tests/test-aio$(EXESUF)
>  check-unit-$(CONFIG_BLOCK) += tests/test-aio-multithread$(EXESUF)
>  check-unit-$(CONFIG_BLOCK) += tests/test-throttle$(EXESUF)
> @@ -529,6 +530,7 @@ tests/test-image-locking$(EXESUF): tests/test-image-locking.o $(test-block-obj-y
>  tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
>  tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
>  tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) $(test-crypto-obj-y)
> +tests/test-bitmap$(EXESUF): tests/test-bitmap.o $(test-util-obj-y)
>  tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
>  tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o migration/page_cache.o $(test-util-obj-y)
>  tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o $(test-util-obj-y)
> diff --git a/tests/test-bitmap.c b/tests/test-bitmap.c
> new file mode 100644
> index 0000000000..36b4c07bf2
> --- /dev/null
> +++ b/tests/test-bitmap.c
> @@ -0,0 +1,81 @@
> +// SPDX-License-Identifier: GPL-2.0

We probably shouldn't use GPL-2.0 for new; most stuff is 2.0-or-later

> +/*
> + * Bitmap.c unit-tests.
> + *
> + * Copyright (C) 2019, Red Hat, Inc.
> + *
> + * Author: Peter Xu <peterx@redhat.com>
> + */
> +
> +#include <stdlib.h>
> +#include "qemu/osdep.h"
> +#include "qemu/bitmap.h"
> +
> +#define BMAP_SIZE  1024
> +
> +static void check_bitmap_copy_with_offset(void)
> +{
> +    int i;
> +    unsigned long *bmap1, *bmap2, *bmap3, total;
> +
> +    bmap1 = bitmap_new(BMAP_SIZE);
> +    bmap2 = bitmap_new(BMAP_SIZE);
> +    bmap3 = bitmap_new(BMAP_SIZE);
> +
> +    *bmap1 = random();
> +    *(bmap1 + 1) = random();
> +    *(bmap1 + 2) = random();
> +    *(bmap1 + 3) = random();
> +    total = BITS_PER_LONG * 4;
> +
> +    /* Shift 115 bits into bmap2 */
> +    bitmap_copy_with_dst_offset(bmap2, bmap1, 115, total);
> +    /* Shift another 85 bits into bmap3 */
> +    bitmap_copy_with_dst_offset(bmap3, bmap2, 85, total + 115);
> +    /* Shift back 200 bits back */
> +    bitmap_copy_with_src_offset(bmap2, bmap3, 200, total);
> +
> +    for (i = 0; i < 3; i++) {
> +        g_assert(*(bmap1 + i) == *(bmap2 + i));
> +    }

can we get rid fo these loops with:
       g_assert_cmpmem(bmap1, total / 8, bmap2, total / 8) 

> +    bitmap_clear(bmap1, 0, BMAP_SIZE);
> +    /* Set bits in bmap1 are 100-245 */
> +    bitmap_set(bmap1, 100, 145);
> +
> +    /* Set bits in bmap2 are 60-205 */
> +    bitmap_copy_with_src_offset(bmap2, bmap1, 40, 250);
> +    for (i = 0; i < 60; i++) {
> +        g_assert(test_bit(i, bmap2) == 0);
> +    }

       g_assert_cmpint(find_first_bit(bmap2, 60), ==, 60)

> +    for (i = 60; i < 205; i++) {
> +        g_assert(test_bit(i, bmap2));
> +    }

       g_assert_cmpint(find_next_zero_bit(bmap2, 205, 60), == 205) 

> +    g_assert(test_bit(205, bmap2) == 0);
> +
> +    /* Set bits in bmap3 are 135-280 */
> +    bitmap_copy_with_dst_offset(bmap3, bmap1, 35, 250);
> +    for (i = 0; i < 135; i++) {
> +        g_assert(test_bit(i, bmap3) == 0);
> +    }
> +    for (i = 135; i < 280; i++) {
> +        g_assert(test_bit(i, bmap3));
> +    }
> +    g_assert(test_bit(280, bmap3) == 0);
> +
> +    g_free(bmap1);
> +    g_free(bmap2);
> +    g_free(bmap3);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +    g_test_init(&argc, &argv, NULL);
> +
> +    g_test_add_func("/bitmap/bitmap_copy_with_offset",
> +                    check_bitmap_copy_with_offset);
> +
> +    g_test_run();
> +
> +    return 0;
> +}
> diff --git a/util/bitmap.c b/util/bitmap.c
> index cb618c65a5..391a7bb744 100644
> --- a/util/bitmap.c
> +++ b/util/bitmap.c
> @@ -402,3 +402,76 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
>  {
>      bitmap_to_from_le(dst, src, nbits);
>  }
> +
> +/*
> + * Copy "src" bitmap with a positive offset and put it into the "dst"
> + * bitmap.  The caller needs to make sure the bitmap size of "src"
> + * is bigger than (shift + nbits).
> + */
> +void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
> +                                 long shift, long nbits)
> +{
> +    unsigned long left_mask, right_mask, last_mask;
> +
> +    assert(shift >= 0);

Just make shift and nbits unsigned long?
(Interestingly bitops.h uses unsigned long everywhere where as bitmap
uses long)

> +    /* Proper shift src pointer to the first word to copy from */
> +    src += BIT_WORD(shift);
> +    shift %= BITS_PER_LONG;
> +    right_mask = (1ul << shift) - 1;
> +    left_mask = ~right_mask;
> +
> +    while (nbits >= BITS_PER_LONG) {
> +        *dst = (*src & left_mask) >> shift;
> +        *dst |= (*(src + 1) & right_mask) << (BITS_PER_LONG - shift);

  src[1] ?

Also, perhaps you should only do this line if (right_mask), that way
if the offset is a multiple of BITS_PER_LONG, you don't access the
word after the end of the bitmap.

> +        dst++;
> +        src++;
> +        nbits -= BITS_PER_LONG;
> +    }
> +
> +    if (nbits > BITS_PER_LONG - shift) {
> +        *dst = (*src & left_mask) >> shift;
> +        nbits -= BITS_PER_LONG - shift;
> +        last_mask = (1 << nbits) - 1;

1ul

> +        *dst |= (*(src + 1) & last_mask) << (BITS_PER_LONG - shift);
> +    } else if (nbits) {
> +        last_mask = (1 << nbits) - 1;

and again

> +        *dst = (*src >> shift) & last_mask;
> +    }
> +}
> +
> +/*
> + * Copy "src" bitmap into the "dst" bitmap with an offset in the
> + * "dst".  The caller needs to make sure the bitmap size of "dst" is
> + * bigger than (shift + nbits).
> + */
> +void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
> +                                 long shift, long nbits)
> +{
> +    unsigned long left_mask, right_mask, last_mask;
> +
> +    assert(shift >= 0);
> +    /* Proper shift src pointer to the first word to copy from */
                       ^^^

dst

> +    dst += BIT_WORD(shift);
> +    shift %= BITS_PER_LONG;
> +    right_mask = (1ul << (BITS_PER_LONG - shift)) - 1;
> +    left_mask = ~right_mask;
> +
> +    *dst &= (1ul << shift) - 1;
> +    while (nbits >= BITS_PER_LONG) {
> +        *dst |= (*src & right_mask) << shift;
> +        *(dst + 1) = (*src & left_mask) >> (BITS_PER_LONG - shift);

dst[1] ?

> +        dst++;
> +        src++;
> +        nbits -= BITS_PER_LONG;
> +    }
> +
> +    if (nbits > BITS_PER_LONG - shift) {
> +        *dst |= (*src & right_mask) << shift;
> +        nbits -= BITS_PER_LONG - shift;
> +        last_mask = ((1 << nbits) - 1) << (BITS_PER_LONG - shift);

1ul

> +        *(dst + 1) = (*src & last_mask) >> (BITS_PER_LONG - shift);
> +    } else if (nbits) {
> +        last_mask = (1 << nbits) - 1;
> +        *dst |= (*src & last_mask) << shift;
> +    }
> +}
> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/12] memory: Pass mr into snapshot_and_clear_dirty
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 06/12] memory: Pass mr into snapshot_and_clear_dirty Peter Xu
@ 2019-05-30 11:22   ` Dr. David Alan Gilbert
  2019-05-31  2:36     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-30 11:22 UTC (permalink / raw)
  To: Peter Xu; +Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Also we change the 2nd parameter of it to be the relative offset
> within the memory region. This is to be used in follow up patches.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  exec.c                  | 3 ++-
>  include/exec/ram_addr.h | 2 +-
>  memory.c                | 3 +--
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 4e734770c2..2615b4cfed 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1387,9 +1387,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
>  }
>  
>  DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
> -     (ram_addr_t start, ram_addr_t length, unsigned client)
> +(MemoryRegion *mr, hwaddr addr, hwaddr length, unsigned client)

Better to keep some indent?

>  {
>      DirtyMemoryBlocks *blocks;
> +    ram_addr_t start = memory_region_get_ram_addr(mr) + addr;
>      unsigned long align = 1UL << (TARGET_PAGE_BITS + BITS_PER_LEVEL);
>      ram_addr_t first = QEMU_ALIGN_DOWN(start, align);
>      ram_addr_t last  = QEMU_ALIGN_UP(start + length, align);
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 79e70a96ee..f8ee011d3c 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -403,7 +403,7 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
>                                                unsigned client);
>  
>  DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
> -    (ram_addr_t start, ram_addr_t length, unsigned client);
> +(MemoryRegion *mr, ram_addr_t start, hwaddr length, unsigned client);

You've called it 'start' there but 'addr' in the definition.
Either way a comment saying that it's the offset with mr would be good.

Dave

>  
>  bool cpu_physical_memory_snapshot_get_dirty(DirtyBitmapSnapshot *snap,
>                                              ram_addr_t start,
> diff --git a/memory.c b/memory.c
> index cff0ea8f40..84bba7b65c 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2071,8 +2071,7 @@ DirtyBitmapSnapshot *memory_region_snapshot_and_clear_dirty(MemoryRegion *mr,
>  {
>      assert(mr->ram_block);
>      memory_region_sync_dirty_bitmap(mr);
> -    return cpu_physical_memory_snapshot_and_clear_dirty(
> -                memory_region_get_ram_addr(mr) + addr, size, client);
> +    return cpu_physical_memory_snapshot_and_clear_dirty(mr, addr, size, client);
>  }
>  
>  bool memory_region_snapshot_get_dirty(MemoryRegion *mr, DirtyBitmapSnapshot *snap,
> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/12] memory: Introduce memory listener hook log_clear()
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 07/12] memory: Introduce memory listener hook log_clear() Peter Xu
@ 2019-05-30 13:20   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-30 13:20 UTC (permalink / raw)
  To: Peter Xu; +Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Introduce a new memory region listener hook log_clear() to allow the
> listeners to hook onto the points where the dirty bitmap is cleared by
> the bitmap users.
> 
> Previously log_sync() contains two operations:
> 
>   - dirty bitmap collection, and,
>   - dirty bitmap clear on remote site.
> 
> Let's take KVM as example - log_sync() for KVM will first copy the
> kernel dirty bitmap to userspace, and at the same time we'll clear the
> dirty bitmap there along with re-protecting all the guest pages again.
> 
> We add this new log_clear() interface only to split the old log_sync()
> into two separated procedures:
> 
>   - use log_sync() to collect the collection only, and,
>   - use log_clear() to clear the remote dirty bitmap.
> 
> With the new interface, the memory listener users will still be able
> to decide how to implement the log synchronization procedure, e.g.,
> they can still only provide log_sync() method only and put all the two
> procedures within log_sync() (that's how the old KVM works before
> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is introduced).  However with this
> new interface the memory listener users will start to have a chance to
> postpone the log clear operation explicitly if the module supports.
> That can really benefit users like KVM at least for host kernels that
> support KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2.
> 
> There are three places that can clear dirty bits in any one of the
> dirty bitmap in the ram_list.dirty_memory[3] array:
> 
>         cpu_physical_memory_snapshot_and_clear_dirty
>         cpu_physical_memory_test_and_clear_dirty
>         cpu_physical_memory_sync_dirty_bitmap
> 
> Currently we hook directly into each of the functions to notify about
> the log_clear().
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  exec.c                  | 12 ++++++++++
>  include/exec/memory.h   | 17 ++++++++++++++
>  include/exec/ram_addr.h |  3 +++
>  memory.c                | 51 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 83 insertions(+)
> 
> diff --git a/exec.c b/exec.c
> index 2615b4cfed..ab595e1e4b 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1355,6 +1355,8 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
>      DirtyMemoryBlocks *blocks;
>      unsigned long end, page;
>      bool dirty = false;
> +    RAMBlock *ramblock;
> +    uint64_t mr_offset, mr_size;
>  
>      if (length == 0) {
>          return false;
> @@ -1366,6 +1368,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
>      rcu_read_lock();
>  
>      blocks = atomic_rcu_read(&ram_list.dirty_memory[client]);
> +    ramblock = qemu_get_ram_block(start);
> +    /* Range sanity check on the ramblock */
> +    assert(start >= ramblock->offset &&
> +           start + length <= ramblock->offset + ramblock->used_length);
>  
>      while (page < end) {
>          unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
> @@ -1377,6 +1383,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
>          page += num;
>      }
>  
> +    mr_offset = (ram_addr_t)(page << TARGET_PAGE_BITS) - ramblock->offset;
> +    mr_size = (end - page) << TARGET_PAGE_BITS;
> +    memory_region_clear_dirty_bitmap(ramblock->mr, mr_offset, mr_size);
> +
>      rcu_read_unlock();
>  
>      if (dirty && tcg_enabled()) {
> @@ -1432,6 +1442,8 @@ DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
>          tlb_reset_dirty_range_all(start, length);
>      }
>  
> +    memory_region_clear_dirty_bitmap(mr, addr, length);
> +
>      return snap;
>  }
>  
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f29300c54d..d752b2a758 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -416,6 +416,7 @@ struct MemoryListener {
>      void (*log_stop)(MemoryListener *listener, MemoryRegionSection *section,
>                       int old, int new);
>      void (*log_sync)(MemoryListener *listener, MemoryRegionSection *section);
> +    void (*log_clear)(MemoryListener *listener, MemoryRegionSection *section);
>      void (*log_global_start)(MemoryListener *listener);
>      void (*log_global_stop)(MemoryListener *listener);
>      void (*eventfd_add)(MemoryListener *listener, MemoryRegionSection *section,
> @@ -1269,6 +1270,22 @@ void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client);
>  void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
>                               hwaddr size);
>  
> +/**
> + * memory_region_clear_dirty_bitmap - clear dirty bitmap for memory range
> + *
> + * This function is called when the caller wants to clear the remote
> + * dirty bitmap of a memory range within the memory region.  This can
> + * be used by e.g. KVM to manually clear dirty log when
> + * KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is declared support by the host
> + * kernel.
> + *
> + * @mr:     the memory region to clear the dirty log upon
> + * @start:  start address offset within the memory region
> + * @len:    length of the memory region to clear dirty bitmap
> + */
> +void memory_region_clear_dirty_bitmap(MemoryRegion *mr, hwaddr start,
> +                                      hwaddr len);
> +
>  /**
>   * memory_region_snapshot_and_clear_dirty: Get a snapshot of the dirty
>   *                                         bitmap and clear it.
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index f8ee011d3c..f8930914cd 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -461,6 +461,9 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>                  idx++;
>              }
>          }
> +
> +        /* TODO: split the huge bitmap into smaller chunks */
> +        memory_region_clear_dirty_bitmap(rb->mr, start, length);
>      } else {
>          ram_addr_t offset = rb->offset;
>  
> diff --git a/memory.c b/memory.c
> index 84bba7b65c..a051025dd1 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2064,6 +2064,57 @@ static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
>      }
>  }
>  
> +void memory_region_clear_dirty_bitmap(MemoryRegion *mr, hwaddr start,
> +                                      hwaddr len)
> +{
> +    MemoryRegionSection mrs;
> +    MemoryListener *listener;
> +    AddressSpace *as;
> +    FlatView *view;
> +    FlatRange *fr;
> +    hwaddr sec_start, sec_end, sec_size;
> +
> +    QTAILQ_FOREACH(listener, &memory_listeners, link) {
> +        if (!listener->log_clear) {
> +            continue;
> +        }
> +        as = listener->address_space;
> +        view = address_space_get_flatview(as);
> +        FOR_EACH_FLAT_RANGE(fr, view) {
> +            if (!fr->dirty_log_mask || fr->mr != mr) {
> +                /*
> +                 * Clear dirty bitmap operation only applies to those
> +                 * regions whose dirty logging is at least enabled
> +                 */
> +                continue;
> +            }
> +
> +            mrs = section_from_flat_range(fr, view);
> +
> +            sec_start = MAX(mrs.offset_within_region, start);
> +            sec_end = mrs.offset_within_region + int128_get64(mrs.size);
> +            sec_end = MIN(sec_end, start + len);
> +
> +            if (sec_start >= sec_end) {
> +                /*
> +                 * If this memory region section has no intersection
> +                 * with the requested range, skip.
> +                 */
> +                continue;
> +            }
> +
> +            /* Valid case; shrink the section if needed */
> +            mrs.offset_within_address_space +=
> +                sec_start - mrs.offset_within_region;
> +            mrs.offset_within_region = sec_start;
> +            sec_size = sec_end - sec_start;
> +            mrs.size = int128_make64(sec_size);
> +            listener->log_clear(listener, &mrs);
> +        }
> +        flatview_unref(view);
> +    }
> +}
> +
>  DirtyBitmapSnapshot *memory_region_snapshot_and_clear_dirty(MemoryRegion *mr,
>                                                              hwaddr addr,
>                                                              hwaddr size,

I think that's ok (although some of the size juggling I only think I've
got).

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/12] kvm: Persistent per kvmslot dirty bitmap
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 09/12] kvm: Persistent per kvmslot dirty bitmap Peter Xu
@ 2019-05-30 13:53   ` Dr. David Alan Gilbert
  2019-05-31  2:43     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-30 13:53 UTC (permalink / raw)
  To: Peter Xu; +Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> When synchronizing dirty bitmap from kernel KVM we do it in a
> per-kvmslot fashion and we allocate the userspace bitmap for each of
> the ioctl.  This patch instead make the bitmap cache be persistent
> then we don't need to g_malloc0() every time.
> 
> More importantly, the cached per-kvmslot dirty bitmap will be further
> used when we want to add support for the KVM_CLEAR_DIRTY_LOG and this
> cached bitmap will be used to guarantee we won't clear any unknown
> dirty bits otherwise that can be a severe data loss issue for
> migration code.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Is there no way to make this get allocated the first time it's needed?
I'm thinking here of the VM most of the time not being migrated so we're
allocating this structure for no benefit.

Dave

> ---
>  accel/kvm/kvm-all.c      | 39 +++++++++++++++++++++------------------
>  include/sysemu/kvm_int.h |  2 ++
>  2 files changed, 23 insertions(+), 18 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index b686531586..334c610918 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -497,31 +497,14 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>              return 0;
>          }
>  
> -        /* XXX bad kernel interface alert
> -         * For dirty bitmap, kernel allocates array of size aligned to
> -         * bits-per-long.  But for case when the kernel is 64bits and
> -         * the userspace is 32bits, userspace can't align to the same
> -         * bits-per-long, since sizeof(long) is different between kernel
> -         * and user space.  This way, userspace will provide buffer which
> -         * may be 4 bytes less than the kernel will use, resulting in
> -         * userspace memory corruption (which is not detectable by valgrind
> -         * too, in most cases).
> -         * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
> -         * a hope that sizeof(long) won't become >8 any time soon.
> -         */
> -        size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
> -                     /*HOST_LONG_BITS*/ 64) / 8;
> -        d.dirty_bitmap = g_malloc0(size);
> -
> +        d.dirty_bitmap = mem->dirty_bmap;
>          d.slot = mem->slot | (kml->as_id << 16);
>          if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
>              DPRINTF("ioctl failed %d\n", errno);
> -            g_free(d.dirty_bitmap);
>              return -1;
>          }
>  
>          kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
> -        g_free(d.dirty_bitmap);
>      }
>  
>      return 0;
> @@ -765,6 +748,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>      MemoryRegion *mr = section->mr;
>      bool writeable = !mr->readonly && !mr->rom_device;
>      hwaddr start_addr, size;
> +    unsigned long bmap_size;
>      void *ram;
>  
>      if (!memory_region_is_ram(mr)) {
> @@ -796,6 +780,8 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>          }
>  
>          /* unregister the slot */
> +        g_free(mem->dirty_bmap);
> +        mem->dirty_bmap = NULL;
>          mem->memory_size = 0;
>          mem->flags = 0;
>          err = kvm_set_user_memory_region(kml, mem, false);
> @@ -807,12 +793,29 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>          return;
>      }
>  
> +    /*
> +     * XXX bad kernel interface alert For dirty bitmap, kernel
> +     * allocates array of size aligned to bits-per-long.  But for case
> +     * when the kernel is 64bits and the userspace is 32bits,
> +     * userspace can't align to the same bits-per-long, since
> +     * sizeof(long) is different between kernel and user space.  This
> +     * way, userspace will provide buffer which may be 4 bytes less
> +     * than the kernel will use, resulting in userspace memory
> +     * corruption (which is not detectable by valgrind too, in most
> +     * cases).  So for now, let's align to 64 instead of
> +     * HOST_LONG_BITS here, in a hope that sizeof(long) won't become
> +     * >8 any time soon.
> +     */
> +    bmap_size = ALIGN((size >> TARGET_PAGE_BITS),
> +                      /*HOST_LONG_BITS*/ 64) / 8;
> +
>      /* register the new slot */
>      mem = kvm_alloc_slot(kml);
>      mem->memory_size = size;
>      mem->start_addr = start_addr;
>      mem->ram = ram;
>      mem->flags = kvm_mem_flags(mr);
> +    mem->dirty_bmap = g_malloc0(bmap_size);
>  
>      err = kvm_set_user_memory_region(kml, mem, true);
>      if (err) {
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index f838412491..687a2ee423 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -21,6 +21,8 @@ typedef struct KVMSlot
>      int slot;
>      int flags;
>      int old_flags;
> +    /* Dirty bitmap cache for the slot */
> +    unsigned long *dirty_bmap;
>  } KVMSlot;
>  
>  typedef struct KVMMemoryListener {
> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 10/12] kvm: Introduce slots lock for memory listener
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 10/12] kvm: Introduce slots lock for memory listener Peter Xu
@ 2019-05-30 16:40   ` Dr. David Alan Gilbert
  2019-05-31  2:48     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-30 16:40 UTC (permalink / raw)
  To: Peter Xu; +Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Introduce KVMMemoryListener.slots_lock to protect the slots inside the
> kvm memory listener.  Currently it is close to useless because all the
> KVM code path now is always protected by the BQL.  But it'll start to
> make sense in follow up patches where we might do remote dirty bitmap
> clear and also we'll update the per-slot cached dirty bitmap even
> without the BQL.  So let's prepare for it.
> 
> We can also use per-slot lock for above reason but it seems to be an
> overkill.  Let's just use this bigger one (which covers all the slots
> of a single address space) but anyway this lock is still much smaller
> than the BQL.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

No one ever frees it? Huh OK, in that case if no one ever frees the
listener then I guess you don't need to cleanup the lock.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  accel/kvm/kvm-all.c      | 58 +++++++++++++++++++++++++++++++---------
>  include/sysemu/kvm_int.h |  2 ++
>  2 files changed, 48 insertions(+), 12 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 334c610918..e687060296 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -138,6 +138,9 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>      KVM_CAP_LAST_INFO
>  };
>  
> +#define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
> +#define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
> +
>  int kvm_get_max_memslots(void)
>  {
>      KVMState *s = KVM_STATE(current_machine->accelerator);
> @@ -165,6 +168,7 @@ int kvm_memcrypt_encrypt_data(uint8_t *ptr, uint64_t len)
>      return 1;
>  }
>  
> +/* Called with KVMMemoryListener.slots_lock held */
>  static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
>  {
>      KVMState *s = kvm_state;
> @@ -182,10 +186,17 @@ static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
>  bool kvm_has_free_slot(MachineState *ms)
>  {
>      KVMState *s = KVM_STATE(ms->accelerator);
> +    bool result;
> +    KVMMemoryListener *kml = &s->memory_listener;
> +
> +    kvm_slots_lock(kml);
> +    result = !!kvm_get_free_slot(kml);
> +    kvm_slots_unlock(kml);
>  
> -    return kvm_get_free_slot(&s->memory_listener);
> +    return result;
>  }
>  
> +/* Called with KVMMemoryListener.slots_lock held */
>  static KVMSlot *kvm_alloc_slot(KVMMemoryListener *kml)
>  {
>      KVMSlot *slot = kvm_get_free_slot(kml);
> @@ -244,18 +255,21 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
>                                         hwaddr *phys_addr)
>  {
>      KVMMemoryListener *kml = &s->memory_listener;
> -    int i;
> +    int i, ret = 0;
>  
> +    kvm_slots_lock(kml);
>      for (i = 0; i < s->nr_slots; i++) {
>          KVMSlot *mem = &kml->slots[i];
>  
>          if (ram >= mem->ram && ram < mem->ram + mem->memory_size) {
>              *phys_addr = mem->start_addr + (ram - mem->ram);
> -            return 1;
> +            ret = 1;
> +            break;
>          }
>      }
> +    kvm_slots_unlock(kml);
>  
> -    return 0;
> +    return ret;
>  }
>  
>  static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
> @@ -391,6 +405,7 @@ static int kvm_mem_flags(MemoryRegion *mr)
>      return flags;
>  }
>  
> +/* Called with KVMMemoryListener.slots_lock held */
>  static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
>                                   MemoryRegion *mr)
>  {
> @@ -409,19 +424,26 @@ static int kvm_section_update_flags(KVMMemoryListener *kml,
>  {
>      hwaddr start_addr, size;
>      KVMSlot *mem;
> +    int ret = 0;
>  
>      size = kvm_align_section(section, &start_addr);
>      if (!size) {
>          return 0;
>      }
>  
> +    kvm_slots_lock(kml);
> +
>      mem = kvm_lookup_matching_slot(kml, start_addr, size);
>      if (!mem) {
>          /* We don't have a slot if we want to trap every access. */
> -        return 0;
> +        goto out;
>      }
>  
> -    return kvm_slot_update_flags(kml, mem, section->mr);
> +    ret = kvm_slot_update_flags(kml, mem, section->mr);
> +
> +out:
> +    kvm_slots_unlock(kml);
> +    return ret;
>  }
>  
>  static void kvm_log_start(MemoryListener *listener,
> @@ -478,6 +500,8 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
>   * This function will first try to fetch dirty bitmap from the kernel,
>   * and then updates qemu's dirty bitmap.
>   *
> + * NOTE: caller must be with kml->slots_lock held.
> + *
>   * @kml: the KVM memory listener object
>   * @section: the memory section to sync the dirty bitmap with
>   */
> @@ -488,26 +512,28 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>      struct kvm_dirty_log d = {};
>      KVMSlot *mem;
>      hwaddr start_addr, size;
> +    int ret = 0;
>  
>      size = kvm_align_section(section, &start_addr);
>      if (size) {
>          mem = kvm_lookup_matching_slot(kml, start_addr, size);
>          if (!mem) {
>              /* We don't have a slot if we want to trap every access. */
> -            return 0;
> +            goto out;
>          }
>  
>          d.dirty_bitmap = mem->dirty_bmap;
>          d.slot = mem->slot | (kml->as_id << 16);
>          if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
>              DPRINTF("ioctl failed %d\n", errno);
> -            return -1;
> +            ret = -1;
> +            goto out;
>          }
>  
>          kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
>      }
> -
> -    return 0;
> +out:
> +    return ret;
>  }
>  
>  static void kvm_coalesce_mmio_region(MemoryListener *listener,
> @@ -770,10 +796,12 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>      ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
>            (start_addr - section->offset_within_address_space);
>  
> +    kvm_slots_lock(kml);
> +
>      if (!add) {
>          mem = kvm_lookup_matching_slot(kml, start_addr, size);
>          if (!mem) {
> -            return;
> +            goto out;
>          }
>          if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>              kvm_physical_sync_dirty_bitmap(kml, section);
> @@ -790,7 +818,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>                      __func__, strerror(-err));
>              abort();
>          }
> -        return;
> +        goto out;
>      }
>  
>      /*
> @@ -823,6 +851,9 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>                  strerror(-err));
>          abort();
>      }
> +
> +out:
> +    kvm_slots_unlock(kml);
>  }
>  
>  static void kvm_region_add(MemoryListener *listener,
> @@ -849,7 +880,9 @@ static void kvm_log_sync(MemoryListener *listener,
>      KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
>      int r;
>  
> +    kvm_slots_lock(kml);
>      r = kvm_physical_sync_dirty_bitmap(kml, section);
> +    kvm_slots_unlock(kml);
>      if (r < 0) {
>          abort();
>      }
> @@ -929,6 +962,7 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
>  {
>      int i;
>  
> +    qemu_mutex_init(&kml->slots_lock);
>      kml->slots = g_malloc0(s->nr_slots * sizeof(KVMSlot));
>      kml->as_id = as_id;
>  
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index 687a2ee423..31df465fdc 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -27,6 +27,8 @@ typedef struct KVMSlot
>  
>  typedef struct KVMMemoryListener {
>      MemoryListener listener;
> +    /* Protects the slots and all inside them */
> +    QemuMutex slots_lock;
>      KVMSlot *slots;
>      int as_id;
>  } KVMMemoryListener;
> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/12] kvm: Support KVM_CLEAR_DIRTY_LOG
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 11/12] kvm: Support KVM_CLEAR_DIRTY_LOG Peter Xu
@ 2019-05-30 17:56   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-30 17:56 UTC (permalink / raw)
  To: Peter Xu; +Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Firstly detect the interface using KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
> and mark it.  When failed to enable the new feature we'll fall back to
> the old sync.
> 
> Provide the log_clear() hook for the memory listeners for both address
> spaces of KVM (normal system memory, and SMM) and deliever the clear
> message to kernel.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c    | 180 +++++++++++++++++++++++++++++++++++++++++
>  accel/kvm/trace-events |   1 +
>  2 files changed, 181 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index e687060296..23895a95a2 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -91,6 +91,7 @@ struct KVMState
>      int many_ioeventfds;
>      int intx_set_mask;
>      bool sync_mmu;
> +    bool manual_dirty_log_protect;
>      /* The man page (and posix) say ioctl numbers are signed int, but
>       * they're not.  Linux, glibc and *BSD all treat ioctl numbers as
>       * unsigned, and treating them as signed here can break things */
> @@ -536,6 +537,157 @@ out:
>      return ret;
>  }
>  
> +/* Alignment requirement for KVM_CLEAR_DIRTY_LOG - 64 pages */
> +#define KVM_CLEAR_LOG_SHIFT  6
> +#define KVM_CLEAR_LOG_ALIGN  (qemu_real_host_page_size << KVM_CLEAR_LOG_SHIFT)
> +#define KVM_CLEAR_LOG_MASK   (-KVM_CLEAR_LOG_ALIGN)
> +
> +/**
> + * kvm_physical_log_clear - Clear the kernel's dirty bitmap for range
> + *
> + * NOTE: this will be a no-op if we haven't enabled manual dirty log
> + * protection in the host kernel because in that case this operation
> + * will be done within log_sync().
> + *
> + * @kml:     the kvm memory listener
> + * @section: the memory range to clear dirty bitmap
> + */
> +static int kvm_physical_log_clear(KVMMemoryListener *kml,
> +                                  MemoryRegionSection *section)
> +{
> +    KVMState *s = kvm_state;
> +    struct kvm_clear_dirty_log d;
> +    uint64_t start, end, bmap_start, start_delta, bmap_npages, size;
> +    unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size;
> +    KVMSlot *mem = NULL;
> +    int ret, i;
> +
> +    if (!s->manual_dirty_log_protect) {
> +        /* No need to do explicit clear */
> +        return 0;
> +    }
> +
> +    start = section->offset_within_address_space;
> +    size = int128_get64(section->size);
> +
> +    if (!size) {
> +        /* Nothing more we can do... */
> +        return 0;
> +    }
> +
> +    kvm_slots_lock(kml);
> +
> +    /* Find any possible slot that covers the section */
> +    for (i = 0; i < s->nr_slots; i++) {
> +        mem = &kml->slots[i];
> +        if (mem->start_addr <= start &&
> +            start + size <= mem->start_addr + mem->memory_size) {
> +            break;
> +        }
> +    }
> +
> +    /*
> +     * We should always find one memslot until this point, otherwise
> +     * there could be something wrong from the upper layer
> +     */
> +    assert(mem && i != s->nr_slots);
> +
> +    /*
> +     * We need to extend either the start or the size or both to
> +     * satisfy the KVM interface requirement.  Firstly, do the start
> +     * page alignment on 64 host pages
> +     */
> +    bmap_start = (start - mem->start_addr) & KVM_CLEAR_LOG_MASK;
> +    start_delta = start - mem->start_addr - bmap_start;
> +    bmap_start /= psize;
> +
> +    /*
> +     * The kernel interface has restriction on the size too, that either:
> +     *
> +     * (1) the size is 64 host pages aligned (just like the start), or
> +     * (2) the size fills up until the end of the KVM memslot.
> +     */
> +    bmap_npages = DIV_ROUND_UP(size + start_delta, KVM_CLEAR_LOG_ALIGN)
> +        << KVM_CLEAR_LOG_SHIFT;
> +    end = mem->memory_size / psize;
> +    if (bmap_npages > end - bmap_start) {
> +        bmap_npages = end - bmap_start;
> +    }
> +    start_delta /= psize;
> +
> +    /*
> +     * Prepare the bitmap to clear dirty bits.  Here we must guarantee
> +     * that we won't clear any unknown dirty bits otherwise we might
> +     * accidentally clear some set bits which are not yet synced from
> +     * the kernel into QEMU's bitmap, then we'll lose track of the
> +     * guest modifications upon those pages (which can directly lead
> +     * to guest data loss or panic after migration).
> +     *
> +     * Layout of the KVMSlot.dirty_bmap:
> +     *
> +     *                   |<-------- bmap_npages -----------..>|
> +     *                                                     [1]
> +     *                     start_delta         size
> +     *  |----------------|-------------|------------------|------------|
> +     *  ^                ^             ^                               ^
> +     *  |                |             |                               |
> +     * start          bmap_start     (start)                         end
> +     * of memslot                                             of memslot
> +     *
> +     * [1] bmap_npages can be aligned to either 64 pages or the end of slot
> +     */
> +
> +    assert(bmap_start % BITS_PER_LONG == 0);
> +    if (start_delta) {
> +        /* Slow path - we need to manipulate a temp bitmap */
> +        bmap_clear = bitmap_new(bmap_npages);
> +        bitmap_copy_with_src_offset(bmap_clear, mem->dirty_bmap,
> +                                    bmap_start, start_delta + size / psize);
> +        /*
> +         * We need to fill the holes at start because that was not
> +         * specified by the caller and we extended the bitmap only for
> +         * 64 pages alignment
> +         */
> +        bitmap_clear(bmap_clear, 0, start_delta);
> +        d.dirty_bitmap = bmap_clear;

This is painful, but I guess it's the only way.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> +    } else {
> +        /* Fast path - start address aligns well with BITS_PER_LONG */
> +        d.dirty_bitmap = mem->dirty_bmap + BIT_WORD(bmap_start);
> +    }
> +
> +    d.first_page = bmap_start;
> +    /* It should never overflow.  If it happens, say something */
> +    assert(bmap_npages <= UINT32_MAX);
> +    d.num_pages = bmap_npages;
> +    d.slot = mem->slot | (kml->as_id << 16);
> +
> +    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
> +        ret = -errno;
> +        error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
> +                     "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
> +                     __func__, d.slot, (uint64_t)d.first_page,
> +                     (uint32_t)d.num_pages, ret);
> +    } else {
> +        ret = 0;
> +        trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
> +    }
> +
> +    /*
> +     * After we have updated the remote dirty bitmap, we update the
> +     * cached bitmap as well for the memslot, then if another user
> +     * clears the same region we know we shouldn't clear it again on
> +     * the remote otherwise it's data loss as well.
> +     */
> +    bitmap_clear(mem->dirty_bmap, bmap_start + start_delta,
> +                 size / psize);
> +    /* This handles the NULL case well */
> +    g_free(bmap_clear);
> +
> +    kvm_slots_unlock(kml);
> +
> +    return ret;
> +}
> +
>  static void kvm_coalesce_mmio_region(MemoryListener *listener,
>                                       MemoryRegionSection *secion,
>                                       hwaddr start, hwaddr size)
> @@ -888,6 +1040,22 @@ static void kvm_log_sync(MemoryListener *listener,
>      }
>  }
>  
> +static void kvm_log_clear(MemoryListener *listener,
> +                          MemoryRegionSection *section)
> +{
> +    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
> +    int r;
> +
> +    r = kvm_physical_log_clear(kml, section);
> +    if (r < 0) {
> +        error_report_once("%s: kvm log clear failed: mr=%s "
> +                          "offset=%"HWADDR_PRIx" size=%"PRIx64, __func__,
> +                          section->mr->name, section->offset_within_region,
> +                          int128_get64(section->size));
> +        abort();
> +    }
> +}
> +
>  static void kvm_mem_ioeventfd_add(MemoryListener *listener,
>                                    MemoryRegionSection *section,
>                                    bool match_data, uint64_t data,
> @@ -975,6 +1143,7 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
>      kml->listener.log_start = kvm_log_start;
>      kml->listener.log_stop = kvm_log_stop;
>      kml->listener.log_sync = kvm_log_sync;
> +    kml->listener.log_clear = kvm_log_clear;
>      kml->listener.priority = 10;
>  
>      memory_listener_register(&kml->listener, as);
> @@ -1699,6 +1868,17 @@ static int kvm_init(MachineState *ms)
>      s->coalesced_pio = s->coalesced_mmio &&
>                         kvm_check_extension(s, KVM_CAP_COALESCED_PIO);
>  
> +    s->manual_dirty_log_protect =
> +        kvm_check_extension(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
> +    if (s->manual_dirty_log_protect) {
> +        ret = kvm_vm_enable_cap(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, 0, 1);
> +        if (ret) {
> +            warn_report("Trying to enable KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 "
> +                        "but failed.  Falling back to the legacy mode. ");
> +            s->manual_dirty_log_protect = false;
> +        }
> +    }
> +
>  #ifdef KVM_CAP_VCPU_EVENTS
>      s->vcpu_events = kvm_check_extension(s, KVM_CAP_VCPU_EVENTS);
>  #endif
> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> index 33c5b1b3af..4fb6e59d19 100644
> --- a/accel/kvm/trace-events
> +++ b/accel/kvm/trace-events
> @@ -15,4 +15,5 @@ kvm_irqchip_release_virq(int virq) "virq %d"
>  kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%" PRIx64 " val=0x%x assign: %d size: %d match: %d"
>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
> +kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
>  
> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 12/12] migration: Split log_clear() into smaller chunks
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 12/12] migration: Split log_clear() into smaller chunks Peter Xu
@ 2019-05-30 18:58   ` Dr. David Alan Gilbert
  2019-05-31  3:05     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Dr. David Alan Gilbert @ 2019-05-30 18:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Currently we are doing log_clear() right after log_sync() which mostly
> keeps the old behavior when log_clear() was still part of log_sync().
> 
> This patch tries to further optimize the migration log_clear() code
> path to split huge log_clear()s into smaller chunks.
> 
> We do this by spliting the whole guest memory region into memory
> chunks, whose size is decided by MigrationState.clear_bitmap_shift (an
> example will be given below).  With that, we don't do the dirty bitmap
> clear operation on the remote node (e.g., KVM) when we fetch the dirty
> bitmap, instead we explicitly clear the dirty bitmap for the memory
> chunk for each of the first time we send a page in that chunk.
> 
> Here comes an example.
> 
> Assuming the guest has 64G memory, then before this patch the KVM
> ioctl KVM_CLEAR_DIRTY_LOG will be a single one covering 64G memory.
> If after the patch, let's assume when the clear bitmap shift is 18,
> then the memory chunk size on x86_64 will be 1UL<<18 * 4K = 1GB.  Then
> instead of sending a big 64G ioctl, we'll send 64 small ioctls, each
> of the ioctl will cover 1G of the guest memory.  For each of the 64
> small ioctls, we'll only send if any of the page in that small chunk
> was going to be sent right away.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

> ---
>  include/exec/ram_addr.h | 75 +++++++++++++++++++++++++++++++++++++++--
>  migration/migration.c   |  4 +++
>  migration/migration.h   | 27 +++++++++++++++
>  migration/ram.c         | 44 ++++++++++++++++++++++++
>  migration/trace-events  |  1 +
>  5 files changed, 149 insertions(+), 2 deletions(-)
> 
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index f8930914cd..c85896f4d5 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -50,8 +50,69 @@ struct RAMBlock {
>      unsigned long *unsentmap;
>      /* bitmap of already received pages in postcopy */
>      unsigned long *receivedmap;
> +
> +    /*
> +     * bitmap of already cleared dirty bitmap.  Set this up to

Is that description the right way around?  Isn't it set when
we receive dirty info from the kernel and hence it needs clearing
in the future?

> +     * non-NULL to enable the capability to postpone and split
> +     * clearing of dirty bitmap on the remote node (e.g., KVM).  The
> +     * bitmap will be set only when doing global sync.
> +     *
> +     * NOTE: this bitmap is different comparing to the other bitmaps
> +     * in that one bit can represent multiple guest pages (which is
> +     * decided by the `clear_bmap_shift' variable below).  On
> +     * destination side, this should always be NULL, and the variable
> +     * `clear_bmap_shift' is meaningless.
> +     */
> +    unsigned long *clear_bmap;
> +    uint8_t clear_bmap_shift;

Is this just here as a convenience rather than using the copy in
the property?

Dave

>  };
>  
> +/**
> + * clear_bmap_size: calculate clear bitmap size
> + *
> + * @pages: number of guest pages
> + * @shift: guest page number shift
> + *
> + * Returns: number of bits for the clear bitmap
> + */
> +static inline long clear_bmap_size(uint64_t pages, uint8_t shift)
> +{
> +    return DIV_ROUND_UP(pages, 1UL << shift);
> +}
> +
> +/**
> + * clear_bmap_set: set clear bitmap for the page range
> + *
> + * @rb: the ramblock to operate on
> + * @start: the start page number
> + * @size: number of pages to set in the bitmap
> + *
> + * Returns: None
> + */
> +static inline void clear_bmap_set(RAMBlock *rb, uint64_t start,
> +                                  uint64_t npages)
> +{
> +    uint8_t shift = rb->clear_bmap_shift;
> +
> +    bitmap_set_atomic(rb->clear_bmap, start >> shift,
> +                      clear_bmap_size(npages, shift));
> +}
> +
> +/**
> + * clear_bmap_test_and_clear: test clear bitmap for the page, clear if set
> + *
> + * @rb: the ramblock to operate on
> + * @page: the page number to check
> + *
> + * Returns: true if the bit was set, false otherwise
> + */
> +static inline bool clear_bmap_test_and_clear(RAMBlock *rb, uint64_t page)
> +{
> +    uint8_t shift = rb->clear_bmap_shift;
> +
> +    return bitmap_test_and_clear_atomic(rb->clear_bmap, page >> shift, 1);
> +}
> +
>  static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
>  {
>      return (b && b->host && offset < b->used_length) ? true : false;
> @@ -462,8 +523,18 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>              }
>          }
>  
> -        /* TODO: split the huge bitmap into smaller chunks */
> -        memory_region_clear_dirty_bitmap(rb->mr, start, length);
> +        if (rb->clear_bmap) {
> +            /*
> +             * Postpone the dirty bitmap clear to the point before we
> +             * really send the pages, also we will split the clear
> +             * dirty procedure into smaller chunks.
> +             */
> +            clear_bmap_set(rb, start >> TARGET_PAGE_BITS,
> +                           length >> TARGET_PAGE_BITS);
> +        } else {
> +            /* Slow path - still do that in a huge chunk */
> +            memory_region_clear_dirty_bitmap(rb->mr, start, length);
> +        }
>      } else {
>          ram_addr_t offset = rb->offset;
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 2865ae3fa9..8a607fe1e2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3362,6 +3362,8 @@ void migration_global_dump(Monitor *mon)
>                     ms->send_section_footer ? "on" : "off");
>      monitor_printf(mon, "decompress-error-check: %s\n",
>                     ms->decompress_error_check ? "on" : "off");
> +    monitor_printf(mon, "clear-bitmap-shift: %u\n",
> +                   ms->clear_bitmap_shift);
>  }
>  
>  #define DEFINE_PROP_MIG_CAP(name, x)             \
> @@ -3376,6 +3378,8 @@ static Property migration_properties[] = {
>                       send_section_footer, true),
>      DEFINE_PROP_BOOL("decompress-error-check", MigrationState,
>                        decompress_error_check, true),
> +    DEFINE_PROP_UINT8("x-clear-bitmap-shift", MigrationState,
> +                      clear_bitmap_shift, CLEAR_BITMAP_SHIFT_DEFAULT),
>  
>      /* Migration parameters */
>      DEFINE_PROP_UINT8("x-compress-level", MigrationState,
> diff --git a/migration/migration.h b/migration/migration.h
> index 780a096857..6e3178d8b2 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -27,6 +27,23 @@ struct PostcopyBlocktimeContext;
>  
>  #define  MIGRATION_RESUME_ACK_VALUE  (1)
>  
> +/*
> + * 1<<6=64 pages -> 256K chunk when page size is 4K.  This gives us
> + * the benefit that all the chunks are 64 pages aligned then the
> + * bitmaps are always aligned to LONG.
> + */
> +#define CLEAR_BITMAP_SHIFT_MIN             6
> +/*
> + * 1<<18=256K pages -> 1G chunk when page size is 4K.  This is the
> + * default value to use if no one specified.
> + */
> +#define CLEAR_BITMAP_SHIFT_DEFAULT        18
> +/*
> + * 1<<31=2G pages -> 8T chunk when page size is 4K.  This should be
> + * big enough and make sure we won't overflow easily.
> + */
> +#define CLEAR_BITMAP_SHIFT_MAX            31
> +
>  /* State for the incoming migration */
>  struct MigrationIncomingState {
>      QEMUFile *from_src_file;
> @@ -233,6 +250,16 @@ struct MigrationState
>       * do not trigger spurious decompression errors.
>       */
>      bool decompress_error_check;
> +
> +    /*
> +     * This decides the size of guest memory chunk that will be used
> +     * to track dirty bitmap clearing.  The size of memory chunk will
> +     * be GUEST_PAGE_SIZE << N.  Say, N=0 means we will clear dirty
> +     * bitmap for each page to send (1<<0=1); N=10 means we will clear
> +     * dirty bitmap only once for 1<<10=1K continuous guest pages
> +     * (which is in 4M chunk).
> +     */
> +    uint8_t clear_bitmap_shift;
>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);
> diff --git a/migration/ram.c b/migration/ram.c
> index dc916042fb..632bf02575 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1668,6 +1668,33 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
>      bool ret;
>  
>      qemu_mutex_lock(&rs->bitmap_mutex);
> +
> +    /*
> +     * Clear dirty bitmap if needed.  This _must_ be called before we
> +     * send any of the page in the chunk because we need to make sure
> +     * we can capture further page content changes when we sync dirty
> +     * log the next time.  So as long as we are going to send any of
> +     * the page in the chunk we clear the remote dirty bitmap for all.
> +     * Clearing it earlier won't be a problem, but too late will.
> +     */
> +    if (rb->clear_bmap && clear_bmap_test_and_clear(rb, page)) {
> +        uint8_t shift = rb->clear_bmap_shift;
> +        hwaddr size = 1ULL << (TARGET_PAGE_BITS + shift);
> +        hwaddr start = (page << TARGET_PAGE_BITS) & (-size);
> +
> +        /*
> +         * CLEAR_BITMAP_SHIFT_MIN should always guarantee this... this
> +         * can make things easier sometimes since then start address
> +         * of the small chunk will always be 64 pages aligned so the
> +         * bitmap will always be aligned to unsigned long.  We should
> +         * even be able to remove this restriction but I'm simply
> +         * keeping it.
> +         */
> +        assert(shift >= 6);
> +        trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
> +        memory_region_clear_dirty_bitmap(rb->mr, start, size);
> +    }
> +
>      ret = test_and_clear_bit(page, rb->bmap);
>  
>      if (ret) {
> @@ -2685,6 +2712,8 @@ static void ram_save_cleanup(void *opaque)
>      memory_global_dirty_log_stop();
>  
>      RAMBLOCK_FOREACH_NOT_IGNORED(block) {
> +        g_free(block->clear_bmap);
> +        block->clear_bmap = NULL;
>          g_free(block->bmap);
>          block->bmap = NULL;
>          g_free(block->unsentmap);
> @@ -3195,15 +3224,30 @@ static int ram_state_init(RAMState **rsp)
>  
>  static void ram_list_init_bitmaps(void)
>  {
> +    MigrationState *ms = migrate_get_current();
>      RAMBlock *block;
>      unsigned long pages;
> +    uint8_t shift;
>  
>      /* Skip setting bitmap if there is no RAM */
>      if (ram_bytes_total()) {
> +        shift = ms->clear_bitmap_shift;
> +        if (shift > CLEAR_BITMAP_SHIFT_MAX) {
> +            error_report("clear_bitmap_shift (%u) too big, using "
> +                         "max value (%u)", shift, CLEAR_BITMAP_SHIFT_MAX);
> +            shift = CLEAR_BITMAP_SHIFT_MAX;
> +        } else if (shift < CLEAR_BITMAP_SHIFT_MIN) {
> +            error_report("clear_bitmap_shift (%u) too small, using "
> +                         "min value (%u)", shift, CLEAR_BITMAP_SHIFT_MIN);
> +            shift = CLEAR_BITMAP_SHIFT_MIN;
> +        }
> +
>          RAMBLOCK_FOREACH_NOT_IGNORED(block) {
>              pages = block->max_length >> TARGET_PAGE_BITS;
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            block->clear_bmap_shift = shift;
> +            block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>              if (migrate_postcopy_ram()) {
>                  block->unsentmap = bitmap_new(pages);
>                  bitmap_set(block->unsentmap, 0, pages);
> diff --git a/migration/trace-events b/migration/trace-events
> index de2e136e57..2c45974330 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -79,6 +79,7 @@ get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_
>  get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, unsigned long page_abs, int sent) "%s/0x%" PRIx64 " page_abs=0x%lx (sent=%d)"
>  migration_bitmap_sync_start(void) ""
>  migration_bitmap_sync_end(uint64_t dirty_pages) "dirty_pages %" PRIu64
> +migration_bitmap_clear_dirty(char *str, uint64_t start, uint64_t size, unsigned long page) "rb %s start 0x%"PRIx64" size 0x%"PRIx64" page 0x%lx"
>  migration_throttle(void) ""
>  multifd_recv(uint8_t id, uint64_t packet_num, uint32_t used, uint32_t flags, uint32_t next_packet_size) "channel %d packet number %" PRIu64 " pages %d flags 0x%x next packet size %d"
>  multifd_recv_sync_main(long packet_num) "packet num %ld"
> -- 
> 2.17.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/12] bitmap: Add bitmap_copy_with_{src|dst}_offset()
  2019-05-30 11:05   ` Dr. David Alan Gilbert
@ 2019-05-31  1:45     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-31  1:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

On Thu, May 30, 2019 at 12:05:27PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > These helpers copy the source bitmap to destination bitmap with a
> > shift either on the src or dst bitmap.
> > 
> > Meanwhile, we never have bitmap tests but we should.
> > 
> > This patch also introduces the initial test cases for utils/bitmap.c
> > but it only tests the newly introduced functions.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/qemu/bitmap.h  |  9 +++++
> >  tests/Makefile.include |  2 ++
> >  tests/test-bitmap.c    | 81 ++++++++++++++++++++++++++++++++++++++++++
> >  util/bitmap.c          | 73 +++++++++++++++++++++++++++++++++++++
> >  4 files changed, 165 insertions(+)
> >  create mode 100644 tests/test-bitmap.c
> > 
> > diff --git a/include/qemu/bitmap.h b/include/qemu/bitmap.h
> > index 5c313346b9..cdaa953371 100644
> > --- a/include/qemu/bitmap.h
> > +++ b/include/qemu/bitmap.h
> > @@ -41,6 +41,10 @@
> >   * bitmap_find_next_zero_area(buf, len, pos, n, mask)	Find bit free area
> >   * bitmap_to_le(dst, src, nbits)      Convert bitmap to little endian
> >   * bitmap_from_le(dst, src, nbits)    Convert bitmap from little endian
> > + * bitmap_copy_with_src_offset(dst, src, offset, nbits)
> > + *                                    *dst = *src (with an offset upon src)
> 
> 'into' would be better than 'upon'

Ok.

> 
> > + * bitmap_copy_with_dst_offset(dst, src, offset, nbits)
> > + *                                    *dst = *src (with an offset upon dst)
> >   */
> >  
> >  /*
> > @@ -271,4 +275,9 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
> >  void bitmap_from_le(unsigned long *dst, const unsigned long *src,
> >                      long nbits);
> >  
> > +void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
> > +                                 long offset, long nbits);
> > +void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
> > +                                 long shift, long nbits);
> > +
> >  #endif /* BITMAP_H */
> > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > index 1865f6b322..5e2d7dddff 100644
> > --- a/tests/Makefile.include
> > +++ b/tests/Makefile.include
> > @@ -64,6 +64,7 @@ check-unit-y += tests/test-opts-visitor$(EXESUF)
> >  check-unit-$(CONFIG_BLOCK) += tests/test-coroutine$(EXESUF)
> >  check-unit-y += tests/test-visitor-serialization$(EXESUF)
> >  check-unit-y += tests/test-iov$(EXESUF)
> > +check-unit-y += tests/test-bitmap$(EXESUF)
> >  check-unit-$(CONFIG_BLOCK) += tests/test-aio$(EXESUF)
> >  check-unit-$(CONFIG_BLOCK) += tests/test-aio-multithread$(EXESUF)
> >  check-unit-$(CONFIG_BLOCK) += tests/test-throttle$(EXESUF)
> > @@ -529,6 +530,7 @@ tests/test-image-locking$(EXESUF): tests/test-image-locking.o $(test-block-obj-y
> >  tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
> >  tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
> >  tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) $(test-crypto-obj-y)
> > +tests/test-bitmap$(EXESUF): tests/test-bitmap.o $(test-util-obj-y)
> >  tests/test-x86-cpuid$(EXESUF): tests/test-x86-cpuid.o
> >  tests/test-xbzrle$(EXESUF): tests/test-xbzrle.o migration/xbzrle.o migration/page_cache.o $(test-util-obj-y)
> >  tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o $(test-util-obj-y)
> > diff --git a/tests/test-bitmap.c b/tests/test-bitmap.c
> > new file mode 100644
> > index 0000000000..36b4c07bf2
> > --- /dev/null
> > +++ b/tests/test-bitmap.c
> > @@ -0,0 +1,81 @@
> > +// SPDX-License-Identifier: GPL-2.0
> 
> We probably shouldn't use GPL-2.0 for new; most stuff is 2.0-or-later

Ok.

> 
> > +/*
> > + * Bitmap.c unit-tests.
> > + *
> > + * Copyright (C) 2019, Red Hat, Inc.
> > + *
> > + * Author: Peter Xu <peterx@redhat.com>
> > + */
> > +
> > +#include <stdlib.h>
> > +#include "qemu/osdep.h"
> > +#include "qemu/bitmap.h"
> > +
> > +#define BMAP_SIZE  1024
> > +
> > +static void check_bitmap_copy_with_offset(void)
> > +{
> > +    int i;
> > +    unsigned long *bmap1, *bmap2, *bmap3, total;
> > +
> > +    bmap1 = bitmap_new(BMAP_SIZE);
> > +    bmap2 = bitmap_new(BMAP_SIZE);
> > +    bmap3 = bitmap_new(BMAP_SIZE);
> > +
> > +    *bmap1 = random();
> > +    *(bmap1 + 1) = random();
> > +    *(bmap1 + 2) = random();
> > +    *(bmap1 + 3) = random();
> > +    total = BITS_PER_LONG * 4;
> > +
> > +    /* Shift 115 bits into bmap2 */
> > +    bitmap_copy_with_dst_offset(bmap2, bmap1, 115, total);
> > +    /* Shift another 85 bits into bmap3 */
> > +    bitmap_copy_with_dst_offset(bmap3, bmap2, 85, total + 115);
> > +    /* Shift back 200 bits back */
> > +    bitmap_copy_with_src_offset(bmap2, bmap3, 200, total);
> > +
> > +    for (i = 0; i < 3; i++) {
> > +        g_assert(*(bmap1 + i) == *(bmap2 + i));
> > +    }
> 
> can we get rid fo these loops with:
>        g_assert_cmpmem(bmap1, total / 8, bmap2, total / 8) 

Yes.  I'm using BITS_PER_LONG instead of 8 though to suite 32bit.

> 
> > +    bitmap_clear(bmap1, 0, BMAP_SIZE);
> > +    /* Set bits in bmap1 are 100-245 */
> > +    bitmap_set(bmap1, 100, 145);
> > +
> > +    /* Set bits in bmap2 are 60-205 */
> > +    bitmap_copy_with_src_offset(bmap2, bmap1, 40, 250);
> > +    for (i = 0; i < 60; i++) {
> > +        g_assert(test_bit(i, bmap2) == 0);
> > +    }
> 
>        g_assert_cmpint(find_first_bit(bmap2, 60), ==, 60)

Fixed.

> 
> > +    for (i = 60; i < 205; i++) {
> > +        g_assert(test_bit(i, bmap2));
> > +    }
> 
>        g_assert_cmpint(find_next_zero_bit(bmap2, 205, 60), == 205) 

Fixed.

> 
> > +    g_assert(test_bit(205, bmap2) == 0);
> > +
> > +    /* Set bits in bmap3 are 135-280 */
> > +    bitmap_copy_with_dst_offset(bmap3, bmap1, 35, 250);
> > +    for (i = 0; i < 135; i++) {
> > +        g_assert(test_bit(i, bmap3) == 0);
> > +    }
> > +    for (i = 135; i < 280; i++) {
> > +        g_assert(test_bit(i, bmap3));
> > +    }
> > +    g_assert(test_bit(280, bmap3) == 0);
> > +
> > +    g_free(bmap1);
> > +    g_free(bmap2);
> > +    g_free(bmap3);
> > +}
> > +
> > +int main(int argc, char **argv)
> > +{
> > +    g_test_init(&argc, &argv, NULL);
> > +
> > +    g_test_add_func("/bitmap/bitmap_copy_with_offset",
> > +                    check_bitmap_copy_with_offset);
> > +
> > +    g_test_run();
> > +
> > +    return 0;
> > +}
> > diff --git a/util/bitmap.c b/util/bitmap.c
> > index cb618c65a5..391a7bb744 100644
> > --- a/util/bitmap.c
> > +++ b/util/bitmap.c
> > @@ -402,3 +402,76 @@ void bitmap_to_le(unsigned long *dst, const unsigned long *src,
> >  {
> >      bitmap_to_from_le(dst, src, nbits);
> >  }
> > +
> > +/*
> > + * Copy "src" bitmap with a positive offset and put it into the "dst"
> > + * bitmap.  The caller needs to make sure the bitmap size of "src"
> > + * is bigger than (shift + nbits).
> > + */
> > +void bitmap_copy_with_src_offset(unsigned long *dst, const unsigned long *src,
> > +                                 long shift, long nbits)
> > +{
> > +    unsigned long left_mask, right_mask, last_mask;
> > +
> > +    assert(shift >= 0);
> 
> Just make shift and nbits unsigned long?
> (Interestingly bitops.h uses unsigned long everywhere where as bitmap
> uses long)

Sure.

> 
> > +    /* Proper shift src pointer to the first word to copy from */
> > +    src += BIT_WORD(shift);
> > +    shift %= BITS_PER_LONG;

[1]

> > +    right_mask = (1ul << shift) - 1;
> > +    left_mask = ~right_mask;
> > +
> > +    while (nbits >= BITS_PER_LONG) {
> > +        *dst = (*src & left_mask) >> shift;
> > +        *dst |= (*(src + 1) & right_mask) << (BITS_PER_LONG - shift);
> 
>   src[1] ?

Sure.

> 
> Also, perhaps you should only do this line if (right_mask), that way
> if the offset is a multiple of BITS_PER_LONG, you don't access the
> word after the end of the bitmap.

Indeed. Normally the real callers in this series won't happen with
that otherwise we don't need to call this function at all.

Instead of checking right_mask every time I think maybe I can just
provide a quick path at [1] above, then if I found shift==0 I'll go to
another memcpy() quick path.

(but that should never be used because if bitmap can be shifted as
 long aligned then we don't even need to copy them in most cases but
 instead we can just use bitmap[N])

> 
> > +        dst++;
> > +        src++;
> > +        nbits -= BITS_PER_LONG;
> > +    }
> > +
> > +    if (nbits > BITS_PER_LONG - shift) {
> > +        *dst = (*src & left_mask) >> shift;
> > +        nbits -= BITS_PER_LONG - shift;
> > +        last_mask = (1 << nbits) - 1;
> 
> 1ul
> 
> > +        *dst |= (*(src + 1) & last_mask) << (BITS_PER_LONG - shift);
> > +    } else if (nbits) {
> > +        last_mask = (1 << nbits) - 1;
> 
> and again
> 
> > +        *dst = (*src >> shift) & last_mask;
> > +    }
> > +}
> > +
> > +/*
> > + * Copy "src" bitmap into the "dst" bitmap with an offset in the
> > + * "dst".  The caller needs to make sure the bitmap size of "dst" is
> > + * bigger than (shift + nbits).
> > + */
> > +void bitmap_copy_with_dst_offset(unsigned long *dst, const unsigned long *src,
> > +                                 long shift, long nbits)
> > +{
> > +    unsigned long left_mask, right_mask, last_mask;
> > +
> > +    assert(shift >= 0);
> > +    /* Proper shift src pointer to the first word to copy from */
>                        ^^^
> 
> dst
> 
> > +    dst += BIT_WORD(shift);
> > +    shift %= BITS_PER_LONG;
> > +    right_mask = (1ul << (BITS_PER_LONG - shift)) - 1;
> > +    left_mask = ~right_mask;
> > +
> > +    *dst &= (1ul << shift) - 1;
> > +    while (nbits >= BITS_PER_LONG) {
> > +        *dst |= (*src & right_mask) << shift;
> > +        *(dst + 1) = (*src & left_mask) >> (BITS_PER_LONG - shift);
> 
> dst[1] ?
> 
> > +        dst++;
> > +        src++;
> > +        nbits -= BITS_PER_LONG;
> > +    }
> > +
> > +    if (nbits > BITS_PER_LONG - shift) {
> > +        *dst |= (*src & right_mask) << shift;
> > +        nbits -= BITS_PER_LONG - shift;
> > +        last_mask = ((1 << nbits) - 1) << (BITS_PER_LONG - shift);
> 
> 1ul

All fixed up for the rest.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/12] memory: Pass mr into snapshot_and_clear_dirty
  2019-05-30 11:22   ` Dr. David Alan Gilbert
@ 2019-05-31  2:36     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-31  2:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

On Thu, May 30, 2019 at 12:22:00PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Also we change the 2nd parameter of it to be the relative offset
> > within the memory region. This is to be used in follow up patches.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  exec.c                  | 3 ++-
> >  include/exec/ram_addr.h | 2 +-
> >  memory.c                | 3 +--
> >  3 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 4e734770c2..2615b4cfed 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -1387,9 +1387,10 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
> >  }
> >  
> >  DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
> > -     (ram_addr_t start, ram_addr_t length, unsigned client)
> > +(MemoryRegion *mr, hwaddr addr, hwaddr length, unsigned client)
> 
> Better to keep some indent?

Ok.  Emacs did it automatically for me without being noticed.  The
previous 5-spaces is odd, I'll use 4 spaces.

> 
> >  {
> >      DirtyMemoryBlocks *blocks;
> > +    ram_addr_t start = memory_region_get_ram_addr(mr) + addr;
> >      unsigned long align = 1UL << (TARGET_PAGE_BITS + BITS_PER_LEVEL);
> >      ram_addr_t first = QEMU_ALIGN_DOWN(start, align);
> >      ram_addr_t last  = QEMU_ALIGN_UP(start + length, align);
> > diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> > index 79e70a96ee..f8ee011d3c 100644
> > --- a/include/exec/ram_addr.h
> > +++ b/include/exec/ram_addr.h
> > @@ -403,7 +403,7 @@ bool cpu_physical_memory_test_and_clear_dirty(ram_addr_t start,
> >                                                unsigned client);
> >  
> >  DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty
> > -    (ram_addr_t start, ram_addr_t length, unsigned client);
> > +(MemoryRegion *mr, ram_addr_t start, hwaddr length, unsigned client);
> 
> You've called it 'start' there but 'addr' in the definition.
> Either way a comment saying that it's the offset with mr would be good.

Right I missed that.  Let me directly rename it to "offset" in both
places to be clear.  Also, I should use hwaddr rather than ram_addr_t,
and I'll fix the indent too here.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/12] kvm: Persistent per kvmslot dirty bitmap
  2019-05-30 13:53   ` Dr. David Alan Gilbert
@ 2019-05-31  2:43     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-31  2:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

On Thu, May 30, 2019 at 02:53:30PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > When synchronizing dirty bitmap from kernel KVM we do it in a
> > per-kvmslot fashion and we allocate the userspace bitmap for each of
> > the ioctl.  This patch instead make the bitmap cache be persistent
> > then we don't need to g_malloc0() every time.
> > 
> > More importantly, the cached per-kvmslot dirty bitmap will be further
> > used when we want to add support for the KVM_CLEAR_DIRTY_LOG and this
> > cached bitmap will be used to guarantee we won't clear any unknown
> > dirty bits otherwise that can be a severe data loss issue for
> > migration code.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Is there no way to make this get allocated the first time it's needed?
> I'm thinking here of the VM most of the time not being migrated so we're
> allocating this structure for no benefit.

Valid argument...  sure we can do the allocation on first usage. How
about below change squashed?

(I'll squash them properly into different patches where proper; the
 assertion would belong to the other patch)

=======================

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 23895a95a2..80bc4be03a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -523,6 +523,11 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,                                                                       
             goto out;
         }

+        if (!mem->dirty_bmap) {
+            /* Allocate on the first log_sync, once and for all */
+            mem->dirty_bmap = g_malloc0(bmap_size);
+        }
+
         d.dirty_bitmap = mem->dirty_bmap;
         d.slot = mem->slot | (kml->as_id << 16);
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
@@ -638,6 +643,8 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
      */

     assert(bmap_start % BITS_PER_LONG == 0);
+    /* We should never do log_clear before log_sync */
+    assert(mem->dirty_bmap);
     if (start_delta) {
         /* Slow path - we need to manipulate a temp bitmap */
         bmap_clear = bitmap_new(bmap_npages);
@@ -995,7 +1002,6 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     mem->start_addr = start_addr;
     mem->ram = ram;
     mem->flags = kvm_mem_flags(mr);
-    mem->dirty_bmap = g_malloc0(bmap_size);

     err = kvm_set_user_memory_region(kml, mem, true);
     if (err) {

Regards,

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 10/12] kvm: Introduce slots lock for memory listener
  2019-05-30 16:40   ` Dr. David Alan Gilbert
@ 2019-05-31  2:48     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-31  2:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

On Thu, May 30, 2019 at 05:40:38PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Introduce KVMMemoryListener.slots_lock to protect the slots inside the
> > kvm memory listener.  Currently it is close to useless because all the
> > KVM code path now is always protected by the BQL.  But it'll start to
> > make sense in follow up patches where we might do remote dirty bitmap
> > clear and also we'll update the per-slot cached dirty bitmap even
> > without the BQL.  So let's prepare for it.
> > 
> > We can also use per-slot lock for above reason but it seems to be an
> > overkill.  Let's just use this bigger one (which covers all the slots
> > of a single address space) but anyway this lock is still much smaller
> > than the BQL.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> No one ever frees it? Huh OK, in that case if no one ever frees the
> listener then I guess you don't need to cleanup the lock.

Yeh not the first time I see something keeps forever in QEMU (in most
cases which won't hurt :).

I'll just leave it to the patch where all things are cleaned up.

> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 12/12] migration: Split log_clear() into smaller chunks
  2019-05-30 18:58   ` Dr. David Alan Gilbert
@ 2019-05-31  3:05     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2019-05-31  3:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Juan Quintela

On Thu, May 30, 2019 at 07:58:55PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Currently we are doing log_clear() right after log_sync() which mostly
> > keeps the old behavior when log_clear() was still part of log_sync().
> > 
> > This patch tries to further optimize the migration log_clear() code
> > path to split huge log_clear()s into smaller chunks.
> > 
> > We do this by spliting the whole guest memory region into memory
> > chunks, whose size is decided by MigrationState.clear_bitmap_shift (an
> > example will be given below).  With that, we don't do the dirty bitmap
> > clear operation on the remote node (e.g., KVM) when we fetch the dirty
> > bitmap, instead we explicitly clear the dirty bitmap for the memory
> > chunk for each of the first time we send a page in that chunk.
> > 
> > Here comes an example.
> > 
> > Assuming the guest has 64G memory, then before this patch the KVM
> > ioctl KVM_CLEAR_DIRTY_LOG will be a single one covering 64G memory.
> > If after the patch, let's assume when the clear bitmap shift is 18,
> > then the memory chunk size on x86_64 will be 1UL<<18 * 4K = 1GB.  Then
> > instead of sending a big 64G ioctl, we'll send 64 small ioctls, each
> > of the ioctl will cover 1G of the guest memory.  For each of the 64
> > small ioctls, we'll only send if any of the page in that small chunk
> > was going to be sent right away.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> > ---
> >  include/exec/ram_addr.h | 75 +++++++++++++++++++++++++++++++++++++++--
> >  migration/migration.c   |  4 +++
> >  migration/migration.h   | 27 +++++++++++++++
> >  migration/ram.c         | 44 ++++++++++++++++++++++++
> >  migration/trace-events  |  1 +
> >  5 files changed, 149 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> > index f8930914cd..c85896f4d5 100644
> > --- a/include/exec/ram_addr.h
> > +++ b/include/exec/ram_addr.h
> > @@ -50,8 +50,69 @@ struct RAMBlock {
> >      unsigned long *unsentmap;
> >      /* bitmap of already received pages in postcopy */
> >      unsigned long *receivedmap;
> > +
> > +    /*
> > +     * bitmap of already cleared dirty bitmap.  Set this up to
> 
> Is that description the right way around?  Isn't it set when
> we receive dirty info from the kernel and hence it needs clearing
> in the future?

Right.  I should say "bitmap to track ...".  I'll reword.

> 
> > +     * non-NULL to enable the capability to postpone and split
> > +     * clearing of dirty bitmap on the remote node (e.g., KVM).  The
> > +     * bitmap will be set only when doing global sync.
> > +     *
> > +     * NOTE: this bitmap is different comparing to the other bitmaps
> > +     * in that one bit can represent multiple guest pages (which is
> > +     * decided by the `clear_bmap_shift' variable below).  On
> > +     * destination side, this should always be NULL, and the variable
> > +     * `clear_bmap_shift' is meaningless.
> > +     */
> > +    unsigned long *clear_bmap;
> > +    uint8_t clear_bmap_shift;
> 
> Is this just here as a convenience rather than using the copy in
> the property?

Right.  It seems a bit awkward to use that directly.  Also note that
the other value can be different with this one when it's illegal
(e.g., bigger than CLEAR_BITMAP_SHIFT_MAX or smaller than
CLEAR_BITMAP_SHIFT_MIN).  This value will always be legal and we don't
need to check/assert it every time.

(Hmm, could bmap_shift be different for different ramblocks?  I never
 thought about that but actually, it can, of course)

Regards,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier Peter Xu
@ 2019-05-31 12:56   ` Juan Quintela
  2019-06-03  6:21     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Juan Quintela @ 2019-05-31 12:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> According to: https://spdx.org/ids-how, let's still allow QEMU to use
> the SPDX license identifier:
>
> // SPDX-License-Identifier: ***
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

Althought this patch don't belong to the series O:-)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
@ 2019-05-31 12:57   ` Juan Quintela
  2019-05-31 12:58   ` Juan Quintela
  1 sibling, 0 replies; 33+ messages in thread
From: Juan Quintela @ 2019-05-31 12:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> cpu_physical_memory_sync_dirty_bitmap() has one RAMBlock* as
> parameter, which means that it must be with RCU read lock held
> already.  Taking it again inside seems redundant.  Removing it.
> Instead comment on the functions about the RCU read lock.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
  2019-05-31 12:57   ` Juan Quintela
@ 2019-05-31 12:58   ` Juan Quintela
  1 sibling, 0 replies; 33+ messages in thread
From: Juan Quintela @ 2019-05-31 12:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> cpu_physical_memory_sync_dirty_bitmap() has one RAMBlock* as
> parameter, which means that it must be with RCU read lock held
> already.  Taking it again inside seems redundant.  Removing it.
> Instead comment on the functions about the RCU read lock.


Anyways, hotplug/unplug is suppossed to be disable during migration.  If
we add/remove a ramblock during migration bad things could happen.

Later, Juan.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/12] memory: Remove memory_region_get_dirty()
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 03/12] memory: Remove memory_region_get_dirty() Peter Xu
@ 2019-05-31 12:59   ` Juan Quintela
  0 siblings, 0 replies; 33+ messages in thread
From: Juan Quintela @ 2019-05-31 12:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> It's never used anywhere.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/12] memory: Don't set migration bitmap when without migration
  2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 04/12] memory: Don't set migration bitmap when without migration Peter Xu
@ 2019-05-31 13:01   ` Juan Quintela
  2019-06-01  2:41     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Juan Quintela @ 2019-05-31 13:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> wrote:
> Similar to 9460dee4b2 ("memory: do not touch code dirty bitmap unless
> TCG is enabled", 2015-06-05) but for the migration bitmap - we can
> skip the MIGRATION bitmap update if migration not enabled.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

But if we ever decide to _not_ dirty all the bitmap at start (only used
pages) we need to fix this.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/12] memory: Don't set migration bitmap when without migration
  2019-05-31 13:01   ` Juan Quintela
@ 2019-06-01  2:41     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2019-06-01  2:41 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Dr . David Alan Gilbert

On Fri, May 31, 2019 at 03:01:29PM +0200, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > Similar to 9460dee4b2 ("memory: do not touch code dirty bitmap unless
> > TCG is enabled", 2015-06-05) but for the migration bitmap - we can
> > skip the MIGRATION bitmap update if migration not enabled.
> >
> > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> But if we ever decide to _not_ dirty all the bitmap at start (only used
> pages) we need to fix this.

Right, but IMHO we can never avoid doing it, because kvm (and also the
per-ramblock dirty bitmaps) will only capture "dirtied pages" after
log sync has started.  One example is what if one page P is never been
touched after log_sync?  Then in kvm dirty log it'll never be set, and
the only way to make sure we will still migrate that page P (that
could be touched before log_sync so it might still contain valid data
rather than a zero page) is to dirty all the pages at the start of
migration (for now, it's ram_list_init_bitmaps).

Thanks for the review!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier
  2019-05-31 12:56   ` Juan Quintela
@ 2019-06-03  6:21     ` Peter Xu
  2019-06-03  8:01       ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2019-06-03  6:21 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Laurent Vivier, Paolo Bonzini, qemu-devel, Dr . David Alan Gilbert

On Fri, May 31, 2019 at 02:56:21PM +0200, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > According to: https://spdx.org/ids-how, let's still allow QEMU to use
> > the SPDX license identifier:
> >
> > // SPDX-License-Identifier: ***
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> Althought this patch don't belong to the series O:-)

Right. :)  And Paolo should have queued the patch.

To make life easier, I plan to simply drop this patch in next spin and
change the only user of "// SPDX-License-Identifier" patch in the
series to simply use "/* ... */" since I just noticed vast codes in
QEMU is using that... then we don't have to depend on this patch.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier
  2019-06-03  6:21     ` Peter Xu
@ 2019-06-03  8:01       ` Paolo Bonzini
  0 siblings, 0 replies; 33+ messages in thread
From: Paolo Bonzini @ 2019-06-03  8:01 UTC (permalink / raw)
  To: Peter Xu, Juan Quintela
  Cc: Laurent Vivier, qemu-devel, Dr . David Alan Gilbert

On 03/06/19 08:21, Peter Xu wrote:
> On Fri, May 31, 2019 at 02:56:21PM +0200, Juan Quintela wrote:
>> Peter Xu <peterx@redhat.com> wrote:
>>> According to: https://spdx.org/ids-how, let's still allow QEMU to use
>>> the SPDX license identifier:
>>>
>>> // SPDX-License-Identifier: ***
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>
>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>>
>> Althought this patch don't belong to the series O:-)
> 
> Right. :)  And Paolo should have queued the patch.
> 
> To make life easier, I plan to simply drop this patch in next spin and
> change the only user of "// SPDX-License-Identifier" patch in the
> series to simply use "/* ... */" since I just noticed vast codes in
> QEMU is using that... then we don't have to depend on this patch.

No problem, I'll send a pull request today.

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-06-03  8:14 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-30  9:29 [Qemu-devel] [PATCH v3 00/12] kvm/migration: support KVM_CLEAR_DIRTY_LOG Peter Xu
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 01/12] checkpatch: Allow SPDX-License-Identifier Peter Xu
2019-05-31 12:56   ` Juan Quintela
2019-06-03  6:21     ` Peter Xu
2019-06-03  8:01       ` Paolo Bonzini
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 02/12] migration: No need to take rcu during sync_dirty_bitmap Peter Xu
2019-05-31 12:57   ` Juan Quintela
2019-05-31 12:58   ` Juan Quintela
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 03/12] memory: Remove memory_region_get_dirty() Peter Xu
2019-05-31 12:59   ` Juan Quintela
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 04/12] memory: Don't set migration bitmap when without migration Peter Xu
2019-05-31 13:01   ` Juan Quintela
2019-06-01  2:41     ` Peter Xu
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 05/12] bitmap: Add bitmap_copy_with_{src|dst}_offset() Peter Xu
2019-05-30 11:05   ` Dr. David Alan Gilbert
2019-05-31  1:45     ` Peter Xu
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 06/12] memory: Pass mr into snapshot_and_clear_dirty Peter Xu
2019-05-30 11:22   ` Dr. David Alan Gilbert
2019-05-31  2:36     ` Peter Xu
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 07/12] memory: Introduce memory listener hook log_clear() Peter Xu
2019-05-30 13:20   ` Dr. David Alan Gilbert
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 08/12] kvm: Update comments for sync_dirty_bitmap Peter Xu
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 09/12] kvm: Persistent per kvmslot dirty bitmap Peter Xu
2019-05-30 13:53   ` Dr. David Alan Gilbert
2019-05-31  2:43     ` Peter Xu
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 10/12] kvm: Introduce slots lock for memory listener Peter Xu
2019-05-30 16:40   ` Dr. David Alan Gilbert
2019-05-31  2:48     ` Peter Xu
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 11/12] kvm: Support KVM_CLEAR_DIRTY_LOG Peter Xu
2019-05-30 17:56   ` Dr. David Alan Gilbert
2019-05-30  9:29 ` [Qemu-devel] [PATCH v3 12/12] migration: Split log_clear() into smaller chunks Peter Xu
2019-05-30 18:58   ` Dr. David Alan Gilbert
2019-05-31  3:05     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).