All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part)
@ 2020-05-23 23:20 Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 01/11] linux-headers: Update Peter Xu
                   ` (12 more replies)
  0 siblings, 13 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

I kept the dirty sync in kvm_set_phys_mem() for kvmslot removals, left a
comment on the known issue about strict dirty sync so we can fix it someday in
the future together with dirty log and dirty ring.

v3:
- added "KVM: Use a big lock to replace per-kml slots_lock"
  this is preparing for the last patch where we'll reap kvm dirty ring when
  removing kvmslots.
- added "KVM: Simplify dirty log sync in kvm_set_phys_mem"
  it's kind of a fix, but also a preparation of the last patch so it'll be very
  easy to add the dirty ring sync there
- the last patch is changed to handle correctly the dirty sync in kvmslot
  removal, also comment there about the known issues.
- reordered the patches a bit
- NOTE: since we kept the sync in memslot removal, this version does not depend
  on any other QEMU series - it is based on QEMU master

v2:
- add r-bs for Dave
- change dirty-ring-size parameter from int64 to uint64_t [Dave]
- remove an assertion for KVM_GET_DIRTY_LOG [Dave]
- document update: "per vcpu" dirty ring [Dave]
- rename KVMReaperState to KVMDirtyRingReaperState [Dave]
- dump errno when kvm_init_vcpu fails with dirty ring [Dave]
- switch to use dirty-ring-gfns as parameter [Dave]
- comment MAP_SHARED [Dave]
- dump more info when enable dirty ring failed [Dave]
- add kvm_dirty_ring_enabled flag to show whether dirty ring enabled
- rewrote many of the last patch to reduce LOC, now we do dirty ring reap only
  with BQL to simplify things, allowing the main or vcpu thread to directly
  call kvm_dirty_ring_reap() to collect dirty pages, so that we can drop a lot
  of synchronization variables like sems or eventfds.

For anyone who wants to try (we need to upgrade kernel too):

KVM branch:
  https://github.com/xzpeter/linux/tree/kvm-dirty-ring

QEMU branch for testing:
  https://github.com/xzpeter/qemu/tree/kvm-dirty-ring

Overview
========

KVM dirty ring is a new interface to pass over dirty bits from kernel
to the userspace.  Instead of using a bitmap for each memory region,
the dirty ring contains an array of dirtied GPAs to fetch, one ring
per vcpu.

There're a few major changes comparing to how the old dirty logging
interface would work:

- Granularity of dirty bits

  KVM dirty ring interface does not offer memory region level
  granularity to collect dirty bits (i.e., per KVM memory
  slot). Instead the dirty bit is collected globally for all the vcpus
  at once.  The major effect is on VGA part because VGA dirty tracking
  is enabled as long as the device is created, also it was in memory
  region granularity.  Now that operation will be amplified to a VM
  sync.  Maybe there's smarter way to do the same thing in VGA with
  the new interface, but so far I don't see it affects much at least
  on regular VMs.

- Collection of dirty bits

  The old dirty logging interface collects KVM dirty bits when
  synchronizing dirty bits.  KVM dirty ring interface instead used a
  standalone thread to do that.  So when the other thread (e.g., the
  migration thread) wants to synchronize the dirty bits, it simply
  kick the thread and wait until it flushes all the dirty bits to the
  ramblock dirty bitmap.

A new parameter "dirty-ring-size" is added to "-accel kvm".  By
default, dirty ring is still disabled (size==0).  To enable it, we
need to be with:

  -accel kvm,dirty-ring-size=65536

This establishes a 64K dirty ring buffer per vcpu.  Then if we
migrate, it'll switch to dirty ring.

I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
channel.  When idle or dirty workload small, I don't observe major
difference on total migration time.  When with higher random dirty
workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
ring). Total migration time is (ping pong migrate for 6 times, in
seconds):

|-------------------------+---------------|
| dirty ring (4k entries) | dirty logging |
|-------------------------+---------------|
|                      70 |            58 |
|                      78 |            70 |
|                      72 |            48 |
|                      74 |            52 |
|                      83 |            49 |
|                      65 |            54 |
|-------------------------+---------------|

Summary:

dirty ring average:    73s
dirty logging average: 55s

The KVM dirty ring will be slower in above case.  The number may show
that the dirty logging is still preferred as a default value because
small/medium VMs are still major cases, and high dirty workload
happens frequently too.  And that's what this series did.

Please refer to the code and comment itself for more information.

Thanks,

Peter Xu (11):
  linux-headers: Update
  memory: Introduce log_sync_global() to memory listener
  KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  KVM: Use a big lock to replace per-kml slots_lock
  KVM: Create the KVMSlot dirty bitmap on flag changes
  KVM: Provide helper to get kvm dirty log
  KVM: Provide helper to sync dirty bitmap from slot to ramblock
  KVM: Simplify dirty log sync in kvm_set_phys_mem
  KVM: Cache kvm slot dirty bitmap size
  KVM: Add dirty-gfn-count property
  KVM: Dirty ring support

 accel/kvm/kvm-all.c         | 540 +++++++++++++++++++++++++++++++-----
 accel/kvm/trace-events      |   7 +
 include/exec/memory.h       |  12 +
 include/hw/core/cpu.h       |   8 +
 include/sysemu/kvm_int.h    |   7 +-
 linux-headers/asm-x86/kvm.h |   1 +
 linux-headers/linux/kvm.h   |  53 ++++
 memory.c                    |  33 ++-
 qemu-options.hx             |   5 +
 9 files changed, 581 insertions(+), 85 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 01/11] linux-headers: Update
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-24 13:27   ` Peter Maydell
  2020-05-23 23:20 ` [PATCH RFC v3 02/11] memory: Introduce log_sync_global() to memory listener Peter Xu
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 linux-headers/asm-x86/kvm.h |  1 +
 linux-headers/linux/kvm.h   | 53 +++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 3f3f780c8c..99b15ce39e 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -12,6 +12,7 @@
 
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define DE_VECTOR 0
 #define DB_VECTOR 1
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 9804495a46..f0f3cecce1 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -1017,6 +1018,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_VCPU_RESETS 179
 #define KVM_CAP_S390_PROTECTED 180
 #define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_DIRTY_LOG_RING 182
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1518,6 +1520,9 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_S390_PROTECTED */
 #define KVM_S390_PV_COMMAND		_IOWR(KVMIO, 0xc5, struct kvm_pv_cmd)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc6)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1671,4 +1676,52 @@ struct kvm_hyperv_eventfd {
 #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE    (1 << 0)
 #define KVM_DIRTY_LOG_INITIALLY_SET            (1 << 1)
 
+/*
+ * Arch needs to define the macro after implementing the dirty ring
+ * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+
+/*
+ * KVM dirty GFN flags, defined as:
+ *
+ * |---------------+---------------+--------------|
+ * | bit 1 (reset) | bit 0 (dirty) | Status       |
+ * |---------------+---------------+--------------|
+ * |             0 |             0 | Invalid GFN  |
+ * |             0 |             1 | Dirty GFN    |
+ * |             1 |             X | GFN to reset |
+ * |---------------+---------------+--------------|
+ *
+ * Lifecycle of a dirty GFN goes like:
+ *
+ *      dirtied         collected        reset
+ * 00 -----------> 01 -------------> 1X -------+
+ *  ^                                          |
+ *  |                                          |
+ *  +------------------------------------------+
+ *
+ * The userspace program is only responsible for the 01->1X state
+ * conversion (to collect dirty bits).  Also, it must not skip any
+ * dirty bits so that dirty bits are always collected in sequence.
+ */
+#define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
+#define KVM_DIRTY_GFN_F_RESET           BIT(1)
+#define KVM_DIRTY_GFN_F_MASK            0x3
+
+/*
+ * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of
+ * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn.  The
+ * size of the gfn buffer is decided by the first argument when
+ * enabling KVM_CAP_DIRTY_LOG_RING.
+ */
+struct kvm_dirty_gfn {
+	__u32 flags;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 02/11] memory: Introduce log_sync_global() to memory listener
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 01/11] linux-headers: Update Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 03/11] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

Some of the memory listener may want to do log synchronization without
being able to specify a range of memory to sync but always globally.
Such a memory listener should provide this new method instead of the
log_sync() method.

Obviously we can also achieve similar thing when we put the global
sync logic into a log_sync() handler. However that's not efficient
enough because otherwise memory_global_dirty_log_sync() may do the
global sync N times, where N is the number of flat ranges in the
address space.

Make this new method be exclusive to log_sync().

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 12 ++++++++++++
 memory.c              | 33 +++++++++++++++++++++++----------
 2 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index e000bd2f97..c0c6155ca0 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -533,6 +533,18 @@ struct MemoryListener {
      */
     void (*log_sync)(MemoryListener *listener, MemoryRegionSection *section);
 
+    /**
+     * @log_sync_global:
+     *
+     * This is the global version of @log_sync when the listener does
+     * not have a way to synchronize the log with finer granularity.
+     * When the listener registers with @log_sync_global defined, then
+     * its @log_sync must be NULL.  Vice versa.
+     *
+     * @listener: The #MemoryListener.
+     */
+    void (*log_sync_global)(MemoryListener *listener);
+
     /**
      * @log_clear:
      *
diff --git a/memory.c b/memory.c
index 92fb8b80d7..a77c884e8e 100644
--- a/memory.c
+++ b/memory.c
@@ -2047,6 +2047,10 @@ void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
                                         memory_region_get_dirty_log_mask(mr));
 }
 
+/*
+ * If memory region `mr' is NULL, do global sync.  Otherwise, sync
+ * dirty bitmap for the specified memory region.
+ */
 static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
 {
     MemoryListener *listener;
@@ -2060,18 +2064,24 @@ static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
      * address space once.
      */
     QTAILQ_FOREACH(listener, &memory_listeners, link) {
-        if (!listener->log_sync) {
-            continue;
-        }
-        as = listener->address_space;
-        view = address_space_get_flatview(as);
-        FOR_EACH_FLAT_RANGE(fr, view) {
-            if (fr->dirty_log_mask && (!mr || fr->mr == mr)) {
-                MemoryRegionSection mrs = section_from_flat_range(fr, view);
-                listener->log_sync(listener, &mrs);
+        if (listener->log_sync) {
+            as = listener->address_space;
+            view = address_space_get_flatview(as);
+            FOR_EACH_FLAT_RANGE(fr, view) {
+                if (fr->dirty_log_mask && (!mr || fr->mr == mr)) {
+                    MemoryRegionSection mrs = section_from_flat_range(fr, view);
+                    listener->log_sync(listener, &mrs);
+                }
             }
+            flatview_unref(view);
+        } else if (listener->log_sync_global) {
+            /*
+             * No matter whether MR is specified, what we can do here
+             * is to do a global sync, because we are not capable to
+             * sync in a finer granularity.
+             */
+            listener->log_sync_global(listener);
         }
-        flatview_unref(view);
     }
 }
 
@@ -2758,6 +2768,9 @@ void memory_listener_register(MemoryListener *listener, AddressSpace *as)
 {
     MemoryListener *other = NULL;
 
+    /* Only one of them can be defined for a listener */
+    assert(!(listener->log_sync && listener->log_sync_global));
+
     listener->address_space = as;
     if (QTAILQ_EMPTY(&memory_listeners)
         || listener->priority >= QTAILQ_LAST(&memory_listeners)->priority) {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 03/11] KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 01/11] linux-headers: Update Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 02/11] memory: Introduce log_sync_global() to memory listener Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-24 16:39   ` Philippe Mathieu-Daudé
  2020-05-23 23:20 ` [PATCH RFC v3 04/11] KVM: Use a big lock to replace per-kml slots_lock Peter Xu
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

kvm_vm_ioctl() handles the errno trick already for ioctl() on
returning -1 for errors.  Fix this.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d06cc04079..6e015aa2d4 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -699,14 +699,13 @@ static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start,
     d.num_pages = bmap_npages;
     d.slot = mem->slot | (as_id << 16);
 
-    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
-        ret = -errno;
+    ret = kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d);
+    if (ret) {
         error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
                      "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
                      __func__, d.slot, (uint64_t)d.first_page,
                      (uint32_t)d.num_pages, ret);
     } else {
-        ret = 0;
         trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
     }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 04/11] KVM: Use a big lock to replace per-kml slots_lock
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (2 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 03/11] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 05/11] KVM: Create the KVMSlot dirty bitmap on flag changes Peter Xu
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

Per-kml slots_lock will bring some trouble if we want to take all slots_lock of
all the KMLs, especially when we're in a context that we could have taken some
of the KML slots_lock, then we even need to figure out what we've taken and
what we need to take.

Make this simple by merging all KML slots_lock into a single slots lock.

Per-kml slots_lock isn't anything that helpful anyway - so far only x86 has two
address spaces (so, two slots_locks).  All the rest archs will be having one
address space always, which means there's actually one slots_lock so it will be
the same as before.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 32 +++++++++++++++++---------------
 include/sysemu/kvm_int.h |  2 --
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6e015aa2d4..6bdb7909cc 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -160,8 +160,10 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
 static NotifierList kvm_irqchip_change_notifiers =
     NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
 
-#define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
-#define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
+static QemuMutex kml_slots_lock;
+
+#define kvm_slots_lock()  qemu_mutex_lock(&kml_slots_lock)
+#define kvm_slots_unlock()  qemu_mutex_unlock(&kml_slots_lock)
 
 int kvm_get_max_memslots(void)
 {
@@ -211,9 +213,9 @@ bool kvm_has_free_slot(MachineState *ms)
     bool result;
     KVMMemoryListener *kml = &s->memory_listener;
 
-    kvm_slots_lock(kml);
+    kvm_slots_lock();
     result = !!kvm_get_free_slot(kml);
-    kvm_slots_unlock(kml);
+    kvm_slots_unlock();
 
     return result;
 }
@@ -279,7 +281,7 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
     KVMMemoryListener *kml = &s->memory_listener;
     int i, ret = 0;
 
-    kvm_slots_lock(kml);
+    kvm_slots_lock();
     for (i = 0; i < s->nr_slots; i++) {
         KVMSlot *mem = &kml->slots[i];
 
@@ -289,7 +291,7 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
             break;
         }
     }
-    kvm_slots_unlock(kml);
+    kvm_slots_unlock();
 
     return ret;
 }
@@ -468,7 +470,7 @@ static int kvm_section_update_flags(KVMMemoryListener *kml,
         return 0;
     }
 
-    kvm_slots_lock(kml);
+    kvm_slots_lock();
 
     while (size && !ret) {
         slot_size = MIN(kvm_max_slot_size, size);
@@ -484,7 +486,7 @@ static int kvm_section_update_flags(KVMMemoryListener *kml,
     }
 
 out:
-    kvm_slots_unlock(kml);
+    kvm_slots_unlock();
     return ret;
 }
 
@@ -754,7 +756,7 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
         return ret;
     }
 
-    kvm_slots_lock(kml);
+    kvm_slots_lock();
 
     for (i = 0; i < s->nr_slots; i++) {
         mem = &kml->slots[i];
@@ -780,7 +782,7 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
         }
     }
 
-    kvm_slots_unlock(kml);
+    kvm_slots_unlock();
 
     return ret;
 }
@@ -1085,7 +1087,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
           (start_addr - section->offset_within_address_space);
 
-    kvm_slots_lock(kml);
+    kvm_slots_lock();
 
     if (!add) {
         do {
@@ -1143,7 +1145,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     } while (size);
 
 out:
-    kvm_slots_unlock(kml);
+    kvm_slots_unlock();
 }
 
 static void kvm_region_add(MemoryListener *listener,
@@ -1170,9 +1172,9 @@ static void kvm_log_sync(MemoryListener *listener,
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
     int r;
 
-    kvm_slots_lock(kml);
+    kvm_slots_lock();
     r = kvm_physical_sync_dirty_bitmap(kml, section);
-    kvm_slots_unlock(kml);
+    kvm_slots_unlock();
     if (r < 0) {
         abort();
     }
@@ -1272,7 +1274,7 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
 {
     int i;
 
-    qemu_mutex_init(&kml->slots_lock);
+    qemu_mutex_init(&kml_slots_lock);
     kml->slots = g_malloc0(s->nr_slots * sizeof(KVMSlot));
     kml->as_id = as_id;
 
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index c660a70c51..f143b28671 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -27,8 +27,6 @@ typedef struct KVMSlot
 
 typedef struct KVMMemoryListener {
     MemoryListener listener;
-    /* Protects the slots and all inside them */
-    QemuMutex slots_lock;
     KVMSlot *slots;
     int as_id;
 } KVMMemoryListener;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 05/11] KVM: Create the KVMSlot dirty bitmap on flag changes
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (3 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 04/11] KVM: Use a big lock to replace per-kml slots_lock Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 06/11] KVM: Provide helper to get kvm dirty log Peter Xu
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

Previously we have two places that will create the per KVMSlot dirty
bitmap:

  1. When a newly created KVMSlot has dirty logging enabled,
  2. When the first log_sync() happens for a memory slot.

The 2nd case is lazy-init, while the 1st case is not (which is a fix
of what the 2nd case missed).

To do explicit initialization of dirty bitmaps, what we're missing is
to create the dirty bitmap when the slot changed from not-dirty-track
to dirty-track.  Do that in kvm_slot_update_flags().

With that, we can safely remove the 2nd lazy-init.

This change will be needed for kvm dirty ring because kvm dirty ring
does not use the log_sync() interface at all.

Also move all the pre-checks into kvm_slot_init_dirty_bitmap().

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6bdb7909cc..5b626af2a7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -165,6 +165,8 @@ static QemuMutex kml_slots_lock;
 #define kvm_slots_lock()  qemu_mutex_lock(&kml_slots_lock)
 #define kvm_slots_unlock()  qemu_mutex_unlock(&kml_slots_lock)
 
+static void kvm_slot_init_dirty_bitmap(KVMSlot *mem);
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_accel());
@@ -455,6 +457,7 @@ static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
         return 0;
     }
 
+    kvm_slot_init_dirty_bitmap(mem);
     return kvm_set_user_memory_region(kml, mem, false);
 }
 
@@ -539,8 +542,12 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
 
 /* Allocate the dirty bitmap for a slot  */
-static void kvm_memslot_init_dirty_bitmap(KVMSlot *mem)
+static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
 {
+    if (!(mem->flags & KVM_MEM_LOG_DIRTY_PAGES) || mem->dirty_bmap) {
+        return;
+    }
+
     /*
      * XXX bad kernel interface alert
      * For dirty bitmap, kernel allocates array of size aligned to
@@ -591,11 +598,6 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
             goto out;
         }
 
-        if (!mem->dirty_bmap) {
-            /* Allocate on the first log_sync, once and for all */
-            kvm_memslot_init_dirty_bitmap(mem);
-        }
-
         d.dirty_bitmap = mem->dirty_bmap;
         d.slot = mem->slot | (kml->as_id << 16);
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
@@ -1125,14 +1127,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         mem->start_addr = start_addr;
         mem->ram = ram;
         mem->flags = kvm_mem_flags(mr);
-
-        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
-            /*
-             * Reallocate the bmap; it means it doesn't disappear in
-             * middle of a migrate.
-             */
-            kvm_memslot_init_dirty_bitmap(mem);
-        }
+        kvm_slot_init_dirty_bitmap(mem);
         err = kvm_set_user_memory_region(kml, mem, true);
         if (err) {
             fprintf(stderr, "%s: error registering slot: %s\n", __func__,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 06/11] KVM: Provide helper to get kvm dirty log
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (4 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 05/11] KVM: Create the KVMSlot dirty bitmap on flag changes Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 07/11] KVM: Provide helper to sync dirty bitmap from slot to ramblock Peter Xu
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

Provide a helper kvm_slot_get_dirty_log() to make the function
kvm_physical_sync_dirty_bitmap() clearer.  We can even cache the as_id
into KVMSlot when it is created, so that we don't even need to pass it
down every time.

Since at it, remove return value of kvm_physical_sync_dirty_bitmap()
because it should never fail.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 42 +++++++++++++++++++++-------------------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5b626af2a7..5892e5db24 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -566,6 +566,21 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
     mem->dirty_bmap = g_malloc0(bitmap_size);
 }
 
+/* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap */
+static void kvm_slot_get_dirty_log(KVMState *s, KVMSlot *slot)
+{
+    struct kvm_dirty_log d = {};
+    int ret;
+
+    d.dirty_bitmap = slot->dirty_bmap;
+    d.slot = slot->slot | (slot->as_id << 16);
+    ret = kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d);
+    if (ret) {
+        error_report_once("%s: KVM_GET_DIRTY_LOG failed with %d",
+                          __func__, ret);
+    }
+}
+
 /**
  * kvm_physical_sync_dirty_bitmap - Sync dirty bitmap from kernel space
  *
@@ -577,15 +592,13 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
  * @kml: the KVM memory listener object
  * @section: the memory section to sync the dirty bitmap with
  */
-static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
-                                          MemoryRegionSection *section)
+static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
+                                           MemoryRegionSection *section)
 {
     KVMState *s = kvm_state;
-    struct kvm_dirty_log d = {};
     KVMSlot *mem;
     hwaddr start_addr, size;
     hwaddr slot_size, slot_offset = 0;
-    int ret = 0;
 
     size = kvm_align_section(section, &start_addr);
     while (size) {
@@ -595,27 +608,19 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
         mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
         if (!mem) {
             /* We don't have a slot if we want to trap every access. */
-            goto out;
+            return;
         }
 
-        d.dirty_bitmap = mem->dirty_bmap;
-        d.slot = mem->slot | (kml->as_id << 16);
-        if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
-            DPRINTF("ioctl failed %d\n", errno);
-            ret = -1;
-            goto out;
-        }
+        kvm_slot_get_dirty_log(s, mem);
 
         subsection.offset_within_region += slot_offset;
         subsection.size = int128_make64(slot_size);
-        kvm_get_dirty_pages_log_range(&subsection, d.dirty_bitmap);
+        kvm_get_dirty_pages_log_range(&subsection, mem->dirty_bmap);
 
         slot_offset += slot_size;
         start_addr += slot_size;
         size -= slot_size;
     }
-out:
-    return ret;
 }
 
 /* Alignment requirement for KVM_CLEAR_DIRTY_LOG - 64 pages */
@@ -1123,6 +1128,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     do {
         slot_size = MIN(kvm_max_slot_size, size);
         mem = kvm_alloc_slot(kml);
+        mem->as_id = kml->as_id;
         mem->memory_size = slot_size;
         mem->start_addr = start_addr;
         mem->ram = ram;
@@ -1165,14 +1171,10 @@ static void kvm_log_sync(MemoryListener *listener,
                          MemoryRegionSection *section)
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
-    int r;
 
     kvm_slots_lock();
-    r = kvm_physical_sync_dirty_bitmap(kml, section);
+    kvm_physical_sync_dirty_bitmap(kml, section);
     kvm_slots_unlock();
-    if (r < 0) {
-        abort();
-    }
 }
 
 static void kvm_log_clear(MemoryListener *listener,
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index f143b28671..1202593c6f 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -23,6 +23,8 @@ typedef struct KVMSlot
     int old_flags;
     /* Dirty bitmap cache for the slot */
     unsigned long *dirty_bmap;
+    /* Cache of the address space ID */
+    int as_id;
 } KVMSlot;
 
 typedef struct KVMMemoryListener {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 07/11] KVM: Provide helper to sync dirty bitmap from slot to ramblock
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (5 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 06/11] KVM: Provide helper to get kvm dirty log Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 08/11] KVM: Simplify dirty log sync in kvm_set_phys_mem Peter Xu
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

kvm_physical_sync_dirty_bitmap() calculates the ramblock offset in an
awkward way from the MemoryRegionSection that passed in from the
caller.  The truth is for each KVMSlot the ramblock offset never
change for the lifecycle.  Cache the ramblock offset for each KVMSlot
into the structure when the KVMSlot is created.

With that, we can further simplify kvm_physical_sync_dirty_bitmap()
with a helper to sync KVMSlot dirty bitmap to the ramblock dirty
bitmap of a specific KVMSlot.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 37 +++++++++++++++++--------------------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5892e5db24..016bad1089 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -528,15 +528,12 @@ static void kvm_log_stop(MemoryListener *listener,
 }
 
 /* get kvm's dirty pages bitmap and update qemu's */
-static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
-                                         unsigned long *bitmap)
+static void kvm_slot_sync_dirty_pages(KVMSlot *slot)
 {
-    ram_addr_t start = section->offset_within_region +
-                       memory_region_get_ram_addr(section->mr);
-    ram_addr_t pages = int128_get64(section->size) / qemu_real_host_page_size;
+    ram_addr_t start = slot->ram_start_offset;
+    ram_addr_t pages = slot->memory_size / qemu_real_host_page_size;
 
-    cpu_physical_memory_set_dirty_lebitmap(bitmap, start, pages);
-    return 0;
+    cpu_physical_memory_set_dirty_lebitmap(slot->dirty_bmap, start, pages);
 }
 
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
@@ -598,12 +595,10 @@ static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
     KVMState *s = kvm_state;
     KVMSlot *mem;
     hwaddr start_addr, size;
-    hwaddr slot_size, slot_offset = 0;
+    hwaddr slot_size;
 
     size = kvm_align_section(section, &start_addr);
     while (size) {
-        MemoryRegionSection subsection = *section;
-
         slot_size = MIN(kvm_max_slot_size, size);
         mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
         if (!mem) {
@@ -612,12 +607,7 @@ static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
         }
 
         kvm_slot_get_dirty_log(s, mem);
-
-        subsection.offset_within_region += slot_offset;
-        subsection.size = int128_make64(slot_size);
-        kvm_get_dirty_pages_log_range(&subsection, mem->dirty_bmap);
-
-        slot_offset += slot_size;
+        kvm_slot_sync_dirty_pages(mem);
         start_addr += slot_size;
         size -= slot_size;
     }
@@ -1072,7 +1062,8 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     int err;
     MemoryRegion *mr = section->mr;
     bool writeable = !mr->readonly && !mr->rom_device;
-    hwaddr start_addr, size, slot_size;
+    hwaddr start_addr, size, slot_size, mr_offset;
+    ram_addr_t ram_start_offset;
     void *ram;
 
     if (!memory_region_is_ram(mr)) {
@@ -1090,9 +1081,13 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         return;
     }
 
-    /* use aligned delta to align the ram address */
-    ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
-          (start_addr - section->offset_within_address_space);
+    /* The offset of the kvmslot within the memory region */
+    mr_offset = section->offset_within_region + start_addr -
+        section->offset_within_address_space;
+
+    /* use aligned delta to align the ram address and offset */
+    ram = memory_region_get_ram_ptr(mr) + mr_offset;
+    ram_start_offset = memory_region_get_ram_addr(mr) + mr_offset;
 
     kvm_slots_lock();
 
@@ -1131,6 +1126,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         mem->as_id = kml->as_id;
         mem->memory_size = slot_size;
         mem->start_addr = start_addr;
+        mem->ram_start_offset = ram_start_offset;
         mem->ram = ram;
         mem->flags = kvm_mem_flags(mr);
         kvm_slot_init_dirty_bitmap(mem);
@@ -1141,6 +1137,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
             abort();
         }
         start_addr += slot_size;
+        ram_start_offset += slot_size;
         ram += slot_size;
         size -= slot_size;
     } while (size);
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 1202593c6f..4a35d04478 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -25,6 +25,8 @@ typedef struct KVMSlot
     unsigned long *dirty_bmap;
     /* Cache of the address space ID */
     int as_id;
+    /* Cache of the offset in ram address space */
+    ram_addr_t ram_start_offset;
 } KVMSlot;
 
 typedef struct KVMMemoryListener {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 08/11] KVM: Simplify dirty log sync in kvm_set_phys_mem
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (6 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 07/11] KVM: Provide helper to sync dirty bitmap from slot to ramblock Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 09/11] KVM: Cache kvm slot dirty bitmap size Peter Xu
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

kvm_physical_sync_dirty_bitmap() on the whole section is inaccurate, because
the section can be a superset of the memslot that we're working on.  The result
is that if the section covers multiple kvm memslots, we could be doing the
synchronization for multiple times for each kvmslot in the section.

With the two helpers that we just introduced, it's very easy to do it right now
by calling the helpers.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 016bad1089..f7c8e6bebe 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1099,7 +1099,8 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                 goto out;
             }
             if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
-                kvm_physical_sync_dirty_bitmap(kml, section);
+                kvm_slot_get_dirty_log(kvm_state, mem);
+                kvm_slot_sync_dirty_pages(mem);
             }
 
             /* unregister the slot */
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 09/11] KVM: Cache kvm slot dirty bitmap size
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (7 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 08/11] KVM: Simplify dirty log sync in kvm_set_phys_mem Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 10/11] KVM: Add dirty-gfn-count property Peter Xu
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

Cache it too because we'll reference it more frequently in the future.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 1 +
 include/sysemu/kvm_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f7c8e6bebe..b9aaa7912c 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -561,6 +561,7 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
     hwaddr bitmap_size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
                                         /*HOST_LONG_BITS*/ 64) / 8;
     mem->dirty_bmap = g_malloc0(bitmap_size);
+    mem->dirty_bmap_size = bitmap_size;
 }
 
 /* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap */
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 4a35d04478..b4d2886e26 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -23,6 +23,7 @@ typedef struct KVMSlot
     int old_flags;
     /* Dirty bitmap cache for the slot */
     unsigned long *dirty_bmap;
+    unsigned long dirty_bmap_size;
     /* Cache of the address space ID */
     int as_id;
     /* Cache of the offset in ram address space */
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 10/11] KVM: Add dirty-gfn-count property
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (8 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 09/11] KVM: Cache kvm slot dirty bitmap size Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-23 23:20 ` [PATCH RFC v3 11/11] KVM: Dirty ring support Peter Xu
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

Add a parameter for dirty gfn count for dirty rings.  If zero, dirty ring is
disabled.  Otherwise dirty ring will be enabled with the per-vcpu gfn count as
specified.  If dirty ring cannot be enabled due to unsupported kernel or
illegal parameter, it'll fallback to dirty logging.

By default, dirty ring is not enabled (dirty-gfn-count default to 0).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 72 +++++++++++++++++++++++++++++++++++++++++++++
 qemu-options.hx     |  5 ++++
 2 files changed, 77 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index b9aaa7912c..dd017d0720 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -128,6 +128,9 @@ struct KVMState
         KVMMemoryListener *ml;
         AddressSpace *as;
     } *as;
+    bool kvm_dirty_ring_enabled;    /* Whether KVM dirty ring is enabled */
+    uint64_t kvm_dirty_ring_size;   /* Size of the per-vcpu dirty ring */
+    uint32_t kvm_dirty_gfn_count;   /* Number of dirty GFNs per ring */
 };
 
 KVMState *kvm_state;
@@ -2129,6 +2132,40 @@ static int kvm_init(MachineState *ms)
     s->memory_listener.listener.coalesced_io_add = kvm_coalesce_mmio_region;
     s->memory_listener.listener.coalesced_io_del = kvm_uncoalesce_mmio_region;
 
+    /*
+     * Enable KVM dirty ring if supported, otherwise fall back to
+     * dirty logging mode
+     */
+    if (s->kvm_dirty_gfn_count > 0) {
+        uint64_t ring_size;
+
+        ring_size = s->kvm_dirty_gfn_count * sizeof(struct kvm_dirty_gfn);
+
+        /* Read the max supported pages */
+        ret = kvm_vm_check_extension(kvm_state, KVM_CAP_DIRTY_LOG_RING);
+        if (ret > 0) {
+            if (ring_size > ret) {
+                error_report("KVM dirty GFN count %" PRIu32 " too big "
+                             "(maximum is %ld).  Please use a smaller value.",
+                             s->kvm_dirty_gfn_count,
+                             ret / sizeof(struct kvm_dirty_gfn));
+                ret = -EINVAL;
+                goto err;
+            }
+
+            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0, ring_size);
+            if (ret) {
+                error_report("Enabling of KVM dirty ring failed: %d. "
+                             "Suggested mininum value is 1024. "
+                             "Please also make sure it's a power of two.", ret);
+                goto err;
+            }
+
+            s->kvm_dirty_ring_size = ring_size;
+            s->kvm_dirty_ring_enabled = true;
+        }
+    }
+
     kvm_memory_listener_register(s, &s->memory_listener,
                                  &address_space_memory, 0);
     memory_listener_register(&kvm_io_listener,
@@ -3089,6 +3126,33 @@ bool kvm_kernel_irqchip_split(void)
     return kvm_state->kernel_irqchip_split == ON_OFF_AUTO_ON;
 }
 
+static void kvm_get_dirty_gfn_count(Object *obj, Visitor *v,
+                                    const char *name, void *opaque,
+                                    Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    uint32_t value = s->kvm_dirty_gfn_count;
+
+    visit_type_uint32(v, name, &value, errp);
+}
+
+static void kvm_set_dirty_gfn_count(Object *obj, Visitor *v,
+                                    const char *name, void *opaque,
+                                    Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    Error *error = NULL;
+    uint32_t value;
+
+    visit_type_uint32(v, name, &value, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    s->kvm_dirty_gfn_count = value;
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
     KVMState *s = KVM_STATE(obj);
@@ -3096,6 +3160,8 @@ static void kvm_accel_instance_init(Object *obj)
     s->kvm_shadow_mem = -1;
     s->kernel_irqchip_allowed = true;
     s->kernel_irqchip_split = ON_OFF_AUTO_AUTO;
+    /* KVM dirty ring is by default off */
+    s->kvm_dirty_gfn_count = 0;
 }
 
 static void kvm_accel_class_init(ObjectClass *oc, void *data)
@@ -3117,6 +3183,12 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
         NULL, NULL);
     object_class_property_set_description(oc, "kvm-shadow-mem",
         "KVM shadow MMU size");
+
+    object_class_property_add(oc, "dirty-gfn-count", "uint32",
+        kvm_get_dirty_gfn_count, kvm_set_dirty_gfn_count,
+        NULL, NULL);
+    object_class_property_set_description(oc, "dirty-gfn-count",
+        "KVM dirty GFN count (=0 to disable dirty ring)");
 }
 
 static const TypeInfo kvm_accel_type = {
diff --git a/qemu-options.hx b/qemu-options.hx
index 93bde2bbc8..b59d47473e 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -124,6 +124,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
     "                kernel-irqchip=on|off|split controls accelerated irqchip support (default=on)\n"
     "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
     "                tb-size=n (TCG translation block cache size)\n"
+    "                dirty-gfn-count=n (KVM dirty ring GFN count, default 0)\n"
     "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
 SRST
 ``-accel name[,prop=value[,...]]``
@@ -158,6 +159,10 @@ SRST
         where both the back-end and front-ends support it and no
         incompatible TCG features have been enabled (e.g.
         icount/replay).
+
+    ``dirty-gfn-count=n``
+        Controls the per-vcpu KVM dirty ring GFN count (=0 to disable).
+
 ERST
 
 DEF("smp", HAS_ARG, QEMU_OPTION_smp,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v3 11/11] KVM: Dirty ring support
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (9 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 10/11] KVM: Add dirty-gfn-count property Peter Xu
@ 2020-05-23 23:20 ` Peter Xu
  2020-05-24 13:06 ` [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
  2020-05-26 14:17 ` Peter Xu
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-23 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx

KVM dirty ring is a new interface to pass over dirty bits from kernel to the
userspace.  Instead of using a bitmap for each memory region, the dirty ring
contains an array of dirtied GPAs to fetch (in the form of offset in slots).
For each vcpu there will be one dirty ring that binds to it.

kvm_dirty_ring_reap() is the major function to collect dirty rings.  It can be
called either by a standalone reaper thread that runs in the background,
collecting dirty pages for the whole VM.  It can also be called directly by any
thread that has BQL taken.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c    | 331 ++++++++++++++++++++++++++++++++++++++++-
 accel/kvm/trace-events |   7 +
 include/hw/core/cpu.h  |   8 +
 3 files changed, 343 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index dd017d0720..b064e68040 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -15,6 +15,7 @@
 
 #include "qemu/osdep.h"
 #include <sys/ioctl.h>
+#include <poll.h>
 
 #include <linux/kvm.h>
 
@@ -76,6 +77,25 @@ struct KVMParkedVcpu {
     QLIST_ENTRY(KVMParkedVcpu) node;
 };
 
+enum KVMDirtyRingReaperState {
+    KVM_DIRTY_RING_REAPER_NONE = 0,
+    /* The reaper is sleeping */
+    KVM_DIRTY_RING_REAPER_WAIT,
+    /* The reaper is reaping for dirty pages */
+    KVM_DIRTY_RING_REAPER_REAPING,
+};
+
+/*
+ * KVM reaper instance, responsible for collecting the KVM dirty bits
+ * via the dirty ring.
+ */
+struct KVMDirtyRingReaper {
+    /* The reaper thread */
+    QemuThread reaper_thr;
+    volatile uint64_t reaper_iteration; /* iteration number of reaper thr */
+    volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */
+};
+
 struct KVMState
 {
     AccelState parent_obj;
@@ -131,6 +151,7 @@ struct KVMState
     bool kvm_dirty_ring_enabled;    /* Whether KVM dirty ring is enabled */
     uint64_t kvm_dirty_ring_size;   /* Size of the per-vcpu dirty ring */
     uint32_t kvm_dirty_gfn_count;   /* Number of dirty GFNs per ring */
+    struct KVMDirtyRingReaper reaper;
 };
 
 KVMState *kvm_state;
@@ -362,6 +383,13 @@ int kvm_destroy_vcpu(CPUState *cpu)
         goto err;
     }
 
+    if (cpu->kvm_dirty_gfns) {
+        ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_size);
+        if (ret < 0) {
+            goto err;
+        }
+    }
+
     vcpu = g_malloc0(sizeof(*vcpu));
     vcpu->vcpu_id = kvm_arch_vcpu_id(cpu);
     vcpu->kvm_fd = cpu->kvm_fd;
@@ -426,6 +454,19 @@ int kvm_init_vcpu(CPUState *cpu)
             (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
     }
 
+    if (s->kvm_dirty_ring_enabled) {
+        /* Use MAP_SHARED to share pages with the kernel */
+        cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_size,
+                                   PROT_READ | PROT_WRITE, MAP_SHARED,
+                                   cpu->kvm_fd,
+                                   PAGE_SIZE * KVM_DIRTY_LOG_PAGE_OFFSET);
+        if (cpu->kvm_dirty_gfns == MAP_FAILED) {
+            ret = -errno;
+            DPRINTF("mmap'ing vcpu dirty gfns failed: %d\n", ret);
+            goto err;
+        }
+    }
+
     ret = kvm_arch_init_vcpu(cpu);
 err:
     return ret;
@@ -539,6 +580,11 @@ static void kvm_slot_sync_dirty_pages(KVMSlot *slot)
     cpu_physical_memory_set_dirty_lebitmap(slot->dirty_bmap, start, pages);
 }
 
+static void kvm_slot_reset_dirty_pages(KVMSlot *slot)
+{
+    memset(slot->dirty_bmap, 0, slot->dirty_bmap_size);
+}
+
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
 
 /* Allocate the dirty bitmap for a slot  */
@@ -582,6 +628,170 @@ static void kvm_slot_get_dirty_log(KVMState *s, KVMSlot *slot)
     }
 }
 
+/* Should be with all slots_lock held for the address spaces. */
+static void kvm_dirty_ring_mark_page(KVMState *s, uint32_t as_id,
+                                     uint32_t slot_id, uint64_t offset)
+{
+    KVMMemoryListener *kml;
+    KVMSlot *mem;
+
+    if (as_id >= s->nr_as) {
+        return;
+    }
+
+    kml = s->as[as_id].ml;
+    mem = &kml->slots[slot_id];
+
+    if (!mem->memory_size || offset >= (mem->memory_size / TARGET_PAGE_SIZE)) {
+        return;
+    }
+
+    set_bit(offset, mem->dirty_bmap);
+}
+
+static bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
+{
+    return gfn->flags == KVM_DIRTY_GFN_F_DIRTY;
+}
+
+static void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
+{
+    gfn->flags = KVM_DIRTY_GFN_F_RESET;
+}
+
+/*
+ * Should be with all slots_lock held for the address spaces.  It returns the
+ * dirty page we've collected on this dirty ring.
+ */
+static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
+{
+    struct kvm_dirty_gfn *dirty_gfns = cpu->kvm_dirty_gfns, *cur;
+    uint32_t gfn_count = s->kvm_dirty_gfn_count;
+    uint32_t count = 0, fetch = cpu->kvm_fetch_index;
+
+    assert(dirty_gfns && gfn_count);
+    trace_kvm_dirty_ring_reap_vcpu(cpu->cpu_index);
+
+    while (true) {
+        cur = &dirty_gfns[fetch % gfn_count];
+        if (!dirty_gfn_is_dirtied(cur)) {
+            break;
+        }
+        kvm_dirty_ring_mark_page(s, cur->slot >> 16, cur->slot & 0xffff,
+                                 cur->offset);
+        dirty_gfn_set_collected(cur);
+        trace_kvm_dirty_ring_page(cpu->cpu_index, fetch, cur->offset);
+        fetch++;
+        count++;
+    }
+    cpu->kvm_fetch_index = fetch;
+
+    return count;
+}
+
+/* Must be with slots_lock held */
+static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
+{
+    int ret;
+    CPUState *cpu;
+    uint64_t total = 0;
+    int64_t stamp;
+
+    stamp = get_clock();
+
+    CPU_FOREACH(cpu) {
+        total += kvm_dirty_ring_reap_one(s, cpu);
+    }
+
+    if (total) {
+        ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
+        assert(ret == total);
+    }
+
+    stamp = get_clock() - stamp;
+
+    if (total) {
+        trace_kvm_dirty_ring_reap(total, stamp / 1000);
+    }
+
+    return total;
+}
+
+/*
+ * Currently for simplicity, we must hold BQL before calling this.  We can
+ * consider to drop the BQL if we're clear with all the race conditions.
+ */
+static uint64_t kvm_dirty_ring_reap(KVMState *s)
+{
+    uint64_t total;
+
+    /*
+     * We need to lock all kvm slots for all address spaces here,
+     * because:
+     *
+     * (1) We need to mark dirty for dirty bitmaps in multiple slots
+     *     and for tons of pages, so it's better to take the lock here
+     *     once rather than once per page.  And more importantly,
+     *
+     * (2) We must _NOT_ publish dirty bits to the other threads
+     *     (e.g., the migration thread) via the kvm memory slot dirty
+     *     bitmaps before correctly re-protect those dirtied pages.
+     *     Otherwise we can have potential risk of data corruption if
+     *     the page data is read in the other thread before we do
+     *     reset below.
+     */
+    kvm_slots_lock();
+    total = kvm_dirty_ring_reap_locked(s);
+    kvm_slots_unlock();
+
+    return total;
+}
+
+static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
+{
+    /* No need to do anything */
+}
+
+/*
+ * Kick all vcpus out in a synchronized way.  When returned, we
+ * guarantee that every vcpu has been kicked and at least returned to
+ * userspace once.
+ */
+static void kvm_cpu_synchronize_kick_all(void)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
+    }
+}
+
+/*
+ * Flush all the existing dirty pages to the KVM slot buffers.  When
+ * this call returns, we guarantee that all the touched dirty pages
+ * before calling this function have been put into the per-kvmslot
+ * dirty bitmap.
+ *
+ * This function must be called with BQL held.
+ */
+static void kvm_dirty_ring_flush(struct KVMDirtyRingReaper *r)
+{
+    trace_kvm_dirty_ring_flush(0);
+    /*
+     * The function needs to be serialized.  Since this function
+     * should always be with BQL held, serialization is guaranteed.
+     * However, let's be sure of it.
+     */
+    assert(qemu_mutex_iothread_locked());
+    /*
+     * First make sure to flush the hardware buffers by kicking all
+     * vcpus out in a synchronous way.
+     */
+    kvm_cpu_synchronize_kick_all();
+    kvm_dirty_ring_reap(kvm_state);
+    trace_kvm_dirty_ring_flush(1);
+}
+
 /**
  * kvm_physical_sync_dirty_bitmap - Sync dirty bitmap from kernel space
  *
@@ -1103,7 +1313,24 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                 goto out;
             }
             if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
-                kvm_slot_get_dirty_log(kvm_state, mem);
+                /*
+                 * NOTE: We should be aware of the fact that here we're only
+                 * doing a best effort to sync dirty bits.  No matter whether
+                 * we're using dirty log or dirty ring, we ignored two facts:
+                 *
+                 * (1) dirty bits can reside in hardware buffers (PML)
+                 *
+                 * (2) after we collected dirty bits here, pages can be dirtied
+                 * again before we do the final KVM_SET_USER_MEMORY_REGION to
+                 * remove the slot.
+                 *
+                 * Not easy.  Let's cross the fingers until it's fixed.
+                 */
+                if (kvm_state->kvm_dirty_ring_enabled) {
+                    kvm_dirty_ring_reap_locked(kvm_state);
+                } else {
+                    kvm_slot_get_dirty_log(kvm_state, mem);
+                }
                 kvm_slot_sync_dirty_pages(mem);
             }
 
@@ -1151,6 +1378,51 @@ out:
     kvm_slots_unlock();
 }
 
+static void *kvm_dirty_ring_reaper_thread(void *data)
+{
+    KVMState *s = data;
+    struct KVMDirtyRingReaper *r = &s->reaper;
+
+    rcu_register_thread();
+
+    trace_kvm_dirty_ring_reaper("init");
+
+    while (true) {
+        r->reaper_state = KVM_DIRTY_RING_REAPER_WAIT;
+        trace_kvm_dirty_ring_reaper("wait");
+        /*
+         * TODO: provide a smarter timeout rather than a constant?
+         */
+        sleep(1);
+
+        trace_kvm_dirty_ring_reaper("wakeup");
+        r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
+
+        qemu_mutex_lock_iothread();
+        kvm_dirty_ring_reap(s);
+        qemu_mutex_unlock_iothread();
+
+        r->reaper_iteration++;
+    }
+
+    trace_kvm_dirty_ring_reaper("exit");
+
+    rcu_unregister_thread();
+
+    return NULL;
+}
+
+static int kvm_dirty_ring_reaper_init(KVMState *s)
+{
+    struct KVMDirtyRingReaper *r = &s->reaper;
+
+    qemu_thread_create(&r->reaper_thr, "kvm-reaper",
+                       kvm_dirty_ring_reaper_thread,
+                       s, QEMU_THREAD_JOINABLE);
+
+    return 0;
+}
+
 static void kvm_region_add(MemoryListener *listener,
                            MemoryRegionSection *section)
 {
@@ -1179,6 +1451,36 @@ static void kvm_log_sync(MemoryListener *listener,
     kvm_slots_unlock();
 }
 
+static void kvm_log_sync_global(MemoryListener *l)
+{
+    KVMMemoryListener *kml = container_of(l, KVMMemoryListener, listener);
+    KVMState *s = kvm_state;
+    KVMSlot *mem;
+    int i;
+
+    /* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
+    kvm_dirty_ring_flush(&s->reaper);
+
+    /*
+     * TODO: make this faster when nr_slots is big while there are
+     * only a few used slots (small VMs).
+     */
+    kvm_slots_lock();
+    for (i = 0; i < s->nr_slots; i++) {
+        mem = &kml->slots[i];
+        if (mem->memory_size && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+            kvm_slot_sync_dirty_pages(mem);
+            /*
+             * This is not needed by KVM_GET_DIRTY_LOG because the
+             * ioctl will unconditionally overwrite the whole region.
+             * However kvm dirty ring has no such side effect.
+             */
+            kvm_slot_reset_dirty_pages(mem);
+        }
+    }
+    kvm_slots_unlock();
+}
+
 static void kvm_log_clear(MemoryListener *listener,
                           MemoryRegionSection *section)
 {
@@ -1285,10 +1587,15 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
     kml->listener.region_del = kvm_region_del;
     kml->listener.log_start = kvm_log_start;
     kml->listener.log_stop = kvm_log_stop;
-    kml->listener.log_sync = kvm_log_sync;
-    kml->listener.log_clear = kvm_log_clear;
     kml->listener.priority = 10;
 
+    if (s->kvm_dirty_ring_enabled) {
+        kml->listener.log_sync_global = kvm_log_sync_global;
+    } else {
+        kml->listener.log_sync = kvm_log_sync;
+        kml->listener.log_clear = kvm_log_clear;
+    }
+
     memory_listener_register(&kml->listener, as);
 
     for (i = 0; i < s->nr_as; ++i) {
@@ -2180,6 +2487,13 @@ static int kvm_init(MachineState *ms)
         qemu_balloon_inhibit(true);
     }
 
+    if (s->kvm_dirty_ring_enabled) {
+        ret = kvm_dirty_ring_reaper_init(s);
+        if (ret) {
+            goto err;
+        }
+    }
+
     return 0;
 
 err:
@@ -2487,6 +2801,17 @@ int kvm_cpu_exec(CPUState *cpu)
         case KVM_EXIT_INTERNAL_ERROR:
             ret = kvm_handle_internal_error(cpu, run);
             break;
+        case KVM_EXIT_DIRTY_RING_FULL:
+            /*
+             * We shouldn't continue if the dirty ring of this vcpu is
+             * still full.  Got kicked by KVM_RESET_DIRTY_RINGS.
+             */
+            trace_kvm_dirty_ring_full(cpu->cpu_index);
+            qemu_mutex_lock_iothread();
+            kvm_dirty_ring_reap(kvm_state);
+            qemu_mutex_unlock_iothread();
+            ret = 0;
+            break;
         case KVM_EXIT_SYSTEM_EVENT:
             switch (run->system_event.type) {
             case KVM_SYSTEM_EVENT_SHUTDOWN:
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 4fb6e59d19..89ef99569f 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -16,4 +16,11 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
 kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
+kvm_dirty_ring_full(int id) "vcpu %d"
+kvm_dirty_ring_reap_vcpu(int id) "vcpu %d"
+kvm_dirty_ring_page(int vcpu, uint32_t slot, uint64_t offset) "vcpu %d fetch %"PRIu32" offset 0x%"PRIx64
+kvm_dirty_ring_reaper(const char *s) "%s"
+kvm_dirty_ring_reap(uint64_t count, int64_t t) "reaped %"PRIu64" pages (took %"PRIi64" us)"
+kvm_dirty_ring_reaper_kick(const char *reason) "%s"
+kvm_dirty_ring_flush(int finished) "%d"
 
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index a998dc2620..ed8ea59eb5 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -340,6 +340,11 @@ struct qemu_work_item;
  * @ignore_memory_transaction_failures: Cached copy of the MachineState
  *    flag of the same name: allows the board to suppress calling of the
  *    CPU do_transaction_failed hook function.
+ * @kvm_dirty_ring_full:
+ *   Whether the kvm dirty ring of this vcpu is soft-full.
+ * @kvm_dirty_ring_avail:
+ *   Semaphore to be posted when the kvm dirty ring of the vcpu is
+ *   available again.
  *
  * State of one CPU core or thread.
  */
@@ -407,9 +412,12 @@ struct CPUState {
      */
     uintptr_t mem_io_pc;
 
+    /* Only used in KVM */
     int kvm_fd;
     struct KVMState *kvm_state;
     struct kvm_run *kvm_run;
+    struct kvm_dirty_gfn *kvm_dirty_gfns;
+    uint32_t kvm_fetch_index;
 
     /* Used for events with 'vcpu' and *without* the 'disabled' properties */
     DECLARE_BITMAP(trace_dstate_delayed, CPU_TRACE_DSTATE_MAX_EVENTS);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part)
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (10 preceding siblings ...)
  2020-05-23 23:20 ` [PATCH RFC v3 11/11] KVM: Dirty ring support Peter Xu
@ 2020-05-24 13:06 ` Peter Xu
  2020-05-26 14:17 ` Peter Xu
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-24 13:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert

On Sat, May 23, 2020 at 07:20:24PM -0400, Peter Xu wrote:
> I kept the dirty sync in kvm_set_phys_mem() for kvmslot removals, left a
> comment on the known issue about strict dirty sync so we can fix it someday in
> the future together with dirty log and dirty ring.

Side note: patch 3,5-8 should not be RFC material at all - they either fixes
existing issues or clean code up.  Please conside to review/merge them first
even before the rest of the patches.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 01/11] linux-headers: Update
  2020-05-23 23:20 ` [PATCH RFC v3 01/11] linux-headers: Update Peter Xu
@ 2020-05-24 13:27   ` Peter Maydell
  2020-05-24 14:06     ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Maydell @ 2020-05-24 13:27 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, QEMU Developers, Dr . David Alan Gilbert

On Sun, 24 May 2020 at 00:21, Peter Xu <peterx@redhat.com> wrote:
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Header updates should always include the upstream
kernel commit against which you ran the scripts/update-linux-headers.sh
script, please.

 linux-headers/asm-x86/kvm.h |  1 +
 linux-headers/linux/kvm.h   | 53 +++++++++++++++++++++++++++++++++++++

Are these really the only files which had changes? It looks
a suspiciously short list...

thanks
-- PMM


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 01/11] linux-headers: Update
  2020-05-24 13:27   ` Peter Maydell
@ 2020-05-24 14:06     ` Peter Xu
  2020-05-24 17:50       ` Peter Maydell
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Xu @ 2020-05-24 14:06 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, QEMU Developers, Dr . David Alan Gilbert

Hi, Peter,

On Sun, May 24, 2020 at 02:27:14PM +0100, Peter Maydell wrote:
> On Sun, 24 May 2020 at 00:21, Peter Xu <peterx@redhat.com> wrote:
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Header updates should always include the upstream
> kernel commit against which you ran the scripts/update-linux-headers.sh
> script, please.

This is based on a kernel series that hasn't yet been merged, so I didn't tag
it (so this is still a RFC series).  Will do when it's merged.

> 
>  linux-headers/asm-x86/kvm.h |  1 +
>  linux-headers/linux/kvm.h   | 53 +++++++++++++++++++++++++++++++++++++
> 
> Are these really the only files which had changes? It looks
> a suspiciously short list...

I didn't check, but I did use the script all the time..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 03/11] KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  2020-05-23 23:20 ` [PATCH RFC v3 03/11] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
@ 2020-05-24 16:39   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-05-24 16:39 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert

On 5/24/20 1:20 AM, Peter Xu wrote:
> kvm_vm_ioctl() handles the errno trick already for ioctl() on
> returning -1 for errors.  Fix this.
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index d06cc04079..6e015aa2d4 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -699,14 +699,13 @@ static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start,
>      d.num_pages = bmap_npages;
>      d.slot = mem->slot | (as_id << 16);
>  
> -    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
> -        ret = -errno;
> +    ret = kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d);
> +    if (ret) {
>          error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
>                       "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
>                       __func__, d.slot, (uint64_t)d.first_page,
>                       (uint32_t)d.num_pages, ret);
>      } else {
> -        ret = 0;
>          trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
>      }
>  
> 

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 01/11] linux-headers: Update
  2020-05-24 14:06     ` Peter Xu
@ 2020-05-24 17:50       ` Peter Maydell
  2020-05-25 14:29         ` Peter Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Maydell @ 2020-05-24 17:50 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, QEMU Developers, Dr . David Alan Gilbert

On Sun, 24 May 2020 at 15:07, Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Peter,
>
> On Sun, May 24, 2020 at 02:27:14PM +0100, Peter Maydell wrote:
> > On Sun, 24 May 2020 at 00:21, Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> >
> > Header updates should always include the upstream
> > kernel commit against which you ran the scripts/update-linux-headers.sh
> > script, please.
>
> This is based on a kernel series that hasn't yet been merged, so I didn't tag
> it (so this is still a RFC series).  Will do when it's merged.

Ah, cool. (It's helpful to note in the commit message for the
header-update patch if it's a not-yet-upstream set of changes.)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 01/11] linux-headers: Update
  2020-05-24 17:50       ` Peter Maydell
@ 2020-05-25 14:29         ` Peter Xu
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-25 14:29 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, QEMU Developers, Dr . David Alan Gilbert

On Sun, May 24, 2020 at 06:50:28PM +0100, Peter Maydell wrote:
> On Sun, 24 May 2020 at 15:07, Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Peter,
> >
> > On Sun, May 24, 2020 at 02:27:14PM +0100, Peter Maydell wrote:
> > > On Sun, 24 May 2020 at 00:21, Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > >
> > > Header updates should always include the upstream
> > > kernel commit against which you ran the scripts/update-linux-headers.sh
> > > script, please.
> >
> > This is based on a kernel series that hasn't yet been merged, so I didn't tag
> > it (so this is still a RFC series).  Will do when it's merged.
> 
> Ah, cool. (It's helpful to note in the commit message for the
> header-update patch if it's a not-yet-upstream set of changes.)

Sure.  I'll add that some into the commit message in my next post if the kernel
series is still unmerged (which is very likely).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part)
  2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (11 preceding siblings ...)
  2020-05-24 13:06 ` [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
@ 2020-05-26 14:17 ` Peter Xu
  12 siblings, 0 replies; 19+ messages in thread
From: Peter Xu @ 2020-05-26 14:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert

On Sat, May 23, 2020 at 07:20:24PM -0400, Peter Xu wrote:
> I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
> channel.  When idle or dirty workload small, I don't observe major
> difference on total migration time.  When with higher random dirty
> workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
> ring). Total migration time is (ping pong migrate for 6 times, in
> seconds):
> 
> |-------------------------+---------------|
> | dirty ring (4k entries) | dirty logging |
> |-------------------------+---------------|
> |                      70 |            58 |
> |                      78 |            70 |
> |                      72 |            48 |
> |                      74 |            52 |
> |                      83 |            49 |
> |                      65 |            54 |
> |-------------------------+---------------|
> 
> Summary:
> 
> dirty ring average:    73s
> dirty logging average: 55s
> 
> The KVM dirty ring will be slower in above case.  The number may show
> that the dirty logging is still preferred as a default value because
> small/medium VMs are still major cases, and high dirty workload
> happens frequently too.  And that's what this series did.

Two more TODOs that can potentially be worked upon:

- Consider to drop the BQL dependency when reap dirty rings: then we can run
  the reaper thread in parallel of main thread.  Needs some thoughts around the
  race conditions of main thread to see whether it's doable.

- Consider to drop the kvmslot bitmap: logically this can be dropped with kvm
  dirty ring, not only for space saving, but also it's slower, and it's yet
  another layer linear to guest mem size which is against the whole idea of kvm
  dirty ring.  This should make above number (of kvm dirty ring) even smaller,
  but probably still not as good as dirty log if workload is high.  I'm not
  sure whether it's possible to even drop the whole ramblock dirty bitmap when
  kvm enabled then we remove all the bitmap caches (I guess VGA would still
  need a very small one if it wants, or just do the refresh unconditionally),
  but that's a much bigger surgery and a wild idea, but logically it should be
  even more efficient with the ring structure, then precopy will just work
  similar to postcopy that there'll be a queue of dirty pages (probably except
  the 1st round of precopy).

I'll append them into the cover letter of next version too.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-05-26 14:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-23 23:20 [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 01/11] linux-headers: Update Peter Xu
2020-05-24 13:27   ` Peter Maydell
2020-05-24 14:06     ` Peter Xu
2020-05-24 17:50       ` Peter Maydell
2020-05-25 14:29         ` Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 02/11] memory: Introduce log_sync_global() to memory listener Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 03/11] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
2020-05-24 16:39   ` Philippe Mathieu-Daudé
2020-05-23 23:20 ` [PATCH RFC v3 04/11] KVM: Use a big lock to replace per-kml slots_lock Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 05/11] KVM: Create the KVMSlot dirty bitmap on flag changes Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 06/11] KVM: Provide helper to get kvm dirty log Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 07/11] KVM: Provide helper to sync dirty bitmap from slot to ramblock Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 08/11] KVM: Simplify dirty log sync in kvm_set_phys_mem Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 09/11] KVM: Cache kvm slot dirty bitmap size Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 10/11] KVM: Add dirty-gfn-count property Peter Xu
2020-05-23 23:20 ` [PATCH RFC v3 11/11] KVM: Dirty ring support Peter Xu
2020-05-24 13:06 ` [PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) Peter Xu
2020-05-26 14:17 ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.