QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part)
@ 2020-02-05 14:17 Peter Xu
  2020-02-05 14:17 ` [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

This is still RFC because the kernel counterpart is still under
review.  However please feel free to read into the code a bit if you
want; they've even got rich comments so not really in WIP status
itself.  Any kind of comments are greatly welcomed.

For anyone who wants to try (we need to upgrade kernel too):

KVM branch:
  https://github.com/xzpeter/linux/tree/kvm-dirty-ring

QEMU branch for testing:
  https://github.com/xzpeter/qemu/tree/kvm-dirty-ring

Overview
========

KVM dirty ring is a new interface to pass over dirty bits from kernel
to the userspace.  Instead of using a bitmap for each memory region,
the dirty ring contains an array of dirtied GPAs to fetch, one ring
per vcpu.

There're a few major changes comparing to how the old dirty logging
interface would work:

- Granularity of dirty bits

  KVM dirty ring interface does not offer memory region level
  granularity to collect dirty bits (i.e., per KVM memory
  slot). Instead the dirty bit is collected globally for all the vcpus
  at once.  The major effect is on VGA part because VGA dirty tracking
  is enabled as long as the device is created, also it was in memory
  region granularity.  Now that operation will be amplified to a VM
  sync.  Maybe there's smarter way to do the same thing in VGA with
  the new interface, but so far I don't see it affects much at least
  on regular VMs.

- Collection of dirty bits

  The old dirty logging interface collects KVM dirty bits when
  synchronizing dirty bits.  KVM dirty ring interface instead used a
  standalone thread to do that.  So when the other thread (e.g., the
  migration thread) wants to synchronize the dirty bits, it simply
  kick the thread and wait until it flushes all the dirty bits to the
  ramblock dirty bitmap.

A new parameter "dirty-ring-size" is added to "-accel kvm".  By
default, dirty ring is still disabled (size==0).  To enable it, we
need to be with:

  -accel kvm,dirty-ring-size=65536

This establishes a 64K dirty ring buffer per vcpu.  Then if we
migrate, it'll switch to dirty ring.

I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
channel.  When idle or dirty workload small, I don't observe major
difference on total migration time.  When with higher random dirty
workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
ring). Total migration time is (ping pong migrate for 6 times, in
seconds):

|-------------------------+---------------|
| dirty ring (4k entries) | dirty logging |
|-------------------------+---------------|
|                      70 |            58 |
|                      78 |            70 |
|                      72 |            48 |
|                      74 |            52 |
|                      83 |            49 |
|                      65 |            54 |
|-------------------------+---------------|

Summary:

dirty ring average:    73s
dirty logging average: 55s

The KVM dirty ring will be slower in above case.  The number may show
that the dirty logging is still preferred as a default value because
small/medium VMs are still major cases, and high dirty workload
happens frequently too.  And that's what this series did.

Please refer to the code and comment itself for more information.

Thanks,

Peter Xu (9):
  KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  linux-headers: Update
  memory: Introduce log_sync_global() to memory listener
  KVM: Create the KVMSlot dirty bitmap on flag changes
  KVM: Provide helper to get kvm dirty log
  KVM: Provide helper to sync dirty bitmap from slot to ramblock
  KVM: Cache kvm slot dirty bitmap size
  KVM: Add dirty-ring-size property
  KVM: Dirty ring support

 accel/kvm/kvm-all.c         | 591 ++++++++++++++++++++++++++++++++----
 accel/kvm/trace-events      |   7 +
 include/exec/memory.h       |  12 +
 include/hw/core/cpu.h       |  10 +
 include/sysemu/kvm_int.h    |   5 +
 linux-headers/asm-x86/kvm.h |   1 +
 linux-headers/linux/kvm.h   |  44 +++
 memory.c                    |  33 +-
 qemu-options.hx             |   3 +
 9 files changed, 638 insertions(+), 68 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 16:34   ` Dr. David Alan Gilbert
  2020-03-25 17:43   ` Peter Xu
  2020-02-05 14:17 ` [PATCH RFC 2/9] linux-headers: Update Peter Xu
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

kvm_vm_ioctl() handles the errno trick already for ioctl() on
returning -1 for errors.  Fix this.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c111312dfd..4be3cd2352 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -688,14 +688,13 @@ static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start,
     d.num_pages = bmap_npages;
     d.slot = mem->slot | (as_id << 16);
 
-    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
-        ret = -errno;
+    ret = kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d);
+    if (ret) {
         error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
                      "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
                      __func__, d.slot, (uint64_t)d.first_page,
                      (uint32_t)d.num_pages, ret);
     } else {
-        ret = 0;
         trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
     }
 
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 2/9] linux-headers: Update
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
  2020-02-05 14:17 ` [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-02-05 14:17 ` [PATCH RFC 3/9] memory: Introduce log_sync_global() to memory listener Peter Xu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 linux-headers/asm-x86/kvm.h |  1 +
 linux-headers/linux/kvm.h   | 44 +++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 503d3f42da..b59bf356c4 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -12,6 +12,7 @@
 
 #define KVM_PIO_PAGE_OFFSET 1
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define DE_VECTOR 0
 #define DB_VECTOR 1
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 9d647fad76..c5a6c6e0a6 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -236,6 +236,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
 #define KVM_EXIT_ARM_NISV         28
+#define KVM_EXIT_DIRTY_RING_FULL  29
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -1009,6 +1010,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
 #define KVM_CAP_ARM_NISV_TO_USER 177
 #define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_DIRTY_LOG_RING 179
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1473,6 +1475,9 @@ struct kvm_enc_region {
 /* Available with KVM_CAP_ARM_SVE */
 #define KVM_ARM_VCPU_FINALIZE	  _IOW(KVMIO,  0xc2, int)
 
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS     _IO(KVMIO, 0xc3)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
@@ -1623,4 +1628,43 @@ struct kvm_hyperv_eventfd {
 #define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
 #define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
 
+/*
+ * KVM dirty GFN flags, defined as:
+ *
+ * |---------------+---------------+--------------|
+ * | bit 1 (reset) | bit 0 (dirty) | Status       |
+ * |---------------+---------------+--------------|
+ * |             0 |             0 | Invalid GFN  |
+ * |             0 |             1 | Dirty GFN    |
+ * |             1 |             X | GFN to reset |
+ * |---------------+---------------+--------------|
+ *
+ * Lifecycle of a dirty GFN goes like:
+ *
+ *      dirtied         collected        reset
+ * 00 -----------> 01 -------------> 1X -------+
+ *  ^                                          |
+ *  |                                          |
+ *  +------------------------------------------+
+ *
+ * The userspace program is only responsible for the 01->1X state
+ * conversion (to collect dirty bits).  Also, it must not skip any
+ * dirty bits so that dirty bits are always collected in sequence.
+ */
+#define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
+#define KVM_DIRTY_GFN_F_RESET           BIT(1)
+#define KVM_DIRTY_GFN_F_MASK            0x3
+
+/*
+ * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of
+ * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn.  The
+ * size of the gfn buffer is decided by the first argument when
+ * enabling KVM_CAP_DIRTY_LOG_RING.
+ */
+struct kvm_dirty_gfn {
+	__u32 flags;
+	__u32 slot;
+	__u64 offset;
+};
+
 #endif /* __LINUX_KVM_H */
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 3/9] memory: Introduce log_sync_global() to memory listener
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
  2020-02-05 14:17 ` [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
  2020-02-05 14:17 ` [PATCH RFC 2/9] linux-headers: Update Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 16:52   ` Dr. David Alan Gilbert
  2020-02-05 14:17 ` [PATCH RFC 4/9] KVM: Create the KVMSlot dirty bitmap on flag changes Peter Xu
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

Some of the memory listener may want to do log synchronization without
being able to specify a range of memory to sync but always globally.
Such a memory listener should provide this new method instead of the
log_sync() method.

Obviously we can also achieve similar thing when we put the global
sync logic into a log_sync() handler. However that's not efficient
enough because otherwise memory_global_dirty_log_sync() may do the
global sync N times, where N is the number of flat views.

Make this new method be exclusive to log_sync().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 12 ++++++++++++
 memory.c              | 33 +++++++++++++++++++++++----------
 2 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index e85b7de99a..c4427094bb 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -533,6 +533,18 @@ struct MemoryListener {
      */
     void (*log_sync)(MemoryListener *listener, MemoryRegionSection *section);
 
+    /**
+     * @log_sync_global:
+     *
+     * This is the global version of @log_sync when the listener does
+     * not have a way to synchronize the log with finer granularity.
+     * When the listener registers with @log_sync_global defined, then
+     * its @log_sync must be NULL.  Vice versa.
+     *
+     * @listener: The #MemoryListener.
+     */
+    void (*log_sync_global)(MemoryListener *listener);
+
     /**
      * @log_clear:
      *
diff --git a/memory.c b/memory.c
index aeaa8dcc9e..53828ba00c 100644
--- a/memory.c
+++ b/memory.c
@@ -2016,6 +2016,10 @@ void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
                                         memory_region_get_dirty_log_mask(mr));
 }
 
+/*
+ * If memory region `mr' is NULL, do global sync.  Otherwise, sync
+ * dirty bitmap for the specified memory region.
+ */
 static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
 {
     MemoryListener *listener;
@@ -2029,18 +2033,24 @@ static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
      * address space once.
      */
     QTAILQ_FOREACH(listener, &memory_listeners, link) {
-        if (!listener->log_sync) {
-            continue;
-        }
-        as = listener->address_space;
-        view = address_space_get_flatview(as);
-        FOR_EACH_FLAT_RANGE(fr, view) {
-            if (fr->dirty_log_mask && (!mr || fr->mr == mr)) {
-                MemoryRegionSection mrs = section_from_flat_range(fr, view);
-                listener->log_sync(listener, &mrs);
+        if (listener->log_sync) {
+            as = listener->address_space;
+            view = address_space_get_flatview(as);
+            FOR_EACH_FLAT_RANGE(fr, view) {
+                if (fr->dirty_log_mask && (!mr || fr->mr == mr)) {
+                    MemoryRegionSection mrs = section_from_flat_range(fr, view);
+                    listener->log_sync(listener, &mrs);
+                }
             }
+            flatview_unref(view);
+        } else if (listener->log_sync_global) {
+            /*
+             * No matter whether MR is specified, what we can do here
+             * is to do a global sync, because we are not capable to
+             * sync in a finer granularity.
+             */
+            listener->log_sync_global(listener);
         }
-        flatview_unref(view);
     }
 }
 
@@ -2727,6 +2737,9 @@ void memory_listener_register(MemoryListener *listener, AddressSpace *as)
 {
     MemoryListener *other = NULL;
 
+    /* Only one of them can be defined for a listener */
+    assert(!(listener->log_sync && listener->log_sync_global));
+
     listener->address_space = as;
     if (QTAILQ_EMPTY(&memory_listeners)
         || listener->priority >= QTAILQ_LAST(&memory_listeners)->priority) {
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 4/9] KVM: Create the KVMSlot dirty bitmap on flag changes
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (2 preceding siblings ...)
  2020-02-05 14:17 ` [PATCH RFC 3/9] memory: Introduce log_sync_global() to memory listener Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 17:44   ` Dr. David Alan Gilbert
  2020-02-05 14:17 ` [PATCH RFC 5/9] KVM: Provide helper to get kvm dirty log Peter Xu
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

Previously we have two places that will create the per KVMSlot dirty
bitmap:

  1. When a newly created KVMSlot has dirty logging enabled,
  2. When the first log_sync() happens for a memory slot.

The 2nd case is lazy-init, while the 1st case is not (which is a fix
of what the 2nd case missed).

To do explicit initialization of dirty bitmaps, what we're missing is
to create the dirty bitmap when the slot changed from not-dirty-track
to dirty-track.  Do that in kvm_slot_update_flags().

With that, we can safely remove the 2nd lazy-init.

This change will be needed for kvm dirty ring because kvm dirty ring
does not use the log_sync() interface at all.

Since at it, move all the pre-checks into kvm_slot_init_dirty_bitmap().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 23 +++++++++--------------
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 4be3cd2352..bb635c775f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -162,6 +162,8 @@ static NotifierList kvm_irqchip_change_notifiers =
 #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
 #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
 
+static void kvm_slot_init_dirty_bitmap(KVMSlot *mem);
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_accel());
@@ -442,6 +444,7 @@ static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
         return 0;
     }
 
+    kvm_slot_init_dirty_bitmap(mem);
     return kvm_set_user_memory_region(kml, mem, false);
 }
 
@@ -526,8 +529,12 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
 
 /* Allocate the dirty bitmap for a slot  */
-static void kvm_memslot_init_dirty_bitmap(KVMSlot *mem)
+static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
 {
+    if (!(mem->flags & KVM_MEM_LOG_DIRTY_PAGES) || mem->dirty_bmap) {
+        return;
+    }
+
     /*
      * XXX bad kernel interface alert
      * For dirty bitmap, kernel allocates array of size aligned to
@@ -578,11 +585,6 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
             goto out;
         }
 
-        if (!mem->dirty_bmap) {
-            /* Allocate on the first log_sync, once and for all */
-            kvm_memslot_init_dirty_bitmap(mem);
-        }
-
         d.dirty_bitmap = mem->dirty_bmap;
         d.slot = mem->slot | (kml->as_id << 16);
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
@@ -1079,14 +1081,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         mem->start_addr = start_addr;
         mem->ram = ram;
         mem->flags = kvm_mem_flags(mr);
-
-        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
-            /*
-             * Reallocate the bmap; it means it doesn't disappear in
-             * middle of a migrate.
-             */
-            kvm_memslot_init_dirty_bitmap(mem);
-        }
+        kvm_slot_init_dirty_bitmap(mem);
         err = kvm_set_user_memory_region(kml, mem, true);
         if (err) {
             fprintf(stderr, "%s: error registering slot: %s\n", __func__,
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 5/9] KVM: Provide helper to get kvm dirty log
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (3 preceding siblings ...)
  2020-02-05 14:17 ` [PATCH RFC 4/9] KVM: Create the KVMSlot dirty bitmap on flag changes Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 17:52   ` Dr. David Alan Gilbert
  2020-02-05 14:17 ` [PATCH RFC 6/9] KVM: Provide helper to sync dirty bitmap from slot to ramblock Peter Xu
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

Provide a helper kvm_slot_get_dirty_log() to make the function
kvm_physical_sync_dirty_bitmap() clearer.  We can even cache the as_id
into KVMSlot when it is created, so that we don't even need to pass it
down every time.

Since at it, remove return value of kvm_physical_sync_dirty_bitmap()
because it should never fail.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 39 +++++++++++++++++++--------------------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index bb635c775f..608216fd53 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -553,6 +553,18 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
     mem->dirty_bmap = g_malloc0(bitmap_size);
 }
 
+/* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap */
+static void kvm_slot_get_dirty_log(KVMState *s, KVMSlot *slot)
+{
+    struct kvm_dirty_log d = {};
+    int ret;
+
+    d.dirty_bitmap = slot->dirty_bmap;
+    d.slot = slot->slot | (slot->as_id << 16);
+    ret = kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d);
+    assert(ret != -1);
+}
+
 /**
  * kvm_physical_sync_dirty_bitmap - Sync dirty bitmap from kernel space
  *
@@ -564,15 +576,13 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
  * @kml: the KVM memory listener object
  * @section: the memory section to sync the dirty bitmap with
  */
-static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
-                                          MemoryRegionSection *section)
+static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
+                                           MemoryRegionSection *section)
 {
     KVMState *s = kvm_state;
-    struct kvm_dirty_log d = {};
     KVMSlot *mem;
     hwaddr start_addr, size;
     hwaddr slot_size, slot_offset = 0;
-    int ret = 0;
 
     size = kvm_align_section(section, &start_addr);
     while (size) {
@@ -582,27 +592,19 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
         mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
         if (!mem) {
             /* We don't have a slot if we want to trap every access. */
-            goto out;
+            return;
         }
 
-        d.dirty_bitmap = mem->dirty_bmap;
-        d.slot = mem->slot | (kml->as_id << 16);
-        if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
-            DPRINTF("ioctl failed %d\n", errno);
-            ret = -1;
-            goto out;
-        }
+        kvm_slot_get_dirty_log(s, mem);
 
         subsection.offset_within_region += slot_offset;
         subsection.size = int128_make64(slot_size);
-        kvm_get_dirty_pages_log_range(&subsection, d.dirty_bitmap);
+        kvm_get_dirty_pages_log_range(&subsection, mem->dirty_bmap);
 
         slot_offset += slot_size;
         start_addr += slot_size;
         size -= slot_size;
     }
-out:
-    return ret;
 }
 
 /* Alignment requirement for KVM_CLEAR_DIRTY_LOG - 64 pages */
@@ -1077,6 +1079,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     do {
         slot_size = MIN(kvm_max_slot_size, size);
         mem = kvm_alloc_slot(kml);
+        mem->as_id = kml->as_id;
         mem->memory_size = slot_size;
         mem->start_addr = start_addr;
         mem->ram = ram;
@@ -1119,14 +1122,10 @@ static void kvm_log_sync(MemoryListener *listener,
                          MemoryRegionSection *section)
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
-    int r;
 
     kvm_slots_lock(kml);
-    r = kvm_physical_sync_dirty_bitmap(kml, section);
+    kvm_physical_sync_dirty_bitmap(kml, section);
     kvm_slots_unlock(kml);
-    if (r < 0) {
-        abort();
-    }
 }
 
 static void kvm_log_clear(MemoryListener *listener,
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index ac2d1f8b56..4434e15ec7 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -23,6 +23,8 @@ typedef struct KVMSlot
     int old_flags;
     /* Dirty bitmap cache for the slot */
     unsigned long *dirty_bmap;
+    /* Cache of the address space ID */
+    int as_id;
 } KVMSlot;
 
 typedef struct KVMMemoryListener {
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 6/9] KVM: Provide helper to sync dirty bitmap from slot to ramblock
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (4 preceding siblings ...)
  2020-02-05 14:17 ` [PATCH RFC 5/9] KVM: Provide helper to get kvm dirty log Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 18:47   ` Dr. David Alan Gilbert
  2020-02-05 14:17 ` [PATCH RFC 7/9] KVM: Cache kvm slot dirty bitmap size Peter Xu
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

kvm_physical_sync_dirty_bitmap() calculates the ramblock offset in an
awkward way from the MemoryRegionSection that passed in from the
caller.  The truth is for each KVMSlot the ramblock offset never
change for the lifecycle.  Cache the ramblock offset for each KVMSlot
into the structure when the KVMSlot is created.

With that, we can further simplify kvm_physical_sync_dirty_bitmap()
with a helper to sync KVMSlot dirty bitmap to the ramblock dirty
bitmap of a specific KVMSlot.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 37 +++++++++++++++++--------------------
 include/sysemu/kvm_int.h |  2 ++
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 608216fd53..f81e7a644b 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -515,15 +515,12 @@ static void kvm_log_stop(MemoryListener *listener,
 }
 
 /* get kvm's dirty pages bitmap and update qemu's */
-static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
-                                         unsigned long *bitmap)
+static void kvm_slot_sync_dirty_pages(KVMSlot *slot)
 {
-    ram_addr_t start = section->offset_within_region +
-                       memory_region_get_ram_addr(section->mr);
-    ram_addr_t pages = int128_get64(section->size) / qemu_real_host_page_size;
+    ram_addr_t start = slot->ram_start_offset;
+    ram_addr_t pages = slot->memory_size / qemu_real_host_page_size;
 
-    cpu_physical_memory_set_dirty_lebitmap(bitmap, start, pages);
-    return 0;
+    cpu_physical_memory_set_dirty_lebitmap(slot->dirty_bmap, start, pages);
 }
 
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
@@ -582,12 +579,10 @@ static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
     KVMState *s = kvm_state;
     KVMSlot *mem;
     hwaddr start_addr, size;
-    hwaddr slot_size, slot_offset = 0;
+    hwaddr slot_size;
 
     size = kvm_align_section(section, &start_addr);
     while (size) {
-        MemoryRegionSection subsection = *section;
-
         slot_size = MIN(kvm_max_slot_size, size);
         mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
         if (!mem) {
@@ -596,12 +591,7 @@ static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
         }
 
         kvm_slot_get_dirty_log(s, mem);
-
-        subsection.offset_within_region += slot_offset;
-        subsection.size = int128_make64(slot_size);
-        kvm_get_dirty_pages_log_range(&subsection, mem->dirty_bmap);
-
-        slot_offset += slot_size;
+        kvm_slot_sync_dirty_pages(mem);
         start_addr += slot_size;
         size -= slot_size;
     }
@@ -1023,7 +1013,8 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     int err;
     MemoryRegion *mr = section->mr;
     bool writeable = !mr->readonly && !mr->rom_device;
-    hwaddr start_addr, size, slot_size;
+    hwaddr start_addr, size, slot_size, mr_offset;
+    ram_addr_t ram_start_offset;
     void *ram;
 
     if (!memory_region_is_ram(mr)) {
@@ -1041,9 +1032,13 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         return;
     }
 
-    /* use aligned delta to align the ram address */
-    ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
-          (start_addr - section->offset_within_address_space);
+    /* The offset of the kvmslot within the memory region */
+    mr_offset = section->offset_within_region + start_addr -
+        section->offset_within_address_space;
+
+    /* use aligned delta to align the ram address and offset */
+    ram = memory_region_get_ram_ptr(mr) + mr_offset;
+    ram_start_offset = memory_region_get_ram_addr(mr) + mr_offset;
 
     kvm_slots_lock(kml);
 
@@ -1082,6 +1077,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         mem->as_id = kml->as_id;
         mem->memory_size = slot_size;
         mem->start_addr = start_addr;
+        mem->ram_start_offset = ram_start_offset;
         mem->ram = ram;
         mem->flags = kvm_mem_flags(mr);
         kvm_slot_init_dirty_bitmap(mem);
@@ -1092,6 +1088,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
             abort();
         }
         start_addr += slot_size;
+        ram_start_offset += slot_size;
         ram += slot_size;
         size -= slot_size;
     } while (size);
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 4434e15ec7..1a19bfef80 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -25,6 +25,8 @@ typedef struct KVMSlot
     unsigned long *dirty_bmap;
     /* Cache of the address space ID */
     int as_id;
+    /* Cache of the offset in ram address space */
+    ram_addr_t ram_start_offset;
 } KVMSlot;
 
 typedef struct KVMMemoryListener {
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 7/9] KVM: Cache kvm slot dirty bitmap size
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (5 preceding siblings ...)
  2020-02-05 14:17 ` [PATCH RFC 6/9] KVM: Provide helper to sync dirty bitmap from slot to ramblock Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 18:49   ` Dr. David Alan Gilbert
  2020-02-05 14:17 ` [PATCH RFC 8/9] KVM: Add dirty-ring-size property Peter Xu
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

Cache it too because we'll reference it more frequently in the future.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c      | 1 +
 include/sysemu/kvm_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f81e7a644b..ea7b8f7ca5 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -548,6 +548,7 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
     hwaddr bitmap_size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
                                         /*HOST_LONG_BITS*/ 64) / 8;
     mem->dirty_bmap = g_malloc0(bitmap_size);
+    mem->dirty_bmap_size = bitmap_size;
 }
 
 /* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap */
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 1a19bfef80..71c9946ecf 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -23,6 +23,7 @@ typedef struct KVMSlot
     int old_flags;
     /* Dirty bitmap cache for the slot */
     unsigned long *dirty_bmap;
+    unsigned long dirty_bmap_size;
     /* Cache of the address space ID */
     int as_id;
     /* Cache of the offset in ram address space */
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 8/9] KVM: Add dirty-ring-size property
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (6 preceding siblings ...)
  2020-02-05 14:17 ` [PATCH RFC 7/9] KVM: Cache kvm slot dirty bitmap size Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 20:00   ` Dr. David Alan Gilbert
  2020-03-25 20:14   ` Dr. David Alan Gilbert
  2020-02-05 14:17 ` [PATCH RFC 9/9] KVM: Dirty ring support Peter Xu
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

Add a parameter for size of dirty ring.  If zero, dirty ring is
disabled.  Otherwise dirty ring will be enabled with the per-vcpu size
as specified.  If dirty ring cannot be enabled due to unsupported
kernel, it'll fallback to dirty logging.  By default, dirty ring is
not enabled (dirty-ring-size==0).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
 qemu-options.hx     |  3 +++
 2 files changed, 67 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index ea7b8f7ca5..6d145a8b98 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -127,6 +127,8 @@ struct KVMState
         KVMMemoryListener *ml;
         AddressSpace *as;
     } *as;
+    int kvm_dirty_ring_size;
+    int kvm_dirty_gfn_count;    /* If nonzero, then kvm dirty ring enabled */
 };
 
 KVMState *kvm_state;
@@ -2077,6 +2079,33 @@ static int kvm_init(MachineState *ms)
     s->memory_listener.listener.coalesced_io_add = kvm_coalesce_mmio_region;
     s->memory_listener.listener.coalesced_io_del = kvm_uncoalesce_mmio_region;
 
+    /*
+     * Enable KVM dirty ring if supported, otherwise fall back to
+     * dirty logging mode
+     */
+    if (s->kvm_dirty_ring_size > 0) {
+        /* Read the max supported pages */
+        ret = kvm_vm_check_extension(kvm_state, KVM_CAP_DIRTY_LOG_RING);
+        if (ret > 0) {
+            if (s->kvm_dirty_ring_size > ret) {
+                error_report("KVM dirty ring size %d too big (maximum is %d). "
+                             "Please use a smaller value.",
+                             s->kvm_dirty_ring_size, ret);
+                goto err;
+            }
+
+            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0,
+                                    s->kvm_dirty_ring_size);
+            if (ret) {
+                error_report("Enabling of KVM dirty ring failed: %d", ret);
+                goto err;
+            }
+
+            s->kvm_dirty_gfn_count =
+                s->kvm_dirty_ring_size / sizeof(struct kvm_dirty_gfn);
+        }
+    }
+
     kvm_memory_listener_register(s, &s->memory_listener,
                                  &address_space_memory, 0);
     memory_listener_register(&kvm_io_listener,
@@ -3037,6 +3066,33 @@ bool kvm_kernel_irqchip_split(void)
     return kvm_state->kernel_irqchip_split == ON_OFF_AUTO_ON;
 }
 
+static void kvm_get_dirty_ring_size(Object *obj, Visitor *v,
+                                    const char *name, void *opaque,
+                                    Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    int64_t value = s->kvm_dirty_ring_size;
+
+    visit_type_int(v, name, &value, errp);
+}
+
+static void kvm_set_dirty_ring_size(Object *obj, Visitor *v,
+                                    const char *name, void *opaque,
+                                    Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    Error *error = NULL;
+    int64_t value;
+
+    visit_type_int(v, name, &value, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    s->kvm_dirty_ring_size = value;
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
     KVMState *s = KVM_STATE(obj);
@@ -3044,6 +3100,8 @@ static void kvm_accel_instance_init(Object *obj)
     s->kvm_shadow_mem = -1;
     s->kernel_irqchip_allowed = true;
     s->kernel_irqchip_split = ON_OFF_AUTO_AUTO;
+    /* By default off */
+    s->kvm_dirty_ring_size = 0;
 }
 
 static void kvm_accel_class_init(ObjectClass *oc, void *data)
@@ -3065,6 +3123,12 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
         NULL, NULL, &error_abort);
     object_class_property_set_description(oc, "kvm-shadow-mem",
         "KVM shadow MMU size", &error_abort);
+
+    object_class_property_add(oc, "dirty-ring-size", "int",
+        kvm_get_dirty_ring_size, kvm_set_dirty_ring_size,
+        NULL, NULL, &error_abort);
+    object_class_property_set_description(oc, "dirty-ring-size",
+        "KVM dirty ring size (<=0 to disable)", &error_abort);
 }
 
 static const TypeInfo kvm_accel_type = {
diff --git a/qemu-options.hx b/qemu-options.hx
index 224a8e8712..140bd38726 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -119,6 +119,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
     "                kernel-irqchip=on|off|split controls accelerated irqchip support (default=on)\n"
     "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
     "                tb-size=n (TCG translation block cache size)\n"
+    "                dirty-ring-size=n (KVM dirty ring size in Bytes)\n"
     "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
 STEXI
 @item -accel @var{name}[,prop=@var{value}[,...]]
@@ -140,6 +141,8 @@ irqchip completely is not recommended except for debugging purposes.
 Defines the size of the KVM shadow MMU.
 @item tb-size=@var{n}
 Controls the size (in MiB) of the TCG translation block cache.
+@item dirty-ring-size=@val{n}
+Controls the size (in Bytes) of KVM dirty ring (<=0 to disable).
 @item thread=single|multi
 Controls number of TCG threads. When the TCG is multi-threaded there will be one
 thread per vCPU therefor taking advantage of additional host cores. The default
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH RFC 9/9] KVM: Dirty ring support
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (7 preceding siblings ...)
  2020-02-05 14:17 ` [PATCH RFC 8/9] KVM: Add dirty-ring-size property Peter Xu
@ 2020-02-05 14:17 ` Peter Xu
  2020-03-25 20:41   ` Dr. David Alan Gilbert
  2020-02-05 14:51 ` [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) no-reply
  2020-03-03 17:32 ` Peter Xu
  10 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-02-05 14:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Dr . David Alan Gilbert, peterx, Michael S . Tsirkin

KVM dirty ring is a new interface to pass over dirty bits from kernel
to the userspace.  Instead of using a bitmap for each memory region,
the dirty ring contains an array of dirtied GPAs to fetch.  For each
vcpu there will be one dirty ring that binds to it.

There're a few major changes comparing to how the old dirty logging
interface would work:

  - Granularity of dirty bits

    KVM dirty ring interface does not offer memory region level
    granularity to collect dirty bits (i.e., per KVM memory slot).
    Instead the dirty bit is collected globally for all the vcpus at
    once.  The major effect is on VGA part because VGA dirty tracking
    is enabled as long as the device is created, also it was in memory
    region granularity.  Now that operation will be amplified to a VM
    sync.  Maybe there's smarter way to do the same thing in VGA with
    the new interface, but so far I don't see it affects much at least
    on regular VMs.

  - Collection of dirty bits

    The old dirty logging interface collects KVM dirty bits when
    synchronizing dirty bits.  KVM dirty ring interface instead used a
    standalone thread to do that.  So when the other thread (e.g., the
    migration thread) wants to synchronize the dirty bits, it simply
    kick the thread and wait until it flushes all the dirty bits to
    the ramblock dirty bitmap.

For more information please refer to the comments in the code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c    | 426 ++++++++++++++++++++++++++++++++++++++++-
 accel/kvm/trace-events |   7 +
 include/hw/core/cpu.h  |  10 +
 3 files changed, 440 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6d145a8b98..201617bbb7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -15,6 +15,7 @@
 
 #include "qemu/osdep.h"
 #include <sys/ioctl.h>
+#include <poll.h>
 
 #include <linux/kvm.h>
 
@@ -75,6 +76,47 @@ struct KVMParkedVcpu {
     QLIST_ENTRY(KVMParkedVcpu) node;
 };
 
+enum KVMReaperState {
+    KVM_REAPER_NONE = 0,
+    /* The reaper is sleeping */
+    KVM_REAPER_WAIT,
+    /* The reaper is reaping for dirty pages */
+    KVM_REAPER_REAPING,
+};
+
+/*
+ * KVM reaper instance, responsible for collecting the KVM dirty bits
+ * via the dirty ring.
+ */
+struct KVMDirtyRingReaper {
+    /* The reaper thread */
+    QemuThread reaper_thr;
+    /*
+     * Telling the reaper thread to wakeup.  This should be used as a
+     * generic interface to kick the reaper thread, like, in vcpu
+     * threads where it gets a exit due to ring full.
+     */
+    EventNotifier reaper_event;
+    /*
+     * This should only be used when someone wants to do synchronous
+     * flush of the dirty ring buffers.  Logically we can achieve this
+     * even with the reaper_event only, however that'll make things
+     * complicated.  This extra event can make the sync procedure easy
+     * and clean.
+     */
+    EventNotifier reaper_flush_event;
+    /*
+     * Used in pair with reaper_flush_event, that the sem will be
+     * posted to notify that the previous flush event is handled by
+     * the reaper thread.
+     */
+    QemuSemaphore reaper_flush_sem;
+    /* Iteration number of the reaper thread */
+    volatile uint64_t reaper_iteration;
+    /* Status of the reaper thread */
+    volatile enum KVMReaperState reaper_state;
+};
+
 struct KVMState
 {
     AccelState parent_obj;
@@ -121,7 +163,6 @@ struct KVMState
     void *memcrypt_handle;
     int (*memcrypt_encrypt_data)(void *handle, uint8_t *ptr, uint64_t len);
 
-    /* For "info mtree -f" to tell if an MR is registered in KVM */
     int nr_as;
     struct KVMAs {
         KVMMemoryListener *ml;
@@ -129,6 +170,7 @@ struct KVMState
     } *as;
     int kvm_dirty_ring_size;
     int kvm_dirty_gfn_count;    /* If nonzero, then kvm dirty ring enabled */
+    struct KVMDirtyRingReaper reaper;
 };
 
 KVMState *kvm_state;
@@ -348,6 +390,11 @@ int kvm_destroy_vcpu(CPUState *cpu)
         goto err;
     }
 
+    ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_size);
+    if (ret < 0) {
+        goto err;
+    }
+
     vcpu = g_malloc0(sizeof(*vcpu));
     vcpu->vcpu_id = kvm_arch_vcpu_id(cpu);
     vcpu->kvm_fd = cpu->kvm_fd;
@@ -391,6 +438,7 @@ int kvm_init_vcpu(CPUState *cpu)
     cpu->kvm_fd = ret;
     cpu->kvm_state = s;
     cpu->vcpu_dirty = true;
+    qemu_sem_init(&cpu->kvm_dirty_ring_avail, 0);
 
     mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
     if (mmap_size < 0) {
@@ -412,6 +460,18 @@ int kvm_init_vcpu(CPUState *cpu)
             (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
     }
 
+    if (s->kvm_dirty_gfn_count) {
+        cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_size,
+                                   PROT_READ | PROT_WRITE, MAP_SHARED,
+                                   cpu->kvm_fd,
+                                   PAGE_SIZE * KVM_DIRTY_LOG_PAGE_OFFSET);
+        if (cpu->kvm_dirty_gfns == MAP_FAILED) {
+            ret = -errno;
+            DPRINTF("mmap'ing vcpu dirty gfns failed\n");
+            goto err;
+        }
+    }
+
     ret = kvm_arch_init_vcpu(cpu);
 err:
     return ret;
@@ -525,6 +585,11 @@ static void kvm_slot_sync_dirty_pages(KVMSlot *slot)
     cpu_physical_memory_set_dirty_lebitmap(slot->dirty_bmap, start, pages);
 }
 
+static void kvm_slot_reset_dirty_pages(KVMSlot *slot)
+{
+    memset(slot->dirty_bmap, 0, slot->dirty_bmap_size);
+}
+
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
 
 /* Allocate the dirty bitmap for a slot  */
@@ -1100,6 +1165,305 @@ out:
     kvm_slots_unlock(kml);
 }
 
+static void kvm_dirty_ring_reaper_kick(void)
+{
+    trace_kvm_dirty_ring_reaper_kick("any");
+    event_notifier_set(&kvm_state->reaper.reaper_event);
+}
+
+static void kvm_dirty_ring_reaper_kick_flush(void)
+{
+    trace_kvm_dirty_ring_reaper_kick("flush");
+    event_notifier_set(&kvm_state->reaper.reaper_flush_event);
+}
+
+/* Should be with all slots_lock held for the address spaces */
+static void kvm_dirty_ring_mark_page(KVMState *s, uint32_t as_id,
+                                     uint32_t slot_id, uint64_t offset)
+{
+    KVMMemoryListener *kml;
+    KVMSlot *mem;
+
+    assert(as_id < s->nr_as);
+    /* These fields shouldn't change after VM inits */
+    kml = s->as[as_id].ml;
+    mem = &kml->slots[slot_id];
+    set_bit(offset, mem->dirty_bmap);
+}
+
+static bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
+{
+    return gfn->flags == KVM_DIRTY_GFN_F_DIRTY;
+}
+
+static void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
+{
+    gfn->flags = KVM_DIRTY_GFN_F_RESET;
+}
+
+/* Should be with all slots_lock held for the address spaces */
+static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
+{
+    struct kvm_dirty_gfn *dirty_gfns = cpu->kvm_dirty_gfns, *cur;
+    uint32_t gfn_count = s->kvm_dirty_gfn_count;
+    uint32_t count = 0, fetch = cpu->kvm_fetch_index;
+
+    assert(dirty_gfns && gfn_count);
+
+    trace_kvm_dirty_ring_reap_vcpu(cpu->cpu_index);
+
+    while (true) {
+        cur = &dirty_gfns[fetch % gfn_count];
+        if (!dirty_gfn_is_dirtied(cur)) {
+            break;
+        }
+        trace_kvm_dirty_ring_page(cpu->cpu_index, fetch, cur->offset);
+        kvm_dirty_ring_mark_page(s, cur->slot >> 16, cur->slot & 0xffff,
+                                 cur->offset);
+        dirty_gfn_set_collected(cur);
+        fetch++;
+        count++;
+    }
+    cpu->kvm_fetch_index = fetch;
+
+    return count;
+}
+
+static uint64_t kvm_dirty_ring_reap(KVMState *s)
+{
+    KVMMemoryListener *kml;
+    int ret, i, locked_count = s->nr_as;
+    CPUState *cpu;
+    uint64_t total = 0;
+
+    /*
+     * We need to lock all kvm slots for all address spaces here,
+     * because:
+     *
+     * (1) We need to mark dirty for dirty bitmaps in multiple slots
+     *     and for tons of pages, so it's better to take the lock here
+     *     once rather than once per page.  And more importantly,
+     *
+     * (2) We must _NOT_ publish dirty bits to the other threads
+     *     (e.g., the migration thread) via the kvm memory slot dirty
+     *     bitmaps before correctly re-protect those dirtied pages.
+     *     Otherwise we can have potential risk of data corruption if
+     *     the page data is read in the other thread before we do
+     *     reset below.
+     */
+    for (i = 0; i < s->nr_as; i++) {
+        kml = s->as[i].ml;
+        if (!kml) {
+            /*
+             * This is tricky - we grow s->as[] dynamically now.  Take
+             * care of that case.  We also assumed the as[] will fill
+             * one by one starting from zero.  Without this, we race
+             * with register_smram_listener.
+             *
+             * TODO: make all these prettier...
+             */
+            locked_count = i;
+            break;
+        }
+        kvm_slots_lock(kml);
+    }
+
+    CPU_FOREACH(cpu) {
+        total += kvm_dirty_ring_reap_one(s, cpu);
+    }
+
+    if (total) {
+        ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
+        assert(ret == total);
+    }
+
+    /* Unlock whatever locks that we have locked */
+    for (i = 0; i < locked_count; i++) {
+        kvm_slots_unlock(s->as[i].ml);
+    }
+
+    CPU_FOREACH(cpu) {
+        if (cpu->kvm_dirty_ring_full) {
+            qemu_sem_post(&cpu->kvm_dirty_ring_avail);
+        }
+    }
+
+    return total;
+}
+
+static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
+{
+    /* No need to do anything */
+}
+
+/*
+ * Kick all vcpus out in a synchronized way.  When returned, we
+ * guarantee that every vcpu has been kicked and at least returned to
+ * userspace once.
+ */
+static void kvm_cpu_synchronize_kick_all(void)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
+    }
+}
+
+/*
+ * Flush all the existing dirty pages to the KVM slot buffers.  When
+ * this call returns, we guarantee that all the touched dirty pages
+ * before calling this function have been put into the per-kvmslot
+ * dirty bitmap.
+ *
+ * To achieve this, we need to:
+ *
+ * (1) Kick all vcpus out, this will make sure that we flush all the
+ *     dirty buffers that potentially in the hardware (PML) into the
+ *     dirty rings, after that,
+ *
+ * (2) Kick the reaper thread and make sure it reaps all the dirty
+ *     page that is in the dirty rings.
+ *
+ * This function must be called with BQL held.
+ */
+static void kvm_dirty_ring_flush(struct KVMDirtyRingReaper *r)
+{
+    uint64_t iteration;
+
+    trace_kvm_dirty_ring_flush(0);
+
+    /*
+     * The function needs to be serialized.  Since this function
+     * should always be with BQL held, serialization is guaranteed.
+     * However, let's be sure of it.
+     */
+    assert(qemu_mutex_iothread_locked());
+
+    /*
+     * First make sure to flush the hardware buffers by kicking all
+     * vcpus out in a synchronous way.
+     */
+    kvm_cpu_synchronize_kick_all();
+
+    iteration = r->reaper_iteration;
+
+    /*
+     * Kick the reaper to collect data.  Here we must make sure that
+     * it goes over a complete WAIT->REAPING->WAIT period so that we
+     * know the reaper has collected all the dirty pages even in the
+     * hardware buffers we just flushed.  To achieve this, we kick the
+     * flush_event.
+     */
+    kvm_dirty_ring_reaper_kick_flush();
+    qemu_sem_wait(&r->reaper_flush_sem);
+
+    /* When reach here, we must have finished at least one iteration */
+    assert(r->reaper_iteration > iteration);
+
+    trace_kvm_dirty_ring_flush(1);
+}
+
+static void *kvm_dirty_ring_reaper_thread(void *data)
+{
+    KVMState *s = data;
+    struct KVMDirtyRingReaper *r = &s->reaper;
+    struct pollfd *pfd = g_new0(struct pollfd, 2);
+    uint64_t count;
+    int64_t stamp;
+    int ret;
+
+    rcu_register_thread();
+
+    trace_kvm_dirty_ring_reaper("init");
+
+    pfd[0].fd = event_notifier_get_fd(&r->reaper_event);
+    pfd[0].events = POLLIN;
+    pfd[1].fd = event_notifier_get_fd(&r->reaper_flush_event);
+    pfd[1].events = POLLIN;
+
+    while (true) {
+        bool flush_requested = false;
+
+        r->reaper_state = KVM_REAPER_WAIT;
+        trace_kvm_dirty_ring_reaper("wait");
+        /*
+         * TODO: provide a smarter timeout rather than a constant?  If
+         * this timeout is too small it could eat a lot of CPU
+         * resource, however if too big then VGA could be less
+         * responsive.  30ms is a value that is not too small so it
+         * won't eat much CPU, while the VGA can still get ~30Hz
+         * refresh rate.
+         */
+        ret = poll(pfd, 2, 30);
+        trace_kvm_dirty_ring_reaper("wakeup");
+        r->reaper_state = KVM_REAPER_REAPING;
+
+        if (ret == -1) {
+            error_report("%s: poll() failed: %s", __func__, strerror(errno));
+            break;
+        }
+
+        /*
+         * Note: we only handle one request at a time.  Also, we'll
+         * clear the event flag before we reap, so each SET to the
+         * event will guarantee that another full-reap procedure will
+         * happen.
+         */
+        if (pfd[0].revents) {
+            ret = event_notifier_test_and_clear(&r->reaper_event);
+            assert(ret);
+        } else if (pfd[1].revents) {
+            ret = event_notifier_test_and_clear(&r->reaper_flush_event);
+            assert(ret);
+            flush_requested = true;
+        }
+
+        stamp = get_clock();
+        count = kvm_dirty_ring_reap(s);
+        stamp = get_clock() - stamp;
+
+        r->reaper_iteration++;
+
+        if (count) {
+            trace_kvm_dirty_ring_reaper_iterate(r->reaper_iteration,
+                                                count, stamp / 1000);
+        }
+
+        /*
+         * If this iteration is to handle a flush event, wakeup the
+         * requester of the flush
+         */
+        if (flush_requested) {
+            qemu_sem_post(&r->reaper_flush_sem);
+        }
+    }
+
+    trace_kvm_dirty_ring_reaper("exit");
+
+    rcu_unregister_thread();
+
+    return NULL;
+}
+
+static int kvm_dirty_ring_reaper_init(KVMState *s)
+{
+    struct KVMDirtyRingReaper *r = &s->reaper;
+    int ret;
+
+    ret = event_notifier_init(&r->reaper_event, false);
+    assert(ret == 0);
+    ret = event_notifier_init(&r->reaper_flush_event, false);
+    assert(ret == 0);
+    qemu_sem_init(&r->reaper_flush_sem, 0);
+
+    qemu_thread_create(&r->reaper_thr, "kvm-reaper",
+                       kvm_dirty_ring_reaper_thread,
+                       s, QEMU_THREAD_JOINABLE);
+
+    return 0;
+}
+
 static void kvm_region_add(MemoryListener *listener,
                            MemoryRegionSection *section)
 {
@@ -1128,6 +1492,36 @@ static void kvm_log_sync(MemoryListener *listener,
     kvm_slots_unlock(kml);
 }
 
+static void kvm_log_sync_global(MemoryListener *l)
+{
+    KVMMemoryListener *kml = container_of(l, KVMMemoryListener, listener);
+    KVMState *s = kvm_state;
+    KVMSlot *mem;
+    int i;
+
+    /* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
+    kvm_dirty_ring_flush(&s->reaper);
+
+    /*
+     * TODO: make this faster when nr_slots is big while there are
+     * only a few used slots (small VMs).
+     */
+    kvm_slots_lock(kml);
+    for (i = 0; i < s->nr_slots; i++) {
+        mem = &kml->slots[i];
+        if (mem->memory_size && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+            kvm_slot_sync_dirty_pages(mem);
+            /*
+             * This is not needed by KVM_GET_DIRTY_LOG because the
+             * ioctl will unconditionally overwrite the whole region.
+             * However kvm dirty ring has no such side effect.
+             */
+            kvm_slot_reset_dirty_pages(mem);
+        }
+    }
+    kvm_slots_unlock(kml);
+}
+
 static void kvm_log_clear(MemoryListener *listener,
                           MemoryRegionSection *section)
 {
@@ -1234,10 +1628,17 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
     kml->listener.region_del = kvm_region_del;
     kml->listener.log_start = kvm_log_start;
     kml->listener.log_stop = kvm_log_stop;
-    kml->listener.log_sync = kvm_log_sync;
-    kml->listener.log_clear = kvm_log_clear;
     kml->listener.priority = 10;
 
+    if (s->kvm_dirty_gfn_count) {
+        /* KVM dirty ring enabled */
+        kml->listener.log_sync_global = kvm_log_sync_global;
+    } else {
+        /* KVM dirty logging enabled */
+        kml->listener.log_sync = kvm_log_sync;
+        kml->listener.log_clear = kvm_log_clear;
+    }
+
     memory_listener_register(&kml->listener, as);
 
     for (i = 0; i < s->nr_as; ++i) {
@@ -2120,6 +2521,13 @@ static int kvm_init(MachineState *ms)
         qemu_balloon_inhibit(true);
     }
 
+    if (s->kvm_dirty_gfn_count) {
+        ret = kvm_dirty_ring_reaper_init(s);
+        if (ret) {
+            goto err;
+        }
+    }
+
     return 0;
 
 err:
@@ -2427,6 +2835,18 @@ int kvm_cpu_exec(CPUState *cpu)
         case KVM_EXIT_INTERNAL_ERROR:
             ret = kvm_handle_internal_error(cpu, run);
             break;
+        case KVM_EXIT_DIRTY_RING_FULL:
+            /*
+             * We shouldn't continue if the dirty ring of this vcpu is
+             * still full.  Got kicked by KVM_RESET_DIRTY_RINGS.
+             */
+            trace_kvm_dirty_ring_full(cpu->cpu_index);
+            cpu->kvm_dirty_ring_full = true;
+            kvm_dirty_ring_reaper_kick();
+            qemu_sem_wait(&cpu->kvm_dirty_ring_avail);
+            cpu->kvm_dirty_ring_full = false;
+            ret = 0;
+            break;
         case KVM_EXIT_SYSTEM_EVENT:
             switch (run->system_event.type) {
             case KVM_SYSTEM_EVENT_SHUTDOWN:
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 4fb6e59d19..17d6b6a154 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -16,4 +16,11 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
 kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
+kvm_dirty_ring_full(int id) "vcpu %d"
+kvm_dirty_ring_reap_vcpu(int id) "vcpu %d"
+kvm_dirty_ring_page(int vcpu, uint32_t slot, uint64_t offset) "vcpu %d fetch %"PRIu32" offset 0x%"PRIx64
+kvm_dirty_ring_reaper(const char *s) "%s"
+kvm_dirty_ring_reaper_iterate(uint64_t iter, uint64_t count, int64_t t) "iteration %"PRIu64" reaped %"PRIu64" pages (took %"PRIi64" us)"
+kvm_dirty_ring_reaper_kick(const char *reason) "%s"
+kvm_dirty_ring_flush(int finished) "%d"
 
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 73e9a869a4..a7cddb7b40 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -342,6 +342,11 @@ struct qemu_work_item;
  * @ignore_memory_transaction_failures: Cached copy of the MachineState
  *    flag of the same name: allows the board to suppress calling of the
  *    CPU do_transaction_failed hook function.
+ * @kvm_dirty_ring_full:
+ *   Whether the kvm dirty ring of this vcpu is soft-full.
+ * @kvm_dirty_ring_avail:
+ *   Semaphore to be posted when the kvm dirty ring of the vcpu is
+ *   available again.
  *
  * State of one CPU core or thread.
  */
@@ -409,9 +414,14 @@ struct CPUState {
      */
     uintptr_t mem_io_pc;
 
+    /* Only used in KVM */
     int kvm_fd;
     struct KVMState *kvm_state;
     struct kvm_run *kvm_run;
+    struct kvm_dirty_gfn *kvm_dirty_gfns;
+    uint32_t kvm_fetch_index;
+    QemuSemaphore kvm_dirty_ring_avail;
+    bool kvm_dirty_ring_full;
 
     /* Used for events with 'vcpu' and *without* the 'disabled' properties */
     DECLARE_BITMAP(trace_dstate_delayed, CPU_TRACE_DSTATE_MAX_EVENTS);
-- 
2.24.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part)
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (8 preceding siblings ...)
  2020-02-05 14:17 ` [PATCH RFC 9/9] KVM: Dirty ring support Peter Xu
@ 2020-02-05 14:51 ` no-reply
  2020-03-03 17:32 ` Peter Xu
  10 siblings, 0 replies; 31+ messages in thread
From: no-reply @ 2020-02-05 14:51 UTC (permalink / raw)
  To: peterx; +Cc: pbonzini, mst, qemu-devel, peterx, dgilbert

Patchew URL: https://patchew.org/QEMU/20200205141749.378044-1-peterx@redhat.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

./qemu-options.texi:86: unknown command `val'
./qemu-options.texi:86: misplaced {
./qemu-options.texi:86: misplaced }
make: *** [Makefile:1007: qemu-doc.txt] Error 1
make: *** Waiting for unfinished jobs....
./qemu-options.texi:86: unknown command `val'
./qemu-options.texi:86: misplaced {
./qemu-options.texi:86: misplaced }
make: *** [Makefile:1000: qemu-doc.html] Error 1
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
    sys.exit(main())
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=48e8f2ce93cb4dfe8067356d075d926c', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-7kdtzngm/src/docker-src.2020-02-05-09.47.34.30265:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=48e8f2ce93cb4dfe8067356d075d926c
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-7kdtzngm/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    3m51.950s
user    0m8.268s


The full log is available at
http://patchew.org/logs/20200205141749.378044-1-peterx@redhat.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part)
  2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
                   ` (9 preceding siblings ...)
  2020-02-05 14:51 ` [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) no-reply
@ 2020-03-03 17:32 ` Peter Xu
  10 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-03 17:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Dr . David Alan Gilbert, Michael S . Tsirkin

On Wed, Feb 05, 2020 at 09:17:40AM -0500, Peter Xu wrote:
> This is still RFC because the kernel counterpart is still under
> review.  However please feel free to read into the code a bit if you
> want; they've even got rich comments so not really in WIP status
> itself.  Any kind of comments are greatly welcomed.
> 
> For anyone who wants to try (we need to upgrade kernel too):
> 
> KVM branch:
>   https://github.com/xzpeter/linux/tree/kvm-dirty-ring
> 
> QEMU branch for testing:
>   https://github.com/xzpeter/qemu/tree/kvm-dirty-ring
> 
> Overview
> ========
> 
> KVM dirty ring is a new interface to pass over dirty bits from kernel
> to the userspace.  Instead of using a bitmap for each memory region,
> the dirty ring contains an array of dirtied GPAs to fetch, one ring
> per vcpu.
> 
> There're a few major changes comparing to how the old dirty logging
> interface would work:
> 
> - Granularity of dirty bits
> 
>   KVM dirty ring interface does not offer memory region level
>   granularity to collect dirty bits (i.e., per KVM memory
>   slot). Instead the dirty bit is collected globally for all the vcpus
>   at once.  The major effect is on VGA part because VGA dirty tracking
>   is enabled as long as the device is created, also it was in memory
>   region granularity.  Now that operation will be amplified to a VM
>   sync.  Maybe there's smarter way to do the same thing in VGA with
>   the new interface, but so far I don't see it affects much at least
>   on regular VMs.
> 
> - Collection of dirty bits
> 
>   The old dirty logging interface collects KVM dirty bits when
>   synchronizing dirty bits.  KVM dirty ring interface instead used a
>   standalone thread to do that.  So when the other thread (e.g., the
>   migration thread) wants to synchronize the dirty bits, it simply
>   kick the thread and wait until it flushes all the dirty bits to the
>   ramblock dirty bitmap.
> 
> A new parameter "dirty-ring-size" is added to "-accel kvm".  By
> default, dirty ring is still disabled (size==0).  To enable it, we
> need to be with:
> 
>   -accel kvm,dirty-ring-size=65536
> 
> This establishes a 64K dirty ring buffer per vcpu.  Then if we
> migrate, it'll switch to dirty ring.

Ping?

I'd like to know whether there's any high level comment about all
these... Considering that the kernel counterpart is still during
review, I guess we don't need to spend much time on that much,
assuming it'll be a ring of dirty addresses.  The rest should be
irrelevant to kernel so I'd glad to know if there's comments on those
parts.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  2020-02-05 14:17 ` [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
@ 2020-03-25 16:34   ` Dr. David Alan Gilbert
  2020-03-25 17:43   ` Peter Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 16:34 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> kvm_vm_ioctl() handles the errno trick already for ioctl() on
> returning -1 for errors.  Fix this.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  accel/kvm/kvm-all.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c111312dfd..4be3cd2352 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -688,14 +688,13 @@ static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start,
>      d.num_pages = bmap_npages;
>      d.slot = mem->slot | (as_id << 16);
>  
> -    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
> -        ret = -errno;
> +    ret = kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d);
> +    if (ret) {
>          error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
>                       "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
>                       __func__, d.slot, (uint64_t)d.first_page,
>                       (uint32_t)d.num_pages, ret);
>      } else {
> -        ret = 0;
>          trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
>      }
>  
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 3/9] memory: Introduce log_sync_global() to memory listener
  2020-02-05 14:17 ` [PATCH RFC 3/9] memory: Introduce log_sync_global() to memory listener Peter Xu
@ 2020-03-25 16:52   ` Dr. David Alan Gilbert
  2020-03-25 17:02     ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 16:52 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> Some of the memory listener may want to do log synchronization without
> being able to specify a range of memory to sync but always globally.
> Such a memory listener should provide this new method instead of the
> log_sync() method.
> 
> Obviously we can also achieve similar thing when we put the global
> sync logic into a log_sync() handler. However that's not efficient
> enough because otherwise memory_global_dirty_log_sync() may do the
> global sync N times, where N is the number of flat views.
> 
> Make this new method be exclusive to log_sync().
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

OK, so I guess the idea here is that when you have a ring with dirty
pages on it, you just need to clear all outstanding things on the ring
whereever they came from.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/exec/memory.h | 12 ++++++++++++
>  memory.c              | 33 +++++++++++++++++++++++----------
>  2 files changed, 35 insertions(+), 10 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e85b7de99a..c4427094bb 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -533,6 +533,18 @@ struct MemoryListener {
>       */
>      void (*log_sync)(MemoryListener *listener, MemoryRegionSection *section);
>  
> +    /**
> +     * @log_sync_global:
> +     *
> +     * This is the global version of @log_sync when the listener does
> +     * not have a way to synchronize the log with finer granularity.
> +     * When the listener registers with @log_sync_global defined, then
> +     * its @log_sync must be NULL.  Vice versa.
> +     *
> +     * @listener: The #MemoryListener.
> +     */
> +    void (*log_sync_global)(MemoryListener *listener);
> +
>      /**
>       * @log_clear:
>       *
> diff --git a/memory.c b/memory.c
> index aeaa8dcc9e..53828ba00c 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2016,6 +2016,10 @@ void memory_region_set_dirty(MemoryRegion *mr, hwaddr addr,
>                                          memory_region_get_dirty_log_mask(mr));
>  }
>  
> +/*
> + * If memory region `mr' is NULL, do global sync.  Otherwise, sync
> + * dirty bitmap for the specified memory region.
> + */
>  static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
>  {
>      MemoryListener *listener;
> @@ -2029,18 +2033,24 @@ static void memory_region_sync_dirty_bitmap(MemoryRegion *mr)
>       * address space once.
>       */
>      QTAILQ_FOREACH(listener, &memory_listeners, link) {
> -        if (!listener->log_sync) {
> -            continue;
> -        }
> -        as = listener->address_space;
> -        view = address_space_get_flatview(as);
> -        FOR_EACH_FLAT_RANGE(fr, view) {
> -            if (fr->dirty_log_mask && (!mr || fr->mr == mr)) {
> -                MemoryRegionSection mrs = section_from_flat_range(fr, view);
> -                listener->log_sync(listener, &mrs);
> +        if (listener->log_sync) {
> +            as = listener->address_space;
> +            view = address_space_get_flatview(as);
> +            FOR_EACH_FLAT_RANGE(fr, view) {
> +                if (fr->dirty_log_mask && (!mr || fr->mr == mr)) {
> +                    MemoryRegionSection mrs = section_from_flat_range(fr, view);
> +                    listener->log_sync(listener, &mrs);
> +                }
>              }
> +            flatview_unref(view);
> +        } else if (listener->log_sync_global) {
> +            /*
> +             * No matter whether MR is specified, what we can do here
> +             * is to do a global sync, because we are not capable to
> +             * sync in a finer granularity.
> +             */
> +            listener->log_sync_global(listener);
>          }
> -        flatview_unref(view);
>      }
>  }
>  
> @@ -2727,6 +2737,9 @@ void memory_listener_register(MemoryListener *listener, AddressSpace *as)
>  {
>      MemoryListener *other = NULL;
>  
> +    /* Only one of them can be defined for a listener */
> +    assert(!(listener->log_sync && listener->log_sync_global));
> +
>      listener->address_space = as;
>      if (QTAILQ_EMPTY(&memory_listeners)
>          || listener->priority >= QTAILQ_LAST(&memory_listeners)->priority) {
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 3/9] memory: Introduce log_sync_global() to memory listener
  2020-03-25 16:52   ` Dr. David Alan Gilbert
@ 2020-03-25 17:02     ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-25 17:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

On Wed, Mar 25, 2020 at 04:52:52PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Some of the memory listener may want to do log synchronization without
> > being able to specify a range of memory to sync but always globally.
> > Such a memory listener should provide this new method instead of the
> > log_sync() method.
> > 
> > Obviously we can also achieve similar thing when we put the global
> > sync logic into a log_sync() handler. However that's not efficient
> > enough because otherwise memory_global_dirty_log_sync() may do the
> > global sync N times, where N is the number of flat views.
> > 
> > Make this new method be exclusive to log_sync().
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> OK, so I guess the idea here is that when you have a ring with dirty
> pages on it, you just need to clear all outstanding things on the ring
> whereever they came from.

Yeah, or say it's about granularity of log_sync(), and the ring layout
will only be able to sync in a global manner rather than per-memslot.
It's actually not good when we only want to sync some specific small
memory region (e.g., the VGA code wants to only sync the VGA ram), but
we don't have much choice...

> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  2020-02-05 14:17 ` [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
  2020-03-25 16:34   ` Dr. David Alan Gilbert
@ 2020-03-25 17:43   ` Peter Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-25 17:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Paolo Bonzini, Michael Tokarev,
	Dr . David Alan Gilbert, Michael S . Tsirkin

On Wed, Feb 05, 2020 at 09:17:41AM -0500, Peter Xu wrote:
> kvm_vm_ioctl() handles the errno trick already for ioctl() on
> returning -1 for errors.  Fix this.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c111312dfd..4be3cd2352 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -688,14 +688,13 @@ static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start,
>      d.num_pages = bmap_npages;
>      d.slot = mem->slot | (as_id << 16);
>  
> -    if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
> -        ret = -errno;
> +    ret = kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d);

Dave raised a question offlist when comparing with KVM_GET_DIRTY_LOG,
regarding 50212d6346 ("Revert "fix return check for KVM_GET_DIRTY_LOG
ioctl"", 2014-04-14) where we wanted to allow KVM_GET_DIRTY_LOG to
fail for some cases.  I didn't find any context of that, and from the
first glance I don't understand why and when we'll get -ENOENT during
sync dirty log (we should have BQL held, so I don't know why a memslot
can be gone underneath).  Anyone knows more?

CCing Peter Maydell and Michael Tokarev too.

> +    if (ret) {
>          error_report("%s: KVM_CLEAR_DIRTY_LOG failed, slot=%d, "
>                       "start=0x%"PRIx64", size=0x%"PRIx32", errno=%d",
>                       __func__, d.slot, (uint64_t)d.first_page,
>                       (uint32_t)d.num_pages, ret);
>      } else {
> -        ret = 0;
>          trace_kvm_clear_dirty_log(d.slot, d.first_page, d.num_pages);
>      }
>  
> -- 
> 2.24.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 4/9] KVM: Create the KVMSlot dirty bitmap on flag changes
  2020-02-05 14:17 ` [PATCH RFC 4/9] KVM: Create the KVMSlot dirty bitmap on flag changes Peter Xu
@ 2020-03-25 17:44   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 17:44 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> Previously we have two places that will create the per KVMSlot dirty
> bitmap:
> 
>   1. When a newly created KVMSlot has dirty logging enabled,
>   2. When the first log_sync() happens for a memory slot.
> 
> The 2nd case is lazy-init, while the 1st case is not (which is a fix
> of what the 2nd case missed).
> 
> To do explicit initialization of dirty bitmaps, what we're missing is
> to create the dirty bitmap when the slot changed from not-dirty-track
> to dirty-track.  Do that in kvm_slot_update_flags().
> 
> With that, we can safely remove the 2nd lazy-init.
> 
> This change will be needed for kvm dirty ring because kvm dirty ring
> does not use the log_sync() interface at all.
> 
> Since at it, move all the pre-checks into kvm_slot_init_dirty_bitmap().

'While at it' or just Also

> Signed-off-by: Peter Xu <peterx@redhat.com>


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  accel/kvm/kvm-all.c | 23 +++++++++--------------
>  1 file changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 4be3cd2352..bb635c775f 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -162,6 +162,8 @@ static NotifierList kvm_irqchip_change_notifiers =
>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
>  
> +static void kvm_slot_init_dirty_bitmap(KVMSlot *mem);
> +
>  int kvm_get_max_memslots(void)
>  {
>      KVMState *s = KVM_STATE(current_accel());
> @@ -442,6 +444,7 @@ static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
>          return 0;
>      }
>  
> +    kvm_slot_init_dirty_bitmap(mem);
>      return kvm_set_user_memory_region(kml, mem, false);
>  }
>  
> @@ -526,8 +529,12 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
>  #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
>  
>  /* Allocate the dirty bitmap for a slot  */
> -static void kvm_memslot_init_dirty_bitmap(KVMSlot *mem)
> +static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
>  {
> +    if (!(mem->flags & KVM_MEM_LOG_DIRTY_PAGES) || mem->dirty_bmap) {
> +        return;
> +    }
> +
>      /*
>       * XXX bad kernel interface alert
>       * For dirty bitmap, kernel allocates array of size aligned to
> @@ -578,11 +585,6 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>              goto out;
>          }
>  
> -        if (!mem->dirty_bmap) {
> -            /* Allocate on the first log_sync, once and for all */
> -            kvm_memslot_init_dirty_bitmap(mem);
> -        }
> -
>          d.dirty_bitmap = mem->dirty_bmap;
>          d.slot = mem->slot | (kml->as_id << 16);
>          if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
> @@ -1079,14 +1081,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>          mem->start_addr = start_addr;
>          mem->ram = ram;
>          mem->flags = kvm_mem_flags(mr);
> -
> -        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
> -            /*
> -             * Reallocate the bmap; it means it doesn't disappear in
> -             * middle of a migrate.
> -             */
> -            kvm_memslot_init_dirty_bitmap(mem);
> -        }
> +        kvm_slot_init_dirty_bitmap(mem);
>          err = kvm_set_user_memory_region(kml, mem, true);
>          if (err) {
>              fprintf(stderr, "%s: error registering slot: %s\n", __func__,
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 5/9] KVM: Provide helper to get kvm dirty log
  2020-02-05 14:17 ` [PATCH RFC 5/9] KVM: Provide helper to get kvm dirty log Peter Xu
@ 2020-03-25 17:52   ` Dr. David Alan Gilbert
  2020-03-25 18:15     ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 17:52 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> Provide a helper kvm_slot_get_dirty_log() to make the function
> kvm_physical_sync_dirty_bitmap() clearer.  We can even cache the as_id
> into KVMSlot when it is created, so that we don't even need to pass it
> down every time.
> 
> Since at it, remove return value of kvm_physical_sync_dirty_bitmap()
> because it should never fail.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c      | 39 +++++++++++++++++++--------------------
>  include/sysemu/kvm_int.h |  2 ++
>  2 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index bb635c775f..608216fd53 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -553,6 +553,18 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
>      mem->dirty_bmap = g_malloc0(bitmap_size);
>  }
>  
> +/* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap */
> +static void kvm_slot_get_dirty_log(KVMState *s, KVMSlot *slot)
> +{
> +    struct kvm_dirty_log d = {};
> +    int ret;
> +
> +    d.dirty_bitmap = slot->dirty_bmap;
> +    d.slot = slot->slot | (slot->as_id << 16);
> +    ret = kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d);
> +    assert(ret != -1);

As discussed on irc, that -1 check seems odd given your previous check
but there seems to be some history as to why it was still there.  Hmm.

It also seems very trusting that it can never possibly fail!

Dave

> +}
> +
>  /**
>   * kvm_physical_sync_dirty_bitmap - Sync dirty bitmap from kernel space
>   *
> @@ -564,15 +576,13 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
>   * @kml: the KVM memory listener object
>   * @section: the memory section to sync the dirty bitmap with
>   */
> -static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
> -                                          MemoryRegionSection *section)
> +static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
> +                                           MemoryRegionSection *section)
>  {
>      KVMState *s = kvm_state;
> -    struct kvm_dirty_log d = {};
>      KVMSlot *mem;
>      hwaddr start_addr, size;
>      hwaddr slot_size, slot_offset = 0;
> -    int ret = 0;
>  
>      size = kvm_align_section(section, &start_addr);
>      while (size) {
> @@ -582,27 +592,19 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>          mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
>          if (!mem) {
>              /* We don't have a slot if we want to trap every access. */
> -            goto out;
> +            return;
>          }
>  
> -        d.dirty_bitmap = mem->dirty_bmap;
> -        d.slot = mem->slot | (kml->as_id << 16);
> -        if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
> -            DPRINTF("ioctl failed %d\n", errno);
> -            ret = -1;
> -            goto out;
> -        }
> +        kvm_slot_get_dirty_log(s, mem);
>  
>          subsection.offset_within_region += slot_offset;
>          subsection.size = int128_make64(slot_size);
> -        kvm_get_dirty_pages_log_range(&subsection, d.dirty_bitmap);
> +        kvm_get_dirty_pages_log_range(&subsection, mem->dirty_bmap);
>  
>          slot_offset += slot_size;
>          start_addr += slot_size;
>          size -= slot_size;
>      }
> -out:
> -    return ret;
>  }
>  
>  /* Alignment requirement for KVM_CLEAR_DIRTY_LOG - 64 pages */
> @@ -1077,6 +1079,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>      do {
>          slot_size = MIN(kvm_max_slot_size, size);
>          mem = kvm_alloc_slot(kml);
> +        mem->as_id = kml->as_id;
>          mem->memory_size = slot_size;
>          mem->start_addr = start_addr;
>          mem->ram = ram;
> @@ -1119,14 +1122,10 @@ static void kvm_log_sync(MemoryListener *listener,
>                           MemoryRegionSection *section)
>  {
>      KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
> -    int r;
>  
>      kvm_slots_lock(kml);
> -    r = kvm_physical_sync_dirty_bitmap(kml, section);
> +    kvm_physical_sync_dirty_bitmap(kml, section);
>      kvm_slots_unlock(kml);
> -    if (r < 0) {
> -        abort();
> -    }
>  }
>  
>  static void kvm_log_clear(MemoryListener *listener,
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index ac2d1f8b56..4434e15ec7 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -23,6 +23,8 @@ typedef struct KVMSlot
>      int old_flags;
>      /* Dirty bitmap cache for the slot */
>      unsigned long *dirty_bmap;
> +    /* Cache of the address space ID */
> +    int as_id;
>  } KVMSlot;
>  
>  typedef struct KVMMemoryListener {
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 5/9] KVM: Provide helper to get kvm dirty log
  2020-03-25 17:52   ` Dr. David Alan Gilbert
@ 2020-03-25 18:15     ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-25 18:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

On Wed, Mar 25, 2020 at 05:52:20PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Provide a helper kvm_slot_get_dirty_log() to make the function
> > kvm_physical_sync_dirty_bitmap() clearer.  We can even cache the as_id
> > into KVMSlot when it is created, so that we don't even need to pass it
> > down every time.
> > 
> > Since at it, remove return value of kvm_physical_sync_dirty_bitmap()
> > because it should never fail.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  accel/kvm/kvm-all.c      | 39 +++++++++++++++++++--------------------
> >  include/sysemu/kvm_int.h |  2 ++
> >  2 files changed, 21 insertions(+), 20 deletions(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index bb635c775f..608216fd53 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -553,6 +553,18 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
> >      mem->dirty_bmap = g_malloc0(bitmap_size);
> >  }
> >  
> > +/* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap */
> > +static void kvm_slot_get_dirty_log(KVMState *s, KVMSlot *slot)
> > +{
> > +    struct kvm_dirty_log d = {};
> > +    int ret;
> > +
> > +    d.dirty_bitmap = slot->dirty_bmap;
> > +    d.slot = slot->slot | (slot->as_id << 16);
> > +    ret = kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d);
> > +    assert(ret != -1);
> 
> As discussed on irc, that -1 check seems odd given your previous check
> but there seems to be some history as to why it was still there.  Hmm.
> 
> It also seems very trusting that it can never possibly fail!

Yeah I agree.  I've asked the question under patch 1.

Before we know the answer, I'll remove the assertion to print an error
if it triggers; check against -1 (-EPERM) explicitly does look odd...

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 6/9] KVM: Provide helper to sync dirty bitmap from slot to ramblock
  2020-02-05 14:17 ` [PATCH RFC 6/9] KVM: Provide helper to sync dirty bitmap from slot to ramblock Peter Xu
@ 2020-03-25 18:47   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 18:47 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> kvm_physical_sync_dirty_bitmap() calculates the ramblock offset in an
> awkward way from the MemoryRegionSection that passed in from the
> caller.  The truth is for each KVMSlot the ramblock offset never
> change for the lifecycle.  Cache the ramblock offset for each KVMSlot
> into the structure when the KVMSlot is created.
> 
> With that, we can further simplify kvm_physical_sync_dirty_bitmap()
> with a helper to sync KVMSlot dirty bitmap to the ramblock dirty
> bitmap of a specific KVMSlot.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  accel/kvm/kvm-all.c      | 37 +++++++++++++++++--------------------
>  include/sysemu/kvm_int.h |  2 ++
>  2 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 608216fd53..f81e7a644b 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -515,15 +515,12 @@ static void kvm_log_stop(MemoryListener *listener,
>  }
>  
>  /* get kvm's dirty pages bitmap and update qemu's */
> -static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
> -                                         unsigned long *bitmap)
> +static void kvm_slot_sync_dirty_pages(KVMSlot *slot)
>  {
> -    ram_addr_t start = section->offset_within_region +
> -                       memory_region_get_ram_addr(section->mr);
> -    ram_addr_t pages = int128_get64(section->size) / qemu_real_host_page_size;
> +    ram_addr_t start = slot->ram_start_offset;
> +    ram_addr_t pages = slot->memory_size / qemu_real_host_page_size;
>  
> -    cpu_physical_memory_set_dirty_lebitmap(bitmap, start, pages);
> -    return 0;
> +    cpu_physical_memory_set_dirty_lebitmap(slot->dirty_bmap, start, pages);
>  }
>  
>  #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
> @@ -582,12 +579,10 @@ static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>      KVMState *s = kvm_state;
>      KVMSlot *mem;
>      hwaddr start_addr, size;
> -    hwaddr slot_size, slot_offset = 0;
> +    hwaddr slot_size;
>  
>      size = kvm_align_section(section, &start_addr);
>      while (size) {
> -        MemoryRegionSection subsection = *section;
> -
>          slot_size = MIN(kvm_max_slot_size, size);
>          mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
>          if (!mem) {
> @@ -596,12 +591,7 @@ static void kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>          }
>  
>          kvm_slot_get_dirty_log(s, mem);
> -
> -        subsection.offset_within_region += slot_offset;
> -        subsection.size = int128_make64(slot_size);
> -        kvm_get_dirty_pages_log_range(&subsection, mem->dirty_bmap);
> -
> -        slot_offset += slot_size;
> +        kvm_slot_sync_dirty_pages(mem);
>          start_addr += slot_size;
>          size -= slot_size;
>      }
> @@ -1023,7 +1013,8 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>      int err;
>      MemoryRegion *mr = section->mr;
>      bool writeable = !mr->readonly && !mr->rom_device;
> -    hwaddr start_addr, size, slot_size;
> +    hwaddr start_addr, size, slot_size, mr_offset;
> +    ram_addr_t ram_start_offset;
>      void *ram;
>  
>      if (!memory_region_is_ram(mr)) {
> @@ -1041,9 +1032,13 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>          return;
>      }
>  
> -    /* use aligned delta to align the ram address */
> -    ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
> -          (start_addr - section->offset_within_address_space);
> +    /* The offset of the kvmslot within the memory region */
> +    mr_offset = section->offset_within_region + start_addr -
> +        section->offset_within_address_space;
> +
> +    /* use aligned delta to align the ram address and offset */
> +    ram = memory_region_get_ram_ptr(mr) + mr_offset;
> +    ram_start_offset = memory_region_get_ram_addr(mr) + mr_offset;
>  
>      kvm_slots_lock(kml);
>  
> @@ -1082,6 +1077,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>          mem->as_id = kml->as_id;
>          mem->memory_size = slot_size;
>          mem->start_addr = start_addr;
> +        mem->ram_start_offset = ram_start_offset;
>          mem->ram = ram;
>          mem->flags = kvm_mem_flags(mr);
>          kvm_slot_init_dirty_bitmap(mem);
> @@ -1092,6 +1088,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
>              abort();
>          }
>          start_addr += slot_size;
> +        ram_start_offset += slot_size;
>          ram += slot_size;
>          size -= slot_size;
>      } while (size);
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index 4434e15ec7..1a19bfef80 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -25,6 +25,8 @@ typedef struct KVMSlot
>      unsigned long *dirty_bmap;
>      /* Cache of the address space ID */
>      int as_id;
> +    /* Cache of the offset in ram address space */
> +    ram_addr_t ram_start_offset;
>  } KVMSlot;
>  
>  typedef struct KVMMemoryListener {
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 7/9] KVM: Cache kvm slot dirty bitmap size
  2020-02-05 14:17 ` [PATCH RFC 7/9] KVM: Cache kvm slot dirty bitmap size Peter Xu
@ 2020-03-25 18:49   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 18:49 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> Cache it too because we'll reference it more frequently in the future.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  accel/kvm/kvm-all.c      | 1 +
>  include/sysemu/kvm_int.h | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index f81e7a644b..ea7b8f7ca5 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -548,6 +548,7 @@ static void kvm_slot_init_dirty_bitmap(KVMSlot *mem)
>      hwaddr bitmap_size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
>                                          /*HOST_LONG_BITS*/ 64) / 8;
>      mem->dirty_bmap = g_malloc0(bitmap_size);
> +    mem->dirty_bmap_size = bitmap_size;
>  }
>  
>  /* Sync dirty bitmap from kernel to KVMSlot.dirty_bmap */
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index 1a19bfef80..71c9946ecf 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -23,6 +23,7 @@ typedef struct KVMSlot
>      int old_flags;
>      /* Dirty bitmap cache for the slot */
>      unsigned long *dirty_bmap;
> +    unsigned long dirty_bmap_size;
>      /* Cache of the address space ID */
>      int as_id;
>      /* Cache of the offset in ram address space */
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 8/9] KVM: Add dirty-ring-size property
  2020-02-05 14:17 ` [PATCH RFC 8/9] KVM: Add dirty-ring-size property Peter Xu
@ 2020-03-25 20:00   ` Dr. David Alan Gilbert
  2020-03-25 20:42     ` Peter Xu
  2020-03-25 20:14   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 20:00 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> Add a parameter for size of dirty ring.  If zero, dirty ring is
> disabled.  Otherwise dirty ring will be enabled with the per-vcpu size
> as specified.  If dirty ring cannot be enabled due to unsupported
> kernel, it'll fallback to dirty logging.  By default, dirty ring is
> not enabled (dirty-ring-size==0).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c | 64 +++++++++++++++++++++++++++++++++++++++++++++
>  qemu-options.hx     |  3 +++
>  2 files changed, 67 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index ea7b8f7ca5..6d145a8b98 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -127,6 +127,8 @@ struct KVMState
>          KVMMemoryListener *ml;
>          AddressSpace *as;
>      } *as;
> +    int kvm_dirty_ring_size;
> +    int kvm_dirty_gfn_count;    /* If nonzero, then kvm dirty ring enabled */
>  };
>  
>  KVMState *kvm_state;
> @@ -2077,6 +2079,33 @@ static int kvm_init(MachineState *ms)
>      s->memory_listener.listener.coalesced_io_add = kvm_coalesce_mmio_region;
>      s->memory_listener.listener.coalesced_io_del = kvm_uncoalesce_mmio_region;
>  
> +    /*
> +     * Enable KVM dirty ring if supported, otherwise fall back to
> +     * dirty logging mode
> +     */
> +    if (s->kvm_dirty_ring_size > 0) {
> +        /* Read the max supported pages */
> +        ret = kvm_vm_check_extension(kvm_state, KVM_CAP_DIRTY_LOG_RING);
> +        if (ret > 0) {
> +            if (s->kvm_dirty_ring_size > ret) {
> +                error_report("KVM dirty ring size %d too big (maximum is %d). "
> +                             "Please use a smaller value.",
> +                             s->kvm_dirty_ring_size, ret);
> +                goto err;
> +            }
> +
> +            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0,
> +                                    s->kvm_dirty_ring_size);
> +            if (ret) {
> +                error_report("Enabling of KVM dirty ring failed: %d", ret);
> +                goto err;
> +            }
> +
> +            s->kvm_dirty_gfn_count =
> +                s->kvm_dirty_ring_size / sizeof(struct kvm_dirty_gfn);

What happens if I was to pass dirty-ring-size=1 ?
Then the count would be 0 and things would get upset somewhere?
Do you need to check for a minimum postive value?

> +        }
> +    }
> +
>      kvm_memory_listener_register(s, &s->memory_listener,
>                                   &address_space_memory, 0);
>      memory_listener_register(&kvm_io_listener,
> @@ -3037,6 +3066,33 @@ bool kvm_kernel_irqchip_split(void)
>      return kvm_state->kernel_irqchip_split == ON_OFF_AUTO_ON;
>  }
>  
> +static void kvm_get_dirty_ring_size(Object *obj, Visitor *v,
> +                                    const char *name, void *opaque,
> +                                    Error **errp)
> +{
> +    KVMState *s = KVM_STATE(obj);
> +    int64_t value = s->kvm_dirty_ring_size;
> +
> +    visit_type_int(v, name, &value, errp);
> +}
> +
> +static void kvm_set_dirty_ring_size(Object *obj, Visitor *v,
> +                                    const char *name, void *opaque,
> +                                    Error **errp)
> +{
> +    KVMState *s = KVM_STATE(obj);
> +    Error *error = NULL;
> +    int64_t value;
> +
> +    visit_type_int(v, name, &value, &error);
> +    if (error) {
> +        error_propagate(errp, error);
> +        return;
> +    }
> +
> +    s->kvm_dirty_ring_size = value;
> +}
> +
>  static void kvm_accel_instance_init(Object *obj)
>  {
>      KVMState *s = KVM_STATE(obj);
> @@ -3044,6 +3100,8 @@ static void kvm_accel_instance_init(Object *obj)
>      s->kvm_shadow_mem = -1;
>      s->kernel_irqchip_allowed = true;
>      s->kernel_irqchip_split = ON_OFF_AUTO_AUTO;
> +    /* By default off */
> +    s->kvm_dirty_ring_size = 0;
>  }
>  
>  static void kvm_accel_class_init(ObjectClass *oc, void *data)
> @@ -3065,6 +3123,12 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
>          NULL, NULL, &error_abort);
>      object_class_property_set_description(oc, "kvm-shadow-mem",
>          "KVM shadow MMU size", &error_abort);
> +
> +    object_class_property_add(oc, "dirty-ring-size", "int",
> +        kvm_get_dirty_ring_size, kvm_set_dirty_ring_size,
> +        NULL, NULL, &error_abort);

I don't think someone passing in a non-number should cause an abort;
it should exit, but I don't think it should abort/core.

> +    object_class_property_set_description(oc, "dirty-ring-size",
> +        "KVM dirty ring size (<=0 to disable)", &error_abort);
>  }
>  
>  static const TypeInfo kvm_accel_type = {
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 224a8e8712..140bd38726 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -119,6 +119,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
>      "                kernel-irqchip=on|off|split controls accelerated irqchip support (default=on)\n"
>      "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
>      "                tb-size=n (TCG translation block cache size)\n"
> +    "                dirty-ring-size=n (KVM dirty ring size in Bytes)\n"
>      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
>  STEXI
>  @item -accel @var{name}[,prop=@var{value}[,...]]
> @@ -140,6 +141,8 @@ irqchip completely is not recommended except for debugging purposes.
>  Defines the size of the KVM shadow MMU.
>  @item tb-size=@var{n}
>  Controls the size (in MiB) of the TCG translation block cache.
> +@item dirty-ring-size=@val{n}
> +Controls the size (in Bytes) of KVM dirty ring (<=0 to disable).

I don't see the point in allowing < 0 ; I'd ban it.

Dave


>  @item thread=single|multi
>  Controls number of TCG threads. When the TCG is multi-threaded there will be one
>  thread per vCPU therefor taking advantage of additional host cores. The default
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 8/9] KVM: Add dirty-ring-size property
  2020-02-05 14:17 ` [PATCH RFC 8/9] KVM: Add dirty-ring-size property Peter Xu
  2020-03-25 20:00   ` Dr. David Alan Gilbert
@ 2020-03-25 20:14   ` Dr. David Alan Gilbert
  2020-03-25 20:48     ` Peter Xu
  1 sibling, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 20:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> Add a parameter for size of dirty ring.  If zero, dirty ring is
> disabled.  Otherwise dirty ring will be enabled with the per-vcpu size
> as specified.  If dirty ring cannot be enabled due to unsupported
> kernel, it'll fallback to dirty logging.  By default, dirty ring is
> not enabled (dirty-ring-size==0).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

<snip>

> diff --git a/qemu-options.hx b/qemu-options.hx
> index 224a8e8712..140bd38726 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -119,6 +119,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
>      "                kernel-irqchip=on|off|split controls accelerated irqchip support (default=on)\n"
>      "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
>      "                tb-size=n (TCG translation block cache size)\n"
> +    "                dirty-ring-size=n (KVM dirty ring size in Bytes)\n"
>      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
>  STEXI
>  @item -accel @var{name}[,prop=@var{value}[,...]]
> @@ -140,6 +141,8 @@ irqchip completely is not recommended except for debugging purposes.
>  Defines the size of the KVM shadow MMU.
>  @item tb-size=@var{n}
>  Controls the size (in MiB) of the TCG translation block cache.
> +@item dirty-ring-size=@val{n}
> +Controls the size (in Bytes) of KVM dirty ring (<=0 to disable).

I think that needs to say 'per vcpu'

Dave

>  @item thread=single|multi
>  Controls number of TCG threads. When the TCG is multi-threaded there will be one
>  thread per vCPU therefor taking advantage of additional host cores. The default
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 9/9] KVM: Dirty ring support
  2020-02-05 14:17 ` [PATCH RFC 9/9] KVM: Dirty ring support Peter Xu
@ 2020-03-25 20:41   ` Dr. David Alan Gilbert
  2020-03-25 21:32     ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-25 20:41 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> KVM dirty ring is a new interface to pass over dirty bits from kernel
> to the userspace.  Instead of using a bitmap for each memory region,
> the dirty ring contains an array of dirtied GPAs to fetch.  For each
> vcpu there will be one dirty ring that binds to it.
> 
> There're a few major changes comparing to how the old dirty logging
> interface would work:
> 
>   - Granularity of dirty bits
> 
>     KVM dirty ring interface does not offer memory region level
>     granularity to collect dirty bits (i.e., per KVM memory slot).
>     Instead the dirty bit is collected globally for all the vcpus at
>     once.  The major effect is on VGA part because VGA dirty tracking
>     is enabled as long as the device is created, also it was in memory
>     region granularity.  Now that operation will be amplified to a VM
>     sync.  Maybe there's smarter way to do the same thing in VGA with
>     the new interface, but so far I don't see it affects much at least
>     on regular VMs.
> 
>   - Collection of dirty bits
> 
>     The old dirty logging interface collects KVM dirty bits when
>     synchronizing dirty bits.  KVM dirty ring interface instead used a
>     standalone thread to do that.  So when the other thread (e.g., the
>     migration thread) wants to synchronize the dirty bits, it simply
>     kick the thread and wait until it flushes all the dirty bits to
>     the ramblock dirty bitmap.
> 
> For more information please refer to the comments in the code.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c    | 426 ++++++++++++++++++++++++++++++++++++++++-
>  accel/kvm/trace-events |   7 +
>  include/hw/core/cpu.h  |  10 +
>  3 files changed, 440 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 6d145a8b98..201617bbb7 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -15,6 +15,7 @@
>  
>  #include "qemu/osdep.h"
>  #include <sys/ioctl.h>
> +#include <poll.h>
>  
>  #include <linux/kvm.h>
>  
> @@ -75,6 +76,47 @@ struct KVMParkedVcpu {
>      QLIST_ENTRY(KVMParkedVcpu) node;
>  };
>  
> +enum KVMReaperState {
> +    KVM_REAPER_NONE = 0,
> +    /* The reaper is sleeping */
> +    KVM_REAPER_WAIT,
> +    /* The reaper is reaping for dirty pages */
> +    KVM_REAPER_REAPING,
> +};

That probably needs to be KVMDirtyRingReaperState
given there are many things that could be reaped.

> +/*
> + * KVM reaper instance, responsible for collecting the KVM dirty bits
> + * via the dirty ring.
> + */
> +struct KVMDirtyRingReaper {
> +    /* The reaper thread */
> +    QemuThread reaper_thr;
> +    /*
> +     * Telling the reaper thread to wakeup.  This should be used as a
> +     * generic interface to kick the reaper thread, like, in vcpu
> +     * threads where it gets a exit due to ring full.
> +     */
> +    EventNotifier reaper_event;

I think I'd just used a simple semaphore for this type of thing.

> +    /*
> +     * This should only be used when someone wants to do synchronous
> +     * flush of the dirty ring buffers.  Logically we can achieve this
> +     * even with the reaper_event only, however that'll make things
> +     * complicated.  This extra event can make the sync procedure easy
> +     * and clean.
> +     */
> +    EventNotifier reaper_flush_event;
> +    /*
> +     * Used in pair with reaper_flush_event, that the sem will be
> +     * posted to notify that the previous flush event is handled by
> +     * the reaper thread.
> +     */
> +    QemuSemaphore reaper_flush_sem;
> +    /* Iteration number of the reaper thread */
> +    volatile uint64_t reaper_iteration;
> +    /* Status of the reaper thread */
> +    volatile enum KVMReaperState reaper_state;
> +};
> +
>  struct KVMState
>  {
>      AccelState parent_obj;
> @@ -121,7 +163,6 @@ struct KVMState
>      void *memcrypt_handle;
>      int (*memcrypt_encrypt_data)(void *handle, uint8_t *ptr, uint64_t len);
>  
> -    /* For "info mtree -f" to tell if an MR is registered in KVM */
>      int nr_as;
>      struct KVMAs {
>          KVMMemoryListener *ml;
> @@ -129,6 +170,7 @@ struct KVMState
>      } *as;
>      int kvm_dirty_ring_size;
>      int kvm_dirty_gfn_count;    /* If nonzero, then kvm dirty ring enabled */
> +    struct KVMDirtyRingReaper reaper;
>  };
>  
>  KVMState *kvm_state;
> @@ -348,6 +390,11 @@ int kvm_destroy_vcpu(CPUState *cpu)
>          goto err;
>      }
>  
> +    ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_size);
> +    if (ret < 0) {
> +        goto err;
> +    }
> +
>      vcpu = g_malloc0(sizeof(*vcpu));
>      vcpu->vcpu_id = kvm_arch_vcpu_id(cpu);
>      vcpu->kvm_fd = cpu->kvm_fd;
> @@ -391,6 +438,7 @@ int kvm_init_vcpu(CPUState *cpu)
>      cpu->kvm_fd = ret;
>      cpu->kvm_state = s;
>      cpu->vcpu_dirty = true;
> +    qemu_sem_init(&cpu->kvm_dirty_ring_avail, 0);
>  
>      mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
>      if (mmap_size < 0) {
> @@ -412,6 +460,18 @@ int kvm_init_vcpu(CPUState *cpu)
>              (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
>      }
>  
> +    if (s->kvm_dirty_gfn_count) {
> +        cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_size,
> +                                   PROT_READ | PROT_WRITE, MAP_SHARED,

Is the MAP_SHARED required?

> +                                   cpu->kvm_fd,
> +                                   PAGE_SIZE * KVM_DIRTY_LOG_PAGE_OFFSET);
> +        if (cpu->kvm_dirty_gfns == MAP_FAILED) {
> +            ret = -errno;
> +            DPRINTF("mmap'ing vcpu dirty gfns failed\n");

Include errno?

> +            goto err;
> +        }
> +    }
> +
>      ret = kvm_arch_init_vcpu(cpu);
>  err:
>      return ret;
> @@ -525,6 +585,11 @@ static void kvm_slot_sync_dirty_pages(KVMSlot *slot)
>      cpu_physical_memory_set_dirty_lebitmap(slot->dirty_bmap, start, pages);
>  }
>  
> +static void kvm_slot_reset_dirty_pages(KVMSlot *slot)
> +{
> +    memset(slot->dirty_bmap, 0, slot->dirty_bmap_size);
> +}
> +
>  #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
>  
>  /* Allocate the dirty bitmap for a slot  */
> @@ -1100,6 +1165,305 @@ out:
>      kvm_slots_unlock(kml);
>  }
>  
> +static void kvm_dirty_ring_reaper_kick(void)
> +{
> +    trace_kvm_dirty_ring_reaper_kick("any");
> +    event_notifier_set(&kvm_state->reaper.reaper_event);
> +}
> +
> +static void kvm_dirty_ring_reaper_kick_flush(void)
> +{
> +    trace_kvm_dirty_ring_reaper_kick("flush");
> +    event_notifier_set(&kvm_state->reaper.reaper_flush_event);
> +}
> +
> +/* Should be with all slots_lock held for the address spaces */
> +static void kvm_dirty_ring_mark_page(KVMState *s, uint32_t as_id,
> +                                     uint32_t slot_id, uint64_t offset)
> +{
> +    KVMMemoryListener *kml;
> +    KVMSlot *mem;
> +
> +    assert(as_id < s->nr_as);
> +    /* These fields shouldn't change after VM inits */
> +    kml = s->as[as_id].ml;
> +    mem = &kml->slots[slot_id];
> +    set_bit(offset, mem->dirty_bmap);
> +}
> +
> +static bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
> +{
> +    return gfn->flags == KVM_DIRTY_GFN_F_DIRTY;
> +}
> +
> +static void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
> +{
> +    gfn->flags = KVM_DIRTY_GFN_F_RESET;
> +}
> +
> +/* Should be with all slots_lock held for the address spaces */
> +static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
> +{
> +    struct kvm_dirty_gfn *dirty_gfns = cpu->kvm_dirty_gfns, *cur;
> +    uint32_t gfn_count = s->kvm_dirty_gfn_count;
> +    uint32_t count = 0, fetch = cpu->kvm_fetch_index;
> +
> +    assert(dirty_gfns && gfn_count);
> +
> +    trace_kvm_dirty_ring_reap_vcpu(cpu->cpu_index);
> +
> +    while (true) {
> +        cur = &dirty_gfns[fetch % gfn_count];
> +        if (!dirty_gfn_is_dirtied(cur)) {
> +            break;
> +        }
> +        trace_kvm_dirty_ring_page(cpu->cpu_index, fetch, cur->offset);
> +        kvm_dirty_ring_mark_page(s, cur->slot >> 16, cur->slot & 0xffff,
> +                                 cur->offset);
> +        dirty_gfn_set_collected(cur);
> +        fetch++;
> +        count++;
> +    }
> +    cpu->kvm_fetch_index = fetch;
> +
> +    return count;
> +}
> +
> +static uint64_t kvm_dirty_ring_reap(KVMState *s)
> +{
> +    KVMMemoryListener *kml;
> +    int ret, i, locked_count = s->nr_as;
> +    CPUState *cpu;
> +    uint64_t total = 0;
> +
> +    /*
> +     * We need to lock all kvm slots for all address spaces here,
> +     * because:
> +     *
> +     * (1) We need to mark dirty for dirty bitmaps in multiple slots
> +     *     and for tons of pages, so it's better to take the lock here
> +     *     once rather than once per page.  And more importantly,
> +     *
> +     * (2) We must _NOT_ publish dirty bits to the other threads
> +     *     (e.g., the migration thread) via the kvm memory slot dirty
> +     *     bitmaps before correctly re-protect those dirtied pages.
> +     *     Otherwise we can have potential risk of data corruption if
> +     *     the page data is read in the other thread before we do
> +     *     reset below.
> +     */
> +    for (i = 0; i < s->nr_as; i++) {
> +        kml = s->as[i].ml;
> +        if (!kml) {
> +            /*
> +             * This is tricky - we grow s->as[] dynamically now.  Take
> +             * care of that case.  We also assumed the as[] will fill
> +             * one by one starting from zero.  Without this, we race
> +             * with register_smram_listener.
> +             *
> +             * TODO: make all these prettier...
> +             */
> +            locked_count = i;
> +            break;
> +        }
> +        kvm_slots_lock(kml);
> +    }
> +
> +    CPU_FOREACH(cpu) {
> +        total += kvm_dirty_ring_reap_one(s, cpu);
> +    }
> +
> +    if (total) {
> +        ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
> +        assert(ret == total);
> +    }
> +
> +    /* Unlock whatever locks that we have locked */
> +    for (i = 0; i < locked_count; i++) {
> +        kvm_slots_unlock(s->as[i].ml);
> +    }
> +
> +    CPU_FOREACH(cpu) {
> +        if (cpu->kvm_dirty_ring_full) {
> +            qemu_sem_post(&cpu->kvm_dirty_ring_avail);
> +        }

Why do you need to wait until here - couldn't you release
each vcpu after you've reaped it?

> +    }
> +
> +    return total;
> +}
> +
> +static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
> +{
> +    /* No need to do anything */
> +}
> +
> +/*
> + * Kick all vcpus out in a synchronized way.  When returned, we
> + * guarantee that every vcpu has been kicked and at least returned to
> + * userspace once.
> + */
> +static void kvm_cpu_synchronize_kick_all(void)
> +{
> +    CPUState *cpu;
> +
> +    CPU_FOREACH(cpu) {
> +        run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
> +    }
> +}
> +
> +/*
> + * Flush all the existing dirty pages to the KVM slot buffers.  When
> + * this call returns, we guarantee that all the touched dirty pages
> + * before calling this function have been put into the per-kvmslot
> + * dirty bitmap.
> + *
> + * To achieve this, we need to:
> + *
> + * (1) Kick all vcpus out, this will make sure that we flush all the
> + *     dirty buffers that potentially in the hardware (PML) into the
> + *     dirty rings, after that,
> + *
> + * (2) Kick the reaper thread and make sure it reaps all the dirty
> + *     page that is in the dirty rings.
> + *
> + * This function must be called with BQL held.
> + */
> +static void kvm_dirty_ring_flush(struct KVMDirtyRingReaper *r)
> +{
> +    uint64_t iteration;
> +
> +    trace_kvm_dirty_ring_flush(0);
> +
> +    /*
> +     * The function needs to be serialized.  Since this function
> +     * should always be with BQL held, serialization is guaranteed.
> +     * However, let's be sure of it.
> +     */
> +    assert(qemu_mutex_iothread_locked());
> +
> +    /*
> +     * First make sure to flush the hardware buffers by kicking all
> +     * vcpus out in a synchronous way.
> +     */
> +    kvm_cpu_synchronize_kick_all();
> +
> +    iteration = r->reaper_iteration;
> +
> +    /*
> +     * Kick the reaper to collect data.  Here we must make sure that
> +     * it goes over a complete WAIT->REAPING->WAIT period so that we
> +     * know the reaper has collected all the dirty pages even in the
> +     * hardware buffers we just flushed.  To achieve this, we kick the
> +     * flush_event.
> +     */
> +    kvm_dirty_ring_reaper_kick_flush();
> +    qemu_sem_wait(&r->reaper_flush_sem);
> +
> +    /* When reach here, we must have finished at least one iteration */
> +    assert(r->reaper_iteration > iteration);
> +
> +    trace_kvm_dirty_ring_flush(1);
> +}
> +
> +static void *kvm_dirty_ring_reaper_thread(void *data)
> +{
> +    KVMState *s = data;
> +    struct KVMDirtyRingReaper *r = &s->reaper;
> +    struct pollfd *pfd = g_new0(struct pollfd, 2);
> +    uint64_t count;
> +    int64_t stamp;
> +    int ret;
> +
> +    rcu_register_thread();
> +
> +    trace_kvm_dirty_ring_reaper("init");
> +
> +    pfd[0].fd = event_notifier_get_fd(&r->reaper_event);
> +    pfd[0].events = POLLIN;
> +    pfd[1].fd = event_notifier_get_fd(&r->reaper_flush_event);
> +    pfd[1].events = POLLIN;
> +
> +    while (true) {
> +        bool flush_requested = false;
> +
> +        r->reaper_state = KVM_REAPER_WAIT;
> +        trace_kvm_dirty_ring_reaper("wait");
> +        /*
> +         * TODO: provide a smarter timeout rather than a constant?  If
> +         * this timeout is too small it could eat a lot of CPU
> +         * resource, however if too big then VGA could be less
> +         * responsive.  30ms is a value that is not too small so it
> +         * won't eat much CPU, while the VGA can still get ~30Hz
> +         * refresh rate.
> +         */
> +        ret = poll(pfd, 2, 30);
> +        trace_kvm_dirty_ring_reaper("wakeup");
> +        r->reaper_state = KVM_REAPER_REAPING;
> +
> +        if (ret == -1) {
> +            error_report("%s: poll() failed: %s", __func__, strerror(errno));
> +            break;
> +        }
> +
> +        /*
> +         * Note: we only handle one request at a time.  Also, we'll
> +         * clear the event flag before we reap, so each SET to the
> +         * event will guarantee that another full-reap procedure will
> +         * happen.
> +         */
> +        if (pfd[0].revents) {
> +            ret = event_notifier_test_and_clear(&r->reaper_event);
> +            assert(ret);
> +        } else if (pfd[1].revents) {
> +            ret = event_notifier_test_and_clear(&r->reaper_flush_event);
> +            assert(ret);
> +            flush_requested = true;
> +        }
> +
> +        stamp = get_clock();
> +        count = kvm_dirty_ring_reap(s);
> +        stamp = get_clock() - stamp;
> +
> +        r->reaper_iteration++;
> +
> +        if (count) {
> +            trace_kvm_dirty_ring_reaper_iterate(r->reaper_iteration,
> +                                                count, stamp / 1000);
> +        }
> +
> +        /*
> +         * If this iteration is to handle a flush event, wakeup the
> +         * requester of the flush
> +         */
> +        if (flush_requested) {
> +            qemu_sem_post(&r->reaper_flush_sem);
> +        }
> +    }
> +
> +    trace_kvm_dirty_ring_reaper("exit");
> +
> +    rcu_unregister_thread();
> +
> +    return NULL;
> +}
> +
> +static int kvm_dirty_ring_reaper_init(KVMState *s)
> +{
> +    struct KVMDirtyRingReaper *r = &s->reaper;
> +    int ret;
> +
> +    ret = event_notifier_init(&r->reaper_event, false);
> +    assert(ret == 0);
> +    ret = event_notifier_init(&r->reaper_flush_event, false);
> +    assert(ret == 0);
> +    qemu_sem_init(&r->reaper_flush_sem, 0);
> +
> +    qemu_thread_create(&r->reaper_thr, "kvm-reaper",
> +                       kvm_dirty_ring_reaper_thread,
> +                       s, QEMU_THREAD_JOINABLE);

That's marked as joinable - does it ever get joined?
If the reaper thread does exit on error (I can only see the poll
failure?) - what happens elsewhere - will things still try and kick it?

Dave

> +
> +    return 0;
> +}
> +
>  static void kvm_region_add(MemoryListener *listener,
>                             MemoryRegionSection *section)
>  {
> @@ -1128,6 +1492,36 @@ static void kvm_log_sync(MemoryListener *listener,
>      kvm_slots_unlock(kml);
>  }
>  
> +static void kvm_log_sync_global(MemoryListener *l)
> +{
> +    KVMMemoryListener *kml = container_of(l, KVMMemoryListener, listener);
> +    KVMState *s = kvm_state;
> +    KVMSlot *mem;
> +    int i;
> +
> +    /* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
> +    kvm_dirty_ring_flush(&s->reaper);
> +
> +    /*
> +     * TODO: make this faster when nr_slots is big while there are
> +     * only a few used slots (small VMs).
> +     */
> +    kvm_slots_lock(kml);
> +    for (i = 0; i < s->nr_slots; i++) {
> +        mem = &kml->slots[i];
> +        if (mem->memory_size && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
> +            kvm_slot_sync_dirty_pages(mem);
> +            /*
> +             * This is not needed by KVM_GET_DIRTY_LOG because the
> +             * ioctl will unconditionally overwrite the whole region.
> +             * However kvm dirty ring has no such side effect.
> +             */
> +            kvm_slot_reset_dirty_pages(mem);
> +        }
> +    }
> +    kvm_slots_unlock(kml);
> +}
> +
>  static void kvm_log_clear(MemoryListener *listener,
>                            MemoryRegionSection *section)
>  {
> @@ -1234,10 +1628,17 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
>      kml->listener.region_del = kvm_region_del;
>      kml->listener.log_start = kvm_log_start;
>      kml->listener.log_stop = kvm_log_stop;
> -    kml->listener.log_sync = kvm_log_sync;
> -    kml->listener.log_clear = kvm_log_clear;
>      kml->listener.priority = 10;
>  
> +    if (s->kvm_dirty_gfn_count) {
> +        /* KVM dirty ring enabled */
> +        kml->listener.log_sync_global = kvm_log_sync_global;
> +    } else {
> +        /* KVM dirty logging enabled */
> +        kml->listener.log_sync = kvm_log_sync;
> +        kml->listener.log_clear = kvm_log_clear;
> +    }
> +
>      memory_listener_register(&kml->listener, as);
>  
>      for (i = 0; i < s->nr_as; ++i) {
> @@ -2120,6 +2521,13 @@ static int kvm_init(MachineState *ms)
>          qemu_balloon_inhibit(true);
>      }
>  
> +    if (s->kvm_dirty_gfn_count) {
> +        ret = kvm_dirty_ring_reaper_init(s);
> +        if (ret) {
> +            goto err;
> +        }
> +    }
> +
>      return 0;
>  
>  err:
> @@ -2427,6 +2835,18 @@ int kvm_cpu_exec(CPUState *cpu)
>          case KVM_EXIT_INTERNAL_ERROR:
>              ret = kvm_handle_internal_error(cpu, run);
>              break;
> +        case KVM_EXIT_DIRTY_RING_FULL:
> +            /*
> +             * We shouldn't continue if the dirty ring of this vcpu is
> +             * still full.  Got kicked by KVM_RESET_DIRTY_RINGS.
> +             */
> +            trace_kvm_dirty_ring_full(cpu->cpu_index);
> +            cpu->kvm_dirty_ring_full = true;
> +            kvm_dirty_ring_reaper_kick();
> +            qemu_sem_wait(&cpu->kvm_dirty_ring_avail);
> +            cpu->kvm_dirty_ring_full = false;
> +            ret = 0;
> +            break;
>          case KVM_EXIT_SYSTEM_EVENT:
>              switch (run->system_event.type) {
>              case KVM_SYSTEM_EVENT_SHUTDOWN:
> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> index 4fb6e59d19..17d6b6a154 100644
> --- a/accel/kvm/trace-events
> +++ b/accel/kvm/trace-events
> @@ -16,4 +16,11 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
> +kvm_dirty_ring_full(int id) "vcpu %d"
> +kvm_dirty_ring_reap_vcpu(int id) "vcpu %d"
> +kvm_dirty_ring_page(int vcpu, uint32_t slot, uint64_t offset) "vcpu %d fetch %"PRIu32" offset 0x%"PRIx64
> +kvm_dirty_ring_reaper(const char *s) "%s"
> +kvm_dirty_ring_reaper_iterate(uint64_t iter, uint64_t count, int64_t t) "iteration %"PRIu64" reaped %"PRIu64" pages (took %"PRIi64" us)"
> +kvm_dirty_ring_reaper_kick(const char *reason) "%s"
> +kvm_dirty_ring_flush(int finished) "%d"
>  
> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
> index 73e9a869a4..a7cddb7b40 100644
> --- a/include/hw/core/cpu.h
> +++ b/include/hw/core/cpu.h
> @@ -342,6 +342,11 @@ struct qemu_work_item;
>   * @ignore_memory_transaction_failures: Cached copy of the MachineState
>   *    flag of the same name: allows the board to suppress calling of the
>   *    CPU do_transaction_failed hook function.
> + * @kvm_dirty_ring_full:
> + *   Whether the kvm dirty ring of this vcpu is soft-full.
> + * @kvm_dirty_ring_avail:
> + *   Semaphore to be posted when the kvm dirty ring of the vcpu is
> + *   available again.
>   *
>   * State of one CPU core or thread.
>   */
> @@ -409,9 +414,14 @@ struct CPUState {
>       */
>      uintptr_t mem_io_pc;
>  
> +    /* Only used in KVM */
>      int kvm_fd;
>      struct KVMState *kvm_state;
>      struct kvm_run *kvm_run;
> +    struct kvm_dirty_gfn *kvm_dirty_gfns;
> +    uint32_t kvm_fetch_index;
> +    QemuSemaphore kvm_dirty_ring_avail;
> +    bool kvm_dirty_ring_full;
>  
>      /* Used for events with 'vcpu' and *without* the 'disabled' properties */
>      DECLARE_BITMAP(trace_dstate_delayed, CPU_TRACE_DSTATE_MAX_EVENTS);
> -- 
> 2.24.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 8/9] KVM: Add dirty-ring-size property
  2020-03-25 20:00   ` Dr. David Alan Gilbert
@ 2020-03-25 20:42     ` Peter Xu
  2020-03-26 13:41       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-03-25 20:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

On Wed, Mar 25, 2020 at 08:00:31PM +0000, Dr. David Alan Gilbert wrote:
> > @@ -2077,6 +2079,33 @@ static int kvm_init(MachineState *ms)
> >      s->memory_listener.listener.coalesced_io_add = kvm_coalesce_mmio_region;
> >      s->memory_listener.listener.coalesced_io_del = kvm_uncoalesce_mmio_region;
> >  
> > +    /*
> > +     * Enable KVM dirty ring if supported, otherwise fall back to
> > +     * dirty logging mode
> > +     */
> > +    if (s->kvm_dirty_ring_size > 0) {
> > +        /* Read the max supported pages */
> > +        ret = kvm_vm_check_extension(kvm_state, KVM_CAP_DIRTY_LOG_RING);
> > +        if (ret > 0) {
> > +            if (s->kvm_dirty_ring_size > ret) {
> > +                error_report("KVM dirty ring size %d too big (maximum is %d). "
> > +                             "Please use a smaller value.",
> > +                             s->kvm_dirty_ring_size, ret);
> > +                goto err;
> > +            }
> > +
> > +            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0,
> > +                                    s->kvm_dirty_ring_size);
> > +            if (ret) {
> > +                error_report("Enabling of KVM dirty ring failed: %d", ret);
> > +                goto err;
> > +            }
> > +
> > +            s->kvm_dirty_gfn_count =
> > +                s->kvm_dirty_ring_size / sizeof(struct kvm_dirty_gfn);
> 
> What happens if I was to pass dirty-ring-size=1 ?
> Then the count would be 0 and things would get upset somewhere?
> Do you need to check for a minimum postive value?

The above kvm_vm_enable_cap() should fail directly and QEMU will stop.
Yes we should check it, but kernel will do that in all cases, so I
just didn't do that explicitly again in the userspace.

I was planning this to be an advanced feature so the user of this
parameter should know the rules to pass values in.

Of course we can introduce another global knob to enable/disable this
feature as a whole, then I can offer a default value for the size
parameter.  I just didn't bother that in this RFC version, but I can
switch to that if that's preferred.

[...]

> > @@ -3065,6 +3123,12 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
> >          NULL, NULL, &error_abort);
> >      object_class_property_set_description(oc, "kvm-shadow-mem",
> >          "KVM shadow MMU size", &error_abort);
> > +
> > +    object_class_property_add(oc, "dirty-ring-size", "int",
> > +        kvm_get_dirty_ring_size, kvm_set_dirty_ring_size,
> > +        NULL, NULL, &error_abort);
> 
> I don't think someone passing in a non-number should cause an abort;
> it should exit, but I don't think it should abort/core.

It won't - &error_abort is for the operation to add the property.  It
should never fail.

User illegal input will look like this (a graceful exit):

$ ./x86_64-softmmu/qemu-system-x86_64 -accel kvm,dirty-ring-size=a
qemu-system-x86_64: -accel kvm,dirty-ring-size=a: Parameter 'dirty-ring-size' expects int64

> 
> > +    object_class_property_set_description(oc, "dirty-ring-size",
> > +        "KVM dirty ring size (<=0 to disable)", &error_abort);
> >  }
> >  
> >  static const TypeInfo kvm_accel_type = {
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 224a8e8712..140bd38726 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -119,6 +119,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
> >      "                kernel-irqchip=on|off|split controls accelerated irqchip support (default=on)\n"
> >      "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
> >      "                tb-size=n (TCG translation block cache size)\n"
> > +    "                dirty-ring-size=n (KVM dirty ring size in Bytes)\n"
> >      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
> >  STEXI
> >  @item -accel @var{name}[,prop=@var{value}[,...]]
> > @@ -140,6 +141,8 @@ irqchip completely is not recommended except for debugging purposes.
> >  Defines the size of the KVM shadow MMU.
> >  @item tb-size=@var{n}
> >  Controls the size (in MiB) of the TCG translation block cache.
> > +@item dirty-ring-size=@val{n}
> > +Controls the size (in Bytes) of KVM dirty ring (<=0 to disable).
> 
> I don't see the point in allowing < 0 ; I'd ban it.

Sure; I can switch to an uint64.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 8/9] KVM: Add dirty-ring-size property
  2020-03-25 20:14   ` Dr. David Alan Gilbert
@ 2020-03-25 20:48     ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-25 20:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

On Wed, Mar 25, 2020 at 08:14:03PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Add a parameter for size of dirty ring.  If zero, dirty ring is
> > disabled.  Otherwise dirty ring will be enabled with the per-vcpu size
> > as specified.  If dirty ring cannot be enabled due to unsupported
> > kernel, it'll fallback to dirty logging.  By default, dirty ring is
> > not enabled (dirty-ring-size==0).
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> 
> <snip>
> 
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 224a8e8712..140bd38726 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -119,6 +119,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
> >      "                kernel-irqchip=on|off|split controls accelerated irqchip support (default=on)\n"
> >      "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
> >      "                tb-size=n (TCG translation block cache size)\n"
> > +    "                dirty-ring-size=n (KVM dirty ring size in Bytes)\n"
> >      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
> >  STEXI
> >  @item -accel @var{name}[,prop=@var{value}[,...]]
> > @@ -140,6 +141,8 @@ irqchip completely is not recommended except for debugging purposes.
> >  Defines the size of the KVM shadow MMU.
> >  @item tb-size=@var{n}
> >  Controls the size (in MiB) of the TCG translation block cache.
> > +@item dirty-ring-size=@val{n}
> > +Controls the size (in Bytes) of KVM dirty ring (<=0 to disable).
> 
> I think that needs to say 'per vcpu'

Right, fixed.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 9/9] KVM: Dirty ring support
  2020-03-25 20:41   ` Dr. David Alan Gilbert
@ 2020-03-25 21:32     ` Peter Xu
  2020-03-26 14:14       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2020-03-25 21:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

On Wed, Mar 25, 2020 at 08:41:44PM +0000, Dr. David Alan Gilbert wrote:

[...]

> > +enum KVMReaperState {
> > +    KVM_REAPER_NONE = 0,
> > +    /* The reaper is sleeping */
> > +    KVM_REAPER_WAIT,
> > +    /* The reaper is reaping for dirty pages */
> > +    KVM_REAPER_REAPING,
> > +};
> 
> That probably needs to be KVMDirtyRingReaperState
> given there are many things that could be reaped.

Sure.

> 
> > +/*
> > + * KVM reaper instance, responsible for collecting the KVM dirty bits
> > + * via the dirty ring.
> > + */
> > +struct KVMDirtyRingReaper {
> > +    /* The reaper thread */
> > +    QemuThread reaper_thr;
> > +    /*
> > +     * Telling the reaper thread to wakeup.  This should be used as a
> > +     * generic interface to kick the reaper thread, like, in vcpu
> > +     * threads where it gets a exit due to ring full.
> > +     */
> > +    EventNotifier reaper_event;
> 
> I think I'd just used a simple semaphore for this type of thing.

I'm actually uncertain on which is cheaper...

At the meantime, I wanted to poll two handles at the same time below
(in kvm_dirty_ring_reaper_thread).  I don't know how to do that with
semaphore.  Could it?

[...]

> > @@ -412,6 +460,18 @@ int kvm_init_vcpu(CPUState *cpu)
> >              (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
> >      }
> >  
> > +    if (s->kvm_dirty_gfn_count) {
> > +        cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_size,
> > +                                   PROT_READ | PROT_WRITE, MAP_SHARED,
> 
> Is the MAP_SHARED required?

Yes it's required.  It's the same when we map the per-vcpu kvm_run.

If we use MAP_PRIVATE, it'll be in a COW fashion - when the userspace
writes to the dirty gfns the 1st time, it'll copy the current dirty
ring page in the kernel and from now on QEMU will never be able to see
what the kernel writes to the dirty gfn pages.  MAP_SHARED means the
userspace and the kernel shares exactly the same page(s).

> 
> > +                                   cpu->kvm_fd,
> > +                                   PAGE_SIZE * KVM_DIRTY_LOG_PAGE_OFFSET);
> > +        if (cpu->kvm_dirty_gfns == MAP_FAILED) {
> > +            ret = -errno;
> > +            DPRINTF("mmap'ing vcpu dirty gfns failed\n");
> 
> Include errno?

Will do.

[...]

> > +static uint64_t kvm_dirty_ring_reap(KVMState *s)
> > +{
> > +    KVMMemoryListener *kml;
> > +    int ret, i, locked_count = s->nr_as;
> > +    CPUState *cpu;
> > +    uint64_t total = 0;
> > +
> > +    /*
> > +     * We need to lock all kvm slots for all address spaces here,
> > +     * because:
> > +     *
> > +     * (1) We need to mark dirty for dirty bitmaps in multiple slots
> > +     *     and for tons of pages, so it's better to take the lock here
> > +     *     once rather than once per page.  And more importantly,
> > +     *
> > +     * (2) We must _NOT_ publish dirty bits to the other threads
> > +     *     (e.g., the migration thread) via the kvm memory slot dirty
> > +     *     bitmaps before correctly re-protect those dirtied pages.
> > +     *     Otherwise we can have potential risk of data corruption if
> > +     *     the page data is read in the other thread before we do
> > +     *     reset below.
> > +     */
> > +    for (i = 0; i < s->nr_as; i++) {
> > +        kml = s->as[i].ml;
> > +        if (!kml) {
> > +            /*
> > +             * This is tricky - we grow s->as[] dynamically now.  Take
> > +             * care of that case.  We also assumed the as[] will fill
> > +             * one by one starting from zero.  Without this, we race
> > +             * with register_smram_listener.
> > +             *
> > +             * TODO: make all these prettier...
> > +             */
> > +            locked_count = i;
> > +            break;
> > +        }
> > +        kvm_slots_lock(kml);
> > +    }
> > +
> > +    CPU_FOREACH(cpu) {
> > +        total += kvm_dirty_ring_reap_one(s, cpu);
> > +    }
> > +
> > +    if (total) {
> > +        ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
> > +        assert(ret == total);
> > +    }
> > +
> > +    /* Unlock whatever locks that we have locked */
> > +    for (i = 0; i < locked_count; i++) {
> > +        kvm_slots_unlock(s->as[i].ml);
> > +    }
> > +
> > +    CPU_FOREACH(cpu) {
> > +        if (cpu->kvm_dirty_ring_full) {
> > +            qemu_sem_post(&cpu->kvm_dirty_ring_avail);
> > +        }
> 
> Why do you need to wait until here - couldn't you release
> each vcpu after you've reaped it?

We probably still need to wait.  Even after we reaped all the dirty
bits we only marked the pages as "collected", the buffers will only be
available again until the kernel re-protect those pages (when the
above KVM_RESET_DIRTY_RINGS completes).  Before that, continuing the
vcpu could let it exit again with the same ring full event.

[...]

> > +static int kvm_dirty_ring_reaper_init(KVMState *s)
> > +{
> > +    struct KVMDirtyRingReaper *r = &s->reaper;
> > +    int ret;
> > +
> > +    ret = event_notifier_init(&r->reaper_event, false);
> > +    assert(ret == 0);
> > +    ret = event_notifier_init(&r->reaper_flush_event, false);
> > +    assert(ret == 0);
> > +    qemu_sem_init(&r->reaper_flush_sem, 0);
> > +
> > +    qemu_thread_create(&r->reaper_thr, "kvm-reaper",
> > +                       kvm_dirty_ring_reaper_thread,
> > +                       s, QEMU_THREAD_JOINABLE);
> 
> That's marked as joinable - does it ever get joined?
> If the reaper thread does exit on error (I can only see the poll
> failure?) - what happens elsewhere - will things still try and kick it?

The reaper thread is not designed to fail for sure. If it fails, it'll
exit without being joined, but otherwise I'll just abort() directly in
the thread which seems to be not anything better...

Regarding to "why not join() it": I think it's simply because we don't
have corresponding operation to AccelClass.init_machine() and
kvm_init(). :-) From that POV, I think I'll still use JOINABLE with
the hope that someday the destroy_machine() hook will be ready and
we'll be able to join it.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 8/9] KVM: Add dirty-ring-size property
  2020-03-25 20:42     ` Peter Xu
@ 2020-03-26 13:41       ` Dr. David Alan Gilbert
  2020-03-26 16:03         ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-26 13:41 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Mar 25, 2020 at 08:00:31PM +0000, Dr. David Alan Gilbert wrote:
> > > @@ -2077,6 +2079,33 @@ static int kvm_init(MachineState *ms)
> > >      s->memory_listener.listener.coalesced_io_add = kvm_coalesce_mmio_region;
> > >      s->memory_listener.listener.coalesced_io_del = kvm_uncoalesce_mmio_region;
> > >  
> > > +    /*
> > > +     * Enable KVM dirty ring if supported, otherwise fall back to
> > > +     * dirty logging mode
> > > +     */
> > > +    if (s->kvm_dirty_ring_size > 0) {
> > > +        /* Read the max supported pages */
> > > +        ret = kvm_vm_check_extension(kvm_state, KVM_CAP_DIRTY_LOG_RING);
> > > +        if (ret > 0) {
> > > +            if (s->kvm_dirty_ring_size > ret) {
> > > +                error_report("KVM dirty ring size %d too big (maximum is %d). "
> > > +                             "Please use a smaller value.",
> > > +                             s->kvm_dirty_ring_size, ret);
> > > +                goto err;
> > > +            }
> > > +
> > > +            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0,
> > > +                                    s->kvm_dirty_ring_size);
> > > +            if (ret) {
> > > +                error_report("Enabling of KVM dirty ring failed: %d", ret);
> > > +                goto err;
> > > +            }
> > > +
> > > +            s->kvm_dirty_gfn_count =
> > > +                s->kvm_dirty_ring_size / sizeof(struct kvm_dirty_gfn);
> > 
> > What happens if I was to pass dirty-ring-size=1 ?
> > Then the count would be 0 and things would get upset somewhere?
> > Do you need to check for a minimum postive value?
> 
> The above kvm_vm_enable_cap() should fail directly and QEMU will stop.
> Yes we should check it, but kernel will do that in all cases, so I
> just didn't do that explicitly again in the userspace.

We probably should have that check since you can give them a more
obvious error message.

> I was planning this to be an advanced feature so the user of this
> parameter should know the rules to pass values in.

Advanced users just make advanced mistakes :-)

I did wonder if perhaps this option should be a count of entries rather
than a byte size.

> Of course we can introduce another global knob to enable/disable this
> feature as a whole, then I can offer a default value for the size
> parameter.  I just didn't bother that in this RFC version, but I can
> switch to that if that's preferred.
> 
> [...]
> 
> > > @@ -3065,6 +3123,12 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
> > >          NULL, NULL, &error_abort);
> > >      object_class_property_set_description(oc, "kvm-shadow-mem",
> > >          "KVM shadow MMU size", &error_abort);
> > > +
> > > +    object_class_property_add(oc, "dirty-ring-size", "int",
> > > +        kvm_get_dirty_ring_size, kvm_set_dirty_ring_size,
> > > +        NULL, NULL, &error_abort);
> > 
> > I don't think someone passing in a non-number should cause an abort;
> > it should exit, but I don't think it should abort/core.
> 
> It won't - &error_abort is for the operation to add the property.  It
> should never fail.
> 
> User illegal input will look like this (a graceful exit):
> 
> $ ./x86_64-softmmu/qemu-system-x86_64 -accel kvm,dirty-ring-size=a
> qemu-system-x86_64: -accel kvm,dirty-ring-size=a: Parameter 'dirty-ring-size' expects int64

OK good.

> > 
> > > +    object_class_property_set_description(oc, "dirty-ring-size",
> > > +        "KVM dirty ring size (<=0 to disable)", &error_abort);
> > >  }
> > >  
> > >  static const TypeInfo kvm_accel_type = {
> > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > index 224a8e8712..140bd38726 100644
> > > --- a/qemu-options.hx
> > > +++ b/qemu-options.hx
> > > @@ -119,6 +119,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
> > >      "                kernel-irqchip=on|off|split controls accelerated irqchip support (default=on)\n"
> > >      "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
> > >      "                tb-size=n (TCG translation block cache size)\n"
> > > +    "                dirty-ring-size=n (KVM dirty ring size in Bytes)\n"
> > >      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
> > >  STEXI
> > >  @item -accel @var{name}[,prop=@var{value}[,...]]
> > > @@ -140,6 +141,8 @@ irqchip completely is not recommended except for debugging purposes.
> > >  Defines the size of the KVM shadow MMU.
> > >  @item tb-size=@var{n}
> > >  Controls the size (in MiB) of the TCG translation block cache.
> > > +@item dirty-ring-size=@val{n}
> > > +Controls the size (in Bytes) of KVM dirty ring (<=0 to disable).
> > 
> > I don't see the point in allowing < 0 ; I'd ban it.
> 
> Sure; I can switch to an uint64.

Great.

Dave

> Thanks,
> 
> -- 
> Peter Xu
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 9/9] KVM: Dirty ring support
  2020-03-25 21:32     ` Peter Xu
@ 2020-03-26 14:14       ` Dr. David Alan Gilbert
  2020-03-26 16:10         ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2020-03-26 14:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Mar 25, 2020 at 08:41:44PM +0000, Dr. David Alan Gilbert wrote:
> 
> [...]
> 
> > > +enum KVMReaperState {
> > > +    KVM_REAPER_NONE = 0,
> > > +    /* The reaper is sleeping */
> > > +    KVM_REAPER_WAIT,
> > > +    /* The reaper is reaping for dirty pages */
> > > +    KVM_REAPER_REAPING,
> > > +};
> > 
> > That probably needs to be KVMDirtyRingReaperState
> > given there are many things that could be reaped.
> 
> Sure.
> 
> > 
> > > +/*
> > > + * KVM reaper instance, responsible for collecting the KVM dirty bits
> > > + * via the dirty ring.
> > > + */
> > > +struct KVMDirtyRingReaper {
> > > +    /* The reaper thread */
> > > +    QemuThread reaper_thr;
> > > +    /*
> > > +     * Telling the reaper thread to wakeup.  This should be used as a
> > > +     * generic interface to kick the reaper thread, like, in vcpu
> > > +     * threads where it gets a exit due to ring full.
> > > +     */
> > > +    EventNotifier reaper_event;
> > 
> > I think I'd just used a simple semaphore for this type of thing.
> 
> I'm actually uncertain on which is cheaper...
> 
> At the meantime, I wanted to poll two handles at the same time below
> (in kvm_dirty_ring_reaper_thread).  I don't know how to do that with
> semaphore.  Could it?

If you're OK with EventNotifier stick with it;  it's just I'm used
to doing with it with a semaphore; e.g. a flag then the semaphore - but
that's fine.

> [...]
> 
> > > @@ -412,6 +460,18 @@ int kvm_init_vcpu(CPUState *cpu)
> > >              (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
> > >      }
> > >  
> > > +    if (s->kvm_dirty_gfn_count) {
> > > +        cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_size,
> > > +                                   PROT_READ | PROT_WRITE, MAP_SHARED,
> > 
> > Is the MAP_SHARED required?
> 
> Yes it's required.  It's the same when we map the per-vcpu kvm_run.
> 
> If we use MAP_PRIVATE, it'll be in a COW fashion - when the userspace
> writes to the dirty gfns the 1st time, it'll copy the current dirty
> ring page in the kernel and from now on QEMU will never be able to see
> what the kernel writes to the dirty gfn pages.  MAP_SHARED means the
> userspace and the kernel shares exactly the same page(s).

OK, worth a comment.

> > 
> > > +                                   cpu->kvm_fd,
> > > +                                   PAGE_SIZE * KVM_DIRTY_LOG_PAGE_OFFSET);
> > > +        if (cpu->kvm_dirty_gfns == MAP_FAILED) {
> > > +            ret = -errno;
> > > +            DPRINTF("mmap'ing vcpu dirty gfns failed\n");
> > 
> > Include errno?
> 
> Will do.
> 
> [...]
> 
> > > +static uint64_t kvm_dirty_ring_reap(KVMState *s)
> > > +{
> > > +    KVMMemoryListener *kml;
> > > +    int ret, i, locked_count = s->nr_as;
> > > +    CPUState *cpu;
> > > +    uint64_t total = 0;
> > > +
> > > +    /*
> > > +     * We need to lock all kvm slots for all address spaces here,
> > > +     * because:
> > > +     *
> > > +     * (1) We need to mark dirty for dirty bitmaps in multiple slots
> > > +     *     and for tons of pages, so it's better to take the lock here
> > > +     *     once rather than once per page.  And more importantly,
> > > +     *
> > > +     * (2) We must _NOT_ publish dirty bits to the other threads
> > > +     *     (e.g., the migration thread) via the kvm memory slot dirty
> > > +     *     bitmaps before correctly re-protect those dirtied pages.
> > > +     *     Otherwise we can have potential risk of data corruption if
> > > +     *     the page data is read in the other thread before we do
> > > +     *     reset below.
> > > +     */
> > > +    for (i = 0; i < s->nr_as; i++) {
> > > +        kml = s->as[i].ml;
> > > +        if (!kml) {
> > > +            /*
> > > +             * This is tricky - we grow s->as[] dynamically now.  Take
> > > +             * care of that case.  We also assumed the as[] will fill
> > > +             * one by one starting from zero.  Without this, we race
> > > +             * with register_smram_listener.
> > > +             *
> > > +             * TODO: make all these prettier...
> > > +             */
> > > +            locked_count = i;
> > > +            break;
> > > +        }
> > > +        kvm_slots_lock(kml);
> > > +    }
> > > +
> > > +    CPU_FOREACH(cpu) {
> > > +        total += kvm_dirty_ring_reap_one(s, cpu);
> > > +    }
> > > +
> > > +    if (total) {
> > > +        ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
> > > +        assert(ret == total);
> > > +    }
> > > +
> > > +    /* Unlock whatever locks that we have locked */
> > > +    for (i = 0; i < locked_count; i++) {
> > > +        kvm_slots_unlock(s->as[i].ml);
> > > +    }
> > > +
> > > +    CPU_FOREACH(cpu) {
> > > +        if (cpu->kvm_dirty_ring_full) {
> > > +            qemu_sem_post(&cpu->kvm_dirty_ring_avail);
> > > +        }
> > 
> > Why do you need to wait until here - couldn't you release
> > each vcpu after you've reaped it?
> 
> We probably still need to wait.  Even after we reaped all the dirty
> bits we only marked the pages as "collected", the buffers will only be
> available again until the kernel re-protect those pages (when the
> above KVM_RESET_DIRTY_RINGS completes).  Before that, continuing the
> vcpu could let it exit again with the same ring full event.

Ah OK.

> [...]
> 
> > > +static int kvm_dirty_ring_reaper_init(KVMState *s)
> > > +{
> > > +    struct KVMDirtyRingReaper *r = &s->reaper;
> > > +    int ret;
> > > +
> > > +    ret = event_notifier_init(&r->reaper_event, false);
> > > +    assert(ret == 0);
> > > +    ret = event_notifier_init(&r->reaper_flush_event, false);
> > > +    assert(ret == 0);
> > > +    qemu_sem_init(&r->reaper_flush_sem, 0);
> > > +
> > > +    qemu_thread_create(&r->reaper_thr, "kvm-reaper",
> > > +                       kvm_dirty_ring_reaper_thread,
> > > +                       s, QEMU_THREAD_JOINABLE);
> > 
> > That's marked as joinable - does it ever get joined?
> > If the reaper thread does exit on error (I can only see the poll
> > failure?) - what happens elsewhere - will things still try and kick it?
> 
> The reaper thread is not designed to fail for sure. If it fails, it'll
> exit without being joined, but otherwise I'll just abort() directly in
> the thread which seems to be not anything better...
> 
> Regarding to "why not join() it": I think it's simply because we don't
> have corresponding operation to AccelClass.init_machine() and
> kvm_init(). :-) From that POV, I think I'll still use JOINABLE with
> the hope that someday the destroy_machine() hook will be ready and
> we'll be able to join it.

OK, that's fine.

Dave

> Thanks,
> 
> -- 
> Peter Xu
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 8/9] KVM: Add dirty-ring-size property
  2020-03-26 13:41       ` Dr. David Alan Gilbert
@ 2020-03-26 16:03         ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-26 16:03 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

On Thu, Mar 26, 2020 at 01:41:44PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Wed, Mar 25, 2020 at 08:00:31PM +0000, Dr. David Alan Gilbert wrote:
> > > > @@ -2077,6 +2079,33 @@ static int kvm_init(MachineState *ms)
> > > >      s->memory_listener.listener.coalesced_io_add = kvm_coalesce_mmio_region;
> > > >      s->memory_listener.listener.coalesced_io_del = kvm_uncoalesce_mmio_region;
> > > >  
> > > > +    /*
> > > > +     * Enable KVM dirty ring if supported, otherwise fall back to
> > > > +     * dirty logging mode
> > > > +     */
> > > > +    if (s->kvm_dirty_ring_size > 0) {
> > > > +        /* Read the max supported pages */
> > > > +        ret = kvm_vm_check_extension(kvm_state, KVM_CAP_DIRTY_LOG_RING);
> > > > +        if (ret > 0) {
> > > > +            if (s->kvm_dirty_ring_size > ret) {
> > > > +                error_report("KVM dirty ring size %d too big (maximum is %d). "
> > > > +                             "Please use a smaller value.",
> > > > +                             s->kvm_dirty_ring_size, ret);
> > > > +                goto err;
> > > > +            }
> > > > +
> > > > +            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0,
> > > > +                                    s->kvm_dirty_ring_size);
> > > > +            if (ret) {
> > > > +                error_report("Enabling of KVM dirty ring failed: %d", ret);
> > > > +                goto err;
> > > > +            }
> > > > +
> > > > +            s->kvm_dirty_gfn_count =
> > > > +                s->kvm_dirty_ring_size / sizeof(struct kvm_dirty_gfn);
> > > 
> > > What happens if I was to pass dirty-ring-size=1 ?
> > > Then the count would be 0 and things would get upset somewhere?
> > > Do you need to check for a minimum postive value?
> > 
> > The above kvm_vm_enable_cap() should fail directly and QEMU will stop.
> > Yes we should check it, but kernel will do that in all cases, so I
> > just didn't do that explicitly again in the userspace.
> 
> We probably should have that check since you can give them a more
> obvious error message.

Yes we can.  Or I can enhance the error message when we failed with
kvm_vm_enable_cap() so the user will get the important hints.  I think
maybe that's more important than the explicit check itself.

> 
> > I was planning this to be an advanced feature so the user of this
> > parameter should know the rules to pass values in.
> 
> Advanced users just make advanced mistakes :-)
> 
> I did wonder if perhaps this option should be a count of entries rather
> than a byte size.

Sometimes it's easier to know "how many bytes we used", while instead
sometimes we want to know "how many dirty addresses we can track".
But sure I can switch, considering the users might be more interested
in the latter. :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH RFC 9/9] KVM: Dirty ring support
  2020-03-26 14:14       ` Dr. David Alan Gilbert
@ 2020-03-26 16:10         ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2020-03-26 16:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Michael S . Tsirkin

On Thu, Mar 26, 2020 at 02:14:36PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Wed, Mar 25, 2020 at 08:41:44PM +0000, Dr. David Alan Gilbert wrote:
> > 
> > [...]
> > 
> > > > +enum KVMReaperState {
> > > > +    KVM_REAPER_NONE = 0,
> > > > +    /* The reaper is sleeping */
> > > > +    KVM_REAPER_WAIT,
> > > > +    /* The reaper is reaping for dirty pages */
> > > > +    KVM_REAPER_REAPING,
> > > > +};
> > > 
> > > That probably needs to be KVMDirtyRingReaperState
> > > given there are many things that could be reaped.
> > 
> > Sure.
> > 
> > > 
> > > > +/*
> > > > + * KVM reaper instance, responsible for collecting the KVM dirty bits
> > > > + * via the dirty ring.
> > > > + */
> > > > +struct KVMDirtyRingReaper {
> > > > +    /* The reaper thread */
> > > > +    QemuThread reaper_thr;
> > > > +    /*
> > > > +     * Telling the reaper thread to wakeup.  This should be used as a
> > > > +     * generic interface to kick the reaper thread, like, in vcpu
> > > > +     * threads where it gets a exit due to ring full.
> > > > +     */
> > > > +    EventNotifier reaper_event;
> > > 
> > > I think I'd just used a simple semaphore for this type of thing.
> > 
> > I'm actually uncertain on which is cheaper...
> > 
> > At the meantime, I wanted to poll two handles at the same time below
> > (in kvm_dirty_ring_reaper_thread).  I don't know how to do that with
> > semaphore.  Could it?
> 
> If you're OK with EventNotifier stick with it;  it's just I'm used
> to doing with it with a semaphore; e.g. a flag then the semaphore - but
> that's fine.

Ah yes flags could work, though we probably need to be careful with
flags and use atomic accesses to avoid race conditions of flag lost.

Then I'll keep it, thanks.

> 
> > [...]
> > 
> > > > @@ -412,6 +460,18 @@ int kvm_init_vcpu(CPUState *cpu)
> > > >              (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
> > > >      }
> > > >  
> > > > +    if (s->kvm_dirty_gfn_count) {
> > > > +        cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_size,
> > > > +                                   PROT_READ | PROT_WRITE, MAP_SHARED,
> > > 
> > > Is the MAP_SHARED required?
> > 
> > Yes it's required.  It's the same when we map the per-vcpu kvm_run.
> > 
> > If we use MAP_PRIVATE, it'll be in a COW fashion - when the userspace
> > writes to the dirty gfns the 1st time, it'll copy the current dirty
> > ring page in the kernel and from now on QEMU will never be able to see
> > what the kernel writes to the dirty gfn pages.  MAP_SHARED means the
> > userspace and the kernel shares exactly the same page(s).
> 
> OK, worth a comment.

Sure.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, back to index

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-05 14:17 [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) Peter Xu
2020-02-05 14:17 ` [PATCH RFC 1/9] KVM: Fixup kvm_log_clear_one_slot() ioctl return check Peter Xu
2020-03-25 16:34   ` Dr. David Alan Gilbert
2020-03-25 17:43   ` Peter Xu
2020-02-05 14:17 ` [PATCH RFC 2/9] linux-headers: Update Peter Xu
2020-02-05 14:17 ` [PATCH RFC 3/9] memory: Introduce log_sync_global() to memory listener Peter Xu
2020-03-25 16:52   ` Dr. David Alan Gilbert
2020-03-25 17:02     ` Peter Xu
2020-02-05 14:17 ` [PATCH RFC 4/9] KVM: Create the KVMSlot dirty bitmap on flag changes Peter Xu
2020-03-25 17:44   ` Dr. David Alan Gilbert
2020-02-05 14:17 ` [PATCH RFC 5/9] KVM: Provide helper to get kvm dirty log Peter Xu
2020-03-25 17:52   ` Dr. David Alan Gilbert
2020-03-25 18:15     ` Peter Xu
2020-02-05 14:17 ` [PATCH RFC 6/9] KVM: Provide helper to sync dirty bitmap from slot to ramblock Peter Xu
2020-03-25 18:47   ` Dr. David Alan Gilbert
2020-02-05 14:17 ` [PATCH RFC 7/9] KVM: Cache kvm slot dirty bitmap size Peter Xu
2020-03-25 18:49   ` Dr. David Alan Gilbert
2020-02-05 14:17 ` [PATCH RFC 8/9] KVM: Add dirty-ring-size property Peter Xu
2020-03-25 20:00   ` Dr. David Alan Gilbert
2020-03-25 20:42     ` Peter Xu
2020-03-26 13:41       ` Dr. David Alan Gilbert
2020-03-26 16:03         ` Peter Xu
2020-03-25 20:14   ` Dr. David Alan Gilbert
2020-03-25 20:48     ` Peter Xu
2020-02-05 14:17 ` [PATCH RFC 9/9] KVM: Dirty ring support Peter Xu
2020-03-25 20:41   ` Dr. David Alan Gilbert
2020-03-25 21:32     ` Peter Xu
2020-03-26 14:14       ` Dr. David Alan Gilbert
2020-03-26 16:10         ` Peter Xu
2020-02-05 14:51 ` [PATCH RFC 0/9] KVM: Dirty ring support (QEMU part) no-reply
2020-03-03 17:32 ` Peter Xu

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git