All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Dirty ring and auto converge optimization
@ 2022-03-28  1:32 wucy11
  2022-03-28  1:32 ` [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread wucy11
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: wucy11 @ 2022-03-28  1:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: baiyw2, yuanmh12, tugy, David Hildenbrand, huangy81,
	Juan Quintela, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, f4bug, dengpc12, Paolo Bonzini, wucy11

From: Chongyun Wu <wucy11@chinatelecom.cn>

v2:
-patch 1: remove patch_1

v1:
-rebase to qemu/master

Overview
============
This series of patches is to optimize the performance of
online migration using dirty ring and autoconverge.

Mainly through the following aspects to do optimization:
1. Dynamically adjust the dirty ring collection thread to
reduce the occurrence of ring full, thereby reducing the
impact on customers, improving the efficiency of dirty
page collection, and thus improving the migration efficiency.

2. When collecting dirty pages from KVM,
kvm_cpu_synchronize_kick_all is not called if the rate is
limited, and it is called only once before suspending the
virtual machine. Because kvm_cpu_synchronize_kick_all will
become very time-consuming when the CPU is limited, and
there will not be too many dirty pages, so it only needs
to be called once before suspending the virtual machine to
ensure that dirty pages will not be lost and the efficiency
of migration is guaranteed .

3. Based on the characteristic of collecting dirty pages
in the dirty ring, a new dirty page rate calculation method
is proposed to obtain a more accurate dirty page rate.

4. Use a more accurate dirty page rate and calculate the
matched speed limit throttle required to complete the
migration according to the current system bandwidth and
parameters, instead of the current time-consuming method
of trying to get a speed limit, greatly reducing migration
time.

Testing
=======
    Test environment:
    Host: 64 cpus(Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz),
          512G memory,
          10G NIC
    VM: 2 cpus,4G memory and 8 cpus, 32G memory
    memory stress: run stress(qemu) in VM to generates memory stress

    Test1: Massive online migration(Run each test item 50 to 200 times)
    Test command: virsh -t migrate $vm --live --p2p --unsafe
    --undefinesource --persistent --auto-converge  --migrateuri
    tcp://${data_ip_remote}
    *********** Use optimized dirtry ring  ***********
    ring_size  mem_stress VM   average_migration_time(ms)
    4096      1G       2C4G     15888
    4096      3G       2C4G     13320
    65536     1G       2C4G     10036
    65536     3G       2C4G     12132
    4096      4G       8C32G    53629
    4096      8G       8C32G    62474
    4096      30G      8C32G    99025
    65536     4G       8C32G    45563
    65536     8G       8C32G    61114
    65536     30G      8C32G    102087
    *********** Use Unoptimized dirtry ring ***********
    ring_size  mem_stress VM   average_migration_time(ms)
    4096      1G       2C4G     23992
    4096      3G       2C4G     44234
    65536     1G       2C4G     24546
    65536     3G       2C4G     44939
    4096      4G       8C32G    88441
    4096      8G       8C32G    may not complete
    4096      30G      8C32G    602884
    65536     4G       8C32G    335535
    65536     8G       8C32G    1249232
    65536     30G      8C32G    616939
    *********** Use bitmap dirty tracking  ***********
    ring_size  mem_stress VM   average_migration_time(ms)
    0         1G       2C4G     24597
    0         3G       2C4G     45254
    0         4G       8C32G    103773
    0         8G       8C32G    129626
    0         30G      8C32G    588212

Test1 result:
    Compared with the old bitmap method and the unoptimized dirty ring,
    the migration time of the optimized dirty ring from the sorted data
    is greatly improved, especially when the virtual machine memory is
    large and the memory pressure is high, the effect is more obvious,
    can achieve five to six times the migration acceleration effect.

    And during the test, it was found that the dirty ring could not be
    completed for a long time after adding certain memory pressure.
    The optimized dirty ring did not encounter such a problem.

    Test2: qemu guestperf test
    Test ommand parameters:  --auto-converge  --stress-mem XX --downtime 300
    --bandwidth 10000
    *********** Use optimized dirtry ring  ***********
    ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                           _drop_duration(s) speed(ms/GB)
    4096       3G    2C4G        5.5           2962             23.5
    65536      3G    2C4G        6             3160             25
    4096       3G    8C32G       13            7921             38
    4096       6G    8C32G       16            11.6K            46
    4096       10G   8C32G       12.1          11.2K            47.6
    4096       20G   8C32G       20            20.2K            71
    4096       30G   8C32G       29.5          29K              94.5
    65536      3G    8C32G       14            8700             40
    65536      6G    8C32G       15            12K              46
    65536      10G   8C32G       11.5          11.1k            47.5
    65536      20G   8C32G       21            20.9K            72
    65536      30G   8C32G       29.5          29.1K            94.5
    *********** Use Unoptimized dirtry ring ***********
    ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                           _drop_duration(s) speed(ms/GB)
    4096        3G    2C4G        23            2766            46
    65536       3G    2C4G        22.2          3283            46
    4096        3G    8C32G       62            48.8K           106
    4096        6G    8C32G       68            23.87K          124
    4096        10G   8C32G       91            16.87K          190
    4096        20G   8C32G       152.8         28.65K          336.8
    4096        30G   8C32G       187           41.19K          502
    65536       3G    8C32G       71            12.7K           67
    65536       6G    8C32G       63            12K             46
    65536       10G   8C32G       88            25.3k           120
    65536       20G   8C32G       157.3         25K             391
    65536       30G   8C32G       171           30.8K           487
    *********** Use bitmap dirty tracking  ***********
    ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                           _drop_duration(s) speed(ms/GB)
    0           3G    2C4G        18             3300            38
    0           3G    8C32G       38             7571            66
    0           6G    8C32G       61.5           10.5K           115.5
    0           10G   8C32G       110            13.68k          180
    0           20G   8C32G       161.6          24.4K           280
    0           30G   8C32G       221.5          28.4K           337.5

Test2 result:
    The above test data shows that the guestperf performance of the
    optimized dirty ring during the migration process is significantly
    better than that of the unoptimized dirty ring, and slightly better
    than the bitmap method.

    During the migration process of the optimized dirty ring, the migration
    time is greatly reduced, and the time in the period of significant
    memory performance degradation is  significantly shorter than that of
    the bitmap mode and the unoptimized dirty ring mode. Therefore, the
    optimized ditry ring can better reduce the impact on guests accessing
    memory resources during the migration process.

Please review, thanks.

Chongyun Wu (4):
  kvm: Dynamically adjust the rate of dirty ring reaper thread
  kvm: Dirty ring autoconverge optmization for
    kvm_cpu_synchronize_kick_all
  kvm: Introduce a dirty rate calculation method based on dirty ring
  migration: Calculate the appropriate throttle for autoconverge

 accel/kvm/kvm-all.c       | 241 +++++++++++++++++++++++++++++++++++++++++-----
 include/exec/cpu-common.h |   2 +
 include/sysemu/kvm.h      |   2 +
 migration/migration.c     |  12 +++
 migration/migration.h     |   2 +
 migration/ram.c           |  64 +++++++++++-
 softmmu/cpus.c            |  18 ++++
 7 files changed, 312 insertions(+), 29 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread
  2022-03-28  1:32 [PATCH v2 0/4] Dirty ring and auto converge optimization wucy11
@ 2022-03-28  1:32 ` wucy11
  2022-04-02  7:04   ` Hyman Huang
  2022-03-28  1:32 ` [PATCH v2 2/4] kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all wucy11
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: wucy11 @ 2022-03-28  1:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: baiyw2, yuanmh12, tugy, David Hildenbrand, huangy81,
	Juan Quintela, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, f4bug, dengpc12, Paolo Bonzini, wucy11

From: Chongyun Wu <wucy11@chinatelecom.cn>

Dynamically adjust the dirty ring collection thread to
reduce the occurrence of ring full, thereby reducing the
impact on customers, improving the efficiency of dirty
page collection, and thus improving the migration efficiency.

Implementation:
1) Define different collection speeds for the reap thread.

2) Divide the total number of dirty pages collected each
time by the ring size to get a ratio which indicates the
occupancy rate of dirty pages in the ring. The higher the
ratio, the higher the possibility that the ring will be full.

3) Different ratios correspond to different running speeds.
A higher ratio value indicates that a higher running speed
is required to collect dirty pages as soon as possible to
ensure that too many ring fulls will not be generated,
which will affect the customer's business.

This patch can significantly reduce the number of ring full
occurrences in the case of high memory dirty page pressure,
and minimize the impact on guests.

Using this patch for the qeum guestperf test, the memory
performance during the migration process is somewhat improved
compared to the bitmap method, and is significantly improved
compared to the unoptimized dirty ring method. For detailed
test data, please refer to the follow-up series of patches.

Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
---
 accel/kvm/kvm-all.c | 149 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 144 insertions(+), 5 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 27864df..65a4de8 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -91,6 +91,27 @@ enum KVMDirtyRingReaperState {
     KVM_DIRTY_RING_REAPER_REAPING,
 };
 
+enum KVMDirtyRingReaperRunLevel {
+    /* The reaper runs at default normal speed */
+    KVM_DIRTY_RING_REAPER_RUN_NORMAL = 0,
+    /* The reaper starts to accelerate in different gears */
+    KVM_DIRTY_RING_REAPER_RUN_FAST1,
+    KVM_DIRTY_RING_REAPER_RUN_FAST2,
+    KVM_DIRTY_RING_REAPER_RUN_FAST3,
+    KVM_DIRTY_RING_REAPER_RUN_FAST4,
+    /* The reaper runs at the fastest speed */
+    KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED,
+};
+
+enum KVMDirtyRingReaperSpeedControl {
+    /* Maintain current speed */
+    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_KEEP = 0,
+    /* Accelerate current speed */
+    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP,
+    /* Decrease current speed */
+    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN
+};
+
 /*
  * KVM reaper instance, responsible for collecting the KVM dirty bits
  * via the dirty ring.
@@ -100,6 +121,11 @@ struct KVMDirtyRingReaper {
     QemuThread reaper_thr;
     volatile uint64_t reaper_iteration; /* iteration number of reaper thr */
     volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */
+    /* Control the running speed of the reaper thread to fit dirty page rate */
+    enum KVMDirtyRingReaperRunLevel run_level;
+    uint64_t ring_full_cnt;
+    float ratio_adjust_threshold;
+    int stable_count_threshold;
 };
 
 struct KVMState
@@ -1449,11 +1475,115 @@ out:
     kvm_slots_unlock();
 }
 
+static uint64_t calcu_sleep_time(KVMState *s,
+                                       uint64_t dirty_count,
+                                       uint64_t ring_full_cnt_last,
+                                       uint32_t *speed_down_cnt)
+{
+    float ratio = 0.0;
+    uint64_t sleep_time = 1000000;
+    enum KVMDirtyRingReaperRunLevel run_level_want;
+    enum KVMDirtyRingReaperSpeedControl speed_control;
+
+    /*
+     * When the number of dirty pages collected exceeds
+     * the given percentage of the ring size,the speed
+     * up action will be triggered.
+     */
+    s->reaper.ratio_adjust_threshold = 0.1;
+    s->reaper.stable_count_threshold = 5;
+
+    ratio = (float)dirty_count / s->kvm_dirty_ring_size;
+
+    if (s->reaper.ring_full_cnt > ring_full_cnt_last) {
+        /* If get a new ring full need speed up reaper thread */
+        if (s->reaper.run_level != KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED) {
+            s->reaper.run_level++;
+        }
+    } else {
+        /*
+         * If get more dirty pages this loop and this status continus
+         * for many times try to speed up reaper thread.
+         * If the status is stable and need to decide which speed need
+         * to use.
+         */
+        if (ratio < s->reaper.ratio_adjust_threshold) {
+            run_level_want = KVM_DIRTY_RING_REAPER_RUN_NORMAL;
+        } else if (ratio < s->reaper.ratio_adjust_threshold * 2) {
+            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST1;
+        } else if (ratio < s->reaper.ratio_adjust_threshold * 3) {
+            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST2;
+        } else if (ratio < s->reaper.ratio_adjust_threshold * 4) {
+            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST3;
+        } else if (ratio < s->reaper.ratio_adjust_threshold * 5) {
+            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST4;
+        } else {
+            run_level_want = KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED;
+        }
+
+        /* Get if need speed up or slow down */
+        if (run_level_want > s->reaper.run_level) {
+            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP;
+            *speed_down_cnt = 0;
+        } else if (run_level_want < s->reaper.run_level) {
+            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN;
+            *speed_down_cnt++;
+        } else {
+            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_KEEP;
+        }
+
+        /* Control reaper thread run in sutiable run speed level */
+        if (speed_control == KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP) {
+            /* If need speed up do not check its stable just do it */
+            s->reaper.run_level++;
+        } else if (speed_control ==
+            KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN) {
+            /* If need speed down we should filter this status */
+            if (*speed_down_cnt > s->reaper.stable_count_threshold) {
+                s->reaper.run_level--;
+            }
+        }
+    }
+
+    /* Set the actual running rate of the reaper */
+    switch (s->reaper.run_level) {
+    case KVM_DIRTY_RING_REAPER_RUN_NORMAL:
+        sleep_time = 1000000;
+        break;
+    case KVM_DIRTY_RING_REAPER_RUN_FAST1:
+        sleep_time = 500000;
+        break;
+    case KVM_DIRTY_RING_REAPER_RUN_FAST2:
+        sleep_time = 250000;
+        break;
+    case KVM_DIRTY_RING_REAPER_RUN_FAST3:
+        sleep_time = 125000;
+        break;
+    case KVM_DIRTY_RING_REAPER_RUN_FAST4:
+        sleep_time = 100000;
+        break;
+    case KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED:
+        sleep_time = 80000;
+        break;
+    default:
+        sleep_time = 1000000;
+        error_report("Bad reaper thread run level, use default");
+    }
+
+    return sleep_time;
+}
+
 static void *kvm_dirty_ring_reaper_thread(void *data)
 {
     KVMState *s = data;
     struct KVMDirtyRingReaper *r = &s->reaper;
 
+    uint64_t count = 0;
+    uint64_t sleep_time = 1000000;
+    uint64_t ring_full_cnt_last = 0;
+    /* Filter speed jitter */
+    uint32_t speed_down_cnt = 0;
+
     rcu_register_thread();
 
     trace_kvm_dirty_ring_reaper("init");
@@ -1461,18 +1591,26 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
     while (true) {
         r->reaper_state = KVM_DIRTY_RING_REAPER_WAIT;
         trace_kvm_dirty_ring_reaper("wait");
-        /*
-         * TODO: provide a smarter timeout rather than a constant?
-         */
-        sleep(1);
+
+       ring_full_cnt_last = s->reaper.ring_full_cnt;
+
+        usleep(sleep_time);
 
         trace_kvm_dirty_ring_reaper("wakeup");
         r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
 
         qemu_mutex_lock_iothread();
-        kvm_dirty_ring_reap(s);
+        count = kvm_dirty_ring_reap(s);
         qemu_mutex_unlock_iothread();
 
+        /*
+         * Calculate the appropriate sleep time according to
+         * the speed of the current dirty page.
+         */
+        sleep_time = calcu_sleep_time(s, count,
+                                      ring_full_cnt_last,
+                                      &speed_down_cnt);
+
         r->reaper_iteration++;
     }
 
@@ -2958,6 +3096,7 @@ int kvm_cpu_exec(CPUState *cpu)
             trace_kvm_dirty_ring_full(cpu->cpu_index);
             qemu_mutex_lock_iothread();
             kvm_dirty_ring_reap(kvm_state);
+            kvm_state->reaper.ring_full_cnt++;
             qemu_mutex_unlock_iothread();
             ret = 0;
             break;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/4] kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all
  2022-03-28  1:32 [PATCH v2 0/4] Dirty ring and auto converge optimization wucy11
  2022-03-28  1:32 ` [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread wucy11
@ 2022-03-28  1:32 ` wucy11
  2022-04-02  7:22   ` Hyman Huang
  2022-03-28  1:32 ` [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring wucy11
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: wucy11 @ 2022-03-28  1:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: baiyw2, yuanmh12, tugy, David Hildenbrand, huangy81,
	Juan Quintela, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, f4bug, dengpc12, Paolo Bonzini, wucy11

From: Chongyun Wu <wucy11@chinatelecom.cn>

Dirty ring feature need call kvm_cpu_synchronize_kick_all
to flush hardware buffers into KVMslots, but when aucoverge
run kvm_cpu_synchronize_kick_all calling will become more
and more time consuming. This will significantly reduce the
efficiency of dirty page queries, especially when memory
pressure is high and the speed limit is high.

When the CPU speed limit is high and kvm_cpu_synchronize_kick_all
is time-consuming, the rate of dirty pages generated by the VM
will also be significantly reduced, so it is not necessary to
call kvm_cpu_synchronize_kick_all at this time, just call it once
before stopping the VM. This will significantly improve the
efficiency of dirty page queries under high pressure.

Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
---
 accel/kvm/kvm-all.c       | 25 +++++--------------------
 include/exec/cpu-common.h |  2 ++
 migration/migration.c     | 11 +++++++++++
 softmmu/cpus.c            | 18 ++++++++++++++++++
 4 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 65a4de8..5e02700 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -48,6 +48,8 @@
 
 #include "hw/boards.h"
 
+#include "sysemu/cpu-throttle.h"
+
 /* This check must be after config-host.h is included */
 #ifdef CONFIG_EVENTFD
 #include <sys/eventfd.h>
@@ -839,25 +841,6 @@ static uint64_t kvm_dirty_ring_reap(KVMState *s)
     return total;
 }
 
-static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
-{
-    /* No need to do anything */
-}
-
-/*
- * Kick all vcpus out in a synchronized way.  When returned, we
- * guarantee that every vcpu has been kicked and at least returned to
- * userspace once.
- */
-static void kvm_cpu_synchronize_kick_all(void)
-{
-    CPUState *cpu;
-
-    CPU_FOREACH(cpu) {
-        run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
-    }
-}
-
 /*
  * Flush all the existing dirty pages to the KVM slot buffers.  When
  * this call returns, we guarantee that all the touched dirty pages
@@ -879,7 +862,9 @@ static void kvm_dirty_ring_flush(void)
      * First make sure to flush the hardware buffers by kicking all
      * vcpus out in a synchronous way.
      */
-    kvm_cpu_synchronize_kick_all();
+    if (!cpu_throttle_get_percentage()) {
+        qemu_kvm_cpu_synchronize_kick_all();
+    }
     kvm_dirty_ring_reap(kvm_state);
     trace_kvm_dirty_ring_flush(1);
 }
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 50a7d29..13045b3 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -160,4 +160,6 @@ extern int singlestep;
 
 void list_cpus(const char *optarg);
 
+void qemu_kvm_cpu_synchronize_kick_all(void);
+
 #endif /* CPU_COMMON_H */
diff --git a/migration/migration.c b/migration/migration.c
index 695f0f2..ca1db88 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -61,6 +61,7 @@
 #include "sysemu/cpus.h"
 #include "yank_functions.h"
 #include "sysemu/qtest.h"
+#include "sysemu/kvm.h"
 
 #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
 
@@ -3183,6 +3184,16 @@ static void migration_completion(MigrationState *s)
 
         if (!ret) {
             bool inactivate = !migrate_colo_enabled();
+            /*
+             * Before stop vm do qemu_kvm_cpu_synchronize_kick_all to
+             * fulsh hardware buffer into KVMslots for dirty ring
+             * optmiaztion, If qemu_kvm_cpu_synchronize_kick_all is not
+             * called when the CPU speed is limited to improve efficiency
+             */
+            if (kvm_dirty_ring_enabled()
+                && cpu_throttle_get_percentage()) {
+                qemu_kvm_cpu_synchronize_kick_all();
+            }
             ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
             trace_migration_completion_vm_stop(ret);
             if (ret >= 0) {
diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 7b75bb6..d028d83 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -810,3 +810,21 @@ void qmp_inject_nmi(Error **errp)
     nmi_monitor_handle(monitor_get_cpu_index(monitor_cur()), errp);
 }
 
+static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
+{
+    /* No need to do anything */
+}
+
+/*
+ * Kick all vcpus out in a synchronized way.  When returned, we
+ * guarantee that every vcpu has been kicked and at least returned to
+ * userspace once.
+ */
+void qemu_kvm_cpu_synchronize_kick_all(void)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
+    }
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring
  2022-03-28  1:32 [PATCH v2 0/4] Dirty ring and auto converge optimization wucy11
  2022-03-28  1:32 ` [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread wucy11
  2022-03-28  1:32 ` [PATCH v2 2/4] kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all wucy11
@ 2022-03-28  1:32 ` wucy11
  2022-04-02  7:28   ` Hyman Huang
  2022-03-28  1:32 ` [PATCH v2 4/4] migration: Calculate the appropriate throttle for autoconverge wucy11
  2022-04-01 13:13 ` [PATCH v2 0/4] Dirty ring and auto converge optimization Peter Xu
  4 siblings, 1 reply; 15+ messages in thread
From: wucy11 @ 2022-03-28  1:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: baiyw2, yuanmh12, tugy, David Hildenbrand, huangy81,
	Juan Quintela, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, f4bug, dengpc12, Paolo Bonzini, wucy11

From: Chongyun Wu <wucy11@chinatelecom.cn>

A new structure KVMDirtyRingDirtyCounter is introduced in
KVMDirtyRingReaper to record the number of dirty pages
within a period of time.

When kvm_dirty_ring_mark_page collects dirty pages, if it
finds that the current dirty pages are not duplicates, it
increases the dirty_pages_period count.

Divide the dirty_pages_period count by the interval to get
the dirty page rate for this period.

And use dirty_pages_period_peak_rate to count the highest
dirty page rate, to solve the problem that the dirty page
collection rate may change greatly during a period of time,
resulting in a large change in the dirty page rate.

Through sufficient testing, it is found that the dirty rate
calculated after kvm_dirty_ring_flush usually matches the actual
pressure, and the dirty rate counted per second may change in the
subsequent seconds, so record the peak dirty rate as the real
dirty pages rate.

This dirty pages rate is mainly used as the subsequent autoconverge
calculation speed limit throttle.

Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
---
 accel/kvm/kvm-all.c  | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/sysemu/kvm.h |  2 ++
 2 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5e02700..a158b1c 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -114,6 +114,13 @@ enum KVMDirtyRingReaperSpeedControl {
     KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN
 };
 
+struct KVMDirtyRingDirtyCounter {
+    int64_t time_last_count;
+    uint64_t dirty_pages_period;
+    int64_t dirty_pages_rate;
+    int64_t dirty_pages_period_peak_rate;
+};
+
 /*
  * KVM reaper instance, responsible for collecting the KVM dirty bits
  * via the dirty ring.
@@ -128,6 +135,7 @@ struct KVMDirtyRingReaper {
     uint64_t ring_full_cnt;
     float ratio_adjust_threshold;
     int stable_count_threshold;
+    struct KVMDirtyRingDirtyCounter counter; /* Calculate dirty pages rate */
 };
 
 struct KVMState
@@ -739,7 +747,9 @@ static void kvm_dirty_ring_mark_page(KVMState *s, uint32_t as_id,
         return;
     }
 
-    set_bit(offset, mem->dirty_bmap);
+    if (!test_and_set_bit(offset, mem->dirty_bmap)) {
+        s->reaper.counter.dirty_pages_period++;
+    }
 }
 
 static bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
@@ -783,6 +793,56 @@ static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
     return count;
 }
 
+int64_t kvm_dirty_ring_get_rate(void)
+{
+    return kvm_state->reaper.counter.dirty_pages_rate;
+}
+
+int64_t kvm_dirty_ring_get_peak_rate(void)
+{
+    return kvm_state->reaper.counter.dirty_pages_period_peak_rate;
+}
+
+static void kvm_dirty_ring_reap_count(KVMState *s)
+{
+    int64_t spend_time = 0;
+    int64_t end_time;
+
+    if (!s->reaper.counter.time_last_count) {
+        s->reaper.counter.time_last_count =
+            qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    }
+
+    end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    spend_time = end_time - s->reaper.counter.time_last_count;
+
+    if (!s->reaper.counter.dirty_pages_period ||
+        !spend_time) {
+        return;
+    }
+
+    /*
+     * More than 1 second = 1000 millisecons,
+     * or trigger by kvm_log_sync_global which spend time
+     * more than 300 milliscons.
+     */
+    if (spend_time > 1000) {
+        /* Count the dirty page rate during this period */
+        s->reaper.counter.dirty_pages_rate =
+            s->reaper.counter.dirty_pages_period * 1000 / spend_time;
+        /* Update the peak dirty page rate at this period */
+        if (s->reaper.counter.dirty_pages_rate >
+            s->reaper.counter.dirty_pages_period_peak_rate) {
+            s->reaper.counter.dirty_pages_period_peak_rate =
+                s->reaper.counter.dirty_pages_rate;
+        }
+
+        /* Reset counters */
+        s->reaper.counter.dirty_pages_period = 0;
+        s->reaper.counter.time_last_count = 0;
+    }
+}
+
 /* Must be with slots_lock held */
 static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
 {
@@ -793,6 +853,8 @@ static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
 
     stamp = get_clock();
 
+    kvm_dirty_ring_reap_count(s);
+
     CPU_FOREACH(cpu) {
         total += kvm_dirty_ring_reap_one(s, cpu);
     }
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a783c78..05846f9 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -582,4 +582,6 @@ bool kvm_cpu_check_are_resettable(void);
 bool kvm_arch_cpu_check_are_resettable(void);
 
 bool kvm_dirty_ring_enabled(void);
+int64_t kvm_dirty_ring_get_rate(void);
+int64_t kvm_dirty_ring_get_peak_rate(void);
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/4] migration: Calculate the appropriate throttle for autoconverge
  2022-03-28  1:32 [PATCH v2 0/4] Dirty ring and auto converge optimization wucy11
                   ` (2 preceding siblings ...)
  2022-03-28  1:32 ` [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring wucy11
@ 2022-03-28  1:32 ` wucy11
  2022-04-01 13:13 ` [PATCH v2 0/4] Dirty ring and auto converge optimization Peter Xu
  4 siblings, 0 replies; 15+ messages in thread
From: wucy11 @ 2022-03-28  1:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: baiyw2, yuanmh12, tugy, David Hildenbrand, huangy81,
	Juan Quintela, Richard Henderson, Dr. David Alan Gilbert,
	Peter Xu, f4bug, dengpc12, Paolo Bonzini, wucy11

From: Chongyun Wu <wucy11@chinatelecom.cn>

The current autoconverge algorithm does not obtain the threshold
that currently requires the CPU to limit the speed through
calculation, but limits the speed of the CPU through continuous
attempts. Start from an initial value to limit the speed. If the
migration can not be completed for two consecutive rounds, increase
the limit threshold and continue to try until the limit threshold
reaches 99%. If the memory pressure is high or the migration
bandwidth is low, then it will gradually increase from the initial
20% to 99%, which will be a long and time-consuming process.

This optimization method calculates a matching rate limit
threshold according to the current migration bandwidth and the
rate of current dirty page generation. When the memory pressure
is high, this optimization can reduce the unnecessary and
time-consuming process of constantly trying to increase the
limit, and significantly improve the efficiency of migration.

Test results after optimization(VM 8C32G, qemu stress tool):
mem_stress  caculated_auto_converge_throttle  bandwidth(MiB/s)
300M               0                          1000M
400M               0                          1000M
1G                50                          1000M
2G                80                          1000M
3G                90                          1000M
4G                90                          1000M
5G                90                          1000M
6G                99                          1000M
10G               99                          1000M
20G               99                          1000M
30G               99                          1000M

Series optimization summary:
Related Patch Series:
[1]kvm,memory: Optimize dirty page collection for dirty ring
[2]kvm: Dynamically control the load of the reaper thread
[3]kvm: Dirty ring autoconverge optmization for kvm_cpu_
synchronize_kick_all
[4]kvm: Introduce a dirty rate calculation method based on dirty
ring
[5]migration: Calculate the appropriate throttle for autoconverge

Test environment:
Host: 64 cpus(Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz),
      512G memory,
      10G NIC
VM: 2 cpus,4G memory and 8 cpus, 32G memory
memory stress: run stress(qemu) in VM to generates memory stress

Test1: Massive online migration(Run each test item 50 to 200 times)
Test command: virsh -t migrate $vm --live --p2p --unsafe
--undefinesource --persistent --auto-converge  --migrateuri
tcp://${data_ip_remote}
*********** Use optimized dirtry ring  ***********
ring_size  mem_stress VM   average_migration_time(ms)
4096      1G       2C4G     15888
4096      3G       2C4G     13320
65536     1G       2C4G     10036
65536     3G       2C4G     12132
4096      4G       8C32G    53629
4096      8G       8C32G    62474
4096      30G      8C32G    99025
65536     4G       8C32G    45563
65536     8G       8C32G    61114
65536     30G      8C32G    102087
*********** Use Unoptimized dirtry ring ***********
ring_size  mem_stress VM   average_migration_time(ms)
4096      1G       2C4G     23992
4096      3G       2C4G     44234
65536     1G       2C4G     24546
65536     3G       2C4G     44939
4096      4G       8C32G    88441
4096      8G       8C32G    may not complete
4096      30G      8C32G    602884
65536     4G       8C32G    335535
65536     8G       8C32G    1249232
65536     30G      8C32G    616939
*********** Use bitmap dirty tracking  ***********
ring_size  mem_stress VM   average_migration_time(ms)
0         1G       2C4G     24597
0         3G       2C4G     45254
0         4G       8C32G    103773
0         8G       8C32G    129626
0         30G      8C32G    588212
Compared with the old bitmap method and the unoptimized dirty ring,
the migration time of the optimized dirty ring from the sorted data
is greatly improved, especially when the virtual machine memory is
large and the memory pressure is high, the effect is more obvious,
can achieve five to six times the migration acceleration effect.

And during the test, it was found that the dirty ring could not be
completed for a long time after adding certain memory pressure.
The optimized dirty ring did not encounter such a problem.

Test2: qemu guestperf test
Test ommand parameters:  --auto-converge  --stress-mem XX
--downtime 300 --bandwidth 10000
*********** Use optimized dirtry ring  ***********
ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                       _drop_duration(s) speed(ms/GB)
4096       3G    2C4G        5.5           2962             23.5
65536      3G    2C4G        6             3160             25
4096       3G    8C32G       13            7921             38
4096       6G    8C32G       16            11.6K            46
4096       10G   8C32G       12.1          11.2K            47.6
4096       20G   8C32G       20            20.2K            71
4096       30G   8C32G       29.5          29K              94.5
65536      3G    8C32G       14            8700             40
65536      6G    8C32G       15            12K              46
65536      10G   8C32G       11.5          11.1k            47.5
65536      20G   8C32G       21            20.9K            72
65536      30G   8C32G       29.5          29.1K            94.5
*********** Use Unoptimized dirtry ring ***********
ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                       _drop_duration(s) speed(ms/GB)
4096        3G    2C4G        23            2766            46
65536       3G    2C4G        22.2          3283            46
4096        3G    8C32G       62            48.8K           106
4096        6G    8C32G       68            23.87K          124
4096        10G   8C32G       91            16.87K          190
4096        20G   8C32G       152.8         28.65K          336.8
4096        30G   8C32G       187           41.19K          502
65536       3G    8C32G       71            12.7K           67
65536       6G    8C32G       63            12K             46
65536       10G   8C32G       88            25.3k           120
65536       20G   8C32G       157.3         25K             391
65536       30G   8C32G       171           30.8K           487
*********** Use bitmap dirty tracking  ***********
ring_size stress VM    Significant_perf  max_memory_update cost_time(s)
                       _drop_duration(s) speed(ms/GB)
0           3G    2C4G        18             3300            38
0           3G    8C32G       38             7571            66
0           6G    8C32G       61.5           10.5K           115.5
0           10G   8C32G       110            13.68k          180
0           20G   8C32G       161.6          24.4K           280
0           30G   8C32G       221.5          28.4K           337.5

The above test data shows that the guestperf performance of the
optimized dirty ring during the migration process is significantly
better than that of the unoptimized dirty ring, and slightly better
than the bitmap method.

During the migration process of the optimized dirty ring, the migration
time is greatly reduced, and the time in the period of significant
memory performance degradation is  significantly shorter than that of
the bitmap mode and the unoptimized dirty ring mode. Therefore, the
optimized ditry ring can better reduce the impact on guests accessing
memory resources during the migration process.

Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
---
 accel/kvm/kvm-all.c   |  7 ++++--
 migration/migration.c |  1 +
 migration/migration.h |  2 ++
 migration/ram.c       | 64 ++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a158b1c..57126f1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -800,7 +800,11 @@ int64_t kvm_dirty_ring_get_rate(void)
 
 int64_t kvm_dirty_ring_get_peak_rate(void)
 {
-    return kvm_state->reaper.counter.dirty_pages_period_peak_rate;
+    int64_t rate = kvm_state->reaper.counter.dirty_pages_period_peak_rate;
+    /* Reset the peak rate to calculate a new peak rate after this moment */
+    kvm_state->reaper.counter.dirty_pages_period_peak_rate = 0;
+
+    return rate;
 }
 
 static void kvm_dirty_ring_reap_count(KVMState *s)
@@ -836,7 +840,6 @@ static void kvm_dirty_ring_reap_count(KVMState *s)
             s->reaper.counter.dirty_pages_period_peak_rate =
                 s->reaper.counter.dirty_pages_rate;
         }
-
         /* Reset counters */
         s->reaper.counter.dirty_pages_period = 0;
         s->reaper.counter.time_last_count = 0;
diff --git a/migration/migration.c b/migration/migration.c
index ca1db88..78ecf8c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2070,6 +2070,7 @@ void migrate_init(MigrationState *s)
     s->vm_was_running = false;
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
+    s->have_caculated_throttle_pct = false;
 }
 
 int migrate_add_blocker_internal(Error *reason, Error **errp)
diff --git a/migration/migration.h b/migration/migration.h
index 2de861d..7c525c9 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -333,6 +333,8 @@ struct MigrationState {
      * This save hostname when out-going migration starts
      */
     char *hostname;
+    /* If already caculated the throttle percentage for migration */
+    bool have_caculated_throttle_pct;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/ram.c b/migration/ram.c
index 170e522..21642eb 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -63,6 +63,7 @@
 #include "qemu/userfaultfd.h"
 #endif /* defined(__linux__) */
 
+#include "sysemu/kvm.h"
 /***********************************************************/
 /* ram save/restore */
 
@@ -617,6 +618,64 @@ static size_t save_page_header(RAMState *rs, QEMUFile *f,  RAMBlock *block,
 }
 
 /**
+ * Calculate the matched speed limit threshold according to
+ * the current migration bandwidth and the current rate of
+ * dirty pages to aviod time-consuming and pointless attempt.
+ */
+static int calculate_throttle_pct(void)
+{
+    MigrationState *s = migrate_get_current();
+    uint64_t threshold = s->parameters.throttle_trigger_threshold;
+    uint64_t pct_initial = s->parameters.cpu_throttle_initial;
+    uint64_t pct_icrement = s->parameters.cpu_throttle_increment;
+    int pct_max = s->parameters.max_cpu_throttle;
+
+    int matched_pct = 0;
+    float facter1 = 0.0;
+    float facter2 = 0.0;
+    int64_t dirty_pages_rate = 0;
+    double bandwith_expect = 0.0;
+    double dirty_pages_rate_expect = 0.0;
+    double bandwith = (s->mbps / 8) * 1024 * 1024;
+
+    if (kvm_dirty_ring_enabled()) {
+        dirty_pages_rate = kvm_dirty_ring_get_peak_rate() *
+            qemu_target_page_size();
+    } else {
+        dirty_pages_rate = ram_counters.dirty_pages_rate *
+            qemu_target_page_size();
+    }
+
+    if (dirty_pages_rate) {
+        facter1 = (float)threshold / 100;
+        bandwith_expect = bandwith * facter1;
+
+        for (uint64_t i = pct_initial; i <= pct_max;) {
+            facter2 = 1 - (float)i / 100;
+            dirty_pages_rate_expect = dirty_pages_rate * facter2;
+
+            if (bandwith_expect > dirty_pages_rate_expect) {
+                matched_pct = i;
+                break;
+            }
+            i += pct_icrement;
+        }
+
+        if (!matched_pct) {
+            info_report("Not find matched throttle pct, "
+                        "maybe pressure too high, use max");
+            matched_pct = pct_max;
+        }
+    } else {
+        matched_pct = pct_initial;
+    }
+
+    s->have_caculated_throttle_pct = true;
+
+    return matched_pct;
+}
+
+/**
  * mig_throttle_guest_down: throttle down the guest
  *
  * Reduce amount of guest cpu execution to hopefully slow down memory
@@ -629,7 +688,6 @@ static void mig_throttle_guest_down(uint64_t bytes_dirty_period,
                                     uint64_t bytes_dirty_threshold)
 {
     MigrationState *s = migrate_get_current();
-    uint64_t pct_initial = s->parameters.cpu_throttle_initial;
     uint64_t pct_increment = s->parameters.cpu_throttle_increment;
     bool pct_tailslow = s->parameters.cpu_throttle_tailslow;
     int pct_max = s->parameters.max_cpu_throttle;
@@ -638,8 +696,8 @@ static void mig_throttle_guest_down(uint64_t bytes_dirty_period,
     uint64_t cpu_now, cpu_ideal, throttle_inc;
 
     /* We have not started throttling yet. Let's start it. */
-    if (!cpu_throttle_active()) {
-        cpu_throttle_set(pct_initial);
+    if (!s->have_caculated_throttle_pct) {
+        cpu_throttle_set(MIN(calculate_throttle_pct(), pct_max));
     } else {
         /* Throttling already on, just increase the rate */
         if (!pct_tailslow) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] Dirty ring and auto converge optimization
  2022-03-28  1:32 [PATCH v2 0/4] Dirty ring and auto converge optimization wucy11
                   ` (3 preceding siblings ...)
  2022-03-28  1:32 ` [PATCH v2 4/4] migration: Calculate the appropriate throttle for autoconverge wucy11
@ 2022-04-01 13:13 ` Peter Xu
  2022-04-02  2:13   ` Chongyun Wu
  4 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2022-04-01 13:13 UTC (permalink / raw)
  To: wucy11
  Cc: yuanmh12, tugy, David Hildenbrand, huangy81, Juan Quintela,
	Richard Henderson, qemu-devel, Dr. David Alan Gilbert, dengpc12,
	Paolo Bonzini, baiyw2, f4bug

Chongyun,

On Mon, Mar 28, 2022 at 09:32:10AM +0800, wucy11@chinatelecom.cn wrote:
> From: Chongyun Wu <wucy11@chinatelecom.cn>
> 
> v2:
> -patch 1: remove patch_1
> 
> v1:
> -rebase to qemu/master
> 
> Overview
> ============
> This series of patches is to optimize the performance of
> online migration using dirty ring and autoconverge.
> 
> Mainly through the following aspects to do optimization:
> 1. Dynamically adjust the dirty ring collection thread to
> reduce the occurrence of ring full, thereby reducing the
> impact on customers, improving the efficiency of dirty
> page collection, and thus improving the migration efficiency.
> 
> 2. When collecting dirty pages from KVM,
> kvm_cpu_synchronize_kick_all is not called if the rate is
> limited, and it is called only once before suspending the
> virtual machine. Because kvm_cpu_synchronize_kick_all will
> become very time-consuming when the CPU is limited, and
> there will not be too many dirty pages, so it only needs
> to be called once before suspending the virtual machine to
> ensure that dirty pages will not be lost and the efficiency
> of migration is guaranteed .
> 
> 3. Based on the characteristic of collecting dirty pages
> in the dirty ring, a new dirty page rate calculation method
> is proposed to obtain a more accurate dirty page rate.
> 
> 4. Use a more accurate dirty page rate and calculate the
> matched speed limit throttle required to complete the
> migration according to the current system bandwidth and
> parameters, instead of the current time-consuming method
> of trying to get a speed limit, greatly reducing migration
> time.

Thanks for the patches.

I'm curious what's the relationship between this series and Yong's?

If talking about throttling, I do think the old auto-converge was kind of
inefficient comparing to the new per-vcpu ways of throttling at least in
either granularity or on read tolerances (e.g., dirty ring based solution
will not block vcpu readers even if the thread is heavily throttled).

We've got quite a few techniques taking care of migration convergence
issues (didn't mention postcopy yet..), and I'm wondering whether at some
point we should be more focused and make a chosen one better, rather than
building different blocks servicing the same purpose.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] Dirty ring and auto converge optimization
  2022-04-01 13:13 ` [PATCH v2 0/4] Dirty ring and auto converge optimization Peter Xu
@ 2022-04-02  2:13   ` Chongyun Wu
  0 siblings, 0 replies; 15+ messages in thread
From: Chongyun Wu @ 2022-04-02  2:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: yuanmh12, tugy, David Hildenbrand, huangy81, Juan Quintela,
	Richard Henderson, qemu-devel, Dr. David Alan Gilbert, dengpc12,
	Paolo Bonzini, baiyw2, f4bug

Thanks for review.

On 4/1/2022 9:13 PM, Peter Xu wrote:
> Chongyun,
> 
> On Mon, Mar 28, 2022 at 09:32:10AM +0800, wucy11@chinatelecom.cn wrote:
>> From: Chongyun Wu <wucy11@chinatelecom.cn>
>>
>> v2:
>> -patch 1: remove patch_1
>>
>> v1:
>> -rebase to qemu/master
>>
>> Overview
>> ============
>> This series of patches is to optimize the performance of
>> online migration using dirty ring and autoconverge.
>>
>> Mainly through the following aspects to do optimization:
>> 1. Dynamically adjust the dirty ring collection thread to
>> reduce the occurrence of ring full, thereby reducing the
>> impact on customers, improving the efficiency of dirty
>> page collection, and thus improving the migration efficiency.
>>
>> 2. When collecting dirty pages from KVM,
>> kvm_cpu_synchronize_kick_all is not called if the rate is
>> limited, and it is called only once before suspending the
>> virtual machine. Because kvm_cpu_synchronize_kick_all will
>> become very time-consuming when the CPU is limited, and
>> there will not be too many dirty pages, so it only needs
>> to be called once before suspending the virtual machine to
>> ensure that dirty pages will not be lost and the efficiency
>> of migration is guaranteed .
>>
>> 3. Based on the characteristic of collecting dirty pages
>> in the dirty ring, a new dirty page rate calculation method
>> is proposed to obtain a more accurate dirty page rate.
>>
>> 4. Use a more accurate dirty page rate and calculate the
>> matched speed limit throttle required to complete the
>> migration according to the current system bandwidth and
>> parameters, instead of the current time-consuming method
>> of trying to get a speed limit, greatly reducing migration
>> time.
> 
> Thanks for the patches.
> 
> I'm curious what's the relationship between this series and Yong's?
I personally think it is a complementary relationship. Yong's can limit 
per-vcpu. In the case of memory pressure threads in certain vcpu scenarios, the 
restrictions on other vcpus are very small, and the impact on customers during 
the migration process will be smaller. The auto-convergence optimization of the 
last two patches in this patch series can cope with scenarios where the memory 
pressure is balanced on each vcpu. Each has its own advantages, and customers 
can choose the appropriate mode according to their own application scenarios. 
The first two patches are for the dirty ring, and both auto converge and yong 
modes can improve performance.

> 
> If talking about throttling, I do think the old auto-converge was kind of
> inefficient comparing to the new per-vcpu ways of throttling at least in
> either granularity or on read tolerances (e.g., dirty ring based solution
> will not block vcpu readers even if the thread is heavily throttled).
Yes, I agree with that. Through the research of dirty ring and a lot of tests, 
some points that may affect the advantages of dirty ring have been found, so 
some optimizations have been made, and these optimizations are found to be 
effective through testing and verification.
In this patch series, only the last two patches are optimized for autocoverge. 
The first two patches are for all situations where the dirty ring is used, 
including Yong's, and there is no conflict with his. Among them, "kvm: 
Dynamically adjust the rate of dirty ring reaper thread" is proposed to take 
advantage of dirty ring. When the memory pressure is high, speeding up the rate 
at which the reaper thread collects dirty pages can effectively solve the 
problem that the frequent occurrence of ring full leads to the frequent exit of 
the guest and the performance of the guestperf is degraded. When the migration 
thread migrates data, it also completes the synchronization of most dirty pages. 
When the migration thread of the dirty ring synchronizes the dirty pages, it 
will take less time, which will also speed up the migration. These two patches 
will make yong's test results better, and the two optimization points are different.

> We've got quite a few techniques taking care of migration convergence
> issues (didn't mention postcopy yet..), and I'm wondering whether at some
> point we should be more focused and make a chosen one better, rather than
> building different blocks servicing the same purpose.
I'm sorry, maybe I should separate these patch series to avoid 
misunderstandings. These patches and yong's should be complementary, and two of 
them can also help yong get some performance improvements.
> 
> Thanks,
> 

-- 
Best Regard,
Chongyun Wu


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread
  2022-03-28  1:32 ` [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread wucy11
@ 2022-04-02  7:04   ` Hyman Huang
  2022-04-02  7:28     ` Chongyun Wu
  0 siblings, 1 reply; 15+ messages in thread
From: Hyman Huang @ 2022-04-02  7:04 UTC (permalink / raw)
  To: wucy11, qemu-devel
  Cc: tugy, David Hildenbrand, yuanmh12, Juan Quintela,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, f4bug,
	dengpc12, Paolo Bonzini, baiyw2



在 2022/3/28 9:32, wucy11@chinatelecom.cn 写道:
> From: Chongyun Wu <wucy11@chinatelecom.cn>
> 
> Dynamically adjust the dirty ring collection thread to
> reduce the occurrence of ring full, thereby reducing the
> impact on customers, improving the efficiency of dirty
> page collection, and thus improving the migration efficiency.
> 
> Implementation:
> 1) Define different collection speeds for the reap thread.
> 
> 2) Divide the total number of dirty pages collected each
> time by the ring size to get a ratio which indicates the
> occupancy rate of dirty pages in the ring. The higher the
> ratio, the higher the possibility that the ring will be full.
> 
> 3) Different ratios correspond to different running speeds.
> A higher ratio value indicates that a higher running speed
> is required to collect dirty pages as soon as possible to
> ensure that too many ring fulls will not be generated,
> which will affect the customer's business.

> 
> This patch can significantly reduce the number of ring full
> occurrences in the case of high memory dirty page pressure,
> and minimize the impact on guests.
> 
Increase the frequency of reaping dirty ring can reduce the guest
vcpu block time obviously and consequently improve the guest memory 
performance.

But this also make the write-memory vcpu run more time and dirty more 
memory, so the migration time may become longer. Maybe we
should also focus on the migraiton time and compare with traditional  algo.
> Using this patch for the qeum guestperf test, the memory
> performance during the migration process is somewhat improved
> compared to the bitmap method, and is significantly improved
> compared to the unoptimized dirty ring method. For detailed
> test data, please refer to the follow-up series of patches.
> 
> Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
> ---
>   accel/kvm/kvm-all.c | 149 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 144 insertions(+), 5 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 27864df..65a4de8 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -91,6 +91,27 @@ enum KVMDirtyRingReaperState {
>       KVM_DIRTY_RING_REAPER_REAPING,
>   };
>  
> +enum KVMDirtyRingReaperRunLevel {
> +    /* The reaper runs at default normal speed */
> +    KVM_DIRTY_RING_REAPER_RUN_NORMAL = 0,
> +    /* The reaper starts to accelerate in different gears */
> +    KVM_DIRTY_RING_REAPER_RUN_FAST1,
> +    KVM_DIRTY_RING_REAPER_RUN_FAST2,
> +    KVM_DIRTY_RING_REAPER_RUN_FAST3,
> +    KVM_DIRTY_RING_REAPER_RUN_FAST4,
> +    /* The reaper runs at the fastest speed */
> +    KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED,
> +};
> + > +enum KVMDirtyRingReaperSpeedControl {
> +    /* Maintain current speed */
> +    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_KEEP = 0,
> +    /* Accelerate current speed */
> +    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP,
> +    /* Decrease current speed */
> +    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN
> +};
> +
>   /*
>    * KVM reaper instance, responsible for collecting the KVM dirty bits
>    * via the dirty ring.
> @@ -100,6 +121,11 @@ struct KVMDirtyRingReaper {
>       QemuThread reaper_thr;
>       volatile uint64_t reaper_iteration; /* iteration number of reaper thr */
>       volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */
> +    /* Control the running speed of the reaper thread to fit dirty page rate */
> +    enum KVMDirtyRingReaperRunLevel run_level;
> +    uint64_t ring_full_cnt;
> +    float ratio_adjust_threshold;
> +    int stable_count_threshold;
Could you add some comments about the introduced field?
>   };
>   
>   struct KVMState
> @@ -1449,11 +1475,115 @@ out:
>       kvm_slots_unlock();
>   }
>   
[...]
> +static uint64_t calcu_sleep_time(KVMState *s,
> +                                       uint64_t dirty_count,
> +                                       uint64_t ring_full_cnt_last,
> +                                       uint32_t *speed_down_cnt)
Code isn't aligned
> +{
> +    float ratio = 0.0;
> +    uint64_t sleep_time = 1000000;
> +    enum KVMDirtyRingReaperRunLevel run_level_want;
> +    enum KVMDirtyRingReaperSpeedControl speed_control;
> +
> +    /*
> +     * When the number of dirty pages collected exceeds
> +     * the given percentage of the ring size,the speed
> +     * up action will be triggered.
> +     */
> +    s->reaper.ratio_adjust_threshold = 0.1;
> +    s->reaper.stable_count_threshold = 5;
> +
> +    ratio = (float)dirty_count / s->kvm_dirty_ring_size;
> +
> +    if (s->reaper.ring_full_cnt > ring_full_cnt_last) {
> +        /* If get a new ring full need speed up reaper thread */
> +        if (s->reaper.run_level != KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED) {
> +            s->reaper.run_level++;
> +        }
> +    } else {
> +        /*
> +         * If get more dirty pages this loop and this status continus
> +         * for many times try to speed up reaper thread.
> +         * If the status is stable and need to decide which speed need
> +         * to use.
> +         */
> +        if (ratio < s->reaper.ratio_adjust_threshold) {
> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_NORMAL;
> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 2) {
> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST1;
> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 3) {
> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST2;
> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 4) {
> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST3;
> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 5) {
> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST4;
> +        } else {
> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED;
> +        }
> +
> +        /* Get if need speed up or slow down */
> +        if (run_level_want > s->reaper.run_level) {
> +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP;
> +            *speed_down_cnt = 0;
> +        } else if (run_level_want < s->reaper.run_level) {
> +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN;
> +            *speed_down_cnt++;
> +        } else {
> +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_KEEP;
> +        }
> +
> +        /* Control reaper thread run in sutiable run speed level */
> +        if (speed_control == KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP) {
> +            /* If need speed up do not check its stable just do it */
> +            s->reaper.run_level++;
> +        } else if (speed_control ==
> +            KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN) {
> +            /* If need speed down we should filter this status */
> +            if (*speed_down_cnt > s->reaper.stable_count_threshold) {
> +                s->reaper.run_level--;
> +            }
> +        }
> +    }
> +
> +    /* Set the actual running rate of the reaper */
> +    switch (s->reaper.run_level) {
> +    case KVM_DIRTY_RING_REAPER_RUN_NORMAL:
> +        sleep_time = 1000000;
> +        break;
> +    case KVM_DIRTY_RING_REAPER_RUN_FAST1:
> +        sleep_time = 500000;
> +        break;
> +    case KVM_DIRTY_RING_REAPER_RUN_FAST2:
> +        sleep_time = 250000;
> +        break;
> +    case KVM_DIRTY_RING_REAPER_RUN_FAST3:
> +        sleep_time = 125000;
> +        break;
> +    case KVM_DIRTY_RING_REAPER_RUN_FAST4:
> +        sleep_time = 100000;
> +        break;
> +    case KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED:
> +        sleep_time = 80000;
> +        break;
> +    default:
> +        sleep_time = 1000000;
> +        error_report("Bad reaper thread run level, use default");
> +    }
> +
> +    return sleep_time;
> +}
> +I think how to calculate the sleep time needs discuussion, including why 
we define 5 levels, why we choose the time constants and in what 
scenarios this algo works fine.

The other thing is i still think it's nicer we have the simplest 
algorithm firstly, which should be very easy to verify.
>   static void *kvm_dirty_ring_reaper_thread(void *data)
>   {
>       KVMState *s = data;
>       struct KVMDirtyRingReaper *r = &s->reaper;
>   
> +    uint64_t count = 0;
> +    uint64_t sleep_time = 1000000;
> +    uint64_t ring_full_cnt_last = 0;
> +    /* Filter speed jitter */
> +    uint32_t speed_down_cnt = 0;
> +
>       rcu_register_thread();
>   
>       trace_kvm_dirty_ring_reaper("init");
> @@ -1461,18 +1591,26 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
>       while (true) {
>           r->reaper_state = KVM_DIRTY_RING_REAPER_WAIT;
>           trace_kvm_dirty_ring_reaper("wait");
> -        /*
> -         * TODO: provide a smarter timeout rather than a constant?
> -         */
> -        sleep(1);
> +
> +       ring_full_cnt_last = s->reaper.ring_full_cnt;
> +
> +        usleep(sleep_time);
>   
>           trace_kvm_dirty_ring_reaper("wakeup");
>           r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
>   
>           qemu_mutex_lock_iothread();
> -        kvm_dirty_ring_reap(s);
> +        count = kvm_dirty_ring_reap(s);
>           qemu_mutex_unlock_iothread();
>   
> +        /*
> +         * Calculate the appropriate sleep time according to
> +         * the speed of the current dirty page.
> +         */
> +        sleep_time = calcu_sleep_time(s, count,
> +                                      ring_full_cnt_last,
> +                                      &speed_down_cnt);
> +
>           r->reaper_iteration++;
>       }
>   
> @@ -2958,6 +3096,7 @@ int kvm_cpu_exec(CPUState *cpu)
>               trace_kvm_dirty_ring_full(cpu->cpu_index);
>               qemu_mutex_lock_iothread();
>               kvm_dirty_ring_reap(kvm_state);
> +            kvm_state->reaper.ring_full_cnt++;
>               qemu_mutex_unlock_iothread();
>               ret = 0;
>               break;

Thanks.
-- 
Best regard

Hyman Huang(黄勇)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/4] kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all
  2022-03-28  1:32 ` [PATCH v2 2/4] kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all wucy11
@ 2022-04-02  7:22   ` Hyman Huang
  2022-05-13 16:41     ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Hyman Huang @ 2022-04-02  7:22 UTC (permalink / raw)
  To: wucy11, qemu-devel
  Cc: tugy, David Hildenbrand, yuanmh12, Juan Quintela,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, f4bug,
	dengpc12, Paolo Bonzini, baiyw2



在 2022/3/28 9:32, wucy11@chinatelecom.cn 写道:
> From: Chongyun Wu <wucy11@chinatelecom.cn>
> 
> Dirty ring feature need call kvm_cpu_synchronize_kick_all
> to flush hardware buffers into KVMslots, but when aucoverge
> run kvm_cpu_synchronize_kick_all calling will become more
> and more time consuming. This will significantly reduce the
> efficiency of dirty page queries, especially when memory
> pressure is high and the speed limit is high.
> 
> When the CPU speed limit is high and kvm_cpu_synchronize_kick_all
> is time-consuming, the rate of dirty pages generated by the VM
> will also be significantly reduced, so it is not necessary to
> call kvm_cpu_synchronize_kick_all at this time, just call it once
> before stopping the VM. This will significantly improve the
> efficiency of dirty page queries under high pressure.
Again, i hope we could consider the migration time instead of just
guest performance.
> 
> Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
> ---
>   accel/kvm/kvm-all.c       | 25 +++++--------------------
>   include/exec/cpu-common.h |  2 ++
>   migration/migration.c     | 11 +++++++++++
>   softmmu/cpus.c            | 18 ++++++++++++++++++
>   4 files changed, 36 insertions(+), 20 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 65a4de8..5e02700 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -48,6 +48,8 @@
>   
>   #include "hw/boards.h"
>   
> +#include "sysemu/cpu-throttle.h"
> +
>   /* This check must be after config-host.h is included */
>   #ifdef CONFIG_EVENTFD
>   #include <sys/eventfd.h>
> @@ -839,25 +841,6 @@ static uint64_t kvm_dirty_ring_reap(KVMState *s)
>       return total;
>   }
>   
> -static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
> -{
> -    /* No need to do anything */
> -}
> -
> -/*
> - * Kick all vcpus out in a synchronized way.  When returned, we
> - * guarantee that every vcpu has been kicked and at least returned to
> - * userspace once.
> - */
> -static void kvm_cpu_synchronize_kick_all(void)
> -{
> -    CPUState *cpu;
> -
> -    CPU_FOREACH(cpu) {
> -        run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
> -    }
> -}
> -
>   /*
>    * Flush all the existing dirty pages to the KVM slot buffers.  When
>    * this call returns, we guarantee that all the touched dirty pages
> @@ -879,7 +862,9 @@ static void kvm_dirty_ring_flush(void)
>        * First make sure to flush the hardware buffers by kicking all
>        * vcpus out in a synchronous way.
>        */
> -    kvm_cpu_synchronize_kick_all();
> +    if (!cpu_throttle_get_percentage()) {
> +        qemu_kvm_cpu_synchronize_kick_all();
> +    }
For the code logic itself, kvm_dirty_ring_flush aims to flush all 
existing dirty pages, same as i reviewed in v1's first commit 
"kvm,memory: Optimize dirty page collection for dirty ring", we 
shouldn't consider the migation logic in this path.
>       kvm_dirty_ring_reap(kvm_state);
>       trace_kvm_dirty_ring_flush(1);
>   }
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 50a7d29..13045b3 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -160,4 +160,6 @@ extern int singlestep;
>   
>   void list_cpus(const char *optarg);
>   
> +void qemu_kvm_cpu_synchronize_kick_all(void);
> +
>   #endif /* CPU_COMMON_H */
> diff --git a/migration/migration.c b/migration/migration.c
> index 695f0f2..ca1db88 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -61,6 +61,7 @@
>   #include "sysemu/cpus.h"
>   #include "yank_functions.h"
>   #include "sysemu/qtest.h"
> +#include "sysemu/kvm.h"
>   
>   #define MAX_THROTTLE  (128 << 20)      /* Migration transfer speed throttling */
>   
> @@ -3183,6 +3184,16 @@ static void migration_completion(MigrationState *s)
>   
>           if (!ret) {
>               bool inactivate = !migrate_colo_enabled();
> +            /*
> +             * Before stop vm do qemu_kvm_cpu_synchronize_kick_all to
> +             * fulsh hardware buffer into KVMslots for dirty ring
> +             * optmiaztion, If qemu_kvm_cpu_synchronize_kick_all is not
> +             * called when the CPU speed is limited to improve efficiency
> +             */
> +            if (kvm_dirty_ring_enabled()
> +                && cpu_throttle_get_percentage()) {
> +                qemu_kvm_cpu_synchronize_kick_all();
> +            }
>               ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>               trace_migration_completion_vm_stop(ret);
>               if (ret >= 0) {
> diff --git a/softmmu/cpus.c b/softmmu/cpus.c
> index 7b75bb6..d028d83 100644
> --- a/softmmu/cpus.c
> +++ b/softmmu/cpus.c
> @@ -810,3 +810,21 @@ void qmp_inject_nmi(Error **errp)
>       nmi_monitor_handle(monitor_get_cpu_index(monitor_cur()), errp);
>   }
>   
[...]
> +static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg)
> +{
> +    /* No need to do anything */
> +}
> +
> +/*
> + * Kick all vcpus out in a synchronized way.  When returned, we
> + * guarantee that every vcpu has been kicked and at least returned to
> + * userspace once.
> + */
> +void qemu_kvm_cpu_synchronize_kick_all(void)
> +{
> +    CPUState *cpu;
> +
> +    CPU_FOREACH(cpu) {
> +        run_on_cpu(cpu, do_kvm_cpu_synchronize_kick, RUN_ON_CPU_NULL);
> +    }
> +}
For the code style itself, IMHO, exporting helpers should split into a 
standalone commit.
-- 
Best regard

Hyman Huang(黄勇)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread
  2022-04-02  7:04   ` Hyman Huang
@ 2022-04-02  7:28     ` Chongyun Wu
  2022-05-13 16:50       ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Chongyun Wu @ 2022-04-02  7:28 UTC (permalink / raw)
  To: Hyman Huang, qemu-devel
  Cc: tugy, David Hildenbrand, yuanmh12, Juan Quintela,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, f4bug,
	dengpc12, Paolo Bonzini, baiyw2

on 4/2/2022 3:04 PM, Hyman Huang wrote:
> 
> 
> 在 2022/3/28 9:32, wucy11@chinatelecom.cn 写道:
>> From: Chongyun Wu <wucy11@chinatelecom.cn>
>>
>> Dynamically adjust the dirty ring collection thread to
>> reduce the occurrence of ring full, thereby reducing the
>> impact on customers, improving the efficiency of dirty
>> page collection, and thus improving the migration efficiency.
>>
>> Implementation:
>> 1) Define different collection speeds for the reap thread.
>>
>> 2) Divide the total number of dirty pages collected each
>> time by the ring size to get a ratio which indicates the
>> occupancy rate of dirty pages in the ring. The higher the
>> ratio, the higher the possibility that the ring will be full.
>>
>> 3) Different ratios correspond to different running speeds.
>> A higher ratio value indicates that a higher running speed
>> is required to collect dirty pages as soon as possible to
>> ensure that too many ring fulls will not be generated,
>> which will affect the customer's business.
> 
>>
>> This patch can significantly reduce the number of ring full
>> occurrences in the case of high memory dirty page pressure,
>> and minimize the impact on guests.
>>
> Increase the frequency of reaping dirty ring can reduce the guest
> vcpu block time obviously and consequently improve the guest memory performance.
> 
> But this also make the write-memory vcpu run more time and dirty more memory, so 
> the migration time may become longer. Maybe we
> should also focus on the migraiton time and compare with traditional  algo.

Below are quemu guestperf test results, with this patch the performence are
improved. The max_memory_update_speed(ms/GB) will affect the customer's normal 
use of memory. The impact of memory efficiency on the guest is sometimes even 
greater than the impact on the CPU. Therefore, it is necessary to reduce the 
impact on the guest during the migration process.

As for saying that doing so will increase the migration time, it has not been 
found in actual testing yet. And the effect of these patch series in reducing 
migration time is also obvious.

guestperf drop duration(s): Duration of significant drop in guestperf
performance
|--------------+-------+--------+-------+-------------+---------------- --|
| dirty        |  ring | memory |  VM   | guestperf   | max_memory_update |
| tracking     |  size | stress |       | drop        | speed(ms/GB)      |
| mode version |       |        |       | duration(s) |                   |
|--------------+-------+--------+-------+-------------+-------------------|
| bitmap mode  |     0 |     3G |  2C4G |          18 |              3300 |
| dirty ring   |  4096 |     3G |  2C4G |          23 |              2766 |
| **optmized** |  4096 |     3G |  2C4G |         5.5 |              2962 |
| dirty ring   | 65536 |     3G |  2C4G |        22.2 |              3283 |
| **optmized** | 65536 |     3G |  2C4G |           6 |              3160 |
|--------------+-------+--------+-------+-------------+-------------------|
| bitmap mode  |     0 |     3G | 8C32G |          38 |              7571 |
| dirty ring   |  4096 |     3G | 8C32G |          62 |             48.8K |
| **optmized** |  4096 |     3G | 8C32G |          13 |              7921 |
| dirty ring   | 65536 |     3G | 8C32G |          71 |             12.7K |
| **optmized** | 65536 |     3G | 8C32G |          14 |              8700 |
|--------------+-------+--------+-------+-------------+-------------------|
| bitmap mode  |     0 |     6G | 8C32G |        61.5 |             10.5K |
| dirty ring   |  4096 |     6G | 8C32G |          68 |            23.87K |
| **optmized** |  4096 |     6G | 8C32G |          16 |             11.6K |
| dirty ring   | 65536 |     6G | 8C32G |          63 |               12K |
| **optmized** | 65536 |     6G | 8C32G |          15 |               12K |
|--------------+-------+--------+-------+-------------+-------------------|
| bitmap mode  |     0 |    10G | 8C32G |         110 |            13.68k |
| dirty ring   |  4096 |    10G | 8C32G |          91 |            16.87K |
| **optmized** |  4096 |    10G | 8C32G |        12.1 |             11.2K |
| dirty ring   | 65536 |    10G | 8C32G |          88 |             25.3k |
| **optmized** | 65536 |    10G | 8C32G |        11.5 |             11.1k |
|--------------+-------+--------+-------+-------------+-------------------|
| bitmap mode  |     0 |    20G | 8C32G |       161.6 |             24.4K |
| dirty ring   |  4096 |    20G | 8C32G |       152.8 |            28.65K |
| **optmized** |  4096 |    20G | 8C32G |          20 |             20.2K |
| dirty ring   | 65536 |    20G | 8C32G |       157.3 |               25k |
| **optmized** | 65536 |    20G | 8C32G |          21 |             20.9K |
|--------------+-------+--------+-------+-------------+-------------------|
| bitmap mode  |     0 |    30G | 8C32G |       221.5 |             28.4K |
| dirty ring   |  4096 |    30G | 8C32G |         187 |            41.19K |
| **optmized** |  4096 |    30G | 8C32G |        29.5 |               29K |
| dirty ring   | 65536 |    30G | 8C32G |         171 |             30.8K |
| **optmized** | 65536 |    30G | 8C32G |        29.5 |             29.1K |
|--------------+-------+--------+-------+-------------+-------------------|

>> Using this patch for the qeum guestperf test, the memory
>> performance during the migration process is somewhat improved
>> compared to the bitmap method, and is significantly improved
>> compared to the unoptimized dirty ring method. For detailed
>> test data, please refer to the follow-up series of patches.
>>
>> Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
>> ---
>>   accel/kvm/kvm-all.c | 149 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 144 insertions(+), 5 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 27864df..65a4de8 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -91,6 +91,27 @@ enum KVMDirtyRingReaperState {
>>       KVM_DIRTY_RING_REAPER_REAPING,
>>   };
>>
>> +enum KVMDirtyRingReaperRunLevel {
>> +    /* The reaper runs at default normal speed */
>> +    KVM_DIRTY_RING_REAPER_RUN_NORMAL = 0,
>> +    /* The reaper starts to accelerate in different gears */
>> +    KVM_DIRTY_RING_REAPER_RUN_FAST1,
>> +    KVM_DIRTY_RING_REAPER_RUN_FAST2,
>> +    KVM_DIRTY_RING_REAPER_RUN_FAST3,
>> +    KVM_DIRTY_RING_REAPER_RUN_FAST4,
>> +    /* The reaper runs at the fastest speed */
>> +    KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED,
>> +};
>> + > +enum KVMDirtyRingReaperSpeedControl {
>> +    /* Maintain current speed */
>> +    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_KEEP = 0,
>> +    /* Accelerate current speed */
>> +    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP,
>> +    /* Decrease current speed */
>> +    KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN
>> +};
>> +
>>   /*
>>    * KVM reaper instance, responsible for collecting the KVM dirty bits
>>    * via the dirty ring.
>> @@ -100,6 +121,11 @@ struct KVMDirtyRingReaper {
>>       QemuThread reaper_thr;
>>       volatile uint64_t reaper_iteration; /* iteration number of reaper thr */
>>       volatile enum KVMDirtyRingReaperState reaper_state; /* reap thr state */
>> +    /* Control the running speed of the reaper thread to fit dirty page rate */
>> +    enum KVMDirtyRingReaperRunLevel run_level;
>> +    uint64_t ring_full_cnt;
>> +    float ratio_adjust_threshold;
>> +    int stable_count_threshold;
> Could you add some comments about the introduced field?
Thanks for mention.

>>   };
>>   struct KVMState
>> @@ -1449,11 +1475,115 @@ out:
>>       kvm_slots_unlock();
>>   }
> [...]
>> +static uint64_t calcu_sleep_time(KVMState *s,
>> +                                       uint64_t dirty_count,
>> +                                       uint64_t ring_full_cnt_last,
>> +                                       uint32_t *speed_down_cnt)
> Code isn't aligned
Thanks for mention.

>> +{
>> +    float ratio = 0.0;
>> +    uint64_t sleep_time = 1000000;
>> +    enum KVMDirtyRingReaperRunLevel run_level_want;
>> +    enum KVMDirtyRingReaperSpeedControl speed_control;
>> +
>> +    /*
>> +     * When the number of dirty pages collected exceeds
>> +     * the given percentage of the ring size,the speed
>> +     * up action will be triggered.
>> +     */
>> +    s->reaper.ratio_adjust_threshold = 0.1;
>> +    s->reaper.stable_count_threshold = 5;
>> +
>> +    ratio = (float)dirty_count / s->kvm_dirty_ring_size;
>> +
>> +    if (s->reaper.ring_full_cnt > ring_full_cnt_last) {
>> +        /* If get a new ring full need speed up reaper thread */
>> +        if (s->reaper.run_level != KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED) {
>> +            s->reaper.run_level++;
>> +        }
>> +    } else {
>> +        /*
>> +         * If get more dirty pages this loop and this status continus
>> +         * for many times try to speed up reaper thread.
>> +         * If the status is stable and need to decide which speed need
>> +         * to use.
>> +         */
>> +        if (ratio < s->reaper.ratio_adjust_threshold) {
>> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_NORMAL;
>> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 2) {
>> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST1;
>> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 3) {
>> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST2;
>> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 4) {
>> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST3;
>> +        } else if (ratio < s->reaper.ratio_adjust_threshold * 5) {
>> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST4;
>> +        } else {
>> +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED;
>> +        }
>> +
>> +        /* Get if need speed up or slow down */
>> +        if (run_level_want > s->reaper.run_level) {
>> +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP;
>> +            *speed_down_cnt = 0;
>> +        } else if (run_level_want < s->reaper.run_level) {
>> +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN;
>> +            *speed_down_cnt++;
>> +        } else {
>> +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_KEEP;
>> +        }
>> +
>> +        /* Control reaper thread run in sutiable run speed level */
>> +        if (speed_control == KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP) {
>> +            /* If need speed up do not check its stable just do it */
>> +            s->reaper.run_level++;
>> +        } else if (speed_control ==
>> +            KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN) {
>> +            /* If need speed down we should filter this status */
>> +            if (*speed_down_cnt > s->reaper.stable_count_threshold) {
>> +                s->reaper.run_level--;
>> +            }
>> +        }
>> +    }
>> +
>> +    /* Set the actual running rate of the reaper */
>> +    switch (s->reaper.run_level) {
>> +    case KVM_DIRTY_RING_REAPER_RUN_NORMAL:
>> +        sleep_time = 1000000;
>> +        break;
>> +    case KVM_DIRTY_RING_REAPER_RUN_FAST1:
>> +        sleep_time = 500000;
>> +        break;
>> +    case KVM_DIRTY_RING_REAPER_RUN_FAST2:
>> +        sleep_time = 250000;
>> +        break;
>> +    case KVM_DIRTY_RING_REAPER_RUN_FAST3:
>> +        sleep_time = 125000;
>> +        break;
>> +    case KVM_DIRTY_RING_REAPER_RUN_FAST4:
>> +        sleep_time = 100000;
>> +        break;
>> +    case KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED:
>> +        sleep_time = 80000;
>> +        break;
>> +    default:
>> +        sleep_time = 1000000;
>> +        error_report("Bad reaper thread run level, use default");
>> +    }
>> +
>> +    return sleep_time;
>> +}
>> +I think how to calculate the sleep time needs discuussion, including why 
> we define 5 levels, why we choose the time constants and in what scenarios this 
> algo works fine.


> 
> The other thing is i still think it's nicer we have the simplest algorithm 
> firstly, which should be very easy to verify.
>>   static void *kvm_dirty_ring_reaper_thread(void *data)
>>   {
>>       KVMState *s = data;
>>       struct KVMDirtyRingReaper *r = &s->reaper;
>> +    uint64_t count = 0;
>> +    uint64_t sleep_time = 1000000;
>> +    uint64_t ring_full_cnt_last = 0;
>> +    /* Filter speed jitter */
>> +    uint32_t speed_down_cnt = 0;
>> +
>>       rcu_register_thread();
>>       trace_kvm_dirty_ring_reaper("init");
>> @@ -1461,18 +1591,26 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
>>       while (true) {
>>           r->reaper_state = KVM_DIRTY_RING_REAPER_WAIT;
>>           trace_kvm_dirty_ring_reaper("wait");
>> -        /*
>> -         * TODO: provide a smarter timeout rather than a constant?
>> -         */
>> -        sleep(1);
>> +
>> +       ring_full_cnt_last = s->reaper.ring_full_cnt;
>> +
>> +        usleep(sleep_time);
>>           trace_kvm_dirty_ring_reaper("wakeup");
>>           r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
>>           qemu_mutex_lock_iothread();
>> -        kvm_dirty_ring_reap(s);
>> +        count = kvm_dirty_ring_reap(s);
>>           qemu_mutex_unlock_iothread();
>> +        /*
>> +         * Calculate the appropriate sleep time according to
>> +         * the speed of the current dirty page.
>> +         */
>> +        sleep_time = calcu_sleep_time(s, count,
>> +                                      ring_full_cnt_last,
>> +                                      &speed_down_cnt);
>> +
>>           r->reaper_iteration++;
>>       }
>> @@ -2958,6 +3096,7 @@ int kvm_cpu_exec(CPUState *cpu)
>>               trace_kvm_dirty_ring_full(cpu->cpu_index);
>>               qemu_mutex_lock_iothread();
>>               kvm_dirty_ring_reap(kvm_state);
>> +            kvm_state->reaper.ring_full_cnt++;
>>               qemu_mutex_unlock_iothread();
>>               ret = 0;
>>               break;
> 
> Thanks.

-- 
Best Regard,
Chongyun Wu


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring
  2022-03-28  1:32 ` [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring wucy11
@ 2022-04-02  7:28   ` Hyman Huang
  2022-04-02  8:59     ` Chongyun Wu
  0 siblings, 1 reply; 15+ messages in thread
From: Hyman Huang @ 2022-04-02  7:28 UTC (permalink / raw)
  To: wucy11, qemu-devel
  Cc: tugy, David Hildenbrand, yuanmh12, Juan Quintela,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu, f4bug,
	dengpc12, Paolo Bonzini, baiyw2



在 2022/3/28 9:32, wucy11@chinatelecom.cn 写道:
> From: Chongyun Wu <wucy11@chinatelecom.cn>
> 
> A new structure KVMDirtyRingDirtyCounter is introduced in
> KVMDirtyRingReaper to record the number of dirty pages
> within a period of time.
> 
> When kvm_dirty_ring_mark_page collects dirty pages, if it
> finds that the current dirty pages are not duplicates, it
> increases the dirty_pages_period count.
> 
> Divide the dirty_pages_period count by the interval to get
> the dirty page rate for this period.
> 
> And use dirty_pages_period_peak_rate to count the highest
> dirty page rate, to solve the problem that the dirty page
> collection rate may change greatly during a period of time,
> resulting in a large change in the dirty page rate.
> 
> Through sufficient testing, it is found that the dirty rate
> calculated after kvm_dirty_ring_flush usually matches the actual
> pressure, and the dirty rate counted per second may change in the
> subsequent seconds, so record the peak dirty rate as the real
> dirty pages rate.
> 
> This dirty pages rate is mainly used as the subsequent autoconverge
> calculation speed limit throttle.
As Peter's advice, i think the better way is exporting or adapting the 
existing implemenation of dirty page rate calculation instead of 
building different blocks.
> 
> Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
> ---
>   accel/kvm/kvm-all.c  | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>   include/sysemu/kvm.h |  2 ++
>   2 files changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 5e02700..a158b1c 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -114,6 +114,13 @@ enum KVMDirtyRingReaperSpeedControl {
>       KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN
>   };
>   
> +struct KVMDirtyRingDirtyCounter {
> +    int64_t time_last_count;
> +    uint64_t dirty_pages_period;
> +    int64_t dirty_pages_rate;
> +    int64_t dirty_pages_period_peak_rate;
> +};
> +
>   /*
>    * KVM reaper instance, responsible for collecting the KVM dirty bits
>    * via the dirty ring.
> @@ -128,6 +135,7 @@ struct KVMDirtyRingReaper {
>       uint64_t ring_full_cnt;
>       float ratio_adjust_threshold;
>       int stable_count_threshold;
> +    struct KVMDirtyRingDirtyCounter counter; /* Calculate dirty pages rate */
>   };
>   
>   struct KVMState
> @@ -739,7 +747,9 @@ static void kvm_dirty_ring_mark_page(KVMState *s, uint32_t as_id,
>           return;
>       }
>   
> -    set_bit(offset, mem->dirty_bmap);
> +    if (!test_and_set_bit(offset, mem->dirty_bmap)) {
> +        s->reaper.counter.dirty_pages_period++;
> +    }
>   }
>   
>   static bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
> @@ -783,6 +793,56 @@ static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
>       return count;
>   }
>   
> +int64_t kvm_dirty_ring_get_rate(void)
> +{
> +    return kvm_state->reaper.counter.dirty_pages_rate;
> +}
> +
> +int64_t kvm_dirty_ring_get_peak_rate(void)
> +{
> +    return kvm_state->reaper.counter.dirty_pages_period_peak_rate;
> +}
> +
> +static void kvm_dirty_ring_reap_count(KVMState *s)
> +{
> +    int64_t spend_time = 0;
> +    int64_t end_time;
> +
> +    if (!s->reaper.counter.time_last_count) {
> +        s->reaper.counter.time_last_count =
> +            qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +    }
> +
> +    end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +    spend_time = end_time - s->reaper.counter.time_last_count;
> +
> +    if (!s->reaper.counter.dirty_pages_period ||
> +        !spend_time) {
> +        return;
> +    }
> +
> +    /*
> +     * More than 1 second = 1000 millisecons,
> +     * or trigger by kvm_log_sync_global which spend time
> +     * more than 300 milliscons.
> +     */
> +    if (spend_time > 1000) {
> +        /* Count the dirty page rate during this period */
> +        s->reaper.counter.dirty_pages_rate =
> +            s->reaper.counter.dirty_pages_period * 1000 / spend_time;
> +        /* Update the peak dirty page rate at this period */
> +        if (s->reaper.counter.dirty_pages_rate >
> +            s->reaper.counter.dirty_pages_period_peak_rate) {
> +            s->reaper.counter.dirty_pages_period_peak_rate =
> +                s->reaper.counter.dirty_pages_rate;
> +        }
> +
> +        /* Reset counters */
> +        s->reaper.counter.dirty_pages_period = 0;
> +        s->reaper.counter.time_last_count = 0;
> +    }
> +}
> +
>   /* Must be with slots_lock held */
>   static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
>   {
> @@ -793,6 +853,8 @@ static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
>   
>       stamp = get_clock();
>   
> +    kvm_dirty_ring_reap_count(s);
> +
>       CPU_FOREACH(cpu) {
>           total += kvm_dirty_ring_reap_one(s, cpu);
>       }
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index a783c78..05846f9 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -582,4 +582,6 @@ bool kvm_cpu_check_are_resettable(void);
>   bool kvm_arch_cpu_check_are_resettable(void);
>   
>   bool kvm_dirty_ring_enabled(void);
> +int64_t kvm_dirty_ring_get_rate(void);
> +int64_t kvm_dirty_ring_get_peak_rate(void);
>   #endif

-- 
Best regard

Hyman Huang(黄勇)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring
  2022-04-02  7:28   ` Hyman Huang
@ 2022-04-02  8:59     ` Chongyun Wu
  2022-05-13 17:25       ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Chongyun Wu @ 2022-04-02  8:59 UTC (permalink / raw)
  To: Hyman Huang, qemu-devel
  Cc: tugy, David Hildenbrand, yuanmh12, Juan Quintela,
	Richard Henderson, Dr. David Alan Gilbert, Peter Xu,
	Philippe Mathieu-Daudé,
	dengpc12, Paolo Bonzini, baiyw2


on 4/2/2022 3:28 PM, Hyman Huang wrote:
> 
> 
> 在 2022/3/28 9:32, wucy11@chinatelecom.cn 写道:
>> From: Chongyun Wu <wucy11@chinatelecom.cn>
>>
>> A new structure KVMDirtyRingDirtyCounter is introduced in
>> KVMDirtyRingReaper to record the number of dirty pages
>> within a period of time.
>>
>> When kvm_dirty_ring_mark_page collects dirty pages, if it
>> finds that the current dirty pages are not duplicates, it
>> increases the dirty_pages_period count.
>>
>> Divide the dirty_pages_period count by the interval to get
>> the dirty page rate for this period.
>>
>> And use dirty_pages_period_peak_rate to count the highest
>> dirty page rate, to solve the problem that the dirty page
>> collection rate may change greatly during a period of time,
>> resulting in a large change in the dirty page rate.
>>
>> Through sufficient testing, it is found that the dirty rate
>> calculated after kvm_dirty_ring_flush usually matches the actual
>> pressure, and the dirty rate counted per second may change in the
>> subsequent seconds, so record the peak dirty rate as the real
>> dirty pages rate.
>>
>> This dirty pages rate is mainly used as the subsequent autoconverge
>> calculation speed limit throttle.
> As Peter's advice, i think the better way is exporting or adapting the existing 
> implemenation of dirty page rate calculation instead of building different blocks.
Yes,that right. But this is a little different.
Qemu currently already has a variety of dirty page rate calculation methods, 
which are used in different scenarios. This method is mainly to calculate the 
appropriate speed limit threshold in the process of hot migration of the dirty 
ring. It is realized by making full use of the characteristics of the dirty 
ring, that is, statistics are performed when collecting dirty pages, which 
cannot be achieved by the old bitmap method, and it will not add too much extra 
overhead, such as starting new threads, etc. There is no need to input 
parameters such as a suitable sampling period, etc., which is simple and 
convenient. Through the test, the pressure of the actual stress added can be 
relatively accurately calculated.

>>
>> Signed-off-by: Chongyun Wu <wucy11@chinatelecom.cn>
>> ---
>>   accel/kvm/kvm-all.c  | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   include/sysemu/kvm.h |  2 ++
>>   2 files changed, 65 insertions(+), 1 deletion(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 5e02700..a158b1c 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -114,6 +114,13 @@ enum KVMDirtyRingReaperSpeedControl {
>>       KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN
>>   };
>> +struct KVMDirtyRingDirtyCounter {
>> +    int64_t time_last_count;
>> +    uint64_t dirty_pages_period;
>> +    int64_t dirty_pages_rate;
>> +    int64_t dirty_pages_period_peak_rate;
>> +};
>> +
>>   /*
>>    * KVM reaper instance, responsible for collecting the KVM dirty bits
>>    * via the dirty ring.
>> @@ -128,6 +135,7 @@ struct KVMDirtyRingReaper {
>>       uint64_t ring_full_cnt;
>>       float ratio_adjust_threshold;
>>       int stable_count_threshold;
>> +    struct KVMDirtyRingDirtyCounter counter; /* Calculate dirty pages rate */
>>   };
>>   struct KVMState
>> @@ -739,7 +747,9 @@ static void kvm_dirty_ring_mark_page(KVMState *s, uint32_t 
>> as_id,
>>           return;
>>       }
>> -    set_bit(offset, mem->dirty_bmap);
>> +    if (!test_and_set_bit(offset, mem->dirty_bmap)) {
>> +        s->reaper.counter.dirty_pages_period++;
>> +    }
>>   }
>>   static bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
>> @@ -783,6 +793,56 @@ static uint32_t kvm_dirty_ring_reap_one(KVMState *s, 
>> CPUState *cpu)
>>       return count;
>>   }
>> +int64_t kvm_dirty_ring_get_rate(void)
>> +{
>> +    return kvm_state->reaper.counter.dirty_pages_rate;
>> +}
>> +
>> +int64_t kvm_dirty_ring_get_peak_rate(void)
>> +{
>> +    return kvm_state->reaper.counter.dirty_pages_period_peak_rate;
>> +}
>> +
>> +static void kvm_dirty_ring_reap_count(KVMState *s)
>> +{
>> +    int64_t spend_time = 0;
>> +    int64_t end_time;
>> +
>> +    if (!s->reaper.counter.time_last_count) {
>> +        s->reaper.counter.time_last_count =
>> +            qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> +    }
>> +
>> +    end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> +    spend_time = end_time - s->reaper.counter.time_last_count;
>> +
>> +    if (!s->reaper.counter.dirty_pages_period ||
>> +        !spend_time) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * More than 1 second = 1000 millisecons,
>> +     * or trigger by kvm_log_sync_global which spend time
>> +     * more than 300 milliscons.
>> +     */
>> +    if (spend_time > 1000) {
>> +        /* Count the dirty page rate during this period */
>> +        s->reaper.counter.dirty_pages_rate =
>> +            s->reaper.counter.dirty_pages_period * 1000 / spend_time;
>> +        /* Update the peak dirty page rate at this period */
>> +        if (s->reaper.counter.dirty_pages_rate >
>> +            s->reaper.counter.dirty_pages_period_peak_rate) {
>> +            s->reaper.counter.dirty_pages_period_peak_rate =
>> +                s->reaper.counter.dirty_pages_rate;
>> +        }
>> +
>> +        /* Reset counters */
>> +        s->reaper.counter.dirty_pages_period = 0;
>> +        s->reaper.counter.time_last_count = 0;
>> +    }
>> +}
>> +
>>   /* Must be with slots_lock held */
>>   static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
>>   {
>> @@ -793,6 +853,8 @@ static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
>>       stamp = get_clock();
>> +    kvm_dirty_ring_reap_count(s);
>> +
>>       CPU_FOREACH(cpu) {
>>           total += kvm_dirty_ring_reap_one(s, cpu);
>>       }
>> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
>> index a783c78..05846f9 100644
>> --- a/include/sysemu/kvm.h
>> +++ b/include/sysemu/kvm.h
>> @@ -582,4 +582,6 @@ bool kvm_cpu_check_are_resettable(void);
>>   bool kvm_arch_cpu_check_are_resettable(void);
>>   bool kvm_dirty_ring_enabled(void);
>> +int64_t kvm_dirty_ring_get_rate(void);
>> +int64_t kvm_dirty_ring_get_peak_rate(void);
>>   #endif
> 

-- 
Best Regard,
Chongyun Wu


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/4] kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all
  2022-04-02  7:22   ` Hyman Huang
@ 2022-05-13 16:41     ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2022-05-13 16:41 UTC (permalink / raw)
  To: Hyman Huang
  Cc: wucy11, qemu-devel, Paolo Bonzini, Richard Henderson,
	Juan Quintela, Dr. David Alan Gilbert, David Hildenbrand, f4bug,
	tugy, dengpc12, yuanmh12, baiyw2

On Sat, Apr 02, 2022 at 03:22:21PM +0800, Hyman Huang wrote:
> 在 2022/3/28 9:32, wucy11@chinatelecom.cn 写道:
> > @@ -879,7 +862,9 @@ static void kvm_dirty_ring_flush(void)
> >        * First make sure to flush the hardware buffers by kicking all
> >        * vcpus out in a synchronous way.
> >        */
> > -    kvm_cpu_synchronize_kick_all();
> > +    if (!cpu_throttle_get_percentage()) {
> > +        qemu_kvm_cpu_synchronize_kick_all();
> > +    }
> For the code logic itself, kvm_dirty_ring_flush aims to flush all existing
> dirty pages, same as i reviewed in v1's first commit "kvm,memory: Optimize
> dirty page collection for dirty ring", we shouldn't consider the migation
> logic in this path.

I agree with Yong's analysis.  I'm afraid this patch is breaking the api
here.

Say, memory_region_sync_dirty_bitmap() should be the memory api to flush
all dirty memory, we can't simply ignore the flushing just because it's
slow when we're throttling the cpus.  Things will already start to break,
e.g., the dirty rate calculation based on dirty ring relies on all dirty
pages be accounted after the sync() call and this patch will make the
reported value smaller than the reality.  It's not a major deal breaker in
dirty rate measurements but it's just one exmample that we're potentially
breaking the memory api.

AFAIU this is probably another reason why I don't really like the
auto-converge old solution because it's kind of brutal in this case by
putting the thread into a long sleep.

One fix could be that we do periodically sleep in auto-converge code so
that the vcpu threads can still handle any cpu sync requests, though I
didn't really check the cpu code path.

It's also racy in that cpu_throttle_get_percentage() is racy itself:
imagine a case where cpu_throttle_get_percentage() always return >0 _but_
it starts to return 0 when migration_complete() - we'll not call
qemu_kvm_cpu_synchronize_kick_all() even for completion and it can crash
the VM.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread
  2022-04-02  7:28     ` Chongyun Wu
@ 2022-05-13 16:50       ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2022-05-13 16:50 UTC (permalink / raw)
  To: Chongyun Wu
  Cc: Hyman Huang, qemu-devel, Paolo Bonzini, Richard Henderson,
	Juan Quintela, Dr. David Alan Gilbert, David Hildenbrand, f4bug,
	tugy, dengpc12, yuanmh12, baiyw2

On Sat, Apr 02, 2022 at 03:28:13PM +0800, Chongyun Wu wrote:
> > > +{
> > > +    float ratio = 0.0;
> > > +    uint64_t sleep_time = 1000000;
> > > +    enum KVMDirtyRingReaperRunLevel run_level_want;
> > > +    enum KVMDirtyRingReaperSpeedControl speed_control;
> > > +
> > > +    /*
> > > +     * When the number of dirty pages collected exceeds
> > > +     * the given percentage of the ring size,the speed
> > > +     * up action will be triggered.
> > > +     */
> > > +    s->reaper.ratio_adjust_threshold = 0.1;
> > > +    s->reaper.stable_count_threshold = 5;
> > > +
> > > +    ratio = (float)dirty_count / s->kvm_dirty_ring_size;
> > > +
> > > +    if (s->reaper.ring_full_cnt > ring_full_cnt_last) {
> > > +        /* If get a new ring full need speed up reaper thread */
> > > +        if (s->reaper.run_level != KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED) {
> > > +            s->reaper.run_level++;
> > > +        }
> > > +    } else {
> > > +        /*
> > > +         * If get more dirty pages this loop and this status continus
> > > +         * for many times try to speed up reaper thread.
> > > +         * If the status is stable and need to decide which speed need
> > > +         * to use.
> > > +         */
> > > +        if (ratio < s->reaper.ratio_adjust_threshold) {
> > > +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_NORMAL;
> > > +        } else if (ratio < s->reaper.ratio_adjust_threshold * 2) {
> > > +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST1;
> > > +        } else if (ratio < s->reaper.ratio_adjust_threshold * 3) {
> > > +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST2;
> > > +        } else if (ratio < s->reaper.ratio_adjust_threshold * 4) {
> > > +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST3;
> > > +        } else if (ratio < s->reaper.ratio_adjust_threshold * 5) {
> > > +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_FAST4;
> > > +        } else {
> > > +            run_level_want = KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED;
> > > +        }
> > > +
> > > +        /* Get if need speed up or slow down */
> > > +        if (run_level_want > s->reaper.run_level) {
> > > +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP;
> > > +            *speed_down_cnt = 0;
> > > +        } else if (run_level_want < s->reaper.run_level) {
> > > +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN;
> > > +            *speed_down_cnt++;
> > > +        } else {
> > > +            speed_control = KVM_DIRTY_RING_REAPER_SPEED_CONTROL_KEEP;
> > > +        }
> > > +
> > > +        /* Control reaper thread run in sutiable run speed level */
> > > +        if (speed_control == KVM_DIRTY_RING_REAPER_SPEED_CONTROL_UP) {
> > > +            /* If need speed up do not check its stable just do it */
> > > +            s->reaper.run_level++;
> > > +        } else if (speed_control ==
> > > +            KVM_DIRTY_RING_REAPER_SPEED_CONTROL_DOWN) {
> > > +            /* If need speed down we should filter this status */
> > > +            if (*speed_down_cnt > s->reaper.stable_count_threshold) {
> > > +                s->reaper.run_level--;
> > > +            }
> > > +        }
> > > +    }
> > > +
> > > +    /* Set the actual running rate of the reaper */
> > > +    switch (s->reaper.run_level) {
> > > +    case KVM_DIRTY_RING_REAPER_RUN_NORMAL:
> > > +        sleep_time = 1000000;
> > > +        break;
> > > +    case KVM_DIRTY_RING_REAPER_RUN_FAST1:
> > > +        sleep_time = 500000;
> > > +        break;
> > > +    case KVM_DIRTY_RING_REAPER_RUN_FAST2:
> > > +        sleep_time = 250000;
> > > +        break;
> > > +    case KVM_DIRTY_RING_REAPER_RUN_FAST3:
> > > +        sleep_time = 125000;
> > > +        break;
> > > +    case KVM_DIRTY_RING_REAPER_RUN_FAST4:
> > > +        sleep_time = 100000;
> > > +        break;
> > > +    case KVM_DIRTY_RING_REAPER_RUN_MAX_SPEED:
> > > +        sleep_time = 80000;
> > > +        break;
> > > +    default:
> > > +        sleep_time = 1000000;
> > > +        error_report("Bad reaper thread run level, use default");
> > > +    }
> > > +
> > > +    return sleep_time;
> > > +}
> > > +I think how to calculate the sleep time needs discuussion,
> > > including why
> > we define 5 levels, why we choose the time constants and in what
> > scenarios this algo works fine.
> 
> 
> > 
> > The other thing is i still think it's nicer we have the simplest
> > algorithm firstly, which should be very easy to verify.

Chongyun,

I saw that you left some empty line above but didn't really answer this
question.  I am with Yong.. there're just too many magic numbers.

I don't have a strong opinion on dropping all of them, maybe your point is
you did tons of testing and you found these numbers are the best (though I
doubt it.. but if it's true please let me know) but at least I think it can
be put into a structure like:

  struct dirty_ring_interval_table {
    /* This can be 0->5; we really don't need those macros.. */
    int level;
    /* This marks over which dirty ratio we use this level */
    float threshold_ratio;
    /* This is the time to sleep if this level is chosen */
    int sleep_us;
    ...
  };

IIUC the whole logic above has a major point in that we shrink the sleep
time more aggresively than growing, I think it can be done much easier than
this algorithm, e.g., we mark a global threshold of say 0.8 dirty ratio out
of the ring size, then:

  (1) if dirty ratio > 0.8 we half the sleep time (delay /= 2)
  (2) if dirty ratio <= 0.2 we increase the sleep time (delay += 20ms)

We also give a limit to the sleep time, say max 1sec min 100ms.  Would that
also work very well already with e.g. 20 LOCs rather than 100+ with lots of
magics and if/else's?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring
  2022-04-02  8:59     ` Chongyun Wu
@ 2022-05-13 17:25       ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2022-05-13 17:25 UTC (permalink / raw)
  To: Chongyun Wu
  Cc: Hyman Huang, qemu-devel, Paolo Bonzini, Richard Henderson,
	Juan Quintela, Dr. David Alan Gilbert, David Hildenbrand,
	Philippe Mathieu-Daudé,
	tugy, dengpc12, yuanmh12, baiyw2

On Sat, Apr 02, 2022 at 04:59:24PM +0800, Chongyun Wu wrote:
> 
> on 4/2/2022 3:28 PM, Hyman Huang wrote:
> > 
> > 
> > 在 2022/3/28 9:32, wucy11@chinatelecom.cn 写道:
> > > From: Chongyun Wu <wucy11@chinatelecom.cn>
> > > 
> > > A new structure KVMDirtyRingDirtyCounter is introduced in
> > > KVMDirtyRingReaper to record the number of dirty pages
> > > within a period of time.
> > > 
> > > When kvm_dirty_ring_mark_page collects dirty pages, if it
> > > finds that the current dirty pages are not duplicates, it
> > > increases the dirty_pages_period count.
> > > 
> > > Divide the dirty_pages_period count by the interval to get
> > > the dirty page rate for this period.
> > > 
> > > And use dirty_pages_period_peak_rate to count the highest
> > > dirty page rate, to solve the problem that the dirty page
> > > collection rate may change greatly during a period of time,
> > > resulting in a large change in the dirty page rate.
> > > 
> > > Through sufficient testing, it is found that the dirty rate
> > > calculated after kvm_dirty_ring_flush usually matches the actual
> > > pressure, and the dirty rate counted per second may change in the
> > > subsequent seconds, so record the peak dirty rate as the real
> > > dirty pages rate.
> > > 
> > > This dirty pages rate is mainly used as the subsequent autoconverge
> > > calculation speed limit throttle.
> > As Peter's advice, i think the better way is exporting or adapting the
> > existing implemenation of dirty page rate calculation instead of
> > building different blocks.
> Yes,that right. But this is a little different.
> Qemu currently already has a variety of dirty page rate calculation methods,
> which are used in different scenarios. This method is mainly to calculate
> the appropriate speed limit threshold in the process of hot migration of the
> dirty ring. It is realized by making full use of the characteristics of the
> dirty ring, that is, statistics are performed when collecting dirty pages,
> which cannot be achieved by the old bitmap method, and it will not add too
> much extra overhead, such as starting new threads, etc. There is no need to
> input parameters such as a suitable sampling period, etc., which is simple
> and convenient. Through the test, the pressure of the actual stress added
> can be relatively accurately calculated.

Please see commit 7786ae40ba4e7d5b9 and we've already have per-vcpu data.
If we want a per-vm data we'd want to do that there too together within
kvm_dirty_ring_reap_one().

Regarding "make best use of dirty ring": it's per-vcpu already and actually
when I thought about optimizing auto-converge why not start using per-vcpu
data to do per-vcpu throttling already?  I'm lost on why it goes back to a
per-vm approach?

I'm also curious what will this be compared to Yong's dirty limit approach?
Dirty limit allows setting dirty rate upon all vcpus in one shot.  From
design-wise, that does sound to be superior.. not only because it's
per-vcpu, but also because it's not sleep()ing but trapping only writes in
the vmexit with small intervals.  Would you have time to compare these two
solutions?

I fully trust and appreciate your test results and I believe it performs
better than the old auto converge, it's just that we can't merge solution A
& B if they all look similar and solving the same problem, even if both
will work better.  We need to choose a way to go at last.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-05-13 17:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-28  1:32 [PATCH v2 0/4] Dirty ring and auto converge optimization wucy11
2022-03-28  1:32 ` [PATCH v2 1/4] kvm: Dynamically adjust the rate of dirty ring reaper thread wucy11
2022-04-02  7:04   ` Hyman Huang
2022-04-02  7:28     ` Chongyun Wu
2022-05-13 16:50       ` Peter Xu
2022-03-28  1:32 ` [PATCH v2 2/4] kvm: Dirty ring autoconverge optmization for kvm_cpu_synchronize_kick_all wucy11
2022-04-02  7:22   ` Hyman Huang
2022-05-13 16:41     ` Peter Xu
2022-03-28  1:32 ` [PATCH v2 3/4] kvm: Introduce a dirty rate calculation method based on dirty ring wucy11
2022-04-02  7:28   ` Hyman Huang
2022-04-02  8:59     ` Chongyun Wu
2022-05-13 17:25       ` Peter Xu
2022-03-28  1:32 ` [PATCH v2 4/4] migration: Calculate the appropriate throttle for autoconverge wucy11
2022-04-01 13:13 ` [PATCH v2 0/4] Dirty ring and auto converge optimization Peter Xu
2022-04-02  2:13   ` Chongyun Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.