[PATCH v2 00/11] migration: introduce dirtylimit capability

From: huangy81@chinatelecom.cn
To: qemu-devel <qemu-devel@nongnu.org>
Cc: "Peter Xu" <peterx@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Laurent Vivier" <laurent@vivier.eu>,
	"Eric Blake" <eblake@redhat.com>,
	"Juan Quintela" <quintela@redhat.com>,
	"Thomas Huth" <thuth@redhat.com>,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	"Hyman Huang(黄勇)" <huangy81@chinatelecom.cn>
Subject: [PATCH v2 00/11] migration: introduce dirtylimit capability
Date: Mon, 21 Nov 2022 11:26:32 -0500	[thread overview]
Message-ID: <cover.1669047366.git.huangy81@chinatelecom.cn> (raw)

From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>

v2: 
This version make a little bit modifications comparing with
version 1 as following:
1. fix the overflow issue reported by Peter Maydell
2. add parameter check for hmp "set_vcpu_dirty_limit" command
3. fix the racing issue between dirty ring reaper thread and
   Qemu main thread.
4. add migrate parameter check for x-vcpu-dirty-limit-period
   and vcpu-dirty-limit.
5. add the logic to forbid hmp/qmp commands set_vcpu_dirty_limit,
   cancel_vcpu_dirty_limit during dirty-limit live migration when
   implement dirty-limit convergence algo.
6. add capability check to ensure auto-converge and dirty-limit
   are mutually exclusive.
7. pre-check if kvm dirty ring size is configured before setting
   dirty-limit migrate parameter 

A more comprehensive test was done comparing with version 1.

The following are test environment:
-------------------------------------------------------------
a. Host hardware info:

CPU:
Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz

CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2

NUMA node0 CPU(s):               0-15,32-47
NUMA node1 CPU(s):               16-31,48-63

Memory:
Hynix  503Gi

Interface:
Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
Speed: 1000Mb/s

b. Host software info:

OS: ctyunos release 2
Kernel: 4.19.90-2102.2.0.0066.ctl2.x86_64
Libvirt baseline version:  libvirt-6.9.0
Qemu baseline version: qemu-5.0

c. vm scale
CPU: 4
Memory: 4G
-------------------------------------------------------------

All the supplementary test data shown as follows are basing on
above test environment.

In version 1, we post a test data from unixbench as follows:

$ taskset -c 8-15 ./Run -i 2 -c 8 {unixbench test item}

host cpu: Intel(R) Xeon(R) Platinum 8378A
host interface speed: 1000Mb/s
  |---------------------+--------+------------+---------------|
  | UnixBench test item | Normal | Dirtylimit | Auto-converge |
  |---------------------+--------+------------+---------------|
  | dhry2reg            | 32800  | 32786      | 25292         |
  | whetstone-double    | 10326  | 10315      | 9847          |
  | pipe                | 15442  | 15271      | 14506         |
  | context1            | 7260   | 6235       | 4514          |
  | spawn               | 3663   | 3317       | 3249          |
  | syscall             | 4669   | 4667       | 3841          |
  |---------------------+--------+------------+---------------|

In version 2, we post a supplementary test data that do not use
taskset and make the scenario more general, see as follows:

$ ./Run

per-vcpu data:
  |---------------------+--------+------------+---------------|
  | UnixBench test item | Normal | Dirtylimit | Auto-converge |
  |---------------------+--------+------------+---------------|
  | dhry2reg            | 2991   | 2902       | 1722          |
  | whetstone-double    | 1018   | 1006       | 627           |
  | Execl Throughput    | 955    | 320        | 660           |
  | File Copy - 1       | 2362   | 805        | 1325          |
  | File Copy - 2       | 1500   | 1406       | 643           |  
  | File Copy - 3       | 4778   | 2160       | 1047          | 
  | Pipe Throughput     | 1181   | 1170       | 842           |
  | Context Switching   | 192    | 224        | 198           |
  | Process Creation    | 490    | 145        | 95            |
  | Shell Scripts - 1   | 1284   | 565        | 610           |
  | Shell Scripts - 2   | 2368   | 900        | 1040          |
  | System Call Overhead| 983    | 948        | 698           |
  | Index Score         | 1263   | 815        | 600           |
  |---------------------+--------+------------+---------------|
Note:
  File Copy - 1: File Copy 1024 bufsize 2000 maxblocks
  File Copy - 2: File Copy 256 bufsize 500 maxblocks 
  File Copy - 3: File Copy 4096 bufsize 8000 maxblocks 
  Shell Scripts - 1: Shell Scripts (1 concurrent)
  Shell Scripts - 2: Shell Scripts (8 concurrent)

Basing on above data, we can draw a conclusion that dirty-limit
can hugely improve the system benchmark almost in every respect,
the "System Benchmarks Index Score" show it improve 35% performance
comparing with auto-converge during live migration.

4-vcpu parallel data(we run a test vm with 4c4g-scale):
  |---------------------+--------+------------+---------------|
  | UnixBench test item | Normal | Dirtylimit | Auto-converge |
  |---------------------+--------+------------+---------------|
  | dhry2reg            | 7975   | 7146       | 5071          |
  | whetstone-double    | 3982   | 3561       | 2124          |
  | Execl Throughput    | 1882   | 1205       | 768           |
  | File Copy - 1       | 1061   | 865        | 498           |
  | File Copy - 2       | 676    | 491        | 519           |  
  | File Copy - 3       | 2260   | 923        | 1329          | 
  | Pipe Throughput     | 3026   | 3009       | 1616          |
  | Context Switching   | 1219   | 1093       | 695           |
  | Process Creation    | 947    | 307        | 446           |
  | Shell Scripts - 1   | 2469   | 977        | 989           |
  | Shell Scripts - 2   | 2667   | 1275       | 984           |
  | System Call Overhead| 1592   | 1459       | 692           |
  | Index Score         | 1976   | 1294       | 997           |
  |---------------------+--------+------------+---------------|

For the parallel data, the "System Benchmarks Index Score" show it
also improve 29% performance.

In version 1, migration total time is shown as follows: 

host cpu: Intel(R) Xeon(R) Platinum 8378A
host interface speed: 1000Mb/s
  |-----------------------+----------------+-------------------|
  | dirty memory size(MB) | Dirtylimit(ms) | Auto-converge(ms) |
  |-----------------------+----------------+-------------------|
  | 60                    | 2014           | 2131              |
  | 70                    | 5381           | 12590             |
  | 90                    | 6037           | 33545             |
  | 110                   | 7660           | [*]               |
  |-----------------------+----------------+-------------------|
  [*]: This case means migration is not convergent. 

In version 2, we post more comprehensive migration total time test data
as follows: 

we update N MB on 4 cpus and sleep S us every time 1 MB data was updated.
test twice in each condition, data is shown as follow: 

  |-----------+--------+--------+----------------+-------------------|
  | ring size | N (MB) | S (us) | Dirtylimit(ms) | Auto-converge(ms) |
  |-----------+--------+--------+----------------+-------------------|
  | 1024      | 1024   | 1000   | 44951          | 191780            |
  | 1024      | 1024   | 1000   | 44546          | 185341            |
  | 1024      | 1024   | 500    | 46505          | 203545            |
  | 1024      | 1024   | 500    | 45469          | 909945            |
  | 1024      | 1024   | 0      | 61858          | [*]               |
  | 1024      | 1024   | 0      | 57922          | [*]               |
  | 1024      | 2048   | 0      | 91982          | [*]               |
  | 1024      | 2048   | 0      | 90388          | [*]               |
  | 2048      | 128    | 10000  | 14511          | 25971             |
  | 2048      | 128    | 10000  | 13472          | 26294             |
  | 2048      | 1024   | 10000  | 44244          | 26294             |
  | 2048      | 1024   | 10000  | 45099          | 157701            |
  | 2048      | 1024   | 500    | 51105          | [*]               |
  | 2048      | 1024   | 500    | 49648          | [*]               |
  | 2048      | 1024   | 0      | 229031         | [*]               |
  | 2048      | 1024   | 0      | 154282         | [*]               |
  |-----------+--------+--------+----------------+-------------------|
  [*]: This case means migration is not convergent. 

Not that the larger ring size is, the less sensitively dirty-limit responds,
so we should choose a optimal ring size base on the test data with different 
scale vm.

We also test the effect of "x-vcpu-dirty-limit-period" parameter on
migration total time. test twice in each condition, data is shown
as follows:

  |-----------+--------+--------+-------------+----------------------|
  | ring size | N (MB) | S (us) | Period (ms) | migration total time | 
  |-----------+--------+--------+-------------+----------------------|
  | 2048      | 1024   | 10000  | 100         | [*]                  |
  | 2048      | 1024   | 10000  | 100         | [*]                  |
  | 2048      | 1024   | 10000  | 300         | 156795               |
  | 2048      | 1024   | 10000  | 300         | 118179               |
  | 2048      | 1024   | 10000  | 500         | 44244                |
  | 2048      | 1024   | 10000  | 500         | 45099                |
  | 2048      | 1024   | 10000  | 700         | 41871                |
  | 2048      | 1024   | 10000  | 700         | 42582                |
  | 2048      | 1024   | 10000  | 1000        | 41430                |
  | 2048      | 1024   | 10000  | 1000        | 40383                |
  | 2048      | 1024   | 10000  | 1500        | 42030                |
  | 2048      | 1024   | 10000  | 1500        | 42598                |
  | 2048      | 1024   | 10000  | 2000        | 41694                |
  | 2048      | 1024   | 10000  | 2000        | 42403                |
  | 2048      | 1024   | 10000  | 3000        | 43538                |
  | 2048      | 1024   | 10000  | 3000        | 43010                |
  |-----------+--------+--------+-------------+----------------------|

It shows that x-vcpu-dirty-limit-period should be configured with 1000 ms
in above condition.

Please review, any comments and suggestions are very appreciated, thanks

Yong

Hyman Huang (11):
  dirtylimit: Fix overflow when computing MB
  softmmu/dirtylimit: Add parameter check for hmp "set_vcpu_dirty_limit"
  kvm-all: Do not allow reap vcpu dirty ring buffer if not ready
  qapi/migration: Introduce x-vcpu-dirty-limit-period parameter
  qapi/migration: Introduce vcpu-dirty-limit parameters
  migration: Introduce dirty-limit capability
  migration: Implement dirty-limit convergence algo
  migration: Export dirty-limit time info
  tests: Add migration dirty-limit capability test
  tests/migration: Introduce dirty-ring-size option into guestperf
  tests/migration: Introduce dirty-limit into guestperf

 accel/kvm/kvm-all.c                     |  36 ++++++++
 include/sysemu/dirtylimit.h             |   2 +
 migration/migration.c                   |  85 ++++++++++++++++++
 migration/migration.h                   |   1 +
 migration/ram.c                         |  62 ++++++++++---
 migration/trace-events                  |   1 +
 monitor/hmp-cmds.c                      |  26 ++++++
 qapi/migration.json                     |  60 +++++++++++--
 softmmu/dirtylimit.c                    |  75 +++++++++++++++-
 tests/migration/guestperf/comparison.py |  24 +++++
 tests/migration/guestperf/engine.py     |  24 ++++-
 tests/migration/guestperf/hardware.py   |   8 +-
 tests/migration/guestperf/progress.py   |  17 +++-
 tests/migration/guestperf/scenario.py   |  11 ++-
 tests/migration/guestperf/shell.py      |  25 +++++-
 tests/qtest/migration-test.c            | 154 ++++++++++++++++++++++++++++++++
 16 files changed, 577 insertions(+), 34 deletions(-)

-- 
1.8.3.1