* [PATCH v17 0/8] support dirty restraint on vCPU
@ 2022-03-02 17:55 huangy81
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
0 siblings, 1 reply; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
v17
- rebase on master
- fix qmp-cmd-test
v16
- rebase on master
- drop the unused typedef syntax in [PATCH v15 6/7]
- add the Reviewed-by and Acked-by tags by the way
v15
- rebase on master
- drop the 'init_time_ms' parameter in function vcpu_calculate_dirtyrate
- drop the 'setup' field in dirtylimit_state and call dirtylimit_process
directly, which makes code cleaner.
- code clean in dirtylimit_adjust_throttle
- fix miss dirtylimit_state_unlock() in dirtylimit_process and
dirtylimit_query_all
- add some comment
Please review. Thanks,
Regards
Yong
v14
- v13 sent by accident, resend patchset.
v13
- rebase on master
- passing NULL to kvm_dirty_ring_reap in commit
"refactor per-vcpu dirty ring reaping" to keep the logic unchanged.
In other word, we still try the best to reap as much PFNs as possible
if dirtylimit not in service.
- move the cpu list gen id changes into a separate patch.
- release the lock before sleep during dirty page rate calculation.
- move the dirty ring size fetch logic into a separate patch.
- drop the DIRTYLIMIT_LINEAR_ADJUSTMENT_WATERMARK MACRO .
- substitute bh with function pointer when implement dirtylimit.
- merge the dirtylimit_start/stop into dirtylimit_change.
- fix "cpu-index" parameter type with "int" to keep consistency.
- fix some syntax error in documents.
Please review. Thanks,
Yong
v12
- rebase on master
- add a new commmit to refactor per-vcpu dirty ring reaping, which can resolve
the "vcpu miss the chances to sleep" problem
- remove the dirtylimit_thread and implemtment throttle in bottom half instead.
- let the dirty ring reaper thread keep sleeping when dirtylimit is in service
- introduce cpu_list_generation_id to identify cpu_list changing.
- keep taking the cpu_list_lock during dirty_stat_wait to prevent vcpu plug/unplug
when calculating the dirty page rate
- move the dirtylimit global initializations out of dirtylimit_set_vcpu and do
some code clean
- add DIRTYLIMIT_LINEAR_ADJUSTMENT_WATERMARK in case of oscillation when throttling
- remove the unmatched count field in dirtylimit_state
- add stub to fix build on non-x86
- refactor the documents
Thanks Peter and Markus for reviewing the previous versions, please review.
Thanks,
Yong
v11
- rebase on master
- add a commit " refactor dirty page rate calculation" so that dirty page rate limit
can reuse the calculation logic.
- handle the cpu hotplug/unplug case in the dirty page rate calculation logic.
- modify the qmp commands according to Markus's advice.
- introduce a standalone file dirtylimit.c to implement dirty page rate limit
- check if dirty limit in service by dirtylimit_state pointer instead of global variable
- introduce dirtylimit_mutex to protect dirtylimit_state
- do some code clean and docs
See the commit for more detail, thanks Markus and Peter very mush for the code
review and give the experienced and insightful advices, most modifications are
based on these advices.
v10:
- rebase on master
- make the following modifications on patch [1/3]:
1. Make "dirtylimit-calc" thread joinable and join it after quitting.
2. Add finalize function to free dirtylimit_calc_state
3. Do some code clean work
- make the following modifications on patch [2/3]:
1. Remove the original implementation of throttle according to
Peter's advice.
2. Introduce a negative feedback system and implement the throttle
on all vcpu in one thread named "dirtylimit".
3. Simplify the algo when calculation the throttle_us_per_full:
increase/decrease linearly when there exists a wide difference
between quota and current dirty page rate, increase/decrease
a fixed time slice when the difference is narrow. This makes
throttle responds faster and reach the quota smoothly.
4. Introduce a unfit_cnt in algo to make sure throttle really
takes effect.
5. Set the max sleep time 99 times more than "ring_full_time_us".
6. Make "dirtylimit" thread joinable and join it after quitting.
- make the following modifications on patch [3/3]:
1. Remove the unplug cpu handling logic.
2. "query-vcpu-dirty-limit" only return dirtylimit information of
vcpus that enable dirtylimit
3. Remove the "dirtylimit_setup" function
4. Trigger the dirtylimit and initialize the global state only
when someone enable dirtylimit, and finalize it after the last
dirtylimit be canceled.
5. Redefine the qmp command vcpu-dirty-limit/query-vcpu-dirty-limit:
enable/disable dirtylimit use a single command "vcpu-dirty-limit",
to enable/disabled dirtylimit on specified vcpu only if "cpu-index"
is specified, otherwise, all vcpu will be affected.
6. Redefine the hmp command vcpu_dirty_limit/info vcpu_dirty_limit
- other points about the code review:
1. "merge the code of calculation dirty page rate"
I think maybe it's not suitable to touch the 'calc-dirty-rate',
because 'calc-dirty-rate' will stop sync log after calculating
the dirtyrate and the 'dirtylimit-cal' will not untill the last
dirtylimit be canceled, if we merge the GLOBAL_DIRTY_LIMIT into
GLOBAL_DIRTY_DIRTYRATE, the two are interacted with each other.
2. The new implementaion of throttle algo enlightened by Peter
responds faster and consume less cpu resource than the older,
we make a impressed progress.
And there is a viewpoint may be discussed, it is that the new
throttle logic is "passive", vcpu sleeps only after dirty ring,
is full, unlike the "auto-converge" which will kick vcpu instead
in a fixed slice time. If the vcpu is memory-write intensive
and the ring size is large, it will produce dirty memory during
the dirty ring full time and the throttle works not so good, it
means the throttle depends on the dirty ring size.
I actually tested the new algo in two case:
case 1: dirty-ring-size: 4096, dirtyrate: 1170MB/s
result: minimum quota dirtyrate is 25MB/s or even less
minimum vcpu util is 6%
case 2: dirty-ring-size: 65536, dirtyrate: 1170MB/s
result: minimum quota dirtyrate is 256MB/s
minimum vcpu util is 24%
I post this just for discussion, i think this is not a big deal
beacase if we set the dirty-ring-size to the maximum value(65536),
we assume the server's bandwidth is capable of handling it.
3. I hard-code the acceptable deviation value to 25MB/s, see the
macro DIRTYLIMIT_TOLERANCE_RANGE. I'm struggling to decide
whether to let it configurable
4. Another point is the unplug cpu handle, current algo affects the
unplugged vcpu, if we set dirty limit on it, we should fork 2
thread "dirtylimit" and "dirtylimit-calc" but do nothing, once the
vcpu is hot-plugged, dirty limit works, i think the logic is ok
but still there can be different advice.
- to let developers play with it easier, i post the hmp usage example:
(qemu) vcpu_dirty_limit -g on -1 500
[Please use 'info vcpu_dirty_limit' to query dirty limit for virtual CPU]
(qemu) info vcpu_dirty_limit
vcpu[0], limit rate 500 (MB/s), current rate 415 (MB/s)
vcpu[1], limit rate 500 (MB/s), current rate 496 (MB/s)
vcpu[2], limit rate 500 (MB/s), current rate 0 (MB/s)
vcpu[3], limit rate 500 (MB/s), current rate 0 (MB/s)
(qemu) vcpu_dirty_limit -g off
[Please use 'info vcpu_dirty_limit' to query dirty limit for virtual CPU]
(qemu) info vcpu_dirty_limit
Dirty page limit not enabled!
(qemu) vcpu_dirty_limit on 0 300
[Please use 'info vcpu_dirty_limit' to query dirty limit for virtual CPU]
(qemu) vcpu_dirty_limit on 1 500
[Please use 'info vcpu_dirty_limit' to query dirty limit for virtual CPU]
(qemu) info vcpu_dirty_limit
vcpu[0], limit rate 300 (MB/s), current rate 342 (MB/s)
vcpu[1], limit rate 500 (MB/s), current rate 485 (MB/s)
(qemu) vcpu_dirty_limit off 0
[Please use 'info vcpu_dirty_limit' to query dirty limit for virtual CPU]
(qemu) info vcpu_dirty_limit
vcpu[1], limit rate 500 (MB/s), current rate 528 (MB/s)
(qemu) vcpu_dirty_limit off 1
[Please use 'info vcpu_dirty_limit' to query dirty limit for virtual CPU]
(qemu) info vcpu_dirty_limit
Dirty page limit not enabled!
Thanks very much for the instructive algo suggestion given by Peter,
the comment and other code reviews made by Markus.
Please review, thanks!
v9:
- rebase on master
- fix the meson directory change, keep it untouched.
v8:
- rebase on master
- polish the error message and remove the "unlikely" compilation syntax
according to the advice given by Markus.
- keep the dirty tracking enabled during "dirtylimit-calc" lifecycle
so that the overhead can be reduced according to the advice given by
Peter.
- merge the "set/cancel" qmp commands into one named "vcpu-dirty-limit"
and introduce qmp command "query-vcpu-dirty-limit" to query dirty
limit information about virtual CPU, according to the advice given by
Peter.
- check if vcpu index is valid and handle the unplug case before
enabling, disabling dirty limit for virtual CPU.
- introduce hmp commands so developers can play with them easier, use
"vcpu_dirty_limit" to enable dirty limit and "info vcpu_dirty_limit"
to query.
The patch [2/3] has not been touched so far. Any corrections and
suggetions are welcome.
Please review, thanks!
v7:
- rebase on master
- polish the comments and error message according to the
advices given by Markus
- introduce dirtylimit_enabled function to pre-check if dirty
page limit is enabled before canceling.
v6:
- rebase on master
- fix dirtylimit setup crash found by Markus
- polish the comments according to the advice given by Markus
- adjust the qemu qmp command tag to 7.0
v5:
- rebase on master
- adjust the throttle algorithm by removing the tuning in
RESTRAINT_RATIO case so that dirty page rate could reachs the quota
more quickly.
- fix percentage update in throttle iteration.
v4:
- rebase on master
- modify the following points according to the advice given by Markus
1. move the defination into migration.json
2. polish the comments of set-dirty-limit
3. do the syntax check and change dirty rate to dirty page rate
Thanks for the carefule reviews made by Markus.
Please review, thanks!
v3:
- rebase on master
- modify the following points according to the advice given by Markus
1. remove the DirtyRateQuotaVcpu and use its field as option directly
2. add comments to show details of what dirtylimit setup do
3. explain how to use dirtylimit in combination with existing qmp
commands "calc-dirty-rate" and "query-dirty-rate" in documentation.
Thanks for the carefule reviews made by Markus.
Please review, thanks!
Hyman
v2:
- rebase on master
- modify the following points according to the advices given by Juan
1. rename dirtyrestraint to dirtylimit
2. implement the full lifecyle function of dirtylimit_calc, include
dirtylimit_calc and dirtylimit_calc_quit
3. introduce 'quit' field in dirtylimit_calc_state to implement the
dirtylimit_calc_quit
4. remove the ready_cond and ready_mtx since it may not be suitable
5. put the 'record_dirtypage' function code at the beggining of the
file
6. remove the unnecesary return;
- other modifications has been made after code review
1. introduce 'bmap' and 'nr' field in dirtylimit_state to record the
number of running thread forked by dirtylimit
2. stop the dirtyrate calculation thread if all the dirtylimit thread
are stopped
3. do some renaming works
dirtyrate calulation thread -> dirtylimit-calc
dirtylimit thread -> dirtylimit-{cpu_index}
function name do_dirtyrestraint -> dirtylimit_check
qmp command dirty-restraint -> set-drity-limit
qmp command dirty-restraint-cancel -> cancel-dirty-limit
header file dirtyrestraint.h -> dirtylimit.h
Please review, thanks !
thanks for the accurate and timely advices given by Juan. we really
appreciate it if corrections and suggetions about this patchset are
proposed.
Best Regards !
Hyman
v1:
this patchset introduce a mechanism to impose dirty restraint
on vCPU, aiming to keep the vCPU running in a certain dirtyrate
given by user. dirty restraint on vCPU maybe an alternative
method to implement convergence logic for live migration,
which could improve guest memory performance during migration
compared with traditional method in theory.
For the current live migration implementation, the convergence
logic throttles all vCPUs of the VM, which has some side effects.
-'read processes' on vCPU will be unnecessarily penalized
- throttle increase percentage step by step, which seems
struggling to find the optimal throttle percentage when
dirtyrate is high.
- hard to predict the remaining time of migration if the
throttling percentage reachs 99%
to a certain extent, the dirty restraint machnism can fix these
effects by throttling at vCPU granularity during migration.
the implementation is rather straightforward, we calculate
vCPU dirtyrate via the Dirty Ring mechanism periodically
as the commit 0e21bf246 "implement dirty-ring dirtyrate calculation"
does, for vCPU that be specified to impose dirty restraint,
we throttle it periodically as the auto-converge does, once after
throttling, we compare the quota dirtyrate with current dirtyrate,
if current dirtyrate is not under the quota, increase the throttling
percentage until current dirtyrate is under the quota.
this patchset is the basis of implmenting a new auto-converge method
for live migration, we introduce two qmp commands for impose/cancel
the dirty restraint on specified vCPU, so it also can be an independent
api to supply the upper app such as libvirt, which can use it to
implement the convergence logic during live migration, supplemented
with the qmp 'calc-dirty-rate' command or whatever.
we post this patchset for RFC and any corrections and suggetions about
the implementation, api, throttleing algorithm or whatever are very
appreciated!
Please review, thanks !
Best Regards !
Hyman Huang (7):
accel/kvm/kvm-all: Refactor per-vcpu dirty ring reaping
cpus: Introduce cpu_list_generation_id
migration/dirtyrate: Refactor dirty page rate calculation
softmmu/dirtylimit: Implement vCPU dirtyrate calculation periodically
accel/kvm/kvm-all: Introduce kvm_dirty_ring_size function
softmmu/dirtylimit: Implement virtual CPU throttle
softmmu/dirtylimit: Implement dirty page rate limit
accel/kvm/kvm-all.c | 45 +++-
accel/stubs/kvm-stub.c | 5 +
cpus-common.c | 8 +
hmp-commands-info.hx | 13 +
hmp-commands.hx | 32 +++
include/exec/cpu-common.h | 1 +
include/exec/memory.h | 5 +-
include/hw/core/cpu.h | 6 +
include/monitor/hmp.h | 3 +
include/sysemu/dirtylimit.h | 37 +++
include/sysemu/dirtyrate.h | 28 +++
include/sysemu/kvm.h | 2 +
migration/dirtyrate.c | 227 +++++++++++------
migration/dirtyrate.h | 7 +-
qapi/migration.json | 80 ++++++
softmmu/dirtylimit.c | 602 ++++++++++++++++++++++++++++++++++++++++++++
softmmu/meson.build | 1 +
softmmu/trace-events | 7 +
18 files changed, 1010 insertions(+), 99 deletions(-)
create mode 100644 include/sysemu/dirtylimit.h
create mode 100644 include/sysemu/dirtyrate.h
create mode 100644 softmmu/dirtylimit.c
--
1.8.3.1
Hyman Huang (8):
accel/kvm/kvm-all: Refactor per-vcpu dirty ring reaping
cpus: Introduce cpu_list_generation_id
migration/dirtyrate: Refactor dirty page rate calculation
softmmu/dirtylimit: Implement vCPU dirtyrate calculation periodically
accel/kvm/kvm-all: Introduce kvm_dirty_ring_size function
softmmu/dirtylimit: Implement virtual CPU throttle
softmmu/dirtylimit: Implement dirty page rate limit
tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test
accel/kvm/kvm-all.c | 45 +++-
accel/stubs/kvm-stub.c | 5 +
cpus-common.c | 8 +
hmp-commands-info.hx | 13 +
hmp-commands.hx | 32 +++
include/exec/cpu-common.h | 1 +
include/exec/memory.h | 5 +-
include/hw/core/cpu.h | 6 +
include/monitor/hmp.h | 3 +
include/sysemu/dirtylimit.h | 37 +++
include/sysemu/dirtyrate.h | 28 +++
include/sysemu/kvm.h | 2 +
migration/dirtyrate.c | 227 +++++++++++------
migration/dirtyrate.h | 7 +-
qapi/migration.json | 80 ++++++
softmmu/dirtylimit.c | 602 ++++++++++++++++++++++++++++++++++++++++++++
softmmu/meson.build | 1 +
softmmu/trace-events | 7 +
tests/qtest/qmp-cmd-test.c | 2 +
19 files changed, 1012 insertions(+), 99 deletions(-)
create mode 100644 include/sysemu/dirtylimit.h
create mode 100644 include/sysemu/dirtyrate.h
create mode 100644 softmmu/dirtylimit.c
--
1.8.3.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v17 1/8] accel/kvm/kvm-all: Refactor per-vcpu dirty ring reaping
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
@ 2022-03-02 17:55 ` huangy81
2022-03-02 17:55 ` [PATCH v17 2/8] cpus: Introduce cpu_list_generation_id huangy81
` (6 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Add a non-required argument 'CPUState' to kvm_dirty_ring_reap so
that it can cover single vcpu dirty-ring-reaping scenario.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
accel/kvm/kvm-all.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 0e66ebb..7b06b8a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -756,17 +756,20 @@ static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
}
/* Must be with slots_lock held */
-static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
+static uint64_t kvm_dirty_ring_reap_locked(KVMState *s, CPUState* cpu)
{
int ret;
- CPUState *cpu;
uint64_t total = 0;
int64_t stamp;
stamp = get_clock();
- CPU_FOREACH(cpu) {
- total += kvm_dirty_ring_reap_one(s, cpu);
+ if (cpu) {
+ total = kvm_dirty_ring_reap_one(s, cpu);
+ } else {
+ CPU_FOREACH(cpu) {
+ total += kvm_dirty_ring_reap_one(s, cpu);
+ }
}
if (total) {
@@ -787,7 +790,7 @@ static uint64_t kvm_dirty_ring_reap_locked(KVMState *s)
* Currently for simplicity, we must hold BQL before calling this. We can
* consider to drop the BQL if we're clear with all the race conditions.
*/
-static uint64_t kvm_dirty_ring_reap(KVMState *s)
+static uint64_t kvm_dirty_ring_reap(KVMState *s, CPUState *cpu)
{
uint64_t total;
@@ -807,7 +810,7 @@ static uint64_t kvm_dirty_ring_reap(KVMState *s)
* reset below.
*/
kvm_slots_lock();
- total = kvm_dirty_ring_reap_locked(s);
+ total = kvm_dirty_ring_reap_locked(s, cpu);
kvm_slots_unlock();
return total;
@@ -854,7 +857,7 @@ static void kvm_dirty_ring_flush(void)
* vcpus out in a synchronous way.
*/
kvm_cpu_synchronize_kick_all();
- kvm_dirty_ring_reap(kvm_state);
+ kvm_dirty_ring_reap(kvm_state, NULL);
trace_kvm_dirty_ring_flush(1);
}
@@ -1398,7 +1401,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
* Not easy. Let's cross the fingers until it's fixed.
*/
if (kvm_state->kvm_dirty_ring_size) {
- kvm_dirty_ring_reap_locked(kvm_state);
+ kvm_dirty_ring_reap_locked(kvm_state, NULL);
} else {
kvm_slot_get_dirty_log(kvm_state, mem);
}
@@ -1470,7 +1473,7 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
qemu_mutex_lock_iothread();
- kvm_dirty_ring_reap(s);
+ kvm_dirty_ring_reap(s, NULL);
qemu_mutex_unlock_iothread();
r->reaper_iteration++;
@@ -2956,7 +2959,7 @@ int kvm_cpu_exec(CPUState *cpu)
*/
trace_kvm_dirty_ring_full(cpu->cpu_index);
qemu_mutex_lock_iothread();
- kvm_dirty_ring_reap(kvm_state);
+ kvm_dirty_ring_reap(kvm_state, NULL);
qemu_mutex_unlock_iothread();
ret = 0;
break;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v17 2/8] cpus: Introduce cpu_list_generation_id
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
2022-03-02 17:55 ` [PATCH v17 1/8] accel/kvm/kvm-all: Refactor per-vcpu dirty ring reaping huangy81
@ 2022-03-02 17:55 ` huangy81
2022-03-02 17:55 ` [PATCH v17 3/8] migration/dirtyrate: Refactor dirty page rate calculation huangy81
` (5 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Introduce cpu_list_generation_id to track cpu list generation so
that cpu hotplug/unplug can be detected during measurement of
dirty page rate.
cpu_list_generation_id could be used to detect changes of cpu
list, which is prepared for dirty page rate measurement.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
cpus-common.c | 8 ++++++++
include/exec/cpu-common.h | 1 +
2 files changed, 9 insertions(+)
diff --git a/cpus-common.c b/cpus-common.c
index 6e73d3e..31c6415 100644
--- a/cpus-common.c
+++ b/cpus-common.c
@@ -73,6 +73,12 @@ static int cpu_get_free_index(void)
}
CPUTailQ cpus = QTAILQ_HEAD_INITIALIZER(cpus);
+static unsigned int cpu_list_generation_id;
+
+unsigned int cpu_list_generation_id_get(void)
+{
+ return cpu_list_generation_id;
+}
void cpu_list_add(CPUState *cpu)
{
@@ -84,6 +90,7 @@ void cpu_list_add(CPUState *cpu)
assert(!cpu_index_auto_assigned);
}
QTAILQ_INSERT_TAIL_RCU(&cpus, cpu, node);
+ cpu_list_generation_id++;
}
void cpu_list_remove(CPUState *cpu)
@@ -96,6 +103,7 @@ void cpu_list_remove(CPUState *cpu)
QTAILQ_REMOVE_RCU(&cpus, cpu, node);
cpu->cpu_index = UNASSIGNED_CPU_INDEX;
+ cpu_list_generation_id++;
}
CPUState *qemu_get_cpu(int index)
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index de5f444..eb33642 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -20,6 +20,7 @@ extern intptr_t qemu_host_page_mask;
void qemu_init_cpu_list(void);
void cpu_list_lock(void);
void cpu_list_unlock(void);
+unsigned int cpu_list_generation_id_get(void);
void tcg_flush_softmmu_tlb(CPUState *cs);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v17 3/8] migration/dirtyrate: Refactor dirty page rate calculation
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
2022-03-02 17:55 ` [PATCH v17 1/8] accel/kvm/kvm-all: Refactor per-vcpu dirty ring reaping huangy81
2022-03-02 17:55 ` [PATCH v17 2/8] cpus: Introduce cpu_list_generation_id huangy81
@ 2022-03-02 17:55 ` huangy81
2022-03-02 17:55 ` [PATCH v17 4/8] softmmu/dirtylimit: Implement vCPU dirtyrate calculation periodically huangy81
` (4 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
abstract out dirty log change logic into function
global_dirty_log_change.
abstract out dirty page rate calculation logic via
dirty-ring into function vcpu_calculate_dirtyrate.
abstract out mathematical dirty page rate calculation
into do_calculate_dirtyrate, decouple it from DirtyStat.
rename set_sample_page_period to dirty_stat_wait, which
is well-understood and will be reused in dirtylimit.
handle cpu hotplug/unplug scenario during measurement of
dirty page rate.
export util functions outside migration.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
include/sysemu/dirtyrate.h | 28 ++++++
migration/dirtyrate.c | 227 ++++++++++++++++++++++++++++-----------------
migration/dirtyrate.h | 7 +-
3 files changed, 174 insertions(+), 88 deletions(-)
create mode 100644 include/sysemu/dirtyrate.h
diff --git a/include/sysemu/dirtyrate.h b/include/sysemu/dirtyrate.h
new file mode 100644
index 0000000..4d3b9a4
--- /dev/null
+++ b/include/sysemu/dirtyrate.h
@@ -0,0 +1,28 @@
+/*
+ * dirty page rate helper functions
+ *
+ * Copyright (c) 2022 CHINA TELECOM CO.,LTD.
+ *
+ * Authors:
+ * Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_DIRTYRATE_H
+#define QEMU_DIRTYRATE_H
+
+typedef struct VcpuStat {
+ int nvcpu; /* number of vcpu */
+ DirtyRateVcpu *rates; /* array of dirty rate for each vcpu */
+} VcpuStat;
+
+int64_t vcpu_calculate_dirtyrate(int64_t calc_time_ms,
+ VcpuStat *stat,
+ unsigned int flag,
+ bool one_shot);
+
+void global_dirty_log_change(unsigned int flag,
+ bool start);
+#endif
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index d65e744..79348de 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -46,7 +46,7 @@ static struct DirtyRateStat DirtyStat;
static DirtyRateMeasureMode dirtyrate_mode =
DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
-static int64_t set_sample_page_period(int64_t msec, int64_t initial_time)
+static int64_t dirty_stat_wait(int64_t msec, int64_t initial_time)
{
int64_t current_time;
@@ -60,6 +60,132 @@ static int64_t set_sample_page_period(int64_t msec, int64_t initial_time)
return msec;
}
+static inline void record_dirtypages(DirtyPageRecord *dirty_pages,
+ CPUState *cpu, bool start)
+{
+ if (start) {
+ dirty_pages[cpu->cpu_index].start_pages = cpu->dirty_pages;
+ } else {
+ dirty_pages[cpu->cpu_index].end_pages = cpu->dirty_pages;
+ }
+}
+
+static int64_t do_calculate_dirtyrate(DirtyPageRecord dirty_pages,
+ int64_t calc_time_ms)
+{
+ uint64_t memory_size_MB;
+ uint64_t increased_dirty_pages =
+ dirty_pages.end_pages - dirty_pages.start_pages;
+
+ memory_size_MB = (increased_dirty_pages * TARGET_PAGE_SIZE) >> 20;
+
+ return memory_size_MB * 1000 / calc_time_ms;
+}
+
+void global_dirty_log_change(unsigned int flag, bool start)
+{
+ qemu_mutex_lock_iothread();
+ if (start) {
+ memory_global_dirty_log_start(flag);
+ } else {
+ memory_global_dirty_log_stop(flag);
+ }
+ qemu_mutex_unlock_iothread();
+}
+
+/*
+ * global_dirty_log_sync
+ * 1. sync dirty log from kvm
+ * 2. stop dirty tracking if needed.
+ */
+static void global_dirty_log_sync(unsigned int flag, bool one_shot)
+{
+ qemu_mutex_lock_iothread();
+ memory_global_dirty_log_sync();
+ if (one_shot) {
+ memory_global_dirty_log_stop(flag);
+ }
+ qemu_mutex_unlock_iothread();
+}
+
+static DirtyPageRecord *vcpu_dirty_stat_alloc(VcpuStat *stat)
+{
+ CPUState *cpu;
+ DirtyPageRecord *records;
+ int nvcpu = 0;
+
+ CPU_FOREACH(cpu) {
+ nvcpu++;
+ }
+
+ stat->nvcpu = nvcpu;
+ stat->rates = g_malloc0(sizeof(DirtyRateVcpu) * nvcpu);
+
+ records = g_malloc0(sizeof(DirtyPageRecord) * nvcpu);
+
+ return records;
+}
+
+static void vcpu_dirty_stat_collect(VcpuStat *stat,
+ DirtyPageRecord *records,
+ bool start)
+{
+ CPUState *cpu;
+
+ CPU_FOREACH(cpu) {
+ record_dirtypages(records, cpu, start);
+ }
+}
+
+int64_t vcpu_calculate_dirtyrate(int64_t calc_time_ms,
+ VcpuStat *stat,
+ unsigned int flag,
+ bool one_shot)
+{
+ DirtyPageRecord *records;
+ int64_t init_time_ms;
+ int64_t duration;
+ int64_t dirtyrate;
+ int i = 0;
+ unsigned int gen_id;
+
+retry:
+ init_time_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+ cpu_list_lock();
+ gen_id = cpu_list_generation_id_get();
+ records = vcpu_dirty_stat_alloc(stat);
+ vcpu_dirty_stat_collect(stat, records, true);
+ cpu_list_unlock();
+
+ duration = dirty_stat_wait(calc_time_ms, init_time_ms);
+
+ global_dirty_log_sync(flag, one_shot);
+
+ cpu_list_lock();
+ if (gen_id != cpu_list_generation_id_get()) {
+ g_free(records);
+ g_free(stat->rates);
+ cpu_list_unlock();
+ goto retry;
+ }
+ vcpu_dirty_stat_collect(stat, records, false);
+ cpu_list_unlock();
+
+ for (i = 0; i < stat->nvcpu; i++) {
+ dirtyrate = do_calculate_dirtyrate(records[i], duration);
+
+ stat->rates[i].id = i;
+ stat->rates[i].dirty_rate = dirtyrate;
+
+ trace_dirtyrate_do_calculate_vcpu(i, dirtyrate);
+ }
+
+ g_free(records);
+
+ return duration;
+}
+
static bool is_sample_period_valid(int64_t sec)
{
if (sec < MIN_FETCH_DIRTYRATE_TIME_SEC ||
@@ -396,44 +522,6 @@ static bool compare_page_hash_info(struct RamblockDirtyInfo *info,
return true;
}
-static inline void record_dirtypages(DirtyPageRecord *dirty_pages,
- CPUState *cpu, bool start)
-{
- if (start) {
- dirty_pages[cpu->cpu_index].start_pages = cpu->dirty_pages;
- } else {
- dirty_pages[cpu->cpu_index].end_pages = cpu->dirty_pages;
- }
-}
-
-static void dirtyrate_global_dirty_log_start(void)
-{
- qemu_mutex_lock_iothread();
- memory_global_dirty_log_start(GLOBAL_DIRTY_DIRTY_RATE);
- qemu_mutex_unlock_iothread();
-}
-
-static void dirtyrate_global_dirty_log_stop(void)
-{
- qemu_mutex_lock_iothread();
- memory_global_dirty_log_sync();
- memory_global_dirty_log_stop(GLOBAL_DIRTY_DIRTY_RATE);
- qemu_mutex_unlock_iothread();
-}
-
-static int64_t do_calculate_dirtyrate_vcpu(DirtyPageRecord dirty_pages)
-{
- uint64_t memory_size_MB;
- int64_t time_s;
- uint64_t increased_dirty_pages =
- dirty_pages.end_pages - dirty_pages.start_pages;
-
- memory_size_MB = (increased_dirty_pages * TARGET_PAGE_SIZE) >> 20;
- time_s = DirtyStat.calc_time;
-
- return memory_size_MB / time_s;
-}
-
static inline void record_dirtypages_bitmap(DirtyPageRecord *dirty_pages,
bool start)
{
@@ -444,11 +532,6 @@ static inline void record_dirtypages_bitmap(DirtyPageRecord *dirty_pages,
}
}
-static void do_calculate_dirtyrate_bitmap(DirtyPageRecord dirty_pages)
-{
- DirtyStat.dirty_rate = do_calculate_dirtyrate_vcpu(dirty_pages);
-}
-
static inline void dirtyrate_manual_reset_protect(void)
{
RAMBlock *block = NULL;
@@ -492,71 +575,49 @@ static void calculate_dirtyrate_dirty_bitmap(struct DirtyRateConfig config)
DirtyStat.start_time = start_time / 1000;
msec = config.sample_period_seconds * 1000;
- msec = set_sample_page_period(msec, start_time);
+ msec = dirty_stat_wait(msec, start_time);
DirtyStat.calc_time = msec / 1000;
/*
- * dirtyrate_global_dirty_log_stop do two things.
+ * do two things.
* 1. fetch dirty bitmap from kvm
* 2. stop dirty tracking
*/
- dirtyrate_global_dirty_log_stop();
+ global_dirty_log_sync(GLOBAL_DIRTY_DIRTY_RATE, true);
record_dirtypages_bitmap(&dirty_pages, false);
- do_calculate_dirtyrate_bitmap(dirty_pages);
+ DirtyStat.dirty_rate = do_calculate_dirtyrate(dirty_pages, msec);
}
static void calculate_dirtyrate_dirty_ring(struct DirtyRateConfig config)
{
- CPUState *cpu;
- int64_t msec = 0;
- int64_t start_time;
+ int64_t duration;
uint64_t dirtyrate = 0;
uint64_t dirtyrate_sum = 0;
- DirtyPageRecord *dirty_pages;
- int nvcpu = 0;
int i = 0;
- CPU_FOREACH(cpu) {
- nvcpu++;
- }
-
- dirty_pages = malloc(sizeof(*dirty_pages) * nvcpu);
-
- DirtyStat.dirty_ring.nvcpu = nvcpu;
- DirtyStat.dirty_ring.rates = malloc(sizeof(DirtyRateVcpu) * nvcpu);
-
- dirtyrate_global_dirty_log_start();
-
- CPU_FOREACH(cpu) {
- record_dirtypages(dirty_pages, cpu, true);
- }
-
- start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
- DirtyStat.start_time = start_time / 1000;
+ /* start log sync */
+ global_dirty_log_change(GLOBAL_DIRTY_DIRTY_RATE, true);
- msec = config.sample_period_seconds * 1000;
- msec = set_sample_page_period(msec, start_time);
- DirtyStat.calc_time = msec / 1000;
+ DirtyStat.start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
- dirtyrate_global_dirty_log_stop();
+ /* calculate vcpu dirtyrate */
+ duration = vcpu_calculate_dirtyrate(config.sample_period_seconds * 1000,
+ &DirtyStat.dirty_ring,
+ GLOBAL_DIRTY_DIRTY_RATE,
+ true);
- CPU_FOREACH(cpu) {
- record_dirtypages(dirty_pages, cpu, false);
- }
+ DirtyStat.calc_time = duration / 1000;
+ /* calculate vm dirtyrate */
for (i = 0; i < DirtyStat.dirty_ring.nvcpu; i++) {
- dirtyrate = do_calculate_dirtyrate_vcpu(dirty_pages[i]);
- trace_dirtyrate_do_calculate_vcpu(i, dirtyrate);
-
- DirtyStat.dirty_ring.rates[i].id = i;
+ dirtyrate = DirtyStat.dirty_ring.rates[i].dirty_rate;
DirtyStat.dirty_ring.rates[i].dirty_rate = dirtyrate;
dirtyrate_sum += dirtyrate;
}
DirtyStat.dirty_rate = dirtyrate_sum;
- free(dirty_pages);
}
static void calculate_dirtyrate_sample_vm(struct DirtyRateConfig config)
@@ -574,7 +635,7 @@ static void calculate_dirtyrate_sample_vm(struct DirtyRateConfig config)
rcu_read_unlock();
msec = config.sample_period_seconds * 1000;
- msec = set_sample_page_period(msec, initial_time);
+ msec = dirty_stat_wait(msec, initial_time);
DirtyStat.start_time = initial_time / 1000;
DirtyStat.calc_time = msec / 1000;
diff --git a/migration/dirtyrate.h b/migration/dirtyrate.h
index 69d4c5b..594a5c0 100644
--- a/migration/dirtyrate.h
+++ b/migration/dirtyrate.h
@@ -13,6 +13,8 @@
#ifndef QEMU_MIGRATION_DIRTYRATE_H
#define QEMU_MIGRATION_DIRTYRATE_H
+#include "sysemu/dirtyrate.h"
+
/*
* Sample 512 pages per GB as default.
*/
@@ -65,11 +67,6 @@ typedef struct SampleVMStat {
uint64_t total_block_mem_MB; /* size of total sampled pages in MB */
} SampleVMStat;
-typedef struct VcpuStat {
- int nvcpu; /* number of vcpu */
- DirtyRateVcpu *rates; /* array of dirty rate for each vcpu */
-} VcpuStat;
-
/*
* Store calculation statistics for each measure.
*/
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v17 4/8] softmmu/dirtylimit: Implement vCPU dirtyrate calculation periodically
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
` (2 preceding siblings ...)
2022-03-02 17:55 ` [PATCH v17 3/8] migration/dirtyrate: Refactor dirty page rate calculation huangy81
@ 2022-03-02 17:55 ` huangy81
2022-03-02 17:55 ` [PATCH v17 5/8] accel/kvm/kvm-all: Introduce kvm_dirty_ring_size function huangy81
` (3 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Introduce the third method GLOBAL_DIRTY_LIMIT of dirty
tracking for calculate dirtyrate periodly for dirty page
rate limit.
Add dirtylimit.c to implement dirtyrate calculation periodly,
which will be used for dirty page rate limit.
Add dirtylimit.h to export util functions for dirty page rate
limit implementation.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
include/exec/memory.h | 5 +-
include/sysemu/dirtylimit.h | 22 +++++++++
softmmu/dirtylimit.c | 116 ++++++++++++++++++++++++++++++++++++++++++++
softmmu/meson.build | 1 +
4 files changed, 143 insertions(+), 1 deletion(-)
create mode 100644 include/sysemu/dirtylimit.h
create mode 100644 softmmu/dirtylimit.c
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4d5997e..88ca510 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -69,7 +69,10 @@ static inline void fuzz_dma_read_cb(size_t addr,
/* Dirty tracking enabled because measuring dirty rate */
#define GLOBAL_DIRTY_DIRTY_RATE (1U << 1)
-#define GLOBAL_DIRTY_MASK (0x3)
+/* Dirty tracking enabled because dirty limit */
+#define GLOBAL_DIRTY_LIMIT (1U << 2)
+
+#define GLOBAL_DIRTY_MASK (0x7)
extern unsigned int global_dirty_tracking;
diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
new file mode 100644
index 0000000..da459f0
--- /dev/null
+++ b/include/sysemu/dirtylimit.h
@@ -0,0 +1,22 @@
+/*
+ * Dirty page rate limit common functions
+ *
+ * Copyright (c) 2022 CHINA TELECOM CO.,LTD.
+ *
+ * Authors:
+ * Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_DIRTYRLIMIT_H
+#define QEMU_DIRTYRLIMIT_H
+
+#define DIRTYLIMIT_CALC_TIME_MS 1000 /* 1000ms */
+
+int64_t vcpu_dirty_rate_get(int cpu_index);
+void vcpu_dirty_rate_stat_start(void);
+void vcpu_dirty_rate_stat_stop(void);
+void vcpu_dirty_rate_stat_initialize(void);
+void vcpu_dirty_rate_stat_finalize(void);
+#endif
diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
new file mode 100644
index 0000000..6102e8c
--- /dev/null
+++ b/softmmu/dirtylimit.c
@@ -0,0 +1,116 @@
+/*
+ * Dirty page rate limit implementation code
+ *
+ * Copyright (c) 2022 CHINA TELECOM CO.,LTD.
+ *
+ * Authors:
+ * Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/main-loop.h"
+#include "qapi/qapi-commands-migration.h"
+#include "sysemu/dirtyrate.h"
+#include "sysemu/dirtylimit.h"
+#include "exec/memory.h"
+#include "hw/boards.h"
+
+struct {
+ VcpuStat stat;
+ bool running;
+ QemuThread thread;
+} *vcpu_dirty_rate_stat;
+
+static void vcpu_dirty_rate_stat_collect(void)
+{
+ VcpuStat stat;
+ int i = 0;
+
+ /* calculate vcpu dirtyrate */
+ vcpu_calculate_dirtyrate(DIRTYLIMIT_CALC_TIME_MS,
+ &stat,
+ GLOBAL_DIRTY_LIMIT,
+ false);
+
+ for (i = 0; i < stat.nvcpu; i++) {
+ vcpu_dirty_rate_stat->stat.rates[i].id = i;
+ vcpu_dirty_rate_stat->stat.rates[i].dirty_rate =
+ stat.rates[i].dirty_rate;
+ }
+
+ free(stat.rates);
+}
+
+static void *vcpu_dirty_rate_stat_thread(void *opaque)
+{
+ rcu_register_thread();
+
+ /* start log sync */
+ global_dirty_log_change(GLOBAL_DIRTY_LIMIT, true);
+
+ while (qatomic_read(&vcpu_dirty_rate_stat->running)) {
+ vcpu_dirty_rate_stat_collect();
+ }
+
+ /* stop log sync */
+ global_dirty_log_change(GLOBAL_DIRTY_LIMIT, false);
+
+ rcu_unregister_thread();
+ return NULL;
+}
+
+int64_t vcpu_dirty_rate_get(int cpu_index)
+{
+ DirtyRateVcpu *rates = vcpu_dirty_rate_stat->stat.rates;
+ return qatomic_read(&rates[cpu_index].dirty_rate);
+}
+
+void vcpu_dirty_rate_stat_start(void)
+{
+ if (qatomic_read(&vcpu_dirty_rate_stat->running)) {
+ return;
+ }
+
+ qatomic_set(&vcpu_dirty_rate_stat->running, 1);
+ qemu_thread_create(&vcpu_dirty_rate_stat->thread,
+ "dirtyrate-stat",
+ vcpu_dirty_rate_stat_thread,
+ NULL,
+ QEMU_THREAD_JOINABLE);
+}
+
+void vcpu_dirty_rate_stat_stop(void)
+{
+ qatomic_set(&vcpu_dirty_rate_stat->running, 0);
+ qemu_mutex_unlock_iothread();
+ qemu_thread_join(&vcpu_dirty_rate_stat->thread);
+ qemu_mutex_lock_iothread();
+}
+
+void vcpu_dirty_rate_stat_initialize(void)
+{
+ MachineState *ms = MACHINE(qdev_get_machine());
+ int max_cpus = ms->smp.max_cpus;
+
+ vcpu_dirty_rate_stat =
+ g_malloc0(sizeof(*vcpu_dirty_rate_stat));
+
+ vcpu_dirty_rate_stat->stat.nvcpu = max_cpus;
+ vcpu_dirty_rate_stat->stat.rates =
+ g_malloc0(sizeof(DirtyRateVcpu) * max_cpus);
+
+ vcpu_dirty_rate_stat->running = false;
+}
+
+void vcpu_dirty_rate_stat_finalize(void)
+{
+ free(vcpu_dirty_rate_stat->stat.rates);
+ vcpu_dirty_rate_stat->stat.rates = NULL;
+
+ free(vcpu_dirty_rate_stat);
+ vcpu_dirty_rate_stat = NULL;
+}
diff --git a/softmmu/meson.build b/softmmu/meson.build
index 39f766c..ab1b407 100644
--- a/softmmu/meson.build
+++ b/softmmu/meson.build
@@ -15,6 +15,7 @@ specific_ss.add(when: 'CONFIG_SOFTMMU', if_true: [files(
'vl.c',
'cpu-timers.c',
'runstate-action.c',
+ 'dirtylimit.c',
)])
specific_ss.add(when: ['CONFIG_SOFTMMU', 'CONFIG_TCG'], if_true: [files(
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v17 5/8] accel/kvm/kvm-all: Introduce kvm_dirty_ring_size function
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
` (3 preceding siblings ...)
2022-03-02 17:55 ` [PATCH v17 4/8] softmmu/dirtylimit: Implement vCPU dirtyrate calculation periodically huangy81
@ 2022-03-02 17:55 ` huangy81
2022-03-02 17:55 ` [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle huangy81
` (2 subsequent siblings)
7 siblings, 0 replies; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Introduce kvm_dirty_ring_size util function to help calculate
dirty ring ful time.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Acked-by: Peter Xu <peterx@redhat.com>
---
accel/kvm/kvm-all.c | 5 +++++
accel/stubs/kvm-stub.c | 5 +++++
include/sysemu/kvm.h | 2 ++
3 files changed, 12 insertions(+)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 7b06b8a..8821d80 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2312,6 +2312,11 @@ bool kvm_dirty_ring_enabled(void)
return kvm_state->kvm_dirty_ring_size ? true : false;
}
+uint32_t kvm_dirty_ring_size(void)
+{
+ return kvm_state->kvm_dirty_ring_size;
+}
+
static int kvm_init(MachineState *ms)
{
MachineClass *mc = MACHINE_GET_CLASS(ms);
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 5319573..1128cb2 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -152,4 +152,9 @@ bool kvm_dirty_ring_enabled(void)
{
return false;
}
+
+uint32_t kvm_dirty_ring_size(void)
+{
+ return 0;
+}
#endif
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 6eb39a0..bc3f0b5 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -563,4 +563,6 @@ bool kvm_cpu_check_are_resettable(void);
bool kvm_arch_cpu_check_are_resettable(void);
bool kvm_dirty_ring_enabled(void);
+
+uint32_t kvm_dirty_ring_size(void);
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
` (4 preceding siblings ...)
2022-03-02 17:55 ` [PATCH v17 5/8] accel/kvm/kvm-all: Introduce kvm_dirty_ring_size function huangy81
@ 2022-03-02 17:55 ` huangy81
2022-05-16 17:13 ` manish.mishra
2022-03-02 17:55 ` [PATCH v17 7/8] softmmu/dirtylimit: Implement dirty page rate limit huangy81
2022-03-02 17:55 ` [PATCH v17 8/8] tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test huangy81
7 siblings, 1 reply; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Setup a negative feedback system when vCPU thread
handling KVM_EXIT_DIRTY_RING_FULL exit by introducing
throttle_us_per_full field in struct CPUState. Sleep
throttle_us_per_full microseconds to throttle vCPU
if dirtylimit is in service.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
accel/kvm/kvm-all.c | 19 ++-
include/hw/core/cpu.h | 6 +
include/sysemu/dirtylimit.h | 15 +++
softmmu/dirtylimit.c | 291 ++++++++++++++++++++++++++++++++++++++++++++
softmmu/trace-events | 7 ++
5 files changed, 337 insertions(+), 1 deletion(-)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 8821d80..98e43e6 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -45,6 +45,7 @@
#include "qemu/guest-random.h"
#include "sysemu/hw_accel.h"
#include "kvm-cpus.h"
+#include "sysemu/dirtylimit.h"
#include "hw/boards.h"
@@ -476,6 +477,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
cpu->kvm_state = s;
cpu->vcpu_dirty = true;
cpu->dirty_pages = 0;
+ cpu->throttle_us_per_full = 0;
mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
if (mmap_size < 0) {
@@ -1469,6 +1471,11 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
*/
sleep(1);
+ /* keep sleeping so that dirtylimit not be interfered by reaper */
+ if (dirtylimit_in_service()) {
+ continue;
+ }
+
trace_kvm_dirty_ring_reaper("wakeup");
r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
@@ -2964,8 +2971,18 @@ int kvm_cpu_exec(CPUState *cpu)
*/
trace_kvm_dirty_ring_full(cpu->cpu_index);
qemu_mutex_lock_iothread();
- kvm_dirty_ring_reap(kvm_state, NULL);
+ /* We throttle vCPU by making it sleep once it exit from kernel
+ * due to dirty ring full. In the dirtylimit scenario, reaping
+ * all vCPUs after a single vCPU dirty ring get full result in
+ * the miss of sleep, so just reap the ring-fulled vCPU.
+ */
+ if (dirtylimit_in_service()) {
+ kvm_dirty_ring_reap(kvm_state, cpu);
+ } else {
+ kvm_dirty_ring_reap(kvm_state, NULL);
+ }
qemu_mutex_unlock_iothread();
+ dirtylimit_vcpu_execute(cpu);
ret = 0;
break;
case KVM_EXIT_SYSTEM_EVENT:
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 76ab3b8..dbeb31a 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -411,6 +411,12 @@ struct CPUState {
*/
bool throttle_thread_scheduled;
+ /*
+ * Sleep throttle_us_per_full microseconds once dirty ring is full
+ * if dirty page rate limit is enabled.
+ */
+ int64_t throttle_us_per_full;
+
bool ignore_memory_transaction_failures;
/* Used for user-only emulation of prctl(PR_SET_UNALIGN). */
diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
index da459f0..8d2c1f3 100644
--- a/include/sysemu/dirtylimit.h
+++ b/include/sysemu/dirtylimit.h
@@ -19,4 +19,19 @@ void vcpu_dirty_rate_stat_start(void);
void vcpu_dirty_rate_stat_stop(void);
void vcpu_dirty_rate_stat_initialize(void);
void vcpu_dirty_rate_stat_finalize(void);
+
+void dirtylimit_state_lock(void);
+void dirtylimit_state_unlock(void);
+void dirtylimit_state_initialize(void);
+void dirtylimit_state_finalize(void);
+bool dirtylimit_in_service(void);
+bool dirtylimit_vcpu_index_valid(int cpu_index);
+void dirtylimit_process(void);
+void dirtylimit_change(bool start);
+void dirtylimit_set_vcpu(int cpu_index,
+ uint64_t quota,
+ bool enable);
+void dirtylimit_set_all(uint64_t quota,
+ bool enable);
+void dirtylimit_vcpu_execute(CPUState *cpu);
#endif
diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
index 6102e8c..76d0b44 100644
--- a/softmmu/dirtylimit.c
+++ b/softmmu/dirtylimit.c
@@ -18,6 +18,26 @@
#include "sysemu/dirtylimit.h"
#include "exec/memory.h"
#include "hw/boards.h"
+#include "sysemu/kvm.h"
+#include "trace.h"
+
+/*
+ * Dirtylimit stop working if dirty page rate error
+ * value less than DIRTYLIMIT_TOLERANCE_RANGE
+ */
+#define DIRTYLIMIT_TOLERANCE_RANGE 25 /* MB/s */
+/*
+ * Plus or minus vcpu sleep time linearly if dirty
+ * page rate error value percentage over
+ * DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT.
+ * Otherwise, plus or minus a fixed vcpu sleep time.
+ */
+#define DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT 50
+/*
+ * Max vcpu sleep time percentage during a cycle
+ * composed of dirty ring full and sleep time.
+ */
+#define DIRTYLIMIT_THROTTLE_PCT_MAX 99
struct {
VcpuStat stat;
@@ -25,6 +45,30 @@ struct {
QemuThread thread;
} *vcpu_dirty_rate_stat;
+typedef struct VcpuDirtyLimitState {
+ int cpu_index;
+ bool enabled;
+ /*
+ * Quota dirty page rate, unit is MB/s
+ * zero if not enabled.
+ */
+ uint64_t quota;
+} VcpuDirtyLimitState;
+
+struct {
+ VcpuDirtyLimitState *states;
+ /* Max cpus number configured by user */
+ int max_cpus;
+ /* Number of vcpu under dirtylimit */
+ int limited_nvcpu;
+} *dirtylimit_state;
+
+/* protect dirtylimit_state */
+static QemuMutex dirtylimit_mutex;
+
+/* dirtylimit thread quit if dirtylimit_quit is true */
+static bool dirtylimit_quit;
+
static void vcpu_dirty_rate_stat_collect(void)
{
VcpuStat stat;
@@ -54,6 +98,9 @@ static void *vcpu_dirty_rate_stat_thread(void *opaque)
while (qatomic_read(&vcpu_dirty_rate_stat->running)) {
vcpu_dirty_rate_stat_collect();
+ if (dirtylimit_in_service()) {
+ dirtylimit_process();
+ }
}
/* stop log sync */
@@ -86,9 +133,11 @@ void vcpu_dirty_rate_stat_start(void)
void vcpu_dirty_rate_stat_stop(void)
{
qatomic_set(&vcpu_dirty_rate_stat->running, 0);
+ dirtylimit_state_unlock();
qemu_mutex_unlock_iothread();
qemu_thread_join(&vcpu_dirty_rate_stat->thread);
qemu_mutex_lock_iothread();
+ dirtylimit_state_lock();
}
void vcpu_dirty_rate_stat_initialize(void)
@@ -114,3 +163,245 @@ void vcpu_dirty_rate_stat_finalize(void)
free(vcpu_dirty_rate_stat);
vcpu_dirty_rate_stat = NULL;
}
+
+void dirtylimit_state_lock(void)
+{
+ qemu_mutex_lock(&dirtylimit_mutex);
+}
+
+void dirtylimit_state_unlock(void)
+{
+ qemu_mutex_unlock(&dirtylimit_mutex);
+}
+
+static void
+__attribute__((__constructor__)) dirtylimit_mutex_init(void)
+{
+ qemu_mutex_init(&dirtylimit_mutex);
+}
+
+static inline VcpuDirtyLimitState *dirtylimit_vcpu_get_state(int cpu_index)
+{
+ return &dirtylimit_state->states[cpu_index];
+}
+
+void dirtylimit_state_initialize(void)
+{
+ MachineState *ms = MACHINE(qdev_get_machine());
+ int max_cpus = ms->smp.max_cpus;
+ int i;
+
+ dirtylimit_state = g_malloc0(sizeof(*dirtylimit_state));
+
+ dirtylimit_state->states =
+ g_malloc0(sizeof(VcpuDirtyLimitState) * max_cpus);
+
+ for (i = 0; i < max_cpus; i++) {
+ dirtylimit_state->states[i].cpu_index = i;
+ }
+
+ dirtylimit_state->max_cpus = max_cpus;
+ trace_dirtylimit_state_initialize(max_cpus);
+}
+
+void dirtylimit_state_finalize(void)
+{
+ free(dirtylimit_state->states);
+ dirtylimit_state->states = NULL;
+
+ free(dirtylimit_state);
+ dirtylimit_state = NULL;
+
+ trace_dirtylimit_state_finalize();
+}
+
+bool dirtylimit_in_service(void)
+{
+ return !!dirtylimit_state;
+}
+
+bool dirtylimit_vcpu_index_valid(int cpu_index)
+{
+ MachineState *ms = MACHINE(qdev_get_machine());
+
+ return !(cpu_index < 0 ||
+ cpu_index >= ms->smp.max_cpus);
+}
+
+static inline int64_t dirtylimit_dirty_ring_full_time(uint64_t dirtyrate)
+{
+ static uint64_t max_dirtyrate;
+ uint32_t dirty_ring_size = kvm_dirty_ring_size();
+ uint64_t dirty_ring_size_meory_MB =
+ dirty_ring_size * TARGET_PAGE_SIZE >> 20;
+
+ if (max_dirtyrate < dirtyrate) {
+ max_dirtyrate = dirtyrate;
+ }
+
+ return dirty_ring_size_meory_MB * 1000000 / max_dirtyrate;
+}
+
+static inline bool dirtylimit_done(uint64_t quota,
+ uint64_t current)
+{
+ uint64_t min, max;
+
+ min = MIN(quota, current);
+ max = MAX(quota, current);
+
+ return ((max - min) <= DIRTYLIMIT_TOLERANCE_RANGE) ? true : false;
+}
+
+static inline bool
+dirtylimit_need_linear_adjustment(uint64_t quota,
+ uint64_t current)
+{
+ uint64_t min, max;
+
+ min = MIN(quota, current);
+ max = MAX(quota, current);
+
+ return ((max - min) * 100 / max) > DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT;
+}
+
+static void dirtylimit_set_throttle(CPUState *cpu,
+ uint64_t quota,
+ uint64_t current)
+{
+ int64_t ring_full_time_us = 0;
+ uint64_t sleep_pct = 0;
+ uint64_t throttle_us = 0;
+
+ if (current == 0) {
+ cpu->throttle_us_per_full = 0;
+ return;
+ }
+
+ ring_full_time_us = dirtylimit_dirty_ring_full_time(current);
+
+ if (dirtylimit_need_linear_adjustment(quota, current)) {
+ if (quota < current) {
+ sleep_pct = (current - quota) * 100 / current;
+ throttle_us =
+ ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
+ cpu->throttle_us_per_full += throttle_us;
+ } else {
+ sleep_pct = (quota - current) * 100 / quota;
+ throttle_us =
+ ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
+ cpu->throttle_us_per_full -= throttle_us;
+ }
+
+ trace_dirtylimit_throttle_pct(cpu->cpu_index,
+ sleep_pct,
+ throttle_us);
+ } else {
+ if (quota < current) {
+ cpu->throttle_us_per_full += ring_full_time_us / 10;
+ } else {
+ cpu->throttle_us_per_full -= ring_full_time_us / 10;
+ }
+ }
+
+ /*
+ * TODO: in the big kvm_dirty_ring_size case (eg: 65536, or other scenario),
+ * current dirty page rate may never reach the quota, we should stop
+ * increasing sleep time?
+ */
+ cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full,
+ ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX);
+
+ cpu->throttle_us_per_full = MAX(cpu->throttle_us_per_full, 0);
+}
+
+static void dirtylimit_adjust_throttle(CPUState *cpu)
+{
+ uint64_t quota = 0;
+ uint64_t current = 0;
+ int cpu_index = cpu->cpu_index;
+
+ quota = dirtylimit_vcpu_get_state(cpu_index)->quota;
+ current = vcpu_dirty_rate_get(cpu_index);
+
+ if (!dirtylimit_done(quota, current)) {
+ dirtylimit_set_throttle(cpu, quota, current);
+ }
+
+ return;
+}
+
+void dirtylimit_process(void)
+{
+ CPUState *cpu;
+
+ if (!qatomic_read(&dirtylimit_quit)) {
+ dirtylimit_state_lock();
+
+ if (!dirtylimit_in_service()) {
+ dirtylimit_state_unlock();
+ return;
+ }
+
+ CPU_FOREACH(cpu) {
+ if (!dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled) {
+ continue;
+ }
+ dirtylimit_adjust_throttle(cpu);
+ }
+ dirtylimit_state_unlock();
+ }
+}
+
+void dirtylimit_change(bool start)
+{
+ if (start) {
+ qatomic_set(&dirtylimit_quit, 0);
+ } else {
+ qatomic_set(&dirtylimit_quit, 1);
+ }
+}
+
+void dirtylimit_set_vcpu(int cpu_index,
+ uint64_t quota,
+ bool enable)
+{
+ trace_dirtylimit_set_vcpu(cpu_index, quota);
+
+ if (enable) {
+ dirtylimit_state->states[cpu_index].quota = quota;
+ if (!dirtylimit_vcpu_get_state(cpu_index)->enabled) {
+ dirtylimit_state->limited_nvcpu++;
+ }
+ } else {
+ dirtylimit_state->states[cpu_index].quota = 0;
+ if (dirtylimit_state->states[cpu_index].enabled) {
+ dirtylimit_state->limited_nvcpu--;
+ }
+ }
+
+ dirtylimit_state->states[cpu_index].enabled = enable;
+}
+
+void dirtylimit_set_all(uint64_t quota,
+ bool enable)
+{
+ MachineState *ms = MACHINE(qdev_get_machine());
+ int max_cpus = ms->smp.max_cpus;
+ int i;
+
+ for (i = 0; i < max_cpus; i++) {
+ dirtylimit_set_vcpu(i, quota, enable);
+ }
+}
+
+void dirtylimit_vcpu_execute(CPUState *cpu)
+{
+ if (dirtylimit_in_service() &&
+ dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled &&
+ cpu->throttle_us_per_full) {
+ trace_dirtylimit_vcpu_execute(cpu->cpu_index,
+ cpu->throttle_us_per_full);
+ usleep(cpu->throttle_us_per_full);
+ }
+}
diff --git a/softmmu/trace-events b/softmmu/trace-events
index 9c88887..22606dc 100644
--- a/softmmu/trace-events
+++ b/softmmu/trace-events
@@ -31,3 +31,10 @@ runstate_set(int current_state, const char *current_state_str, int new_state, co
system_wakeup_request(int reason) "reason=%d"
qemu_system_shutdown_request(int reason) "reason=%d"
qemu_system_powerdown_request(void) ""
+
+#dirtylimit.c
+dirtylimit_state_initialize(int max_cpus) "dirtylimit state initialize: max cpus %d"
+dirtylimit_state_finalize(void)
+dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
+dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
+dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v17 7/8] softmmu/dirtylimit: Implement dirty page rate limit
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
` (5 preceding siblings ...)
2022-03-02 17:55 ` [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle huangy81
@ 2022-03-02 17:55 ` huangy81
2022-03-02 17:55 ` [PATCH v17 8/8] tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test huangy81
7 siblings, 0 replies; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Implement dirtyrate calculation periodically basing on
dirty-ring and throttle virtual CPU until it reachs the quota
dirty page rate given by user.
Introduce qmp commands "set-vcpu-dirty-limit",
"cancel-vcpu-dirty-limit", "query-vcpu-dirty-limit"
to enable, disable, query dirty page limit for virtual CPU.
Meanwhile, introduce corresponding hmp commands
"set_vcpu_dirty_limit", "cancel_vcpu_dirty_limit",
"info vcpu_dirty_limit" so the feature can be more usable.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
hmp-commands-info.hx | 13 ++++
hmp-commands.hx | 32 +++++++++
include/monitor/hmp.h | 3 +
qapi/migration.json | 80 +++++++++++++++++++++
softmmu/dirtylimit.c | 195 ++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 323 insertions(+)
diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index e90f20a..61b23d2 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -865,6 +865,19 @@ SRST
Display the vcpu dirty rate information.
ERST
+ {
+ .name = "vcpu_dirty_limit",
+ .args_type = "",
+ .params = "",
+ .help = "show dirty page limit information of all vCPU",
+ .cmd = hmp_info_vcpu_dirty_limit,
+ },
+
+SRST
+ ``info vcpu_dirty_limit``
+ Display the vcpu dirty page limit information.
+ERST
+
#if defined(TARGET_I386)
{
.name = "sgx",
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 70a9136..5bedee2 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1744,3 +1744,35 @@ ERST
"\n\t\t\t -b to specify dirty bitmap as method of calculation)",
.cmd = hmp_calc_dirty_rate,
},
+
+SRST
+``set_vcpu_dirty_limit``
+ Set dirty page rate limit on virtual CPU, the information about all the
+ virtual CPU dirty limit status can be observed with ``info vcpu_dirty_limit``
+ command.
+ERST
+
+ {
+ .name = "set_vcpu_dirty_limit",
+ .args_type = "dirty_rate:l,cpu_index:l?",
+ .params = "dirty_rate [cpu_index]",
+ .help = "set dirty page rate limit, use cpu_index to set limit"
+ "\n\t\t\t\t\t on a specified virtual cpu",
+ .cmd = hmp_set_vcpu_dirty_limit,
+ },
+
+SRST
+``cancel_vcpu_dirty_limit``
+ Cancel dirty page rate limit on virtual CPU, the information about all the
+ virtual CPU dirty limit status can be observed with ``info vcpu_dirty_limit``
+ command.
+ERST
+
+ {
+ .name = "cancel_vcpu_dirty_limit",
+ .args_type = "cpu_index:l?",
+ .params = "[cpu_index]",
+ .help = "cancel dirty page rate limit, use cpu_index to cancel"
+ "\n\t\t\t\t\t limit on a specified virtual cpu",
+ .cmd = hmp_cancel_vcpu_dirty_limit,
+ },
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 96d0148..478820e 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -131,6 +131,9 @@ void hmp_replay_delete_break(Monitor *mon, const QDict *qdict);
void hmp_replay_seek(Monitor *mon, const QDict *qdict);
void hmp_info_dirty_rate(Monitor *mon, const QDict *qdict);
void hmp_calc_dirty_rate(Monitor *mon, const QDict *qdict);
+void hmp_set_vcpu_dirty_limit(Monitor *mon, const QDict *qdict);
+void hmp_cancel_vcpu_dirty_limit(Monitor *mon, const QDict *qdict);
+void hmp_info_vcpu_dirty_limit(Monitor *mon, const QDict *qdict);
void hmp_human_readable_text_helper(Monitor *mon,
HumanReadableText *(*qmp_handler)(Error **));
diff --git a/qapi/migration.json b/qapi/migration.json
index 5975a0e..2ccbb92 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1861,6 +1861,86 @@
{ 'command': 'query-dirty-rate', 'returns': 'DirtyRateInfo' }
##
+# @DirtyLimitInfo:
+#
+# Dirty page rate limit information of a virtual CPU.
+#
+# @cpu-index: index of a virtual CPU.
+#
+# @limit-rate: upper limit of dirty page rate (MB/s) for a virtual
+# CPU, 0 means unlimited.
+#
+# @current-rate: current dirty page rate (MB/s) for a virtual CPU.
+#
+# Since: 7.0
+#
+##
+{ 'struct': 'DirtyLimitInfo',
+ 'data': { 'cpu-index': 'int',
+ 'limit-rate': 'uint64',
+ 'current-rate': 'uint64' } }
+
+##
+# @set-vcpu-dirty-limit:
+#
+# Set the upper limit of dirty page rate for virtual CPUs.
+#
+# Requires KVM with accelerator property "dirty-ring-size" set.
+# A virtual CPU's dirty page rate is a measure of its memory load.
+# To observe dirty page rates, use @calc-dirty-rate.
+#
+# @cpu-index: index of a virtual CPU, default is all.
+#
+# @dirty-rate: upper limit of dirty page rate (MB/s) for virtual CPUs.
+#
+# Since: 7.0
+#
+# Example:
+# {"execute": "set-vcpu-dirty-limit"}
+# "arguments": { "dirty-rate": 200,
+# "cpu-index": 1 } }
+#
+##
+{ 'command': 'set-vcpu-dirty-limit',
+ 'data': { '*cpu-index': 'int',
+ 'dirty-rate': 'uint64' } }
+
+##
+# @cancel-vcpu-dirty-limit:
+#
+# Cancel the upper limit of dirty page rate for virtual CPUs.
+#
+# Cancel the dirty page limit for the vCPU which has been set with
+# set-vcpu-dirty-limit command. Note that this command requires
+# support from dirty ring, same as the "set-vcpu-dirty-limit".
+#
+# @cpu-index: index of a virtual CPU, default is all.
+#
+# Since: 7.0
+#
+# Example:
+# {"execute": "cancel-vcpu-dirty-limit"}
+# "arguments": { "cpu-index": 1 } }
+#
+##
+{ 'command': 'cancel-vcpu-dirty-limit',
+ 'data': { '*cpu-index': 'int'} }
+
+##
+# @query-vcpu-dirty-limit:
+#
+# Returns information about virtual CPU dirty page rate limits, if any.
+#
+# Since: 7.0
+#
+# Example:
+# {"execute": "query-vcpu-dirty-limit"}
+#
+##
+{ 'command': 'query-vcpu-dirty-limit',
+ 'returns': [ 'DirtyLimitInfo' ] }
+
+##
# @snapshot-save:
#
# Save a VM snapshot
diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
index 76d0b44..365bd43 100644
--- a/softmmu/dirtylimit.c
+++ b/softmmu/dirtylimit.c
@@ -14,8 +14,12 @@
#include "qapi/error.h"
#include "qemu/main-loop.h"
#include "qapi/qapi-commands-migration.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/error.h"
#include "sysemu/dirtyrate.h"
#include "sysemu/dirtylimit.h"
+#include "monitor/hmp.h"
+#include "monitor/monitor.h"
#include "exec/memory.h"
#include "hw/boards.h"
#include "sysemu/kvm.h"
@@ -405,3 +409,194 @@ void dirtylimit_vcpu_execute(CPUState *cpu)
usleep(cpu->throttle_us_per_full);
}
}
+
+static void dirtylimit_init(void)
+{
+ dirtylimit_state_initialize();
+ dirtylimit_change(true);
+ vcpu_dirty_rate_stat_initialize();
+ vcpu_dirty_rate_stat_start();
+}
+
+static void dirtylimit_cleanup(void)
+{
+ vcpu_dirty_rate_stat_stop();
+ vcpu_dirty_rate_stat_finalize();
+ dirtylimit_change(false);
+ dirtylimit_state_finalize();
+}
+
+void qmp_cancel_vcpu_dirty_limit(bool has_cpu_index,
+ int64_t cpu_index,
+ Error **errp)
+{
+ if (!kvm_enabled() || !kvm_dirty_ring_enabled()) {
+ return;
+ }
+
+ if (has_cpu_index && !dirtylimit_vcpu_index_valid(cpu_index)) {
+ error_setg(errp, "incorrect cpu index specified");
+ return;
+ }
+
+ if (!dirtylimit_in_service()) {
+ return;
+ }
+
+ dirtylimit_state_lock();
+
+ if (has_cpu_index) {
+ dirtylimit_set_vcpu(cpu_index, 0, false);
+ } else {
+ dirtylimit_set_all(0, false);
+ }
+
+ if (!dirtylimit_state->limited_nvcpu) {
+ dirtylimit_cleanup();
+ }
+
+ dirtylimit_state_unlock();
+}
+
+void hmp_cancel_vcpu_dirty_limit(Monitor *mon, const QDict *qdict)
+{
+ int64_t cpu_index = qdict_get_try_int(qdict, "cpu_index", -1);
+ Error *err = NULL;
+
+ qmp_cancel_vcpu_dirty_limit(!!(cpu_index != -1), cpu_index, &err);
+ if (err) {
+ hmp_handle_error(mon, err);
+ return;
+ }
+
+ monitor_printf(mon, "[Please use 'info vcpu_dirty_limit' to query "
+ "dirty limit for virtual CPU]\n");
+}
+
+void qmp_set_vcpu_dirty_limit(bool has_cpu_index,
+ int64_t cpu_index,
+ uint64_t dirty_rate,
+ Error **errp)
+{
+ if (!kvm_enabled() || !kvm_dirty_ring_enabled()) {
+ error_setg(errp, "dirty page limit feature requires KVM with"
+ " accelerator property 'dirty-ring-size' set'");
+ return;
+ }
+
+ if (has_cpu_index && !dirtylimit_vcpu_index_valid(cpu_index)) {
+ error_setg(errp, "incorrect cpu index specified");
+ return;
+ }
+
+ if (!dirty_rate) {
+ qmp_cancel_vcpu_dirty_limit(has_cpu_index, cpu_index, errp);
+ return;
+ }
+
+ dirtylimit_state_lock();
+
+ if (!dirtylimit_in_service()) {
+ dirtylimit_init();
+ }
+
+ if (has_cpu_index) {
+ dirtylimit_set_vcpu(cpu_index, dirty_rate, true);
+ } else {
+ dirtylimit_set_all(dirty_rate, true);
+ }
+
+ dirtylimit_state_unlock();
+}
+
+void hmp_set_vcpu_dirty_limit(Monitor *mon, const QDict *qdict)
+{
+ int64_t dirty_rate = qdict_get_int(qdict, "dirty_rate");
+ int64_t cpu_index = qdict_get_try_int(qdict, "cpu_index", -1);
+ Error *err = NULL;
+
+ qmp_set_vcpu_dirty_limit(!!(cpu_index != -1), cpu_index, dirty_rate, &err);
+ if (err) {
+ hmp_handle_error(mon, err);
+ return;
+ }
+
+ monitor_printf(mon, "[Please use 'info vcpu_dirty_limit' to query "
+ "dirty limit for virtual CPU]\n");
+}
+
+static struct DirtyLimitInfo *dirtylimit_query_vcpu(int cpu_index)
+{
+ DirtyLimitInfo *info = NULL;
+
+ info = g_malloc0(sizeof(*info));
+ info->cpu_index = cpu_index;
+ info->limit_rate = dirtylimit_vcpu_get_state(cpu_index)->quota;
+ info->current_rate = vcpu_dirty_rate_get(cpu_index);
+
+ return info;
+}
+
+static struct DirtyLimitInfoList *dirtylimit_query_all(void)
+{
+ int i, index;
+ DirtyLimitInfo *info = NULL;
+ DirtyLimitInfoList *head = NULL, **tail = &head;
+
+ dirtylimit_state_lock();
+
+ if (!dirtylimit_in_service()) {
+ dirtylimit_state_unlock();
+ return NULL;
+ }
+
+ for (i = 0; i < dirtylimit_state->max_cpus; i++) {
+ index = dirtylimit_state->states[i].cpu_index;
+ if (dirtylimit_vcpu_get_state(index)->enabled) {
+ info = dirtylimit_query_vcpu(index);
+ QAPI_LIST_APPEND(tail, info);
+ }
+ }
+
+ dirtylimit_state_unlock();
+
+ return head;
+}
+
+struct DirtyLimitInfoList *qmp_query_vcpu_dirty_limit(Error **errp)
+{
+ if (!dirtylimit_in_service()) {
+ error_setg(errp, "dirty page limit not enabled");
+ return NULL;
+ }
+
+ return dirtylimit_query_all();
+}
+
+void hmp_info_vcpu_dirty_limit(Monitor *mon, const QDict *qdict)
+{
+ DirtyLimitInfoList *limit, *head, *info = NULL;
+ Error *err = NULL;
+
+ if (!dirtylimit_in_service()) {
+ monitor_printf(mon, "Dirty page limit not enabled!\n");
+ return;
+ }
+
+ info = qmp_query_vcpu_dirty_limit(&err);
+ if (err) {
+ hmp_handle_error(mon, err);
+ return;
+ }
+
+ head = info;
+ for (limit = head; limit != NULL; limit = limit->next) {
+ monitor_printf(mon, "vcpu[%"PRIi64"], limit rate %"PRIi64 " (MB/s),"
+ " current rate %"PRIi64 " (MB/s)\n",
+ limit->value->cpu_index,
+ limit->value->limit_rate,
+ limit->value->current_rate);
+ }
+
+ g_free(info);
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v17 8/8] tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
` (6 preceding siblings ...)
2022-03-02 17:55 ` [PATCH v17 7/8] softmmu/dirtylimit: Implement dirty page rate limit huangy81
@ 2022-03-02 17:55 ` huangy81
2022-03-03 5:58 ` Markus Armbruster
7 siblings, 1 reply; 22+ messages in thread
From: huangy81 @ 2022-03-02 17:55 UTC (permalink / raw)
To: qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Hyman, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
query-vcpu-dirty-limit success depends on enabling dirty
page rate limit, so just add it to the list of skipped
command to ensure qmp-cmd-test run successfully.
Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
tests/qtest/qmp-cmd-test.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tests/qtest/qmp-cmd-test.c b/tests/qtest/qmp-cmd-test.c
index 7f103ea..4b216a0 100644
--- a/tests/qtest/qmp-cmd-test.c
+++ b/tests/qtest/qmp-cmd-test.c
@@ -110,6 +110,8 @@ static bool query_is_ignored(const char *cmd)
"query-sev-capabilities",
"query-sgx",
"query-sgx-capabilities",
+ /* Success depends on enabling dirty page rate limit */
+ "query-vcpu-dirty-limit",
NULL
};
int i;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v17 8/8] tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test
2022-03-02 17:55 ` [PATCH v17 8/8] tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test huangy81
@ 2022-03-03 5:58 ` Markus Armbruster
2022-03-03 6:11 ` Hyman Huang
0 siblings, 1 reply; 22+ messages in thread
From: Markus Armbruster @ 2022-03-03 5:58 UTC (permalink / raw)
To: huangy81
Cc: Eduardo Habkost, David Hildenbrand, Juan Quintela,
Richard Henderson, qemu-devel, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
huangy81@chinatelecom.cn writes:
> From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
>
> query-vcpu-dirty-limit success depends on enabling dirty
> page rate limit, so just add it to the list of skipped
> command to ensure qmp-cmd-test run successfully.
>
> Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
> tests/qtest/qmp-cmd-test.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/tests/qtest/qmp-cmd-test.c b/tests/qtest/qmp-cmd-test.c
> index 7f103ea..4b216a0 100644
> --- a/tests/qtest/qmp-cmd-test.c
> +++ b/tests/qtest/qmp-cmd-test.c
> @@ -110,6 +110,8 @@ static bool query_is_ignored(const char *cmd)
> "query-sev-capabilities",
> "query-sgx",
> "query-sgx-capabilities",
> + /* Success depends on enabling dirty page rate limit */
> + "query-vcpu-dirty-limit",
> NULL
> };
> int i;
I think you need to squash this into the patch that adds the command, so
"make check" passes after each commit.
As far as I can tell, there is no other test coverage. That gap needs
to be plugged.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 8/8] tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test
2022-03-03 5:58 ` Markus Armbruster
@ 2022-03-03 6:11 ` Hyman Huang
0 siblings, 0 replies; 22+ messages in thread
From: Hyman Huang @ 2022-03-03 6:11 UTC (permalink / raw)
To: Markus Armbruster
Cc: Eduardo Habkost, David Hildenbrand, Juan Quintela,
Richard Henderson, qemu-devel, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
在 2022/3/3 13:58, Markus Armbruster 写道:
> huangy81@chinatelecom.cn writes:
>
>> From: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
>>
>> query-vcpu-dirty-limit success depends on enabling dirty
>> page rate limit, so just add it to the list of skipped
>> command to ensure qmp-cmd-test run successfully.
>>
>> Signed-off-by: Hyman Huang(黄勇) <huangy81@chinatelecom.cn>
>> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> ---
>> tests/qtest/qmp-cmd-test.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/tests/qtest/qmp-cmd-test.c b/tests/qtest/qmp-cmd-test.c
>> index 7f103ea..4b216a0 100644
>> --- a/tests/qtest/qmp-cmd-test.c
>> +++ b/tests/qtest/qmp-cmd-test.c
>> @@ -110,6 +110,8 @@ static bool query_is_ignored(const char *cmd)
>> "query-sev-capabilities",
>> "query-sgx",
>> "query-sgx-capabilities",
>> + /* Success depends on enabling dirty page rate limit */
>> + "query-vcpu-dirty-limit",
>> NULL
>> };
>> int i;
>
> I think you need to squash this into the patch that adds the command, so
> "make check" passes after each commit.
>
> As far as I can tell, there is no other test coverage. That gap needs
> to be plugged.
>
Indeed, please ignore this version. I have done this in versoin 18. :)
--
Best regard
Hyman Huang(黄勇)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-03-02 17:55 ` [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle huangy81
@ 2022-05-16 17:13 ` manish.mishra
2022-05-17 8:19 ` Hyman Huang
0 siblings, 1 reply; 22+ messages in thread
From: manish.mishra @ 2022-05-16 17:13 UTC (permalink / raw)
To: huangy81, qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
[-- Attachment #1: Type: text/plain, Size: 16091 bytes --]
Hi Hyman Huang,
I had few doubts regarding this patch series.
1. Why we choose for dirty rate limit per vcpu. I mean it becomes very hard for user to decide per
vcpu dirty rate limit. For e.g. we have 1Gbps network and 10 vcpu vm. Now if someone wants to
keep criteria for convergence as total dirty rate of VM should be lesser than half of available
bandwidth. For this case to ensure convergence user has to give dirty rate limit per vcpu
as 1Gbps/ 2 / 10 = 50Mbps. But assume then that VM has only 1 thread which is actively
dirtying memory, in that case so much of available quota will be wasted. So would not it be
better to use dirty rate limit control per VM instead of vcpu?
2. Also Here we are adaptively trying to adjust sleep time based on current obsered dirty rate and
dirty rate limit. Can it be more forceful like. Assume we have dirty rate limit of 10pages
per sec and auto-converge/ dirty rate limit was triggered at time 0. Now at any point of time assume
at time 10 sec if number of pages dirtyed are more than 100pages we sleep for interpolated amount
of time. Basically at every dirty ring exit we can check if current number of pages dirtied are more than
what should be allowed by this time?
thanks
Manish Mishra
On 02/03/22 11:25 pm, huangy81@chinatelecom.cn wrote:
> From: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
>
> Setup a negative feedback system when vCPU thread
> handling KVM_EXIT_DIRTY_RING_FULL exit by introducing
> throttle_us_per_full field in struct CPUState. Sleep
> throttle_us_per_full microseconds to throttle vCPU
> if dirtylimit is in service.
>
> Signed-off-by: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
> Reviewed-by: Peter Xu<peterx@redhat.com>
> ---
> accel/kvm/kvm-all.c | 19 ++-
> include/hw/core/cpu.h | 6 +
> include/sysemu/dirtylimit.h | 15 +++
> softmmu/dirtylimit.c | 291 ++++++++++++++++++++++++++++++++++++++++++++
> softmmu/trace-events | 7 ++
> 5 files changed, 337 insertions(+), 1 deletion(-)
>
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 8821d80..98e43e6 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -45,6 +45,7 @@
> #include "qemu/guest-random.h"
> #include "sysemu/hw_accel.h"
> #include "kvm-cpus.h"
> +#include "sysemu/dirtylimit.h"
>
> #include "hw/boards.h"
>
> @@ -476,6 +477,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
> cpu->kvm_state = s;
> cpu->vcpu_dirty = true;
> cpu->dirty_pages = 0;
> + cpu->throttle_us_per_full = 0;
>
> mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
> if (mmap_size < 0) {
> @@ -1469,6 +1471,11 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
> */
> sleep(1);
>
> + /* keep sleeping so that dirtylimit not be interfered by reaper */
> + if (dirtylimit_in_service()) {
> + continue;
> + }
> +
> trace_kvm_dirty_ring_reaper("wakeup");
> r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
>
> @@ -2964,8 +2971,18 @@ int kvm_cpu_exec(CPUState *cpu)
> */
> trace_kvm_dirty_ring_full(cpu->cpu_index);
> qemu_mutex_lock_iothread();
> - kvm_dirty_ring_reap(kvm_state, NULL);
> + /* We throttle vCPU by making it sleep once it exit from kernel
> + * due to dirty ring full. In the dirtylimit scenario, reaping
> + * all vCPUs after a single vCPU dirty ring get full result in
> + * the miss of sleep, so just reap the ring-fulled vCPU.
> + */
> + if (dirtylimit_in_service()) {
> + kvm_dirty_ring_reap(kvm_state, cpu);
> + } else {
> + kvm_dirty_ring_reap(kvm_state, NULL);
> + }
> qemu_mutex_unlock_iothread();
> + dirtylimit_vcpu_execute(cpu);
> ret = 0;
> break;
> case KVM_EXIT_SYSTEM_EVENT:
> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
> index 76ab3b8..dbeb31a 100644
> --- a/include/hw/core/cpu.h
> +++ b/include/hw/core/cpu.h
> @@ -411,6 +411,12 @@ struct CPUState {
> */
> bool throttle_thread_scheduled;
>
> + /*
> + * Sleep throttle_us_per_full microseconds once dirty ring is full
> + * if dirty page rate limit is enabled.
> + */
> + int64_t throttle_us_per_full;
> +
> bool ignore_memory_transaction_failures;
>
> /* Used for user-only emulation of prctl(PR_SET_UNALIGN). */
> diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
> index da459f0..8d2c1f3 100644
> --- a/include/sysemu/dirtylimit.h
> +++ b/include/sysemu/dirtylimit.h
> @@ -19,4 +19,19 @@ void vcpu_dirty_rate_stat_start(void);
> void vcpu_dirty_rate_stat_stop(void);
> void vcpu_dirty_rate_stat_initialize(void);
> void vcpu_dirty_rate_stat_finalize(void);
> +
> +void dirtylimit_state_lock(void);
> +void dirtylimit_state_unlock(void);
> +void dirtylimit_state_initialize(void);
> +void dirtylimit_state_finalize(void);
> +bool dirtylimit_in_service(void);
> +bool dirtylimit_vcpu_index_valid(int cpu_index);
> +void dirtylimit_process(void);
> +void dirtylimit_change(bool start);
> +void dirtylimit_set_vcpu(int cpu_index,
> + uint64_t quota,
> + bool enable);
> +void dirtylimit_set_all(uint64_t quota,
> + bool enable);
> +void dirtylimit_vcpu_execute(CPUState *cpu);
> #endif
> diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
> index 6102e8c..76d0b44 100644
> --- a/softmmu/dirtylimit.c
> +++ b/softmmu/dirtylimit.c
> @@ -18,6 +18,26 @@
> #include "sysemu/dirtylimit.h"
> #include "exec/memory.h"
> #include "hw/boards.h"
> +#include "sysemu/kvm.h"
> +#include "trace.h"
> +
> +/*
> + * Dirtylimit stop working if dirty page rate error
> + * value less than DIRTYLIMIT_TOLERANCE_RANGE
> + */
> +#define DIRTYLIMIT_TOLERANCE_RANGE 25 /* MB/s */
> +/*
> + * Plus or minus vcpu sleep time linearly if dirty
> + * page rate error value percentage over
> + * DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT.
> + * Otherwise, plus or minus a fixed vcpu sleep time.
> + */
> +#define DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT 50
> +/*
> + * Max vcpu sleep time percentage during a cycle
> + * composed of dirty ring full and sleep time.
> + */
> +#define DIRTYLIMIT_THROTTLE_PCT_MAX 99
>
> struct {
> VcpuStat stat;
> @@ -25,6 +45,30 @@ struct {
> QemuThread thread;
> } *vcpu_dirty_rate_stat;
>
> +typedef struct VcpuDirtyLimitState {
> + int cpu_index;
> + bool enabled;
> + /*
> + * Quota dirty page rate, unit is MB/s
> + * zero if not enabled.
> + */
> + uint64_t quota;
> +} VcpuDirtyLimitState;
> +
> +struct {
> + VcpuDirtyLimitState *states;
> + /* Max cpus number configured by user */
> + int max_cpus;
> + /* Number of vcpu under dirtylimit */
> + int limited_nvcpu;
> +} *dirtylimit_state;
> +
> +/* protect dirtylimit_state */
> +static QemuMutex dirtylimit_mutex;
> +
> +/* dirtylimit thread quit if dirtylimit_quit is true */
> +static bool dirtylimit_quit;
> +
> static void vcpu_dirty_rate_stat_collect(void)
> {
> VcpuStat stat;
> @@ -54,6 +98,9 @@ static void *vcpu_dirty_rate_stat_thread(void *opaque)
>
> while (qatomic_read(&vcpu_dirty_rate_stat->running)) {
> vcpu_dirty_rate_stat_collect();
> + if (dirtylimit_in_service()) {
> + dirtylimit_process();
> + }
> }
>
> /* stop log sync */
> @@ -86,9 +133,11 @@ void vcpu_dirty_rate_stat_start(void)
> void vcpu_dirty_rate_stat_stop(void)
> {
> qatomic_set(&vcpu_dirty_rate_stat->running, 0);
> + dirtylimit_state_unlock();
> qemu_mutex_unlock_iothread();
> qemu_thread_join(&vcpu_dirty_rate_stat->thread);
> qemu_mutex_lock_iothread();
> + dirtylimit_state_lock();
> }
>
> void vcpu_dirty_rate_stat_initialize(void)
> @@ -114,3 +163,245 @@ void vcpu_dirty_rate_stat_finalize(void)
> free(vcpu_dirty_rate_stat);
> vcpu_dirty_rate_stat = NULL;
> }
> +
> +void dirtylimit_state_lock(void)
> +{
> + qemu_mutex_lock(&dirtylimit_mutex);
> +}
> +
> +void dirtylimit_state_unlock(void)
> +{
> + qemu_mutex_unlock(&dirtylimit_mutex);
> +}
> +
> +static void
> +__attribute__((__constructor__)) dirtylimit_mutex_init(void)
> +{
> + qemu_mutex_init(&dirtylimit_mutex);
> +}
> +
> +static inline VcpuDirtyLimitState *dirtylimit_vcpu_get_state(int cpu_index)
> +{
> + return &dirtylimit_state->states[cpu_index];
> +}
> +
> +void dirtylimit_state_initialize(void)
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> + int max_cpus = ms->smp.max_cpus;
> + int i;
> +
> + dirtylimit_state = g_malloc0(sizeof(*dirtylimit_state));
> +
> + dirtylimit_state->states =
> + g_malloc0(sizeof(VcpuDirtyLimitState) * max_cpus);
> +
> + for (i = 0; i < max_cpus; i++) {
> + dirtylimit_state->states[i].cpu_index = i;
> + }
> +
> + dirtylimit_state->max_cpus = max_cpus;
> + trace_dirtylimit_state_initialize(max_cpus);
> +}
> +
> +void dirtylimit_state_finalize(void)
> +{
> + free(dirtylimit_state->states);
> + dirtylimit_state->states = NULL;
> +
> + free(dirtylimit_state);
> + dirtylimit_state = NULL;
> +
> + trace_dirtylimit_state_finalize();
> +}
> +
> +bool dirtylimit_in_service(void)
> +{
> + return !!dirtylimit_state;
> +}
> +
> +bool dirtylimit_vcpu_index_valid(int cpu_index)
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> +
> + return !(cpu_index < 0 ||
> + cpu_index >= ms->smp.max_cpus);
> +}
> +
> +static inline int64_t dirtylimit_dirty_ring_full_time(uint64_t dirtyrate)
> +{
> + static uint64_t max_dirtyrate;
> + uint32_t dirty_ring_size = kvm_dirty_ring_size();
> + uint64_t dirty_ring_size_meory_MB =
> + dirty_ring_size * TARGET_PAGE_SIZE >> 20;
> +
> + if (max_dirtyrate < dirtyrate) {
> + max_dirtyrate = dirtyrate;
> + }
> +
> + return dirty_ring_size_meory_MB * 1000000 / max_dirtyrate;
> +}
> +
> +static inline bool dirtylimit_done(uint64_t quota,
> + uint64_t current)
> +{
> + uint64_t min, max;
> +
> + min = MIN(quota, current);
> + max = MAX(quota, current);
> +
> + return ((max - min) <= DIRTYLIMIT_TOLERANCE_RANGE) ? true : false;
> +}
> +
> +static inline bool
> +dirtylimit_need_linear_adjustment(uint64_t quota,
> + uint64_t current)
> +{
> + uint64_t min, max;
> +
> + min = MIN(quota, current);
> + max = MAX(quota, current);
> +
> + return ((max - min) * 100 / max) > DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT;
> +}
> +
> +static void dirtylimit_set_throttle(CPUState *cpu,
> + uint64_t quota,
> + uint64_t current)
> +{
> + int64_t ring_full_time_us = 0;
> + uint64_t sleep_pct = 0;
> + uint64_t throttle_us = 0;
> +
> + if (current == 0) {
> + cpu->throttle_us_per_full = 0;
> + return;
> + }
> +
> + ring_full_time_us = dirtylimit_dirty_ring_full_time(current);
> +
> + if (dirtylimit_need_linear_adjustment(quota, current)) {
> + if (quota < current) {
> + sleep_pct = (current - quota) * 100 / current;
> + throttle_us =
> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
> + cpu->throttle_us_per_full += throttle_us;
> + } else {
> + sleep_pct = (quota - current) * 100 / quota;
> + throttle_us =
> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
> + cpu->throttle_us_per_full -= throttle_us;
> + }
> +
> + trace_dirtylimit_throttle_pct(cpu->cpu_index,
> + sleep_pct,
> + throttle_us);
> + } else {
> + if (quota < current) {
> + cpu->throttle_us_per_full += ring_full_time_us / 10;
> + } else {
> + cpu->throttle_us_per_full -= ring_full_time_us / 10;
> + }
> + }
> +
> + /*
> + * TODO: in the big kvm_dirty_ring_size case (eg: 65536, or other scenario),
> + * current dirty page rate may never reach the quota, we should stop
> + * increasing sleep time?
> + */
> + cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full,
> + ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX);
> +
> + cpu->throttle_us_per_full = MAX(cpu->throttle_us_per_full, 0);
> +}
> +
> +static void dirtylimit_adjust_throttle(CPUState *cpu)
> +{
> + uint64_t quota = 0;
> + uint64_t current = 0;
> + int cpu_index = cpu->cpu_index;
> +
> + quota = dirtylimit_vcpu_get_state(cpu_index)->quota;
> + current = vcpu_dirty_rate_get(cpu_index);
> +
> + if (!dirtylimit_done(quota, current)) {
> + dirtylimit_set_throttle(cpu, quota, current);
> + }
> +
> + return;
> +}
> +
> +void dirtylimit_process(void)
> +{
> + CPUState *cpu;
> +
> + if (!qatomic_read(&dirtylimit_quit)) {
> + dirtylimit_state_lock();
> +
> + if (!dirtylimit_in_service()) {
> + dirtylimit_state_unlock();
> + return;
> + }
> +
> + CPU_FOREACH(cpu) {
> + if (!dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled) {
> + continue;
> + }
> + dirtylimit_adjust_throttle(cpu);
> + }
> + dirtylimit_state_unlock();
> + }
> +}
> +
> +void dirtylimit_change(bool start)
> +{
> + if (start) {
> + qatomic_set(&dirtylimit_quit, 0);
> + } else {
> + qatomic_set(&dirtylimit_quit, 1);
> + }
> +}
> +
> +void dirtylimit_set_vcpu(int cpu_index,
> + uint64_t quota,
> + bool enable)
> +{
> + trace_dirtylimit_set_vcpu(cpu_index, quota);
> +
> + if (enable) {
> + dirtylimit_state->states[cpu_index].quota = quota;
> + if (!dirtylimit_vcpu_get_state(cpu_index)->enabled) {
> + dirtylimit_state->limited_nvcpu++;
> + }
> + } else {
> + dirtylimit_state->states[cpu_index].quota = 0;
> + if (dirtylimit_state->states[cpu_index].enabled) {
> + dirtylimit_state->limited_nvcpu--;
> + }
> + }
> +
> + dirtylimit_state->states[cpu_index].enabled = enable;
> +}
> +
> +void dirtylimit_set_all(uint64_t quota,
> + bool enable)
> +{
> + MachineState *ms = MACHINE(qdev_get_machine());
> + int max_cpus = ms->smp.max_cpus;
> + int i;
> +
> + for (i = 0; i < max_cpus; i++) {
> + dirtylimit_set_vcpu(i, quota, enable);
> + }
> +}
> +
> +void dirtylimit_vcpu_execute(CPUState *cpu)
> +{
> + if (dirtylimit_in_service() &&
> + dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled &&
> + cpu->throttle_us_per_full) {
> + trace_dirtylimit_vcpu_execute(cpu->cpu_index,
> + cpu->throttle_us_per_full);
> + usleep(cpu->throttle_us_per_full);
> + }
> +}
> diff --git a/softmmu/trace-events b/softmmu/trace-events
> index 9c88887..22606dc 100644
> --- a/softmmu/trace-events
> +++ b/softmmu/trace-events
> @@ -31,3 +31,10 @@ runstate_set(int current_state, const char *current_state_str, int new_state, co
> system_wakeup_request(int reason) "reason=%d"
> qemu_system_shutdown_request(int reason) "reason=%d"
> qemu_system_powerdown_request(void) ""
> +
> +#dirtylimit.c
> +dirtylimit_state_initialize(int max_cpus) "dirtylimit state initialize: max cpus %d"
> +dirtylimit_state_finalize(void)
> +dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
> +dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
> +dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
[-- Attachment #2: Type: text/html, Size: 17020 bytes --]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-05-16 17:13 ` manish.mishra
@ 2022-05-17 8:19 ` Hyman Huang
2022-05-23 16:44 ` manish.mishra
0 siblings, 1 reply; 22+ messages in thread
From: Hyman Huang @ 2022-05-17 8:19 UTC (permalink / raw)
To: manish.mishra, qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
Thanks Manish for the comment, i'll give my explanation and any
supplement are welcomed.
在 2022/5/17 1:13, manish.mishra 写道:
> Hi Hyman Huang,
>
> I had few doubts regarding this patch series.
For the first point, m'm rudely guessing that you want to figure out how
should we set the vcpu dirty limit correctly during live migration to
make it convergent.
This can be achieved by set a single dirtylimit value on all vcpus, the
value need not be equivalent of the half of available bandwidth so
precisely since the dirtylimit is sufficient conditions of migration
sucess, but not necessary condition.
We can set the dirtylimit as the minimum of what user can tolerate, in
most case, migration can achieve convergent in advance and do the
switchover with the real dirtyrate greater than dirtylimit. This can be
implemented because Qemu will check the criteria every iteration, once
it meet the condition, Qemu will do the switch over no matter what
convergent algo is.
>
> 1. Why we choose for dirty rate limit per vcpu. I mean it becomes very
> hard for user to decide per
>
> vcpu dirty rate limit. For e.g. we have 1Gbps network and 10 vcpu
> vm. Now if someone wants to
>
> keep criteria for convergence as total dirty rate of VM should be
> lesser than half of available
>
> bandwidth. For this case to ensure convergence user has to give
> dirty rate limit per vcpu
>
> as 1Gbps/ 2 / 10 = 50Mbps. But assume then that VM has only 1
> thread which is actively
>
> dirtying memory, in that case so much of available quota will be
> wasted.
This is a good and frequent question about dirtylimit, as mentioned
above, throttle occurs only when dirty ring full and exit to user space.
A vcpu is set up with dirtylimit during live migration, but it does not
dirty memory, it may never get throttled.
The dirtylimit only throttle those vcpu who dirty memory and dirtyrate
greater then dirtylimit.
So would not it be
>
> better to use dirty rate limit control per VM instead of vcpu?
>
> 2. Also Here we are adaptively trying to adjust sleep time based on
> current obsered dirty rate and
>
> dirty rate limit. Can it be more forceful like. Assume we have
> dirty rate limit of 10pages
>
> per sec and auto-converge/ dirty rate limit was triggered at time
> 0. Now at any point of time assume
>
> at time 10 sec if number of pages dirtyed are more than 100pages we
> sleep for interpolated amount
>
> of time. Basically at every dirty ring exit we can check if current
> number of pages dirtied are more than
>
> what should be allowed by this time?
Yes, indeed, but as memtioned above, if dirty ring exit, it give Qemu a
hint that vcpu is dirting memory, we should check it.
I post the series of dirtylimit capability for RFC, may be it can help
me to explain the usage of vcpu dirty limit, it can be found here:
https://lore.kernel.org/qemu-devel/cover.1652762652.git.huangy81@chinatelecom.cn/
Thanks,
Yong
>
> thanks
>
> Manish Mishra
>
> On 02/03/22 11:25 pm, huangy81@chinatelecom.cn wrote:
>> From: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
>>
>> Setup a negative feedback system when vCPU thread
>> handling KVM_EXIT_DIRTY_RING_FULL exit by introducing
>> throttle_us_per_full field in struct CPUState. Sleep
>> throttle_us_per_full microseconds to throttle vCPU
>> if dirtylimit is in service.
>>
>> Signed-off-by: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
>> Reviewed-by: Peter Xu<peterx@redhat.com>
>> ---
>> accel/kvm/kvm-all.c | 19 ++-
>> include/hw/core/cpu.h | 6 +
>> include/sysemu/dirtylimit.h | 15 +++
>> softmmu/dirtylimit.c | 291 ++++++++++++++++++++++++++++++++++++++++++++
>> softmmu/trace-events | 7 ++
>> 5 files changed, 337 insertions(+), 1 deletion(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 8821d80..98e43e6 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -45,6 +45,7 @@
>> #include "qemu/guest-random.h"
>> #include "sysemu/hw_accel.h"
>> #include "kvm-cpus.h"
>> +#include "sysemu/dirtylimit.h"
>>
>> #include "hw/boards.h"
>>
>> @@ -476,6 +477,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
>> cpu->kvm_state = s;
>> cpu->vcpu_dirty = true;
>> cpu->dirty_pages = 0;
>> + cpu->throttle_us_per_full = 0;
>>
>> mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
>> if (mmap_size < 0) {
>> @@ -1469,6 +1471,11 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
>> */
>> sleep(1);
>>
>> + /* keep sleeping so that dirtylimit not be interfered by reaper */
>> + if (dirtylimit_in_service()) {
>> + continue;
>> + }
>> +
>> trace_kvm_dirty_ring_reaper("wakeup");
>> r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
>>
>> @@ -2964,8 +2971,18 @@ int kvm_cpu_exec(CPUState *cpu)
>> */
>> trace_kvm_dirty_ring_full(cpu->cpu_index);
>> qemu_mutex_lock_iothread();
>> - kvm_dirty_ring_reap(kvm_state, NULL);
>> + /* We throttle vCPU by making it sleep once it exit from kernel
>> + * due to dirty ring full. In the dirtylimit scenario, reaping
>> + * all vCPUs after a single vCPU dirty ring get full result in
>> + * the miss of sleep, so just reap the ring-fulled vCPU.
>> + */
>> + if (dirtylimit_in_service()) {
>> + kvm_dirty_ring_reap(kvm_state, cpu);
>> + } else {
>> + kvm_dirty_ring_reap(kvm_state, NULL);
>> + }
>> qemu_mutex_unlock_iothread();
>> + dirtylimit_vcpu_execute(cpu);
>> ret = 0;
>> break;
>> case KVM_EXIT_SYSTEM_EVENT:
>> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
>> index 76ab3b8..dbeb31a 100644
>> --- a/include/hw/core/cpu.h
>> +++ b/include/hw/core/cpu.h
>> @@ -411,6 +411,12 @@ struct CPUState {
>> */
>> bool throttle_thread_scheduled;
>>
>> + /*
>> + * Sleep throttle_us_per_full microseconds once dirty ring is full
>> + * if dirty page rate limit is enabled.
>> + */
>> + int64_t throttle_us_per_full;
>> +
>> bool ignore_memory_transaction_failures;
>>
>> /* Used for user-only emulation of prctl(PR_SET_UNALIGN). */
>> diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
>> index da459f0..8d2c1f3 100644
>> --- a/include/sysemu/dirtylimit.h
>> +++ b/include/sysemu/dirtylimit.h
>> @@ -19,4 +19,19 @@ void vcpu_dirty_rate_stat_start(void);
>> void vcpu_dirty_rate_stat_stop(void);
>> void vcpu_dirty_rate_stat_initialize(void);
>> void vcpu_dirty_rate_stat_finalize(void);
>> +
>> +void dirtylimit_state_lock(void);
>> +void dirtylimit_state_unlock(void);
>> +void dirtylimit_state_initialize(void);
>> +void dirtylimit_state_finalize(void);
>> +bool dirtylimit_in_service(void);
>> +bool dirtylimit_vcpu_index_valid(int cpu_index);
>> +void dirtylimit_process(void);
>> +void dirtylimit_change(bool start);
>> +void dirtylimit_set_vcpu(int cpu_index,
>> + uint64_t quota,
>> + bool enable);
>> +void dirtylimit_set_all(uint64_t quota,
>> + bool enable);
>> +void dirtylimit_vcpu_execute(CPUState *cpu);
>> #endif
>> diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
>> index 6102e8c..76d0b44 100644
>> --- a/softmmu/dirtylimit.c
>> +++ b/softmmu/dirtylimit.c
>> @@ -18,6 +18,26 @@
>> #include "sysemu/dirtylimit.h"
>> #include "exec/memory.h"
>> #include "hw/boards.h"
>> +#include "sysemu/kvm.h"
>> +#include "trace.h"
>> +
>> +/*
>> + * Dirtylimit stop working if dirty page rate error
>> + * value less than DIRTYLIMIT_TOLERANCE_RANGE
>> + */
>> +#define DIRTYLIMIT_TOLERANCE_RANGE 25 /* MB/s */
>> +/*
>> + * Plus or minus vcpu sleep time linearly if dirty
>> + * page rate error value percentage over
>> + * DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT.
>> + * Otherwise, plus or minus a fixed vcpu sleep time.
>> + */
>> +#define DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT 50
>> +/*
>> + * Max vcpu sleep time percentage during a cycle
>> + * composed of dirty ring full and sleep time.
>> + */
>> +#define DIRTYLIMIT_THROTTLE_PCT_MAX 99
>>
>> struct {
>> VcpuStat stat;
>> @@ -25,6 +45,30 @@ struct {
>> QemuThread thread;
>> } *vcpu_dirty_rate_stat;
>>
>> +typedef struct VcpuDirtyLimitState {
>> + int cpu_index;
>> + bool enabled;
>> + /*
>> + * Quota dirty page rate, unit is MB/s
>> + * zero if not enabled.
>> + */
>> + uint64_t quota;
>> +} VcpuDirtyLimitState;
>> +
>> +struct {
>> + VcpuDirtyLimitState *states;
>> + /* Max cpus number configured by user */
>> + int max_cpus;
>> + /* Number of vcpu under dirtylimit */
>> + int limited_nvcpu;
>> +} *dirtylimit_state;
>> +
>> +/* protect dirtylimit_state */
>> +static QemuMutex dirtylimit_mutex;
>> +
>> +/* dirtylimit thread quit if dirtylimit_quit is true */
>> +static bool dirtylimit_quit;
>> +
>> static void vcpu_dirty_rate_stat_collect(void)
>> {
>> VcpuStat stat;
>> @@ -54,6 +98,9 @@ static void *vcpu_dirty_rate_stat_thread(void *opaque)
>>
>> while (qatomic_read(&vcpu_dirty_rate_stat->running)) {
>> vcpu_dirty_rate_stat_collect();
>> + if (dirtylimit_in_service()) {
>> + dirtylimit_process();
>> + }
>> }
>>
>> /* stop log sync */
>> @@ -86,9 +133,11 @@ void vcpu_dirty_rate_stat_start(void)
>> void vcpu_dirty_rate_stat_stop(void)
>> {
>> qatomic_set(&vcpu_dirty_rate_stat->running, 0);
>> + dirtylimit_state_unlock();
>> qemu_mutex_unlock_iothread();
>> qemu_thread_join(&vcpu_dirty_rate_stat->thread);
>> qemu_mutex_lock_iothread();
>> + dirtylimit_state_lock();
>> }
>>
>> void vcpu_dirty_rate_stat_initialize(void)
>> @@ -114,3 +163,245 @@ void vcpu_dirty_rate_stat_finalize(void)
>> free(vcpu_dirty_rate_stat);
>> vcpu_dirty_rate_stat = NULL;
>> }
>> +
>> +void dirtylimit_state_lock(void)
>> +{
>> + qemu_mutex_lock(&dirtylimit_mutex);
>> +}
>> +
>> +void dirtylimit_state_unlock(void)
>> +{
>> + qemu_mutex_unlock(&dirtylimit_mutex);
>> +}
>> +
>> +static void
>> +__attribute__((__constructor__)) dirtylimit_mutex_init(void)
>> +{
>> + qemu_mutex_init(&dirtylimit_mutex);
>> +}
>> +
>> +static inline VcpuDirtyLimitState *dirtylimit_vcpu_get_state(int cpu_index)
>> +{
>> + return &dirtylimit_state->states[cpu_index];
>> +}
>> +
>> +void dirtylimit_state_initialize(void)
>> +{
>> + MachineState *ms = MACHINE(qdev_get_machine());
>> + int max_cpus = ms->smp.max_cpus;
>> + int i;
>> +
>> + dirtylimit_state = g_malloc0(sizeof(*dirtylimit_state));
>> +
>> + dirtylimit_state->states =
>> + g_malloc0(sizeof(VcpuDirtyLimitState) * max_cpus);
>> +
>> + for (i = 0; i < max_cpus; i++) {
>> + dirtylimit_state->states[i].cpu_index = i;
>> + }
>> +
>> + dirtylimit_state->max_cpus = max_cpus;
>> + trace_dirtylimit_state_initialize(max_cpus);
>> +}
>> +
>> +void dirtylimit_state_finalize(void)
>> +{
>> + free(dirtylimit_state->states);
>> + dirtylimit_state->states = NULL;
>> +
>> + free(dirtylimit_state);
>> + dirtylimit_state = NULL;
>> +
>> + trace_dirtylimit_state_finalize();
>> +}
>> +
>> +bool dirtylimit_in_service(void)
>> +{
>> + return !!dirtylimit_state;
>> +}
>> +
>> +bool dirtylimit_vcpu_index_valid(int cpu_index)
>> +{
>> + MachineState *ms = MACHINE(qdev_get_machine());
>> +
>> + return !(cpu_index < 0 ||
>> + cpu_index >= ms->smp.max_cpus);
>> +}
>> +
>> +static inline int64_t dirtylimit_dirty_ring_full_time(uint64_t dirtyrate)
>> +{
>> + static uint64_t max_dirtyrate;
>> + uint32_t dirty_ring_size = kvm_dirty_ring_size();
>> + uint64_t dirty_ring_size_meory_MB =
>> + dirty_ring_size * TARGET_PAGE_SIZE >> 20;
>> +
>> + if (max_dirtyrate < dirtyrate) {
>> + max_dirtyrate = dirtyrate;
>> + }
>> +
>> + return dirty_ring_size_meory_MB * 1000000 / max_dirtyrate;
>> +}
>> +
>> +static inline bool dirtylimit_done(uint64_t quota,
>> + uint64_t current)
>> +{
>> + uint64_t min, max;
>> +
>> + min = MIN(quota, current);
>> + max = MAX(quota, current);
>> +
>> + return ((max - min) <= DIRTYLIMIT_TOLERANCE_RANGE) ? true : false;
>> +}
>> +
>> +static inline bool
>> +dirtylimit_need_linear_adjustment(uint64_t quota,
>> + uint64_t current)
>> +{
>> + uint64_t min, max;
>> +
>> + min = MIN(quota, current);
>> + max = MAX(quota, current);
>> +
>> + return ((max - min) * 100 / max) > DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT;
>> +}
>> +
>> +static void dirtylimit_set_throttle(CPUState *cpu,
>> + uint64_t quota,
>> + uint64_t current)
>> +{
>> + int64_t ring_full_time_us = 0;
>> + uint64_t sleep_pct = 0;
>> + uint64_t throttle_us = 0;
>> +
>> + if (current == 0) {
>> + cpu->throttle_us_per_full = 0;
>> + return;
>> + }
>> +
>> + ring_full_time_us = dirtylimit_dirty_ring_full_time(current);
>> +
>> + if (dirtylimit_need_linear_adjustment(quota, current)) {
>> + if (quota < current) {
>> + sleep_pct = (current - quota) * 100 / current;
>> + throttle_us =
>> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
>> + cpu->throttle_us_per_full += throttle_us;
>> + } else {
>> + sleep_pct = (quota - current) * 100 / quota;
>> + throttle_us =
>> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
>> + cpu->throttle_us_per_full -= throttle_us;
>> + }
>> +
>> + trace_dirtylimit_throttle_pct(cpu->cpu_index,
>> + sleep_pct,
>> + throttle_us);
>> + } else {
>> + if (quota < current) {
>> + cpu->throttle_us_per_full += ring_full_time_us / 10;
>> + } else {
>> + cpu->throttle_us_per_full -= ring_full_time_us / 10;
>> + }
>> + }
>> +
>> + /*
>> + * TODO: in the big kvm_dirty_ring_size case (eg: 65536, or other scenario),
>> + * current dirty page rate may never reach the quota, we should stop
>> + * increasing sleep time?
>> + */
>> + cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full,
>> + ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX);
>> +
>> + cpu->throttle_us_per_full = MAX(cpu->throttle_us_per_full, 0);
>> +}
>> +
>> +static void dirtylimit_adjust_throttle(CPUState *cpu)
>> +{
>> + uint64_t quota = 0;
>> + uint64_t current = 0;
>> + int cpu_index = cpu->cpu_index;
>> +
>> + quota = dirtylimit_vcpu_get_state(cpu_index)->quota;
>> + current = vcpu_dirty_rate_get(cpu_index);
>> +
>> + if (!dirtylimit_done(quota, current)) {
>> + dirtylimit_set_throttle(cpu, quota, current);
>> + }
>> +
>> + return;
>> +}
>> +
>> +void dirtylimit_process(void)
>> +{
>> + CPUState *cpu;
>> +
>> + if (!qatomic_read(&dirtylimit_quit)) {
>> + dirtylimit_state_lock();
>> +
>> + if (!dirtylimit_in_service()) {
>> + dirtylimit_state_unlock();
>> + return;
>> + }
>> +
>> + CPU_FOREACH(cpu) {
>> + if (!dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled) {
>> + continue;
>> + }
>> + dirtylimit_adjust_throttle(cpu);
>> + }
>> + dirtylimit_state_unlock();
>> + }
>> +}
>> +
>> +void dirtylimit_change(bool start)
>> +{
>> + if (start) {
>> + qatomic_set(&dirtylimit_quit, 0);
>> + } else {
>> + qatomic_set(&dirtylimit_quit, 1);
>> + }
>> +}
>> +
>> +void dirtylimit_set_vcpu(int cpu_index,
>> + uint64_t quota,
>> + bool enable)
>> +{
>> + trace_dirtylimit_set_vcpu(cpu_index, quota);
>> +
>> + if (enable) {
>> + dirtylimit_state->states[cpu_index].quota = quota;
>> + if (!dirtylimit_vcpu_get_state(cpu_index)->enabled) {
>> + dirtylimit_state->limited_nvcpu++;
>> + }
>> + } else {
>> + dirtylimit_state->states[cpu_index].quota = 0;
>> + if (dirtylimit_state->states[cpu_index].enabled) {
>> + dirtylimit_state->limited_nvcpu--;
>> + }
>> + }
>> +
>> + dirtylimit_state->states[cpu_index].enabled = enable;
>> +}
>> +
>> +void dirtylimit_set_all(uint64_t quota,
>> + bool enable)
>> +{
>> + MachineState *ms = MACHINE(qdev_get_machine());
>> + int max_cpus = ms->smp.max_cpus;
>> + int i;
>> +
>> + for (i = 0; i < max_cpus; i++) {
>> + dirtylimit_set_vcpu(i, quota, enable);
>> + }
>> +}
>> +
>> +void dirtylimit_vcpu_execute(CPUState *cpu)
>> +{
>> + if (dirtylimit_in_service() &&
>> + dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled &&
>> + cpu->throttle_us_per_full) {
>> + trace_dirtylimit_vcpu_execute(cpu->cpu_index,
>> + cpu->throttle_us_per_full);
>> + usleep(cpu->throttle_us_per_full);
>> + }
>> +}
>> diff --git a/softmmu/trace-events b/softmmu/trace-events
>> index 9c88887..22606dc 100644
>> --- a/softmmu/trace-events
>> +++ b/softmmu/trace-events
>> @@ -31,3 +31,10 @@ runstate_set(int current_state, const char *current_state_str, int new_state, co
>> system_wakeup_request(int reason) "reason=%d"
>> qemu_system_shutdown_request(int reason) "reason=%d"
>> qemu_system_powerdown_request(void) ""
>> +
>> +#dirtylimit.c
>> +dirtylimit_state_initialize(int max_cpus) "dirtylimit state initialize: max cpus %d"
>> +dirtylimit_state_finalize(void)
>> +dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>> +dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>> +dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
--
Best regard
Hyman Huang(黄勇)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-05-17 8:19 ` Hyman Huang
@ 2022-05-23 16:44 ` manish.mishra
2022-05-25 15:38 ` Hyman Huang
0 siblings, 1 reply; 22+ messages in thread
From: manish.mishra @ 2022-05-23 16:44 UTC (permalink / raw)
To: Hyman Huang, qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
On 17/05/22 1:49 pm, Hyman Huang wrote:
> Thanks Manish for the comment, i'll give my explanation and any supplement are welcomed.
Really sorry for such late reply Hyman, this slipped my mind.
>
> 在 2022/5/17 1:13, manish.mishra 写道:
>> Hi Hyman Huang,
>>
>> I had few doubts regarding this patch series.
> For the first point, m'm rudely guessing that you want to figure out how should we set the vcpu dirty limit correctly during live migration to make it convergent.
>
> This can be achieved by set a single dirtylimit value on all vcpus, the value need not be equivalent of the half of available bandwidth so precisely since the dirtylimit is sufficient conditions of migration sucess, but not necessary condition.
>
> We can set the dirtylimit as the minimum of what user can tolerate, in most case, migration can achieve convergent in advance and do the switchover with the real dirtyrate greater than dirtylimit. This can be implemented because Qemu will check the criteria every iteration, once it meet the condition, Qemu will do the switch over no matter what convergent algo is.
Yes got it Hyman, my question was in direction that if we control dirty rate per vcpu, total dirty of VM become very unpredictable. For example if we have set dirty rate limit of each vcpu 50MBps for 10vcpu VM. Then total dirty rate of VM can be anywhere from 0-500MBps based on how many vcpu are active and how much. So if we had dirty rate limit control per VM it would have been much more predictable for user to use. I mean we can keep account of total dirty rate of VM and individual dirty rate and then assign throttle_sleep according to their weights to keep total dirty rate within limit of per vm dirty rate limit. But definately it can be targetted in future and should not be a blocker for now.
>>
>> 1. Why we choose for dirty rate limit per vcpu. I mean it becomes very hard for user to decide per
>>
>> vcpu dirty rate limit. For e.g. we have 1Gbps network and 10 vcpu vm. Now if someone wants to
>>
>> keep criteria for convergence as total dirty rate of VM should be lesser than half of available
>>
>> bandwidth. For this case to ensure convergence user has to give dirty rate limit per vcpu
>>
>> as 1Gbps/ 2 / 10 = 50Mbps. But assume then that VM has only 1 thread which is actively
>>
>> dirtying memory, in that case so much of available quota will be wasted.
> This is a good and frequent question about dirtylimit, as mentioned above, throttle occurs only when dirty ring full and exit to user space.
> A vcpu is set up with dirtylimit during live migration, but it does not dirty memory, it may never get throttled.
> The dirtylimit only throttle those vcpu who dirty memory and dirtyrate greater then dirtylimit.
>
> So would not it be
>>
>> better to use dirty rate limit control per VM instead of vcpu?
>>
>> 2. Also Here we are adaptively trying to adjust sleep time based on current obsered dirty rate and
>>
>> dirty rate limit. Can it be more forceful like. Assume we have dirty rate limit of 10pages
>>
>> per sec and auto-converge/ dirty rate limit was triggered at time 0. Now at any point of time assume
>>
>> at time 10 sec if number of pages dirtyed are more than 100pages we sleep for interpolated amount
>>
>> of time. Basically at every dirty ring exit we can check if current number of pages dirtied are more than
>>
>> what should be allowed by this time?
> Yes, indeed, but as memtioned above, if dirty ring exit, it give Qemu a hint that vcpu is dirting memory, we should check it.
>
> I post the series of dirtylimit capability for RFC, may be it can help me to explain the usage of vcpu dirty limit, it can be found here:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1652762652.git.huangy81-40chinatelecom.cn_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=c4KON2DiMd-szjwjggQcuUvTsPWblztAL0gVzaHnNmc&m=eTtzbPA0FcwY1xwq3KPGhj-Nk5zT41MwAjVGH8a-yeQokG7j3pJxtGsFVCzMDH2X&s=iitKUTNXv8Xkvs-n-K1Aow8MxLEP64RdTXw532_oLIY&e=
> Thanks,
> Yong
thanks I read this.
Also i had few additional things in mind.
1. I see there is no limit on cpu->throttle_us_per_full. I see below line but then ring_full_time_us can be very high value so in some rare cases cpu->throttle_us_per_full can be very high. I know few database applications which can not tolerate continous sleep of more than 2 secs. I agree user should not configure very low dirty rate limit to avoid such situation but then user may not have enough idea of algorithm so better we keep out internal limits?
cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full,
ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX);
2. Also this algorithm only control or limits dirty rate by guest writes. There can be some memory dirtying done by virtio based devices which is accounted only at qemu level so may not be accounted through dirty rings so do we have plan for that in future? Those are not issue for auto-converge as it slows full VM but dirty rate limit only slows guest writes.
>>
>> thanks
>>
>> Manish Mishra
>>
>> On 02/03/22 11:25 pm, huangy81@chinatelecom.cn wrote:
>>> From: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
>>>
>>> Setup a negative feedback system when vCPU thread
>>> handling KVM_EXIT_DIRTY_RING_FULL exit by introducing
>>> throttle_us_per_full field in struct CPUState. Sleep
>>> throttle_us_per_full microseconds to throttle vCPU
>>> if dirtylimit is in service.
>>>
>>> Signed-off-by: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
>>> Reviewed-by: Peter Xu<peterx@redhat.com>
>>> ---
>>> accel/kvm/kvm-all.c | 19 ++-
>>> include/hw/core/cpu.h | 6 +
>>> include/sysemu/dirtylimit.h | 15 +++
>>> softmmu/dirtylimit.c | 291 ++++++++++++++++++++++++++++++++++++++++++++
>>> softmmu/trace-events | 7 ++
>>> 5 files changed, 337 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 8821d80..98e43e6 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -45,6 +45,7 @@
>>> #include "qemu/guest-random.h"
>>> #include "sysemu/hw_accel.h"
>>> #include "kvm-cpus.h"
>>> +#include "sysemu/dirtylimit.h"
>>> #include "hw/boards.h"
>>> @@ -476,6 +477,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
>>> cpu->kvm_state = s;
>>> cpu->vcpu_dirty = true;
>>> cpu->dirty_pages = 0;
>>> + cpu->throttle_us_per_full = 0;
>>> mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
>>> if (mmap_size < 0) {
>>> @@ -1469,6 +1471,11 @@ static void *kvm_dirty_ring_reaper_thread(void *data)
>>> */
>>> sleep(1);
>>> + /* keep sleeping so that dirtylimit not be interfered by reaper */
>>> + if (dirtylimit_in_service()) {
>>> + continue;
>>> + }
>>> +
>>> trace_kvm_dirty_ring_reaper("wakeup");
>>> r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
>>> @@ -2964,8 +2971,18 @@ int kvm_cpu_exec(CPUState *cpu)
>>> */
>>> trace_kvm_dirty_ring_full(cpu->cpu_index);
>>> qemu_mutex_lock_iothread();
>>> - kvm_dirty_ring_reap(kvm_state, NULL);
>>> + /* We throttle vCPU by making it sleep once it exit from kernel
>>> + * due to dirty ring full. In the dirtylimit scenario, reaping
>>> + * all vCPUs after a single vCPU dirty ring get full result in
>>> + * the miss of sleep, so just reap the ring-fulled vCPU.
>>> + */
>>> + if (dirtylimit_in_service()) {
>>> + kvm_dirty_ring_reap(kvm_state, cpu);
>>> + } else {
>>> + kvm_dirty_ring_reap(kvm_state, NULL);
>>> + }
>>> qemu_mutex_unlock_iothread();
>>> + dirtylimit_vcpu_execute(cpu);
>>> ret = 0;
>>> break;
>>> case KVM_EXIT_SYSTEM_EVENT:
>>> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
>>> index 76ab3b8..dbeb31a 100644
>>> --- a/include/hw/core/cpu.h
>>> +++ b/include/hw/core/cpu.h
>>> @@ -411,6 +411,12 @@ struct CPUState {
>>> */
>>> bool throttle_thread_scheduled;
>>> + /*
>>> + * Sleep throttle_us_per_full microseconds once dirty ring is full
>>> + * if dirty page rate limit is enabled.
>>> + */
>>> + int64_t throttle_us_per_full;
>>> +
>>> bool ignore_memory_transaction_failures;
>>> /* Used for user-only emulation of prctl(PR_SET_UNALIGN). */
>>> diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
>>> index da459f0..8d2c1f3 100644
>>> --- a/include/sysemu/dirtylimit.h
>>> +++ b/include/sysemu/dirtylimit.h
>>> @@ -19,4 +19,19 @@ void vcpu_dirty_rate_stat_start(void);
>>> void vcpu_dirty_rate_stat_stop(void);
>>> void vcpu_dirty_rate_stat_initialize(void);
>>> void vcpu_dirty_rate_stat_finalize(void);
>>> +
>>> +void dirtylimit_state_lock(void);
>>> +void dirtylimit_state_unlock(void);
>>> +void dirtylimit_state_initialize(void);
>>> +void dirtylimit_state_finalize(void);
>>> +bool dirtylimit_in_service(void);
>>> +bool dirtylimit_vcpu_index_valid(int cpu_index);
>>> +void dirtylimit_process(void);
>>> +void dirtylimit_change(bool start);
>>> +void dirtylimit_set_vcpu(int cpu_index,
>>> + uint64_t quota,
>>> + bool enable);
>>> +void dirtylimit_set_all(uint64_t quota,
>>> + bool enable);
>>> +void dirtylimit_vcpu_execute(CPUState *cpu);
>>> #endif
>>> diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
>>> index 6102e8c..76d0b44 100644
>>> --- a/softmmu/dirtylimit.c
>>> +++ b/softmmu/dirtylimit.c
>>> @@ -18,6 +18,26 @@
>>> #include "sysemu/dirtylimit.h"
>>> #include "exec/memory.h"
>>> #include "hw/boards.h"
>>> +#include "sysemu/kvm.h"
>>> +#include "trace.h"
>>> +
>>> +/*
>>> + * Dirtylimit stop working if dirty page rate error
>>> + * value less than DIRTYLIMIT_TOLERANCE_RANGE
>>> + */
>>> +#define DIRTYLIMIT_TOLERANCE_RANGE 25 /* MB/s */
>>> +/*
>>> + * Plus or minus vcpu sleep time linearly if dirty
>>> + * page rate error value percentage over
>>> + * DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT.
>>> + * Otherwise, plus or minus a fixed vcpu sleep time.
>>> + */
>>> +#define DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT 50
>>> +/*
>>> + * Max vcpu sleep time percentage during a cycle
>>> + * composed of dirty ring full and sleep time.
>>> + */
>>> +#define DIRTYLIMIT_THROTTLE_PCT_MAX 99
>>> struct {
>>> VcpuStat stat;
>>> @@ -25,6 +45,30 @@ struct {
>>> QemuThread thread;
>>> } *vcpu_dirty_rate_stat;
>>> +typedef struct VcpuDirtyLimitState {
>>> + int cpu_index;
>>> + bool enabled;
>>> + /*
>>> + * Quota dirty page rate, unit is MB/s
>>> + * zero if not enabled.
>>> + */
>>> + uint64_t quota;
>>> +} VcpuDirtyLimitState;
>>> +
>>> +struct {
>>> + VcpuDirtyLimitState *states;
>>> + /* Max cpus number configured by user */
>>> + int max_cpus;
>>> + /* Number of vcpu under dirtylimit */
>>> + int limited_nvcpu;
>>> +} *dirtylimit_state;
>>> +
>>> +/* protect dirtylimit_state */
>>> +static QemuMutex dirtylimit_mutex;
>>> +
>>> +/* dirtylimit thread quit if dirtylimit_quit is true */
>>> +static bool dirtylimit_quit;
>>> +
>>> static void vcpu_dirty_rate_stat_collect(void)
>>> {
>>> VcpuStat stat;
>>> @@ -54,6 +98,9 @@ static void *vcpu_dirty_rate_stat_thread(void *opaque)
>>> while (qatomic_read(&vcpu_dirty_rate_stat->running)) {
>>> vcpu_dirty_rate_stat_collect();
>>> + if (dirtylimit_in_service()) {
>>> + dirtylimit_process();
>>> + }
>>> }
>>> /* stop log sync */
>>> @@ -86,9 +133,11 @@ void vcpu_dirty_rate_stat_start(void)
>>> void vcpu_dirty_rate_stat_stop(void)
>>> {
>>> qatomic_set(&vcpu_dirty_rate_stat->running, 0);
>>> + dirtylimit_state_unlock();
>>> qemu_mutex_unlock_iothread();
>>> qemu_thread_join(&vcpu_dirty_rate_stat->thread);
>>> qemu_mutex_lock_iothread();
>>> + dirtylimit_state_lock();
>>> }
>>> void vcpu_dirty_rate_stat_initialize(void)
>>> @@ -114,3 +163,245 @@ void vcpu_dirty_rate_stat_finalize(void)
>>> free(vcpu_dirty_rate_stat);
>>> vcpu_dirty_rate_stat = NULL;
>>> }
>>> +
>>> +void dirtylimit_state_lock(void)
>>> +{
>>> + qemu_mutex_lock(&dirtylimit_mutex);
>>> +}
>>> +
>>> +void dirtylimit_state_unlock(void)
>>> +{
>>> + qemu_mutex_unlock(&dirtylimit_mutex);
>>> +}
>>> +
>>> +static void
>>> +__attribute__((__constructor__)) dirtylimit_mutex_init(void)
>>> +{
>>> + qemu_mutex_init(&dirtylimit_mutex);
>>> +}
>>> +
>>> +static inline VcpuDirtyLimitState *dirtylimit_vcpu_get_state(int cpu_index)
>>> +{
>>> + return &dirtylimit_state->states[cpu_index];
>>> +}
>>> +
>>> +void dirtylimit_state_initialize(void)
>>> +{
>>> + MachineState *ms = MACHINE(qdev_get_machine());
>>> + int max_cpus = ms->smp.max_cpus;
>>> + int i;
>>> +
>>> + dirtylimit_state = g_malloc0(sizeof(*dirtylimit_state));
>>> +
>>> + dirtylimit_state->states =
>>> + g_malloc0(sizeof(VcpuDirtyLimitState) * max_cpus);
>>> +
>>> + for (i = 0; i < max_cpus; i++) {
>>> + dirtylimit_state->states[i].cpu_index = i;
>>> + }
>>> +
>>> + dirtylimit_state->max_cpus = max_cpus;
>>> + trace_dirtylimit_state_initialize(max_cpus);
>>> +}
>>> +
>>> +void dirtylimit_state_finalize(void)
>>> +{
>>> + free(dirtylimit_state->states);
>>> + dirtylimit_state->states = NULL;
>>> +
>>> + free(dirtylimit_state);
>>> + dirtylimit_state = NULL;
>>> +
>>> + trace_dirtylimit_state_finalize();
>>> +}
>>> +
>>> +bool dirtylimit_in_service(void)
>>> +{
>>> + return !!dirtylimit_state;
>>> +}
>>> +
>>> +bool dirtylimit_vcpu_index_valid(int cpu_index)
>>> +{
>>> + MachineState *ms = MACHINE(qdev_get_machine());
>>> +
>>> + return !(cpu_index < 0 ||
>>> + cpu_index >= ms->smp.max_cpus);
>>> +}
>>> +
>>> +static inline int64_t dirtylimit_dirty_ring_full_time(uint64_t dirtyrate)
>>> +{
>>> + static uint64_t max_dirtyrate;
>>> + uint32_t dirty_ring_size = kvm_dirty_ring_size();
>>> + uint64_t dirty_ring_size_meory_MB =
>>> + dirty_ring_size * TARGET_PAGE_SIZE >> 20;
>>> +
>>> + if (max_dirtyrate < dirtyrate) {
>>> + max_dirtyrate = dirtyrate;
>>> + }
>>> +
>>> + return dirty_ring_size_meory_MB * 1000000 / max_dirtyrate;
>>> +}
>>> +
>>> +static inline bool dirtylimit_done(uint64_t quota,
>>> + uint64_t current)
>>> +{
>>> + uint64_t min, max;
>>> +
>>> + min = MIN(quota, current);
>>> + max = MAX(quota, current);
>>> +
>>> + return ((max - min) <= DIRTYLIMIT_TOLERANCE_RANGE) ? true : false;
>>> +}
>>> +
>>> +static inline bool
>>> +dirtylimit_need_linear_adjustment(uint64_t quota,
>>> + uint64_t current)
>>> +{
>>> + uint64_t min, max;
>>> +
>>> + min = MIN(quota, current);
>>> + max = MAX(quota, current);
>>> +
>>> + return ((max - min) * 100 / max) > DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT;
>>> +}
>>> +
>>> +static void dirtylimit_set_throttle(CPUState *cpu,
>>> + uint64_t quota,
>>> + uint64_t current)
>>> +{
>>> + int64_t ring_full_time_us = 0;
>>> + uint64_t sleep_pct = 0;
>>> + uint64_t throttle_us = 0;
>>> +
>>> + if (current == 0) {
>>> + cpu->throttle_us_per_full = 0;
>>> + return;
>>> + }
>>> +
>>> + ring_full_time_us = dirtylimit_dirty_ring_full_time(current);
>>> +
>>> + if (dirtylimit_need_linear_adjustment(quota, current)) {
>>> + if (quota < current) {
>>> + sleep_pct = (current - quota) * 100 / current;
>>> + throttle_us =
>>> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
>>> + cpu->throttle_us_per_full += throttle_us;
>>> + } else {
>>> + sleep_pct = (quota - current) * 100 / quota;
>>> + throttle_us =
>>> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct);
>>> + cpu->throttle_us_per_full -= throttle_us;
>>> + }
>>> +
>>> + trace_dirtylimit_throttle_pct(cpu->cpu_index,
>>> + sleep_pct,
>>> + throttle_us);
>>> + } else {
>>> + if (quota < current) {
>>> + cpu->throttle_us_per_full += ring_full_time_us / 10;
>>> + } else {
>>> + cpu->throttle_us_per_full -= ring_full_time_us / 10;
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * TODO: in the big kvm_dirty_ring_size case (eg: 65536, or other scenario),
>>> + * current dirty page rate may never reach the quota, we should stop
>>> + * increasing sleep time?
>>> + */
>>> + cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full,
>>> + ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX);
>>> +
>>> + cpu->throttle_us_per_full = MAX(cpu->throttle_us_per_full, 0);
>>> +}
>>> +
>>> +static void dirtylimit_adjust_throttle(CPUState *cpu)
>>> +{
>>> + uint64_t quota = 0;
>>> + uint64_t current = 0;
>>> + int cpu_index = cpu->cpu_index;
>>> +
>>> + quota = dirtylimit_vcpu_get_state(cpu_index)->quota;
>>> + current = vcpu_dirty_rate_get(cpu_index);
>>> +
>>> + if (!dirtylimit_done(quota, current)) {
>>> + dirtylimit_set_throttle(cpu, quota, current);
>>> + }
>>> +
>>> + return;
>>> +}
>>> +
>>> +void dirtylimit_process(void)
>>> +{
>>> + CPUState *cpu;
>>> +
>>> + if (!qatomic_read(&dirtylimit_quit)) {
>>> + dirtylimit_state_lock();
>>> +
>>> + if (!dirtylimit_in_service()) {
>>> + dirtylimit_state_unlock();
>>> + return;
>>> + }
>>> +
>>> + CPU_FOREACH(cpu) {
>>> + if (!dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled) {
>>> + continue;
>>> + }
>>> + dirtylimit_adjust_throttle(cpu);
>>> + }
>>> + dirtylimit_state_unlock();
>>> + }
>>> +}
>>> +
>>> +void dirtylimit_change(bool start)
>>> +{
>>> + if (start) {
>>> + qatomic_set(&dirtylimit_quit, 0);
>>> + } else {
>>> + qatomic_set(&dirtylimit_quit, 1);
>>> + }
>>> +}
>>> +
>>> +void dirtylimit_set_vcpu(int cpu_index,
>>> + uint64_t quota,
>>> + bool enable)
>>> +{
>>> + trace_dirtylimit_set_vcpu(cpu_index, quota);
>>> +
>>> + if (enable) {
>>> + dirtylimit_state->states[cpu_index].quota = quota;
>>> + if (!dirtylimit_vcpu_get_state(cpu_index)->enabled) {
>>> + dirtylimit_state->limited_nvcpu++;
>>> + }
>>> + } else {
>>> + dirtylimit_state->states[cpu_index].quota = 0;
>>> + if (dirtylimit_state->states[cpu_index].enabled) {
>>> + dirtylimit_state->limited_nvcpu--;
>>> + }
>>> + }
>>> +
>>> + dirtylimit_state->states[cpu_index].enabled = enable;
>>> +}
>>> +
>>> +void dirtylimit_set_all(uint64_t quota,
>>> + bool enable)
>>> +{
>>> + MachineState *ms = MACHINE(qdev_get_machine());
>>> + int max_cpus = ms->smp.max_cpus;
>>> + int i;
>>> +
>>> + for (i = 0; i < max_cpus; i++) {
>>> + dirtylimit_set_vcpu(i, quota, enable);
>>> + }
>>> +}
>>> +
>>> +void dirtylimit_vcpu_execute(CPUState *cpu)
>>> +{
>>> + if (dirtylimit_in_service() &&
>>> + dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled &&
>>> + cpu->throttle_us_per_full) {
>>> + trace_dirtylimit_vcpu_execute(cpu->cpu_index,
>>> + cpu->throttle_us_per_full);
>>> + usleep(cpu->throttle_us_per_full);
>>> + }
>>> +}
>>> diff --git a/softmmu/trace-events b/softmmu/trace-events
>>> index 9c88887..22606dc 100644
>>> --- a/softmmu/trace-events
>>> +++ b/softmmu/trace-events
>>> @@ -31,3 +31,10 @@ runstate_set(int current_state, const char *current_state_str, int new_state, co
>>> system_wakeup_request(int reason) "reason=%d"
>>> qemu_system_shutdown_request(int reason) "reason=%d"
>>> qemu_system_powerdown_request(void) ""
>>> +
>>> +#dirtylimit.c
>>> +dirtylimit_state_initialize(int max_cpus) "dirtylimit state initialize: max cpus %d"
>>> +dirtylimit_state_finalize(void)
>>> +dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>>> +dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>>> +dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-05-23 16:44 ` manish.mishra
@ 2022-05-25 15:38 ` Hyman Huang
2022-05-25 15:55 ` Peter Xu
0 siblings, 1 reply; 22+ messages in thread
From: Hyman Huang @ 2022-05-25 15:38 UTC (permalink / raw)
To: manish.mishra, qemu-devel
Cc: Eduardo Habkost, David Hildenbrand, Juan Quintela,
Richard Henderson, Markus ArmBruster, Peter Xu, Paolo Bonzini,
Philippe Mathieu-Daudé
在 2022/5/24 0:44, manish.mishra 写道:
>
> On 17/05/22 1:49 pm, Hyman Huang wrote:
>> Thanks Manish for the comment, i'll give my explanation and any
>> supplement are welcomed.
> Really sorry for such late reply Hyman, this slipped my mind.
>>
>> 在 2022/5/17 1:13, manish.mishra 写道:
>>> Hi Hyman Huang,
>>>
>>> I had few doubts regarding this patch series.
>> For the first point, m'm rudely guessing that you want to figure out
>> how should we set the vcpu dirty limit correctly during live migration
>> to make it convergent.
>>
>> This can be achieved by set a single dirtylimit value on all vcpus,
>> the value need not be equivalent of the half of available bandwidth so
>> precisely since the dirtylimit is sufficient conditions of migration
>> sucess, but not necessary condition.
>>
>> We can set the dirtylimit as the minimum of what user can tolerate, in
>> most case, migration can achieve convergent in advance and do the
>> switchover with the real dirtyrate greater than dirtylimit. This can
>> be implemented because Qemu will check the criteria every iteration,
>> once it meet the condition, Qemu will do the switch over no matter
>> what convergent algo is.
>
>
> Yes got it Hyman, my question was in direction that if we control dirty
> rate per vcpu, total dirty of VM become very unpredictable. For example
> if we have set dirty rate limit of each vcpu 50MBps for 10vcpu VM. Then
> total dirty rate of VM can be anywhere from 0-500MBps based on how many
> vcpu are active and how much. So if we had dirty rate limit control per
> VM it would have been much more predictable for user to use. I mean we
> can keep account of total dirty rate of VM and individual dirty rate and
> then assign throttle_sleep according to their weights to keep total
> dirty rate within limit of per vm dirty rate limit. But definately it
> can be targetted in future and should not be a blocker for now.
I got it. This patchset doesn't aim to how to control the dirty page
rate precisely from my view, but to provide a method to limit dirty page
rate. So the two views don't conflict. Dirtylimit focuse on limitting
"write-vcpu" and introducing quota dirty page rate just to make the
throttle algo has homogeneous metric parameters. As to how to control
the dirty page rate precisely, there may be a fresh new algo to
implement it(as you see, monitor the vm dirty page rate, stat and
calculate weights, assign throttle_sleep and so on)
>
>>>
>>> 1. Why we choose for dirty rate limit per vcpu. I mean it becomes
>>> very hard for user to decide per
>>>
>>> vcpu dirty rate limit. For e.g. we have 1Gbps network and 10
>>> vcpu vm. Now if someone wants to
>>>
>>> keep criteria for convergence as total dirty rate of VM should
>>> be lesser than half of available
>>>
>>> bandwidth. For this case to ensure convergence user has to give
>>> dirty rate limit per vcpu
>>>
>>> as 1Gbps/ 2 / 10 = 50Mbps. But assume then that VM has only 1
>>> thread which is actively
>>>
>>> dirtying memory, in that case so much of available quota will be
>>> wasted.
>> This is a good and frequent question about dirtylimit, as mentioned
>> above, throttle occurs only when dirty ring full and exit to user space.
>> A vcpu is set up with dirtylimit during live migration, but it does
>> not dirty memory, it may never get throttled.
>> The dirtylimit only throttle those vcpu who dirty memory and dirtyrate
>> greater then dirtylimit.
>>
>> So would not it be
>>>
>>> better to use dirty rate limit control per VM instead of vcpu?
>>>
>>> 2. Also Here we are adaptively trying to adjust sleep time based on
>>> current obsered dirty rate and
>>>
>>> dirty rate limit. Can it be more forceful like. Assume we have
>>> dirty rate limit of 10pages
>>>
>>> per sec and auto-converge/ dirty rate limit was triggered at
>>> time 0. Now at any point of time assume
>>>
>>> at time 10 sec if number of pages dirtyed are more than 100pages
>>> we sleep for interpolated amount
>>>
>>> of time. Basically at every dirty ring exit we can check if
>>> current number of pages dirtied are more than
>>>
>>> what should be allowed by this time?
>> Yes, indeed, but as memtioned above, if dirty ring exit, it give Qemu
>> a hint that vcpu is dirting memory, we should check it.
>>
>> I post the series of dirtylimit capability for RFC, may be it can help
>> me to explain the usage of vcpu dirty limit, it can be found here:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1652762652.git.huangy81-40chinatelecom.cn_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=c4KON2DiMd-szjwjggQcuUvTsPWblztAL0gVzaHnNmc&m=eTtzbPA0FcwY1xwq3KPGhj-Nk5zT41MwAjVGH8a-yeQokG7j3pJxtGsFVCzMDH2X&s=iitKUTNXv8Xkvs-n-K1Aow8MxLEP64RdTXw532_oLIY&e=
>>
>> Thanks,
>> Yong
> thanks I read this.
>
> Also i had few additional things in mind.
>
> 1. I see there is no limit on cpu->throttle_us_per_full. I see below
> line but then ring_full_time_us can be very high value so in some rare
> cases cpu->throttle_us_per_full can be very high. I know few database
> applications which can not tolerate continous sleep of more than 2 secs.
> I agree user should not configure very low dirty rate limit to avoid
> such situation but then user may not have enough idea of algorithm so
> better we keep out internal limits?
>
> cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full,
> ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX);
The ring_full_time_us is affected by two factors, dirty ring size and
max dirty page rate, as the following formula:
ring_full_time_us = dirty ring size * page size / max_dirtyrate;
Maximum of dirty ring size is 65536, so from this perspective, the
ring_full_time_us has limit but not hard-coded via macro。The scenario
that ring_full_time_us is high only occur when dirty ring size
configured with max value 65536。I configured the max dirty ring size
and open the dirtylimit_vcpu_execute trace event, it show that
throttle_us_per_full <= 1s in my test server environment。Indeed, i
agree that ring_full_time_us can be very high in some case, but adding a
perfect limit also need a lot experience. I suggest that we can leave
the code untouched and keep the algo as simple as possible until we do
make sure that application is affected.
>
> 2. Also this algorithm only control or limits dirty rate by guest
> writes. There can be some memory dirtying done by virtio based devices
> which is accounted only at qemu level so may not be accounted through
> dirty rings so do we have plan for that in future? Those are not issue
> for auto-converge as it slows full VM but dirty rate limit only slows
> guest writes.
>
From the migration point of view, time spent on migrating memory is far
greater than migrating devices emulated by qemu. I think we can do that
when migrating device costs the same magnitude time as migrating memory.
As to auto-converge, it throttle vcpu by kicking it and force it to
sleep periodically. The two seems has no much difference from the
perspective of internal method but the auto-converge is kind of
"offensive" when doing restraint. I'll read the auto-converge
implementation code and figure out the problem you point out.
Thanks
>>>
>>> thanks
>>>
>>> Manish Mishra
>>>
>>> On 02/03/22 11:25 pm, huangy81@chinatelecom.cn wrote:
>>>> From: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
>>>>
>>>> Setup a negative feedback system when vCPU thread
>>>> handling KVM_EXIT_DIRTY_RING_FULL exit by introducing
>>>> throttle_us_per_full field in struct CPUState. Sleep
>>>> throttle_us_per_full microseconds to throttle vCPU
>>>> if dirtylimit is in service.
>>>>
>>>> Signed-off-by: Hyman Huang(黄勇)<huangy81@chinatelecom.cn>
>>>> Reviewed-by: Peter Xu<peterx@redhat.com>
>>>> ---
>>>> accel/kvm/kvm-all.c | 19 ++-
>>>> include/hw/core/cpu.h | 6 +
>>>> include/sysemu/dirtylimit.h | 15 +++
>>>> softmmu/dirtylimit.c | 291
>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>> softmmu/trace-events | 7 ++
>>>> 5 files changed, 337 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>> index 8821d80..98e43e6 100644
>>>> --- a/accel/kvm/kvm-all.c
>>>> +++ b/accel/kvm/kvm-all.c
>>>> @@ -45,6 +45,7 @@
>>>> #include "qemu/guest-random.h"
>>>> #include "sysemu/hw_accel.h"
>>>> #include "kvm-cpus.h"
>>>> +#include "sysemu/dirtylimit.h"
>>>> #include "hw/boards.h"
>>>> @@ -476,6 +477,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
>>>> cpu->kvm_state = s;
>>>> cpu->vcpu_dirty = true;
>>>> cpu->dirty_pages = 0;
>>>> + cpu->throttle_us_per_full = 0;
>>>> mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
>>>> if (mmap_size < 0) {
>>>> @@ -1469,6 +1471,11 @@ static void
>>>> *kvm_dirty_ring_reaper_thread(void *data)
>>>> */
>>>> sleep(1);
>>>> + /* keep sleeping so that dirtylimit not be interfered by
>>>> reaper */
>>>> + if (dirtylimit_in_service()) {
>>>> + continue;
>>>> + }
>>>> +
>>>> trace_kvm_dirty_ring_reaper("wakeup");
>>>> r->reaper_state = KVM_DIRTY_RING_REAPER_REAPING;
>>>> @@ -2964,8 +2971,18 @@ int kvm_cpu_exec(CPUState *cpu)
>>>> */
>>>> trace_kvm_dirty_ring_full(cpu->cpu_index);
>>>> qemu_mutex_lock_iothread();
>>>> - kvm_dirty_ring_reap(kvm_state, NULL);
>>>> + /* We throttle vCPU by making it sleep once it exit
>>>> from kernel
>>>> + * due to dirty ring full. In the dirtylimit scenario,
>>>> reaping
>>>> + * all vCPUs after a single vCPU dirty ring get full
>>>> result in
>>>> + * the miss of sleep, so just reap the ring-fulled vCPU.
>>>> + */
>>>> + if (dirtylimit_in_service()) {
>>>> + kvm_dirty_ring_reap(kvm_state, cpu);
>>>> + } else {
>>>> + kvm_dirty_ring_reap(kvm_state, NULL);
>>>> + }
>>>> qemu_mutex_unlock_iothread();
>>>> + dirtylimit_vcpu_execute(cpu);
>>>> ret = 0;
>>>> break;
>>>> case KVM_EXIT_SYSTEM_EVENT:
>>>> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
>>>> index 76ab3b8..dbeb31a 100644
>>>> --- a/include/hw/core/cpu.h
>>>> +++ b/include/hw/core/cpu.h
>>>> @@ -411,6 +411,12 @@ struct CPUState {
>>>> */
>>>> bool throttle_thread_scheduled;
>>>> + /*
>>>> + * Sleep throttle_us_per_full microseconds once dirty ring is full
>>>> + * if dirty page rate limit is enabled.
>>>> + */
>>>> + int64_t throttle_us_per_full;
>>>> +
>>>> bool ignore_memory_transaction_failures;
>>>> /* Used for user-only emulation of prctl(PR_SET_UNALIGN). */
>>>> diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
>>>> index da459f0..8d2c1f3 100644
>>>> --- a/include/sysemu/dirtylimit.h
>>>> +++ b/include/sysemu/dirtylimit.h
>>>> @@ -19,4 +19,19 @@ void vcpu_dirty_rate_stat_start(void);
>>>> void vcpu_dirty_rate_stat_stop(void);
>>>> void vcpu_dirty_rate_stat_initialize(void);
>>>> void vcpu_dirty_rate_stat_finalize(void);
>>>> +
>>>> +void dirtylimit_state_lock(void);
>>>> +void dirtylimit_state_unlock(void);
>>>> +void dirtylimit_state_initialize(void);
>>>> +void dirtylimit_state_finalize(void);
>>>> +bool dirtylimit_in_service(void);
>>>> +bool dirtylimit_vcpu_index_valid(int cpu_index);
>>>> +void dirtylimit_process(void);
>>>> +void dirtylimit_change(bool start);
>>>> +void dirtylimit_set_vcpu(int cpu_index,
>>>> + uint64_t quota,
>>>> + bool enable);
>>>> +void dirtylimit_set_all(uint64_t quota,
>>>> + bool enable);
>>>> +void dirtylimit_vcpu_execute(CPUState *cpu);
>>>> #endif
>>>> diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c
>>>> index 6102e8c..76d0b44 100644
>>>> --- a/softmmu/dirtylimit.c
>>>> +++ b/softmmu/dirtylimit.c
>>>> @@ -18,6 +18,26 @@
>>>> #include "sysemu/dirtylimit.h"
>>>> #include "exec/memory.h"
>>>> #include "hw/boards.h"
>>>> +#include "sysemu/kvm.h"
>>>> +#include "trace.h"
>>>> +
>>>> +/*
>>>> + * Dirtylimit stop working if dirty page rate error
>>>> + * value less than DIRTYLIMIT_TOLERANCE_RANGE
>>>> + */
>>>> +#define DIRTYLIMIT_TOLERANCE_RANGE 25 /* MB/s */
>>>> +/*
>>>> + * Plus or minus vcpu sleep time linearly if dirty
>>>> + * page rate error value percentage over
>>>> + * DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT.
>>>> + * Otherwise, plus or minus a fixed vcpu sleep time.
>>>> + */
>>>> +#define DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT 50
>>>> +/*
>>>> + * Max vcpu sleep time percentage during a cycle
>>>> + * composed of dirty ring full and sleep time.
>>>> + */
>>>> +#define DIRTYLIMIT_THROTTLE_PCT_MAX 99
>>>> struct {
>>>> VcpuStat stat;
>>>> @@ -25,6 +45,30 @@ struct {
>>>> QemuThread thread;
>>>> } *vcpu_dirty_rate_stat;
>>>> +typedef struct VcpuDirtyLimitState {
>>>> + int cpu_index;
>>>> + bool enabled;
>>>> + /*
>>>> + * Quota dirty page rate, unit is MB/s
>>>> + * zero if not enabled.
>>>> + */
>>>> + uint64_t quota;
>>>> +} VcpuDirtyLimitState;
>>>> +
>>>> +struct {
>>>> + VcpuDirtyLimitState *states;
>>>> + /* Max cpus number configured by user */
>>>> + int max_cpus;
>>>> + /* Number of vcpu under dirtylimit */
>>>> + int limited_nvcpu;
>>>> +} *dirtylimit_state;
>>>> +
>>>> +/* protect dirtylimit_state */
>>>> +static QemuMutex dirtylimit_mutex;
>>>> +
>>>> +/* dirtylimit thread quit if dirtylimit_quit is true */
>>>> +static bool dirtylimit_quit;
>>>> +
>>>> static void vcpu_dirty_rate_stat_collect(void)
>>>> {
>>>> VcpuStat stat;
>>>> @@ -54,6 +98,9 @@ static void *vcpu_dirty_rate_stat_thread(void
>>>> *opaque)
>>>> while (qatomic_read(&vcpu_dirty_rate_stat->running)) {
>>>> vcpu_dirty_rate_stat_collect();
>>>> + if (dirtylimit_in_service()) {
>>>> + dirtylimit_process();
>>>> + }
>>>> }
>>>> /* stop log sync */
>>>> @@ -86,9 +133,11 @@ void vcpu_dirty_rate_stat_start(void)
>>>> void vcpu_dirty_rate_stat_stop(void)
>>>> {
>>>> qatomic_set(&vcpu_dirty_rate_stat->running, 0);
>>>> + dirtylimit_state_unlock();
>>>> qemu_mutex_unlock_iothread();
>>>> qemu_thread_join(&vcpu_dirty_rate_stat->thread);
>>>> qemu_mutex_lock_iothread();
>>>> + dirtylimit_state_lock();
>>>> }
>>>> void vcpu_dirty_rate_stat_initialize(void)
>>>> @@ -114,3 +163,245 @@ void vcpu_dirty_rate_stat_finalize(void)
>>>> free(vcpu_dirty_rate_stat);
>>>> vcpu_dirty_rate_stat = NULL;
>>>> }
>>>> +
>>>> +void dirtylimit_state_lock(void)
>>>> +{
>>>> + qemu_mutex_lock(&dirtylimit_mutex);
>>>> +}
>>>> +
>>>> +void dirtylimit_state_unlock(void)
>>>> +{
>>>> + qemu_mutex_unlock(&dirtylimit_mutex);
>>>> +}
>>>> +
>>>> +static void
>>>> +__attribute__((__constructor__)) dirtylimit_mutex_init(void)
>>>> +{
>>>> + qemu_mutex_init(&dirtylimit_mutex);
>>>> +}
>>>> +
>>>> +static inline VcpuDirtyLimitState *dirtylimit_vcpu_get_state(int
>>>> cpu_index)
>>>> +{
>>>> + return &dirtylimit_state->states[cpu_index];
>>>> +}
>>>> +
>>>> +void dirtylimit_state_initialize(void)
>>>> +{
>>>> + MachineState *ms = MACHINE(qdev_get_machine());
>>>> + int max_cpus = ms->smp.max_cpus;
>>>> + int i;
>>>> +
>>>> + dirtylimit_state = g_malloc0(sizeof(*dirtylimit_state));
>>>> +
>>>> + dirtylimit_state->states =
>>>> + g_malloc0(sizeof(VcpuDirtyLimitState) * max_cpus);
>>>> +
>>>> + for (i = 0; i < max_cpus; i++) {
>>>> + dirtylimit_state->states[i].cpu_index = i;
>>>> + }
>>>> +
>>>> + dirtylimit_state->max_cpus = max_cpus;
>>>> + trace_dirtylimit_state_initialize(max_cpus);
>>>> +}
>>>> +
>>>> +void dirtylimit_state_finalize(void)
>>>> +{
>>>> + free(dirtylimit_state->states);
>>>> + dirtylimit_state->states = NULL;
>>>> +
>>>> + free(dirtylimit_state);
>>>> + dirtylimit_state = NULL;
>>>> +
>>>> + trace_dirtylimit_state_finalize();
>>>> +}
>>>> +
>>>> +bool dirtylimit_in_service(void)
>>>> +{
>>>> + return !!dirtylimit_state;
>>>> +}
>>>> +
>>>> +bool dirtylimit_vcpu_index_valid(int cpu_index)
>>>> +{
>>>> + MachineState *ms = MACHINE(qdev_get_machine());
>>>> +
>>>> + return !(cpu_index < 0 ||
>>>> + cpu_index >= ms->smp.max_cpus);
>>>> +}
>>>> +
>>>> +static inline int64_t dirtylimit_dirty_ring_full_time(uint64_t
>>>> dirtyrate)
>>>> +{
>>>> + static uint64_t max_dirtyrate;
>>>> + uint32_t dirty_ring_size = kvm_dirty_ring_size();
>>>> + uint64_t dirty_ring_size_meory_MB =
>>>> + dirty_ring_size * TARGET_PAGE_SIZE >> 20;
>>>> +
>>>> + if (max_dirtyrate < dirtyrate) {
>>>> + max_dirtyrate = dirtyrate;
>>>> + }
>>>> +
>>>> + return dirty_ring_size_meory_MB * 1000000 / max_dirtyrate;
>>>> +}
>>>> +
>>>> +static inline bool dirtylimit_done(uint64_t quota,
>>>> + uint64_t current)
>>>> +{
>>>> + uint64_t min, max;
>>>> +
>>>> + min = MIN(quota, current);
>>>> + max = MAX(quota, current);
>>>> +
>>>> + return ((max - min) <= DIRTYLIMIT_TOLERANCE_RANGE) ? true : false;
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +dirtylimit_need_linear_adjustment(uint64_t quota,
>>>> + uint64_t current)
>>>> +{
>>>> + uint64_t min, max;
>>>> +
>>>> + min = MIN(quota, current);
>>>> + max = MAX(quota, current);
>>>> +
>>>> + return ((max - min) * 100 / max) >
>>>> DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT;
>>>> +}
>>>> +
>>>> +static void dirtylimit_set_throttle(CPUState *cpu,
>>>> + uint64_t quota,
>>>> + uint64_t current)
>>>> +{
>>>> + int64_t ring_full_time_us = 0;
>>>> + uint64_t sleep_pct = 0;
>>>> + uint64_t throttle_us = 0;
>>>> +
>>>> + if (current == 0) {
>>>> + cpu->throttle_us_per_full = 0;
>>>> + return;
>>>> + }
>>>> +
>>>> + ring_full_time_us = dirtylimit_dirty_ring_full_time(current);
>>>> +
>>>> + if (dirtylimit_need_linear_adjustment(quota, current)) {
>>>> + if (quota < current) {
>>>> + sleep_pct = (current - quota) * 100 / current;
>>>> + throttle_us =
>>>> + ring_full_time_us * sleep_pct / (double)(100 -
>>>> sleep_pct);
>>>> + cpu->throttle_us_per_full += throttle_us;
>>>> + } else {
>>>> + sleep_pct = (quota - current) * 100 / quota;
>>>> + throttle_us =
>>>> + ring_full_time_us * sleep_pct / (double)(100 -
>>>> sleep_pct);
>>>> + cpu->throttle_us_per_full -= throttle_us;
>>>> + }
>>>> +
>>>> + trace_dirtylimit_throttle_pct(cpu->cpu_index,
>>>> + sleep_pct,
>>>> + throttle_us);
>>>> + } else {
>>>> + if (quota < current) {
>>>> + cpu->throttle_us_per_full += ring_full_time_us / 10;
>>>> + } else {
>>>> + cpu->throttle_us_per_full -= ring_full_time_us / 10;
>>>> + }
>>>> + }
>>>> +
>>>> + /*
>>>> + * TODO: in the big kvm_dirty_ring_size case (eg: 65536, or
>>>> other scenario),
>>>> + * current dirty page rate may never reach the quota, we
>>>> should stop
>>>> + * increasing sleep time?
>>>> + */
>>>> + cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full,
>>>> + ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX);
>>>> +
>>>> + cpu->throttle_us_per_full = MAX(cpu->throttle_us_per_full, 0);
>>>> +}
>>>> +
>>>> +static void dirtylimit_adjust_throttle(CPUState *cpu)
>>>> +{
>>>> + uint64_t quota = 0;
>>>> + uint64_t current = 0;
>>>> + int cpu_index = cpu->cpu_index;
>>>> +
>>>> + quota = dirtylimit_vcpu_get_state(cpu_index)->quota;
>>>> + current = vcpu_dirty_rate_get(cpu_index);
>>>> +
>>>> + if (!dirtylimit_done(quota, current)) {
>>>> + dirtylimit_set_throttle(cpu, quota, current);
>>>> + }
>>>> +
>>>> + return;
>>>> +}
>>>> +
>>>> +void dirtylimit_process(void)
>>>> +{
>>>> + CPUState *cpu;
>>>> +
>>>> + if (!qatomic_read(&dirtylimit_quit)) {
>>>> + dirtylimit_state_lock();
>>>> +
>>>> + if (!dirtylimit_in_service()) {
>>>> + dirtylimit_state_unlock();
>>>> + return;
>>>> + }
>>>> +
>>>> + CPU_FOREACH(cpu) {
>>>> + if (!dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled) {
>>>> + continue;
>>>> + }
>>>> + dirtylimit_adjust_throttle(cpu);
>>>> + }
>>>> + dirtylimit_state_unlock();
>>>> + }
>>>> +}
>>>> +
>>>> +void dirtylimit_change(bool start)
>>>> +{
>>>> + if (start) {
>>>> + qatomic_set(&dirtylimit_quit, 0);
>>>> + } else {
>>>> + qatomic_set(&dirtylimit_quit, 1);
>>>> + }
>>>> +}
>>>> +
>>>> +void dirtylimit_set_vcpu(int cpu_index,
>>>> + uint64_t quota,
>>>> + bool enable)
>>>> +{
>>>> + trace_dirtylimit_set_vcpu(cpu_index, quota);
>>>> +
>>>> + if (enable) {
>>>> + dirtylimit_state->states[cpu_index].quota = quota;
>>>> + if (!dirtylimit_vcpu_get_state(cpu_index)->enabled) {
>>>> + dirtylimit_state->limited_nvcpu++;
>>>> + }
>>>> + } else {
>>>> + dirtylimit_state->states[cpu_index].quota = 0;
>>>> + if (dirtylimit_state->states[cpu_index].enabled) {
>>>> + dirtylimit_state->limited_nvcpu--;
>>>> + }
>>>> + }
>>>> +
>>>> + dirtylimit_state->states[cpu_index].enabled = enable;
>>>> +}
>>>> +
>>>> +void dirtylimit_set_all(uint64_t quota,
>>>> + bool enable)
>>>> +{
>>>> + MachineState *ms = MACHINE(qdev_get_machine());
>>>> + int max_cpus = ms->smp.max_cpus;
>>>> + int i;
>>>> +
>>>> + for (i = 0; i < max_cpus; i++) {
>>>> + dirtylimit_set_vcpu(i, quota, enable);
>>>> + }
>>>> +}
>>>> +
>>>> +void dirtylimit_vcpu_execute(CPUState *cpu)
>>>> +{
>>>> + if (dirtylimit_in_service() &&
>>>> + dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled &&
>>>> + cpu->throttle_us_per_full) {
>>>> + trace_dirtylimit_vcpu_execute(cpu->cpu_index,
>>>> + cpu->throttle_us_per_full);
>>>> + usleep(cpu->throttle_us_per_full);
>>>> + }
>>>> +}
>>>> diff --git a/softmmu/trace-events b/softmmu/trace-events
>>>> index 9c88887..22606dc 100644
>>>> --- a/softmmu/trace-events
>>>> +++ b/softmmu/trace-events
>>>> @@ -31,3 +31,10 @@ runstate_set(int current_state, const char
>>>> *current_state_str, int new_state, co
>>>> system_wakeup_request(int reason) "reason=%d"
>>>> qemu_system_shutdown_request(int reason) "reason=%d"
>>>> qemu_system_powerdown_request(void) ""
>>>> +
>>>> +#dirtylimit.c
>>>> +dirtylimit_state_initialize(int max_cpus) "dirtylimit state
>>>> initialize: max cpus %d"
>>>> +dirtylimit_state_finalize(void)
>>>> +dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t
>>>> time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust
>>>> time %"PRIi64 " us"
>>>> +dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set
>>>> dirty page rate limit %"PRIu64
>>>> +dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us)
>>>> "CPU[%d] sleep %"PRIi64 " us"
>>
--
Best regard
Hyman Huang(黄勇)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-05-25 15:38 ` Hyman Huang
@ 2022-05-25 15:55 ` Peter Xu
2022-05-26 2:51 ` Jason Wang
0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2022-05-25 15:55 UTC (permalink / raw)
To: Hyman Huang
Cc: manish.mishra, qemu-devel, Eduardo Habkost, David Hildenbrand,
Juan Quintela, Richard Henderson, Markus ArmBruster,
Paolo Bonzini, Philippe Mathieu-Daudé,
Jason Wang
On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
> > 2. Also this algorithm only control or limits dirty rate by guest
> > writes. There can be some memory dirtying done by virtio based devices
> > which is accounted only at qemu level so may not be accounted through
> > dirty rings so do we have plan for that in future? Those are not issue
> > for auto-converge as it slows full VM but dirty rate limit only slows
> > guest writes.
> >
> From the migration point of view, time spent on migrating memory is far
> greater than migrating devices emulated by qemu. I think we can do that when
> migrating device costs the same magnitude time as migrating memory.
>
> As to auto-converge, it throttle vcpu by kicking it and force it to sleep
> periodically. The two seems has no much difference from the perspective of
> internal method but the auto-converge is kind of "offensive" when doing
> restraint. I'll read the auto-converge implementation code and figure out
> the problem you point out.
This seems to be not virtio-specific, but can be applied to any device DMA
writting to guest mem (if not including vfio). But indeed virtio can be
normally faster.
I'm also curious how fast a device DMA could dirty memories. This could be
a question to answer to all vcpu-based throttling approaches (including the
quota based approach that was proposed on KVM list). Maybe for kernel
virtio drivers we can have some easier estimation? My guess is it'll be
much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
could use a large chunk of guest mem.
[copy Jason too]
--
Peter Xu
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-05-25 15:55 ` Peter Xu
@ 2022-05-26 2:51 ` Jason Wang
2022-06-01 16:15 ` manish.mishra
2022-06-13 9:58 ` manish.mishra
0 siblings, 2 replies; 22+ messages in thread
From: Jason Wang @ 2022-05-26 2:51 UTC (permalink / raw)
To: Peter Xu
Cc: Hyman Huang, manish.mishra, qemu-devel, Eduardo Habkost,
David Hildenbrand, Juan Quintela, Richard Henderson,
Markus ArmBruster, Paolo Bonzini, Philippe Mathieu-Daudé
On Wed, May 25, 2022 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
> > > 2. Also this algorithm only control or limits dirty rate by guest
> > > writes. There can be some memory dirtying done by virtio based devices
> > > which is accounted only at qemu level so may not be accounted through
> > > dirty rings so do we have plan for that in future? Those are not issue
> > > for auto-converge as it slows full VM but dirty rate limit only slows
> > > guest writes.
> > >
> > From the migration point of view, time spent on migrating memory is far
> > greater than migrating devices emulated by qemu. I think we can do that when
> > migrating device costs the same magnitude time as migrating memory.
> >
> > As to auto-converge, it throttle vcpu by kicking it and force it to sleep
> > periodically. The two seems has no much difference from the perspective of
> > internal method but the auto-converge is kind of "offensive" when doing
> > restraint. I'll read the auto-converge implementation code and figure out
> > the problem you point out.
>
> This seems to be not virtio-specific, but can be applied to any device DMA
> writting to guest mem (if not including vfio). But indeed virtio can be
> normally faster.
>
> I'm also curious how fast a device DMA could dirty memories. This could be
> a question to answer to all vcpu-based throttling approaches (including the
> quota based approach that was proposed on KVM list). Maybe for kernel
> virtio drivers we can have some easier estimation?
As you said below, it really depends on the speed of the backend.
> My guess is it'll be
> much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
> could use a large chunk of guest mem.
Probably, for vhost-user backend, it could be ~20Mpps or even higher.
Thanks
>
> [copy Jason too]
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-05-26 2:51 ` Jason Wang
@ 2022-06-01 16:15 ` manish.mishra
2022-06-13 9:58 ` manish.mishra
1 sibling, 0 replies; 22+ messages in thread
From: manish.mishra @ 2022-06-01 16:15 UTC (permalink / raw)
To: Jason Wang, Peter Xu
Cc: Hyman Huang, qemu-devel, Eduardo Habkost, David Hildenbrand,
Juan Quintela, Richard Henderson, Markus ArmBruster,
Paolo Bonzini, Philippe Mathieu-Daudé
On 26/05/22 8:21 am, Jason Wang wrote:
> On Wed, May 25, 2022 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>> On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
>>>> 2. Also this algorithm only control or limits dirty rate by guest
>>>> writes. There can be some memory dirtying done by virtio based devices
>>>> which is accounted only at qemu level so may not be accounted through
>>>> dirty rings so do we have plan for that in future? Those are not issue
>>>> for auto-converge as it slows full VM but dirty rate limit only slows
>>>> guest writes.
>>>>
>>> From the migration point of view, time spent on migrating memory is far
>>> greater than migrating devices emulated by qemu. I think we can do that when
>>> migrating device costs the same magnitude time as migrating memory.
>>>
>>> As to auto-converge, it throttle vcpu by kicking it and force it to sleep
>>> periodically. The two seems has no much difference from the perspective of
>>> internal method but the auto-converge is kind of "offensive" when doing
>>> restraint. I'll read the auto-converge implementation code and figure out
>>> the problem you point out.
>> This seems to be not virtio-specific, but can be applied to any device DMA
>> writting to guest mem (if not including vfio). But indeed virtio can be
>> normally faster.
>>
>> I'm also curious how fast a device DMA could dirty memories. This could be
>> a question to answer to all vcpu-based throttling approaches (including the
>> quota based approach that was proposed on KVM list). Maybe for kernel
>> virtio drivers we can have some easier estimation?
> As you said below, it really depends on the speed of the backend.
>
>> My guess is it'll be
>> much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
>> could use a large chunk of guest mem.
> Probably, for vhost-user backend, it could be ~20Mpps or even higher.
>
> Thanks
>
>> [copy Jason too]
I will try to get some numbers of this by next week. Jason Just wanted to get more
context why it should be ~20Mbps like it can be as much as throughput limit of
storage/network in worst case?
Also we were internally discussing to keep this kind of throttling not as an alternate
to auto-converge but somehow to run orthogonal to auto-converge with some modifications.
In cases where most dirty is by guest writes auto-converge anyway will not be active as it
decides throttle based on ratio of dirty/2*transferred which if is forced correctly by e.g.
dirty rate limit will be ~1. This is easiest approach we could think for start but can
definately be improved in future. May be something similar can be done for this dirty limit
approach too?
Surely not for this patch series but in future.
thanks
Manish Mishra
>> --
>> Peter Xu
>>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-05-26 2:51 ` Jason Wang
2022-06-01 16:15 ` manish.mishra
@ 2022-06-13 9:58 ` manish.mishra
2022-06-13 14:33 ` Peter Xu
1 sibling, 1 reply; 22+ messages in thread
From: manish.mishra @ 2022-06-13 9:58 UTC (permalink / raw)
To: Jason Wang, Peter Xu
Cc: Hyman Huang, qemu-devel, Eduardo Habkost, David Hildenbrand,
Juan Quintela, Richard Henderson, Markus ArmBruster,
Paolo Bonzini, Philippe Mathieu-Daudé
[-- Attachment #1: Type: text/plain, Size: 2234 bytes --]
On 26/05/22 8:21 am, Jason Wang wrote:
> On Wed, May 25, 2022 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>> On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
>>>> 2. Also this algorithm only control or limits dirty rate by guest
>>>> writes. There can be some memory dirtying done by virtio based devices
>>>> which is accounted only at qemu level so may not be accounted through
>>>> dirty rings so do we have plan for that in future? Those are not issue
>>>> for auto-converge as it slows full VM but dirty rate limit only slows
>>>> guest writes.
>>>>
>>> From the migration point of view, time spent on migrating memory is far
>>> greater than migrating devices emulated by qemu. I think we can do that when
>>> migrating device costs the same magnitude time as migrating memory.
>>>
>>> As to auto-converge, it throttle vcpu by kicking it and force it to sleep
>>> periodically. The two seems has no much difference from the perspective of
>>> internal method but the auto-converge is kind of "offensive" when doing
>>> restraint. I'll read the auto-converge implementation code and figure out
>>> the problem you point out.
>> This seems to be not virtio-specific, but can be applied to any device DMA
>> writting to guest mem (if not including vfio). But indeed virtio can be
>> normally faster.
>>
>> I'm also curious how fast a device DMA could dirty memories. This could be
>> a question to answer to all vcpu-based throttling approaches (including the
>> quota based approach that was proposed on KVM list). Maybe for kernel
>> virtio drivers we can have some easier estimation?
> As you said below, it really depends on the speed of the backend.
>
>> My guess is it'll be
>> much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
>> could use a large chunk of guest mem.
> Probably, for vhost-user backend, it could be ~20Mpps or even higher.
Sorry for late response on this. We did experiment with IO on virtio-scsi based disk.
We could see dirty rate of ~500MBps on my system and most of that was not tracked
as kvm_dirty_log. Also for reference i am attaching test we used to avoid tacking
in KVM. (as attached file).
>
> Thanks
>
>> [copy Jason too]
>>
>> --
>> Peter Xu
>>
[-- Attachment #2: non-kvm-dirty-test.c --]
[-- Type: text/plain, Size: 1424 bytes --]
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#define PAGE_SIZE 4096
#define GB (1024 * 1024 * 1024)
int main()
{
char *buff;
size_t size;
struct stat stat;
// Take file of size atleast double of RAM size to
// achieve max dirty rate possible.
const char * file_name = "file_10_gb";
int fd;
size_t i = 0, count = 0;
struct timespec ts1, ts0;
double time_diff;
fd = open(file_name, O_RDONLY);
if (fd == -1) {
perror("Error opening file");
exit(1);
}
fstat (fd, &stat);
size = stat.st_size;
printf("File size %ld\n", (long)size);
buff = (char *)mmap(0, size, PROT_READ, MAP_PRIVATE, fd, 0);
if (buff == MAP_FAILED) {
perror("Mmap Error");
exit(1);
}
(void)clock_gettime(CLOCK_MONOTONIC, &ts0);
while(1) {
char c;
i = (i + PAGE_SIZE) % size;
c = buff[i];
count++;
// Check on every 10K pages for rate.
if (count % 10000 == 0) {
(void)clock_gettime(CLOCK_MONOTONIC, &ts1);
time_diff = ((double)ts1.tv_sec + ts1.tv_nsec * 1.0e-9) -((double)ts0.tv_sec + ts0.tv_nsec * 1.0e-9);
printf("Expected Dirty rate %f\n", (10000.0 * PAGE_SIZE) / GB / time_diff);
ts0 = ts1;
}
}
close(fd);
return 0;
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-06-13 9:58 ` manish.mishra
@ 2022-06-13 14:33 ` Peter Xu
2022-06-13 15:33 ` manish.mishra
0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2022-06-13 14:33 UTC (permalink / raw)
To: manish.mishra
Cc: Jason Wang, Hyman Huang, qemu-devel, Eduardo Habkost,
David Hildenbrand, Juan Quintela, Richard Henderson,
Markus ArmBruster, Paolo Bonzini, Philippe Mathieu-Daudé
On Mon, Jun 13, 2022 at 03:28:34PM +0530, manish.mishra wrote:
>
> On 26/05/22 8:21 am, Jason Wang wrote:
> > On Wed, May 25, 2022 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
> > > > > 2. Also this algorithm only control or limits dirty rate by guest
> > > > > writes. There can be some memory dirtying done by virtio based devices
> > > > > which is accounted only at qemu level so may not be accounted through
> > > > > dirty rings so do we have plan for that in future? Those are not issue
> > > > > for auto-converge as it slows full VM but dirty rate limit only slows
> > > > > guest writes.
> > > > >
> > > > From the migration point of view, time spent on migrating memory is far
> > > > greater than migrating devices emulated by qemu. I think we can do that when
> > > > migrating device costs the same magnitude time as migrating memory.
> > > >
> > > > As to auto-converge, it throttle vcpu by kicking it and force it to sleep
> > > > periodically. The two seems has no much difference from the perspective of
> > > > internal method but the auto-converge is kind of "offensive" when doing
> > > > restraint. I'll read the auto-converge implementation code and figure out
> > > > the problem you point out.
> > > This seems to be not virtio-specific, but can be applied to any device DMA
> > > writting to guest mem (if not including vfio). But indeed virtio can be
> > > normally faster.
> > >
> > > I'm also curious how fast a device DMA could dirty memories. This could be
> > > a question to answer to all vcpu-based throttling approaches (including the
> > > quota based approach that was proposed on KVM list). Maybe for kernel
> > > virtio drivers we can have some easier estimation?
> > As you said below, it really depends on the speed of the backend.
> >
> > > My guess is it'll be
> > > much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
> > > could use a large chunk of guest mem.
> > Probably, for vhost-user backend, it could be ~20Mpps or even higher.
>
> Sorry for late response on this. We did experiment with IO on virtio-scsi based disk.
Thanks for trying this and sharing it out.
>
> We could see dirty rate of ~500MBps on my system and most of that was not tracked
>
> as kvm_dirty_log. Also for reference i am attaching test we used to avoid tacking
>
> in KVM. (as attached file).
The number looks sane as it seems to be the sequential bandwidth for a
disk, though I'm not 100% sure it'll work as expected since you mmap()ed
the region with private pages rather than shared, so after you did I'm
wondering whether below will happen (also based on the fact that you mapped
twice the size of guest mem as you mentioned in the comment):
(1) Swap out will start to trigger after you read a lot of data into the
mem already, then old-read pages will be swapped out to disk (and
hopefully the swap device does not reside on the same virtio-scsi
disk or it'll be even more complicated scenario of mixture IOs..),
meanwhile when you finish reading a round and start to read from
offset 0 swap-in will start to happen too. Swapping can slow down
things already, and I'm wondering whether the 500MB/s was really
caused by the swapout rather than backend disk reads. More below.
(2) Another attribute of private pages AFAICT is after you read it once
it does not need to be read again from the virtio-scsi disks. In
other words, I'm thinking whether starting from the 2nd iteration
your program won't trigger any DMA at all but purely torturing the
swap device.
Maybe changing MAP_PRIVATE to MAP_SHARED can emulate better on what we want
to measure, but I'm also not 100% sure on whether it could be accurate..
Thanks,
>
> >
> > Thanks
> >
> > > [copy Jason too]
> > >
> > > --
> > > Peter Xu
> > >
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <sys/stat.h>
> #include <sys/time.h>
> #include <time.h>
> #include <unistd.h>
>
> #define PAGE_SIZE 4096
> #define GB (1024 * 1024 * 1024)
>
> int main()
> {
> char *buff;
> size_t size;
> struct stat stat;
> // Take file of size atleast double of RAM size to
> // achieve max dirty rate possible.
> const char * file_name = "file_10_gb";
> int fd;
> size_t i = 0, count = 0;
> struct timespec ts1, ts0;
> double time_diff;
>
> fd = open(file_name, O_RDONLY);
> if (fd == -1) {
> perror("Error opening file");
> exit(1);
> }
>
> fstat (fd, &stat);
> size = stat.st_size;
> printf("File size %ld\n", (long)size);
>
> buff = (char *)mmap(0, size, PROT_READ, MAP_PRIVATE, fd, 0);
> if (buff == MAP_FAILED) {
> perror("Mmap Error");
> exit(1);
> }
>
> (void)clock_gettime(CLOCK_MONOTONIC, &ts0);
>
> while(1) {
> char c;
>
> i = (i + PAGE_SIZE) % size;
> c = buff[i];
> count++;
> // Check on every 10K pages for rate.
> if (count % 10000 == 0) {
> (void)clock_gettime(CLOCK_MONOTONIC, &ts1);
> time_diff = ((double)ts1.tv_sec + ts1.tv_nsec * 1.0e-9) -((double)ts0.tv_sec + ts0.tv_nsec * 1.0e-9);
> printf("Expected Dirty rate %f\n", (10000.0 * PAGE_SIZE) / GB / time_diff);
> ts0 = ts1;
> }
> }
>
> close(fd);
> return 0;
> }
--
Peter Xu
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-06-13 14:33 ` Peter Xu
@ 2022-06-13 15:33 ` manish.mishra
2022-06-13 15:58 ` Peter Xu
0 siblings, 1 reply; 22+ messages in thread
From: manish.mishra @ 2022-06-13 15:33 UTC (permalink / raw)
To: Peter Xu
Cc: Jason Wang, Hyman Huang, qemu-devel, Eduardo Habkost,
David Hildenbrand, Juan Quintela, Richard Henderson,
Markus ArmBruster, Paolo Bonzini, Philippe Mathieu-Daudé
On 13/06/22 8:03 pm, Peter Xu wrote:
> On Mon, Jun 13, 2022 at 03:28:34PM +0530, manish.mishra wrote:
>> On 26/05/22 8:21 am, Jason Wang wrote:
>>> On Wed, May 25, 2022 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
>>>> On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
>>>>>> 2. Also this algorithm only control or limits dirty rate by guest
>>>>>> writes. There can be some memory dirtying done by virtio based devices
>>>>>> which is accounted only at qemu level so may not be accounted through
>>>>>> dirty rings so do we have plan for that in future? Those are not issue
>>>>>> for auto-converge as it slows full VM but dirty rate limit only slows
>>>>>> guest writes.
>>>>>>
>>>>> From the migration point of view, time spent on migrating memory is far
>>>>> greater than migrating devices emulated by qemu. I think we can do that when
>>>>> migrating device costs the same magnitude time as migrating memory.
>>>>>
>>>>> As to auto-converge, it throttle vcpu by kicking it and force it to sleep
>>>>> periodically. The two seems has no much difference from the perspective of
>>>>> internal method but the auto-converge is kind of "offensive" when doing
>>>>> restraint. I'll read the auto-converge implementation code and figure out
>>>>> the problem you point out.
>>>> This seems to be not virtio-specific, but can be applied to any device DMA
>>>> writting to guest mem (if not including vfio). But indeed virtio can be
>>>> normally faster.
>>>>
>>>> I'm also curious how fast a device DMA could dirty memories. This could be
>>>> a question to answer to all vcpu-based throttling approaches (including the
>>>> quota based approach that was proposed on KVM list). Maybe for kernel
>>>> virtio drivers we can have some easier estimation?
>>> As you said below, it really depends on the speed of the backend.
>>>
>>>> My guess is it'll be
>>>> much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
>>>> could use a large chunk of guest mem.
>>> Probably, for vhost-user backend, it could be ~20Mpps or even higher.
>> Sorry for late response on this. We did experiment with IO on virtio-scsi based disk.
> Thanks for trying this and sharing it out.
>
>> We could see dirty rate of ~500MBps on my system and most of that was not tracked
>>
>> as kvm_dirty_log. Also for reference i am attaching test we used to avoid tacking
>>
>> in KVM. (as attached file).
> The number looks sane as it seems to be the sequential bandwidth for a
> disk, though I'm not 100% sure it'll work as expected since you mmap()ed
> the region with private pages rather than shared, so after you did I'm
> wondering whether below will happen (also based on the fact that you mapped
> twice the size of guest mem as you mentioned in the comment):
>
> (1) Swap out will start to trigger after you read a lot of data into the
> mem already, then old-read pages will be swapped out to disk (and
> hopefully the swap device does not reside on the same virtio-scsi
> disk or it'll be even more complicated scenario of mixture IOs..),
> meanwhile when you finish reading a round and start to read from
> offset 0 swap-in will start to happen too. Swapping can slow down
> things already, and I'm wondering whether the 500MB/s was really
> caused by the swapout rather than backend disk reads. More below.
>
> (2) Another attribute of private pages AFAICT is after you read it once
> it does not need to be read again from the virtio-scsi disks. In
> other words, I'm thinking whether starting from the 2nd iteration
> your program won't trigger any DMA at all but purely torturing the
> swap device.
>
> Maybe changing MAP_PRIVATE to MAP_SHARED can emulate better on what we want
> to measure, but I'm also not 100% sure on whether it could be accurate..
>
> Thanks,
>
Thanks Peter, Yes agree MAP_SHARED should be used here, sorry i missed that 😁.
Yes, my purpose of taking file size larger than RAM_SIZE was to cause
frequent page cache flush and re-populating page-cache pages, not to
trigger swaps. I checked on my VM i had swapping disabled, may be
MAP_PRIVATE did not make difference because it was read-only.
I tested again with MAP_SHARED it comes around ~500MBps.
Thanks
Manish Mishra
>>> Thanks
>>>
>>>> [copy Jason too]
>>>>
>>>> --
>>>> Peter Xu
>>>>
>> #include <fcntl.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <sys/stat.h>
>> #include <sys/time.h>
>> #include <time.h>
>> #include <unistd.h>
>>
>> #define PAGE_SIZE 4096
>> #define GB (1024 * 1024 * 1024)
>>
>> int main()
>> {
>> char *buff;
>> size_t size;
>> struct stat stat;
>> // Take file of size atleast double of RAM size to
>> // achieve max dirty rate possible.
>> const char * file_name = "file_10_gb";
>> int fd;
>> size_t i = 0, count = 0;
>> struct timespec ts1, ts0;
>> double time_diff;
>>
>> fd = open(file_name, O_RDONLY);
>> if (fd == -1) {
>> perror("Error opening file");
>> exit(1);
>> }
>>
>> fstat (fd, &stat);
>> size = stat.st_size;
>> printf("File size %ld\n", (long)size);
>>
>> buff = (char *)mmap(0, size, PROT_READ, MAP_PRIVATE, fd, 0);
>> if (buff == MAP_FAILED) {
>> perror("Mmap Error");
>> exit(1);
>> }
>>
>> (void)clock_gettime(CLOCK_MONOTONIC, &ts0);
>>
>> while(1) {
>> char c;
>>
>> i = (i + PAGE_SIZE) % size;
>> c = buff[i];
>> count++;
>> // Check on every 10K pages for rate.
>> if (count % 10000 == 0) {
>> (void)clock_gettime(CLOCK_MONOTONIC, &ts1);
>> time_diff = ((double)ts1.tv_sec + ts1.tv_nsec * 1.0e-9) -((double)ts0.tv_sec + ts0.tv_nsec * 1.0e-9);
>> printf("Expected Dirty rate %f\n", (10000.0 * PAGE_SIZE) / GB / time_diff);
>> ts0 = ts1;
>> }
>> }
>>
>> close(fd);
>> return 0;
>> }
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle
2022-06-13 15:33 ` manish.mishra
@ 2022-06-13 15:58 ` Peter Xu
0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-06-13 15:58 UTC (permalink / raw)
To: manish.mishra
Cc: Jason Wang, Hyman Huang, qemu-devel, Eduardo Habkost,
David Hildenbrand, Juan Quintela, Richard Henderson,
Markus ArmBruster, Paolo Bonzini, Philippe Mathieu-Daudé
On Mon, Jun 13, 2022 at 09:03:24PM +0530, manish.mishra wrote:
>
> On 13/06/22 8:03 pm, Peter Xu wrote:
> > On Mon, Jun 13, 2022 at 03:28:34PM +0530, manish.mishra wrote:
> > > On 26/05/22 8:21 am, Jason Wang wrote:
> > > > On Wed, May 25, 2022 at 11:56 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > On Wed, May 25, 2022 at 11:38:26PM +0800, Hyman Huang wrote:
> > > > > > > 2. Also this algorithm only control or limits dirty rate by guest
> > > > > > > writes. There can be some memory dirtying done by virtio based devices
> > > > > > > which is accounted only at qemu level so may not be accounted through
> > > > > > > dirty rings so do we have plan for that in future? Those are not issue
> > > > > > > for auto-converge as it slows full VM but dirty rate limit only slows
> > > > > > > guest writes.
> > > > > > >
> > > > > > From the migration point of view, time spent on migrating memory is far
> > > > > > greater than migrating devices emulated by qemu. I think we can do that when
> > > > > > migrating device costs the same magnitude time as migrating memory.
> > > > > >
> > > > > > As to auto-converge, it throttle vcpu by kicking it and force it to sleep
> > > > > > periodically. The two seems has no much difference from the perspective of
> > > > > > internal method but the auto-converge is kind of "offensive" when doing
> > > > > > restraint. I'll read the auto-converge implementation code and figure out
> > > > > > the problem you point out.
> > > > > This seems to be not virtio-specific, but can be applied to any device DMA
> > > > > writting to guest mem (if not including vfio). But indeed virtio can be
> > > > > normally faster.
> > > > >
> > > > > I'm also curious how fast a device DMA could dirty memories. This could be
> > > > > a question to answer to all vcpu-based throttling approaches (including the
> > > > > quota based approach that was proposed on KVM list). Maybe for kernel
> > > > > virtio drivers we can have some easier estimation?
> > > > As you said below, it really depends on the speed of the backend.
> > > >
> > > > > My guess is it'll be
> > > > > much harder for DPDK-in-guest (aka userspace drivers) because IIUC that
> > > > > could use a large chunk of guest mem.
> > > > Probably, for vhost-user backend, it could be ~20Mpps or even higher.
> > > Sorry for late response on this. We did experiment with IO on virtio-scsi based disk.
> > Thanks for trying this and sharing it out.
> >
> > > We could see dirty rate of ~500MBps on my system and most of that was not tracked
> > >
> > > as kvm_dirty_log. Also for reference i am attaching test we used to avoid tacking
> > >
> > > in KVM. (as attached file).
> > The number looks sane as it seems to be the sequential bandwidth for a
> > disk, though I'm not 100% sure it'll work as expected since you mmap()ed
> > the region with private pages rather than shared, so after you did I'm
> > wondering whether below will happen (also based on the fact that you mapped
> > twice the size of guest mem as you mentioned in the comment):
> >
> > (1) Swap out will start to trigger after you read a lot of data into the
> > mem already, then old-read pages will be swapped out to disk (and
> > hopefully the swap device does not reside on the same virtio-scsi
> > disk or it'll be even more complicated scenario of mixture IOs..),
> > meanwhile when you finish reading a round and start to read from
> > offset 0 swap-in will start to happen too. Swapping can slow down
> > things already, and I'm wondering whether the 500MB/s was really
> > caused by the swapout rather than backend disk reads. More below.
> >
> > (2) Another attribute of private pages AFAICT is after you read it once
> > it does not need to be read again from the virtio-scsi disks. In
> > other words, I'm thinking whether starting from the 2nd iteration
> > your program won't trigger any DMA at all but purely torturing the
> > swap device.
> >
> > Maybe changing MAP_PRIVATE to MAP_SHARED can emulate better on what we want
> > to measure, but I'm also not 100% sure on whether it could be accurate..
> >
> > Thanks,
> >
> Thanks Peter, Yes agree MAP_SHARED should be used here, sorry i missed that 😁.
>
> Yes, my purpose of taking file size larger than RAM_SIZE was to cause
>
> frequent page cache flush and re-populating page-cache pages, not to
>
> trigger swaps. I checked on my VM i had swapping disabled, may be
>
> MAP_PRIVATE did not make difference because it was read-only.
Makes sense. And yeah I overlooked the RO part - indeed page cache will be
used for RO pages as long as never written. So it'll behave like shared.
Otherwise for a swap-all-off you should have have hit OOM anyway and the
process probably will get killed sooner or later. :)
>
> I tested again with MAP_SHARED it comes around ~500MBps.
Makes sense. I'd guess that's the limitation of the virtio-scsi backend,
IOW the logical limitation of device dirtying memory could be unlimited
(e.g., when we put the virtio backend onto a ramdisk).
--
Peter Xu
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-06-13 15:59 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-02 17:55 [PATCH v17 0/8] support dirty restraint on vCPU huangy81
[not found] ` <cover.1646243447.git.huangy81@chinatelecom.cn>
2022-03-02 17:55 ` [PATCH v17 1/8] accel/kvm/kvm-all: Refactor per-vcpu dirty ring reaping huangy81
2022-03-02 17:55 ` [PATCH v17 2/8] cpus: Introduce cpu_list_generation_id huangy81
2022-03-02 17:55 ` [PATCH v17 3/8] migration/dirtyrate: Refactor dirty page rate calculation huangy81
2022-03-02 17:55 ` [PATCH v17 4/8] softmmu/dirtylimit: Implement vCPU dirtyrate calculation periodically huangy81
2022-03-02 17:55 ` [PATCH v17 5/8] accel/kvm/kvm-all: Introduce kvm_dirty_ring_size function huangy81
2022-03-02 17:55 ` [PATCH v17 6/8] softmmu/dirtylimit: Implement virtual CPU throttle huangy81
2022-05-16 17:13 ` manish.mishra
2022-05-17 8:19 ` Hyman Huang
2022-05-23 16:44 ` manish.mishra
2022-05-25 15:38 ` Hyman Huang
2022-05-25 15:55 ` Peter Xu
2022-05-26 2:51 ` Jason Wang
2022-06-01 16:15 ` manish.mishra
2022-06-13 9:58 ` manish.mishra
2022-06-13 14:33 ` Peter Xu
2022-06-13 15:33 ` manish.mishra
2022-06-13 15:58 ` Peter Xu
2022-03-02 17:55 ` [PATCH v17 7/8] softmmu/dirtylimit: Implement dirty page rate limit huangy81
2022-03-02 17:55 ` [PATCH v17 8/8] tests/qtest/qmp-cmd-test: Ignore query-vcpu-dirty-limit test huangy81
2022-03-03 5:58 ` Markus Armbruster
2022-03-03 6:11 ` Hyman Huang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.