[PATCH v10 0/3] Per-vCPU dirty quota-based throttling

* [PATCH v10 0/3] Per-vCPU dirty quota-based throttling
@ 2024-02-21 19:51 Shivam Kumar
  2024-02-21 19:51 ` [PATCH v10 1/3] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Shivam Kumar @ 2024-02-21 19:51 UTC (permalink / raw)
  To: maz, pbonzini, seanjc, james.morse, suzuki.poulose, oliver.upton,
	yuzenghui, catalin.marinas, aravind.retnakaran, carl.waldspurger,
	david.vrabel, david, will
  Cc: kvm, Shivam Kumar

This patchset introduces a new mechanism (dirty-quota-based
throttling) to throttle the rate at which memory pages can be dirtied.
This is done by setting a limit on the number of bytes  that each vCPU
is allowed to dirty at a time, until it is allocated additional quota.

This new throttling mechanism is exposed to userspace through a new
KVM capability, KVM_CAP_DIRTY_QUOTA. If this capability is enabled by
userspace, each vCPU will exit to userspace (with exit reason
KVM_EXIT_DIRTY_QUOTA_EXHAUSTED) as soon as its dirty quota is
exhausted (in other words, a given vCPU will exit to userspace as soon
as it has dirtied as many bytes as the limit set for it). When the
vCPU exits to userspace, userspace may increase the dirty quota of the
vCPU (after optionally sleeping for an appropriate period of time) so
that it can continue dirtying more memory.

Dirty-quota-based throttling is a very effective choice for live
migration, for the following reasons:

1. With dirty-quota-based throttling, we can precisely set the amount
of memory we can afford to dirty for the migration to converge (and
within reasonable time). This behaviour is much more effective than
the current state-of-the-art auto-converge mechanism that implements
time-based throttling (making vCPUs sleep for some time to throttle
dirtying), since some workloads can dirty a huge amount of memory even
if its vCPUs are given a very small interval to run, thus causing
migrations to take longer and possibly failing to converge.

2. While the current auto-converge mechanism makes the whole VM sleep
to throttle memory dirtying, we can selectively throttle vCPUs with
dirty-quota-based throttling (i.e. only causing vCPUs that are
dirtying more than a threshold to sleep). Furthermore, if we choose
very small intervals to compute and enforce dirty quota, we can
achieve micro-stunning (i.e. stunning the vCPUs precisely when they
are dirtying the memory). Both of these behaviors help the
dirty-quota-based scheme to throttle only those vCPUs that are
dirtying memory, and only when they are dirtying the memory. Hence,
while the current auto-converge scheme is prone to throttling reads
and writes equally, dirty-quota-based throttling has minimal impact on
read performance.

3. Dirty-quota-based throttling can adapt quickly to changes in
network bandwidth if it is enforced in very small intervals.  In other
words, we can consider the current available network bandwidth when
computing an appropriate dirty quota for the next interval.

The benefits of dirty-quota-based throttling are not limited to live
migration.  The dirty-quota mechanism can also be leveraged to
support other use cases that would benefit from effective throttling
of memory writes.  The update_dirty_quota hook in the implementation
can be used outside the context of live migration, but note that such
alternative uses must also write-protect the memory.

We have evaluated dirty-quota-based throttling using two key metrics:
A. Live migration performance (time to migrate)
B. Guest performance during live migration

We have used a synthetic workload that dirties memory sequentially in
a loop. It is characterised by three variables m, n and l. A given
instance of this workload (m=x,n=y,l=z) is a workload dirtying x GB of
memory with y threads at a rate of z GBps. In the following table, b
is network bandwidth configured for the live migration, t_curr is the
total time to migrate with the current throttling logic and t_dq is
the total time to migrate with dirty-quota-based throttling.

    A. Live migration performance

+--------+----+----------+----------+---------------+----------+----------+
| m (GB) |  n | l (GBps) | b (MBps) |    t_curr (s) | t_dq (s) | Diff (%) |
+--------+----+----------+----------+---------------+----------+----------+
|      8 |  2 |     8.00 |      640 |         60.38 |    15.22 |     74.8 |
|     16 |  4 |     1.26 |      640 |         75.99 |    32.22 |     57.6 |
|     32 |  6 |     0.10 |      640 |         49.81 |    49.80 |      0.0 |
|     48 |  8 |     2.20 |      640 |        287.78 |   115.65 |     59.8 |
|     32 |  6 |    32.00 |      640 |        364.30 |    84.26 |     76.9 |
|      8 |  2 |     8.00 |      128 |        452.91 |    94.99 |     79.0 |
|    512 | 32 |     0.10 |      640 |        868.94 |   841.92 |      3.1 |
|     16 |  4 |     1.26 |       64 |       1538.94 |   426.21 |     72.3 |
|     32 |  6 |     1.80 |     1024 |       1406.80 |   452.82 |     67.8 |
|    512 | 32 |     7.20 |      640 |       4561.30 |   906.60 |     80.1 |
|    128 | 16 |     3.50 |      128 |       7009.98 |  1689.61 |     75.9 |
|     16 |  4 |    16.00 |       64 | "Unconverged" |   461.47 |      N/A |
|     32 |  6 |    32.00 |      128 | "Unconverged" |   454.27 |      N/A |
|    512 | 32 |   512.00 |      640 | "Unconverged" |   917.37 |      N/A |
|    128 | 16 |   128.00 |      128 | "Unconverged" |  1946.00 |      N/A |
+--------+----+----------+----------+---------------+----------+----------+

    B. Guest performance:

+=====================+===================+===================+==========+
|        Case         | Guest Runtime (%) | Guest Runtime (%) | Diff (%) |
+=====================+===================+===================+==========+
|                     | (Current)         | (Dirty Quota)     |          |
+---------------------+-------------------+-------------------+----------+
| Write-intensive     | 26.4              | 35.3              |     33.7 |
+---------------------+-------------------+-------------------+----------+
| Read-write-balanced | 40.6              | 70.8              |     74.4 |
+---------------------+-------------------+-------------------+----------+
| Read-intensive      | 63.1              | 81.8              |     29.6 |
+---------------------+-------------------+-------------------+----------+

Guest Runtime (in percentage) in the above table is the percentage of
time a guest vCPU is actually running, averaged across all vCPUs of
the guest. For B, we have run variants of the afore-mentioned
synthetic workload dirtying memory sequentially in a loop on some
threads and just reading memory sequentially on the other threads. We
have also conducted similar experiments with more realistic benchmarks
/ workloads e.g. redis, and obtained similar results.

Dirty-quota-based throttling was presented in KVM Forum 2021. Please
find the details here:
https://kvmforum2021.sched.com/event/ke4A/dirty-quota-based-vm-live-migration-auto-converge-manish-mishra-shivam-kumar-nutanix-india

The current v10 patchset includes the following changes over v9:

1. Use vma_pagesize as the dirty granularity for updating dirty quota
on arm64.
2. Do not update dirty quota for instances where the hypervisor is
writing into guest memory. Accounting for these instances in vCPUs'
dirty quota is unfair to the vCPUs. Also, some of these instances,
such as record_steal_time, frequently try to redundantly mark the same
set of pages dirty again and again. To avoid these distortions, we had
previously relied on checking the dirty bitmap to avoid redundantly
updating quotas. Since we have now decoupled dirty-quota-based
throttling from the live-migration dirty-tracking path, we have
resolved this issue by simply avoiding the mis-accounting caused by
these hypervisor-induced writes to guest memory.  Through extensive
experiments, we have verified that this new approach is approximately
as effective as the prior approach that relied on checking the dirty
bitmap.

v1:
https://lore.kernel.org/kvm/20211114145721.209219-1-shivam.kumar1@xxxxxxxxxxx/
v2: https://lore.kernel.org/kvm/Ydx2EW6U3fpJoJF0@xxxxxxxxxx/T/
v3: https://lore.kernel.org/kvm/YkT1kzWidaRFdQQh@xxxxxxxxxx/T/
v4:
https://lore.kernel.org/all/20220521202937.184189-1-shivam.kumar1@xxxxxxxxxxx/
v5: https://lore.kernel.org/all/202209130532.2BJwW65L-lkp@xxxxxxxxx/T/
v6:
https://lore.kernel.org/all/20220915101049.187325-1-shivam.kumar1@xxxxxxxxxxx/
v7:
https://lore.kernel.org/all/a64d9818-c68d-1e33-5783-414e9a9bdbd1@xxxxxxxxxxx/t/
v8:
https://lore.kernel.org/all/20230225204758.17726-1-shivam.kumar1@nutanix.com/
v9:
https://lore.kernel.org/kvm/20230504144328.139462-1-shivam.kumar1@nutanix.com/

Thanks,
Shivam

Shivam Kumar (3):
  KVM: Implement dirty quota-based throttling of vcpus
  KVM: x86: Dirty quota-based throttling of vcpus
  KVM: arm64: Dirty quota-based throttling of vcpus

 Documentation/virt/kvm/api.rst | 17 +++++++++++++++++
 arch/arm64/kvm/Kconfig         |  1 +
 arch/arm64/kvm/arm.c           |  5 +++++
 arch/arm64/kvm/mmu.c           |  1 +
 arch/x86/kvm/Kconfig           |  1 +
 arch/x86/kvm/mmu/mmu.c         |  6 +++++-
 arch/x86/kvm/mmu/spte.c        |  1 +
 arch/x86/kvm/vmx/vmx.c         |  3 +++
 arch/x86/kvm/x86.c             |  6 +++++-
 include/linux/kvm_host.h       |  9 +++++++++
 include/uapi/linux/kvm.h       |  8 ++++++++
 tools/include/uapi/linux/kvm.h |  1 +
 virt/kvm/Kconfig               |  3 +++
 virt/kvm/kvm_main.c            | 27 +++++++++++++++++++++++++++
 14 files changed, 87 insertions(+), 2 deletions(-)

-- 
2.22.3

^ permalink raw reply	[flat|nested] 14+ messages in thread