All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] KVM: Dirty quota-based throttling
@ 2022-05-21 20:29 Shivam Kumar
  2022-05-21 20:29 ` [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Shivam Kumar @ 2022-05-21 20:29 UTC (permalink / raw)
  To: pbonzini, seanjc, maz, james.morse, borntraeger, david; +Cc: kvm, Shivam Kumar

This is v4 of the dirty quota series, with the following changes over v3:

i) Changes in documentation based on the feedback received.
ii) Handling some corner cases, e.g. adding dirty quota check in VMX's
big emulation loop for invalid guest state when unrestricted guest is
disabled.
iii) Moving the s390x and arm64 changes to separate commits.

v1:
https://lore.kernel.org/kvm/20211114145721.209219-1-shivam.kumar1@nutanix.com/
v2: https://lore.kernel.org/kvm/Ydx2EW6U3fpJoJF0@google.com/T/
v3: https://lore.kernel.org/kvm/YkT1kzWidaRFdQQh@google.com/T/

Shivam Kumar (4):
  KVM: Implement dirty quota-based throttling of vcpus
  KVM: arm64: Dirty quota-based throttling of vcpus
  KVM: s390x: Dirty quota-based throttling of vcpus
  KVM: selftests: Add selftests for dirty quota throttling

 Documentation/virt/kvm/api.rst                | 32 ++++++++++++++++
 arch/arm64/kvm/arm.c                          |  3 ++
 arch/s390/kvm/kvm-s390.c                      |  3 ++
 arch/x86/kvm/mmu/spte.c                       |  4 +-
 arch/x86/kvm/vmx/vmx.c                        |  3 ++
 arch/x86/kvm/x86.c                            |  4 ++
 include/linux/kvm_host.h                      | 15 ++++++++
 include/linux/kvm_types.h                     |  1 +
 include/uapi/linux/kvm.h                      | 12 ++++++
 tools/testing/selftests/kvm/dirty_log_test.c  | 37 +++++++++++++++++--
 .../selftests/kvm/include/kvm_util_base.h     |  4 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 36 ++++++++++++++++++
 virt/kvm/kvm_main.c                           |  7 +++-
 13 files changed, 154 insertions(+), 7 deletions(-)

-- 
2.22.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-21 20:29 [PATCH v4 0/4] KVM: Dirty quota-based throttling Shivam Kumar
@ 2022-05-21 20:29 ` Shivam Kumar
  2022-05-24  7:11   ` Marc Zyngier
  2022-05-21 20:29 ` [PATCH v4 2/4] KVM: arm64: Dirty " Shivam Kumar
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Shivam Kumar @ 2022-05-21 20:29 UTC (permalink / raw)
  To: pbonzini, seanjc, maz, james.morse, borntraeger, david
  Cc: kvm, Shivam Kumar, Shaju Abraham, Manish Mishra, Anurag Madnawat

Define variables to track and throttle memory dirtying for every vcpu.

dirty_count:    Number of pages the vcpu has dirtied since its creation,
                while dirty logging is enabled.
dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
                more, it needs to request more quota by exiting to
                userspace.

Implement the flow for throttling based on dirty quota.

i) Increment dirty_count for the vcpu whenever it dirties a page.
ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
count equals/exceeds dirty quota) to request more dirty quota.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/spte.c        |  4 ++--
 arch/x86/kvm/vmx/vmx.c         |  3 +++
 arch/x86/kvm/x86.c             |  4 ++++
 include/linux/kvm_host.h       | 15 +++++++++++++++
 include/linux/kvm_types.h      |  1 +
 include/uapi/linux/kvm.h       | 12 ++++++++++++
 virt/kvm/kvm_main.c            |  7 ++++++-
 8 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9f3172376ec3..a9317ed31d06 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6125,6 +6125,24 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
+If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
+exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
+makes the following information available to the userspace:
+	'count' field: the current count of pages dirtied by the VCPU, can be
+        skewed based on the size of the pages accessed by each vCPU.
+	'quota' field: the observed dirty quota just before the exit to userspace.
+The userspace can design a strategy to allocate the overall scope of dirtying
+for the VM among the vcpus. Based on the strategy and the current state of dirty
+quota throttling, the userspace can make a decision to either update (increase)
+the quota or to put the VCPU to sleep for some time.
+
 ::
 
 		/* Fix the size of the union. */
@@ -6159,6 +6177,20 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
 ::
 
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
+	 * is reached/exceeded.
+	 */
+	__u64 dirty_quota;
+Please note that enforcing the quota is best effort, as the guest may dirty
+multiple pages before KVM can recheck the quota.  However, unless KVM is using
+a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
+KVM will detect quota exhaustion within a handful of dirtied page.  If a
+hardware ring buffer is used, the overrun is bounded by the size of the buffer
+(512 entries for PML).
+
+::
   };
 
 
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 73cfe62fdad1..01f0d2a04796 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -182,9 +182,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
 		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
 
-	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
+	if (spte & PT_WRITABLE_MASK) {
 		/* Enforced by kvm_mmu_hugepage_adjust. */
-		WARN_ON(level > PG_LEVEL_4K);
+		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));
 		mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
 	}
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b730d799c26e..5cbe4992692a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5507,6 +5507,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 		 */
 		if (__xfer_to_guest_mode_work_pending())
 			return 1;
+
+		if (!kvm_vcpu_check_dirty_quota(vcpu))
+			return 0;
 	}
 
 	return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb4029660bd9..0b35b8cc0274 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.l1tf_flush_l1d = true;
 
 	for (;;) {
+		r = kvm_vcpu_check_dirty_quota(vcpu);
+		if (!r)
+			break;
+
 		if (kvm_vcpu_running(vcpu)) {
 			r = vcpu_enter_guest(vcpu);
 		} else {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f11039944c08..ca1ac970a6cf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
 }
 
+static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	u64 dirty_quota = READ_ONCE(run->dirty_quota);
+	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
+
+	if (!dirty_quota || (pages_dirtied < dirty_quota))
+		return 1;
+
+	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
+	run->dirty_quota_exit.count = pages_dirtied;
+	run->dirty_quota_exit.quota = dirty_quota;
+	return 0;
+}
+
 /*
  * Some of the bitops functions do not support too long bitmaps.
  * This number must be determined not to exceed such limits.
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index dceac12c1ce5..7f42486b0405 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic {
 	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
 	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
 	u64 blocking;
+	u64 pages_dirtied;
 };
 
 #define KVM_STATS_NAME_SIZE	48
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 507ee1f2aa96..1d9531efe1fb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -487,6 +488,11 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -508,6 +514,12 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the
+	 * quota is reached/exceeded.
+	 */
+	__u64 dirty_quota;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0afc016cc54d..041ab464405d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
 		return;
 #endif
 
-	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
+	if (!memslot)
+		return;
+
+	vcpu->stat.generic.pages_dirtied++;
+
+	if (kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-- 
2.22.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 2/4] KVM: arm64: Dirty quota-based throttling of vcpus
  2022-05-21 20:29 [PATCH v4 0/4] KVM: Dirty quota-based throttling Shivam Kumar
  2022-05-21 20:29 ` [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
@ 2022-05-21 20:29 ` Shivam Kumar
  2022-05-24  7:14   ` Marc Zyngier
  2022-05-21 20:29 ` [PATCH v4 3/4] KVM: s390x: " Shivam Kumar
  2022-05-21 20:29 ` [PATCH v4 4/4] KVM: selftests: Add selftests for dirty quota throttling Shivam Kumar
  3 siblings, 1 reply; 24+ messages in thread
From: Shivam Kumar @ 2022-05-21 20:29 UTC (permalink / raw)
  To: pbonzini, seanjc, maz, james.morse, borntraeger, david
  Cc: kvm, Shivam Kumar, Shaju Abraham, Manish Mishra, Anurag Madnawat

Exit to userspace whenever the dirty quota is exhausted (i.e. dirty count
equals/exceeds dirty quota) to request more dirty quota.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 arch/arm64/kvm/arm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ecc5958e27fe..5b6a239b83a5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -848,6 +848,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	ret = 1;
 	run->exit_reason = KVM_EXIT_UNKNOWN;
 	while (ret > 0) {
+		ret = kvm_vcpu_check_dirty_quota(vcpu);
+		if (!ret)
+			break;
 		/*
 		 * Check conditions before entering the guest
 		 */
-- 
2.22.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 3/4] KVM: s390x: Dirty quota-based throttling of vcpus
  2022-05-21 20:29 [PATCH v4 0/4] KVM: Dirty quota-based throttling Shivam Kumar
  2022-05-21 20:29 ` [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
  2022-05-21 20:29 ` [PATCH v4 2/4] KVM: arm64: Dirty " Shivam Kumar
@ 2022-05-21 20:29 ` Shivam Kumar
  2022-05-21 20:29 ` [PATCH v4 4/4] KVM: selftests: Add selftests for dirty quota throttling Shivam Kumar
  3 siblings, 0 replies; 24+ messages in thread
From: Shivam Kumar @ 2022-05-21 20:29 UTC (permalink / raw)
  To: pbonzini, seanjc, maz, james.morse, borntraeger, david
  Cc: kvm, Shivam Kumar, Shaju Abraham, Manish Mishra, Anurag Madnawat

Exit to userspace whenever the dirty quota is exhausted (i.e. dirty count
equals/exceeds dirty quota) to request more dirty quota.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 arch/s390/kvm/kvm-s390.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2296b1ff1e02..9cc0e0583ef4 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3994,6 +3994,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu)
 static int vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	int rc, cpuflags;
+	rc = kvm_vcpu_check_dirty_quota(vcpu);
+	if (!rc)
+		return -EREMOTE;
 
 	/*
 	 * On s390 notifications for arriving pages will be delivered directly
-- 
2.22.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 4/4] KVM: selftests: Add selftests for dirty quota throttling
  2022-05-21 20:29 [PATCH v4 0/4] KVM: Dirty quota-based throttling Shivam Kumar
                   ` (2 preceding siblings ...)
  2022-05-21 20:29 ` [PATCH v4 3/4] KVM: s390x: " Shivam Kumar
@ 2022-05-21 20:29 ` Shivam Kumar
  3 siblings, 0 replies; 24+ messages in thread
From: Shivam Kumar @ 2022-05-21 20:29 UTC (permalink / raw)
  To: pbonzini, seanjc, maz, james.morse, borntraeger, david
  Cc: kvm, Shivam Kumar, Shaju Abraham, Manish Mishra, Anurag Madnawat

Add selftests for dirty quota throttling with an optional -d parameter
to configure by what value dirty quota should be incremented after
each dirty quota exit. With very small intervals, a smaller value of
dirty quota can ensure that the dirty quota exit code is tested. A zero
value disables dirty quota throttling and thus dirty logging, without
dirty quota throttling, can be tested.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 37 +++++++++++++++++--
 .../selftests/kvm/include/kvm_util_base.h     |  4 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 36 ++++++++++++++++++
 3 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3fcd89e195c7..e75d826e21fb 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -65,6 +65,8 @@
 
 #define SIG_IPI SIGUSR1
 
+#define TEST_DIRTY_QUOTA_INCREMENT		8
+
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
  * sync_global_to/from_guest() are used when accessing from
@@ -191,6 +193,7 @@ static enum log_mode_t host_log_mode_option = LOG_MODE_ALL;
 static enum log_mode_t host_log_mode;
 static pthread_t vcpu_thread;
 static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
+static uint64_t test_dirty_quota_increment = TEST_DIRTY_QUOTA_INCREMENT;
 
 static void vcpu_kick(void)
 {
@@ -210,6 +213,13 @@ static void sem_wait_until(sem_t *sem)
 	while (ret == -1 && errno == EINTR);
 }
 
+static void set_dirty_quota(struct kvm_vm *vm, uint64_t dirty_quota)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	vcpu_set_dirty_quota(run, dirty_quota);
+}
+
 static bool clear_log_supported(void)
 {
 	return kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
@@ -260,9 +270,13 @@ static void default_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 	TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
 		    "vcpu run failed: errno=%d", err);
 
-	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
-		    "Invalid guest sync status: exit_reason=%s\n",
-		    exit_reason_str(run->exit_reason));
+	if (test_dirty_quota_increment &&
+		run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED)
+		vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
+	else
+		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
+			"Invalid guest sync status: exit_reason=%s\n",
+			exit_reason_str(run->exit_reason));
 
 	vcpu_handle_sync_stop();
 }
@@ -377,6 +391,9 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
 		/* We should allow this to continue */
 		;
+	} else if (test_dirty_quota_increment &&
+		run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED) {
+		vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
 	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
 		   (ret == -1 && err == EINTR)) {
 		/* Update the flag first before pause */
@@ -773,6 +790,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	sync_global_to_guest(vm, guest_test_virt_mem);
 	sync_global_to_guest(vm, guest_num_pages);
 
+	/* Initialise dirty quota */
+	if (test_dirty_quota_increment)
+		set_dirty_quota(vm, test_dirty_quota_increment);
+
 	/* Start the iterations */
 	iteration = 1;
 	sync_global_to_guest(vm, iteration);
@@ -814,6 +835,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
 	log_mode_before_vcpu_join();
+	/* Terminate dirty quota throttling */
+	if (test_dirty_quota_increment)
+		set_dirty_quota(vm, 0);
 	pthread_join(vcpu_thread, NULL);
 
 	pr_info("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
@@ -835,6 +859,8 @@ static void help(char *name)
 	printf(" -c: specify dirty ring size, in number of entries\n");
 	printf("     (only useful for dirty-ring test; default: %"PRIu32")\n",
 	       TEST_DIRTY_RING_COUNT);
+	printf(" -q: specify incemental dirty quota (default: %"PRIu32")\n",
+	       TEST_DIRTY_QUOTA_INCREMENT);
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
 	printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
@@ -863,11 +889,14 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "c:q:hi:I:p:m:M:")) != -1) {
 		switch (opt) {
 		case 'c':
 			test_dirty_ring_count = strtol(optarg, NULL, 10);
 			break;
+		case 'q':
+			test_dirty_quota_increment = strtol(optarg, NULL, 10);
+			break;
 		case 'i':
 			p.iterations = strtol(optarg, NULL, 10);
 			break;
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 4ed6aa049a91..b70732998329 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -395,4 +395,8 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
 
 uint32_t guest_get_vcpuid(void);
 
+void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota);
+void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
+			uint64_t test_dirty_quota_increment);
+
 #endif /* SELFTEST_KVM_UTIL_BASE_H */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index d8cf851ab119..74a354dea8a7 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -19,6 +19,7 @@
 #include <linux/kernel.h>
 
 #define KVM_UTIL_MIN_PFN	2
+#define PML_BUFFER_SIZE	512
 
 static int vcpu_mmap_sz(void);
 
@@ -2286,6 +2287,7 @@ static struct exit_reason {
 	{KVM_EXIT_X86_RDMSR, "RDMSR"},
 	{KVM_EXIT_X86_WRMSR, "WRMSR"},
 	{KVM_EXIT_XEN, "XEN"},
+	{KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, "DIRTY_QUOTA_EXHAUSTED"},
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
 #endif
@@ -2517,3 +2519,37 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid)
 
 	return ioctl(vcpu->fd, KVM_GET_STATS_FD, NULL);
 }
+
+void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota)
+{
+	run->dirty_quota = dirty_quota;
+
+	if (dirty_quota)
+		pr_info("Dirty quota throttling enabled with initial quota %"PRIu64"\n",
+			dirty_quota);
+	else
+		pr_info("Dirty quota throttling disabled\n");
+}
+
+void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
+			uint64_t test_dirty_quota_increment)
+{
+	uint64_t quota = run->dirty_quota_exit.quota;
+	uint64_t count = run->dirty_quota_exit.count;
+
+	/*
+	 * Due to Intel's Page Modification Logging, number of pages dirtied by
+	 * the vcpu can exceed its dirty quota by PML buffer size.
+	 */
+	TEST_ASSERT(count <= quota + PML_BUFFER_SIZE, "Invalid number of pages
+		dirtied: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
+
+	TEST_ASSERT(count >= quota, "Dirty quota exit happened with quota yet to
+		be exhausted: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
+
+	if (count > quota)
+		pr_info("Dirty quota exit with unequal quota and count:
+			count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
+
+	run->dirty_quota = count + test_dirty_quota_increment;
+}
-- 
2.22.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-21 20:29 ` [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
@ 2022-05-24  7:11   ` Marc Zyngier
  2022-05-26  9:33     ` Shivam Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2022-05-24  7:11 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: pbonzini, seanjc, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat

On Sat, 21 May 2022 21:29:36 +0100,
Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
> 
> Define variables to track and throttle memory dirtying for every vcpu.
> 
> dirty_count:    Number of pages the vcpu has dirtied since its creation,
>                 while dirty logging is enabled.
> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
>                 more, it needs to request more quota by exiting to
>                 userspace.
> 
> Implement the flow for throttling based on dirty quota.
> 
> i) Increment dirty_count for the vcpu whenever it dirties a page.
> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
> count equals/exceeds dirty quota) to request more dirty quota.
> 
> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>  Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/spte.c        |  4 ++--
>  arch/x86/kvm/vmx/vmx.c         |  3 +++
>  arch/x86/kvm/x86.c             |  4 ++++

Please split the x86 support in its own patch.

>  include/linux/kvm_host.h       | 15 +++++++++++++++
>  include/linux/kvm_types.h      |  1 +
>  include/uapi/linux/kvm.h       | 12 ++++++++++++
>  virt/kvm/kvm_main.c            |  7 ++++++-
>  8 files changed, 75 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9f3172376ec3..a9317ed31d06 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6125,6 +6125,24 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
> +makes the following information available to the userspace:
> +	'count' field: the current count of pages dirtied by the VCPU, can be
> +        skewed based on the size of the pages accessed by each vCPU.
> +	'quota' field: the observed dirty quota just before the exit to userspace.
> +The userspace can design a strategy to allocate the overall scope of dirtying
> +for the VM among the vcpus. Based on the strategy and the current state of dirty
> +quota throttling, the userspace can make a decision to either update (increase)
> +the quota or to put the VCPU to sleep for some time.
> +
>  ::
>  
>  		/* Fix the size of the union. */
> @@ -6159,6 +6177,20 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
>  
>  ::
>  
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
> +	 * is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;
> +Please note that enforcing the quota is best effort, as the guest may dirty
> +multiple pages before KVM can recheck the quota.  However, unless KVM is using
> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
> +KVM will detect quota exhaustion within a handful of dirtied page.  If a
> +hardware ring buffer is used, the overrun is bounded by the size of the buffer
> +(512 entries for PML).
> +
> +::
>    };
>  
>  
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 73cfe62fdad1..01f0d2a04796 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -182,9 +182,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
>  		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
>  
> -	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
> +	if (spte & PT_WRITABLE_MASK) {
>  		/* Enforced by kvm_mmu_hugepage_adjust. */
> -		WARN_ON(level > PG_LEVEL_4K);
> +		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));
>  		mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
>  	}
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b730d799c26e..5cbe4992692a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5507,6 +5507,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>  		 */
>  		if (__xfer_to_guest_mode_work_pending())
>  			return 1;
> +
> +		if (!kvm_vcpu_check_dirty_quota(vcpu))
> +			return 0;
>  	}
>  
>  	return 1;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eb4029660bd9..0b35b8cc0274 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  	vcpu->arch.l1tf_flush_l1d = true;
>  
>  	for (;;) {
> +		r = kvm_vcpu_check_dirty_quota(vcpu);

I really wonder why this has to be checked on each and every run. Why
isn't this a request set by the core code instead?

> +		if (!r)
> +			break;
> +
>  		if (kvm_vcpu_running(vcpu)) {
>  			r = vcpu_enter_guest(vcpu);
>  		} else {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f11039944c08..ca1ac970a6cf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>  	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>  }
>  
> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)

Why is this inline instead of a normal call?

> +{
> +	struct kvm_run *run = vcpu->run;
> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> +
> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> +		return 1;

What happens when page_dirtied becomes large and dirty_quota has to
wrap to allow further progress?

> +
> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> +	run->dirty_quota_exit.count = pages_dirtied;
> +	run->dirty_quota_exit.quota = dirty_quota;
> +	return 0;
> +}
> +
>  /*
>   * Some of the bitops functions do not support too long bitmaps.
>   * This number must be determined not to exceed such limits.
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index dceac12c1ce5..7f42486b0405 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic {
>  	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
>  	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
>  	u64 blocking;
> +	u64 pages_dirtied;
>  };
>  
>  #define KVM_STATS_NAME_SIZE	48
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 507ee1f2aa96..1d9531efe1fb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -270,6 +270,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_X86_BUS_LOCK     33
>  #define KVM_EXIT_XEN              34
>  #define KVM_EXIT_RISCV_SBI        35
> +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -487,6 +488,11 @@ struct kvm_run {
>  			unsigned long args[6];
>  			unsigned long ret[2];
>  		} riscv_sbi;
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -508,6 +514,12 @@ struct kvm_run {
>  		struct kvm_sync_regs regs;
>  		char padding[SYNC_REGS_SIZE_BYTES];
>  	} s;
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the

lifetime

> +	 * quota is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;

How is the feature detected from userspace? I don't see how it can
make use of it without knowing about it the first place.

>  };
>  
>  /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0afc016cc54d..041ab464405d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>  		return;
>  #endif
>  
> -	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> +	if (!memslot)
> +		return;
> +
> +	vcpu->stat.generic.pages_dirtied++;
> +
> +	if (kvm_slot_dirty_track_enabled(memslot)) {
>  		unsigned long rel_gfn = gfn - memslot->base_gfn;
>  		u32 slot = (memslot->as_id << 16) | memslot->id;
>  

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/4] KVM: arm64: Dirty quota-based throttling of vcpus
  2022-05-21 20:29 ` [PATCH v4 2/4] KVM: arm64: Dirty " Shivam Kumar
@ 2022-05-24  7:14   ` Marc Zyngier
  2022-05-26  9:16     ` Shivam Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2022-05-24  7:14 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: pbonzini, seanjc, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat

On Sat, 21 May 2022 21:29:38 +0100,
Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
> 
> Exit to userspace whenever the dirty quota is exhausted (i.e. dirty count
> equals/exceeds dirty quota) to request more dirty quota.
> 
> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>  arch/arm64/kvm/arm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ecc5958e27fe..5b6a239b83a5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -848,6 +848,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  	ret = 1;
>  	run->exit_reason = KVM_EXIT_UNKNOWN;
>  	while (ret > 0) {
> +		ret = kvm_vcpu_check_dirty_quota(vcpu);
> +		if (!ret)
> +			break;
>  		/*
>  		 * Check conditions before entering the guest
>  		 */

Why do we need yet another check on the fast path? It seems to me that
this is what requests are for, so I'm definitely not keen on this
approach. I certainly do not want any extra overhead for something
that is only used on migration. If anything, it is the migration path
that should incur the overhead.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/4] KVM: arm64: Dirty quota-based throttling of vcpus
  2022-05-24  7:14   ` Marc Zyngier
@ 2022-05-26  9:16     ` Shivam Kumar
  2022-05-26 17:58       ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Shivam Kumar @ 2022-05-26  9:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: pbonzini, seanjc, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat


On 24/05/22 12:44 pm, Marc Zyngier wrote:
> On Sat, 21 May 2022 21:29:38 +0100,
> Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
>> Exit to userspace whenever the dirty quota is exhausted (i.e. dirty count
>> equals/exceeds dirty quota) to request more dirty quota.
>>
>> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
>> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
>> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
>> ---
>>   arch/arm64/kvm/arm.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index ecc5958e27fe..5b6a239b83a5 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -848,6 +848,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>   	ret = 1;
>>   	run->exit_reason = KVM_EXIT_UNKNOWN;
>>   	while (ret > 0) {
>> +		ret = kvm_vcpu_check_dirty_quota(vcpu);
>> +		if (!ret)
>> +			break;
>>   		/*
>>   		 * Check conditions before entering the guest
>>   		 */
> Why do we need yet another check on the fast path? It seems to me that
> this is what requests are for, so I'm definitely not keen on this
> approach. I certainly do not want any extra overhead for something
> that is only used on migration. If anything, it is the migration path
> that should incur the overhead.
>
> 	M.
I'll try implementing this with requests. Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-24  7:11   ` Marc Zyngier
@ 2022-05-26  9:33     ` Shivam Kumar
  2022-05-26 13:30       ` Marc Zyngier
  0 siblings, 1 reply; 24+ messages in thread
From: Shivam Kumar @ 2022-05-26  9:33 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: pbonzini, seanjc, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat


On 24/05/22 12:41 pm, Marc Zyngier wrote:
> On Sat, 21 May 2022 21:29:36 +0100,
> Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
>> Define variables to track and throttle memory dirtying for every vcpu.
>>
>> dirty_count:    Number of pages the vcpu has dirtied since its creation,
>>                  while dirty logging is enabled.
>> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
>>                  more, it needs to request more quota by exiting to
>>                  userspace.
>>
>> Implement the flow for throttling based on dirty quota.
>>
>> i) Increment dirty_count for the vcpu whenever it dirties a page.
>> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
>> count equals/exceeds dirty quota) to request more dirty quota.
>>
>> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
>> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
>> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
>> ---
>>   Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
>>   arch/x86/kvm/mmu/spte.c        |  4 ++--
>>   arch/x86/kvm/vmx/vmx.c         |  3 +++
>>   arch/x86/kvm/x86.c             |  4 ++++
> Please split the x86 support in its own patch.
I'd merged the changes into one commit based on the feedback received on 
previous patchsets (as the change is very simple). I hadn't received any 
review from arm and s390 maintainers, so I separated those changes in 
different commits. Thanks.
>
>>   include/linux/kvm_host.h       | 15 +++++++++++++++
>>   include/linux/kvm_types.h      |  1 +
>>   include/uapi/linux/kvm.h       | 12 ++++++++++++
>>   virt/kvm/kvm_main.c            |  7 ++++++-
>>   8 files changed, 75 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 9f3172376ec3..a9317ed31d06 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6125,6 +6125,24 @@ array field represents return values. The userspace should update the return
>>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>>   spec refer, https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_riscv_riscv-2Dsbi-2Ddoc&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=gKJrc6bbnHvyaBs-J8ysjBmGRkXneAgEgVCMrWtZO4xdbqfmX-zF2y68oOWbuSuA&s=YkC-KO05Du0_A2-m3nWatY5ZPzMYoDbcWjI3k95obPQ&e= .
>>   
>> +::
>> +
>> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
>> +		struct {
>> +			__u64 count;
>> +			__u64 quota;
>> +		} dirty_quota_exit;
>> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
>> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
>> +makes the following information available to the userspace:
>> +	'count' field: the current count of pages dirtied by the VCPU, can be
>> +        skewed based on the size of the pages accessed by each vCPU.
>> +	'quota' field: the observed dirty quota just before the exit to userspace.
>> +The userspace can design a strategy to allocate the overall scope of dirtying
>> +for the VM among the vcpus. Based on the strategy and the current state of dirty
>> +quota throttling, the userspace can make a decision to either update (increase)
>> +the quota or to put the VCPU to sleep for some time.
>> +
>>   ::
>>   
>>   		/* Fix the size of the union. */
>> @@ -6159,6 +6177,20 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
>>   
>>   ::
>>   
>> +	/*
>> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
>> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
>> +	 * is reached/exceeded.
>> +	 */
>> +	__u64 dirty_quota;
>> +Please note that enforcing the quota is best effort, as the guest may dirty
>> +multiple pages before KVM can recheck the quota.  However, unless KVM is using
>> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
>> +KVM will detect quota exhaustion within a handful of dirtied page.  If a
>> +hardware ring buffer is used, the overrun is bounded by the size of the buffer
>> +(512 entries for PML).
>> +
>> +::
>>     };
>>   
>>   
>> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
>> index 73cfe62fdad1..01f0d2a04796 100644
>> --- a/arch/x86/kvm/mmu/spte.c
>> +++ b/arch/x86/kvm/mmu/spte.c
>> @@ -182,9 +182,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>   		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
>>   		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
>>   
>> -	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
>> +	if (spte & PT_WRITABLE_MASK) {
>>   		/* Enforced by kvm_mmu_hugepage_adjust. */
>> -		WARN_ON(level > PG_LEVEL_4K);
>> +		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));
>>   		mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
>>   	}
>>   
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index b730d799c26e..5cbe4992692a 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -5507,6 +5507,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>>   		 */
>>   		if (__xfer_to_guest_mode_work_pending())
>>   			return 1;
>> +
>> +		if (!kvm_vcpu_check_dirty_quota(vcpu))
>> +			return 0;
>>   	}
>>   
>>   	return 1;
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index eb4029660bd9..0b35b8cc0274 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>>   	vcpu->arch.l1tf_flush_l1d = true;
>>   
>>   	for (;;) {
>> +		r = kvm_vcpu_check_dirty_quota(vcpu);
> I really wonder why this has to be checked on each and every run. Why
> isn't this a request set by the core code instead?
Ack. Thanks.
>
>> +		if (!r)
>> +			break;
>> +
>>   		if (kvm_vcpu_running(vcpu)) {
>>   			r = vcpu_enter_guest(vcpu);
>>   		} else {
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index f11039944c08..ca1ac970a6cf 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>>   	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>>   }
>>   
>> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> Why is this inline instead of a normal call?
It's a lightweight function and I found similar functions marked inline 
in the code. Thanks.
>> +{
>> +	struct kvm_run *run = vcpu->run;
>> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
>> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
>> +
>> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
>> +		return 1;
> What happens when page_dirtied becomes large and dirty_quota has to
> wrap to allow further progress?
Every time the quota is exhausted, userspace is expected to set it to 
pages_dirtied + new quota. So, pages_dirtied will always follow dirty 
quota. I'll be sending the qemu patches soon. Thanks.
>
>> +
>> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
>> +	run->dirty_quota_exit.count = pages_dirtied;
>> +	run->dirty_quota_exit.quota = dirty_quota;
>> +	return 0;
>> +}
>> +
>>   /*
>>    * Some of the bitops functions do not support too long bitmaps.
>>    * This number must be determined not to exceed such limits.
>> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
>> index dceac12c1ce5..7f42486b0405 100644
>> --- a/include/linux/kvm_types.h
>> +++ b/include/linux/kvm_types.h
>> @@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic {
>>   	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
>>   	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
>>   	u64 blocking;
>> +	u64 pages_dirtied;
>>   };
>>   
>>   #define KVM_STATS_NAME_SIZE	48
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 507ee1f2aa96..1d9531efe1fb 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -270,6 +270,7 @@ struct kvm_xen_exit {
>>   #define KVM_EXIT_X86_BUS_LOCK     33
>>   #define KVM_EXIT_XEN              34
>>   #define KVM_EXIT_RISCV_SBI        35
>> +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36
>>   
>>   /* For KVM_EXIT_INTERNAL_ERROR */
>>   /* Emulate instruction failed. */
>> @@ -487,6 +488,11 @@ struct kvm_run {
>>   			unsigned long args[6];
>>   			unsigned long ret[2];
>>   		} riscv_sbi;
>> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
>> +		struct {
>> +			__u64 count;
>> +			__u64 quota;
>> +		} dirty_quota_exit;
>>   		/* Fix the size of the union. */
>>   		char padding[256];
>>   	};
>> @@ -508,6 +514,12 @@ struct kvm_run {
>>   		struct kvm_sync_regs regs;
>>   		char padding[SYNC_REGS_SIZE_BYTES];
>>   	} s;
>> +	/*
>> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
>> +	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the
> lifetime
Ack. Thanks.
>> +	 * quota is reached/exceeded.
>> +	 */
>> +	__u64 dirty_quota;
> How is the feature detected from userspace? I don't see how it can
> make use of it without knowing about it the first place.
I hope changing the kernel headers would suffice. We had started the 
patch series with a KVM CAP for dirty quota but dropped it as per the 
feedback recieved. Thanks.
>>   };
>>   
>>   /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 0afc016cc54d..041ab464405d 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
>>   		return;
>>   #endif
>>   
>> -	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>> +	if (!memslot)
>> +		return;
>> +
>> +	vcpu->stat.generic.pages_dirtied++;
>> +
>> +	if (kvm_slot_dirty_track_enabled(memslot)) {
>>   		unsigned long rel_gfn = gfn - memslot->base_gfn;
>>   		u32 slot = (memslot->as_id << 16) | memslot->id;
>>   
> Thanks,
>
> 	M.
Thank you so much for the review. Looking forward to further feedback 
and suggestions.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-26  9:33     ` Shivam Kumar
@ 2022-05-26 13:30       ` Marc Zyngier
  2022-05-26 15:44         ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2022-05-26 13:30 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: pbonzini, seanjc, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, 26 May 2022 10:33:04 +0100,
Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
> 
> 
> On 24/05/22 12:41 pm, Marc Zyngier wrote:
> > On Sat, 21 May 2022 21:29:36 +0100,
> > Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
> >> Define variables to track and throttle memory dirtying for every vcpu.
> >> 
> >> dirty_count:    Number of pages the vcpu has dirtied since its creation,
> >>                  while dirty logging is enabled.
> >> dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
> >>                  more, it needs to request more quota by exiting to
> >>                  userspace.
> >> 
> >> Implement the flow for throttling based on dirty quota.
> >> 
> >> i) Increment dirty_count for the vcpu whenever it dirties a page.
> >> ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
> >> count equals/exceeds dirty quota) to request more dirty quota.
> >> 
> >> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> >> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> >> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> >> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> >> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> >> ---
> >>   Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
> >>   arch/x86/kvm/mmu/spte.c        |  4 ++--
> >>   arch/x86/kvm/vmx/vmx.c         |  3 +++
> >>   arch/x86/kvm/x86.c             |  4 ++++
> > Please split the x86 support in its own patch.
> I'd merged the changes into one commit based on the feedback received
> on previous patchsets (as the change is very simple). I hadn't
> received any review from arm and s390 maintainers, so I separated
> those changes in different commits. Thanks.

Well, you don't get much reviewing if you don't Cc people. I still
think having the generic code one its own is the right thing to do,
each architecture being added separately. That's irrespective of the
simplicity of the patch (I have seen patches with only two lines that
contained 3 bugs).

> > 
> >>   include/linux/kvm_host.h       | 15 +++++++++++++++
> >>   include/linux/kvm_types.h      |  1 +
> >>   include/uapi/linux/kvm.h       | 12 ++++++++++++
> >>   virt/kvm/kvm_main.c            |  7 ++++++-
> >>   8 files changed, 75 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index 9f3172376ec3..a9317ed31d06 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -6125,6 +6125,24 @@ array field represents return values. The userspace should update the return
> >>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >>   spec refer, https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_riscv_riscv-2Dsbi-2Ddoc&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=gKJrc6bbnHvyaBs-J8ysjBmGRkXneAgEgVCMrWtZO4xdbqfmX-zF2y68oOWbuSuA&s=YkC-KO05Du0_A2-m3nWatY5ZPzMYoDbcWjI3k95obPQ&e= .
> >>   +::
> >> +
> >> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> >> +		struct {
> >> +			__u64 count;
> >> +			__u64 quota;
> >> +		} dirty_quota_exit;
> >> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
> >> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
> >> +makes the following information available to the userspace:
> >> +	'count' field: the current count of pages dirtied by the VCPU, can be
> >> +        skewed based on the size of the pages accessed by each vCPU.
> >> +	'quota' field: the observed dirty quota just before the exit to userspace.
> >> +The userspace can design a strategy to allocate the overall scope of dirtying
> >> +for the VM among the vcpus. Based on the strategy and the current state of dirty
> >> +quota throttling, the userspace can make a decision to either update (increase)
> >> +the quota or to put the VCPU to sleep for some time.
> >> +
> >>   ::
> >>     		/* Fix the size of the union. */
> >> @@ -6159,6 +6177,20 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
> >>     ::
> >>   +	/*
> >> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> >> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
> >> +	 * is reached/exceeded.
> >> +	 */
> >> +	__u64 dirty_quota;
> >> +Please note that enforcing the quota is best effort, as the guest may dirty
> >> +multiple pages before KVM can recheck the quota.  However, unless KVM is using
> >> +a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
> >> +KVM will detect quota exhaustion within a handful of dirtied page.  If a
> >> +hardware ring buffer is used, the overrun is bounded by the size of the buffer
> >> +(512 entries for PML).
> >> +
> >> +::
> >>     };
> >>     diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> >> index 73cfe62fdad1..01f0d2a04796 100644
> >> --- a/arch/x86/kvm/mmu/spte.c
> >> +++ b/arch/x86/kvm/mmu/spte.c
> >> @@ -182,9 +182,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >>   		  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> >>   		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> >>   -	if ((spte & PT_WRITABLE_MASK) &&
> >> kvm_slot_dirty_track_enabled(slot)) {
> >> +	if (spte & PT_WRITABLE_MASK) {
> >>   		/* Enforced by kvm_mmu_hugepage_adjust. */
> >> -		WARN_ON(level > PG_LEVEL_4K);
> >> +		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));
> >>   		mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
> >>   	}
> >>   diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >> index b730d799c26e..5cbe4992692a 100644
> >> --- a/arch/x86/kvm/vmx/vmx.c
> >> +++ b/arch/x86/kvm/vmx/vmx.c
> >> @@ -5507,6 +5507,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> >>   		 */
> >>   		if (__xfer_to_guest_mode_work_pending())
> >>   			return 1;
> >> +
> >> +		if (!kvm_vcpu_check_dirty_quota(vcpu))
> >> +			return 0;
> >>   	}
> >>     	return 1;
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index eb4029660bd9..0b35b8cc0274 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> >>   	vcpu->arch.l1tf_flush_l1d = true;
> >>     	for (;;) {
> >> +		r = kvm_vcpu_check_dirty_quota(vcpu);
> > I really wonder why this has to be checked on each and every run. Why
> > isn't this a request set by the core code instead?
> Ack. Thanks.
> > 
> >> +		if (!r)
> >> +			break;
> >> +
> >>   		if (kvm_vcpu_running(vcpu)) {
> >>   			r = vcpu_enter_guest(vcpu);
> >>   		} else {
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index f11039944c08..ca1ac970a6cf 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> >>   	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
> >>   }
> >>   +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu
> >> *vcpu)
> > Why is this inline instead of a normal call?
> It's a lightweight function and I found similar functions marked
> inline in the code. Thanks.

That's not a justification. If you mark something inline, this is
either because it is performance critical (that's not the case here),
or that it must be inlined by construction (not the case either).

> >> +{
> >> +	struct kvm_run *run = vcpu->run;
> >> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> >> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> >> +
> >> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> >> +		return 1;
> > What happens when page_dirtied becomes large and dirty_quota has to
> > wrap to allow further progress?
> Every time the quota is exhausted, userspace is expected to set it to
> pages_dirtied + new quota. So, pages_dirtied will always follow dirty
> quota. I'll be sending the qemu patches soon. Thanks.

Right, so let's assume that page_dirtied=0xffffffffffffffff (yes, I
have dirtied that many pages). I add two page of quota, so
dirty_quota=1. pages_dirtied > dirty_quota, and we're back to
userspace forever. Epic fail.

> >
> >> +
> >> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> >> +	run->dirty_quota_exit.count = pages_dirtied;
> >> +	run->dirty_quota_exit.quota = dirty_quota;
> >> +	return 0;
> >> +}
> >> +
> >>   /*
> >>    * Some of the bitops functions do not support too long bitmaps.
> >>    * This number must be determined not to exceed such limits.
> >> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> >> index dceac12c1ce5..7f42486b0405 100644
> >> --- a/include/linux/kvm_types.h
> >> +++ b/include/linux/kvm_types.h
> >> @@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic {
> >>   	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
> >>   	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
> >>   	u64 blocking;
> >> +	u64 pages_dirtied;
> >>   };
> >>     #define KVM_STATS_NAME_SIZE	48
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 507ee1f2aa96..1d9531efe1fb 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -270,6 +270,7 @@ struct kvm_xen_exit {
> >>   #define KVM_EXIT_X86_BUS_LOCK     33
> >>   #define KVM_EXIT_XEN              34
> >>   #define KVM_EXIT_RISCV_SBI        35
> >> +#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36
> >>     /* For KVM_EXIT_INTERNAL_ERROR */
> >>   /* Emulate instruction failed. */
> >> @@ -487,6 +488,11 @@ struct kvm_run {
> >>   			unsigned long args[6];
> >>   			unsigned long ret[2];
> >>   		} riscv_sbi;
> >> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> >> +		struct {
> >> +			__u64 count;
> >> +			__u64 quota;
> >> +		} dirty_quota_exit;
> >>   		/* Fix the size of the union. */
> >>   		char padding[256];
> >>   	};
> >> @@ -508,6 +514,12 @@ struct kvm_run {
> >>   		struct kvm_sync_regs regs;
> >>   		char padding[SYNC_REGS_SIZE_BYTES];
> >>   	} s;
> >> +	/*
> >> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> >> +	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the
> > lifetime
> Ack. Thanks.
> >> +	 * quota is reached/exceeded.
> >> +	 */
> >> +	__u64 dirty_quota;
> > How is the feature detected from userspace? I don't see how it can
> > make use of it without knowing about it the first place.
> I hope changing the kernel headers would suffice. We had started the
> patch series with a KVM CAP for dirty quota but dropped it as per the
> feedback recieved. Thanks.

I don't see how anyone can make use of this feature without a
capability. How do you know whether you're ever going to see the exit
if you can't know if the kernel supports the feature?

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-26 13:30       ` Marc Zyngier
@ 2022-05-26 15:44         ` Sean Christopherson
  2022-05-26 16:16           ` Marc Zyngier
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-26 15:44 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Shivam Kumar, pbonzini, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, May 26, 2022, Marc Zyngier wrote:
> > >> +{
> > >> +	struct kvm_run *run = vcpu->run;
> > >> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> > >> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> > >> +
> > >> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> > >> +		return 1;
> > > What happens when page_dirtied becomes large and dirty_quota has to
> > > wrap to allow further progress?
> > Every time the quota is exhausted, userspace is expected to set it to
> > pages_dirtied + new quota. So, pages_dirtied will always follow dirty
> > quota. I'll be sending the qemu patches soon. Thanks.
> 
> Right, so let's assume that page_dirtied=0xffffffffffffffff (yes, I
> have dirtied that many pages).

Really?  Written that many bytes from a guest?  Maybe.  But actually marked that
many pages dirty in hardware, let alone in KVM?  And on a single CPU?

By my back of the napkin math, a 4096 CPU system running at 16ghz with each CPU
able to access one page of memory per cycle would take ~3 days to access 2^64
pages.

Assuming a ridiculously optimistic ~20 cycles to walk page tables, fetch the cache
line from memory, insert into the TLB, and mark the PTE dirty, that's still ~60
days to actually dirty that many pages in hardware.

Let's again be comically optimistic and assume KVM can somehow propagate a dirty
bit from hardware PTEs to the dirty bitmap/ring in another ~20 cycles.  That brings
us to ~1200 days.

But the stat is per vCPU, so that actually means it would take ~13.8k years for a
single vCPU/CPU to dirty 2^64 pages... running at a ludicrous 16ghz on a CPU with
latencies that are a likely an order of magnitude faster than anything that exists
today.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-26 15:44         ` Sean Christopherson
@ 2022-05-26 16:16           ` Marc Zyngier
  2022-05-26 17:46             ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Marc Zyngier @ 2022-05-26 16:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Shivam Kumar, pbonzini, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, 26 May 2022 16:44:13 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Thu, May 26, 2022, Marc Zyngier wrote:
> > > >> +{
> > > >> +	struct kvm_run *run = vcpu->run;
> > > >> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> > > >> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> > > >> +
> > > >> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> > > >> +		return 1;
> > > > What happens when page_dirtied becomes large and dirty_quota has to
> > > > wrap to allow further progress?
> > > Every time the quota is exhausted, userspace is expected to set it to
> > > pages_dirtied + new quota. So, pages_dirtied will always follow dirty
> > > quota. I'll be sending the qemu patches soon. Thanks.
> > 
> > Right, so let's assume that page_dirtied=0xffffffffffffffff (yes, I
> > have dirtied that many pages).
> 
> Really?  Written that many bytes from a guest?  Maybe.  But actually
> marked that many pages dirty in hardware, let alone in KVM?  And on
> a single CPU?
>
> By my back of the napkin math, a 4096 CPU system running at 16ghz
> with each CPU able to access one page of memory per cycle would take
> ~3 days to access 2^64 pages.
> 
> Assuming a ridiculously optimistic ~20 cycles to walk page tables,
> fetch the cache line from memory, insert into the TLB, and mark the
> PTE dirty, that's still ~60 days to actually dirty that many pages
> in hardware.
> 
> Let's again be comically optimistic and assume KVM can somehow
> propagate a dirty bit from hardware PTEs to the dirty bitmap/ring in
> another ~20 cycles.  That brings us to ~1200 days.
> 
> But the stat is per vCPU, so that actually means it would take
> ~13.8k years for a single vCPU/CPU to dirty 2^64 pages... running at
> a ludicrous 16ghz on a CPU with latencies that are a likely an order
> of magnitude faster than anything that exists today.

Congratulations, you can multiply! ;-)

It just shows that the proposed API is pretty bad, because instead of
working as a credit, it works as a ceiling, based on a value that is
dependent on the vpcu previous state (forcing userspace to recompute
the next quota on each exit), and with undocumented, arbitrary limits
as a bonus.

I don't like it, and probably won't like it in 13.8k years either.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-26 16:16           ` Marc Zyngier
@ 2022-05-26 17:46             ` Sean Christopherson
  2022-05-30 11:05               ` Shivam Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-05-26 17:46 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Shivam Kumar, pbonzini, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, May 26, 2022, Marc Zyngier wrote:
> On Thu, 26 May 2022 16:44:13 +0100,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Thu, May 26, 2022, Marc Zyngier wrote:
> > > > >> +{
> > > > >> +	struct kvm_run *run = vcpu->run;
> > > > >> +	u64 dirty_quota = READ_ONCE(run->dirty_quota);
> > > > >> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> > > > >> +
> > > > >> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> > > > >> +		return 1;
> > > > > What happens when page_dirtied becomes large and dirty_quota has to
> > > > > wrap to allow further progress?
> > > > Every time the quota is exhausted, userspace is expected to set it to
> > > > pages_dirtied + new quota. So, pages_dirtied will always follow dirty
> > > > quota. I'll be sending the qemu patches soon. Thanks.
> > > 
> > > Right, so let's assume that page_dirtied=0xffffffffffffffff (yes, I
> > > have dirtied that many pages).
> > 
> > Really?  Written that many bytes from a guest?  Maybe.  But actually
> > marked that many pages dirty in hardware, let alone in KVM?  And on
> > a single CPU?
> >
> > By my back of the napkin math, a 4096 CPU system running at 16ghz
> > with each CPU able to access one page of memory per cycle would take
> > ~3 days to access 2^64 pages.
> > 
> > Assuming a ridiculously optimistic ~20 cycles to walk page tables,
> > fetch the cache line from memory, insert into the TLB, and mark the
> > PTE dirty, that's still ~60 days to actually dirty that many pages
> > in hardware.
> > 
> > Let's again be comically optimistic and assume KVM can somehow
> > propagate a dirty bit from hardware PTEs to the dirty bitmap/ring in
> > another ~20 cycles.  That brings us to ~1200 days.
> > 
> > But the stat is per vCPU, so that actually means it would take
> > ~13.8k years for a single vCPU/CPU to dirty 2^64 pages... running at
> > a ludicrous 16ghz on a CPU with latencies that are a likely an order
> > of magnitude faster than anything that exists today.
> 
> Congratulations, you can multiply! ;-)

Don't ask me how long it took for me do to that math :-D

> It just shows that the proposed API is pretty bad, because instead of
> working as a credit, it works as a ceiling, based on a value that is
> dependent on the vpcu previous state (forcing userspace to recompute
> the next quota on each exit),

My understanding is that intended use case is dependent on previous vCPU state by
design.  E.g. a vCPU that is aggressively dirtying memory will be given a different
quota than a vCPU that has historically not dirtied much memory.

Even if the quotas are fixed, "recomputing" the new quota is simply:

	run->dirty_quota = run->dirty_quota_exit.count + PER_VCPU_DIRTY_QUOTA

The main motivation for the ceiling approach is that it's simpler for KVM to implement
since KVM never writes vcpu->run->dirty_quota.  E.g. userspace may observe a "spurious"
exit if KVM reads the quota right before it's changed by userspace, but KVM doesn't
have to guard against clobbering the quota.

A credit based implementation would require (a) snapshotting the credit at
some point during of KVM_RUN, (b) disallowing changing the quota credit while the
vCPU is running, or (c) an expensive atomic sequence (expensive on x86 at least)
to avoid clobbering an update from userspace.  (a) is undesirable because it delays
recognition of the new quota, especially if KVM doesn't re-read the quota in the
tight run loop.  And AIUI, changing the quota while the vCPU is running is a desired
use case, so that rules out (b).  The cost of (c) is not the end of the world, but
IMO the benefits are marginal.

E.g. if we go with a request-based implementation where kvm_vcpu_check_dirty_quota()
is called in mark_page_dirty_in_slot(), then the ceiling-based approach is:

  static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
  {
        u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);

	if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
		return;

	/*
	 * Snapshot the quota to report it to userspace.  The dirty count will
	 * captured when the request is processed.
	 */
	vcpu->dirty_quota = dirty_quota;
	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
  }

whereas the credit-based approach would need to be something like:

  static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
  {
        u64 old_dirty_quota;

	if (!READ_ONCE(vcpu->run->dirty_quota_enabled))
		return;

	old_dirty_quota = atomic64_fetch_add_unless(vcpu->run->dirty_quota, -1, 0);

	/* Quota is not yet exhausted, or was already exhausted. */
	if (old_dirty_quota != 1)
		return;

	/*
	 * The dirty count will be re-captured in dirty_count when the request
	 * is processed so that userspace can compute the number of pages that
	 * were dirtied after the quota was exhausted.
	 */
	vcpu->dirty_count_at_exhaustion = vcpu->stat.generic.pages_dirtied;
	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
  }

> and with undocumented, arbitrary limits as a bonus.

Eh, documenting that userspace needs to be aware of a theoretically-possible wrap
is easy enough.  If we're truly worried about a wrap scenario, it'd be trivial to
add a flag/ioctl() to let userspace reset vcpu->stat.generic.pages_dirtied.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 2/4] KVM: arm64: Dirty quota-based throttling of vcpus
  2022-05-26  9:16     ` Shivam Kumar
@ 2022-05-26 17:58       ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2022-05-26 17:58 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: Marc Zyngier, pbonzini, james.morse, borntraeger, david, kvm,
	Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, May 26, 2022, Shivam Kumar wrote:
> 
> On 24/05/22 12:44 pm, Marc Zyngier wrote:
> > On Sat, 21 May 2022 21:29:38 +0100,
> > Shivam Kumar <shivam.kumar1@nutanix.com> wrote:
> > > Exit to userspace whenever the dirty quota is exhausted (i.e. dirty count
> > > equals/exceeds dirty quota) to request more dirty quota.
> > > 
> > > Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> > > Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> > > Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> > > Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> > > Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> > > ---
> > >   arch/arm64/kvm/arm.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index ecc5958e27fe..5b6a239b83a5 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -848,6 +848,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > >   	ret = 1;
> > >   	run->exit_reason = KVM_EXIT_UNKNOWN;
> > >   	while (ret > 0) {
> > > +		ret = kvm_vcpu_check_dirty_quota(vcpu);
> > > +		if (!ret)
> > > +			break;
> > >   		/*
> > >   		 * Check conditions before entering the guest
> > >   		 */
> > Why do we need yet another check on the fast path? It seems to me that
> > this is what requests are for, so I'm definitely not keen on this
> > approach. I certainly do not want any extra overhead for something
> > that is only used on migration. If anything, it is the migration path
> > that should incur the overhead.
> > 
> > 	M.
> I'll try implementing this with requests. Thanks.

I've no objection to using a request, avoiding the extra checks is quite nice,
and x86's fastpath loop would Just Work (though VMX's handle_invalid_guest_state()
still needs manually checking).

The only gotchas are that it'll likely require a small amount of #ifdeffery, and
the quota will need to be "temporarily" captured in 'struct kvm_vcpu' because
there's no guarantee that the next exit to userspace will be due to the dirty
quota, i.e. something else might ovewrirte the exit union in vcpu->run.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-26 17:46             ` Sean Christopherson
@ 2022-05-30 11:05               ` Shivam Kumar
  2022-07-05  7:21                 ` Shivam Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Shivam Kumar @ 2022-05-30 11:05 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier
  Cc: pbonzini, james.morse, borntraeger, david, kvm, Shaju Abraham,
	Manish Mishra, Anurag Madnawat


>> It just shows that the proposed API is pretty bad, because instead of
>> working as a credit, it works as a ceiling, based on a value that is
>> dependent on the vpcu previous state (forcing userspace to recompute
>> the next quota on each exit),
> My understanding is that intended use case is dependent on previous vCPU state by
> design.  E.g. a vCPU that is aggressively dirtying memory will be given a different
> quota than a vCPU that has historically not dirtied much memory.
>
> Even if the quotas are fixed, "recomputing" the new quota is simply:
>
> 	run->dirty_quota = run->dirty_quota_exit.count + PER_VCPU_DIRTY_QUOTA
>
> The main motivation for the ceiling approach is that it's simpler for KVM to implement
> since KVM never writes vcpu->run->dirty_quota.  E.g. userspace may observe a "spurious"
> exit if KVM reads the quota right before it's changed by userspace, but KVM doesn't
> have to guard against clobbering the quota.
>
> A credit based implementation would require (a) snapshotting the credit at
> some point during of KVM_RUN, (b) disallowing changing the quota credit while the
> vCPU is running, or (c) an expensive atomic sequence (expensive on x86 at least)
> to avoid clobbering an update from userspace.  (a) is undesirable because it delays
> recognition of the new quota, especially if KVM doesn't re-read the quota in the
> tight run loop.  And AIUI, changing the quota while the vCPU is running is a desired
> use case, so that rules out (b).  The cost of (c) is not the end of the world, but
> IMO the benefits are marginal.
>
> E.g. if we go with a request-based implementation where kvm_vcpu_check_dirty_quota()
> is called in mark_page_dirty_in_slot(), then the ceiling-based approach is:
>
>    static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
>    {
>          u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>
> 	if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
> 		return;
>
> 	/*
> 	 * Snapshot the quota to report it to userspace.  The dirty count will
> 	 * captured when the request is processed.
> 	 */
> 	vcpu->dirty_quota = dirty_quota;
> 	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
>    }
>
> whereas the credit-based approach would need to be something like:
>
>    static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
>    {
>          u64 old_dirty_quota;
>
> 	if (!READ_ONCE(vcpu->run->dirty_quota_enabled))
> 		return;
>
> 	old_dirty_quota = atomic64_fetch_add_unless(vcpu->run->dirty_quota, -1, 0);
>
> 	/* Quota is not yet exhausted, or was already exhausted. */
> 	if (old_dirty_quota != 1)
> 		return;
>
> 	/*
> 	 * The dirty count will be re-captured in dirty_count when the request
> 	 * is processed so that userspace can compute the number of pages that
> 	 * were dirtied after the quota was exhausted.
> 	 */
> 	vcpu->dirty_count_at_exhaustion = vcpu->stat.generic.pages_dirtied;
> 	kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
>    }
>
>> and with undocumented, arbitrary limits as a bonus.
> Eh, documenting that userspace needs to be aware of a theoretically-possible wrap
> is easy enough.  If we're truly worried about a wrap scenario, it'd be trivial to
> add a flag/ioctl() to let userspace reset vcpu->stat.generic.pages_dirtied.
Thank you so much Marc and Sean for the feedback. We can implement an 
ioctl to reset 'pages_dirtied' and mention in the documentation that 
userspace should reset it after each migration (regardless of whether 
the migration fails/succeeds). I hope this will address the concern raised.

We moved away from the credit-based approach due to multiple atomic 
reads. For now, we are sticking to not changing the quota while the vcpu 
is running. But, we need to track the pages dirtied at the time of the 
last quota update for credit-based approach. We also need to share this 
with the userspace as pages dirtied can overflow in certain 
circumstances and we might have to adjust the throttling accordingly. 
With this, the check would look like:

u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
u64 last_pages_dirtied = READ_ONCE(vcpu->run->last_pages_dirtied);
u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;

if (pages_dirtied - last_pages_dirtied < dirty_quota)
	return;

// Code to set exit reason here

Please post your suggestions. Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-30 11:05               ` Shivam Kumar
@ 2022-07-05  7:21                 ` Shivam Kumar
  2022-07-14 18:04                   ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Shivam Kumar @ 2022-07-05  7:21 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier
  Cc: pbonzini, james.morse, borntraeger, david, kvm, Shaju Abraham,
	Manish Mishra, Anurag Madnawat



On 30/05/22 4:35 pm, Shivam Kumar wrote:
> 
>>> It just shows that the proposed API is pretty bad, because instead of
>>> working as a credit, it works as a ceiling, based on a value that is
>>> dependent on the vpcu previous state (forcing userspace to recompute
>>> the next quota on each exit),
>> My understanding is that intended use case is dependent on previous 
>> vCPU state by
>> design.  E.g. a vCPU that is aggressively dirtying memory will be 
>> given a different
>> quota than a vCPU that has historically not dirtied much memory.
>>
>> Even if the quotas are fixed, "recomputing" the new quota is simply:
>>
>>     run->dirty_quota = run->dirty_quota_exit.count + PER_VCPU_DIRTY_QUOTA
>>
>> The main motivation for the ceiling approach is that it's simpler for 
>> KVM to implement
>> since KVM never writes vcpu->run->dirty_quota.  E.g. userspace may 
>> observe a "spurious"
>> exit if KVM reads the quota right before it's changed by userspace, 
>> but KVM doesn't
>> have to guard against clobbering the quota.
>>
>> A credit based implementation would require (a) snapshotting the 
>> credit at
>> some point during of KVM_RUN, (b) disallowing changing the quota 
>> credit while the
>> vCPU is running, or (c) an expensive atomic sequence (expensive on x86 
>> at least)
>> to avoid clobbering an update from userspace.  (a) is undesirable 
>> because it delays
>> recognition of the new quota, especially if KVM doesn't re-read the 
>> quota in the
>> tight run loop.  And AIUI, changing the quota while the vCPU is 
>> running is a desired
>> use case, so that rules out (b).  The cost of (c) is not the end of 
>> the world, but
>> IMO the benefits are marginal.
>>
>> E.g. if we go with a request-based implementation where 
>> kvm_vcpu_check_dirty_quota()
>> is called in mark_page_dirty_in_slot(), then the ceiling-based 
>> approach is:
>>
>>    static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
>>    {
>>          u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>>
>>     if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
>>         return;
>>
>>     /*
>>      * Snapshot the quota to report it to userspace.  The dirty count 
>> will
>>      * captured when the request is processed.
>>      */
>>     vcpu->dirty_quota = dirty_quota;
>>     kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
>>    }
>>
>> whereas the credit-based approach would need to be something like:
>>
>>    static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
>>    {
>>          u64 old_dirty_quota;
>>
>>     if (!READ_ONCE(vcpu->run->dirty_quota_enabled))
>>         return;
>>
>>     old_dirty_quota = 
>> atomic64_fetch_add_unless(vcpu->run->dirty_quota, -1, 0);
>>
>>     /* Quota is not yet exhausted, or was already exhausted. */
>>     if (old_dirty_quota != 1)
>>         return;
>>
>>     /*
>>      * The dirty count will be re-captured in dirty_count when the 
>> request
>>      * is processed so that userspace can compute the number of pages 
>> that
>>      * were dirtied after the quota was exhausted.
>>      */
>>     vcpu->dirty_count_at_exhaustion = vcpu->stat.generic.pages_dirtied;
>>     kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
>>    }
>>
>>> and with undocumented, arbitrary limits as a bonus.
>> Eh, documenting that userspace needs to be aware of a 
>> theoretically-possible wrap
>> is easy enough.  If we're truly worried about a wrap scenario, it'd be 
>> trivial to
>> add a flag/ioctl() to let userspace reset 
>> vcpu->stat.generic.pages_dirtied.
> Thank you so much Marc and Sean for the feedback. We can implement an 
> ioctl to reset 'pages_dirtied' and mention in the documentation that 
> userspace should reset it after each migration (regardless of whether 
> the migration fails/succeeds). I hope this will address the concern raised.
> 
> We moved away from the credit-based approach due to multiple atomic 
> reads. For now, we are sticking to not changing the quota while the vcpu 
> is running. But, we need to track the pages dirtied at the time of the 
> last quota update for credit-based approach. We also need to share this 
> with the userspace as pages dirtied can overflow in certain 
> circumstances and we might have to adjust the throttling accordingly. 
> With this, the check would look like:
> 
> u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> u64 last_pages_dirtied = READ_ONCE(vcpu->run->last_pages_dirtied);
> u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> 
> if (pages_dirtied - last_pages_dirtied < dirty_quota)
>      return;
> 
> // Code to set exit reason here
> 
> Please post your suggestions. Thanks.

Hi, here's a summary of what needs to be changed and what should be kept 
as it is (purely my opinion based on the discussions we have had so far):

i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm 
requests in dirty quota check. I hope that the ceiling-based approach, 
with proper documentation and an ioctl exposed for resetting 
'dirty_quota' and 'pages_dirtied', is good enough. Please post your 
suggestions if you think otherwise.
ii) The change in VMX's handle_invalid_guest_state() remains as it is.
iii) For now, we are lazily updating dirty quota, i.e. we are updating 
it only when the vcpu exits to userspace with the exit reason 
KVM_EXIT_DIRTY_QUOTA_EXHAUSTED. So, we don't require an additional 
variable to capture the dirty quota at the time of dirty quota check. 
This may change in the future though.
iv) Adding a KVM capability for dirty quota so that userspace can detect 
if this feature is available or not. Please let me know if we can do it 
differently than having a KVM cap.

I'm happy to send v5 if the above points look good to you. If you have 
further suggestions, please let me know. I'm looking forward to them. 
Thank you so much for the review so far.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-05  7:21                 ` Shivam Kumar
@ 2022-07-14 18:04                   ` Peter Xu
  2022-07-14 20:48                     ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-07-14 18:04 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: Sean Christopherson, Marc Zyngier, pbonzini, james.morse,
	borntraeger, david, kvm, Shaju Abraham, Manish Mishra,
	Anurag Madnawat

Hi, Shivam,

On Tue, Jul 05, 2022 at 12:51:01PM +0530, Shivam Kumar wrote:
> Hi, here's a summary of what needs to be changed and what should be kept as
> it is (purely my opinion based on the discussions we have had so far):
> 
> i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm requests
> in dirty quota check. I hope that the ceiling-based approach, with proper
> documentation and an ioctl exposed for resetting 'dirty_quota' and
> 'pages_dirtied', is good enough. Please post your suggestions if you think
> otherwise.

An ioctl just for this could be an overkill to me.

Currently you exposes only "quota" to kvm_run, then when vmexit you have
exit fields contain both "quota" and "count".  I always think it's a bit
redundant.

What I'm thinking is:

  (1) Expose both "quota" and "count" in kvm_run, then:

      "quota" should only be written by userspace and read by kernel.
      "count" should only be written by kernel and read by the userspace. [*]

      [*] One special case is when the userspace found that there's risk of
      quota & count overflow, then the userspace:

        - Kick the vcpu out (so the kernel won't write to "count" anymore)
        - Update both "quota" and "count" to safe values
        - Resume the KVM_RUN

  (2) When quota reached, we don't need to copy quota/count in vmexit
      fields, since the userspace can read the realtime values in kvm_run.

Would this work?

> ii) The change in VMX's handle_invalid_guest_state() remains as it is.
> iii) For now, we are lazily updating dirty quota, i.e. we are updating it
> only when the vcpu exits to userspace with the exit reason
> KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.

At least with above design, IMHO we can update kvm_run.quota in real time
without kicking vcpu threads (the quota only increases except the reset
phase for overflow).  Or is there any other reason you need to kick vcpu
out when updating quota?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-14 18:04                   ` Peter Xu
@ 2022-07-14 20:48                     ` Sean Christopherson
  2022-07-14 22:19                       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-07-14 20:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: Shivam Kumar, Marc Zyngier, pbonzini, james.morse, borntraeger,
	david, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, Jul 14, 2022, Peter Xu wrote:
> Hi, Shivam,
> 
> On Tue, Jul 05, 2022 at 12:51:01PM +0530, Shivam Kumar wrote:
> > Hi, here's a summary of what needs to be changed and what should be kept as
> > it is (purely my opinion based on the discussions we have had so far):
> > 
> > i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm requests
> > in dirty quota check. I hope that the ceiling-based approach, with proper
> > documentation and an ioctl exposed for resetting 'dirty_quota' and
> > 'pages_dirtied', is good enough. Please post your suggestions if you think
> > otherwise.
> 
> An ioctl just for this could be an overkill to me.
>
> Currently you exposes only "quota" to kvm_run, then when vmexit you have
> exit fields contain both "quota" and "count".  I always think it's a bit
> redundant.
> 
> What I'm thinking is:
> 
>   (1) Expose both "quota" and "count" in kvm_run, then:
> 
>       "quota" should only be written by userspace and read by kernel.
>       "count" should only be written by kernel and read by the userspace. [*]
> 
>       [*] One special case is when the userspace found that there's risk of
>       quota & count overflow, then the userspace:
> 
>         - Kick the vcpu out (so the kernel won't write to "count" anymore)
>         - Update both "quota" and "count" to safe values
>         - Resume the KVM_RUN
> 
>   (2) When quota reached, we don't need to copy quota/count in vmexit
>       fields, since the userspace can read the realtime values in kvm_run.
> 
> Would this work?

Technically, yes, practically speaking, no.  If KVM doesn't provide the quota
that _KVM_ saw at the time of exit, then there's no sane way to audit KVM exits
due to KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.  Providing the quota ensure userspace sees
sane, coherent data if there's a race between KVM checking the quota and userspace
updating the quota.  If KVM doesn't provide the quota, then userspace can see an
exit with "count < quota".

Even if userspace is ok with such races, it will be extremely difficult to detect
KVM issues if we mess something up because such behavior would have to be allowed
by KVM's ABI.

> > ii) The change in VMX's handle_invalid_guest_state() remains as it is.
> > iii) For now, we are lazily updating dirty quota, i.e. we are updating it
> > only when the vcpu exits to userspace with the exit reason
> > KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.
> 
> At least with above design, IMHO we can update kvm_run.quota in real time
> without kicking vcpu threads (the quota only increases except the reset
> phase for overflow).  Or is there any other reason you need to kick vcpu
> out when updating quota?

I'm not convinced overflow is actually possible.  IMO the current (v4) patch but
with a REQUEST instead of an in-line check is sufficient.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-14 20:48                     ` Sean Christopherson
@ 2022-07-14 22:19                       ` Peter Xu
  2022-07-15 16:23                         ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-07-14 22:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Shivam Kumar, Marc Zyngier, pbonzini, james.morse, borntraeger,
	david, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, Jul 14, 2022 at 08:48:04PM +0000, Sean Christopherson wrote:
> On Thu, Jul 14, 2022, Peter Xu wrote:
> > Hi, Shivam,
> > 
> > On Tue, Jul 05, 2022 at 12:51:01PM +0530, Shivam Kumar wrote:
> > > Hi, here's a summary of what needs to be changed and what should be kept as
> > > it is (purely my opinion based on the discussions we have had so far):
> > > 
> > > i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm requests
> > > in dirty quota check. I hope that the ceiling-based approach, with proper
> > > documentation and an ioctl exposed for resetting 'dirty_quota' and
> > > 'pages_dirtied', is good enough. Please post your suggestions if you think
> > > otherwise.
> > 
> > An ioctl just for this could be an overkill to me.
> >
> > Currently you exposes only "quota" to kvm_run, then when vmexit you have
> > exit fields contain both "quota" and "count".  I always think it's a bit
> > redundant.
> > 
> > What I'm thinking is:
> > 
> >   (1) Expose both "quota" and "count" in kvm_run, then:
> > 
> >       "quota" should only be written by userspace and read by kernel.
> >       "count" should only be written by kernel and read by the userspace. [*]
> > 
> >       [*] One special case is when the userspace found that there's risk of
> >       quota & count overflow, then the userspace:
> > 
> >         - Kick the vcpu out (so the kernel won't write to "count" anymore)
> >         - Update both "quota" and "count" to safe values
> >         - Resume the KVM_RUN
> > 
> >   (2) When quota reached, we don't need to copy quota/count in vmexit
> >       fields, since the userspace can read the realtime values in kvm_run.
> > 
> > Would this work?
> 
> Technically, yes, practically speaking, no.  If KVM doesn't provide the quota
> that _KVM_ saw at the time of exit, then there's no sane way to audit KVM exits
> due to KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.  Providing the quota ensure userspace sees
> sane, coherent data if there's a race between KVM checking the quota and userspace
> updating the quota.  If KVM doesn't provide the quota, then userspace can see an
> exit with "count < quota".

This is rare false positive which should be acceptable in this case (the
same as vmexit with count==quota but we just planned to boost the quota),
IMHO it's better than always kicking the vcpu, since the overhead for such
false is only a vmexit but nothing else.

> 
> Even if userspace is ok with such races, it will be extremely difficult to detect
> KVM issues if we mess something up because such behavior would have to be allowed
> by KVM's ABI.

Could you elaborate?  We have quite a few places sharing these between
user/kernel on kvm_run, no?

> 
> > > ii) The change in VMX's handle_invalid_guest_state() remains as it is.
> > > iii) For now, we are lazily updating dirty quota, i.e. we are updating it
> > > only when the vcpu exits to userspace with the exit reason
> > > KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.
> > 
> > At least with above design, IMHO we can update kvm_run.quota in real time
> > without kicking vcpu threads (the quota only increases except the reset
> > phase for overflow).  Or is there any other reason you need to kick vcpu
> > out when updating quota?
> 
> I'm not convinced overflow is actually possible.

Yeah from that part I second you.  I don't really think it'll even happen
in practise, but I'd still like to make sure we won't introduce an ioctl
for the overflow only.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-14 22:19                       ` Peter Xu
@ 2022-07-15 16:23                         ` Sean Christopherson
  2022-07-15 16:56                           ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-07-15 16:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Shivam Kumar, Marc Zyngier, pbonzini, james.morse, borntraeger,
	david, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, Jul 14, 2022, Peter Xu wrote:
> On Thu, Jul 14, 2022 at 08:48:04PM +0000, Sean Christopherson wrote:
> > On Thu, Jul 14, 2022, Peter Xu wrote:
> > > Hi, Shivam,
> > > 
> > > On Tue, Jul 05, 2022 at 12:51:01PM +0530, Shivam Kumar wrote:
> > > > Hi, here's a summary of what needs to be changed and what should be kept as
> > > > it is (purely my opinion based on the discussions we have had so far):
> > > > 
> > > > i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm requests
> > > > in dirty quota check. I hope that the ceiling-based approach, with proper
> > > > documentation and an ioctl exposed for resetting 'dirty_quota' and
> > > > 'pages_dirtied', is good enough. Please post your suggestions if you think
> > > > otherwise.
> > > 
> > > An ioctl just for this could be an overkill to me.
> > >
> > > Currently you exposes only "quota" to kvm_run, then when vmexit you have
> > > exit fields contain both "quota" and "count".  I always think it's a bit
> > > redundant.
> > > 
> > > What I'm thinking is:
> > > 
> > >   (1) Expose both "quota" and "count" in kvm_run, then:
> > > 
> > >       "quota" should only be written by userspace and read by kernel.
> > >       "count" should only be written by kernel and read by the userspace. [*]
> > > 
> > >       [*] One special case is when the userspace found that there's risk of
> > >       quota & count overflow, then the userspace:
> > > 
> > >         - Kick the vcpu out (so the kernel won't write to "count" anymore)
> > >         - Update both "quota" and "count" to safe values
> > >         - Resume the KVM_RUN
> > > 
> > >   (2) When quota reached, we don't need to copy quota/count in vmexit
> > >       fields, since the userspace can read the realtime values in kvm_run.
> > > 
> > > Would this work?
> > 
> > Technically, yes, practically speaking, no.  If KVM doesn't provide the quota
> > that _KVM_ saw at the time of exit, then there's no sane way to audit KVM exits
> > due to KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.  Providing the quota ensure userspace sees
> > sane, coherent data if there's a race between KVM checking the quota and userspace
> > updating the quota.  If KVM doesn't provide the quota, then userspace can see an
> > exit with "count < quota".
> 
> This is rare false positive which should be acceptable in this case (the
> same as vmexit with count==quota but we just planned to boost the quota),
> IMHO it's better than always kicking the vcpu, since the overhead for such
> false is only a vmexit but nothing else.

Oh, we're in complete agreement on that front.  I'm only objecting to forcing
userspace to read the realtime quota+count.  I want KVM to provide a snapshot of
the quota+count so that if there's a KVM bug, e.g. KVM spuriously exits, then
there is zero ambiguity as the quota+count in the kvm_run exit field will hold
invalid/garbage data.  Without a snapshot, if there were a bug where KVM spuriously
exited, root causing or even detecting the bug would be difficult if userspace is
dynamically updating the quota as changing the quota would have destroyed the
evidence of KVM's bug.

It's unlikely we'll eever have such a bug, but filling the exits fields is cheap, and
because it's a union, the "redundant" fields don't consume extra space in kvm_run.

And the reasoning behind not having kvm_run.dirty_count is that it's fully
redundant if KVM provides a stat, and IMO such a stat will be quite helpful for
things beyond dirty quotas, e.g. being able to see which vCPUs are dirtying memory
from the command line for debug purposes.

> > Even if userspace is ok with such races, it will be extremely difficult to detect
> > KVM issues if we mess something up because such behavior would have to be allowed
> > by KVM's ABI.
> 
> Could you elaborate?  We have quite a few places sharing these between
> user/kernel on kvm_run, no?

I think I answered this above, let me know if I didn't.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-15 16:23                         ` Sean Christopherson
@ 2022-07-15 16:56                           ` Peter Xu
  2022-07-15 17:07                             ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-07-15 16:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Shivam Kumar, Marc Zyngier, pbonzini, james.morse, borntraeger,
	david, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Fri, Jul 15, 2022 at 04:23:54PM +0000, Sean Christopherson wrote:
> On Thu, Jul 14, 2022, Peter Xu wrote:
> > On Thu, Jul 14, 2022 at 08:48:04PM +0000, Sean Christopherson wrote:
> > > On Thu, Jul 14, 2022, Peter Xu wrote:
> > > > Hi, Shivam,
> > > > 
> > > > On Tue, Jul 05, 2022 at 12:51:01PM +0530, Shivam Kumar wrote:
> > > > > Hi, here's a summary of what needs to be changed and what should be kept as
> > > > > it is (purely my opinion based on the discussions we have had so far):
> > > > > 
> > > > > i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm requests
> > > > > in dirty quota check. I hope that the ceiling-based approach, with proper
> > > > > documentation and an ioctl exposed for resetting 'dirty_quota' and
> > > > > 'pages_dirtied', is good enough. Please post your suggestions if you think
> > > > > otherwise.
> > > > 
> > > > An ioctl just for this could be an overkill to me.
> > > >
> > > > Currently you exposes only "quota" to kvm_run, then when vmexit you have
> > > > exit fields contain both "quota" and "count".  I always think it's a bit
> > > > redundant.
> > > > 
> > > > What I'm thinking is:
> > > > 
> > > >   (1) Expose both "quota" and "count" in kvm_run, then:
> > > > 
> > > >       "quota" should only be written by userspace and read by kernel.
> > > >       "count" should only be written by kernel and read by the userspace. [*]
> > > > 
> > > >       [*] One special case is when the userspace found that there's risk of
> > > >       quota & count overflow, then the userspace:
> > > > 
> > > >         - Kick the vcpu out (so the kernel won't write to "count" anymore)
> > > >         - Update both "quota" and "count" to safe values
> > > >         - Resume the KVM_RUN
> > > > 
> > > >   (2) When quota reached, we don't need to copy quota/count in vmexit
> > > >       fields, since the userspace can read the realtime values in kvm_run.
> > > > 
> > > > Would this work?
> > > 
> > > Technically, yes, practically speaking, no.  If KVM doesn't provide the quota
> > > that _KVM_ saw at the time of exit, then there's no sane way to audit KVM exits
> > > due to KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.  Providing the quota ensure userspace sees
> > > sane, coherent data if there's a race between KVM checking the quota and userspace
> > > updating the quota.  If KVM doesn't provide the quota, then userspace can see an
> > > exit with "count < quota".
> > 
> > This is rare false positive which should be acceptable in this case (the
> > same as vmexit with count==quota but we just planned to boost the quota),
> > IMHO it's better than always kicking the vcpu, since the overhead for such
> > false is only a vmexit but nothing else.
> 
> Oh, we're in complete agreement on that front.  I'm only objecting to forcing
> userspace to read the realtime quota+count.  I want KVM to provide a snapshot of
> the quota+count so that if there's a KVM bug, e.g. KVM spuriously exits, then
> there is zero ambiguity as the quota+count in the kvm_run exit field will hold
> invalid/garbage data.

That'll need to be a super accident of both: an accident of spurious exit,
and another accident to set vm exit reason exactly to QUOTA_FULL. :) But I
see what you meant.

> Without a snapshot, if there were a bug where KVM spuriously
> exited, root causing or even detecting the bug would be difficult if userspace is
> dynamically updating the quota as changing the quota would have destroyed the
> evidence of KVM's bug.
> 
> It's unlikely we'll eever have such a bug, but filling the exits fields is cheap, and
> because it's a union, the "redundant" fields don't consume extra space in kvm_run.

Yes no objection if you think that's better, the overhead is pretty trivial
indeed.

> 
> And the reasoning behind not having kvm_run.dirty_count is that it's fully
> redundant if KVM provides a stat, and IMO such a stat will be quite helpful for
> things beyond dirty quotas, e.g. being able to see which vCPUs are dirtying memory
> from the command line for debug purposes.

Not if with overflow in mind?  Note that I totally agree the overflow may
not even happen, but I think it makes sense to consider as a complete
design of ceiling-based approach.  Think the Millennium bug, we never know
what will happen in the future..

So no objection too on having stats for dirty pages, it's just that if we
still want to cover the overflow issue we'd make dirty_count writable, then
it'd still better be in kvm_run, imho.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-15 16:56                           ` Peter Xu
@ 2022-07-15 17:07                             ` Sean Christopherson
  2022-07-15 17:26                               ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Sean Christopherson @ 2022-07-15 17:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Shivam Kumar, Marc Zyngier, pbonzini, james.morse, borntraeger,
	david, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Fri, Jul 15, 2022, Peter Xu wrote:
> On Fri, Jul 15, 2022 at 04:23:54PM +0000, Sean Christopherson wrote:
> > And the reasoning behind not having kvm_run.dirty_count is that it's fully
> > redundant if KVM provides a stat, and IMO such a stat will be quite helpful for
> > things beyond dirty quotas, e.g. being able to see which vCPUs are dirtying memory
> > from the command line for debug purposes.
> 
> Not if with overflow in mind?  Note that I totally agree the overflow may
> not even happen, but I think it makes sense to consider as a complete
> design of ceiling-based approach.  Think the Millennium bug, we never know
> what will happen in the future..
> 
> So no objection too on having stats for dirty pages, it's just that if we
> still want to cover the overflow issue we'd make dirty_count writable, then
> it'd still better be in kvm_run, imho.

Yeah, never say never, but unless my math is wrong, overflow isn't happening anytime
soon.  And if future CPUs can overflow the number of dirty pages, then they'll be
able to overflow a number of stats, at which point I think we'll want a generic
ioctl() to reset _all_ stats.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-15 17:07                             ` Sean Christopherson
@ 2022-07-15 17:26                               ` Peter Xu
  2022-07-15 22:38                                 ` Sean Christopherson
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-07-15 17:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Shivam Kumar, Marc Zyngier, pbonzini, james.morse, borntraeger,
	david, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Fri, Jul 15, 2022 at 05:07:45PM +0000, Sean Christopherson wrote:
> On Fri, Jul 15, 2022, Peter Xu wrote:
> > On Fri, Jul 15, 2022 at 04:23:54PM +0000, Sean Christopherson wrote:
> > > And the reasoning behind not having kvm_run.dirty_count is that it's fully
> > > redundant if KVM provides a stat, and IMO such a stat will be quite helpful for
> > > things beyond dirty quotas, e.g. being able to see which vCPUs are dirtying memory
> > > from the command line for debug purposes.
> > 
> > Not if with overflow in mind?  Note that I totally agree the overflow may
> > not even happen, but I think it makes sense to consider as a complete
> > design of ceiling-based approach.  Think the Millennium bug, we never know
> > what will happen in the future..
> > 
> > So no objection too on having stats for dirty pages, it's just that if we
> > still want to cover the overflow issue we'd make dirty_count writable, then
> > it'd still better be in kvm_run, imho.
> 
> Yeah, never say never, but unless my math is wrong, overflow isn't happening anytime
> soon.  And if future CPUs can overflow the number of dirty pages, then they'll be
> able to overflow a number of stats, at which point I think we'll want a generic
> ioctl() to reset _all_ stats.

It's slightly different as this affects functionality, unlike most stats.
I'll leave that to Shivam to see his preference.  If to go that way, I hope
we can at least have some WARN_ON_ONCE() on detecting overflows of "count".

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
  2022-07-15 17:26                               ` Peter Xu
@ 2022-07-15 22:38                                 ` Sean Christopherson
  0 siblings, 0 replies; 24+ messages in thread
From: Sean Christopherson @ 2022-07-15 22:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Shivam Kumar, Marc Zyngier, pbonzini, james.morse, borntraeger,
	david, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Fri, Jul 15, 2022, Peter Xu wrote:
> On Fri, Jul 15, 2022 at 05:07:45PM +0000, Sean Christopherson wrote:
> > On Fri, Jul 15, 2022, Peter Xu wrote:
> > > On Fri, Jul 15, 2022 at 04:23:54PM +0000, Sean Christopherson wrote:
> > > > And the reasoning behind not having kvm_run.dirty_count is that it's fully
> > > > redundant if KVM provides a stat, and IMO such a stat will be quite helpful for
> > > > things beyond dirty quotas, e.g. being able to see which vCPUs are dirtying memory
> > > > from the command line for debug purposes.
> > > 
> > > Not if with overflow in mind?  Note that I totally agree the overflow may
> > > not even happen, but I think it makes sense to consider as a complete
> > > design of ceiling-based approach.  Think the Millennium bug, we never know
> > > what will happen in the future..
> > > 
> > > So no objection too on having stats for dirty pages, it's just that if we
> > > still want to cover the overflow issue we'd make dirty_count writable, then
> > > it'd still better be in kvm_run, imho.
> > 
> > Yeah, never say never, but unless my math is wrong, overflow isn't happening anytime
> > soon.  And if future CPUs can overflow the number of dirty pages, then they'll be
> > able to overflow a number of stats, at which point I think we'll want a generic
> > ioctl() to reset _all_ stats.
> 
> It's slightly different as this affects functionality, unlike most stats.

Now that KVM exposes stats via ioctls, I don't think we should assume _anything_
about how userspace consumes stats.  I.e. KVM should guarantee sane behavior and
correctness for all stats.

In other words, this different in that it _directly_ affects KVM functionality,
but I think at this point we should assume that any and all stats can indirectly
affect KVM functionality.

> I'll leave that to Shivam to see his preference.  If to go that way, I hope
> we can at least have some WARN_ON_ONCE() on detecting overflows of "count".

I still think it's silly, but I definitely won't object to:

	WARN_ON_ONCE(!++vcpu->stat.generic.pages_dirtied);

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-07-15 22:38 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-21 20:29 [PATCH v4 0/4] KVM: Dirty quota-based throttling Shivam Kumar
2022-05-21 20:29 ` [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
2022-05-24  7:11   ` Marc Zyngier
2022-05-26  9:33     ` Shivam Kumar
2022-05-26 13:30       ` Marc Zyngier
2022-05-26 15:44         ` Sean Christopherson
2022-05-26 16:16           ` Marc Zyngier
2022-05-26 17:46             ` Sean Christopherson
2022-05-30 11:05               ` Shivam Kumar
2022-07-05  7:21                 ` Shivam Kumar
2022-07-14 18:04                   ` Peter Xu
2022-07-14 20:48                     ` Sean Christopherson
2022-07-14 22:19                       ` Peter Xu
2022-07-15 16:23                         ` Sean Christopherson
2022-07-15 16:56                           ` Peter Xu
2022-07-15 17:07                             ` Sean Christopherson
2022-07-15 17:26                               ` Peter Xu
2022-07-15 22:38                                 ` Sean Christopherson
2022-05-21 20:29 ` [PATCH v4 2/4] KVM: arm64: Dirty " Shivam Kumar
2022-05-24  7:14   ` Marc Zyngier
2022-05-26  9:16     ` Shivam Kumar
2022-05-26 17:58       ` Sean Christopherson
2022-05-21 20:29 ` [PATCH v4 3/4] KVM: s390x: " Shivam Kumar
2022-05-21 20:29 ` [PATCH v4 4/4] KVM: selftests: Add selftests for dirty quota throttling Shivam Kumar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.