All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] KVM: Dirty quota-based throttling
@ 2022-03-06 22:08 Shivam Kumar
  2022-03-06 22:08 ` [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Shivam Kumar @ 2022-03-06 22:08 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: kvm, Shivam Kumar

This is v3 of the dirty quota series, with a few changes in the
previous implementation and the following additions:

i) Added blurb for dirty quota in the KVM API documentation.

ii) Added KVM selftests for dirty quota throttling.

Shivam Kumar (3):
  KVM: Implement dirty quota-based throttling of vcpus
  KVM: Documentation: Update kvm_run structure for dirty quota
  KVM: selftests: Add selftests for dirty quota throttling

 Documentation/virt/kvm/api.rst                | 28 ++++++++++++++
 arch/arm64/kvm/arm.c                          |  3 ++
 arch/s390/kvm/kvm-s390.c                      |  3 ++
 arch/x86/kvm/x86.c                            |  4 ++
 include/linux/kvm_host.h                      | 15 ++++++++
 include/linux/kvm_types.h                     |  1 +
 include/uapi/linux/kvm.h                      | 12 ++++++
 tools/testing/selftests/kvm/dirty_log_test.c  | 37 +++++++++++++++++--
 .../selftests/kvm/include/kvm_util_base.h     |  4 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 36 ++++++++++++++++++
 virt/kvm/kvm_main.c                           |  7 +++-
 11 files changed, 145 insertions(+), 5 deletions(-)

-- 
2.22.3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-03-06 22:08 [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar
@ 2022-03-06 22:08 ` Shivam Kumar
  2022-03-31  0:28   ` Sean Christopherson
  2022-05-02 22:14   ` Peter Xu
  2022-03-06 22:08 ` [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota Shivam Kumar
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Shivam Kumar @ 2022-03-06 22:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, Shivam Kumar, Shaju Abraham, Manish Mishra, Anurag Madnawat

Define variables to track and throttle memory dirtying for every vcpu.

dirty_count:    Number of pages the vcpu has dirtied since its creation,
                while dirty logging is enabled.
dirty_quota:    Number of pages the vcpu is allowed to dirty. To dirty
                more, it needs to request more quota by exiting to
                userspace.

Implement the flow for throttling based on dirty quota.

i) Increment dirty_count for the vcpu whenever it dirties a page.
ii) Exit to userspace whenever the dirty quota is exhausted (i.e. dirty
count equals/exceeds dirty quota) to request more dirty quota.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 arch/arm64/kvm/arm.c      |  3 +++
 arch/s390/kvm/kvm-s390.c  |  3 +++
 arch/x86/kvm/x86.c        |  4 ++++
 include/linux/kvm_host.h  | 15 +++++++++++++++
 include/linux/kvm_types.h |  1 +
 include/uapi/linux/kvm.h  | 12 ++++++++++++
 virt/kvm/kvm_main.c       |  7 ++++++-
 7 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ecc5958e27fe..5b6a239b83a5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -848,6 +848,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 	ret = 1;
 	run->exit_reason = KVM_EXIT_UNKNOWN;
 	while (ret > 0) {
+		ret = kvm_vcpu_check_dirty_quota(vcpu);
+		if (!ret)
+			break;
 		/*
 		 * Check conditions before entering the guest
 		 */
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2296b1ff1e02..9cc0e0583ef4 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3994,6 +3994,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu)
 static int vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	int rc, cpuflags;
+	rc = kvm_vcpu_check_dirty_quota(vcpu);
+	if (!rc)
+		return -EREMOTE;
 
 	/*
 	 * On s390 notifications for arriving pages will be delivered directly
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb4029660bd9..0b35b8cc0274 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.l1tf_flush_l1d = true;
 
 	for (;;) {
+		r = kvm_vcpu_check_dirty_quota(vcpu);
+		if (!r)
+			break;
+
 		if (kvm_vcpu_running(vcpu)) {
 			r = vcpu_enter_guest(vcpu);
 		} else {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f11039944c08..b1c599c78c42 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
 }
 
+static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
+{
+	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
+	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
+	struct kvm_run *run = vcpu->run;
+
+	if (!dirty_quota || (pages_dirtied < dirty_quota))
+		return 1;
+
+	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
+	run->dirty_quota_exit.count = pages_dirtied;
+	run->dirty_quota_exit.quota = dirty_quota;
+	return 0;
+}
+
 /*
  * Some of the bitops functions do not support too long bitmaps.
  * This number must be determined not to exceed such limits.
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index dceac12c1ce5..7f42486b0405 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -106,6 +106,7 @@ struct kvm_vcpu_stat_generic {
 	u64 halt_poll_fail_hist[HALT_POLL_HIST_COUNT];
 	u64 halt_wait_hist[HALT_POLL_HIST_COUNT];
 	u64 blocking;
+	u64 pages_dirtied;
 };
 
 #define KVM_STATS_NAME_SIZE	48
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 507ee1f2aa96..1d9531efe1fb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_DIRTY_QUOTA_EXHAUSTED 36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -487,6 +488,11 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -508,6 +514,12 @@ struct kvm_run {
 		struct kvm_sync_regs regs;
 		char padding[SYNC_REGS_SIZE_BYTES];
 	} s;
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * liftime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the
+	 * quota is reached/exceeded.
+	 */
+	__u64 dirty_quota;
 };
 
 /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0afc016cc54d..041ab464405d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3163,7 +3163,12 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
 		return;
 #endif
 
-	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
+	if (!memslot)
+		return;
+
+	vcpu->stat.generic.pages_dirtied++;
+
+	if (kvm_slot_dirty_track_enabled(memslot)) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
 		u32 slot = (memslot->as_id << 16) | memslot->id;
 
-- 
2.22.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota
  2022-03-06 22:08 [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar
  2022-03-06 22:08 ` [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
@ 2022-03-06 22:08 ` Shivam Kumar
  2022-03-31  0:40   ` Sean Christopherson
  2022-03-06 22:08 ` [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling Shivam Kumar
  2022-03-19 18:21 ` [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar
  3 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-03-06 22:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, Shivam Kumar, Shaju Abraham, Manish Mishra, Anurag Madnawat

Update the kvm_run structure with a brief description of dirty
quota members and how dirty quota throttling works.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9f3172376ec3..50e001473b1f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6125,6 +6125,23 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
+		struct {
+			__u64 count;
+			__u64 quota;
+		} dirty_quota_exit;
+If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
+exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
+makes the following information available to the userspace:
+	'count' field: the current count of pages dirtied by the VCPU,
+	'quota' field: the observed dirty quota just before the exit to userspace.
+The userspace can design a strategy to allocate the overall scope of dirtying
+for the VM among the vcpus. Based on the strategy and the current state of dirty
+quota throttling, the userspace can make a decision to either update (increase)
+the quota or to put the VCPU to sleep for some time.
+
 ::
 
 		/* Fix the size of the union. */
@@ -6159,6 +6176,17 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
 ::
 
+	/*
+	 * Number of pages the vCPU is allowed to have dirtied over its entire
+	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
+	 * is reached/exceeded.
+	 */
+	__u64 dirty_quota;
+Please note that this quota cannot be strictly enforced if PML is enabled, and
+the VCPU may end up dirtying pages more than its quota. The difference however
+is bounded by the PML buffer size.
+
+::
   };
 
 
-- 
2.22.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling
  2022-03-06 22:08 [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar
  2022-03-06 22:08 ` [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
  2022-03-06 22:08 ` [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota Shivam Kumar
@ 2022-03-06 22:08 ` Shivam Kumar
  2022-04-18  4:55   ` Shivam Kumar
  2022-03-19 18:21 ` [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar
  3 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-03-06 22:08 UTC (permalink / raw)
  To: pbonzini, seanjc
  Cc: kvm, Shivam Kumar, Shaju Abraham, Manish Mishra, Anurag Madnawat

Add selftests for dirty quota throttling with an optional -d parameter
to configure by what value dirty quota should be incremented after
each dirty quota exit. With very small intervals, a smaller value of
dirty quota can ensure that the dirty quota exit code is tested. A zero
value disables dirty quota throttling and thus dirty logging, without
dirty quota throttling, can be tested.

Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c  | 37 +++++++++++++++++--
 .../selftests/kvm/include/kvm_util_base.h     |  4 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 36 ++++++++++++++++++
 3 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 3fcd89e195c7..e75d826e21fb 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -65,6 +65,8 @@
 
 #define SIG_IPI SIGUSR1
 
+#define TEST_DIRTY_QUOTA_INCREMENT		8
+
 /*
  * Guest/Host shared variables. Ensure addr_gva2hva() and/or
  * sync_global_to/from_guest() are used when accessing from
@@ -191,6 +193,7 @@ static enum log_mode_t host_log_mode_option = LOG_MODE_ALL;
 static enum log_mode_t host_log_mode;
 static pthread_t vcpu_thread;
 static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
+static uint64_t test_dirty_quota_increment = TEST_DIRTY_QUOTA_INCREMENT;
 
 static void vcpu_kick(void)
 {
@@ -210,6 +213,13 @@ static void sem_wait_until(sem_t *sem)
 	while (ret == -1 && errno == EINTR);
 }
 
+static void set_dirty_quota(struct kvm_vm *vm, uint64_t dirty_quota)
+{
+	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+
+	vcpu_set_dirty_quota(run, dirty_quota);
+}
+
 static bool clear_log_supported(void)
 {
 	return kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
@@ -260,9 +270,13 @@ static void default_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 	TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
 		    "vcpu run failed: errno=%d", err);
 
-	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
-		    "Invalid guest sync status: exit_reason=%s\n",
-		    exit_reason_str(run->exit_reason));
+	if (test_dirty_quota_increment &&
+		run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED)
+		vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
+	else
+		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
+			"Invalid guest sync status: exit_reason=%s\n",
+			exit_reason_str(run->exit_reason));
 
 	vcpu_handle_sync_stop();
 }
@@ -377,6 +391,9 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
 	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
 		/* We should allow this to continue */
 		;
+	} else if (test_dirty_quota_increment &&
+		run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED) {
+		vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
 	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
 		   (ret == -1 && err == EINTR)) {
 		/* Update the flag first before pause */
@@ -773,6 +790,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	sync_global_to_guest(vm, guest_test_virt_mem);
 	sync_global_to_guest(vm, guest_num_pages);
 
+	/* Initialise dirty quota */
+	if (test_dirty_quota_increment)
+		set_dirty_quota(vm, test_dirty_quota_increment);
+
 	/* Start the iterations */
 	iteration = 1;
 	sync_global_to_guest(vm, iteration);
@@ -814,6 +835,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
 	log_mode_before_vcpu_join();
+	/* Terminate dirty quota throttling */
+	if (test_dirty_quota_increment)
+		set_dirty_quota(vm, 0);
 	pthread_join(vcpu_thread, NULL);
 
 	pr_info("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
@@ -835,6 +859,8 @@ static void help(char *name)
 	printf(" -c: specify dirty ring size, in number of entries\n");
 	printf("     (only useful for dirty-ring test; default: %"PRIu32")\n",
 	       TEST_DIRTY_RING_COUNT);
+	printf(" -q: specify incemental dirty quota (default: %"PRIu32")\n",
+	       TEST_DIRTY_QUOTA_INCREMENT);
 	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
 	       TEST_HOST_LOOP_N);
 	printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
@@ -863,11 +889,14 @@ int main(int argc, char *argv[])
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "c:q:hi:I:p:m:M:")) != -1) {
 		switch (opt) {
 		case 'c':
 			test_dirty_ring_count = strtol(optarg, NULL, 10);
 			break;
+		case 'q':
+			test_dirty_quota_increment = strtol(optarg, NULL, 10);
+			break;
 		case 'i':
 			p.iterations = strtol(optarg, NULL, 10);
 			break;
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 4ed6aa049a91..b70732998329 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -395,4 +395,8 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
 
 uint32_t guest_get_vcpuid(void);
 
+void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota);
+void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
+			uint64_t test_dirty_quota_increment);
+
 #endif /* SELFTEST_KVM_UTIL_BASE_H */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index d8cf851ab119..fa77558d745e 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -19,6 +19,7 @@
 #include <linux/kernel.h>
 
 #define KVM_UTIL_MIN_PFN	2
+#define PML_BUFFER_SIZE	512
 
 static int vcpu_mmap_sz(void);
 
@@ -2286,6 +2287,7 @@ static struct exit_reason {
 	{KVM_EXIT_X86_RDMSR, "RDMSR"},
 	{KVM_EXIT_X86_WRMSR, "WRMSR"},
 	{KVM_EXIT_XEN, "XEN"},
+	{KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, "DIRTY_QUOTA_EXHAUSTED"},
 #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
 	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
 #endif
@@ -2517,3 +2519,37 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid)
 
 	return ioctl(vcpu->fd, KVM_GET_STATS_FD, NULL);
 }
+
+void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota)
+{
+	run->dirty_quota = dirty_quota;
+
+	if (dirty_quota)
+		pr_info("Dirty quota throttling enabled with initial quota %"PRIu64"\n",
+			dirty_quota);
+	else
+		pr_info("Dirty quota throttling disabled\n");
+}
+
+void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
+			uint64_t test_dirty_quota_increment)
+{
+	uint64_t quota = run->dirty_quota_exit.quota;
+	uint64_t count = run->dirty_quota_exit.count;
+
+	/*
+	 * Due to PML, number of pages dirtied by the vcpu can exceed its dirty
+	 * quota by PML buffer size.
+	 */
+	TEST_ASSERT(count <= quota + PML_BUFFER_SIZE, "Invalid number of pages
+		dirtied: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
+
+	TEST_ASSERT(count >= quota, "Dirty quota exit happened with quota yet to
+		be exhausted: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
+
+	if (count > quota)
+		pr_info("Dirty quota exit with unequal quota and count:
+			count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
+
+	run->dirty_quota = count + test_dirty_quota_increment;
+}
-- 
2.22.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/3] KVM: Dirty quota-based throttling
  2022-03-06 22:08 [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar
                   ` (2 preceding siblings ...)
  2022-03-06 22:08 ` [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling Shivam Kumar
@ 2022-03-19 18:21 ` Shivam Kumar
  3 siblings, 0 replies; 26+ messages in thread
From: Shivam Kumar @ 2022-03-19 18:21 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: kvm


On 07/03/22 3:38 am, Shivam Kumar wrote:
> This is v3 of the dirty quota series, with a few changes in the
> previous implementation and the following additions:
>
> i) Added blurb for dirty quota in the KVM API documentation.
>
> ii) Added KVM selftests for dirty quota throttling.
>
> Shivam Kumar (3):
>    KVM: Implement dirty quota-based throttling of vcpus
>    KVM: Documentation: Update kvm_run structure for dirty quota
>    KVM: selftests: Add selftests for dirty quota throttling
>
>   Documentation/virt/kvm/api.rst                | 28 ++++++++++++++
>   arch/arm64/kvm/arm.c                          |  3 ++
>   arch/s390/kvm/kvm-s390.c                      |  3 ++
>   arch/x86/kvm/x86.c                            |  4 ++
>   include/linux/kvm_host.h                      | 15 ++++++++
>   include/linux/kvm_types.h                     |  1 +
>   include/uapi/linux/kvm.h                      | 12 ++++++
>   tools/testing/selftests/kvm/dirty_log_test.c  | 37 +++++++++++++++++--
>   .../selftests/kvm/include/kvm_util_base.h     |  4 ++
>   tools/testing/selftests/kvm/lib/kvm_util.c    | 36 ++++++++++++++++++
>   virt/kvm/kvm_main.c                           |  7 +++-
>   11 files changed, 145 insertions(+), 5 deletions(-)
>
Grateful for the last reviews. Would be great to receive some feedback 
on this. Will help me move forward. Thank you.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-03-06 22:08 ` [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
@ 2022-03-31  0:28   ` Sean Christopherson
  2022-03-31  7:20     ` Shivam Kumar
  2022-05-02 22:14   ` Peter Xu
  1 sibling, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2022-03-31  0:28 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

This whole series needs to be Cc'd to the arm64 and s390 folks.  The easiest way
to that is to use scripts/get_maintainers.pl, which will grab the appropriate
people.  There are a variety of options you can use to tailor it to your style.
E.g. for KVM I do

  --nogit --nogit-fallback --norolestats --nofixes --pattern-depth=1

for To:, and then add

  --nom

for Cc:.  The --pattern-depth=1 tells it to not recurse up so that it doesn't
include the x86 maintainers for arch/x86/kvm patches.

I'd Cc them manually, but I think it'll be easier to just post v4.

On Sun, Mar 06, 2022, Shivam Kumar wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eb4029660bd9..0b35b8cc0274 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  	vcpu->arch.l1tf_flush_l1d = true;
>  
>  	for (;;) {
> +		r = kvm_vcpu_check_dirty_quota(vcpu);
> +		if (!r)
> +			break;
> +
>  		if (kvm_vcpu_running(vcpu)) {
>  			r = vcpu_enter_guest(vcpu);
>  		} else {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f11039944c08..b1c599c78c42 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>  	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>  }
>  
> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> +{
> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> +	struct kvm_run *run = vcpu->run;

Might as well use "run" when reading the dirty quota.

> +
> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> +		return 1;

I don't love returning 0/1 from a function that suggests it returns a bool, but
I do agree it's better than actually returning a bool.  I also don't have a better
name, so I'm just whining in the hope that Paolo or someone else has an idea :-)

> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> +	run->dirty_quota_exit.count = pages_dirtied;
> +	run->dirty_quota_exit.quota = dirty_quota;
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota
  2022-03-06 22:08 ` [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota Shivam Kumar
@ 2022-03-31  0:40   ` Sean Christopherson
  2022-03-31  7:30     ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2022-03-31  0:40 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Sun, Mar 06, 2022, Shivam Kumar wrote:
> Update the kvm_run structure with a brief description of dirty
> quota members and how dirty quota throttling works.

This should be squashed with patch 1.  I actually had to look ahead to this patch
because I forgot the details since I last reviewed this :-)

> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>  Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9f3172376ec3..50e001473b1f 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6125,6 +6125,23 @@ array field represents return values. The userspace should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> +		/* KVM_EXIT_DIRTY_QUOTA_EXHAUSTED */
> +		struct {
> +			__u64 count;
> +			__u64 quota;
> +		} dirty_quota_exit;
> +If exit reason is KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, it indicates that the VCPU has
> +exhausted its dirty quota. The 'dirty_quota_exit' member of kvm_run structure
> +makes the following information available to the userspace:
> +	'count' field: the current count of pages dirtied by the VCPU,
> +	'quota' field: the observed dirty quota just before the exit to userspace.
> +The userspace can design a strategy to allocate the overall scope of dirtying
> +for the VM among the vcpus. Based on the strategy and the current state of dirty
> +quota throttling, the userspace can make a decision to either update (increase)
> +the quota or to put the VCPU to sleep for some time.
> +
>  ::
>  
>  		/* Fix the size of the union. */
> @@ -6159,6 +6176,17 @@ values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
>  
>  ::
>  
> +	/*
> +	 * Number of pages the vCPU is allowed to have dirtied over its entire
> +	 * lifetime.  KVM_RUN exits with KVM_EXIT_DIRTY_QUOTA_EXHAUSTED if the quota
> +	 * is reached/exceeded.
> +	 */
> +	__u64 dirty_quota;
> +Please note that this quota cannot be strictly enforced if PML is enabled, and
> +the VCPU may end up dirtying pages more than its quota. The difference however
> +is bounded by the PML buffer size.

If you want to be pedantic, I doubt KVM can strictly enforce the quota even if PML
is disabled.  E.g. I can all but guarantee that it's possible to dirty multiple
pages during a single exit.  Probably also worth spelling out PML and genericizing
things.  Maybe

  Please note that enforcing the quota is best effort, as the guest may dirty
  multiple pages before KVM can recheck the quota.  However, unless KVM is using
  a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
  KVM will detect quota exhaustion within a handful of dirtied page.  If a
  hardware ring buffer is used, the overrun is bounded by the size of the buffer
  (512 entries for PML).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-03-31  0:28   ` Sean Christopherson
@ 2022-03-31  7:20     ` Shivam Kumar
  2022-03-31 15:37       ` Sean Christopherson
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-03-31  7:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 31/03/22 5:58 am, Sean Christopherson wrote:
> This whole series needs to be Cc'd to the arm64 and s390 folks.  The easiest way
> to that is to use scripts/get_maintainers.pl, which will grab the appropriate
> people.  There are a variety of options you can use to tailor it to your style.
> E.g. for KVM I do
>
>    --nogit --nogit-fallback --norolestats --nofixes --pattern-depth=1
>
> for To:, and then add
>
>    --nom
>
> for Cc:.  The --pattern-depth=1 tells it to not recurse up so that it doesn't
> include the x86 maintainers for arch/x86/kvm patches.
>
> I'd Cc them manually, but I think it'll be easier to just post v4.

Thanks. I'm waiting for some reviews on the selftests (the third patch 
of this series). As
soon as I receive some, I'll send v4.

>
> On Sun, Mar 06, 2022, Shivam Kumar wrote:
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index eb4029660bd9..0b35b8cc0274 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10257,6 +10257,10 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>>   	vcpu->arch.l1tf_flush_l1d = true;
>>   
>>   	for (;;) {
>> +		r = kvm_vcpu_check_dirty_quota(vcpu);
>> +		if (!r)
>> +			break;
>> +
>>   		if (kvm_vcpu_running(vcpu)) {
>>   			r = vcpu_enter_guest(vcpu);
>>   		} else {
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index f11039944c08..b1c599c78c42 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -530,6 +530,21 @@ static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
>>   	return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
>>   }
>>   
>> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
>> +	struct kvm_run *run = vcpu->run;
> Might as well use "run" when reading the dirty quota.
Sure. Thanks.
>
>> +
>> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
>> +		return 1;
> I don't love returning 0/1 from a function that suggests it returns a bool, but
> I do agree it's better than actually returning a bool.  I also don't have a better
> name, so I'm just whining in the hope that Paolo or someone else has an idea :-)
I've seen plenty of check functions returning 0/1 but please do let me 
know if there's
a convention to use a bool in such scenarios. I'm also looking for a 
better name but
this one also looks good enough to me.
>> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
>> +	run->dirty_quota_exit.count = pages_dirtied;
>> +	run->dirty_quota_exit.quota = dirty_quota;
>> +	return 0;
>> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota
  2022-03-31  0:40   ` Sean Christopherson
@ 2022-03-31  7:30     ` Shivam Kumar
  2022-03-31 15:24       ` Sean Christopherson
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-03-31  7:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 31/03/22 6:10 am, Sean Christopherson wrote:
> On Sun, Mar 06, 2022, Shivam Kumar wrote:
>> Update the kvm_run structure with a brief description of dirty
>> quota members and how dirty quota throttling works.
> This should be squashed with patch 1.  I actually had to look ahead to this patch
> because I forgot the details since I last reviewed this :-)
Ack. Thanks.
>> +	__u64 dirty_quota;
>> +Please note that this quota cannot be strictly enforced if PML is enabled, and
>> +the VCPU may end up dirtying pages more than its quota. The difference however
>> +is bounded by the PML buffer size.
> If you want to be pedantic, I doubt KVM can strictly enforce the quota even if PML
> is disabled.  E.g. I can all but guarantee that it's possible to dirty multiple
> pages during a single exit.  Probably also worth spelling out PML and genericizing
> things.  Maybe
>
>    Please note that enforcing the quota is best effort, as the guest may dirty
>    multiple pages before KVM can recheck the quota.  However, unless KVM is using
>    a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
>    KVM will detect quota exhaustion within a handful of dirtied page.  If a
>    hardware ring buffer is used, the overrun is bounded by the size of the buffer
>    (512 entries for PML).
Thank you for the blurb. Looks good to me, though I'm curious about the 
exits
that can dirty multiple pages.

Thank you so much for the review, Sean. I'm hoping if I could get 
one-pass review
from you on the third patch in this series (selftests for dirty quota).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota
  2022-03-31  7:30     ` Shivam Kumar
@ 2022-03-31 15:24       ` Sean Christopherson
  2022-04-01 13:49         ` Sean Christopherson
  0 siblings, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2022-03-31 15:24 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, Mar 31, 2022, Shivam Kumar wrote:
> 
> On 31/03/22 6:10 am, Sean Christopherson wrote:
> > On Sun, Mar 06, 2022, Shivam Kumar wrote:
> > > Update the kvm_run structure with a brief description of dirty
> > > quota members and how dirty quota throttling works.
> > This should be squashed with patch 1.  I actually had to look ahead to this patch
> > because I forgot the details since I last reviewed this :-)
> Ack. Thanks.
> > > +	__u64 dirty_quota;
> > > +Please note that this quota cannot be strictly enforced if PML is enabled, and
> > > +the VCPU may end up dirtying pages more than its quota. The difference however
> > > +is bounded by the PML buffer size.
> > If you want to be pedantic, I doubt KVM can strictly enforce the quota even if PML
> > is disabled.  E.g. I can all but guarantee that it's possible to dirty multiple
> > pages during a single exit.  Probably also worth spelling out PML and genericizing
> > things.  Maybe
> > 
> >    Please note that enforcing the quota is best effort, as the guest may dirty
> >    multiple pages before KVM can recheck the quota.  However, unless KVM is using
> >    a hardware-based dirty ring buffer, e.g. Intel's Page Modification Logging,
> >    KVM will detect quota exhaustion within a handful of dirtied page.  If a
> >    hardware ring buffer is used, the overrun is bounded by the size of the buffer
> >    (512 entries for PML).
> Thank you for the blurb. Looks good to me, though I'm curious about the exits
> that can dirty multiple pages.

Anything that touches multiple pages.  nested_mark_vmcs12_pages_dirty() is an
easy example.  Emulating L2 with nested TDP.  An emulated instruction that splits
a page.  I'm pretty sure FNAME(sync_page) could dirty an entire page worth of
SPTEs, and that's waaay too deep to bail from.

Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() guards the call
to mark_page_dirty_in_slot() with kvm_slot_dirty_track_enabled(), which means it
won't honor the dirty quota unless dirty logging is enabled.  Probably not an issue
for the intended use case, but it'll result in wrong stats, and technically the
dirty quota can be enabled without dirty logging being enabled.

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4739b53c9734..df0349be388b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -182,7 +182,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
                  "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
                  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));

-       if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
+       if (spte & PT_WRITABLE_MASK) {
                /* Enforced by kvm_mmu_hugepage_adjust. */
                WARN_ON(level > PG_LEVEL_4K);
                mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);


And thinking more about silly edge cases, VMX's big emulation loop for invalid
guest state when unrestricted guest is disabled should probably explicitly check
the dirty quota.  Again, I doubt it matters to anyone's use case, but it is treated
as a full run loop for things like pending signals, it'd be good to be consistent.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 84a7500cd80c..5e1ae373634c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5511,6 +5511,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
                 */
                if (__xfer_to_guest_mode_work_pending())
                        return 1;
+
+               if (!kvm_vcpu_check_dirty_quota(vcpu))
+                       return 0;
        }

        return 1;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-03-31  7:20     ` Shivam Kumar
@ 2022-03-31 15:37       ` Sean Christopherson
  2022-04-06 12:32         ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2022-03-31 15:37 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, Mar 31, 2022, Shivam Kumar wrote:
> > > +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> > > +		return 1;
> > I don't love returning 0/1 from a function that suggests it returns a bool, but
> > I do agree it's better than actually returning a bool.  I also don't have a better
> > name, so I'm just whining in the hope that Paolo or someone else has an idea :-)
> I've seen plenty of check functions returning 0/1 but please do let me know
> if there's a convention to use a bool in such scenarios.

The preferred style for KVM is to return a bool for helpers that are obviously
testing something, e.g. functions with names is "is_valid", "check_request", etc...
But we also very strongly prefer not returning bools from functions that have
side effects or can fail, i.e. don't use a bool to indicate success.

KVM has a third, gross use case of 0/1, where 0 means "exit to userspace" and 1
means "re-enter the guest".  Unfortunately, it's so ubiquitous that replacing it
with a proper enum is all but guaranteed to introduce bugs, and the 0/1 behavior
allows KVM to do things liek "if (!some_function())".

This helper falls into this last category of KVM's special 0/1 handling.  The
reason I don't love the name is the "check" part, which also puts it into "this
is a check helper".  But returning a bool would be even worse because the helper
does more than just check the quota, it also fills in the exit reason.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota
  2022-03-31 15:24       ` Sean Christopherson
@ 2022-04-01 13:49         ` Sean Christopherson
  2022-04-06 12:39           ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2022-04-01 13:49 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, Mar 31, 2022, Sean Christopherson wrote:
> Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() guards the call
> to mark_page_dirty_in_slot() with kvm_slot_dirty_track_enabled(), which means it
> won't honor the dirty quota unless dirty logging is enabled.  Probably not an issue
> for the intended use case, but it'll result in wrong stats, and technically the
> dirty quota can be enabled without dirty logging being enabled.
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 4739b53c9734..df0349be388b 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -182,7 +182,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>                   "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
>                   get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> 
> -       if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
> +       if (spte & PT_WRITABLE_MASK) {
>                 /* Enforced by kvm_mmu_hugepage_adjust. */
>                 WARN_ON(level > PG_LEVEL_4K);
>                 mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);

This pseudopatch is buggy, the WARN_ON() will obviously fire.  Easiest thing would
be to move the condition into the WARN_ON.

		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));

That brings up another thing that's worth documenting: the dirty_count will be
skewed based on the size of the pages accessed by each vCPU.  I still think having
the stat always count will be useful though.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-03-31 15:37       ` Sean Christopherson
@ 2022-04-06 12:32         ` Shivam Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Shivam Kumar @ 2022-04-06 12:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 31/03/22 9:07 pm, Sean Christopherson wrote:
> On Thu, Mar 31, 2022, Shivam Kumar wrote:
>>>> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
>>>> +		return 1;
>>> I don't love returning 0/1 from a function that suggests it returns a bool, but
>>> I do agree it's better than actually returning a bool.  I also don't have a better
>>> name, so I'm just whining in the hope that Paolo or someone else has an idea :-)
>> I've seen plenty of check functions returning 0/1 but please do let me know
>> if there's a convention to use a bool in such scenarios.
> The preferred style for KVM is to return a bool for helpers that are obviously
> testing something, e.g. functions with names is "is_valid", "check_request", etc...
> But we also very strongly prefer not returning bools from functions that have
> side effects or can fail, i.e. don't use a bool to indicate success.
>
> KVM has a third, gross use case of 0/1, where 0 means "exit to userspace" and 1
> means "re-enter the guest".  Unfortunately, it's so ubiquitous that replacing it
> with a proper enum is all but guaranteed to introduce bugs, and the 0/1 behavior
> allows KVM to do things liek "if (!some_function())".
>
> This helper falls into this last category of KVM's special 0/1 handling.  The
> reason I don't love the name is the "check" part, which also puts it into "this
> is a check helper".  But returning a bool would be even worse because the helper
> does more than just check the quota, it also fills in the exit reason.
Does kvm_vcpu_exit_on_dirty_quota_reached make sense? We will invert the 
output: 1 will mean "exit to userspace" and 0 will mean "re-enter the 
guest". This is verbose but this is the best I could think of.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota
  2022-04-01 13:49         ` Sean Christopherson
@ 2022-04-06 12:39           ` Shivam Kumar
  2022-04-06 12:44             ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-04-06 12:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 01/04/22 7:19 pm, Sean Christopherson wrote:
> On Thu, Mar 31, 2022, Sean Christopherson wrote:
>> Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() guards the call
>> to mark_page_dirty_in_slot() with kvm_slot_dirty_track_enabled(), which means it
>> won't honor the dirty quota unless dirty logging is enabled.  Probably not an issue
>> for the intended use case, but it'll result in wrong stats, and technically the
>> dirty quota can be enabled without dirty logging being enabled.
>>
>> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
>> index 4739b53c9734..df0349be388b 100644
>> --- a/arch/x86/kvm/mmu/spte.c
>> +++ b/arch/x86/kvm/mmu/spte.c
>> @@ -182,7 +182,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>>                    "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
>>                    get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
>>
>> -       if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
>> +       if (spte & PT_WRITABLE_MASK) {
>>                  /* Enforced by kvm_mmu_hugepage_adjust. */
>>                  WARN_ON(level > PG_LEVEL_4K);
>>                  mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
> This pseudopatch is buggy, the WARN_ON() will obviously fire.  Easiest thing would
> be to move the condition into the WARN_ON.
>
> 		WARN_ON(level > PG_LEVEL_4K && kvm_slot_dirty_track_enabled(slot));

Thank you. Will add this.

>
> That brings up another thing that's worth documenting: the dirty_count will be
> skewed based on the size of the pages accessed by each vCPU.  I still think having
> the stat always count will be useful though.
I thought it was obvious :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota
  2022-04-06 12:39           ` Shivam Kumar
@ 2022-04-06 12:44             ` Shivam Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Shivam Kumar @ 2022-04-06 12:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 06/04/22 6:09 pm, Shivam Kumar wrote:
>
> On 01/04/22 7:19 pm, Sean Christopherson wrote:
>> On Thu, Mar 31, 2022, Sean Christopherson wrote:
>>> Oof, loking at sync_page(), that's a bug in patch 1.  make_spte() 
>>> guards the call
>>>                  WARN_ON(level > PG_LEVEL_4K);
>>> skewed based on the size of the pages accessed by each vCPU. I still 
>>> think having
>>> the stat always count will be useful though.
> I thought it was obvious :)
Hi Sean, I'm grateful for your feedback. I wish you could post some 
feedback on the selftests too (third patch in this patchset).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling
  2022-03-06 22:08 ` [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling Shivam Kumar
@ 2022-04-18  4:55   ` Shivam Kumar
  2022-04-18  4:59     ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-04-18  4:55 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 07/03/22 3:38 am, Shivam Kumar wrote:
> Add selftests for dirty quota throttling with an optional -d parameter
> to configure by what value dirty quota should be incremented after
> each dirty quota exit. With very small intervals, a smaller value of
> dirty quota can ensure that the dirty quota exit code is tested. A zero
> value disables dirty quota throttling and thus dirty logging, without
> dirty quota throttling, can be tested.
>
> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>   tools/testing/selftests/kvm/dirty_log_test.c  | 37 +++++++++++++++++--
>   .../selftests/kvm/include/kvm_util_base.h     |  4 ++
>   tools/testing/selftests/kvm/lib/kvm_util.c    | 36 ++++++++++++++++++
>   3 files changed, 73 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
> index 3fcd89e195c7..e75d826e21fb 100644
> --- a/tools/testing/selftests/kvm/dirty_log_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_test.c
> @@ -65,6 +65,8 @@
>   
>   #define SIG_IPI SIGUSR1
>   
> +#define TEST_DIRTY_QUOTA_INCREMENT		8
> +
>   /*
>    * Guest/Host shared variables. Ensure addr_gva2hva() and/or
>    * sync_global_to/from_guest() are used when accessing from
> @@ -191,6 +193,7 @@ static enum log_mode_t host_log_mode_option = LOG_MODE_ALL;
>   static enum log_mode_t host_log_mode;
>   static pthread_t vcpu_thread;
>   static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
> +static uint64_t test_dirty_quota_increment = TEST_DIRTY_QUOTA_INCREMENT;
>   
>   static void vcpu_kick(void)
>   {
> @@ -210,6 +213,13 @@ static void sem_wait_until(sem_t *sem)
>   	while (ret == -1 && errno == EINTR);
>   }
>   
> +static void set_dirty_quota(struct kvm_vm *vm, uint64_t dirty_quota)
> +{
> +	struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> +
> +	vcpu_set_dirty_quota(run, dirty_quota);
> +}
> +
>   static bool clear_log_supported(void)
>   {
>   	return kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
> @@ -260,9 +270,13 @@ static void default_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
>   	TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
>   		    "vcpu run failed: errno=%d", err);
>   
> -	TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> -		    "Invalid guest sync status: exit_reason=%s\n",
> -		    exit_reason_str(run->exit_reason));
> +	if (test_dirty_quota_increment &&
> +		run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED)
> +		vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
> +	else
> +		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> +			"Invalid guest sync status: exit_reason=%s\n",
> +			exit_reason_str(run->exit_reason));
>   
>   	vcpu_handle_sync_stop();
>   }
> @@ -377,6 +391,9 @@ static void dirty_ring_after_vcpu_run(struct kvm_vm *vm, int ret, int err)
>   	if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
>   		/* We should allow this to continue */
>   		;
> +	} else if (test_dirty_quota_increment &&
> +		run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED) {
> +		vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
>   	} else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
>   		   (ret == -1 && err == EINTR)) {
>   		/* Update the flag first before pause */
> @@ -773,6 +790,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   	sync_global_to_guest(vm, guest_test_virt_mem);
>   	sync_global_to_guest(vm, guest_num_pages);
>   
> +	/* Initialise dirty quota */
> +	if (test_dirty_quota_increment)
> +		set_dirty_quota(vm, test_dirty_quota_increment);
> +
>   	/* Start the iterations */
>   	iteration = 1;
>   	sync_global_to_guest(vm, iteration);
> @@ -814,6 +835,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>   	/* Tell the vcpu thread to quit */
>   	host_quit = true;
>   	log_mode_before_vcpu_join();
> +	/* Terminate dirty quota throttling */
> +	if (test_dirty_quota_increment)
> +		set_dirty_quota(vm, 0);
>   	pthread_join(vcpu_thread, NULL);
>   
>   	pr_info("Total bits checked: dirty (%"PRIu64"), clear (%"PRIu64"), "
> @@ -835,6 +859,8 @@ static void help(char *name)
>   	printf(" -c: specify dirty ring size, in number of entries\n");
>   	printf("     (only useful for dirty-ring test; default: %"PRIu32")\n",
>   	       TEST_DIRTY_RING_COUNT);
> +	printf(" -q: specify incemental dirty quota (default: %"PRIu32")\n",
> +	       TEST_DIRTY_QUOTA_INCREMENT);
>   	printf(" -i: specify iteration counts (default: %"PRIu64")\n",
>   	       TEST_HOST_LOOP_N);
>   	printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
> @@ -863,11 +889,14 @@ int main(int argc, char *argv[])
>   
>   	guest_modes_append_default();
>   
> -	while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
> +	while ((opt = getopt(argc, argv, "c:q:hi:I:p:m:M:")) != -1) {
>   		switch (opt) {
>   		case 'c':
>   			test_dirty_ring_count = strtol(optarg, NULL, 10);
>   			break;
> +		case 'q':
> +			test_dirty_quota_increment = strtol(optarg, NULL, 10);
> +			break;
>   		case 'i':
>   			p.iterations = strtol(optarg, NULL, 10);
>   			break;
> diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
> index 4ed6aa049a91..b70732998329 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util_base.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
> @@ -395,4 +395,8 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
>   
>   uint32_t guest_get_vcpuid(void);
>   
> +void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota);
> +void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
> +			uint64_t test_dirty_quota_increment);
> +
>   #endif /* SELFTEST_KVM_UTIL_BASE_H */
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index d8cf851ab119..fa77558d745e 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -19,6 +19,7 @@
>   #include <linux/kernel.h>
>   
>   #define KVM_UTIL_MIN_PFN	2
> +#define PML_BUFFER_SIZE	512
>   
>   static int vcpu_mmap_sz(void);
>   
> @@ -2286,6 +2287,7 @@ static struct exit_reason {
>   	{KVM_EXIT_X86_RDMSR, "RDMSR"},
>   	{KVM_EXIT_X86_WRMSR, "WRMSR"},
>   	{KVM_EXIT_XEN, "XEN"},
> +	{KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, "DIRTY_QUOTA_EXHAUSTED"},
>   #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
>   	{KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
>   #endif
> @@ -2517,3 +2519,37 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid)
>   
>   	return ioctl(vcpu->fd, KVM_GET_STATS_FD, NULL);
>   }
> +
> +void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota)
> +{
> +	run->dirty_quota = dirty_quota;
> +
> +	if (dirty_quota)
> +		pr_info("Dirty quota throttling enabled with initial quota %"PRIu64"\n",
> +			dirty_quota);
> +	else
> +		pr_info("Dirty quota throttling disabled\n");
> +}
> +
> +void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
> +			uint64_t test_dirty_quota_increment)
> +{
> +	uint64_t quota = run->dirty_quota_exit.quota;
> +	uint64_t count = run->dirty_quota_exit.count;
> +
> +	/*
> +	 * Due to PML, number of pages dirtied by the vcpu can exceed its dirty
> +	 * quota by PML buffer size.
> +	 */
> +	TEST_ASSERT(count <= quota + PML_BUFFER_SIZE, "Invalid number of pages
> +		dirtied: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
> +
> +	TEST_ASSERT(count >= quota, "Dirty quota exit happened with quota yet to
> +		be exhausted: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
> +
> +	if (count > quota)
> +		pr_info("Dirty quota exit with unequal quota and count:
> +			count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
> +
> +	run->dirty_quota = count + test_dirty_quota_increment;
> +}
I'll be grateful if I could get some reviews on this patch. Will help me 
move forward. Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling
  2022-04-18  4:55   ` Shivam Kumar
@ 2022-04-18  4:59     ` Shivam Kumar
  2022-04-18 16:17       ` Sean Christopherson
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-04-18  4:59 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 18/04/22 10:25 am, Shivam Kumar wrote:
>
> On 07/03/22 3:38 am, Shivam Kumar wrote:
>> Add selftests for dirty quota throttling with an optional -d parameter
>> to configure by what value dirty quota should be incremented after
>> each dirty quota exit. With very small intervals, a smaller value of
>> dirty quota can ensure that the dirty quota exit code is tested. A zero
>> value disables dirty quota throttling and thus dirty logging, without
>> dirty quota throttling, can be tested.
>>
>> Suggested-by: Shaju Abraham <shaju.abraham@nutanix.com>
>> Suggested-by: Manish Mishra <manish.mishra@nutanix.com>
>> Co-developed-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Anurag Madnawat <anurag.madnawat@nutanix.com>
>> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
>> ---
>>   tools/testing/selftests/kvm/dirty_log_test.c  | 37 +++++++++++++++++--
>>   .../selftests/kvm/include/kvm_util_base.h     |  4 ++
>>   tools/testing/selftests/kvm/lib/kvm_util.c    | 36 ++++++++++++++++++
>>   3 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
>> b/tools/testing/selftests/kvm/dirty_log_test.c
>> index 3fcd89e195c7..e75d826e21fb 100644
>> --- a/tools/testing/selftests/kvm/dirty_log_test.c
>> +++ b/tools/testing/selftests/kvm/dirty_log_test.c
>> @@ -65,6 +65,8 @@
>>     #define SIG_IPI SIGUSR1
>>   +#define TEST_DIRTY_QUOTA_INCREMENT        8
>> +
>>   /*
>>    * Guest/Host shared variables. Ensure addr_gva2hva() and/or
>>    * sync_global_to/from_guest() are used when accessing from
>> @@ -191,6 +193,7 @@ static enum log_mode_t host_log_mode_option = 
>> LOG_MODE_ALL;
>>   static enum log_mode_t host_log_mode;
>>   static pthread_t vcpu_thread;
>>   static uint32_t test_dirty_ring_count = TEST_DIRTY_RING_COUNT;
>> +static uint64_t test_dirty_quota_increment = 
>> TEST_DIRTY_QUOTA_INCREMENT;
>>     static void vcpu_kick(void)
>>   {
>> @@ -210,6 +213,13 @@ static void sem_wait_until(sem_t *sem)
>>       while (ret == -1 && errno == EINTR);
>>   }
>>   +static void set_dirty_quota(struct kvm_vm *vm, uint64_t dirty_quota)
>> +{
>> +    struct kvm_run *run = vcpu_state(vm, VCPU_ID);
>> +
>> +    vcpu_set_dirty_quota(run, dirty_quota);
>> +}
>> +
>>   static bool clear_log_supported(void)
>>   {
>>       return kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
>> @@ -260,9 +270,13 @@ static void default_after_vcpu_run(struct kvm_vm 
>> *vm, int ret, int err)
>>       TEST_ASSERT(ret == 0 || (ret == -1 && err == EINTR),
>>               "vcpu run failed: errno=%d", err);
>>   -    TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> -            "Invalid guest sync status: exit_reason=%s\n",
>> -            exit_reason_str(run->exit_reason));
>> +    if (test_dirty_quota_increment &&
>> +        run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED)
>> +        vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
>> +    else
>> +        TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> +            "Invalid guest sync status: exit_reason=%s\n",
>> +            exit_reason_str(run->exit_reason));
>>         vcpu_handle_sync_stop();
>>   }
>> @@ -377,6 +391,9 @@ static void dirty_ring_after_vcpu_run(struct 
>> kvm_vm *vm, int ret, int err)
>>       if (get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC) {
>>           /* We should allow this to continue */
>>           ;
>> +    } else if (test_dirty_quota_increment &&
>> +        run->exit_reason == KVM_EXIT_DIRTY_QUOTA_EXHAUSTED) {
>> +        vcpu_handle_dirty_quota_exit(run, test_dirty_quota_increment);
>>       } else if (run->exit_reason == KVM_EXIT_DIRTY_RING_FULL ||
>>              (ret == -1 && err == EINTR)) {
>>           /* Update the flag first before pause */
>> @@ -773,6 +790,10 @@ static void run_test(enum vm_guest_mode mode, 
>> void *arg)
>>       sync_global_to_guest(vm, guest_test_virt_mem);
>>       sync_global_to_guest(vm, guest_num_pages);
>>   +    /* Initialise dirty quota */
>> +    if (test_dirty_quota_increment)
>> +        set_dirty_quota(vm, test_dirty_quota_increment);
>> +
>>       /* Start the iterations */
>>       iteration = 1;
>>       sync_global_to_guest(vm, iteration);
>> @@ -814,6 +835,9 @@ static void run_test(enum vm_guest_mode mode, 
>> void *arg)
>>       /* Tell the vcpu thread to quit */
>>       host_quit = true;
>>       log_mode_before_vcpu_join();
>> +    /* Terminate dirty quota throttling */
>> +    if (test_dirty_quota_increment)
>> +        set_dirty_quota(vm, 0);
>>       pthread_join(vcpu_thread, NULL);
>>         pr_info("Total bits checked: dirty (%"PRIu64"), clear 
>> (%"PRIu64"), "
>> @@ -835,6 +859,8 @@ static void help(char *name)
>>       printf(" -c: specify dirty ring size, in number of entries\n");
>>       printf("     (only useful for dirty-ring test; default: 
>> %"PRIu32")\n",
>>              TEST_DIRTY_RING_COUNT);
>> +    printf(" -q: specify incemental dirty quota (default: 
>> %"PRIu32")\n",
>> +           TEST_DIRTY_QUOTA_INCREMENT);
>>       printf(" -i: specify iteration counts (default: %"PRIu64")\n",
>>              TEST_HOST_LOOP_N);
>>       printf(" -I: specify interval in ms (default: %"PRIu64" ms)\n",
>> @@ -863,11 +889,14 @@ int main(int argc, char *argv[])
>>         guest_modes_append_default();
>>   -    while ((opt = getopt(argc, argv, "c:hi:I:p:m:M:")) != -1) {
>> +    while ((opt = getopt(argc, argv, "c:q:hi:I:p:m:M:")) != -1) {
>>           switch (opt) {
>>           case 'c':
>>               test_dirty_ring_count = strtol(optarg, NULL, 10);
>>               break;
>> +        case 'q':
>> +            test_dirty_quota_increment = strtol(optarg, NULL, 10);
>> +            break;
>>           case 'i':
>>               p.iterations = strtol(optarg, NULL, 10);
>>               break;
>> diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
>> b/tools/testing/selftests/kvm/include/kvm_util_base.h
>> index 4ed6aa049a91..b70732998329 100644
>> --- a/tools/testing/selftests/kvm/include/kvm_util_base.h
>> +++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
>> @@ -395,4 +395,8 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t 
>> vcpuid);
>>     uint32_t guest_get_vcpuid(void);
>>   +void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota);
>> +void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
>> +            uint64_t test_dirty_quota_increment);
>> +
>>   #endif /* SELFTEST_KVM_UTIL_BASE_H */
>> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
>> b/tools/testing/selftests/kvm/lib/kvm_util.c
>> index d8cf851ab119..fa77558d745e 100644
>> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
>> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
>> @@ -19,6 +19,7 @@
>>   #include <linux/kernel.h>
>>     #define KVM_UTIL_MIN_PFN    2
>> +#define PML_BUFFER_SIZE    512
>>     static int vcpu_mmap_sz(void);
>>   @@ -2286,6 +2287,7 @@ static struct exit_reason {
>>       {KVM_EXIT_X86_RDMSR, "RDMSR"},
>>       {KVM_EXIT_X86_WRMSR, "WRMSR"},
>>       {KVM_EXIT_XEN, "XEN"},
>> +    {KVM_EXIT_DIRTY_QUOTA_EXHAUSTED, "DIRTY_QUOTA_EXHAUSTED"},
>>   #ifdef KVM_EXIT_MEMORY_NOT_PRESENT
>>       {KVM_EXIT_MEMORY_NOT_PRESENT, "MEMORY_NOT_PRESENT"},
>>   #endif
>> @@ -2517,3 +2519,37 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, 
>> uint32_t vcpuid)
>>         return ioctl(vcpu->fd, KVM_GET_STATS_FD, NULL);
>>   }
>> +
>> +void vcpu_set_dirty_quota(struct kvm_run *run, uint64_t dirty_quota)
>> +{
>> +    run->dirty_quota = dirty_quota;
>> +
>> +    if (dirty_quota)
>> +        pr_info("Dirty quota throttling enabled with initial quota 
>> %"PRIu64"\n",
>> +            dirty_quota);
>> +    else
>> +        pr_info("Dirty quota throttling disabled\n");
>> +}
>> +
>> +void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
>> +            uint64_t test_dirty_quota_increment)
>> +{
>> +    uint64_t quota = run->dirty_quota_exit.quota;
>> +    uint64_t count = run->dirty_quota_exit.count;
>> +
>> +    /*
>> +     * Due to PML, number of pages dirtied by the vcpu can exceed 
>> its dirty
>> +     * quota by PML buffer size.
>> +     */
>> +    TEST_ASSERT(count <= quota + PML_BUFFER_SIZE, "Invalid number of 
>> pages
>> +        dirtied: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
Sean, I don't think this would be valid anymore because as you 
mentioned, the vcpu
can dirty multiple pages in one vmexit. I could use your help here.
>> +
>> +    TEST_ASSERT(count >= quota, "Dirty quota exit happened with 
>> quota yet to
>> +        be exhausted: count=%"PRIu64", quota=%"PRIu64"\n", count, 
>> quota);
>> +
>> +    if (count > quota)
>> +        pr_info("Dirty quota exit with unequal quota and count:
>> +            count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
>> +
>> +    run->dirty_quota = count + test_dirty_quota_increment;
>> +}
> I'll be grateful if I could get some reviews on this patch. Will help 
> me move forward. Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling
  2022-04-18  4:59     ` Shivam Kumar
@ 2022-04-18 16:17       ` Sean Christopherson
  2022-04-28  7:00         ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2022-04-18 16:17 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Mon, Apr 18, 2022, Shivam Kumar wrote:
> > > +void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
> > > +            uint64_t test_dirty_quota_increment)
> > > +{
> > > +    uint64_t quota = run->dirty_quota_exit.quota;
> > > +    uint64_t count = run->dirty_quota_exit.count;
> > > +
> > > +    /*
> > > +     * Due to PML, number of pages dirtied by the vcpu can exceed its dirty
> > > +     * quota by PML buffer size.
> > > +     */
> > > +    TEST_ASSERT(count <= quota + PML_BUFFER_SIZE, "Invalid number of pages
> > > +        dirtied: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
> Sean, I don't think this would be valid anymore because as you mentioned, the
> vcpu can dirty multiple pages in one vmexit. I could use your help here.

TL;DR: Should be fine, but s390 likely needs an exception.

Practically speaking the 512 entry fuzziness is all but guaranteed to prevent
false failures.

But, unconditionally allowing for overflow of 512 entries also means the test is
unlikely to ever detect violations.  So to provide meaningful coverage, this needs
to allow overflow if and only if PML is enabled.

And that brings us back to false failures due to _legitimate_ scenarios where a vCPU
can dirty multiple pages.  Emphasis on legitimate, because except for an s390 edge
case, I don't think this test's guest code does anything that would dirty multiple
pages in a single instruction, e.g. there's no emulation, shouldn't be any descriptor
table side effects, etc...  So unless I'm missing something, KVM should be able to
precisely handle the core run loop.

s390 does appear to have a caveat:

	/*
	 * On s390x, all pages of a 1M segment are initially marked as dirty
	 * when a page of the segment is written to for the very first time.
	 * To compensate this specialty in this test, we need to touch all
	 * pages during the first iteration.
	 */
	for (i = 0; i < guest_num_pages; i++) {
		addr = guest_test_virt_mem + i * guest_page_size;
		*(uint64_t *)addr = READ_ONCE(iteration);
	}

IIUC, subsequent iterations will be ok, but the first iteration needs to allow
for overflow of 256 (AFAIK the test only uses 4kb pages on s390).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling
  2022-04-18 16:17       ` Sean Christopherson
@ 2022-04-28  7:00         ` Shivam Kumar
  2022-04-28 23:59           ` Sean Christopherson
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-04-28  7:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 18/04/22 9:47 pm, Sean Christopherson wrote:
> On Mon, Apr 18, 2022, Shivam Kumar wrote:
>>>> +void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
>>>> +            uint64_t test_dirty_quota_increment)
>>>> +{
>>>> +    uint64_t quota = run->dirty_quota_exit.quota;
>>>> +    uint64_t count = run->dirty_quota_exit.count;
>>>> +
>>>> +    /*
>>>> +     * Due to PML, number of pages dirtied by the vcpu can exceed its dirty
>>>> +     * quota by PML buffer size.
>>>> +     */
>>>> +    TEST_ASSERT(count <= quota + PML_BUFFER_SIZE, "Invalid number of pages
>>>> +        dirtied: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
>> Sean, I don't think this would be valid anymore because as you mentioned, the
>> vcpu can dirty multiple pages in one vmexit. I could use your help here.
> TL;DR: Should be fine, but s390 likely needs an exception.
>
> Practically speaking the 512 entry fuzziness is all but guaranteed to prevent
> false failures.
>
> But, unconditionally allowing for overflow of 512 entries also means the test is
> unlikely to ever detect violations.  So to provide meaningful coverage, this needs
> to allow overflow if and only if PML is enabled.
>
> And that brings us back to false failures due to _legitimate_ scenarios where a vCPU
> can dirty multiple pages.  Emphasis on legitimate, because except for an s390 edge
> case, I don't think this test's guest code does anything that would dirty multiple
> pages in a single instruction, e.g. there's no emulation, shouldn't be any descriptor
> table side effects, etc...  So unless I'm missing something, KVM should be able to
> precisely handle the core run loop.
>
> s390 does appear to have a caveat:
>
> 	/*
> 	 * On s390x, all pages of a 1M segment are initially marked as dirty
> 	 * when a page of the segment is written to for the very first time.
> 	 * To compensate this specialty in this test, we need to touch all
> 	 * pages during the first iteration.
> 	 */
> 	for (i = 0; i < guest_num_pages; i++) {
> 		addr = guest_test_virt_mem + i * guest_page_size;
> 		*(uint64_t *)addr = READ_ONCE(iteration);
> 	}
>
> IIUC, subsequent iterations will be ok, but the first iteration needs to allow
> for overflow of 256 (AFAIK the test only uses 4kb pages on s390).
Hi Sean, need an advice from your side before sending v4. In my opinion, 
I should organise my patchset in a way that the first n-1 patches have 
changes for x86 and the last patch has the changes for s390 and arm64. 
This can help me move forward for the x86 arch and get help and reviews 
from s390 and arm64 maintainers in parallel. Please let me know if this 
makes sense.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling
  2022-04-28  7:00         ` Shivam Kumar
@ 2022-04-28 23:59           ` Sean Christopherson
  0 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2022-04-28 23:59 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: pbonzini, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Thu, Apr 28, 2022, Shivam Kumar wrote:
> 
> On 18/04/22 9:47 pm, Sean Christopherson wrote:
> > On Mon, Apr 18, 2022, Shivam Kumar wrote:
> > > > > +void vcpu_handle_dirty_quota_exit(struct kvm_run *run,
> > > > > +            uint64_t test_dirty_quota_increment)
> > > > > +{
> > > > > +    uint64_t quota = run->dirty_quota_exit.quota;
> > > > > +    uint64_t count = run->dirty_quota_exit.count;
> > > > > +
> > > > > +    /*
> > > > > +     * Due to PML, number of pages dirtied by the vcpu can exceed its dirty
> > > > > +     * quota by PML buffer size.
> > > > > +     */
> > > > > +    TEST_ASSERT(count <= quota + PML_BUFFER_SIZE, "Invalid number of pages
> > > > > +        dirtied: count=%"PRIu64", quota=%"PRIu64"\n", count, quota);
> > > Sean, I don't think this would be valid anymore because as you mentioned, the
> > > vcpu can dirty multiple pages in one vmexit. I could use your help here.
> > TL;DR: Should be fine, but s390 likely needs an exception.
> > 
> > Practically speaking the 512 entry fuzziness is all but guaranteed to prevent
> > false failures.
> > 
> > But, unconditionally allowing for overflow of 512 entries also means the test is
> > unlikely to ever detect violations.  So to provide meaningful coverage, this needs
> > to allow overflow if and only if PML is enabled.
> > 
> > And that brings us back to false failures due to _legitimate_ scenarios where a vCPU
> > can dirty multiple pages.  Emphasis on legitimate, because except for an s390 edge
> > case, I don't think this test's guest code does anything that would dirty multiple
> > pages in a single instruction, e.g. there's no emulation, shouldn't be any descriptor
> > table side effects, etc...  So unless I'm missing something, KVM should be able to
> > precisely handle the core run loop.
> > 
> > s390 does appear to have a caveat:
> > 
> > 	/*
> > 	 * On s390x, all pages of a 1M segment are initially marked as dirty
> > 	 * when a page of the segment is written to for the very first time.
> > 	 * To compensate this specialty in this test, we need to touch all
> > 	 * pages during the first iteration.
> > 	 */
> > 	for (i = 0; i < guest_num_pages; i++) {
> > 		addr = guest_test_virt_mem + i * guest_page_size;
> > 		*(uint64_t *)addr = READ_ONCE(iteration);
> > 	}
> > 
> > IIUC, subsequent iterations will be ok, but the first iteration needs to allow
> > for overflow of 256 (AFAIK the test only uses 4kb pages on s390).
> Hi Sean, need an advice from your side before sending v4. In my opinion, I
> should organise my patchset in a way that the first n-1 patches have changes
> for x86 and the last patch has the changes for s390 and arm64. This can help
> me move forward for the x86 arch and get help and reviews from s390 and
> arm64 maintainers in parallel. Please let me know if this makes sense.

Works for me.  It probably makes sense to split s390 and arm64 too, that way you
don't need a v5 if one wants the feature and the other does not.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-03-06 22:08 ` [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
  2022-03-31  0:28   ` Sean Christopherson
@ 2022-05-02 22:14   ` Peter Xu
  2022-05-03  7:22     ` Shivam Kumar
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Xu @ 2022-05-02 22:14 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: pbonzini, seanjc, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

Hi, Shivam,

On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> +{
> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> +	struct kvm_run *run = vcpu->run;
> +
> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> +		return 1;
> +
> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> +	run->dirty_quota_exit.count = pages_dirtied;
> +	run->dirty_quota_exit.quota = dirty_quota;

Pure question: why this needs to be returned to userspace?  Is this value
set from userspace?

> +	return 0;
> +}

The other high level question is whether you have considered using the ring
full event to achieve similar goal?

Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
gets full.  I think there's a problem that the ring size can not be
randomly set but must be a power of 2.  Also, there is a maximum size of
ring allowed at least.

However since the ring size can be fairly small (e.g. 4096 entries) it can
still achieve some kind of accuracy.  For example, the userspace can
quickly kick the vcpu back to VM_RUN only until it sees that it reaches
some quota (and actually that's how dirty-limit is implemented on QEMU,
contributed by China Telecom):

https://lore.kernel.org/qemu-devel/cover.1646243252.git.huangy81@chinatelecom.cn/

Is there perhaps some explicit reason that dirty ring cannot be used?

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-02 22:14   ` Peter Xu
@ 2022-05-03  7:22     ` Shivam Kumar
  2022-05-03 13:43       ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-05-03  7:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, seanjc, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 03/05/22 3:44 am, Peter Xu wrote:
> Hi, Shivam,
>
> On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
>> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
>> +{
>> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
>> +	struct kvm_run *run = vcpu->run;
>> +
>> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
>> +		return 1;
>> +
>> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
>> +	run->dirty_quota_exit.count = pages_dirtied;
>> +	run->dirty_quota_exit.quota = dirty_quota;
> Pure question: why this needs to be returned to userspace?  Is this value
> set from userspace?
>
1) The quota needs to be replenished once exhasuted.
2) The vcpu should be made to sleep if it has consumed its quota pretty 
quick.

Both these actions are performed on the userspace side, where we expect 
a thread calculating the quota at very small regular intervals based on 
network bandwith information. This can enable us to micro-stun the vcpus 
(steal their runtime just the moment they were dirtying heavily).

We have implemented a "common quota" approach, i.e. transfering any 
unused quota to a common pool so that it can be consumed by any vcpu in 
the next interval on FCFS basis.

It seemed fit to implement all this logic on the userspace side and just 
keep the "dirty count" and the "logic to exit to userspace whenever the 
vcpu has consumed its quota" on the kernel side. The count is required 
on the userspace side because there are cases where a vcpu can actually 
dirty more than its quota (e.g. if PML is enabled). Hence, this 
information can be useful on the userspace side and can be used to 
re-adjust the next quotas.

Thank you for the question. Please let me know if you have further concerns.

>> +	return 0;
>> +}
> The other high level question is whether you have considered using the ring
> full event to achieve similar goal?
>
> Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
> gets full.  I think there's a problem that the ring size can not be
> randomly set but must be a power of 2.  Also, there is a maximum size of
> ring allowed at least.
>
> However since the ring size can be fairly small (e.g. 4096 entries) it can
> still achieve some kind of accuracy.  For example, the userspace can
> quickly kick the vcpu back to VM_RUN only until it sees that it reaches
> some quota (and actually that's how dirty-limit is implemented on QEMU,
> contributed by China Telecom):
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1646243252.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=y6cIruIsp50rH6ImgUi28etki9RTCTHLhRic4IzAtLa62j9PqDMsKGmePy8wGIy8&s=tAZZzTjg74UGxGVzhlREaLYpxBpsDaNV4X_DNdOcUJ8&e=
>
> Is there perhaps some explicit reason that dirty ring cannot be used?
>
> Thanks!
When we started this series, AFAIK it was not possible to set the dirty 
ring size once the vcpus are created. So, we couldn't dynamically set 
dirty ring size. Also, since we are going for micro-stunning and the 
allowed dirties in such small intervals can be pretty low, it can cause 
issues if we can only use a dirty quota which is a power of 2. For 
instance, if the dirty quota is to be set to 9, we can only set it to 16 
(if we round up) and if dirty quota is to be set to 15 we can only set 
it to 8 (if we round down). I hope you'd agree that this can make a huge 
difference.

Also, this approach works for both dirty bitmap and dirty ring interface 
which can help in extending this solution to other architectures.

I'm very grateful for the questions. Looking forward to more feedback. 
Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-03  7:22     ` Shivam Kumar
@ 2022-05-03 13:43       ` Peter Xu
  2022-05-04  6:33         ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2022-05-03 13:43 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: pbonzini, seanjc, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Tue, May 03, 2022 at 12:52:26PM +0530, Shivam Kumar wrote:
> 
> On 03/05/22 3:44 am, Peter Xu wrote:
> > Hi, Shivam,
> > 
> > On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
> > > +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> > > +{
> > > +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> > > +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> > > +	struct kvm_run *run = vcpu->run;
> > > +
> > > +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> > > +		return 1;
> > > +
> > > +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> > > +	run->dirty_quota_exit.count = pages_dirtied;
> > > +	run->dirty_quota_exit.quota = dirty_quota;
> > Pure question: why this needs to be returned to userspace?  Is this value
> > set from userspace?
> > 
> 1) The quota needs to be replenished once exhasuted.
> 2) The vcpu should be made to sleep if it has consumed its quota pretty
> quick.
> 
> Both these actions are performed on the userspace side, where we expect a
> thread calculating the quota at very small regular intervals based on
> network bandwith information. This can enable us to micro-stun the vcpus
> (steal their runtime just the moment they were dirtying heavily).
> 
> We have implemented a "common quota" approach, i.e. transfering any unused
> quota to a common pool so that it can be consumed by any vcpu in the next
> interval on FCFS basis.
> 
> It seemed fit to implement all this logic on the userspace side and just
> keep the "dirty count" and the "logic to exit to userspace whenever the vcpu
> has consumed its quota" on the kernel side. The count is required on the
> userspace side because there are cases where a vcpu can actually dirty more
> than its quota (e.g. if PML is enabled). Hence, this information can be
> useful on the userspace side and can be used to re-adjust the next quotas.

I agree this information is useful.  Though my question was that if the
userspace should have a copy (per-vcpu) of that already then it's not
needed to pass it over to it anymore?

> 
> Thank you for the question. Please let me know if you have further concerns.
> 
> > > +	return 0;
> > > +}
> > The other high level question is whether you have considered using the ring
> > full event to achieve similar goal?
> > 
> > Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
> > gets full.  I think there's a problem that the ring size can not be
> > randomly set but must be a power of 2.  Also, there is a maximum size of
> > ring allowed at least.
> > 
> > However since the ring size can be fairly small (e.g. 4096 entries) it can
> > still achieve some kind of accuracy.  For example, the userspace can
> > quickly kick the vcpu back to VM_RUN only until it sees that it reaches
> > some quota (and actually that's how dirty-limit is implemented on QEMU,
> > contributed by China Telecom):
> > 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1646243252.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=y6cIruIsp50rH6ImgUi28etki9RTCTHLhRic4IzAtLa62j9PqDMsKGmePy8wGIy8&s=tAZZzTjg74UGxGVzhlREaLYpxBpsDaNV4X_DNdOcUJ8&e=
> > 
> > Is there perhaps some explicit reason that dirty ring cannot be used?
> > 
> > Thanks!
> When we started this series, AFAIK it was not possible to set the dirty ring
> size once the vcpus are created. So, we couldn't dynamically set dirty ring
> size.

Agreed.  The ring size can only be set when startup and can't be changed.

> Also, since we are going for micro-stunning and the allowed dirties in
> such small intervals can be pretty low, it can cause issues if we can
> only use a dirty quota which is a power of 2. For instance, if the dirty
> quota is to be set to 9, we can only set it to 16 (if we round up) and if
> dirty quota is to be set to 15 we can only set it to 8 (if we round
> down). I hope you'd agree that this can make a huge difference.

Yes. As discussed above, I didn't expect the ring size to be the quota
per-se, so what I'm wondering is whether we can leverage a small and
constant sized ring to emulate the behavior of a quota with any size, but
with a minimum granule of the dirty ring size.

> 
> Also, this approach works for both dirty bitmap and dirty ring interface
> which can help in extending this solution to other architectures.

Is there any specific arch that you're interested outside x86?

Logically we can also think about extending dirty ring to other archs, but
there were indeed challenges where some pages can be dirtied without a vcpu
context, and that's why it was only supported initially on x86.

I think it should not be a problem for the quota solution, because it's
backed up by the dirty bitmap so no dirty page will be overlooked for
migration purpose, which is definitely a benefit.  But I'm still curious
whether you looked into any specific archs already (x86 doesn't have such
problem) so that maybe there's some quota you still want to apply elsewhere
where there's no vcpu context.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-03 13:43       ` Peter Xu
@ 2022-05-04  6:33         ` Shivam Kumar
  2022-05-04 17:26           ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Shivam Kumar @ 2022-05-04  6:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, seanjc, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 03/05/22 7:13 pm, Peter Xu wrote:
> On Tue, May 03, 2022 at 12:52:26PM +0530, Shivam Kumar wrote:
>> On 03/05/22 3:44 am, Peter Xu wrote:
>>> Hi, Shivam,
>>>
>>> On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
>>>> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>>>> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
>>>> +	struct kvm_run *run = vcpu->run;
>>>> +
>>>> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
>>>> +		return 1;
>>>> +
>>>> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
>>>> +	run->dirty_quota_exit.count = pages_dirtied;
>>>> +	run->dirty_quota_exit.quota = dirty_quota;
>>> Pure question: why this needs to be returned to userspace?  Is this value
>>> set from userspace?
>>>
>> 1) The quota needs to be replenished once exhasuted.
>> 2) The vcpu should be made to sleep if it has consumed its quota pretty
>> quick.
>>
>> Both these actions are performed on the userspace side, where we expect a
>> thread calculating the quota at very small regular intervals based on
>> network bandwith information. This can enable us to micro-stun the vcpus
>> (steal their runtime just the moment they were dirtying heavily).
>>
>> We have implemented a "common quota" approach, i.e. transfering any unused
>> quota to a common pool so that it can be consumed by any vcpu in the next
>> interval on FCFS basis.
>>
>> It seemed fit to implement all this logic on the userspace side and just
>> keep the "dirty count" and the "logic to exit to userspace whenever the vcpu
>> has consumed its quota" on the kernel side. The count is required on the
>> userspace side because there are cases where a vcpu can actually dirty more
>> than its quota (e.g. if PML is enabled). Hence, this information can be
>> useful on the userspace side and can be used to re-adjust the next quotas.
> I agree this information is useful.  Though my question was that if the
> userspace should have a copy (per-vcpu) of that already then it's not
> needed to pass it over to it anymore?
This is how we started but then based on the feedback from Sean, we 
moved 'pages_dirtied' to vcpu stats as it can be a useful stat. The 
'dirty_quota' variable is already shared with userspace as it is in the 
vcpu run struct and hence the quota can be modified by userspace on the 
go. So, it made sense to pass both the variables at the time of exit 
(the vcpu might be exiting with an old copy of dirty quota, which the 
userspace needs to know).

Thanks.
>> Thank you for the question. Please let me know if you have further concerns.
>>
>>>> +	return 0;
>>>> +}
>>> The other high level question is whether you have considered using the ring
>>> full event to achieve similar goal?
>>>
>>> Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
>>> gets full.  I think there's a problem that the ring size can not be
>>> randomly set but must be a power of 2.  Also, there is a maximum size of
>>> ring allowed at least.
>>>
>>> However since the ring size can be fairly small (e.g. 4096 entries) it can
>>> still achieve some kind of accuracy.  For example, the userspace can
>>> quickly kick the vcpu back to VM_RUN only until it sees that it reaches
>>> some quota (and actually that's how dirty-limit is implemented on QEMU,
>>> contributed by China Telecom):
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1646243252.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=y6cIruIsp50rH6ImgUi28etki9RTCTHLhRic4IzAtLa62j9PqDMsKGmePy8wGIy8&s=tAZZzTjg74UGxGVzhlREaLYpxBpsDaNV4X_DNdOcUJ8&e=
>>>
>>> Is there perhaps some explicit reason that dirty ring cannot be used?
>>>
>>> Thanks!
>> When we started this series, AFAIK it was not possible to set the dirty ring
>> size once the vcpus are created. So, we couldn't dynamically set dirty ring
>> size.
> Agreed.  The ring size can only be set when startup and can't be changed.
>
>> Also, since we are going for micro-stunning and the allowed dirties in
>> such small intervals can be pretty low, it can cause issues if we can
>> only use a dirty quota which is a power of 2. For instance, if the dirty
>> quota is to be set to 9, we can only set it to 16 (if we round up) and if
>> dirty quota is to be set to 15 we can only set it to 8 (if we round
>> down). I hope you'd agree that this can make a huge difference.
> Yes. As discussed above, I didn't expect the ring size to be the quota
> per-se, so what I'm wondering is whether we can leverage a small and
> constant sized ring to emulate the behavior of a quota with any size, but
> with a minimum granule of the dirty ring size.
This would be an interesting thing to try. I've already planned efforts 
to optimise it for dirty ring interface. Thank you for this suggestion.

Side question: Is there any plan to make it possible to dynamically 
update the dirty ring size?
>> Also, this approach works for both dirty bitmap and dirty ring interface
>> which can help in extending this solution to other architectures.
> Is there any specific arch that you're interested outside x86?
x86 is the first priority but this patchset targets s390 and arm as well.
>
> Logically we can also think about extending dirty ring to other archs, but
> there were indeed challenges where some pages can be dirtied without a vcpu
> context, and that's why it was only supported initially on x86.
This is an interesting problem and we are aware of it. We have a couple 
of ideas but they are very raw as of now.
>
> I think it should not be a problem for the quota solution, because it's
> backed up by the dirty bitmap so no dirty page will be overlooked for
> migration purpose, which is definitely a benefit.  But I'm still curious
> whether you looked into any specific archs already (x86 doesn't have such
> problem) so that maybe there's some quota you still want to apply elsewhere
> where there's no vcpu context.
Yes, this is kind of similar to one of the ideas we have thought. 
Though, there are many things which need a lot of brainstorming, e.g. 
the ratio in which we can split the overall quota to accomodate for 
dirties with no vcpu context.
> Thanks,
Thanks again for these invaluable comments, Peter.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-04  6:33         ` Shivam Kumar
@ 2022-05-04 17:26           ` Peter Xu
  2022-05-05 15:17             ` Shivam Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2022-05-04 17:26 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: pbonzini, seanjc, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat

On Wed, May 04, 2022 at 12:03:57PM +0530, Shivam Kumar wrote:
> 
> On 03/05/22 7:13 pm, Peter Xu wrote:
> > On Tue, May 03, 2022 at 12:52:26PM +0530, Shivam Kumar wrote:
> > > On 03/05/22 3:44 am, Peter Xu wrote:
> > > > Hi, Shivam,
> > > > 
> > > > On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
> > > > > +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> > > > > +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> > > > > +	struct kvm_run *run = vcpu->run;
> > > > > +
> > > > > +	if (!dirty_quota || (pages_dirtied < dirty_quota))
> > > > > +		return 1;
> > > > > +
> > > > > +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
> > > > > +	run->dirty_quota_exit.count = pages_dirtied;
> > > > > +	run->dirty_quota_exit.quota = dirty_quota;
> > > > Pure question: why this needs to be returned to userspace?  Is this value
> > > > set from userspace?
> > > > 
> > > 1) The quota needs to be replenished once exhasuted.
> > > 2) The vcpu should be made to sleep if it has consumed its quota pretty
> > > quick.
> > > 
> > > Both these actions are performed on the userspace side, where we expect a
> > > thread calculating the quota at very small regular intervals based on
> > > network bandwith information. This can enable us to micro-stun the vcpus
> > > (steal their runtime just the moment they were dirtying heavily).
> > > 
> > > We have implemented a "common quota" approach, i.e. transfering any unused
> > > quota to a common pool so that it can be consumed by any vcpu in the next
> > > interval on FCFS basis.
> > > 
> > > It seemed fit to implement all this logic on the userspace side and just
> > > keep the "dirty count" and the "logic to exit to userspace whenever the vcpu
> > > has consumed its quota" on the kernel side. The count is required on the
> > > userspace side because there are cases where a vcpu can actually dirty more
> > > than its quota (e.g. if PML is enabled). Hence, this information can be
> > > useful on the userspace side and can be used to re-adjust the next quotas.
> > I agree this information is useful.  Though my question was that if the
> > userspace should have a copy (per-vcpu) of that already then it's not
> > needed to pass it over to it anymore?
> This is how we started but then based on the feedback from Sean, we moved
> 'pages_dirtied' to vcpu stats as it can be a useful stat. The 'dirty_quota'
> variable is already shared with userspace as it is in the vcpu run struct
> and hence the quota can be modified by userspace on the go. So, it made
> sense to pass both the variables at the time of exit (the vcpu might be
> exiting with an old copy of dirty quota, which the userspace needs to know).

Correct.

My point was the userspace could simply cache the old quota too in the
userspace vcpu struct even if there's the real quota value in vcpu run.

No strong opinion, but normally if this info is optional to userspace I
think it's cleaner to not pass it over again to keep the kernel ABI simple.

> 
> Thanks.
> > > Thank you for the question. Please let me know if you have further concerns.
> > > 
> > > > > +	return 0;
> > > > > +}
> > > > The other high level question is whether you have considered using the ring
> > > > full event to achieve similar goal?
> > > > 
> > > > Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
> > > > gets full.  I think there's a problem that the ring size can not be
> > > > randomly set but must be a power of 2.  Also, there is a maximum size of
> > > > ring allowed at least.
> > > > 
> > > > However since the ring size can be fairly small (e.g. 4096 entries) it can
> > > > still achieve some kind of accuracy.  For example, the userspace can
> > > > quickly kick the vcpu back to VM_RUN only until it sees that it reaches
> > > > some quota (and actually that's how dirty-limit is implemented on QEMU,
> > > > contributed by China Telecom):
> > > > 
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1646243252.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=y6cIruIsp50rH6ImgUi28etki9RTCTHLhRic4IzAtLa62j9PqDMsKGmePy8wGIy8&s=tAZZzTjg74UGxGVzhlREaLYpxBpsDaNV4X_DNdOcUJ8&e=
> > > > 
> > > > Is there perhaps some explicit reason that dirty ring cannot be used?
> > > > 
> > > > Thanks!
> > > When we started this series, AFAIK it was not possible to set the dirty ring
> > > size once the vcpus are created. So, we couldn't dynamically set dirty ring
> > > size.
> > Agreed.  The ring size can only be set when startup and can't be changed.
> > 
> > > Also, since we are going for micro-stunning and the allowed dirties in
> > > such small intervals can be pretty low, it can cause issues if we can
> > > only use a dirty quota which is a power of 2. For instance, if the dirty
> > > quota is to be set to 9, we can only set it to 16 (if we round up) and if
> > > dirty quota is to be set to 15 we can only set it to 8 (if we round
> > > down). I hope you'd agree that this can make a huge difference.
> > Yes. As discussed above, I didn't expect the ring size to be the quota
> > per-se, so what I'm wondering is whether we can leverage a small and
> > constant sized ring to emulate the behavior of a quota with any size, but
> > with a minimum granule of the dirty ring size.
> This would be an interesting thing to try. I've already planned efforts to
> optimise it for dirty ring interface. Thank you for this suggestion.
> 
> Side question: Is there any plan to make it possible to dynamically update
> the dirty ring size?

No plan that I'm aware of..

After I checked: kvm_dirty_ring_get_rsvd_entries() limits our current ring
size, which is hardware dependent on PML.  It seems at least 1024 will
still be a work-for-all case, but not sure how it'll work in reality since
then soft_limit of the dirty ring will be relatively small so it'll kick to
userspace more often.  Maybe that's not a huge problem for a throttle
scenario.  In that case the granule will be <=4MB if it'll work.

> > > Also, this approach works for both dirty bitmap and dirty ring interface
> > > which can help in extending this solution to other architectures.
> > Is there any specific arch that you're interested outside x86?
> x86 is the first priority but this patchset targets s390 and arm as well.

I see.

> > 
> > Logically we can also think about extending dirty ring to other archs, but
> > there were indeed challenges where some pages can be dirtied without a vcpu
> > context, and that's why it was only supported initially on x86.
> This is an interesting problem and we are aware of it. We have a couple of
> ideas but they are very raw as of now.

I think a default (no-vcpu) ring will solve most of the issues, but that
requires some thoughts, e.g. the major difference between ring and bitmap
is that ring can be full while bitmap cannot.. We need to think careful on
that when it comes.

The other thing is IIRC s390 has been using dirty bits on the pgtables
(either radix or hash based) to trap dirty, so that'll be challenging too
when connected with a ring interface because it could make the whole thing
much less helpful..

So from that pov I think your solution sounds reasonable on that it
decouples the feature with the interface, and it also looks simple.

> > 
> > I think it should not be a problem for the quota solution, because it's
> > backed up by the dirty bitmap so no dirty page will be overlooked for
> > migration purpose, which is definitely a benefit.  But I'm still curious
> > whether you looked into any specific archs already (x86 doesn't have such
> > problem) so that maybe there's some quota you still want to apply elsewhere
> > where there's no vcpu context.
> Yes, this is kind of similar to one of the ideas we have thought. Though,
> there are many things which need a lot of brainstorming, e.g. the ratio in
> which we can split the overall quota to accomodate for dirties with no vcpu
> context.

I'm slightly worried it'll make things even more complicated.

Only until we're thinking seriously on non-x86 platforms (since again x86
doesn't have this issue afaict..): I think one thing we could do is to dig
out all these cases and think about whether they do need any quota at all.

IOW, whether the no-vcpu dirty context are prone to have VM live migration
converge issue.  If the answer is no, IMHO a simpler approach is we can
ignore those dirty pages for quota purpose.

I think that's a major benefit of your approach comparing to the full dirty
ring approach, because you're always backed by the dirty bitmap so there's
no way of data loss.  Dirty ring's one major challenge is how to make sure
all dirty PFNs get recorded and meanwhile we don't interrupt kvm workflow
in general.

One thing I'd really appreciate is if this solution would be accepted from
the kernel POV and if you plan to work on a qemu version of it, please
consider reading the work from Hyman Huang from China Telecom on the dirty
limit solution (which is currently based on dirty ring):

https://lore.kernel.org/qemu-devel/cover.1648748793.git.huangy81@chinatelecom.cn/

Since they'll be very similar approaches to solving the same problem.
Hopefully we could unify the work and not have two fully separate
approaches even if they should really share something in common.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus
  2022-05-04 17:26           ` Peter Xu
@ 2022-05-05 15:17             ` Shivam Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Shivam Kumar @ 2022-05-05 15:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, seanjc, kvm, Shaju Abraham, Manish Mishra, Anurag Madnawat


On 04/05/22 10:56 pm, Peter Xu wrote:
> On Wed, May 04, 2022 at 12:03:57PM +0530, Shivam Kumar wrote:
>> On 03/05/22 7:13 pm, Peter Xu wrote:
>>> On Tue, May 03, 2022 at 12:52:26PM +0530, Shivam Kumar wrote:
>>>> On 03/05/22 3:44 am, Peter Xu wrote:
>>>>> Hi, Shivam,
>>>>>
>>>>> On Sun, Mar 06, 2022 at 10:08:48PM +0000, Shivam Kumar wrote:
>>>>>> +static inline int kvm_vcpu_check_dirty_quota(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>>>>>> +	u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
>>>>>> +	struct kvm_run *run = vcpu->run;
>>>>>> +
>>>>>> +	if (!dirty_quota || (pages_dirtied < dirty_quota))
>>>>>> +		return 1;
>>>>>> +
>>>>>> +	run->exit_reason = KVM_EXIT_DIRTY_QUOTA_EXHAUSTED;
>>>>>> +	run->dirty_quota_exit.count = pages_dirtied;
>>>>>> +	run->dirty_quota_exit.quota = dirty_quota;
>>>>> Pure question: why this needs to be returned to userspace?  Is this value
>>>>> set from userspace?
>>>>>
>>>> 1) The quota needs to be replenished once exhasuted.
>>>> 2) The vcpu should be made to sleep if it has consumed its quota pretty
>>>> quick.
>>>>
>>>> Both these actions are performed on the userspace side, where we expect a
>>>> thread calculating the quota at very small regular intervals based on
>>>> network bandwith information. This can enable us to micro-stun the vcpus
>>>> (steal their runtime just the moment they were dirtying heavily).
>>>>
>>>> We have implemented a "common quota" approach, i.e. transfering any unused
>>>> quota to a common pool so that it can be consumed by any vcpu in the next
>>>> interval on FCFS basis.
>>>>
>>>> It seemed fit to implement all this logic on the userspace side and just
>>>> keep the "dirty count" and the "logic to exit to userspace whenever the vcpu
>>>> has consumed its quota" on the kernel side. The count is required on the
>>>> userspace side because there are cases where a vcpu can actually dirty more
>>>> than its quota (e.g. if PML is enabled). Hence, this information can be
>>>> useful on the userspace side and can be used to re-adjust the next quotas.
>>> I agree this information is useful.  Though my question was that if the
>>> userspace should have a copy (per-vcpu) of that already then it's not
>>> needed to pass it over to it anymore?
>> This is how we started but then based on the feedback from Sean, we moved
>> 'pages_dirtied' to vcpu stats as it can be a useful stat. The 'dirty_quota'
>> variable is already shared with userspace as it is in the vcpu run struct
>> and hence the quota can be modified by userspace on the go. So, it made
>> sense to pass both the variables at the time of exit (the vcpu might be
>> exiting with an old copy of dirty quota, which the userspace needs to know).
> Correct.
>
> My point was the userspace could simply cache the old quota too in the
> userspace vcpu struct even if there's the real quota value in vcpu run.
>
> No strong opinion, but normally if this info is optional to userspace I
> think it's cleaner to not pass it over again to keep the kernel ABI simple.
Ack. Though this implementation path aimed at not reserving extra memory 
for old values of dirty quota and making the solution robust to multiple 
changes (multiple old versions) in dirty quota during dirty quota exit, 
which is rare. There was no such strong reason.
>
> Thanks.
>>>> Thank you for the question. Please let me know if you have further concerns.
>>>>
>>>>>> +	return 0;
>>>>>> +}
>>>>> The other high level question is whether you have considered using the ring
>>>>> full event to achieve similar goal?
>>>>>
>>>>> Right now KVM_EXIT_DIRTY_RING_FULL event is generated when per-vcpu ring
>>>>> gets full.  I think there's a problem that the ring size can not be
>>>>> randomly set but must be a power of 2.  Also, there is a maximum size of
>>>>> ring allowed at least.
>>>>>
>>>>> However since the ring size can be fairly small (e.g. 4096 entries) it can
>>>>> still achieve some kind of accuracy.  For example, the userspace can
>>>>> quickly kick the vcpu back to VM_RUN only until it sees that it reaches
>>>>> some quota (and actually that's how dirty-limit is implemented on QEMU,
>>>>> contributed by China Telecom):
>>>>>
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1646243252.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=y6cIruIsp50rH6ImgUi28etki9RTCTHLhRic4IzAtLa62j9PqDMsKGmePy8wGIy8&s=tAZZzTjg74UGxGVzhlREaLYpxBpsDaNV4X_DNdOcUJ8&e=
>>>>>
>>>>> Is there perhaps some explicit reason that dirty ring cannot be used?
>>>>>
>>>>> Thanks!
>>>> When we started this series, AFAIK it was not possible to set the dirty ring
>>>> size once the vcpus are created. So, we couldn't dynamically set dirty ring
>>>> size.
>>> Agreed.  The ring size can only be set when startup and can't be changed.
>>>
>>>> Also, since we are going for micro-stunning and the allowed dirties in
>>>> such small intervals can be pretty low, it can cause issues if we can
>>>> only use a dirty quota which is a power of 2. For instance, if the dirty
>>>> quota is to be set to 9, we can only set it to 16 (if we round up) and if
>>>> dirty quota is to be set to 15 we can only set it to 8 (if we round
>>>> down). I hope you'd agree that this can make a huge difference.
>>> Yes. As discussed above, I didn't expect the ring size to be the quota
>>> per-se, so what I'm wondering is whether we can leverage a small and
>>> constant sized ring to emulate the behavior of a quota with any size, but
>>> with a minimum granule of the dirty ring size.
>> This would be an interesting thing to try. I've already planned efforts to
>> optimise it for dirty ring interface. Thank you for this suggestion.
>>
>> Side question: Is there any plan to make it possible to dynamically update
>> the dirty ring size?
> No plan that I'm aware of..
>
> After I checked: kvm_dirty_ring_get_rsvd_entries() limits our current ring
> size, which is hardware dependent on PML.  It seems at least 1024 will
> still be a work-for-all case, but not sure how it'll work in reality since
> then soft_limit of the dirty ring will be relatively small so it'll kick to
> userspace more often.  Maybe that's not a huge problem for a throttle
> scenario.  In that case the granule will be <=4MB if it'll work.
Ack. Thanks.
>
>>>> Also, this approach works for both dirty bitmap and dirty ring interface
>>>> which can help in extending this solution to other architectures.
>>> Is there any specific arch that you're interested outside x86?
>> x86 is the first priority but this patchset targets s390 and arm as well.
> I see.
>
>>> Logically we can also think about extending dirty ring to other archs, but
>>> there were indeed challenges where some pages can be dirtied without a vcpu
>>> context, and that's why it was only supported initially on x86.
>> This is an interesting problem and we are aware of it. We have a couple of
>> ideas but they are very raw as of now.
> I think a default (no-vcpu) ring will solve most of the issues, but that
> requires some thoughts, e.g. the major difference between ring and bitmap
> is that ring can be full while bitmap cannot.. We need to think careful on
> that when it comes.
>
> The other thing is IIRC s390 has been using dirty bits on the pgtables
> (either radix or hash based) to trap dirty, so that'll be challenging too
> when connected with a ring interface because it could make the whole thing
> much less helpful..
>
> So from that pov I think your solution sounds reasonable on that it
> decouples the feature with the interface, and it also looks simple.
Ack. Thanks.
>
>>> I think it should not be a problem for the quota solution, because it's
>>> backed up by the dirty bitmap so no dirty page will be overlooked for
>>> migration purpose, which is definitely a benefit.  But I'm still curious
>>> whether you looked into any specific archs already (x86 doesn't have such
>>> problem) so that maybe there's some quota you still want to apply elsewhere
>>> where there's no vcpu context.
>> Yes, this is kind of similar to one of the ideas we have thought. Though,
>> there are many things which need a lot of brainstorming, e.g. the ratio in
>> which we can split the overall quota to accomodate for dirties with no vcpu
>> context.
> I'm slightly worried it'll make things even more complicated.
>
> Only until we're thinking seriously on non-x86 platforms (since again x86
> doesn't have this issue afaict..): I think one thing we could do is to dig
> out all these cases and think about whether they do need any quota at all.
>
> IOW, whether the no-vcpu dirty context are prone to have VM live migration
> converge issue.  If the answer is no, IMHO a simpler approach is we can
> ignore those dirty pages for quota purpose.
Yes, we are running some experiments to identify such cases where enough 
dirty can happen without vcpu context to make migration not converge.
> I think that's a major benefit of your approach comparing to the full dirty
> ring approach, because you're always backed by the dirty bitmap so there's
> no way of data loss.  Dirty ring's one major challenge is how to make sure
> all dirty PFNs get recorded and meanwhile we don't interrupt kvm workflow
> in general.
>
> One thing I'd really appreciate is if this solution would be accepted from
> the kernel POV and if you plan to work on a qemu version of it, please
> consider reading the work from Hyman Huang from China Telecom on the dirty
> limit solution (which is currently based on dirty ring):
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_qemu-2Ddevel_cover.1648748793.git.huangy81-40chinatelecom.cn_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=WHTbPjZer3ai__KbADSbXu_06rsu-MDRK4LCpRgwdXVXtMPlxN2MVMjGzsvBlOqz&s=sMVOOszKIvQ2vM03bdMEhVOAkeN55QgFUk_XbUm2JRI&e=
>
> Since they'll be very similar approaches to solving the same problem.
> Hopefully we could unify the work and not have two fully separate
> approaches even if they should really share something in common.
Definitely, this is already on my reading list. Thanks.
>
> Thanks,
>
Thank you for the comments.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-05-05 15:18 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-06 22:08 [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar
2022-03-06 22:08 ` [PATCH v3 1/3] KVM: Implement dirty quota-based throttling of vcpus Shivam Kumar
2022-03-31  0:28   ` Sean Christopherson
2022-03-31  7:20     ` Shivam Kumar
2022-03-31 15:37       ` Sean Christopherson
2022-04-06 12:32         ` Shivam Kumar
2022-05-02 22:14   ` Peter Xu
2022-05-03  7:22     ` Shivam Kumar
2022-05-03 13:43       ` Peter Xu
2022-05-04  6:33         ` Shivam Kumar
2022-05-04 17:26           ` Peter Xu
2022-05-05 15:17             ` Shivam Kumar
2022-03-06 22:08 ` [PATCH v3 2/3] KVM: Documentation: Update kvm_run structure for dirty quota Shivam Kumar
2022-03-31  0:40   ` Sean Christopherson
2022-03-31  7:30     ` Shivam Kumar
2022-03-31 15:24       ` Sean Christopherson
2022-04-01 13:49         ` Sean Christopherson
2022-04-06 12:39           ` Shivam Kumar
2022-04-06 12:44             ` Shivam Kumar
2022-03-06 22:08 ` [PATCH v3 3/3] KVM: selftests: Add selftests for dirty quota throttling Shivam Kumar
2022-04-18  4:55   ` Shivam Kumar
2022-04-18  4:59     ` Shivam Kumar
2022-04-18 16:17       ` Sean Christopherson
2022-04-28  7:00         ` Shivam Kumar
2022-04-28 23:59           ` Sean Christopherson
2022-03-19 18:21 ` [PATCH v3 0/3] KVM: Dirty quota-based throttling Shivam Kumar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.