[PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
@ 2022-11-03 20:44 David Matlack
  2022-11-07 21:21 ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: David Matlack @ 2022-11-03 20:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Sean Christopherson, kvm, David Matlack

Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked,
as it will just be faulted back in at the same 4KiB granularity when
accessed by a vCPU. This may need to be changed if KVM ever supports
2MiB (or larger) dirty tracking granularity, or faulting huge pages
during dirty tracking for reads/executes. However for now, these zaps
are entirely wasteful.

This commit does nominally increase the CPU usage of the NX recover
worker by about 1% when testing with a VM with 16 slots.

Signed-off-by: David Matlack <dmatlack@google.com>
---
In order to check if this commit increases the CPU usage of the NX
recovery worker thread I used a modified version of execute_perf_test
[1] that supports splitting guest memory into multiple slots and reports
/proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just
before tearing down the VM. The goal was to force a large number of NX
Huge Page recoveries and see if the recovery worker used any more CPU.

Test Setup:

  echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms
  echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio

Test Command:

  ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o

        | kvm-nx-lpage-re:se.sum_exec_runtime      |
        | ---------------------------------------- |
Run     | Before             | After               |
------- | ------------------ | ------------------- |
1       | 730.084105         | 724.375314          |
2       | 728.751339         | 740.581988          |
3       | 736.264720         | 757.078163          |

Comparing the median results, this commit results in about a 1% increase
CPU usage of the NX recovery worker.

[1] https://lore.kernel.org/kvm/20221019234050.3919566-2-dmatlack@google.com/

v2:
 - Only skip NX Huge Pages that are actively being dirty tracked [Paolo]

v1: https://lore.kernel.org/kvm/20221027200316.2221027-1-dmatlack@google.com/

 arch/x86/kvm/mmu/mmu.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 82bc6321e58e..1c443f9aeb4b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6831,6 +6831,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 {
 	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
+	struct kvm_memory_slot *slot;
 	int rcu_idx;
 	struct kvm_mmu_page *sp;
 	unsigned int ratio;
@@ -6865,7 +6866,21 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 				      struct kvm_mmu_page,
 				      possible_nx_huge_page_link);
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
-		if (is_tdp_mmu_page(sp))
+		WARN_ON_ONCE(!sp->role.direct);
+
+		slot = gfn_to_memslot(kvm, sp->gfn);
+		WARN_ON_ONCE(!slot);
+
+		/*
+		 * Unaccount and do not attempt to recover any NX Huge Pages
+		 * that are being dirty tracked, as they would just be faulted
+		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
+		 * recovered, along with all the other huge pages in the slot,
+		 * when dirty logging is disabled.
+		 */
+		if (slot && kvm_slot_dirty_track_enabled(slot))
+			unaccount_nx_huge_page(kvm, sp);
+		else if (is_tdp_mmu_page(sp))
 			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
 		else
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);

-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
  2022-11-03 20:44 [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages David Matlack
@ 2022-11-07 21:21 ` Sean Christopherson
  2022-11-17 16:28   ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2022-11-07 21:21 UTC (permalink / raw)
  To: David Matlack; +Cc: Paolo Bonzini, kvm

On Thu, Nov 03, 2022, David Matlack wrote:
> Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked,
> as it will just be faulted back in at the same 4KiB granularity when
> accessed by a vCPU. This may need to be changed if KVM ever supports
> 2MiB (or larger) dirty tracking granularity, or faulting huge pages
> during dirty tracking for reads/executes. However for now, these zaps
> are entirely wasteful.
> 
> This commit does nominally increase the CPU usage of the NX recover
> worker by about 1% when testing with a VM with 16 slots.

Probably should include the benefits of the change (avoiding zaps) if you're going
to list the downsides.  I suspect you'll have a hard time quantifying capturing the
positives because the behavior is heavily dependent on the zapping frequency, which
is controlled by userspace, but having some understanding of the tradeoffs is
important otherwise it's basically impossible to know whether or not that 1% CPU
usage is worth eating.

Hmm, and the memslot heuristic doesn't address the recovery worker holding mmu_lock
for write.  On a non-preemptible kernel, rwlock_needbreak() is always false, e.g.
the worker won't yield to vCPUs that are trying to handle non-fast page faults.
The worker should eventually reach steady state by unaccounting everything, but
that might take a while.

An alternative idea to the memslot heuristic would be to add a knob to allow
disabling the recovery thread on a per-VM basis.  Userspace should know that it's
dirty logging a given VM for migration.

E.g.

---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu/mmu.c          | 12 +++++++-----
 arch/x86/kvm/x86.c              |  9 +++++++++
 include/uapi/linux/kvm.h        |  4 ++++
 tools/include/uapi/linux/kvm.h  |  4 ++++
 5 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7551b6f9c31c..2b982cea11ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1341,6 +1341,7 @@ struct kvm_arch {
 	u32 max_vcpu_ids;
 
 	bool disable_nx_huge_pages;
+	bool disable_nx_huge_page_recovery;
 
 	/*
 	 * Memory caches used to allocate shadow pages when performing eager
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f81539061d6..44d2cce14f38 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6856,12 +6856,13 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-static long get_nx_lpage_recovery_timeout(u64 start_time)
+static long get_nx_lpage_recovery_timeout(struct kvm *kvm, u64 start_time)
 {
 	bool enabled;
 	uint period;
 
-	enabled = calc_nx_huge_pages_recovery_period(&period);
+	enabled = !READ_ONCE(kvm->arch.disable_nx_huge_page_recovery) &&
+		  calc_nx_huge_pages_recovery_period(&period);
 
 	return enabled ? start_time + msecs_to_jiffies(period) - get_jiffies_64()
 		       : MAX_SCHEDULE_TIMEOUT;
@@ -6874,12 +6875,12 @@ static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
 
 	while (true) {
 		start_time = get_jiffies_64();
-		remaining_time = get_nx_lpage_recovery_timeout(start_time);
+		remaining_time = get_nx_lpage_recovery_timeout(kvm, start_time);
 
 		set_current_state(TASK_INTERRUPTIBLE);
 		while (!kthread_should_stop() && remaining_time > 0) {
 			schedule_timeout(remaining_time);
-			remaining_time = get_nx_lpage_recovery_timeout(start_time);
+			remaining_time = get_nx_lpage_recovery_timeout(kvm, start_time);
 			set_current_state(TASK_INTERRUPTIBLE);
 		}
 
@@ -6888,7 +6889,8 @@ static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
 		if (kthread_should_stop())
 			return 0;
 
-		kvm_recover_nx_lpages(kvm);
+		if (!READ_ONCE(kvm->arch.disable_nx_huge_page_recovery))
+			kvm_recover_nx_lpages(kvm);
 	}
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 521b433f978c..04f287c65ac3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4409,6 +4409,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VAPIC:
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGE_RECOVERY:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
@@ -6376,6 +6377,14 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->lock);
 		break;
+	case KVM_CAP_VM_DISABLE_NX_HUGE_PAGE_RECOVERY:
+		r = -EINVAL;
+		if (cap->args[0] & ~KVM_X86_DISABLE_NX_HUGE_PAGE_RECOVERY)
+			break;
+
+		WRITE_ONCE(kvm->arch.disable_nx_huge_page_recovery, !!cap->args[0]);
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0d5d4419139a..097661a867d9 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1178,6 +1178,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_VM_DISABLE_NX_HUGE_PAGE_RECOVERY 224
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2198,6 +2199,9 @@ struct kvm_stats_desc {
 #define KVM_X86_NOTIFY_VMEXIT_ENABLED		(1ULL << 0)
 #define KVM_X86_NOTIFY_VMEXIT_USER		(1ULL << 1)
 
+/* Available with KVM_CAP_VM_DISABLE_NX_HUGE_PAGE_RECOVERY */
+#define KVM_X86_DISABLE_NX_HUGE_PAGE_RECOVERY	(1ULL << 0)
+
 /* Available with KVM_CAP_S390_ZPCI_OP */
 #define KVM_S390_ZPCI_OP         _IOW(KVMIO,  0xd1, struct kvm_s390_zpci_op)
 
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 0d5d4419139a..097661a867d9 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1178,6 +1178,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_ZPCI_OP 221
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
+#define KVM_CAP_VM_DISABLE_NX_HUGE_PAGE_RECOVERY 224
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2198,6 +2199,9 @@ struct kvm_stats_desc {
 #define KVM_X86_NOTIFY_VMEXIT_ENABLED		(1ULL << 0)
 #define KVM_X86_NOTIFY_VMEXIT_USER		(1ULL << 1)
 
+/* Available with KVM_CAP_VM_DISABLE_NX_HUGE_PAGE_RECOVERY */
+#define KVM_X86_DISABLE_NX_HUGE_PAGE_RECOVERY	(1ULL << 0)
+
 /* Available with KVM_CAP_S390_ZPCI_OP */
 #define KVM_S390_ZPCI_OP         _IOW(KVMIO,  0xd1, struct kvm_s390_zpci_op)
 

base-commit: 7f7bac08d9e31cd6e2c0ea1685c86ec6f1e7e03c
-- 


> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
> In order to check if this commit increases the CPU usage of the NX
> recovery worker thread I used a modified version of execute_perf_test
> [1] that supports splitting guest memory into multiple slots and reports
> /proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just
> before tearing down the VM. The goal was to force a large number of NX
> Huge Page recoveries and see if the recovery worker used any more CPU.
> 
> Test Setup:
> 
>   echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms
>   echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio
> 
> Test Command:
> 
>   ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o
> 
>         | kvm-nx-lpage-re:se.sum_exec_runtime      |
>         | ---------------------------------------- |
> Run     | Before             | After               |
> ------- | ------------------ | ------------------- |
> 1       | 730.084105         | 724.375314          |
> 2       | 728.751339         | 740.581988          |
> 3       | 736.264720         | 757.078163          |
> 
> Comparing the median results, this commit results in about a 1% increase
> CPU usage of the NX recovery worker.
> 
> [1] https://lore.kernel.org/kvm/20221019234050.3919566-2-dmatlack@google.com/
> 
> v2:
>  - Only skip NX Huge Pages that are actively being dirty tracked [Paolo]
> 
> v1: https://lore.kernel.org/kvm/20221027200316.2221027-1-dmatlack@google.com/
> 
>  arch/x86/kvm/mmu/mmu.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 82bc6321e58e..1c443f9aeb4b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6831,6 +6831,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
>  static void kvm_recover_nx_huge_pages(struct kvm *kvm)
>  {
>  	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
> +	struct kvm_memory_slot *slot;
>  	int rcu_idx;
>  	struct kvm_mmu_page *sp;
>  	unsigned int ratio;
> @@ -6865,7 +6866,21 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
>  				      struct kvm_mmu_page,
>  				      possible_nx_huge_page_link);
>  		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
> -		if (is_tdp_mmu_page(sp))
> +		WARN_ON_ONCE(!sp->role.direct);

This is an unrelated change, no?  In the sense that nothing below will break if
the SP isn't direct.

> +
> +		slot = gfn_to_memslot(kvm, sp->gfn);
> +		WARN_ON_ONCE(!slot);
> +
> +		/*
> +		 * Unaccount and do not attempt to recover any NX Huge Pages
> +		 * that are being dirty tracked, as they would just be faulted
> +		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
> +		 * recovered, along with all the other huge pages in the slot,
> +		 * when dirty logging is disabled.
> +		 */
> +		if (slot && kvm_slot_dirty_track_enabled(slot))

Not sure it's cleaner, but this could be:

		if (!WARN_ON_ONCE(!slot) && kvm_slot_dirty_track_enabled(slot))

to show that KVM doesn't blow up if the above WARN fires.

> +			unaccount_nx_huge_page(kvm, sp);
> +		else if (is_tdp_mmu_page(sp))
>  			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
>  		else
>  			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> 
> -- 
> 2.38.1.431.g37b22c650d-goog
> 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
  2022-11-07 21:21 ` Sean Christopherson
@ 2022-11-17 16:28   ` Paolo Bonzini
  2022-11-17 16:39     ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2022-11-17 16:28 UTC (permalink / raw)
  To: Sean Christopherson, David Matlack; +Cc: kvm

On 11/7/22 22:21, Sean Christopherson wrote:
> 
> Hmm, and the memslot heuristic doesn't address the recovery worker holding mmu_lock
> for write.  On a non-preemptible kernel, rwlock_needbreak() is always false, e.g.
> the worker won't yield to vCPUs that are trying to handle non-fast page faults.
> The worker should eventually reach steady state by unaccounting everything, but
> that might take a while.

I'm not sure what you mean here?  The recovery worker will still 
decrease to_zap by 1 on every unaccounted NX hugepage, and go to sleep 
after it reaches 0.

Also, David's test used a 10-second halving time for the recovery 
thread.  With the 1 hour time the effect would Perhaps the 1 hour time 
used by default by KVM is overly conservative, but 1% over 10 seconds is 
certainly a lot larger an effect, than 1% over 1 hour.

So, I'm queuing the patch.

Paolo

> An alternative idea to the memslot heuristic would be to add a knob to allow
> disabling the recovery thread on a per-VM basis.  Userspace should know that it's
> dirty logging a given VM for migration.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
  2022-11-17 16:28   ` Paolo Bonzini
@ 2022-11-17 16:39     ` Sean Christopherson
  2022-11-17 16:57       ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2022-11-17 16:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: David Matlack, kvm

On Thu, Nov 17, 2022, Paolo Bonzini wrote:
> On 11/7/22 22:21, Sean Christopherson wrote:
> > 
> > Hmm, and the memslot heuristic doesn't address the recovery worker holding mmu_lock
> > for write.  On a non-preemptible kernel, rwlock_needbreak() is always false, e.g.
> > the worker won't yield to vCPUs that are trying to handle non-fast page faults.
> > The worker should eventually reach steady state by unaccounting everything, but
> > that might take a while.
> 
> I'm not sure what you mean here?  The recovery worker will still decrease
> to_zap by 1 on every unaccounted NX hugepage, and go to sleep after it
> reaches 0.

Right, what I'm saying is that this approach is still sub-optimal because it does
all that work will holding mmu_lock for write.  

> Also, David's test used a 10-second halving time for the recovery thread.
> With the 1 hour time the effect would Perhaps the 1 hour time used by
> default by KVM is overly conservative, but 1% over 10 seconds is certainly a
> lot larger an effect, than 1% over 1 hour.

It's not the CPU usage I'm thinking of, it's the unnecessary blockage of MMU
operations on other tasks/vCPUs.  Given that this is related to dirty logging,
odds are very good that there will be a variety of operations in flight, e.g.
KVM_GET_DIRTY_LOG.  If the recovery ratio is aggressive, and/or there are a lot
of pages to recover, the recovery thread could hold mmu_lock until a reched is
needed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
  2022-11-17 16:39     ` Sean Christopherson
@ 2022-11-17 16:57       ` Paolo Bonzini
  2022-11-17 17:03         ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2022-11-17 16:57 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: David Matlack, kvm

On 11/17/22 17:39, Sean Christopherson wrote:
> Right, what I'm saying is that this approach is still sub-optimal because it does
> all that work will holding mmu_lock for write.
> 
>> Also, David's test used a 10-second halving time for the recovery thread.
>> With the 1 hour time the effect would Perhaps the 1 hour time used by
>> default by KVM is overly conservative, but 1% over 10 seconds is certainly a
>> lot larger an effect, than 1% over 1 hour.
> 
> It's not the CPU usage I'm thinking of, it's the unnecessary blockage of MMU
> operations on other tasks/vCPUs.  Given that this is related to dirty logging,
> odds are very good that there will be a variety of operations in flight, e.g.
> KVM_GET_DIRTY_LOG.  If the recovery ratio is aggressive, and/or there are a lot
> of pages to recover, the recovery thread could hold mmu_lock until a reched is
> needed.

If you need that, you need to configure your kernel to be preemptible, 
at least voluntarily.  That's in general a good idea for KVM, given its 
rwlock-happiness.

And the patch is not making it worse, is it?  Yes, you have to look up 
the memslot, but the work to do that should be less than what you save 
by not zapping the page.

Perhaps we could add to struct kvm a counter of the number of log-pages 
memslots.  While a correct value would only be readable with slots_lock 
taken, the NX recovery thread is content with an approximate value and 
therefore can retrieve it with READ_ONCE or atomic_read().

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
  2022-11-17 16:57       ` Paolo Bonzini
@ 2022-11-17 17:03         ` Sean Christopherson
  2022-11-17 17:15           ` David Matlack
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2022-11-17 17:03 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: David Matlack, kvm

On Thu, Nov 17, 2022, Paolo Bonzini wrote:
> On 11/17/22 17:39, Sean Christopherson wrote:
> > Right, what I'm saying is that this approach is still sub-optimal because it does
> > all that work will holding mmu_lock for write.
> > 
> > > Also, David's test used a 10-second halving time for the recovery thread.
> > > With the 1 hour time the effect would Perhaps the 1 hour time used by
> > > default by KVM is overly conservative, but 1% over 10 seconds is certainly a
> > > lot larger an effect, than 1% over 1 hour.
> > 
> > It's not the CPU usage I'm thinking of, it's the unnecessary blockage of MMU
> > operations on other tasks/vCPUs.  Given that this is related to dirty logging,
> > odds are very good that there will be a variety of operations in flight, e.g.
> > KVM_GET_DIRTY_LOG.  If the recovery ratio is aggressive, and/or there are a lot
> > of pages to recover, the recovery thread could hold mmu_lock until a reched is
> > needed.
> 
> If you need that, you need to configure your kernel to be preemptible, at
> least voluntarily.  That's in general a good idea for KVM, given its
> rwlock-happiness.

IMO, it's not that simple.  We always "need" better live migration performance,
but we don't need/want preemption in general.

> And the patch is not making it worse, is it?  Yes, you have to look up the
> memslot, but the work to do that should be less than what you save by not
> zapping the page.

Yes, my objection  is that we're adding a heuristic to guess at userspace's
intentions (it's probably a good guess, but still) and the resulting behavior isn't
optimal.  Giving userspace an explicit knob seems straightforward and would address
both of those issues, why not go that route?

> Perhaps we could add to struct kvm a counter of the number of log-pages
> memslots.  While a correct value would only be readable with slots_lock
> taken, the NX recovery thread is content with an approximate value and
> therefore can retrieve it with READ_ONCE or atomic_read().
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
  2022-11-17 17:03         ` Sean Christopherson
@ 2022-11-17 17:15           ` David Matlack
  2022-11-17 19:07             ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: David Matlack @ 2022-11-17 17:15 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm

On Thu, Nov 17, 2022 at 9:04 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Nov 17, 2022, Paolo Bonzini wrote:
> > On 11/17/22 17:39, Sean Christopherson wrote:
> > > Right, what I'm saying is that this approach is still sub-optimal because it does
> > > all that work will holding mmu_lock for write.
> > >
> > > > Also, David's test used a 10-second halving time for the recovery thread.
> > > > With the 1 hour time the effect would Perhaps the 1 hour time used by
> > > > default by KVM is overly conservative, but 1% over 10 seconds is certainly a
> > > > lot larger an effect, than 1% over 1 hour.
> > >
> > > It's not the CPU usage I'm thinking of, it's the unnecessary blockage of MMU
> > > operations on other tasks/vCPUs.  Given that this is related to dirty logging,
> > > odds are very good that there will be a variety of operations in flight, e.g.
> > > KVM_GET_DIRTY_LOG.  If the recovery ratio is aggressive, and/or there are a lot
> > > of pages to recover, the recovery thread could hold mmu_lock until a reched is
> > > needed.
> >
> > If you need that, you need to configure your kernel to be preemptible, at
> > least voluntarily.  That's in general a good idea for KVM, given its
> > rwlock-happiness.
>
> IMO, it's not that simple.  We always "need" better live migration performance,
> but we don't need/want preemption in general.
>
> > And the patch is not making it worse, is it?  Yes, you have to look up the
> > memslot, but the work to do that should be less than what you save by not
> > zapping the page.
>
> Yes, my objection  is that we're adding a heuristic to guess at userspace's
> intentions (it's probably a good guess, but still) and the resulting behavior isn't
> optimal.  Giving userspace an explicit knob seems straightforward and would address
> both of those issues, why not go that route?

In this case KVM knows that zapping dirty-tracked pages is completely
useless, regardless of what userspace is doing, so there's no
guessing.

A userspace knob requires userspace guess at KVM's implementation
details. e.g. KVM could theoretically support faulting in read
accesses and execute accesses as write-protected huge pages during
dirty logging. Or KVM might supporting 2MiB+ dirty logging. In both
cases a binary userspace knob might not be the best fit.

I agree that, even with this patch, KVM is still suboptimal because it
is holding the MMU lock to do all these checks. But this patch should
at least be a step in the right direction for reducing customer
hiccups during live migration.

Also as for the CPU usage, I did a terrible job of explaining the
impact. It's a 1% increase over the current usage, but the current
usage is extremely low even with my way overly aggressive settings.
Specifically, the CPU usage of the NX recovery worker increased from
0.73 CPU-seconds to 0.74 CPU-seconds over a 2.5 minute runtime.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages
  2022-11-17 17:15           ` David Matlack
@ 2022-11-17 19:07             ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2022-11-17 19:07 UTC (permalink / raw)
  To: David Matlack; +Cc: Paolo Bonzini, kvm

On Thu, Nov 17, 2022, David Matlack wrote:
> On Thu, Nov 17, 2022 at 9:04 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Nov 17, 2022, Paolo Bonzini wrote:
> > > On 11/17/22 17:39, Sean Christopherson wrote:
> > > > Right, what I'm saying is that this approach is still sub-optimal because it does
> > > > all that work will holding mmu_lock for write.
> > > >
> > > > > Also, David's test used a 10-second halving time for the recovery thread.
> > > > > With the 1 hour time the effect would Perhaps the 1 hour time used by
> > > > > default by KVM is overly conservative, but 1% over 10 seconds is certainly a
> > > > > lot larger an effect, than 1% over 1 hour.
> > > >
> > > > It's not the CPU usage I'm thinking of, it's the unnecessary blockage of MMU
> > > > operations on other tasks/vCPUs.  Given that this is related to dirty logging,
> > > > odds are very good that there will be a variety of operations in flight, e.g.
> > > > KVM_GET_DIRTY_LOG.  If the recovery ratio is aggressive, and/or there are a lot
> > > > of pages to recover, the recovery thread could hold mmu_lock until a reched is
> > > > needed.
> > >
> > > If you need that, you need to configure your kernel to be preemptible, at
> > > least voluntarily.  That's in general a good idea for KVM, given its
> > > rwlock-happiness.
> >
> > IMO, it's not that simple.  We always "need" better live migration performance,
> > but we don't need/want preemption in general.
> >
> > > And the patch is not making it worse, is it?  Yes, you have to look up the
> > > memslot, but the work to do that should be less than what you save by not
> > > zapping the page.
> >
> > Yes, my objection  is that we're adding a heuristic to guess at userspace's
> > intentions (it's probably a good guess, but still) and the resulting behavior isn't
> > optimal.  Giving userspace an explicit knob seems straightforward and would address
> > both of those issues, why not go that route?
> 
> In this case KVM knows that zapping dirty-tracked pages is completely
> useless, regardless of what userspace is doing, so there's no
> guessing.
> 
> A userspace knob requires userspace guess at KVM's implementation
> details. e.g. KVM could theoretically support faulting in read
> accesses and execute accesses as write-protected huge pages during
> dirty logging. Or KVM might supporting 2MiB+ dirty logging. In both
> cases a binary userspace knob might not be the best fit.

Hmm, maybe.  If userspace is migrating a VM, zapping shadow pages to try and
allow NX huge pages may be undesirable irrespective of KVM internals.  E.g. even
if KVM supports 2MiB dirty logging, zapping an entire 2MiB region of guest memory
to _maybe_ install a huge page while the guest is already likely experiencing
jitter is probably a net negative.

I do agree that they are somewhat complimentary though, e.g. even if userspace is
aware of the per-VM knob, userspace might want to allow reaping during migration
for whatever reason.  Or conversely, userspace might want to temporarily disable
reaping for reasons completely unrelated to migration.

> I agree that, even with this patch, KVM is still suboptimal because it
> is holding the MMU lock to do all these checks. But this patch should
> at least be a step in the right direction for reducing customer
> hiccups during live migration.

True.

> Also as for the CPU usage, I did a terrible job of explaining the
> impact. It's a 1% increase over the current usage, but the current
> usage is extremely low even with my way overly aggressive settings.
> Specifically, the CPU usage of the NX recovery worker increased from
> 0.73 CPU-seconds to 0.74 CPU-seconds over a 2.5 minute runtime.

Heh, that does change things a bit.

Objection officially withdrawn, allowing userspace to turn off the reaper can be
done on top if it actually adds value.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-17 19:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-03 20:44 [PATCH v2] KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages David Matlack
2022-11-07 21:21 ` Sean Christopherson
2022-11-17 16:28   ` Paolo Bonzini
2022-11-17 16:39     ` Sean Christopherson
2022-11-17 16:57       ` Paolo Bonzini
2022-11-17 17:03         ` Sean Christopherson
2022-11-17 17:15           ` David Matlack
2022-11-17 19:07             ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.