[PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page
@ 2023-06-02  1:15 Sean Christopherson
  2023-06-02  1:15 ` [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Sean Christopherson
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Sean Christopherson @ 2023-06-02  1:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jason Gunthorpe, Alistair Popple, Robin Murphy

Convert VMX's handling of mmu_notifier invalidations of the APIC-access page
from invalidate_range() to KVM's standard invalidate_range_{start,end}().

KVM (ab)uses invalidate_range() to fudge around not stalling vCPUs until
relevant in-flight invalidations complete.  Abusing invalidate_range() works,
but it requires one-off code in KVM, sets a bad precedent in KVM, and is
blocking improvements to mmu_notifier's definition of invalidate_range()
due to KVM's usage diverging wildly from the original intent of notifying
IOMMUs of changes to shared page tables.

Clean up the mess by hooking x86's implementation of kvm_unmap_gfn_range()
and stalling vCPUs by re-requesting KVM_REQ_APIC_PAGE_RELOAD until the
invalidation completes.

Sean Christopherson (3):
  KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access
    page
  KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares

 arch/x86/kvm/mmu/mmu.c   |  4 ++++
 arch/x86/kvm/vmx/vmx.c   | 50 ++++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/x86.c       | 14 -----------
 include/linux/kvm_host.h |  3 ---
 virt/kvm/kvm_main.c      | 18 ---------------
 5 files changed, 49 insertions(+), 40 deletions(-)

base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
-- 
2.41.0.rc2.161.g9c6817b8e7-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-02  1:15 [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Sean Christopherson
@ 2023-06-02  1:15 ` Sean Christopherson
  2023-06-06  2:11   ` Alistair Popple
  2023-06-07  7:40   ` yu.c.zhang
  2023-06-02  1:15 ` [PATCH 2/3] KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page Sean Christopherson
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Sean Christopherson @ 2023-06-02  1:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jason Gunthorpe, Alistair Popple, Robin Murphy

Re-request an APIC-access page reload if there is a relevant mmu_notifier
invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
vCPUs until the backing pfn for the APIC-access page is "officially"
stable.  Relying on the primary MMU to not make changes after invoking
->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
would also trigger an ->invalidate_range(), but using ->invalidate_range()
to fudge around KVM not honoring past and in-progress invalidations is a
bit hacky.

Honoring invalidations will allow using KVM's standard mmu_notifier hooks
to detect APIC-access page reloads, which will in turn allow removing
KVM's implementation of ->invalidate_range() (the APIC-access page case is
a true one-off).

Opportunistically add a comment to explain why doing nothing if a memslot
isn't found is functionally correct.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 50 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..59195f0dc7a5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6708,7 +6708,12 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 
 static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 {
-	struct page *page;
+	const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *slot;
+	unsigned long mmu_seq;
+	kvm_pfn_t pfn;
 
 	/* Defer reload until vmcs01 is the current VMCS. */
 	if (is_guest_mode(vcpu)) {
@@ -6720,18 +6725,53 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 	    SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 		return;
 
-	page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
-	if (is_error_page(page))
+	/*
+	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
+	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
+	 * on the pfn lookup's validation of the memslot to ensure a valid hva
+	 * is used for the retry check.
+	 */
+	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
+	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
 		return;
 
-	vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+	/*
+	 * Ensure that the mmu_notifier sequence count is read before KVM
+	 * retrieves the pfn from the primary MMU.  Note, the memslot is
+	 * protected by SRCU, not the mmu_notifier.  Pairs with the smp_wmb()
+	 * in kvm_mmu_invalidate_end().
+	 */
+	mmu_seq = kvm->mmu_invalidate_seq;
+	smp_rmb();
+
+	/*
+	 * No need to retry if the memslot does not exist or is invalid.  KVM
+	 * controls the APIC-access page memslot, and only deletes the memslot
+	 * if APICv is permanently inhibited, i.e. the memslot won't reappear.
+	 */
+	pfn = gfn_to_pfn_memslot(slot, gfn);
+	if (is_error_noslot_pfn(pfn))
+		return;
+
+	read_lock(&vcpu->kvm->mmu_lock);
+	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
+				     gfn_to_hva_memslot(slot, gfn))) {
+		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
+		read_unlock(&vcpu->kvm->mmu_lock);
+		goto out;
+	}
+
+	vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn));
+	read_unlock(&vcpu->kvm->mmu_lock);
+
 	vmx_flush_tlb_current(vcpu);
 
+out:
 	/*
 	 * Do not pin apic access page in memory, the MMU notifier
 	 * will call us again if it is migrated or swapped out.
 	 */
-	put_page(page);
+	kvm_release_pfn_clean(pfn);
 }
 
 static void vmx_hwapic_isr_update(int max_isr)
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page
  2023-06-02  1:15 [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Sean Christopherson
  2023-06-02  1:15 ` [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Sean Christopherson
@ 2023-06-02  1:15 ` Sean Christopherson
  2023-06-06  2:20   ` Alistair Popple
  2023-06-02  1:15 ` [PATCH 3/3] KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares Sean Christopherson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-06-02  1:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jason Gunthorpe, Alistair Popple, Robin Murphy

Now that KVM honors past and in-progress mmu_notifier invalidations when
reloading the APIC-access page, use KVM's "standard" invalidation hooks
to trigger a reload and delete the one-off usage of invalidate_range().

Aside from eliminating one-off code in KVM, dropping KVM's use of
invalidate_range() will allow common mmu_notifier to redefine the API to
be more strictly focused on invalidating secondary TLBs that share the
primary MMU's page tables.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   |  3 +++
 arch/x86/kvm/x86.c       | 14 --------------
 include/linux/kvm_host.h |  3 ---
 virt/kvm/kvm_main.c      | 18 ------------------
 4 files changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..01a11ce68e57 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1600,6 +1600,9 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
 
+	if (range->slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
 	return flush;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ceb7c5e9cf9e..141a62e59d2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10431,20 +10431,6 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 		vcpu, (u64 *)vcpu->arch.ioapic_handled_vectors);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-					    unsigned long start, unsigned long end)
-{
-	unsigned long apic_address;
-
-	/*
-	 * The physical address of apic access page is stored in the VMCS.
-	 * Update it when it becomes invalid.
-	 */
-	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
-	if (start <= apic_address && apic_address < end)
-		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
-}
-
 void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
 {
 	static_call_cond(kvm_x86_guest_memory_reclaimed)(kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e571e973bc2..cb66f4100be7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2237,9 +2237,6 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-					    unsigned long start, unsigned long end);
-
 void kvm_arch_guest_memory_reclaimed(struct kvm *kvm);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb5c13eee193..844ff6b0b21d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -154,11 +154,6 @@ static unsigned long long kvm_active_vms;
 
 static DEFINE_PER_CPU(cpumask_var_t, cpu_kick_mask);
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-						   unsigned long start, unsigned long end)
-{
-}
-
 __weak void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
 {
 }
@@ -521,18 +516,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
-					      struct mm_struct *mm,
-					      unsigned long start, unsigned long end)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int idx;
-
-	idx = srcu_read_lock(&kvm->srcu);
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
-	srcu_read_unlock(&kvm->srcu, idx);
-}
-
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
@@ -892,7 +875,6 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares
  2023-06-02  1:15 [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Sean Christopherson
  2023-06-02  1:15 ` [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Sean Christopherson
  2023-06-02  1:15 ` [PATCH 2/3] KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page Sean Christopherson
@ 2023-06-02  1:15 ` Sean Christopherson
  2023-06-05 10:15 ` [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Paolo Bonzini
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2023-06-02  1:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jason Gunthorpe, Alistair Popple, Robin Murphy

Request an APIC-access page reload when the backing page is migrated (or
unmapped) if and only if vendor code actually plugs the backing pfn into
structures that reside outside of KVM's MMU.  This avoids kicking all
vCPUs in the (hopefully infrequent) scenario where the backing page is
migrated/invalidated.

Unlike VMX's APICv, SVM's AVIC doesn't plug the backing pfn directly into
the VMCB and so doesn't need a hook to invalidate an out-of-MMU "mapping".

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 01a11ce68e57..beb507d82adf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1600,7 +1600,8 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (tdp_mmu_enabled)
 		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
 
-	if (range->slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT)
+	if (kvm_x86_ops.set_apic_access_page_addr &&
+	    range->slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 
 	return flush;
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page
  2023-06-02  1:15 [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Sean Christopherson
                   ` (2 preceding siblings ...)
  2023-06-02  1:15 ` [PATCH 3/3] KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares Sean Christopherson
@ 2023-06-05 10:15 ` Paolo Bonzini
  2023-06-06 16:43 ` Jason Gunthorpe
  2023-06-07  0:55 ` Sean Christopherson
  5 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2023-06-05 10:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Jason Gunthorpe, Alistair Popple, Robin Murphy

On 6/2/23 03:15, Sean Christopherson wrote:
> Convert VMX's handling of mmu_notifier invalidations of the APIC-access page
> from invalidate_range() to KVM's standard invalidate_range_{start,end}().
> 
> KVM (ab)uses invalidate_range() to fudge around not stalling vCPUs until
> relevant in-flight invalidations complete.  Abusing invalidate_range() works,
> but it requires one-off code in KVM, sets a bad precedent in KVM, and is
> blocking improvements to mmu_notifier's definition of invalidate_range()
> due to KVM's usage diverging wildly from the original intent of notifying
> IOMMUs of changes to shared page tables.
> 
> Clean up the mess by hooking x86's implementation of kvm_unmap_gfn_range()
> and stalling vCPUs by re-requesting KVM_REQ_APIC_PAGE_RELOAD until the
> invalidation completes.
> 
> Sean Christopherson (3):
>    KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
>    KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access
>      page
>    KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares
> 
>   arch/x86/kvm/mmu/mmu.c   |  4 ++++
>   arch/x86/kvm/vmx/vmx.c   | 50 ++++++++++++++++++++++++++++++++++++----
>   arch/x86/kvm/x86.c       | 14 -----------
>   include/linux/kvm_host.h |  3 ---
>   virt/kvm/kvm_main.c      | 18 ---------------
>   5 files changed, 49 insertions(+), 40 deletions(-)
> 
> 
> base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f


Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-02  1:15 ` [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Sean Christopherson
@ 2023-06-06  2:11   ` Alistair Popple
  2023-06-06 17:22     ` Sean Christopherson
  2023-06-07  7:40   ` yu.c.zhang
  1 sibling, 1 reply; 19+ messages in thread
From: Alistair Popple @ 2023-06-06  2:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe, Robin Murphy


Thanks for doing this. I'm not overly familiar with KVM implementation
but am familiar with mmu notifiers so read through the KVM usage. Looks
like KVM is sort of doing something similar to what mmu_interval_notifiers
do and I wonder if some of that could be shared. Anyway I didn't spot
anything wrong here so feel free to add:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

Sean Christopherson <seanjc@google.com> writes:

> Re-request an APIC-access page reload if there is a relevant mmu_notifier
> invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
> vCPUs until the backing pfn for the APIC-access page is "officially"
> stable.  Relying on the primary MMU to not make changes after invoking
> ->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
> would also trigger an ->invalidate_range(), but using ->invalidate_range()
> to fudge around KVM not honoring past and in-progress invalidations is a
> bit hacky.
>
> Honoring invalidations will allow using KVM's standard mmu_notifier hooks
> to detect APIC-access page reloads, which will in turn allow removing
> KVM's implementation of ->invalidate_range() (the APIC-access page case is
> a true one-off).
>
> Opportunistically add a comment to explain why doing nothing if a memslot
> isn't found is functionally correct.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 50 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 45 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 44fb619803b8..59195f0dc7a5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6708,7 +6708,12 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
>  
>  static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  {
> -	struct page *page;
> +	const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_memslots *slots = kvm_memslots(kvm);
> +	struct kvm_memory_slot *slot;
> +	unsigned long mmu_seq;
> +	kvm_pfn_t pfn;
>  
>  	/* Defer reload until vmcs01 is the current VMCS. */
>  	if (is_guest_mode(vcpu)) {
> @@ -6720,18 +6725,53 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  	    SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
>  		return;
>  
> -	page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> -	if (is_error_page(page))
> +	/*
> +	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
> +	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
> +	 * on the pfn lookup's validation of the memslot to ensure a valid hva
> +	 * is used for the retry check.
> +	 */
> +	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
> +	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
>  		return;
>  
> -	vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +	/*
> +	 * Ensure that the mmu_notifier sequence count is read before KVM
> +	 * retrieves the pfn from the primary MMU.  Note, the memslot is
> +	 * protected by SRCU, not the mmu_notifier.  Pairs with the smp_wmb()
> +	 * in kvm_mmu_invalidate_end().
> +	 */
> +	mmu_seq = kvm->mmu_invalidate_seq;
> +	smp_rmb();
> +
> +	/*
> +	 * No need to retry if the memslot does not exist or is invalid.  KVM
> +	 * controls the APIC-access page memslot, and only deletes the memslot
> +	 * if APICv is permanently inhibited, i.e. the memslot won't reappear.
> +	 */
> +	pfn = gfn_to_pfn_memslot(slot, gfn);
> +	if (is_error_noslot_pfn(pfn))
> +		return;
> +
> +	read_lock(&vcpu->kvm->mmu_lock);
> +	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> +				     gfn_to_hva_memslot(slot, gfn))) {
> +		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
> +		read_unlock(&vcpu->kvm->mmu_lock);
> +		goto out;
> +	}
> +
> +	vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn));
> +	read_unlock(&vcpu->kvm->mmu_lock);
> +
>  	vmx_flush_tlb_current(vcpu);
>  
> +out:
>  	/*
>  	 * Do not pin apic access page in memory, the MMU notifier
>  	 * will call us again if it is migrated or swapped out.
>  	 */
> -	put_page(page);
> +	kvm_release_pfn_clean(pfn);
>  }
>  
>  static void vmx_hwapic_isr_update(int max_isr)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page
  2023-06-02  1:15 ` [PATCH 2/3] KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page Sean Christopherson
@ 2023-06-06  2:20   ` Alistair Popple
  0 siblings, 0 replies; 19+ messages in thread
From: Alistair Popple @ 2023-06-06  2:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe, Robin Murphy


Again I'm no KVM expert but read through this and it all looked correct
and made sense so feel free to add:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

Sean Christopherson <seanjc@google.com> writes:

> Now that KVM honors past and in-progress mmu_notifier invalidations when
> reloading the APIC-access page, use KVM's "standard" invalidation hooks
> to trigger a reload and delete the one-off usage of invalidate_range().
>
> Aside from eliminating one-off code in KVM, dropping KVM's use of
> invalidate_range() will allow common mmu_notifier to redefine the API to
> be more strictly focused on invalidating secondary TLBs that share the
> primary MMU's page tables.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   |  3 +++
>  arch/x86/kvm/x86.c       | 14 --------------
>  include/linux/kvm_host.h |  3 ---
>  virt/kvm/kvm_main.c      | 18 ------------------
>  4 files changed, 3 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..01a11ce68e57 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1600,6 +1600,9 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  	if (tdp_mmu_enabled)
>  		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
>  
> +	if (range->slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
>  	return flush;
>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ceb7c5e9cf9e..141a62e59d2b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10431,20 +10431,6 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>  		vcpu, (u64 *)vcpu->arch.ioapic_handled_vectors);
>  }
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -					    unsigned long start, unsigned long end)
> -{
> -	unsigned long apic_address;
> -
> -	/*
> -	 * The physical address of apic access page is stored in the VMCS.
> -	 * Update it when it becomes invalid.
> -	 */
> -	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> -	if (start <= apic_address && apic_address < end)
> -		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> -}
> -
>  void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
>  {
>  	static_call_cond(kvm_x86_guest_memory_reclaimed)(kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0e571e973bc2..cb66f4100be7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2237,9 +2237,6 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>  }
>  #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -					    unsigned long start, unsigned long end);
> -
>  void kvm_arch_guest_memory_reclaimed(struct kvm *kvm);
>  
>  #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cb5c13eee193..844ff6b0b21d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -154,11 +154,6 @@ static unsigned long long kvm_active_vms;
>  
>  static DEFINE_PER_CPU(cpumask_var_t, cpu_kick_mask);
>  
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -						   unsigned long start, unsigned long end)
> -{
> -}
> -
>  __weak void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
>  {
>  }
> @@ -521,18 +516,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  	return container_of(mn, struct kvm, mmu_notifier);
>  }
>  
> -static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> -					      struct mm_struct *mm,
> -					      unsigned long start, unsigned long end)
> -{
> -	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> -	int idx;
> -
> -	idx = srcu_read_lock(&kvm->srcu);
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> -	srcu_read_unlock(&kvm->srcu, idx);
> -}
> -
>  typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>  
>  typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> @@ -892,7 +875,6 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>  }
>  
>  static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> -	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
>  	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
>  	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
>  	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page
  2023-06-02  1:15 [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Sean Christopherson
                   ` (3 preceding siblings ...)
  2023-06-05 10:15 ` [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Paolo Bonzini
@ 2023-06-06 16:43 ` Jason Gunthorpe
  2023-06-07  0:55 ` Sean Christopherson
  5 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 16:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Alistair Popple, Robin Murphy

On Thu, Jun 01, 2023 at 06:15:15PM -0700, Sean Christopherson wrote:
> Convert VMX's handling of mmu_notifier invalidations of the APIC-access page
> from invalidate_range() to KVM's standard invalidate_range_{start,end}().
> 
> KVM (ab)uses invalidate_range() to fudge around not stalling vCPUs until
> relevant in-flight invalidations complete.  Abusing invalidate_range() works,
> but it requires one-off code in KVM, sets a bad precedent in KVM, and is
> blocking improvements to mmu_notifier's definition of invalidate_range()
> due to KVM's usage diverging wildly from the original intent of notifying
> IOMMUs of changes to shared page tables.
> 
> Clean up the mess by hooking x86's implementation of kvm_unmap_gfn_range()
> and stalling vCPUs by re-requesting KVM_REQ_APIC_PAGE_RELOAD until the
> invalidation completes.

I don't know much about kvm, but this looks like what I had in mind
and is a good way to use mmu notifiers

Thanks,
Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-06  2:11   ` Alistair Popple
@ 2023-06-06 17:22     ` Sean Christopherson
  2023-06-06 17:39       ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-06-06 17:22 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe, Robin Murphy

On Tue, Jun 06, 2023, Alistair Popple wrote:
> 
> Thanks for doing this. I'm not overly familiar with KVM implementation
> but am familiar with mmu notifiers so read through the KVM usage. Looks
> like KVM is sort of doing something similar to what mmu_interval_notifiers
> do and I wonder if some of that could be shared.

They're very, very similar.  At a glance, KVM likely could be moved over to the
common interval tree implementation, but it would require a substantial amount of
work in KVM, and various extensions to the mmu notifiers too.  I definitely wouldn't
be opposed to converging KVM if someone wants to put in the effort, but at the same
time I wouldn't recommend that anyone actually do the work as the ROI is likely
very low.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-06 17:22     ` Sean Christopherson
@ 2023-06-06 17:39       ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2023-06-06 17:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Alistair Popple, Paolo Bonzini, kvm, linux-kernel, Robin Murphy

On Tue, Jun 06, 2023 at 10:22:14AM -0700, Sean Christopherson wrote:
> On Tue, Jun 06, 2023, Alistair Popple wrote:
> > 
> > Thanks for doing this. I'm not overly familiar with KVM implementation
> > but am familiar with mmu notifiers so read through the KVM usage. Looks
> > like KVM is sort of doing something similar to what mmu_interval_notifiers
> > do and I wonder if some of that could be shared.
> 
> They're very, very similar.  At a glance, KVM likely could be moved over to the
> common interval tree implementation, but it would require a substantial amount of
> work in KVM, and various extensions to the mmu notifiers too.  I definitely wouldn't
> be opposed to converging KVM if someone wants to put in the effort, but at the same
> time I wouldn't recommend that anyone actually do the work as the ROI is likely
> very low.

I looked at it lightly a long time ago and came to sort of the same
conclusion, it might even be a negative for KVM as it isn't quite as
tightly inlined and lock free.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page
  2023-06-02  1:15 [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Sean Christopherson
                   ` (4 preceding siblings ...)
  2023-06-06 16:43 ` Jason Gunthorpe
@ 2023-06-07  0:55 ` Sean Christopherson
  5 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2023-06-07  0:55 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Alistair Popple, Robin Murphy, Jason Gunthorpe

On Thu, 01 Jun 2023 18:15:15 -0700, Sean Christopherson wrote:
> Convert VMX's handling of mmu_notifier invalidations of the APIC-access page
> from invalidate_range() to KVM's standard invalidate_range_{start,end}().
> 
> KVM (ab)uses invalidate_range() to fudge around not stalling vCPUs until
> relevant in-flight invalidations complete.  Abusing invalidate_range() works,
> but it requires one-off code in KVM, sets a bad precedent in KVM, and is
> blocking improvements to mmu_notifier's definition of invalidate_range()
> due to KVM's usage diverging wildly from the original intent of notifying
> IOMMUs of changes to shared page tables.
> 
> [...]

Applied to kvm-x86 vmx, thanks!

[1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
      https://github.com/kvm-x86/linux/commit/878940b33d76
[2/3] KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page
      https://github.com/kvm-x86/linux/commit/0a8a5f2c8c26
[3/3] KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares
      https://github.com/kvm-x86/linux/commit/0a3869e14d4a

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-02  1:15 ` [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Sean Christopherson
  2023-06-06  2:11   ` Alistair Popple
@ 2023-06-07  7:40   ` yu.c.zhang
  2023-06-07 14:30     ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: yu.c.zhang @ 2023-06-07  7:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

On Thu, Jun 01, 2023 at 06:15:16PM -0700, Sean Christopherson wrote:
> Re-request an APIC-access page reload if there is a relevant mmu_notifier
> invalidation in-progress when KVM retrieves the backing pfn, i.e. stall
> vCPUs until the backing pfn for the APIC-access page is "officially"
> stable.  Relying on the primary MMU to not make changes after invoking
> ->invalidate_range() works, e.g. any additional changes to a PRESENT PTE
> would also trigger an ->invalidate_range(), but using ->invalidate_range()
> to fudge around KVM not honoring past and in-progress invalidations is a
> bit hacky.
> 
> Honoring invalidations will allow using KVM's standard mmu_notifier hooks
> to detect APIC-access page reloads, which will in turn allow removing
> KVM's implementation of ->invalidate_range() (the APIC-access page case is
> a true one-off).
> 
> Opportunistically add a comment to explain why doing nothing if a memslot
> isn't found is functionally correct.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 50 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 44fb619803b8..59195f0dc7a5 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6708,7 +6708,12 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
>  
>  static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  {
> -	struct page *page;
> +	const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_memslots *slots = kvm_memslots(kvm);
> +	struct kvm_memory_slot *slot;
> +	unsigned long mmu_seq;
> +	kvm_pfn_t pfn;
>  
>  	/* Defer reload until vmcs01 is the current VMCS. */
>  	if (is_guest_mode(vcpu)) {
> @@ -6720,18 +6725,53 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  	    SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
>  		return;
>  
> -	page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> -	if (is_error_page(page))
> +	/*
> +	 * Grab the memslot so that the hva lookup for the mmu_notifier retry
> +	 * is guaranteed to use the same memslot as the pfn lookup, i.e. rely
> +	 * on the pfn lookup's validation of the memslot to ensure a valid hva
> +	 * is used for the retry check.
> +	 */
> +	slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
> +	if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
>  		return;
>  
> -	vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +	/*
> +	 * Ensure that the mmu_notifier sequence count is read before KVM
> +	 * retrieves the pfn from the primary MMU.  Note, the memslot is
> +	 * protected by SRCU, not the mmu_notifier.  Pairs with the smp_wmb()
> +	 * in kvm_mmu_invalidate_end().
> +	 */
> +	mmu_seq = kvm->mmu_invalidate_seq;
> +	smp_rmb();
> +
> +	/*
> +	 * No need to retry if the memslot does not exist or is invalid.  KVM
> +	 * controls the APIC-access page memslot, and only deletes the memslot
> +	 * if APICv is permanently inhibited, i.e. the memslot won't reappear.
> +	 */
> +	pfn = gfn_to_pfn_memslot(slot, gfn);
> +	if (is_error_noslot_pfn(pfn))
> +		return;
> +
> +	read_lock(&vcpu->kvm->mmu_lock);
> +	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> +				     gfn_to_hva_memslot(slot, gfn))) {
> +		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);

Are the mmu_invalidate_retry_hva() and the following request meant to stall
the vCPU if there's on going invalidation? 

If yes, may I ask how would the vCPU be stalled?

Below are my understandings and confusions about this process. I must have
missed something, waiting to be educated... :) 

When the backing page of APIC access page is to be reclaimed:
1> kvm_mmu_notifier_invalidate_range_start() -> __kvm_handle_hva_range() will
increase the kvm->mmu_invalidate_in_progress and account the start/end of this
page in kvm_mmu_invalidate_begin().
2> And then kvm_unmap_gfn_range() will zap the TDP, and send the request,
KVM_REQ_APIC_PAGE_RELOAD, to all vCPUs.
3> While a vCPU tries to reload the APIC access page before entering the guest,
kvm->mmu_invalidate_in_progress will be checked in mmu_invalidate_retry_hva(),
but it is possible(or is it?) that the kvm->mmu_invalidate_in_progess is still
positive, so KVM_REQ_APIC_PAGE_RELOAD is set again. No reload, and no TLB flush.
4> So what if the vCPU resumes with KVM_REQ_APIC_PAGE_RELOAD & KVM_REQ_TLB_FLUSH
flags being set, yet APIC access page is not reloaded and TLB is not flushed? Or,
will this happen?

One more dumb question - why does KVM not just pin the APIC access page? 

> +		read_unlock(&vcpu->kvm->mmu_lock);
> +		goto out;
> +	}
> +
> +	vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(pfn));
> +	read_unlock(&vcpu->kvm->mmu_lock);
> +
>  	vmx_flush_tlb_current(vcpu);
>  
> +out:
>  	/*
>  	 * Do not pin apic access page in memory, the MMU notifier
>  	 * will call us again if it is migrated or swapped out.
>  	 */
> -	put_page(page);
> +	kvm_release_pfn_clean(pfn);
>  }
>  
>  static void vmx_hwapic_isr_update(int max_isr)
> -- 
> 2.41.0.rc2.161.g9c6817b8e7-goog
> 

B.R.
Yu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-07  7:40   ` yu.c.zhang
@ 2023-06-07 14:30     ` Sean Christopherson
  2023-06-07 17:23       ` Yu Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-06-07 14:30 UTC (permalink / raw)
  To: yu.c.zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

On Wed, Jun 07, 2023, yu.c.zhang@linux.intel.com wrote:
> On Thu, Jun 01, 2023 at 06:15:16PM -0700, Sean Christopherson wrote:
> > +	pfn = gfn_to_pfn_memslot(slot, gfn);
> > +	if (is_error_noslot_pfn(pfn))
> > +		return;
> > +
> > +	read_lock(&vcpu->kvm->mmu_lock);
> > +	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> > +				     gfn_to_hva_memslot(slot, gfn))) {
> > +		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
> 
> Are the mmu_invalidate_retry_hva() and the following request meant to stall
> the vCPU if there's on going invalidation? 

Yep.

> If yes, may I ask how would the vCPU be stalled?
> 
> Below are my understandings and confusions about this process. I must have
> missed something, waiting to be educated... :) 
> 
> When the backing page of APIC access page is to be reclaimed:
> 1> kvm_mmu_notifier_invalidate_range_start() -> __kvm_handle_hva_range() will
> increase the kvm->mmu_invalidate_in_progress and account the start/end of this
> page in kvm_mmu_invalidate_begin().
> 2> And then kvm_unmap_gfn_range() will zap the TDP, and send the request,
> KVM_REQ_APIC_PAGE_RELOAD, to all vCPUs.
> 3> While a vCPU tries to reload the APIC access page before entering the guest,
> kvm->mmu_invalidate_in_progress will be checked in mmu_invalidate_retry_hva(),
> but it is possible(or is it?) that the kvm->mmu_invalidate_in_progess is still
> positive, so KVM_REQ_APIC_PAGE_RELOAD is set again. No reload, and no TLB flush.
> 4> So what if the vCPU resumes with KVM_REQ_APIC_PAGE_RELOAD & KVM_REQ_TLB_FLUSH
> flags being set, yet APIC access page is not reloaded and TLB is not flushed? Or,
> will this happen?

Pending requests block KVM from actually entering the guest.  If a request comes
in after vcpu_enter_guest()'s initial handling of requests, KVM will bail before
VM-Enter and go back through the entire "outer" run loop.

This isn't necessarily the most efficient way to handle the stall, e.g. KVM does
a fair bit of prep for VM-Enter before detecting the pending request.  The
alternative would be to have kvm_vcpu_reload_apic_access_page() return value
instructing vcpu_enter_guest() whether to bail immediately or continue on.  I
elected for the re-request approach because (a) it didn't require redefining the
kvm_x86_ops vendor hook, (b) this should be a rare situation and not performance
critical overall, and (c) there's no guarantee that bailing immediately would
actually yield better performance from the guest's perspective, e.g. if there are
other pending requests/work, then the KVM can handle those items while the vCPU
is stalled instead of waiting until the invalidation completes to proceed.

> One more dumb question - why does KVM not just pin the APIC access page?

Definitely not a dumb question, I asked myself the same thing multiple times when
looking at this :-)  Pinning the page would be easier, and KVM actually did that
in the original implementation.  The issue is in how KVM allocates the backing
page.  It's not a traditional kernel allocation, but is instead anonymous memory
allocated by way of vm_mmap(), i.e. for all intents and purposes it's a user
allocation.  That means the kernel expects it to be a regular movable page, e.g.
it's entirely possible the page (if it were pinned) could be the only page in a
2MiB chunk preventing the kernel from migrating/compacting and creating a hugepage.

In hindsight, I'm not entirely convinced that unpinning the page was the right
choice, as it resulted in a handful of nasty bugs.  But, now that we've fixed all
those bugs (knock wood), there's no good argument for undoing all of that work.
Because while the code is subtle and requires hooks in a few paths, it's not *that*
complex and for the most part doesn't require active maintenance.

static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
	if (kvm_request_pending(vcpu)) {  <= check if any request is pending

		if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
			kvm_vcpu_reload_apic_access_page(vcpu); <= re-requests APIC_PAGE_RELOAD

                ...
	}

	...

	preempt_disable();

	static_call(kvm_x86_prepare_switch_to_guest)(vcpu);

        <host => guest bookkeeping>

	if (kvm_vcpu_exit_request(vcpu)) {  <= detects the new pending request
		vcpu->mode = OUTSIDE_GUEST_MODE;
		smp_wmb();
		local_irq_enable();
		preempt_enable();
		kvm_vcpu_srcu_read_lock(vcpu);
		r = 1;
		goto cancel_injection;  <= bails from actually entering the guest
	}

	if (req_immediate_exit) {
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		static_call(kvm_x86_request_immediate_exit)(vcpu);
	}

	for (;;) {
		<inner run / VM-Enter loop>
	}

	<VM-Exit path>

	r = static_call(kvm_x86_handle_exit)(vcpu, exit_fastpath);
	return r;

cancel_injection:
	if (req_immediate_exit)
		kvm_make_request(KVM_REQ_EVENT, vcpu);
	static_call(kvm_x86_cancel_injection)(vcpu);
	if (unlikely(vcpu->arch.apic_attention))
		kvm_lapic_sync_from_vapic(vcpu);
out:
	return r;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-07 14:30     ` Sean Christopherson
@ 2023-06-07 17:23       ` Yu Zhang
  2023-06-07 17:53         ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Zhang @ 2023-06-07 17:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

> Pending requests block KVM from actually entering the guest.  If a request comes
> in after vcpu_enter_guest()'s initial handling of requests, KVM will bail before
> VM-Enter and go back through the entire "outer" run loop.
> 
> This isn't necessarily the most efficient way to handle the stall, e.g. KVM does
> a fair bit of prep for VM-Enter before detecting the pending request.  The
> alternative would be to have kvm_vcpu_reload_apic_access_page() return value
> instructing vcpu_enter_guest() whether to bail immediately or continue on.  I
> elected for the re-request approach because (a) it didn't require redefining the
> kvm_x86_ops vendor hook, (b) this should be a rare situation and not performance
> critical overall, and (c) there's no guarantee that bailing immediately would
> actually yield better performance from the guest's perspective, e.g. if there are
> other pending requests/work, then the KVM can handle those items while the vCPU
> is stalled instead of waiting until the invalidation completes to proceed.
> 

Wah! Thank you so much! Especially for the code snippets below! :)

> > One more dumb question - why does KVM not just pin the APIC access page?
> 
> Definitely not a dumb question, I asked myself the same thing multiple times when
> looking at this :-)  Pinning the page would be easier, and KVM actually did that
> in the original implementation.  The issue is in how KVM allocates the backing
> page.  It's not a traditional kernel allocation, but is instead anonymous memory
> allocated by way of vm_mmap(), i.e. for all intents and purposes it's a user
> allocation.  That means the kernel expects it to be a regular movable page, e.g.
> it's entirely possible the page (if it were pinned) could be the only page in a
> 2MiB chunk preventing the kernel from migrating/compacting and creating a hugepage.
> 
> In hindsight, I'm not entirely convinced that unpinning the page was the right
> choice, as it resulted in a handful of nasty bugs.  But, now that we've fixed all
> those bugs (knock wood), there's no good argument for undoing all of that work.
> Because while the code is subtle and requires hooks in a few paths, it's not *that*
> complex and for the most part doesn't require active maintenance.
> 

Thanks again! One more thing that bothers me when reading the mmu notifier,
is about the TLB flush request. After the APIC access page is reloaded, the
TLB will be flushed (a single-context EPT invalidation on not-so-outdated
CPUs) in vmx_set_apic_access_page_addr(). But the mmu notifier will send the
KVM_REQ_TLB_FLUSH as well, by kvm_mmu_notifier_invalidate_range_start() ->
__kvm_handle_hva_range(), therefore causing the vCPU to trigger another TLB
flush - normally a global EPT invalidation I guess.

But, is this necessary?

Could we try to return false in kvm_unmap_gfn_range() to indicate no more
flush is needed, if the range to be unmapped falls within guest APIC base,
and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?

But there are multiple places in vmx_set_apic_access_page_addr() to return
earlier(e.g., if xapic mode is disabled for this vCPU) with no TLB flush being
triggered, I am not sure if doing so would cause more problems... Any comment?
Thanks!

B.R.
Yu


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-07 17:23       ` Yu Zhang
@ 2023-06-07 17:53         ` Sean Christopherson
  2023-06-08  7:00           ` Yu Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-06-07 17:53 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

On Thu, Jun 08, 2023, Yu Zhang wrote:
> Thanks again! One more thing that bothers me when reading the mmu notifier,
> is about the TLB flush request. After the APIC access page is reloaded, the
> TLB will be flushed (a single-context EPT invalidation on not-so-outdated
> CPUs) in vmx_set_apic_access_page_addr(). But the mmu notifier will send the
> KVM_REQ_TLB_FLUSH as well, by kvm_mmu_notifier_invalidate_range_start() ->
> __kvm_handle_hva_range(), therefore causing the vCPU to trigger another TLB
> flush - normally a global EPT invalidation I guess.

Yes.

> But, is this necessary?

Flushing when KVM zaps SPTEs is definitely necessary.  But the flush in
vmx_set_apic_access_page_addr() *should* be redundant.

> Could we try to return false in kvm_unmap_gfn_range() to indicate no more
> flush is needed, if the range to be unmapped falls within guest APIC base,
> and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?

No, because vmx_flush_tlb_current(), a.k.a. KVM_REQ_TLB_FLUSH_CURRENT, flushes
only the current root, i.e. on the current EP4TA.  kvm_unmap_gfn_range() isn't
tied to a single vCPU and so needs to flush all roots.  We could in theory more
precisely track which roots needs to be flushed, but in practice it's highly
unlikely to matter as there is typically only one "main" root when TDP (EPT) is
in use.  In other words, KVM could avoid unnecessarily flushing entries for other
roots, but it would incur non-trivial complexity, and the probability of the
precise flushing having a measurable impact on guest performance is quite low, at
least outside of nested scenarios.

But as above, flushing in vmx_set_apic_access_page_addr() shouldn't be necessary.
If there were SPTEs, then KVM would already have zapped and flushed.  If there
weren't SPTEs, then it should have been impossible for the guest to have valid
TLB entries.  KVM needs to flush when VIRTUALIZE_APIC_ACCESSES is toggled on, as
the CPU could have non-vAPIC TLB entries, but that's handled by vmx_set_virtual_apic_mode().

I'll send a follow-up patch to drop the flush from vmx_set_apic_access_page_addr(),
I don't *think* I'm missing an edge case...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-07 17:53         ` Sean Christopherson
@ 2023-06-08  7:00           ` Yu Zhang
  2023-06-13 19:07             ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Zhang @ 2023-06-08  7:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

> 
> Flushing when KVM zaps SPTEs is definitely necessary.  But the flush in
> vmx_set_apic_access_page_addr() *should* be redundant.
> 
> > Could we try to return false in kvm_unmap_gfn_range() to indicate no more
> > flush is needed, if the range to be unmapped falls within guest APIC base,
> > and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?
> 
> No, because vmx_flush_tlb_current(), a.k.a. KVM_REQ_TLB_FLUSH_CURRENT, flushes
> only the current root, i.e. on the current EP4TA.  kvm_unmap_gfn_range() isn't
> tied to a single vCPU and so needs to flush all roots.  We could in theory more
> precisely track which roots needs to be flushed, but in practice it's highly
> unlikely to matter as there is typically only one "main" root when TDP (EPT) is
> in use.  In other words, KVM could avoid unnecessarily flushing entries for other
> roots, but it would incur non-trivial complexity, and the probability of the
> precise flushing having a measurable impact on guest performance is quite low, at
> least outside of nested scenarios.

Well, I can understand the invalidation shall be performed for both current EP4TA,
and the nested EP4TA(EPT02) when host retries to reclaim a normal page, because L1
may assign this page to L2. But for APIC base address, will L1 map this address to
L2? 

Also, what if the virtualize APIC access is to be supported in L2, and the backing
page is being reclaimed in L0? I saw nested_get_vmcs12_pages() will check vmcs12
and set the APIC access address in VMCS02, but not sure if this routine will be
triggered by the mmu notifier...

B.R.
Yu

> 
> But as above, flushing in vmx_set_apic_access_page_addr() shouldn't be necessary.
> If there were SPTEs, then KVM would already have zapped and flushed.  If there
> weren't SPTEs, then it should have been impossible for the guest to have valid
> TLB entries.  KVM needs to flush when VIRTUALIZE_APIC_ACCESSES is toggled on, as
> the CPU could have non-vAPIC TLB entries, but that's handled by vmx_set_virtual_apic_mode().
> 
> I'll send a follow-up patch to drop the flush from vmx_set_apic_access_page_addr(),
> I don't *think* I'm missing an edge case...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-08  7:00           ` Yu Zhang
@ 2023-06-13 19:07             ` Sean Christopherson
  2023-06-17  3:45               ` Yu Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2023-06-13 19:07 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

On Thu, Jun 08, 2023, Yu Zhang wrote:
> > 
> > Flushing when KVM zaps SPTEs is definitely necessary.  But the flush in
> > vmx_set_apic_access_page_addr() *should* be redundant.
> > 
> > > Could we try to return false in kvm_unmap_gfn_range() to indicate no more
> > > flush is needed, if the range to be unmapped falls within guest APIC base,
> > > and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?
> > 
> > No, because vmx_flush_tlb_current(), a.k.a. KVM_REQ_TLB_FLUSH_CURRENT, flushes
> > only the current root, i.e. on the current EP4TA.  kvm_unmap_gfn_range() isn't
> > tied to a single vCPU and so needs to flush all roots.  We could in theory more
> > precisely track which roots needs to be flushed, but in practice it's highly
> > unlikely to matter as there is typically only one "main" root when TDP (EPT) is
> > in use.  In other words, KVM could avoid unnecessarily flushing entries for other
> > roots, but it would incur non-trivial complexity, and the probability of the
> > precise flushing having a measurable impact on guest performance is quite low, at
> > least outside of nested scenarios.
> 
> Well, I can understand the invalidation shall be performed for both current EP4TA,
> and the nested EP4TA(EPT02) when host retries to reclaim a normal page, because L1
> may assign this page to L2. But for APIC base address, will L1 map this address to
> L2?

L1 can do whatever it wants.  E.g. L1 could passthrough its APIC to L2, in which
case, yes, L1 will map its APIC base into L2.  KVM (as L0) however doesn't support
mapping the APIC-access page into L2.  KVM *could* support utilizing APICv to
accelerate L2 when L1 has done a full APIC passthrough, but AFAIK no one has
requested such support.  Functionally, an APIC passthrough setup for L1=>L2 will
work, but KVM will trap and emulate APIC accesses from L2 instead of utilizing
hardware acceleration.

More commonly, L1 will use APICv for L2 and thus have an APIC-access page for L2,
and KVM will map _that_ page into L2.

> Also, what if the virtualize APIC access is to be supported in L2,

As above, KVM never maps the APIC-access page that KVM (as L0) manages into L2.

> and the backing page is being reclaimed in L0? I saw
> nested_get_vmcs12_pages() will check vmcs12 and set the APIC access address
> in VMCS02, but not sure if this routine will be triggered by the mmu
> notifier...

Pages from vmcs12 that are referenced by physical address in the VMCS are pinned
(where "pinned" means KVM holds a reference to the page) by kvm_vcpu_map().  I.e.
the page will not be migrated, and if userspace unmaps the page, userspace might
break its VM, but that's true for any guest memory that userspace unexpectedly
unmaps, and there won't be any no use-after-free issues.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-13 19:07             ` Sean Christopherson
@ 2023-06-17  3:45               ` Yu Zhang
  2023-06-22 23:02                 ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Zhang @ 2023-06-17  3:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

> 
> > and the backing page is being reclaimed in L0? I saw
> > nested_get_vmcs12_pages() will check vmcs12 and set the APIC access address
> > in VMCS02, but not sure if this routine will be triggered by the mmu
> > notifier...
> 
> Pages from vmcs12 that are referenced by physical address in the VMCS are pinned
> (where "pinned" means KVM holds a reference to the page) by kvm_vcpu_map().  I.e.
> the page will not be migrated, and if userspace unmaps the page, userspace might
> break its VM, but that's true for any guest memory that userspace unexpectedly
> unmaps, and there won't be any no use-after-free issues.
> 
Thanks, Sean. 

About the kvm_vcpu_map(), is it necessary for APIC access address? L0 only
needs to get its pfn, and does not care about the hva or struct page. Could
we just use gfn_to_pfn() to retrieve the pfn, and kvm_release_pfn_clean() to
unpin it later? 

B.R.
Yu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress
  2023-06-17  3:45               ` Yu Zhang
@ 2023-06-22 23:02                 ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2023-06-22 23:02 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, Jason Gunthorpe,
	Alistair Popple, Robin Murphy

On Sat, Jun 17, 2023, Yu Zhang wrote:
> > 
> > > and the backing page is being reclaimed in L0? I saw
> > > nested_get_vmcs12_pages() will check vmcs12 and set the APIC access address
> > > in VMCS02, but not sure if this routine will be triggered by the mmu
> > > notifier...
> > 
> > Pages from vmcs12 that are referenced by physical address in the VMCS are pinned
> > (where "pinned" means KVM holds a reference to the page) by kvm_vcpu_map().  I.e.
> > the page will not be migrated, and if userspace unmaps the page, userspace might
> > break its VM, but that's true for any guest memory that userspace unexpectedly
> > unmaps, and there won't be any no use-after-free issues.
> > 
> Thanks, Sean. 
> 
> About the kvm_vcpu_map(), is it necessary for APIC access address?

No, not strictly necessary.

> L0 only needs to get its pfn, and does not care about the hva or struct page. Could
> we just use gfn_to_pfn() to retrieve the pfn, and kvm_release_pfn_clean() to
> unpin it later? 

Yep, that would suffice.  The main reason I converted the APIC access page to use
kvm_vcpu_map()[1] was that it was easiest way to play nice with struct page memory.

I don't think this is worth doing right now, as the practical downside is really
just that setups that hide memory from the kernel do an unnecessary memremap().
I'd much prefer to "fix" this wart when we (hopefully) overhaul all these APIs[2].

[1] fe1911aa443e ("KVM: nVMX: Use kvm_vcpu_map() to get/pin vmcs12's APIC-access page")
[2] https://lore.kernel.org/all/ZGvUsf7lMkrNDHuE@google.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-06-22 23:02 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02  1:15 [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Sean Christopherson
2023-06-02  1:15 ` [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Sean Christopherson
2023-06-06  2:11   ` Alistair Popple
2023-06-06 17:22     ` Sean Christopherson
2023-06-06 17:39       ` Jason Gunthorpe
2023-06-07  7:40   ` yu.c.zhang
2023-06-07 14:30     ` Sean Christopherson
2023-06-07 17:23       ` Yu Zhang
2023-06-07 17:53         ` Sean Christopherson
2023-06-08  7:00           ` Yu Zhang
2023-06-13 19:07             ` Sean Christopherson
2023-06-17  3:45               ` Yu Zhang
2023-06-22 23:02                 ` Sean Christopherson
2023-06-02  1:15 ` [PATCH 2/3] KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page Sean Christopherson
2023-06-06  2:20   ` Alistair Popple
2023-06-02  1:15 ` [PATCH 3/3] KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares Sean Christopherson
2023-06-05 10:15 ` [PATCH 0/3] KVM: x86: Use "standard" mmu_notifier hook for APIC page Paolo Bonzini
2023-06-06 16:43 ` Jason Gunthorpe
2023-06-07  0:55 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).