linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] KVM: X86: fix tlb_flush_guest()
@ 2021-05-27  2:39 Lai Jiangshan
  2021-05-27 12:55 ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Lai Jiangshan @ 2021-05-27  2:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Radim Krčmář,
	kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
the hypervisor do the operation that equals to native_flush_tlb_global()
or invpcid_flush_all() in the specified guest CPU.

When TDP is enabled, there is no problem to just flush the hardware
TLB of the specified guest CPU.

But when using shadowpaging, the hypervisor should have to sync the
shadow pagetable at first before flushing the hardware TLB so that
it can truely emulate the operation of invpcid_flush_all() in guest.

The problem exists since the first implementation of KVM_VCPU_FLUSH_TLB
in commit f38a7b75267f ("KVM: X86: support paravirtualized help for TLB
shootdowns").  But I don't think it would be a real world problem that
time since the local CPU's tlb is flushed at first in guest before queuing
KVM_VCPU_FLUSH_TLB to other CPUs.  It means that the hypervisor syncs the
shadow pagetable before seeing the corresponding KVM_VCPU_FLUSH_TLBs.

After commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
concurrently"), the guest doesn't flush local CPU's tlb at first and
the hypervisor can handle other VCPU's KVM_VCPU_FLUSH_TLB earlier than
local VCPU's tlb flush and might flush the hardware tlb without syncing
the shadow pagetable beforehand.

Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/svm/svm.c | 16 +++++++++++++++-
 arch/x86/kvm/vmx/vmx.c |  8 +++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 05eca131eaf2..f4523c859245 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3575,6 +3575,20 @@ void svm_flush_tlb(struct kvm_vcpu *vcpu)
 		svm->current_vmcb->asid_generation--;
 }
 
+static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * When NPT is enabled, just flush the ASID.
+	 *
+	 * When NPT is not enabled, the operation should be equal to
+	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
+	 */
+	if (npt_enabled)
+		svm_flush_tlb(vcpu);
+	else
+		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+}
+
 static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4486,7 +4500,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.tlb_flush_all = svm_flush_tlb,
 	.tlb_flush_current = svm_flush_tlb,
 	.tlb_flush_gva = svm_flush_tlb_gva,
-	.tlb_flush_guest = svm_flush_tlb,
+	.tlb_flush_guest = svm_flush_tlb_guest,
 
 	.run = svm_vcpu_run,
 	.handle_exit = handle_exit,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4bceb5ca3a89..1913504e3472 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3049,8 +3049,14 @@ static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	 * are required to flush GVA->{G,H}PA mappings from the TLB if vpid is
 	 * disabled (VM-Enter with vpid enabled and vpid==0 is disallowed),
 	 * i.e. no explicit INVVPID is necessary.
+	 *
+	 * When EPT is not enabled, the operation should be equal to
+	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
 	 */
-	vpid_sync_context(to_vmx(vcpu)->vpid);
+	if (enable_ept)
+		vpid_sync_context(to_vmx(vcpu)->vpid);
+	else
+		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
 }
 
 void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-27  2:39 [PATCH] KVM: X86: fix tlb_flush_guest() Lai Jiangshan
@ 2021-05-27 12:55 ` Paolo Bonzini
  2021-05-27 16:13   ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2021-05-27 12:55 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Lai Jiangshan, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Radim Krčmář,
	kvm

On 27/05/21 04:39, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> the hypervisor do the operation that equals to native_flush_tlb_global()
> or invpcid_flush_all() in the specified guest CPU.
> 
> When TDP is enabled, there is no problem to just flush the hardware
> TLB of the specified guest CPU.
> 
> But when using shadowpaging, the hypervisor should have to sync the
> shadow pagetable at first before flushing the hardware TLB so that
> it can truely emulate the operation of invpcid_flush_all() in guest.

Can you explain why?

Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if 
(tdp_enabled).  This provides also a single, good place to add a comment 
with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.

Paolo

> The problem exists since the first implementation of KVM_VCPU_FLUSH_TLB
> in commit f38a7b75267f ("KVM: X86: support paravirtualized help for TLB
> shootdowns").  But I don't think it would be a real world problem that
> time since the local CPU's tlb is flushed at first in guest before queuing
> KVM_VCPU_FLUSH_TLB to other CPUs.  It means that the hypervisor syncs the
> shadow pagetable before seeing the corresponding KVM_VCPU_FLUSH_TLBs.
> 
> After commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
> concurrently"), the guest doesn't flush local CPU's tlb at first and
> the hypervisor can handle other VCPU's KVM_VCPU_FLUSH_TLB earlier than
> local VCPU's tlb flush and might flush the hardware tlb without syncing
> the shadow pagetable beforehand.
> 
> Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/svm/svm.c | 16 +++++++++++++++-
>   arch/x86/kvm/vmx/vmx.c |  8 +++++++-
>   2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 05eca131eaf2..f4523c859245 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3575,6 +3575,20 @@ void svm_flush_tlb(struct kvm_vcpu *vcpu)
>   		svm->current_vmcb->asid_generation--;
>   }
>   
> +static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * When NPT is enabled, just flush the ASID.
> +	 *
> +	 * When NPT is not enabled, the operation should be equal to
> +	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
> +	 */
> +	if (npt_enabled)
> +		svm_flush_tlb(vcpu);
> +	else
> +		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +}
> +
>   static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
>   {
>   	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4486,7 +4500,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>   	.tlb_flush_all = svm_flush_tlb,
>   	.tlb_flush_current = svm_flush_tlb,
>   	.tlb_flush_gva = svm_flush_tlb_gva,
> -	.tlb_flush_guest = svm_flush_tlb,
> +	.tlb_flush_guest = svm_flush_tlb_guest,
>   
>   	.run = svm_vcpu_run,
>   	.handle_exit = handle_exit,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4bceb5ca3a89..1913504e3472 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3049,8 +3049,14 @@ static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   	 * are required to flush GVA->{G,H}PA mappings from the TLB if vpid is
>   	 * disabled (VM-Enter with vpid enabled and vpid==0 is disallowed),
>   	 * i.e. no explicit INVVPID is necessary.
> +	 *
> +	 * When EPT is not enabled, the operation should be equal to
> +	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
>   	 */
> -	vpid_sync_context(to_vmx(vcpu)->vpid);
> +	if (enable_ept)
> +		vpid_sync_context(to_vmx(vcpu)->vpid);
> +	else
> +		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
>   }
>   
>   void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-27 12:55 ` Paolo Bonzini
@ 2021-05-27 16:13   ` Sean Christopherson
  2021-05-27 16:14     ` Sean Christopherson
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-05-27 16:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Radim Krčmář,
	kvm

+Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
         issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
	 flushing.

On Thu, May 27, 2021, Paolo Bonzini wrote:
> On 27/05/21 04:39, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > 
> > For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> > the hypervisor do the operation that equals to native_flush_tlb_global()
> > or invpcid_flush_all() in the specified guest CPU.
> > 
> > When TDP is enabled, there is no problem to just flush the hardware
> > TLB of the specified guest CPU.
> > 
> > But when using shadowpaging, the hypervisor should have to sync the
> > shadow pagetable at first before flushing the hardware TLB so that
> > it can truely emulate the operation of invpcid_flush_all() in guest.
> 
> Can you explain why?

KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
that require a TLB flush to take effect, e.g. making a writable page read-only,
KVM waits until the guest explicitly does said flush to propagate the changes to
the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
page being writable when the guest expects it to be read-only.

> Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
> (tdp_enabled).  This provides also a single, good place to add a comment
> with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.

Ya.  

KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
offset the performance gains of the paravirtualized flush.

And making a request won't work without revamping the order of request handling
in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
serviced before KVM_REQ_STEAL_UPDATE.

Cleaning up and documenting the MMU related requests is on my todo list, but the
immediate fix should be tiny and I can do my cleanups on top.

I believe the minimal fix is:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 81ab3b8f22e5..b0072063f9bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
 static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
        ++vcpu->stat.tlb_flush;
+
+       if (!tdp_enabled)
+               kvm_mmu_sync_roots(vcpu);
        static_call(kvm_x86_tlb_flush_guest)(vcpu);
 }
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-27 16:13   ` Sean Christopherson
@ 2021-05-27 16:14     ` Sean Christopherson
  2021-05-27 19:28       ` Sean Christopherson
  2021-05-28  0:18     ` Lai Jiangshan
  2021-05-29 22:12     ` Maxim Levitsky
  2 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-05-27 16:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Maxim Levitsky, kvm

+Maxim for real this time...

On Thu, May 27, 2021, Sean Christopherson wrote:
> +Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
>          issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
> 	 flushing.
> 
> On Thu, May 27, 2021, Paolo Bonzini wrote:
> > On 27/05/21 04:39, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > 
> > > For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> > > the hypervisor do the operation that equals to native_flush_tlb_global()
> > > or invpcid_flush_all() in the specified guest CPU.
> > > 
> > > When TDP is enabled, there is no problem to just flush the hardware
> > > TLB of the specified guest CPU.
> > > 
> > > But when using shadowpaging, the hypervisor should have to sync the
> > > shadow pagetable at first before flushing the hardware TLB so that
> > > it can truely emulate the operation of invpcid_flush_all() in guest.
> > 
> > Can you explain why?
> 
> KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
> that require a TLB flush to take effect, e.g. making a writable page read-only,
> KVM waits until the guest explicitly does said flush to propagate the changes to
> the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
> page being writable when the guest expects it to be read-only.
> 
> > Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
> > (tdp_enabled).  This provides also a single, good place to add a comment
> > with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.
> 
> Ya.  
> 
> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> offset the performance gains of the paravirtualized flush.
> 
> And making a request won't work without revamping the order of request handling
> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> serviced before KVM_REQ_STEAL_UPDATE.
> 
> Cleaning up and documenting the MMU related requests is on my todo list, but the
> immediate fix should be tiny and I can do my cleanups on top.
> 
> I believe the minimal fix is:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 81ab3b8f22e5..b0072063f9bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  {
>         ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled)
> +               kvm_mmu_sync_roots(vcpu);
>         static_call(kvm_x86_tlb_flush_guest)(vcpu);
>  }
>  
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-27 16:14     ` Sean Christopherson
@ 2021-05-27 19:28       ` Sean Christopherson
  2021-05-28  1:13         ` Lai Jiangshan
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-05-27 19:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Maxim Levitsky, kvm

On Thu, May 27, 2021, Sean Christopherson wrote:
> > KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> > offset the performance gains of the paravirtualized flush.

Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
between flushing a specific mm and flushing the entire TLB.  The HyperV usage
(via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
isn't necessary as KVM just needs to sync all roots, not blast them away.  For
previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
fix will need to unload those roots.

And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
to sync the MMU on the second flush even though the guest can technically rely
on the first MOV CR3 to have synchronized any previous changes relative to the
fisrt MOV CR3.

Lai, if it's ok with you, I'll massage this patch as discussed and fold it into
a larger series to fix the other bugs and do additional cleanup/improvements.

> > I believe the minimal fix is:
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 81ab3b8f22e5..b0072063f9bf 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
> >  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
> >  {
> >         ++vcpu->stat.tlb_flush;
> > +
> > +       if (!tdp_enabled)
> > +               kvm_mmu_sync_roots(vcpu);
> >         static_call(kvm_x86_tlb_flush_guest)(vcpu);
> >  }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-27 16:13   ` Sean Christopherson
  2021-05-27 16:14     ` Sean Christopherson
@ 2021-05-28  0:18     ` Lai Jiangshan
  2021-05-28  0:26       ` Sean Christopherson
  2021-05-29 22:12     ` Maxim Levitsky
  2 siblings, 1 reply; 21+ messages in thread
From: Lai Jiangshan @ 2021-05-28  0:18 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Radim Krčmář,
	kvm, Maxim Levitsky



On 2021/5/28 00:13, Sean Christopherson wrote:
> +Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
>           issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
> 	 flushing.
> 
> On Thu, May 27, 2021, Paolo Bonzini wrote:
>> On 27/05/21 04:39, Lai Jiangshan wrote:
>>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>>
>>> For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
>>> the hypervisor do the operation that equals to native_flush_tlb_global()
>>> or invpcid_flush_all() in the specified guest CPU.
>>>
>>> When TDP is enabled, there is no problem to just flush the hardware
>>> TLB of the specified guest CPU.
>>>
>>> But when using shadowpaging, the hypervisor should have to sync the
>>> shadow pagetable at first before flushing the hardware TLB so that
>>> it can truely emulate the operation of invpcid_flush_all() in guest.
>>
>> Can you explain why?
> 
> KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
> that require a TLB flush to take effect, e.g. making a writable page read-only,
> KVM waits until the guest explicitly does said flush to propagate the changes to
> the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
> page being writable when the guest expects it to be read-only.
> 
>> Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
>> (tdp_enabled).  This provides also a single, good place to add a comment
>> with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.
> 
> Ya.
> 
> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> offset the performance gains of the paravirtualized flush >
> And making a request won't work without revamping the order of request handling
> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> serviced before KVM_REQ_STEAL_UPDATE.

Yes, it just fixes the said problem in the simplest way.
I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
(If the guest is not preempted, it will call invpcid_flush_all() and will be handled
by this way)


The improvement code will go later, and will not be backported.
The proper way to flush guest is to use code in

https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
as:
+		kvm_mmu_sync_roots(vcpu);
+		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
+		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+			vcpu->arch.mmu->prev_roots[i].need_sync = true;

If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
roots in prev_roots.



> 
> Cleaning up and documenting the MMU related requests is on my todo list, but the
> immediate fix should be tiny and I can do my cleanups on top.
> 
> I believe the minimal fix is:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 81ab3b8f22e5..b0072063f9bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   {
>          ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled)
> +               kvm_mmu_sync_roots(vcpu);

it doesn't handle prev_roots which are also needed as
shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).

>          static_call(kvm_x86_tlb_flush_guest)(vcpu);

For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
to make it consistent with other shadowpage code.

>   }
>   
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-28  0:18     ` Lai Jiangshan
@ 2021-05-28  0:26       ` Sean Christopherson
  2021-05-28  1:29         ` Lai Jiangshan
  2021-06-02  8:13         ` Lai Jiangshan
  0 siblings, 2 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-05-28  0:26 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Lai Jiangshan, linux-kernel, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Radim Krčmář,
	kvm, Maxim Levitsky

On Fri, May 28, 2021, Lai Jiangshan wrote:
> 
> On 2021/5/28 00:13, Sean Christopherson wrote:
> > And making a request won't work without revamping the order of request handling
> > in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> > serviced before KVM_REQ_STEAL_UPDATE.
> 
> Yes, it just fixes the said problem in the simplest way.
> I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
> by this way)

The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
in vcpu_enter_guest() and so the reload request won't be recognized until the
next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
guaranteed to run between the invcpid code and VM-Enter.

> The improvement code will go later, and will not be backported.

I would argue that introducing a potential performance regression is in itself a
bug.  IMO, going straight to kvm_mmu_sync_roots() is not high risk.

> The proper way to flush guest is to use code in
> 
> https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
> as:
> +		kvm_mmu_sync_roots(vcpu);
> +		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
> +		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> +			vcpu->arch.mmu->prev_roots[i].need_sync = true;
> 
> If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
> to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
> roots in prev_roots.

I like the idea, I just haven't gotten around to reviewing that patch yet.

> > Cleaning up and documenting the MMU related requests is on my todo list, but the
> > immediate fix should be tiny and I can do my cleanups on top.
> > 
> > I believe the minimal fix is:
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 81ab3b8f22e5..b0072063f9bf 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
> >   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
> >   {
> >          ++vcpu->stat.tlb_flush;
> > +
> > +       if (!tdp_enabled)
> > +               kvm_mmu_sync_roots(vcpu);
> 
> it doesn't handle prev_roots which are also needed as
> shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).

Ya, I belated realized this :-)

> >          static_call(kvm_x86_tlb_flush_guest)(vcpu);
> 
> For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
> to make it consistent with other shadowpage code.
> 
> >   }
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-27 19:28       ` Sean Christopherson
@ 2021-05-28  1:13         ` Lai Jiangshan
  2021-06-02 15:09           ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Lai Jiangshan @ 2021-05-28  1:13 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Maxim Levitsky, kvm



On 2021/5/28 03:28, Sean Christopherson wrote:
> On Thu, May 27, 2021, Sean Christopherson wrote:
>>> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
>>> offset the performance gains of the paravirtualized flush.
> 
> Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
> between flushing a specific mm and flushing the entire TLB.  The HyperV usage
> (via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
> isn't necessary as KVM just needs to sync all roots, not blast them away.  For
> previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
> fix will need to unload those roots.
> 
> And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
> broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
> flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
> to sync the MMU on the second flush even though the guest can technically rely
> on the first MOV CR3 to have synchronized any previous changes relative to the
> fisrt MOV CR3.

Could you elaborate the problem please?
When can a MOV CR3 that needs to flush the entire TLB if PCID is enabled?

If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 0, the instruction invalidates all TLB entries 
associated with the PCID specified in bits 11:0 of the instruction’s source operand except those for global pages. It 
also invalidates all entries in all paging-structure caches associated with that PCID. It is not required to invalidate 
entries in the TLBs and paging-structure caches that are associated with other PCIDs.

> 
> Lai, if it's ok with you, I'll massage this patch as discussed and fold it into
> a larger series to fix the other bugs and do additional cleanup/improvements.
> 
>>> I believe the minimal fix is:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 81ab3b8f22e5..b0072063f9bf 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>>>   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>>>   {
>>>          ++vcpu->stat.tlb_flush;
>>> +
>>> +       if (!tdp_enabled)
>>> +               kvm_mmu_sync_roots(vcpu);
>>>          static_call(kvm_x86_tlb_flush_guest)(vcpu);
>>>   }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-28  0:26       ` Sean Christopherson
@ 2021-05-28  1:29         ` Lai Jiangshan
  2021-06-02 15:01           ` Sean Christopherson
  2021-06-02  8:13         ` Lai Jiangshan
  1 sibling, 1 reply; 21+ messages in thread
From: Lai Jiangshan @ 2021-05-28  1:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Lai Jiangshan, linux-kernel, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Radim Krčmář,
	kvm, Maxim Levitsky



On 2021/5/28 08:26, Sean Christopherson wrote:
> On Fri, May 28, 2021, Lai Jiangshan wrote:
>>
>> On 2021/5/28 00:13, Sean Christopherson wrote:
>>> And making a request won't work without revamping the order of request handling
>>> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
>>> serviced before KVM_REQ_STEAL_UPDATE.
>>
>> Yes, it just fixes the said problem in the simplest way.
>> I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
>> (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
>> by this way)
> 
> The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
> in vcpu_enter_guest() and so the reload request won't be recognized until the
> next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
> guaranteed to run between the invcpid code and VM-Enter.

Kvm will recheck the request before VM-enter.
See kvm_vcpu_exit_request().
It won't lost any request, it just causes a additional iteration.

> 
>> The improvement code will go later, and will not be backported.
> 
> I would argue that introducing a potential performance regression is in itself a
> bug.  IMO, going straight to kvm_mmu_sync_roots() is not high risk.

I think we can introduce a kvm_mmu_sync_roots_all() which syncs current
root and handles prev_roots (mark need_sync or drop) as cleanup/preparation patch
and then fix the problem.

Do we need to backport the cleanup/preparation patch or just backport the way
with KVM_REQ_MMU_RELOAD?


> 
>> The proper way to flush guest is to use code in
>>
>> https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
>> as:
>> +		kvm_mmu_sync_roots(vcpu);
>> +		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
>> +		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>> +			vcpu->arch.mmu->prev_roots[i].need_sync = true;
>>
>> If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
>> to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
>> roots in prev_roots.
> 
> I like the idea, I just haven't gotten around to reviewing that patch yet.
> 
>>> Cleaning up and documenting the MMU related requests is on my todo list, but the
>>> immediate fix should be tiny and I can do my cleanups on top.
>>>
>>> I believe the minimal fix is:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 81ab3b8f22e5..b0072063f9bf 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>>>    static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>>>    {
>>>           ++vcpu->stat.tlb_flush;
>>> +
>>> +       if (!tdp_enabled)
>>> +               kvm_mmu_sync_roots(vcpu);
>>
>> it doesn't handle prev_roots which are also needed as
>> shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> 
> Ya, I belated realized this :-)
> 
>>>           static_call(kvm_x86_tlb_flush_guest)(vcpu);
>>
>> For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
>> to make it consistent with other shadowpage code.

For !tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
to make it consistent with other shadowpage code.

>>
>>>    }
>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-27 16:13   ` Sean Christopherson
  2021-05-27 16:14     ` Sean Christopherson
  2021-05-28  0:18     ` Lai Jiangshan
@ 2021-05-29 22:12     ` Maxim Levitsky
  2021-05-31 17:22       ` [PATCH V2] " Lai Jiangshan
  2 siblings, 1 reply; 21+ messages in thread
From: Maxim Levitsky @ 2021-05-29 22:12 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm

On Thu, 2021-05-27 at 16:13 +0000, Sean Christopherson wrote:
> +Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
>          issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
> 	 flushing.

Still crashes with this patch sadly (tested now on my AMD laptop now).

This win10 boot bug without TDP paging is 100% reproducible,
although it crashes sometimes a bit differently, sometimes VM reboots right
at start of the boot, sometimes it just hangs forever, 
sometimes the VM bluescreens (with various reasons). 
This makes sense for random paging related corruption though.
 
I'll look at this patch more carefully soon.
 
I also have another bug which I put aside as well due to complexity
which involves running hyperv itself nested and and then migrating
the L1, which leads the hyperv VM bluescreen sooner or later,
(I don't remember anymore if that was both on intel and AMD or only intel,
but this didn’t involve any npt/ept disablement),
so I'll see if this patch helps with this bug as well.

Thanks,
	Best regards,
		Maxim Levitsky


> 
> On Thu, May 27, 2021, Paolo Bonzini wrote:
> > On 27/05/21 04:39, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > 
> > > For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> > > the hypervisor do the operation that equals to native_flush_tlb_global()
> > > or invpcid_flush_all() in the specified guest CPU.
> > > 
> > > When TDP is enabled, there is no problem to just flush the hardware
> > > TLB of the specified guest CPU.
> > > 
> > > But when using shadowpaging, the hypervisor should have to sync the
> > > shadow pagetable at first before flushing the hardware TLB so that
> > > it can truely emulate the operation of invpcid_flush_all() in guest.
> > 
> > Can you explain why?
> 
> KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
> that require a TLB flush to take effect, e.g. making a writable page read-only,
> KVM waits until the guest explicitly does said flush to propagate the changes to
> the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
> page being writable when the guest expects it to be read-only.
> 
> > Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
> > (tdp_enabled).  This provides also a single, good place to add a comment
> > with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.
> 
> Ya.  
> 
> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> offset the performance gains of the paravirtualized flush.
> 
> And making a request won't work without revamping the order of request handling
> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> serviced before KVM_REQ_STEAL_UPDATE.
> 
> Cleaning up and documenting the MMU related requests is on my todo list, but the
> immediate fix should be tiny and I can do my cleanups on top.
> 
> I believe the minimal fix is:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 81ab3b8f22e5..b0072063f9bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  {
>         ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled)
> +               kvm_mmu_sync_roots(vcpu);
>         static_call(kvm_x86_tlb_flush_guest)(vcpu);
>  }
>  
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH V2] KVM: X86: fix tlb_flush_guest()
  2021-05-29 22:12     ` Maxim Levitsky
@ 2021-05-31 17:22       ` Lai Jiangshan
  2021-06-02 15:39         ` Sean Christopherson
  2021-06-07 22:38         ` Maxim Levitsky
  0 siblings, 2 replies; 21+ messages in thread
From: Lai Jiangshan @ 2021-05-31 17:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Maxim Levitsky, Paolo Bonzini,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Radim Krčmář,
	kvm

From: Lai Jiangshan <laijs@linux.alibaba.com>

For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
the hypervisor do the operation that equals to native_flush_tlb_global()
or invpcid_flush_all() in the specified guest CPU.

When TDP is enabled, there is no problem to just flush the hardware
TLB of the specified guest CPU.

But when using shadowpaging, the hypervisor should have to sync the
shadow pagetable at first before flushing the hardware TLB so that
it can truely emulate the operation of invpcid_flush_all() in guest.

The problem exists since the first implementation of KVM_VCPU_FLUSH_TLB
in commit f38a7b75267f ("KVM: X86: support paravirtualized help for TLB
shootdowns").  But I don't think it would be a real world problem that
time since the local CPU's tlb is flushed at first in guest before queuing
KVM_VCPU_FLUSH_TLB to other CPUs.  It means that the hypervisor syncs the
shadow pagetable before seeing the corresponding KVM_VCPU_FLUSH_TLBs.

After commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
concurrently"), the guest doesn't flush local CPU's tlb at first and
the hypervisor can handle other VCPU's KVM_VCPU_FLUSH_TLB earlier than
local VCPU's tlb flush and might flush the hardware tlb without syncing
the shadow pagetable beforehand.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
Changed from V1
	Use kvm_mmu_unload() instead of KVM_REQ_MMU_RELOAD to avoid
	causing unneeded iteration of vcpu_enter_guest().

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbc4e04e67ad..27248e330767 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3072,6 +3072,22 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
 static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
 	++vcpu->stat.tlb_flush;
+
+	if (!tdp_enabled) {
+		/*
+		 * When two dimensional paging is not enabled, the
+		 * operation should equal to native_flush_tlb_global()
+		 * or invpcid_flush_all() on the guest's behalf via
+		 * synchronzing shadow pagetable and flushing.
+		 *
+		 * kvm_mmu_unload() results consequent kvm_mmu_load()
+		 * before entering guest which will do the required
+		 * pagetable synchronzing and TLB flushing.
+		 */
+		kvm_mmu_unload(vcpu);
+		return;
+	}
+
 	static_call(kvm_x86_tlb_flush_guest)(vcpu);
 }
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-28  0:26       ` Sean Christopherson
  2021-05-28  1:29         ` Lai Jiangshan
@ 2021-06-02  8:13         ` Lai Jiangshan
  1 sibling, 0 replies; 21+ messages in thread
From: Lai Jiangshan @ 2021-06-02  8:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Lai Jiangshan, linux-kernel, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Radim Krčmář,
	kvm, Maxim Levitsky



On 2021/5/28 08:26, Sean Christopherson wrote:
> On Fri, May 28, 2021, Lai Jiangshan wrote:
>>
>> On 2021/5/28 00:13, Sean Christopherson wrote:
>>> And making a request won't work without revamping the order of request handling
>>> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
>>> serviced before KVM_REQ_STEAL_UPDATE.
>>
>> Yes, it just fixes the said problem in the simplest way.
>> I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
>> (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
>> by this way)
> 
> The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
> in vcpu_enter_guest() and so the reload request won't be recognized until the
> next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
> guaranteed to run between the invcpid code and VM-Enter.
> 
>> The improvement code will go later, and will not be backported.
> 
> I would argue that introducing a potential performance regression is in itself a
> bug.  IMO, going straight to kvm_mmu_sync_roots() is not high risk.

Hello, Sean

Patch V2 address all these concerns. And it uses the minimal fix as you
suggested in your previous reply (fix it directly in kvm_vcpu_flush_tlb_guest())

Could you have a review again please?

Thanks
Lai.

> 
>> The proper way to flush guest is to use code in
>>
>> https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
>> as:
>> +		kvm_mmu_sync_roots(vcpu);
>> +		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
>> +		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>> +			vcpu->arch.mmu->prev_roots[i].need_sync = true;
>>
>> If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
>> to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
>> roots in prev_roots.
> 
> I like the idea, I just haven't gotten around to reviewing that patch yet.
> 
>>> Cleaning up and documenting the MMU related requests is on my todo list, but the
>>> immediate fix should be tiny and I can do my cleanups on top.
>>>
>>> I believe the minimal fix is:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 81ab3b8f22e5..b0072063f9bf 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>>>    static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>>>    {
>>>           ++vcpu->stat.tlb_flush;
>>> +
>>> +       if (!tdp_enabled)
>>> +               kvm_mmu_sync_roots(vcpu);
>>
>> it doesn't handle prev_roots which are also needed as
>> shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> 
> Ya, I belated realized this :-)
> 
>>>           static_call(kvm_x86_tlb_flush_guest)(vcpu);
>>
>> For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
>> to make it consistent with other shadowpage code.
>>
>>>    }
>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-28  1:29         ` Lai Jiangshan
@ 2021-06-02 15:01           ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-06-02 15:01 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Lai Jiangshan, linux-kernel, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Radim Krčmář,
	kvm, Maxim Levitsky

On Fri, May 28, 2021, Lai Jiangshan wrote:
> 
> On 2021/5/28 08:26, Sean Christopherson wrote:
> > On Fri, May 28, 2021, Lai Jiangshan wrote:
> > > 
> > > On 2021/5/28 00:13, Sean Christopherson wrote:
> > > > And making a request won't work without revamping the order of request handling
> > > > in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> > > > serviced before KVM_REQ_STEAL_UPDATE.
> > > 
> > > Yes, it just fixes the said problem in the simplest way.
> > > I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> > > (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
> > > by this way)
> > 
> > The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
> > in vcpu_enter_guest() and so the reload request won't be recognized until the
> > next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
> > guaranteed to run between the invcpid code and VM-Enter.
> 
> Kvm will recheck the request before VM-enter.
> See kvm_vcpu_exit_request().

Ah, right, forgot requests are rechecked.  Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-05-28  1:13         ` Lai Jiangshan
@ 2021-06-02 15:09           ` Sean Christopherson
  2021-06-02 22:07             ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2021-06-02 15:09 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Lai Jiangshan, linux-kernel, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Maxim Levitsky, kvm

On Fri, May 28, 2021, Lai Jiangshan wrote:
> 
> 
> On 2021/5/28 03:28, Sean Christopherson wrote:
> > On Thu, May 27, 2021, Sean Christopherson wrote:
> > > > KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> > > > offset the performance gains of the paravirtualized flush.
> > 
> > Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
> > between flushing a specific mm and flushing the entire TLB.  The HyperV usage
> > (via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
> > isn't necessary as KVM just needs to sync all roots, not blast them away.  For
> > previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
> > fix will need to unload those roots.
> > 
> > And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
> > broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
> > flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
> > to sync the MMU on the second flush even though the guest can technically rely
> > on the first MOV CR3 to have synchronized any previous changes relative to the
> > fisrt MOV CR3.
> 
> Could you elaborate the problem please?
> When can a MOV CR3 that needs to flush the entire TLB if PCID is enabled?

Scratch that, I was wrong.  The SDM explicitly states that other PCIDs don't
need to be flushed if CR4.PCIDE=1.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] KVM: X86: fix tlb_flush_guest()
  2021-05-31 17:22       ` [PATCH V2] " Lai Jiangshan
@ 2021-06-02 15:39         ` Sean Christopherson
  2021-06-07 22:38         ` Maxim Levitsky
  1 sibling, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-06-02 15:39 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Maxim Levitsky, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Radim Krčmář,
	kvm

On Tue, Jun 01, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> the hypervisor do the operation that equals to native_flush_tlb_global()
> or invpcid_flush_all() in the specified guest CPU.

I don't like referencing guest code, here and in the comment.  The paravirt
flush isn't limited to Linux guests, the existing kernel code might change, and
it doesn't directly explain KVM's responsibilities in response to a guest TLB
flush.

Something like:

  When using shadow paging, unload the guest MMU when emulating a guest TLB
  flush to all roots are synchronized.  From the guest's perspective,
  flushing the TLB ensure any and all modifications to its PTEs will be
  recognized by the CPU.

  Note, unloading the MMU is overkill, but is done to mirror KVM's existing
  handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
  can be done to more precisely synchronize roots when servicing a guest
  TLB flush.
  
> When TDP is enabled, there is no problem to just flush the hardware
> TLB of the specified guest CPU.
> 
> But when using shadowpaging, the hypervisor should have to sync the
> shadow pagetable at first before flushing the hardware TLB so that
> it can truely emulate the operation of invpcid_flush_all() in guest.
> 
> The problem exists since the first implementation of KVM_VCPU_FLUSH_TLB
> in commit f38a7b75267f ("KVM: X86: support paravirtualized help for TLB
> shootdowns").  But I don't think it would be a real world problem that
> time since the local CPU's tlb is flushed at first in guest before queuing
> KVM_VCPU_FLUSH_TLB to other CPUs.  It means that the hypervisor syncs the
> shadow pagetable before seeing the corresponding KVM_VCPU_FLUSH_TLBs.
> 
> After commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
> concurrently"), the guest doesn't flush local CPU's tlb at first and
> the hypervisor can handle other VCPU's KVM_VCPU_FLUSH_TLB earlier than
> local VCPU's tlb flush and might flush the hardware tlb without syncing
> the shadow pagetable beforehand.

Maybe reword the last two paragraphs to make it clear that a change in the Linux
kernel exposed the KVM bug?

  This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB, but
  was only recently exposed after Linux guests stopped flushing the local
  CPU's TLB prior to flushing remote TLBs (see commit 4ce94eabac16,
  "x86/mm/tlb: Flush remote and local TLBs concurrently").

> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
> Changed from V1
> 	Use kvm_mmu_unload() instead of KVM_REQ_MMU_RELOAD to avoid
> 	causing unneeded iteration of vcpu_enter_guest().
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bbc4e04e67ad..27248e330767 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,22 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  {
>  	++vcpu->stat.tlb_flush;
> +
> +	if (!tdp_enabled) {
> +		/*
> +		 * When two dimensional paging is not enabled, the
> +		 * operation should equal to native_flush_tlb_global()
> +		 * or invpcid_flush_all() on the guest's behalf via
> +		 * synchronzing shadow pagetable and flushing.
> +		 *
> +		 * kvm_mmu_unload() results consequent kvm_mmu_load()
> +		 * before entering guest which will do the required
> +		 * pagetable synchronzing and TLB flushing.
> +		 */
> +		kvm_mmu_unload(vcpu);

I don't like the full unload, but I suppose the big hammer does make sense for a
backport since handle_invcpid() and toggling CR4.PGE already nuke everything.  :-/

> +		return;
> +	}
> +
>  	static_call(kvm_x86_tlb_flush_guest)(vcpu);
>  }
>  
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] KVM: X86: fix tlb_flush_guest()
  2021-06-02 15:09           ` Sean Christopherson
@ 2021-06-02 22:07             ` Sean Christopherson
  0 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-06-02 22:07 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, Lai Jiangshan, linux-kernel, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin,
	Maxim Levitsky, kvm

On Wed, Jun 02, 2021, Sean Christopherson wrote:
> On Fri, May 28, 2021, Lai Jiangshan wrote:
> > 
> > 
> > On 2021/5/28 03:28, Sean Christopherson wrote:
> > > On Thu, May 27, 2021, Sean Christopherson wrote:
> > > > > KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> > > > > offset the performance gains of the paravirtualized flush.
> > > 
> > > Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
> > > between flushing a specific mm and flushing the entire TLB.  The HyperV usage
> > > (via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
> > > isn't necessary as KVM just needs to sync all roots, not blast them away.  For
> > > previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
> > > fix will need to unload those roots.
> > > 
> > > And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
> > > broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
> > > flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
> > > to sync the MMU on the second flush even though the guest can technically rely
> > > on the first MOV CR3 to have synchronized any previous changes relative to the
> > > fisrt MOV CR3.
> > 
> > Could you elaborate the problem please?
> > When can a MOV CR3 that needs to flush the entire TLB if PCID is enabled?
> 
> Scratch that, I was wrong.  The SDM explicitly states that other PCIDs don't
> need to be flushed if CR4.PCIDE=1.

*sigh*

I was partially right.  If the guest does

  1: MOV    B, %rax
     MOV %rax, %cr3

  2: <modify PTEs in B>

  3: MOV    A, %rax
     MOV %rax, %cr3
 
  4: MOV    B, %rax
     BTS  $63, %rax
     MOV %rax, %cr3

where A and B are CR3 values with the same PCID, then KVM will fail to sync B at
step (4) due to PCID_NOFLUSH, even though the guest can technically rely on
its modifications at step (2) to become visible at step (3) when the PCID is
flushed on CR3 load.

So it's not a full TLB flush, rather a flush of the PCID, which can theoretically
impact previous CR3 values.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] KVM: X86: fix tlb_flush_guest()
  2021-05-31 17:22       ` [PATCH V2] " Lai Jiangshan
  2021-06-02 15:39         ` Sean Christopherson
@ 2021-06-07 22:38         ` Maxim Levitsky
  2021-06-08  0:03           ` Sean Christopherson
  1 sibling, 1 reply; 21+ messages in thread
From: Maxim Levitsky @ 2021-06-07 22:38 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel
  Cc: Lai Jiangshan, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

On Tue, 2021-06-01 at 01:22 +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> the hypervisor do the operation that equals to native_flush_tlb_global()
> or invpcid_flush_all() in the specified guest CPU.
> 
> When TDP is enabled, there is no problem to just flush the hardware
> TLB of the specified guest CPU.
> 
> But when using shadowpaging, the hypervisor should have to sync the
> shadow pagetable at first before flushing the hardware TLB so that
> it can truely emulate the operation of invpcid_flush_all() in guest.
> 
> The problem exists since the first implementation of KVM_VCPU_FLUSH_TLB
> in commit f38a7b75267f ("KVM: X86: support paravirtualized help for TLB
> shootdowns").  But I don't think it would be a real world problem that
> time since the local CPU's tlb is flushed at first in guest before queuing
> KVM_VCPU_FLUSH_TLB to other CPUs.  It means that the hypervisor syncs the
> shadow pagetable before seeing the corresponding KVM_VCPU_FLUSH_TLBs.
> 
> After commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
> concurrently"), the guest doesn't flush local CPU's tlb at first and
> the hypervisor can handle other VCPU's KVM_VCPU_FLUSH_TLB earlier than
> local VCPU's tlb flush and might flush the hardware tlb without syncing
> the shadow pagetable beforehand.
> 
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
> Changed from V1
> 	Use kvm_mmu_unload() instead of KVM_REQ_MMU_RELOAD to avoid
> 	causing unneeded iteration of vcpu_enter_guest().
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bbc4e04e67ad..27248e330767 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,22 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  {
>  	++vcpu->stat.tlb_flush;
> +
> +	if (!tdp_enabled) {
> +		/*
> +		 * When two dimensional paging is not enabled, the
> +		 * operation should equal to native_flush_tlb_global()
> +		 * or invpcid_flush_all() on the guest's behalf via
> +		 * synchronzing shadow pagetable and flushing.
> +		 *
> +		 * kvm_mmu_unload() results consequent kvm_mmu_load()
> +		 * before entering guest which will do the required
> +		 * pagetable synchronzing and TLB flushing.
> +		 */
> +		kvm_mmu_unload(vcpu);
> +		return;
> +	}
> +
>  	static_call(kvm_x86_tlb_flush_guest)(vcpu);
>  }
>  
Hi!
 
So this patch *does* fix the windows boot without TDP!
 
However it feels like either I was extremely unlucky or
something else was fixed recently since:
 
1. I am sure I did test the window VM without any hyperv enlightenments
(I think with -hypervisor even). I always suspect the HV code to cause
trouble so I disable it.
 
2. When running with single vcpu, even with 'hv-tlbflush' windows doesn't 
use the PV TLB flush much. It uses it but rarely. 
 
As I see now without this patch (which makes the windows boot always), 
with a single vCPU I can boot a VM without EPT, and I am sure I tested this.
without NPT I still can't boot on AMD, as windows seems to use PV TLB flush
more often on AMD, even with a single vCPU.
 
I do remember testing with single vCPU on both Intel and AMD, and I never had
gotten either of them to boot without TDP.
 
But anyway with this patch the bug is gone for good.
Thank you very very much for fixing this, you saved me lot of time which
I would have put it into this bug eventually.
 
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
 
As for the patch itself, it also looks fine in its current form
(It can't probably be optimized but this can be done later)
So
 
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
 

More notes from the testing I just did:
 
1. On AMD with npt=0, the windows VM boots very slowly, and then in the task manager
I see that it booted with 1 CPU, although I configured it for 3-28 vCPUs (doesn't matter how many)
I tested this with several win10 VMs, same pattern repeats.
 
2. The windows nag screen about "we beg you to open a microsoft account" makes the VM enter a live lock.
I see about half million at least VM exits per second due to page faults and it is stuck in 'please wait' screen
while with NPT=1 it shows up instantly. The VM has 12 GB of ram so I don't think RAM is an issue.
 
It's likely that those are just result of unoptimized code in regard to TLB flushes,
and timeouts in windows.
On my Intel laptop, the VM is way faster with EPT=0 and it boots with 3 vCPUs just fine
(the laptop has just dual core CPU, so I can't really give more that 3 vCPU to the VM)



Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] KVM: X86: fix tlb_flush_guest()
  2021-06-07 22:38         ` Maxim Levitsky
@ 2021-06-08  0:03           ` Sean Christopherson
  2021-06-08 14:01             ` Lai Jiangshan
                               ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Sean Christopherson @ 2021-06-08  0:03 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Lai Jiangshan, linux-kernel, Lai Jiangshan, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

On Tue, Jun 08, 2021, Maxim Levitsky wrote:
> So this patch *does* fix the windows boot without TDP!

Woot!

> Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Lai,

I have a reworded version of your patch sitting in a branch that leverages this
path to fix similar bugs and do additional cleanup.  Any objection to me gathering
Maxim's tags and posting the version below?  I'm more than happy to hold off if
you'd prefer to send your own version, but I don't want to send my own series
without this fix as doing so would introduce bugs.

Thanks!

Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Tue Jun 1 01:22:56 2021 +0800

    KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
    
    When using shadow paging, unload the guest MMU when emulating a guest TLB
    flush to all roots are synchronized.  From the guest's perspective,
    flushing the TLB ensures any and all modifications to its PTEs will be
    recognized by the CPU.
    
    Note, unloading the MMU is overkill, but is done to mirror KVM's existing
    handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
    can be done to more precisely synchronize roots when servicing a guest
    TLB flush.
    
    If TDP is enabled, synchronizing the MMU is unnecessary even if nested
    TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
    TDP mappgins.  For EPT, an explicit INVEPT is required to invalidate
    guest-physical mappings.  For NPT, guest mappings are always tagged with
    an ASID and thus can only be invalidated via the VMCB's ASID control.
    
    This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB, but
    was only recently exposed after Linux guests stopped flushing the local
    CPU's TLB prior to flushing remote TLBs (see commit 4ce94eabac16,
    "x86/mm/tlb: Flush remote and local TLBs concurrently").
    
    Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
    Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
    Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    [sean: massaged comment and changelog]
    Signed-off-by: Sean Christopherson <seanjc@google.com>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1cd6d4685932..3b02528d5ee8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3072,6 +3072,18 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
 static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
        ++vcpu->stat.tlb_flush;
+
+       if (!tdp_enabled) {
+               /*
+                * Unload the entire MMU to force a sync of the shadow page
+                * tables.  A TLB flush on behalf of the guest is equivalent
+                * to INVPCID(all), toggling CR4.PGE, etc...  Note, loading the
+                * MMU will also do an actual TLB flush.
+                */
+               kvm_mmu_unload(vcpu);
+               return;
+       }
+
        static_call(kvm_x86_tlb_flush_guest)(vcpu);
 }
 

> More notes from the testing I just did:
>  
> 1. On AMD with npt=0, the windows VM boots very slowly, and then in the task manager
> I see that it booted with 1 CPU, although I configured it for 3-28 vCPUs (doesn't matter how many)
> I tested this with several win10 VMs, same pattern repeats.

That's very odd.  Maybe it's so slow that the guest gives up on the AP and marks
it as dead?  That seems unlikely though, I can't imagine waking APs would be
_that_ slow.

> 2. The windows nag screen about "we beg you to open a microsoft account" makes the VM enter a live lock.
> I see about half million at least VM exits per second due to page faults and it is stuck in 'please wait' screen
> while with NPT=1 it shows up instantly. The VM has 12 GB of ram so I don't think RAM is an issue.
>  
> It's likely that those are just result of unoptimized code in regard to TLB flushes,
> and timeouts in windows.
> On my Intel laptop, the VM is way faster with EPT=0 and it boots with 3 vCPUs just fine
> (the laptop has just dual core CPU, so I can't really give more that 3 vCPU to the VM)

Any chance your Intel CPU has PCID?  Although the all-contexts INVPCID emulation
nukes everything, the single-context INVPCID emulation in KVM is optimized to
(a) sync the current MMU (if necessary) instead of unloading it and (b) free
only roots with the matching PCID.  I believe all other forms of TLB flushing
that are likely to be used by the guest will lead to KVM unloading the entire
MMU and rebuilding it from scratch.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] KVM: X86: fix tlb_flush_guest()
  2021-06-08  0:03           ` Sean Christopherson
@ 2021-06-08 14:01             ` Lai Jiangshan
  2021-06-08 17:36             ` Paolo Bonzini
  2021-06-08 21:31             ` Maxim Levitsky
  2 siblings, 0 replies; 21+ messages in thread
From: Lai Jiangshan @ 2021-06-08 14:01 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: Lai Jiangshan, linux-kernel, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm



On 2021/6/8 08:03, Sean Christopherson wrote:
> On Tue, Jun 08, 2021, Maxim Levitsky wrote:
>> So this patch *does* fix the windows boot without TDP!
> 
> Woot!
> 
>> Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
>> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Lai,
> 
> I have a reworded version of your patch sitting in a branch that leverages this
> path to fix similar bugs and do additional cleanup.  Any objection to me gathering
> Maxim's tags and posting the version below?  I'm more than happy to hold off if
> you'd prefer to send your own version, but I don't want to send my own series
> without this fix as doing so would introduce bugs.

Sean

Wow, thank you, it is an excellent rewording and I'm happy with it.

I'm looking forward to your series and I will remake the "need_sync"[1] patch
to more precisely keep&synchronize roots in few days if there is no other
optimization.

Thanks
Lai

[1]: https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/

> 
> Thanks!
> 
> Author: Lai Jiangshan <laijs@linux.alibaba.com>
> Date:   Tue Jun 1 01:22:56 2021 +0800
> 
>      KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
>      
>      When using shadow paging, unload the guest MMU when emulating a guest TLB
>      flush to all roots are synchronized.  From the guest's perspective,
>      flushing the TLB ensures any and all modifications to its PTEs will be
>      recognized by the CPU.
>      
>      Note, unloading the MMU is overkill, but is done to mirror KVM's existing
>      handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
>      can be done to more precisely synchronize roots when servicing a guest
>      TLB flush.
>      
>      If TDP is enabled, synchronizing the MMU is unnecessary even if nested
>      TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
>      TDP mappgins.  For EPT, an explicit INVEPT is required to invalidate
>      guest-physical mappings.  For NPT, guest mappings are always tagged with
>      an ASID and thus can only be invalidated via the VMCB's ASID control.
>      
>      This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB, but
>      was only recently exposed after Linux guests stopped flushing the local
>      CPU's TLB prior to flushing remote TLBs (see commit 4ce94eabac16,
>      "x86/mm/tlb: Flush remote and local TLBs concurrently").
>      
>      Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
>      Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
>      Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
>      Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>      [sean: massaged comment and changelog]
>      Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1cd6d4685932..3b02528d5ee8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,18 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   {
>          ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled) {
> +               /*
> +                * Unload the entire MMU to force a sync of the shadow page
> +                * tables.  A TLB flush on behalf of the guest is equivalent
> +                * to INVPCID(all), toggling CR4.PGE, etc...  Note, loading the
> +                * MMU will also do an actual TLB flush.
> +                */
> +               kvm_mmu_unload(vcpu);
> +               return;
> +       }
> +
>          static_call(kvm_x86_tlb_flush_guest)(vcpu);
>   }
>   
> 
>> More notes from the testing I just did:
>>   
>> 1. On AMD with npt=0, the windows VM boots very slowly, and then in the task manager
>> I see that it booted with 1 CPU, although I configured it for 3-28 vCPUs (doesn't matter how many)
>> I tested this with several win10 VMs, same pattern repeats.
> 
> That's very odd.  Maybe it's so slow that the guest gives up on the AP and marks
> it as dead?  That seems unlikely though, I can't imagine waking APs would be
> _that_ slow.
> 
>> 2. The windows nag screen about "we beg you to open a microsoft account" makes the VM enter a live lock.
>> I see about half million at least VM exits per second due to page faults and it is stuck in 'please wait' screen
>> while with NPT=1 it shows up instantly. The VM has 12 GB of ram so I don't think RAM is an issue.
>>   
>> It's likely that those are just result of unoptimized code in regard to TLB flushes,
>> and timeouts in windows.
>> On my Intel laptop, the VM is way faster with EPT=0 and it boots with 3 vCPUs just fine
>> (the laptop has just dual core CPU, so I can't really give more that 3 vCPU to the VM)
> 
> Any chance your Intel CPU has PCID?  Although the all-contexts INVPCID emulation
> nukes everything, the single-context INVPCID emulation in KVM is optimized to
> (a) sync the current MMU (if necessary) instead of unloading it and (b) free
> only roots with the matching PCID.  I believe all other forms of TLB flushing
> that are likely to be used by the guest will lead to KVM unloading the entire
> MMU and rebuilding it from scratch.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] KVM: X86: fix tlb_flush_guest()
  2021-06-08  0:03           ` Sean Christopherson
  2021-06-08 14:01             ` Lai Jiangshan
@ 2021-06-08 17:36             ` Paolo Bonzini
  2021-06-08 21:31             ` Maxim Levitsky
  2 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2021-06-08 17:36 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: Lai Jiangshan, linux-kernel, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm

On 08/06/21 02:03, Sean Christopherson wrote:
> On Tue, Jun 08, 2021, Maxim Levitsky wrote:
>> So this patch *does* fix the windows boot without TDP!
> 
> Woot!
> 
>> Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
>> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Lai,
> 
> I have a reworded version of your patch sitting in a branch that leverages this
> path to fix similar bugs and do additional cleanup.  Any objection to me gathering
> Maxim's tags and posting the version below?  I'm more than happy to hold off if
> you'd prefer to send your own version, but I don't want to send my own series
> without this fix as doing so would introduce bugs.
> 
> Thanks!
> 
> Author: Lai Jiangshan <laijs@linux.alibaba.com>
> Date:   Tue Jun 1 01:22:56 2021 +0800
> 
>      KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
>      
>      When using shadow paging, unload the guest MMU when emulating a guest TLB
>      flush to all roots are synchronized.  From the guest's perspective,
>      flushing the TLB ensures any and all modifications to its PTEs will be
>      recognized by the CPU.
>      
>      Note, unloading the MMU is overkill, but is done to mirror KVM's existing
>      handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
>      can be done to more precisely synchronize roots when servicing a guest
>      TLB flush.
>      
>      If TDP is enabled, synchronizing the MMU is unnecessary even if nested
>      TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
>      TDP mappgins.  For EPT, an explicit INVEPT is required to invalidate
>      guest-physical mappings.  For NPT, guest mappings are always tagged with
>      an ASID and thus can only be invalidated via the VMCB's ASID control.
>      
>      This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB, but
>      was only recently exposed after Linux guests stopped flushing the local
>      CPU's TLB prior to flushing remote TLBs (see commit 4ce94eabac16,
>      "x86/mm/tlb: Flush remote and local TLBs concurrently").
>      
>      Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
>      Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
>      Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
>      Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>      [sean: massaged comment and changelog]
>      Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1cd6d4685932..3b02528d5ee8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,18 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   {
>          ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled) {
> +               /*
> +                * Unload the entire MMU to force a sync of the shadow page
> +                * tables.  A TLB flush on behalf of the guest is equivalent
> +                * to INVPCID(all), toggling CR4.PGE, etc...  Note, loading the
> +                * MMU will also do an actual TLB flush.
> +                */

I slightly prefer it in the opposite order (first why, then how):

                /*
                 * A TLB flush on behalf of the guest is equivalent to
                 * INVPCID(all), toggling CR4.PGE, etc., which requires
                 * a forced sync of the shadow page tables.  Unload the
                 * entire MMU here and the subsequent load will sync the
                 * shadow page tables, and also flush the TLB.
                 */

Queued, thanks all!  It's great that this fixes an actual bug.

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] KVM: X86: fix tlb_flush_guest()
  2021-06-08  0:03           ` Sean Christopherson
  2021-06-08 14:01             ` Lai Jiangshan
  2021-06-08 17:36             ` Paolo Bonzini
@ 2021-06-08 21:31             ` Maxim Levitsky
  2 siblings, 0 replies; 21+ messages in thread
From: Maxim Levitsky @ 2021-06-08 21:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, Lai Jiangshan, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm

On Tue, 2021-06-08 at 00:03 +0000, Sean Christopherson wrote:
> On Tue, Jun 08, 2021, Maxim Levitsky wrote:
> > So this patch *does* fix the windows boot without TDP!
> 
> Woot!
> 
> > Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Lai,
> 
> I have a reworded version of your patch sitting in a branch that leverages this
> path to fix similar bugs and do additional cleanup.  Any objection to me gathering
> Maxim's tags and posting the version below?  I'm more than happy to hold off if
> you'd prefer to send your own version, but I don't want to send my own series
> without this fix as doing so would introduce bugs.
> 
> Thanks!
> 
> Author: Lai Jiangshan <laijs@linux.alibaba.com>
> Date:   Tue Jun 1 01:22:56 2021 +0800
> 
>     KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
>     
>     When using shadow paging, unload the guest MMU when emulating a guest TLB
>     flush to all roots are synchronized.  From the guest's perspective,
>     flushing the TLB ensures any and all modifications to its PTEs will be
>     recognized by the CPU.
>     
>     Note, unloading the MMU is overkill, but is done to mirror KVM's existing
>     handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
>     can be done to more precisely synchronize roots when servicing a guest
>     TLB flush.
>     
>     If TDP is enabled, synchronizing the MMU is unnecessary even if nested
>     TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
>     TDP mappgins.  For EPT, an explicit INVEPT is required to invalidate
>     guest-physical mappings.  For NPT, guest mappings are always tagged with
>     an ASID and thus can only be invalidated via the VMCB's ASID control.
>     
>     This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB, but
>     was only recently exposed after Linux guests stopped flushing the local
>     CPU's TLB prior to flushing remote TLBs (see commit 4ce94eabac16,
>     "x86/mm/tlb: Flush remote and local TLBs concurrently").
>     
>     Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
>     Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
>     Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
>     Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>     [sean: massaged comment and changelog]
>     Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1cd6d4685932..3b02528d5ee8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,18 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  {
>         ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled) {
> +               /*
> +                * Unload the entire MMU to force a sync of the shadow page
> +                * tables.  A TLB flush on behalf of the guest is equivalent
> +                * to INVPCID(all), toggling CR4.PGE, etc...  Note, loading the
> +                * MMU will also do an actual TLB flush.
> +                */
> +               kvm_mmu_unload(vcpu);
> +               return;
> +       }
> +
>         static_call(kvm_x86_tlb_flush_guest)(vcpu);
>  }
>  
> 
> > More notes from the testing I just did:
> >  
> > 1. On AMD with npt=0, the windows VM boots very slowly, and then in the task manager
> > I see that it booted with 1 CPU, although I configured it for 3-28 vCPUs (doesn't matter how many)
> > I tested this with several win10 VMs, same pattern repeats.
> 
> That's very odd.  Maybe it's so slow that the guest gives up on the AP and marks
> it as dead?  That seems unlikely though, I can't imagine waking APs would be
> _that_ slow.

I also can't say that it is that slow, but it is possible that a live lock like situation
similar to what I see with the nag screen causes an AP bootup to timeout.
I also see that using my old workaround to boot with NPT=0 which was to always keep shadow
MMU in sync, also causes windows to see a single vCPU.

I will when I have some time for this to try to artificially slow windows down in some other way
and see if it still fails to bring up more than one vCPU.


> 
> > 2. The windows nag screen about "we beg you to open a microsoft account" makes the VM enter a live lock.
> > I see about half million at least VM exits per second due to page faults and it is stuck in 'please wait' screen
> > while with NPT=1 it shows up instantly. The VM has 12 GB of ram so I don't think RAM is an issue.
> >  
> > It's likely that those are just result of unoptimized code in regard to TLB flushes,
> > and timeouts in windows.
> > On my Intel laptop, the VM is way faster with EPT=0 and it boots with 3 vCPUs just fine
> > (the laptop has just dual core CPU, so I can't really give more that 3 vCPU to the VM)
> 
> Any chance your Intel CPU has PCID?  Although the all-contexts INVPCID emulation
> nukes everything, the single-context INVPCID emulation in KVM is optimized to
> (a) sync the current MMU (if necessary) instead of unloading it and (b) free
> only roots with the matching PCID.  I believe all other forms of TLB flushing
> that are likely to be used by the guest will lead to KVM unloading the entire
> MMU and rebuilding it from scratch.

Yes it has it. It is relatively recent Kabylake processor (i7-7600U),
so this could be it. I'll try to disable it as well when I have some time for
this.


Finally about my testing of running HyperV nested. This patch can't fix it,
since it only changes the non TDP code path, but we also have another patch on
the list that is shadow paging related 
"KVM: X86: MMU: Use the correct inherited permissions to get shadow page"

Sadly the VM still did eventually bluescreen (with 'critical process died' after about 400
migration cycles this time. No luck this time.

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-06-08 21:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27  2:39 [PATCH] KVM: X86: fix tlb_flush_guest() Lai Jiangshan
2021-05-27 12:55 ` Paolo Bonzini
2021-05-27 16:13   ` Sean Christopherson
2021-05-27 16:14     ` Sean Christopherson
2021-05-27 19:28       ` Sean Christopherson
2021-05-28  1:13         ` Lai Jiangshan
2021-06-02 15:09           ` Sean Christopherson
2021-06-02 22:07             ` Sean Christopherson
2021-05-28  0:18     ` Lai Jiangshan
2021-05-28  0:26       ` Sean Christopherson
2021-05-28  1:29         ` Lai Jiangshan
2021-06-02 15:01           ` Sean Christopherson
2021-06-02  8:13         ` Lai Jiangshan
2021-05-29 22:12     ` Maxim Levitsky
2021-05-31 17:22       ` [PATCH V2] " Lai Jiangshan
2021-06-02 15:39         ` Sean Christopherson
2021-06-07 22:38         ` Maxim Levitsky
2021-06-08  0:03           ` Sean Christopherson
2021-06-08 14:01             ` Lai Jiangshan
2021-06-08 17:36             ` Paolo Bonzini
2021-06-08 21:31             ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).