[PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
@ 2020-09-03 14:11 Mohammed Gamal
  2020-09-03 17:57 ` Jim Mattson
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Mohammed Gamal @ 2020-09-03 14:11 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
	jmattson, joro, Mohammed Gamal

This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.

Since smaller physical address spaces are only supported on VMX, the parameter
is only exposed in the kvm_intel module.
Modifications to VMX page fault and EPT violation handling will depend on whether
that parameter is enabled.

Also disable support by default, and let the user decide if they want to enable
it.

Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
 arch/x86/kvm/vmx/vmx.h |  3 +++
 arch/x86/kvm/x86.c     |  2 +-
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 819c185adf09..dc778c7b5a06 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
 module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
 #endif
 
+extern bool __read_mostly allow_smaller_maxphyaddr;
+module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
+
 #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
 #define KVM_VM_CR0_ALWAYS_ON				\
@@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 
 	if (is_page_fault(intr_info)) {
 		cr2 = vmx_get_exit_qual(vcpu);
-		if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
+		if (enable_ept && !vcpu->arch.apf.host_apf_flags
+			&& allow_smaller_maxphyaddr) {
 			/*
 			 * EPT will cause page fault only if we need to
 			 * detect illegal GPAs.
@@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	 * would also use advanced VM-exit information for EPT violations to
 	 * reconstruct the page fault error code.
 	 */
-	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
+	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) && allow_smaller_maxphyaddr)
 		return kvm_emulate_instruction(vcpu, 0);
 
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
@@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
 #endif
 	vmx_check_vmcs12_offsets();
 
-	/*
-	 * Intel processors don't have problems with
-	 * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
-	 * it for VMX by default
-	 */
-	allow_smaller_maxphyaddr = true;
-
 	return 0;
 }
 module_init(vmx_init);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 26175a4759fa..b859435efa2e 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
 
 static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
 {
+	if (!allow_smaller_maxphyaddr)
+		return false;
+
 	return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d39d6cf1d473..982f1d73a884 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
 
-bool __read_mostly allow_smaller_maxphyaddr;
+bool __read_mostly allow_smaller_maxphyaddr = 0;
 EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
 
 static u64 __read_mostly host_xss;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
@ 2020-09-03 17:57 ` Jim Mattson
  2020-09-03 18:03   ` Paolo Bonzini
  2020-09-23 13:45 ` Paolo Bonzini
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2020-09-03 17:57 UTC (permalink / raw)
  To: Mohammed Gamal
  Cc: kvm list, Paolo Bonzini, LKML, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel

On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>

I think a smaller guest physical address width *should* be allowed.
However, perhaps the pedantic adherence to the architectural
specification could be turned on or off per-VM? And, if we're going to
be pedantic, I think we should go all the way and get MOV-to-CR3
correct.

Does the typical guest care about whether or not setting any of the
bits 51:46 in a PFN results in a fault?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 17:57 ` Jim Mattson
@ 2020-09-03 18:03   ` Paolo Bonzini
  2020-09-03 18:32     ` Jim Mattson
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-03 18:03 UTC (permalink / raw)
  To: Jim Mattson, Mohammed Gamal
  Cc: kvm list, LKML, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel

On 03/09/20 19:57, Jim Mattson wrote:
> On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>>
>> Since smaller physical address spaces are only supported on VMX, the parameter
>> is only exposed in the kvm_intel module.
>> Modifications to VMX page fault and EPT violation handling will depend on whether
>> that parameter is enabled.
>>
>> Also disable support by default, and let the user decide if they want to enable
>> it.
>
> I think a smaller guest physical address width *should* be allowed.
> However, perhaps the pedantic adherence to the architectural
> specification could be turned on or off per-VM? And, if we're going to
> be pedantic, I think we should go all the way and get MOV-to-CR3
> correct.

That would be way too slow.  Even the current trapping of present #PF
can introduce some slowdown depending on the workload.

> Does the typical guest care about whether or not setting any of the
> bits 51:46 in a PFN results in a fault?

At least KVM with shadow pages does, which is a bit niche but it shows
that you cannot really rely on no one doing it.  As you guessed, the
main usage of the feature is for machines with 5-level page tables where
there are no reserved bits; emulating smaller MAXPHYADDR allows
migrating VMs from 4-level page-table hosts.

Enabling per-VM would not be particularly useful IMO because if you want
to disable this code you can just set host MAXPHYADDR = guest
MAXPHYADDR, which should be the common case unless you want to do that
kind of Skylake to Icelake (or similar) migration.

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 18:03   ` Paolo Bonzini
@ 2020-09-03 18:32     ` Jim Mattson
  2020-09-03 20:02       ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2020-09-03 18:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mohammed Gamal, kvm list, LKML, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel

On Thu, Sep 3, 2020 at 11:03 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 03/09/20 19:57, Jim Mattson wrote:
> > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
> >> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
> >>
> >> Since smaller physical address spaces are only supported on VMX, the parameter
> >> is only exposed in the kvm_intel module.
> >> Modifications to VMX page fault and EPT violation handling will depend on whether
> >> that parameter is enabled.
> >>
> >> Also disable support by default, and let the user decide if they want to enable
> >> it.
> >
> > I think a smaller guest physical address width *should* be allowed.
> > However, perhaps the pedantic adherence to the architectural
> > specification could be turned on or off per-VM? And, if we're going to
> > be pedantic, I think we should go all the way and get MOV-to-CR3
> > correct.
>
> That would be way too slow.  Even the current trapping of present #PF
> can introduce some slowdown depending on the workload.

Yes, I was concerned about that...which is why I would not want to
enable pedantic mode. But if you're going to be pedantic, why go
halfway?

> > Does the typical guest care about whether or not setting any of the
> > bits 51:46 in a PFN results in a fault?
>
> At least KVM with shadow pages does, which is a bit niche but it shows
> that you cannot really rely on no one doing it.  As you guessed, the
> main usage of the feature is for machines with 5-level page tables where
> there are no reserved bits; emulating smaller MAXPHYADDR allows
> migrating VMs from 4-level page-table hosts.
>
> Enabling per-VM would not be particularly useful IMO because if you want
> to disable this code you can just set host MAXPHYADDR = guest
> MAXPHYADDR, which should be the common case unless you want to do that
> kind of Skylake to Icelake (or similar) migration.

I expect that it will be quite common to run 46-bit wide legacy VMs on
Ice Lake hardware, as Ice Lake machines start showing up in
heterogeneous data centers.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 18:32     ` Jim Mattson
@ 2020-09-03 20:02       ` Paolo Bonzini
  2020-09-03 21:26         ` Jim Mattson
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-03 20:02 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Mohammed Gamal, kvm list, LKML, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel

On 03/09/20 20:32, Jim Mattson wrote:
>> [Checking writes to CR3] would be way too slow.  Even the current
>> trapping of present #PF can introduce some slowdown depending on the
>> workload.
>
> Yes, I was concerned about that...which is why I would not want to
> enable pedantic mode. But if you're going to be pedantic, why go
> halfway?

Because I am not sure about any guest, even KVM, caring about setting
bits 51:46 in CR3.

>>> Does the typical guest care about whether or not setting any of the
>>> bits 51:46 in a PFN results in a fault?
>>
>> At least KVM with shadow pages does, which is a bit niche but it shows
>> that you cannot really rely on no one doing it.  As you guessed, the
>> main usage of the feature is for machines with 5-level page tables where
>> there are no reserved bits; emulating smaller MAXPHYADDR allows
>> migrating VMs from 4-level page-table hosts.
>>
>> Enabling per-VM would not be particularly useful IMO because if you want
>> to disable this code you can just set host MAXPHYADDR = guest
>> MAXPHYADDR, which should be the common case unless you want to do that
>> kind of Skylake to Icelake (or similar) migration.
> 
> I expect that it will be quite common to run 46-bit wide legacy VMs on
> Ice Lake hardware, as Ice Lake machines start showing up in
> heterogeneous data centers.

If you'll be okay with running _all_ 46-bit wide legacy VMs without
MAXPHYADDR emulation, that's what this patch is for.  If you'll be okay
with running _only_ 46-bit wide VMs without emulation, you still don't
need special enabling per-VM beyond the automatic one based on
CPUID[0x8000_0008].  Do you think you'll need to enable it for some
special 46-bit VMs?

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 20:02       ` Paolo Bonzini
@ 2020-09-03 21:26         ` Jim Mattson
  0 siblings, 0 replies; 19+ messages in thread
From: Jim Mattson @ 2020-09-03 21:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mohammed Gamal, kvm list, LKML, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel

On Thu, Sep 3, 2020 at 1:02 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 03/09/20 20:32, Jim Mattson wrote:
> >> [Checking writes to CR3] would be way too slow.  Even the current
> >> trapping of present #PF can introduce some slowdown depending on the
> >> workload.
> >
> > Yes, I was concerned about that...which is why I would not want to
> > enable pedantic mode. But if you're going to be pedantic, why go
> > halfway?
>
> Because I am not sure about any guest, even KVM, caring about setting
> bits 51:46 in CR3.
>
> >>> Does the typical guest care about whether or not setting any of the
> >>> bits 51:46 in a PFN results in a fault?
> >>
> >> At least KVM with shadow pages does, which is a bit niche but it shows
> >> that you cannot really rely on no one doing it.  As you guessed, the
> >> main usage of the feature is for machines with 5-level page tables where
> >> there are no reserved bits; emulating smaller MAXPHYADDR allows
> >> migrating VMs from 4-level page-table hosts.
> >>
> >> Enabling per-VM would not be particularly useful IMO because if you want
> >> to disable this code you can just set host MAXPHYADDR = guest
> >> MAXPHYADDR, which should be the common case unless you want to do that
> >> kind of Skylake to Icelake (or similar) migration.
> >
> > I expect that it will be quite common to run 46-bit wide legacy VMs on
> > Ice Lake hardware, as Ice Lake machines start showing up in
> > heterogeneous data centers.
>
> If you'll be okay with running _all_ 46-bit wide legacy VMs without
> MAXPHYADDR emulation, that's what this patch is for.  If you'll be okay
> with running _only_ 46-bit wide VMs without emulation, you still don't
> need special enabling per-VM beyond the automatic one based on
> CPUID[0x8000_0008].  Do you think you'll need to enable it for some
> special 46-bit VMs?

Yes. From what you've said above, we would only want to enable this
for the niche case of 46-bit KVM guests using shadow paging. I would
expect that to be a very small number of VMs. :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
  2020-09-03 17:57 ` Jim Mattson
@ 2020-09-23 13:45 ` Paolo Bonzini
       [not found]   ` <CALMp9eTHbhwfdq4Be=XcUG9z82KK8AapQeVmsdH=mGdQ_Yt2ug@mail.gmail.com>
  2020-09-28 15:34 ` Qian Cai
  2021-01-16  0:08 ` Jim Mattson
  3 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-23 13:45 UTC (permalink / raw)
  To: Mohammed Gamal, kvm
  Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli, jmattson, joro

On 03/09/20 16:11, Mohammed Gamal wrote:
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
> 
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
> 
> Also disable support by default, and let the user decide if they want to enable
> it.
> 
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
>  arch/x86/kvm/vmx/vmx.h |  3 +++
>  arch/x86/kvm/x86.c     |  2 +-
>  3 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
>  module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  #endif
>  
> +extern bool __read_mostly allow_smaller_maxphyaddr;
> +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> +
>  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
>  #define KVM_VM_CR0_ALWAYS_ON				\
> @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>  
>  	if (is_page_fault(intr_info)) {
>  		cr2 = vmx_get_exit_qual(vcpu);
> -		if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> +		if (enable_ept && !vcpu->arch.apf.host_apf_flags
> +			&& allow_smaller_maxphyaddr) {
>  			/*
>  			 * EPT will cause page fault only if we need to
>  			 * detect illegal GPAs.
> @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	 * would also use advanced VM-exit information for EPT violations to
>  	 * reconstruct the page fault error code.
>  	 */
> -	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> +	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) && allow_smaller_maxphyaddr)
>  		return kvm_emulate_instruction(vcpu, 0);
>  
>  	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
>  #endif
>  	vmx_check_vmcs12_offsets();
>  
> -	/*
> -	 * Intel processors don't have problems with
> -	 * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> -	 * it for VMX by default
> -	 */
> -	allow_smaller_maxphyaddr = true;
> -
>  	return 0;
>  }
>  module_init(vmx_init);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 26175a4759fa..b859435efa2e 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
>  
>  static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
>  {
> +	if (!allow_smaller_maxphyaddr)
> +		return false;
> +
>  	return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;

This needs to return true if !enable_ept.

	if (!enable_ept)
		return true;

	return allow_smaller_maxphyaddr &&
		 cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;

Fixed and queued, thanks.

Paolo

>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d39d6cf1d473..982f1d73a884 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
>  u64 __read_mostly host_efer;
>  EXPORT_SYMBOL_GPL(host_efer);
>  
> -bool __read_mostly allow_smaller_maxphyaddr;
> +bool __read_mostly allow_smaller_maxphyaddr = 0;
>  EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
>  
>  static u64 __read_mostly host_xss;
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <CALMp9eTHbhwfdq4Be=XcUG9z82KK8AapQeVmsdH=mGdQ_Yt2ug@mail.gmail.com>]

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
       [not found]   ` <CALMp9eTHbhwfdq4Be=XcUG9z82KK8AapQeVmsdH=mGdQ_Yt2ug@mail.gmail.com>
@ 2020-09-23 15:19     ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-23 15:19 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Mohammed Gamal, joro, kvm, linux-kernel, sean.j.christopherson,
	vkuznets, wanpengli

On 23/09/20 16:32, Jim Mattson wrote:
> You don’t buy my argument that this should be per-VM, then? I’d bet that
> we have very close to 0 customers who care about emulating the reserved
> bits properly for 46-but wide VMs, but there may be someone out there
> using shadow paging in a nested kvm. 
> 

I do buy it but I prefer not to have a released version of Linux where
this is enabled by default.  I procrastinated hoping to cobble something
together but I didn't have time.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
  2020-09-03 17:57 ` Jim Mattson
  2020-09-23 13:45 ` Paolo Bonzini
@ 2020-09-28 15:34 ` Qian Cai
  2020-09-29 11:59   ` Qian Cai
  2021-01-16  0:08 ` Jim Mattson
  3 siblings, 1 reply; 19+ messages in thread
From: Qian Cai @ 2020-09-28 15:34 UTC (permalink / raw)
  To: Mohammed Gamal, kvm, pbonzini
  Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
	jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds

On Thu, 2020-09-03 at 16:11 +0200, Mohammed Gamal wrote:
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
> 
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on
> whether
> that parameter is enabled.
> 
> Also disable support by default, and let the user decide if they want to
> enable
> it.
> 
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>

Running a simple SR-IOV on Intel will trigger the warning below:

.config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config

P.S.: I did confirm the linux-next included this hunk as well:
https://lore.kernel.org/kvm/8c7ce8ff-a212-a974-3829-c45eb5335651@redhat.com/


[ 1119.752137][ T7441] WARNING: CPU: 27 PID: 7441 at arch/x86/kvm/vmx/vmx.c:4809 handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1119.763312][ T7441] Modules linked in: loop nls_ascii nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ses enclosure efivarfs ip_tables x_tables sd_mod nvme tg3 firmware_class smartpqi nvme_core libphy scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[ 1119.786660][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted: G          I       5.9.0-rc7-next-20200928+ #2
[ 1119.796572][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 1119.805870][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1119.812788][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8 <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
[ 1119.832384][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
[ 1119.838353][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX: 0000000000000000
[ 1119.846247][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc2b98940
[ 1119.854124][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09: ffffed11d1fcd071
[ 1119.862012][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12: ffff8888f6129000
[ 1119.869903][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15: 0000000000000000
[ 1119.877795][ T7441] FS:  00007fc3277fe700(0000) GS:ffff88901f640000(0000) knlGS:0000000000000000
[ 1119.886649][ T7441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1119.893127][ T7441] CR2: 0000000000000000 CR3: 0000000ab7376002 CR4: 00000000007726e0
[ 1119.901022][ T7441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1119.908916][ T7441] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1119.916806][ T7441] PKRU: 55555554
[ 1119.920226][ T7441] Call Trace:
[ 1119.923426][ T7441]  vcpu_enter_guest+0x1ef4/0x4850 [kvm]
[ 1119.928877][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1119.934335][ T7441]  ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
[ 1119.941029][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
[ 1119.947177][ T7441]  ? rcu_read_unlock+0x40/0x40
[ 1119.951822][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1119.957272][ T7441]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 1119.962441][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1119.968325][ T7441]  ? __local_bh_enable_ip+0xa0/0xe0
[ 1119.973433][ T7441]  ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
[ 1119.979776][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1119.985945][ T7441]  kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1119.991924][ T7441]  kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
[ 1119.997047][ T7441]  ? kvm_vcpu_block+0xc40/0xc40 [kvm]
[ 1120.002305][ T7441]  ? find_held_lock+0x33/0x1c0
[ 1120.006968][ T7441]  ? __fget_files+0x1a4/0x2e0
[ 1120.011533][ T7441]  ? lock_downgrade+0x730/0x730
[ 1120.016283][ T7441]  ? rcu_read_lock_sched_held+0xd0/0xd0
[ 1120.021714][ T7441]  ? __fget_files+0x1c3/0x2e0
[ 1120.026287][ T7441]  __x64_sys_ioctl+0x315/0xfc0
[ 1120.030933][ T7441]  ? generic_block_fiemap+0x60/0x60
[ 1120.036034][ T7441]  ? up_read+0x1a3/0x730
[ 1120.040158][ T7441]  ? down_read_nested+0x420/0x420
[ 1120.045080][ T7441]  ? syscall_enter_from_user_mode+0x17/0x50
[ 1120.050862][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.056311][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.061743][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.067629][ T7441]  ? syscall_enter_from_user_mode+0x1c/0x50
[ 1120.073410][ T7441]  do_syscall_64+0x33/0x40
[ 1120.077723][ T7441]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1120.083505][ T7441] RIP: 0033:0x7fc3368c687b
[ 1120.087813][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
[ 1120.107408][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1120.115739][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX: 00007fc3368c687b
[ 1120.123622][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[ 1120.131513][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09: 0000000000000000
[ 1120.139402][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12: 00005613156b4100
[ 1120.147294][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15: 00005613177da760
[ 1120.155185][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted: G          I       5.9.0-rc7-next-20200928+ #2
[ 1120.165072][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 1120.174345][ T7441] Call Trace:
[ 1120.177509][ T7441]  dump_stack+0x99/0xcb
[ 1120.181548][ T7441]  __warn.cold.13+0xe/0x55
[ 1120.185848][ T7441]  ? handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1120.192157][ T7441]  report_bug+0x1af/0x260
[ 1120.196367][ T7441]  handle_bug+0x44/0x80
[ 1120.200400][ T7441]  exc_invalid_op+0x13/0x40
[ 1120.204784][ T7441]  asm_exc_invalid_op+0x12/0x20
[ 1120.209520][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1120.216435][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8 <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
[ 1120.236016][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
[ 1120.241973][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX: 0000000000000000
[ 1120.249851][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc2b98940
[ 1120.257729][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09: ffffed11d1fcd071
[ 1120.265605][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12: ffff8888f6129000
[ 1120.273483][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15: 0000000000000000
[ 1120.281364][ T7441]  ? handle_exception_nmi+0x788/0xe60 [kvm_intel]
[ 1120.287695][ T7441]  vcpu_enter_guest+0x1ef4/0x4850 [kvm]
[ 1120.293126][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.298582][ T7441]  ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
[ 1120.305262][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
[ 1120.311392][ T7441]  ? rcu_read_unlock+0x40/0x40
[ 1120.316038][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.321469][ T7441]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 1120.326639][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.332506][ T7441]  ? __local_bh_enable_ip+0xa0/0xe0
[ 1120.337613][ T7441]  ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
[ 1120.343942][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1120.350096][ T7441]  kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1120.356076][ T7441]  kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
[ 1120.361180][ T7441]  ? kvm_vcpu_block+0xc40/0xc40 [kvm]
[ 1120.366436][ T7441]  ? find_held_lock+0x33/0x1c0
[ 1120.371082][ T7441]  ? __fget_files+0x1a4/0x2e0
[ 1120.375639][ T7441]  ? lock_downgrade+0x730/0x730
[ 1120.380371][ T7441]  ? rcu_read_lock_sched_held+0xd0/0xd0
[ 1120.385802][ T7441]  ? __fget_files+0x1c3/0x2e0
[ 1120.390360][ T7441]  __x64_sys_ioctl+0x315/0xfc0
[ 1120.395006][ T7441]  ? generic_block_fiemap+0x60/0x60
[ 1120.400088][ T7441]  ? up_read+0x1a3/0x730
[ 1120.404209][ T7441]  ? down_read_nested+0x420/0x420
[ 1120.409116][ T7441]  ? syscall_enter_from_user_mode+0x17/0x50
[ 1120.414898][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.420329][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.425761][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.431628][ T7441]  ? syscall_enter_from_user_mode+0x1c/0x50
[ 1120.437410][ T7441]  do_syscall_64+0x33/0x40
[ 1120.441704][ T7441]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1120.447484][ T7441] RIP: 0033:0x7fc3368c687b
[ 1120.451779][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
[ 1120.471361][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1120.479676][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX: 00007fc3368c687b
[ 1120.487552][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[ 1120.495429][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09: 0000000000000000
[ 1120.503305][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12: 00005613156b4100
[ 1120.511181][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15: 00005613177da760
[ 1120.519104][ T7441] irq event stamp: 4321673
[ 1120.523400][ T7441] hardirqs last  enabled at (4321681): [<ffffffffa6c2a76f>] console_unlock+0x81f/0xa20
[ 1120.532948][ T7441] hardirqs last disabled at (4321690): [<ffffffffa6c2a67b>] console_unlock+0x72b/0xa20
[ 1120.542495][ T7441] softirqs last  enabled at (4321672): [<ffffffffa800061b>] __do_softirq+0x61b/0x95d
[ 1120.551864][ T7441] softirqs last disabled at (4321665): [<ffffffffa7e00ec2>] asm_call_irq_on_stack+0x12/0x20
[ 1120.561852][ T7441] ---[ end trace 31c2bbb23abc5aa2 ]---

> ---
>  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
>  arch/x86/kvm/vmx/vmx.h |  3 +++
>  arch/x86/kvm/x86.c     |  2 +-
>  3 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
>  module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  #endif
>  
> +extern bool __read_mostly allow_smaller_maxphyaddr;
> +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> +
>  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
>  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
>  #define KVM_VM_CR0_ALWAYS_ON				\
> @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>  
>  	if (is_page_fault(intr_info)) {
>  		cr2 = vmx_get_exit_qual(vcpu);
> -		if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> +		if (enable_ept && !vcpu->arch.apf.host_apf_flags
> +			&& allow_smaller_maxphyaddr) {
>  			/*
>  			 * EPT will cause page fault only if we need to
>  			 * detect illegal GPAs.
> @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	 * would also use advanced VM-exit information for EPT violations to
>  	 * reconstruct the page fault error code.
>  	 */
> -	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> +	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) &&
> allow_smaller_maxphyaddr)
>  		return kvm_emulate_instruction(vcpu, 0);
>  
>  	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
>  #endif
>  	vmx_check_vmcs12_offsets();
>  
> -	/*
> -	 * Intel processors don't have problems with
> -	 * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> -	 * it for VMX by default
> -	 */
> -	allow_smaller_maxphyaddr = true;
> -
>  	return 0;
>  }
>  module_init(vmx_init);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 26175a4759fa..b859435efa2e 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
>  
>  static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
>  {
> +	if (!allow_smaller_maxphyaddr)
> +		return false;
> +
>  	return !enable_ept || cpuid_maxphyaddr(vcpu) <
> boot_cpu_data.x86_phys_bits;
>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d39d6cf1d473..982f1d73a884 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
>  u64 __read_mostly host_efer;
>  EXPORT_SYMBOL_GPL(host_efer);
>  
> -bool __read_mostly allow_smaller_maxphyaddr;
> +bool __read_mostly allow_smaller_maxphyaddr = 0;
>  EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
>  
>  static u64 __read_mostly host_xss;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-28 15:34 ` Qian Cai
@ 2020-09-29 11:59   ` Qian Cai
  2020-09-29 12:26     ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Qian Cai @ 2020-09-29 11:59 UTC (permalink / raw)
  To: Mohammed Gamal, kvm, pbonzini
  Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
	jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds

On Mon, 2020-09-28 at 11:34 -0400, Qian Cai wrote:
> On Thu, 2020-09-03 at 16:11 +0200, Mohammed Gamal wrote:
> > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > parameter.
> > 
> > Since smaller physical address spaces are only supported on VMX, the
> > parameter
> > is only exposed in the kvm_intel module.
> > Modifications to VMX page fault and EPT violation handling will depend on
> > whether
> > that parameter is enabled.
> > 
> > Also disable support by default, and let the user decide if they want to
> > enable
> > it.
> > 
> > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> 
> Running a simple SR-IOV on Intel will trigger the warning below:
> 
> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config
> 
> P.S.: I did confirm the linux-next included this hunk as well:
> https://lore.kernel.org/kvm/8c7ce8ff-a212-a974-3829-c45eb5335651@redhat.com/
> 
> 
> [ 1119.752137][ T7441] WARNING: CPU: 27 PID: 7441 at
> arch/x86/kvm/vmx/vmx.c:4809 handle_exception_nmi+0xbfc/0xe60 [kvm_intel]

That is:

WARN_ON_ONCE(!allow_smaller_maxphyaddr);

I noticed the origin patch did not have this WARN_ON_ONCE(), but the mainline
commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address space
support user-configurable") does have it for some reasons.

Paolo, any idea?

> [ 1119.763312][ T7441] Modules linked in: loop nls_ascii nls_cp437 vfat fat
> kvm_intel kvm irqbypass efivars ses enclosure efivarfs ip_tables x_tables
> sd_mod nvme tg3 firmware_class smartpqi nvme_core libphy scsi_transport_sas
> dm_mirror dm_region_hash dm_log dm_mod
> [ 1119.786660][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted:
> G          I       5.9.0-rc7-next-20200928+ #2
> [ 1119.796572][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560
> Gen10, BIOS U34 11/13/2019
> [ 1119.805870][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1119.812788][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00
> 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8
> <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
> [ 1119.832384][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
> [ 1119.838353][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX:
> 0000000000000000
> [ 1119.846247][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffffc2b98940
> [ 1119.854124][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09:
> ffffed11d1fcd071
> [ 1119.862012][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12:
> ffff8888f6129000
> [ 1119.869903][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15:
> 0000000000000000
> [ 1119.877795][ T7441] FS:  00007fc3277fe700(0000) GS:ffff88901f640000(0000)
> knlGS:0000000000000000
> [ 1119.886649][ T7441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1119.893127][ T7441] CR2: 0000000000000000 CR3: 0000000ab7376002 CR4:
> 00000000007726e0
> [ 1119.901022][ T7441] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 1119.908916][ T7441] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 1119.916806][ T7441] PKRU: 55555554
> [ 1119.920226][ T7441] Call Trace:
> [ 1119.923426][ T7441]  vcpu_enter_guest+0x1ef4/0x4850 [kvm]
> [ 1119.928877][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1119.934335][ T7441]  ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
> [ 1119.941029][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
> [ 1119.947177][ T7441]  ? rcu_read_unlock+0x40/0x40
> [ 1119.951822][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1119.957272][ T7441]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 1119.962441][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1119.968325][ T7441]  ? __local_bh_enable_ip+0xa0/0xe0
> [ 1119.973433][ T7441]  ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
> [ 1119.979776][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1119.985945][ T7441]  kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1119.991924][ T7441]  kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
> [ 1119.997047][ T7441]  ? kvm_vcpu_block+0xc40/0xc40 [kvm]
> [ 1120.002305][ T7441]  ? find_held_lock+0x33/0x1c0
> [ 1120.006968][ T7441]  ? __fget_files+0x1a4/0x2e0
> [ 1120.011533][ T7441]  ? lock_downgrade+0x730/0x730
> [ 1120.016283][ T7441]  ? rcu_read_lock_sched_held+0xd0/0xd0
> [ 1120.021714][ T7441]  ? __fget_files+0x1c3/0x2e0
> [ 1120.026287][ T7441]  __x64_sys_ioctl+0x315/0xfc0
> [ 1120.030933][ T7441]  ? generic_block_fiemap+0x60/0x60
> [ 1120.036034][ T7441]  ? up_read+0x1a3/0x730
> [ 1120.040158][ T7441]  ? down_read_nested+0x420/0x420
> [ 1120.045080][ T7441]  ? syscall_enter_from_user_mode+0x17/0x50
> [ 1120.050862][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.056311][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.061743][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.067629][ T7441]  ? syscall_enter_from_user_mode+0x1c/0x50
> [ 1120.073410][ T7441]  do_syscall_64+0x33/0x40
> [ 1120.077723][ T7441]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1120.083505][ T7441] RIP: 0033:0x7fc3368c687b
> [ 1120.087813][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
> [ 1120.107408][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 1120.115739][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX:
> 00007fc3368c687b
> [ 1120.123622][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000017
> [ 1120.131513][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09:
> 0000000000000000
> [ 1120.139402][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00005613156b4100
> [ 1120.147294][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15:
> 00005613177da760
> [ 1120.155185][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted:
> G          I       5.9.0-rc7-next-20200928+ #2
> [ 1120.165072][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560
> Gen10, BIOS U34 11/13/2019
> [ 1120.174345][ T7441] Call Trace:
> [ 1120.177509][ T7441]  dump_stack+0x99/0xcb
> [ 1120.181548][ T7441]  __warn.cold.13+0xe/0x55
> [ 1120.185848][ T7441]  ? handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1120.192157][ T7441]  report_bug+0x1af/0x260
> [ 1120.196367][ T7441]  handle_bug+0x44/0x80
> [ 1120.200400][ T7441]  exc_invalid_op+0x13/0x40
> [ 1120.204784][ T7441]  asm_exc_invalid_op+0x12/0x20
> [ 1120.209520][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1120.216435][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00
> 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8
> <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
> [ 1120.236016][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
> [ 1120.241973][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX:
> 0000000000000000
> [ 1120.249851][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffffc2b98940
> [ 1120.257729][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09:
> ffffed11d1fcd071
> [ 1120.265605][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12:
> ffff8888f6129000
> [ 1120.273483][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15:
> 0000000000000000
> [ 1120.281364][ T7441]  ? handle_exception_nmi+0x788/0xe60 [kvm_intel]
> [ 1120.287695][ T7441]  vcpu_enter_guest+0x1ef4/0x4850 [kvm]
> [ 1120.293126][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.298582][ T7441]  ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
> [ 1120.305262][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
> [ 1120.311392][ T7441]  ? rcu_read_unlock+0x40/0x40
> [ 1120.316038][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.321469][ T7441]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 1120.326639][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.332506][ T7441]  ? __local_bh_enable_ip+0xa0/0xe0
> [ 1120.337613][ T7441]  ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
> [ 1120.343942][ T7441]  ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1120.350096][ T7441]  kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1120.356076][ T7441]  kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
> [ 1120.361180][ T7441]  ? kvm_vcpu_block+0xc40/0xc40 [kvm]
> [ 1120.366436][ T7441]  ? find_held_lock+0x33/0x1c0
> [ 1120.371082][ T7441]  ? __fget_files+0x1a4/0x2e0
> [ 1120.375639][ T7441]  ? lock_downgrade+0x730/0x730
> [ 1120.380371][ T7441]  ? rcu_read_lock_sched_held+0xd0/0xd0
> [ 1120.385802][ T7441]  ? __fget_files+0x1c3/0x2e0
> [ 1120.390360][ T7441]  __x64_sys_ioctl+0x315/0xfc0
> [ 1120.395006][ T7441]  ? generic_block_fiemap+0x60/0x60
> [ 1120.400088][ T7441]  ? up_read+0x1a3/0x730
> [ 1120.404209][ T7441]  ? down_read_nested+0x420/0x420
> [ 1120.409116][ T7441]  ? syscall_enter_from_user_mode+0x17/0x50
> [ 1120.414898][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.420329][ T7441]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.425761][ T7441]  ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.431628][ T7441]  ? syscall_enter_from_user_mode+0x1c/0x50
> [ 1120.437410][ T7441]  do_syscall_64+0x33/0x40
> [ 1120.441704][ T7441]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1120.447484][ T7441] RIP: 0033:0x7fc3368c687b
> [ 1120.451779][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
> [ 1120.471361][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 1120.479676][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX:
> 00007fc3368c687b
> [ 1120.487552][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000017
> [ 1120.495429][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09:
> 0000000000000000
> [ 1120.503305][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00005613156b4100
> [ 1120.511181][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15:
> 00005613177da760
> [ 1120.519104][ T7441] irq event stamp: 4321673
> [ 1120.523400][ T7441] hardirqs last  enabled at (4321681):
> [<ffffffffa6c2a76f>] console_unlock+0x81f/0xa20
> [ 1120.532948][ T7441] hardirqs last disabled at (4321690):
> [<ffffffffa6c2a67b>] console_unlock+0x72b/0xa20
> [ 1120.542495][ T7441] softirqs last  enabled at (4321672):
> [<ffffffffa800061b>] __do_softirq+0x61b/0x95d
> [ 1120.551864][ T7441] softirqs last disabled at (4321665):
> [<ffffffffa7e00ec2>] asm_call_irq_on_stack+0x12/0x20
> [ 1120.561852][ T7441] ---[ end trace 31c2bbb23abc5aa2 ]---
> 
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> >  arch/x86/kvm/vmx/vmx.h |  3 +++
> >  arch/x86/kvm/x86.c     |  2 +-
> >  3 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 819c185adf09..dc778c7b5a06 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> >  module_param_named(preemption_timer, enable_preemption_timer, bool,
> > S_IRUGO);
> >  #endif
> >  
> > +extern bool __read_mostly allow_smaller_maxphyaddr;
> > +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> > +
> >  #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> >  #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> >  #define KVM_VM_CR0_ALWAYS_ON				\
> > @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> >  
> >  	if (is_page_fault(intr_info)) {
> >  		cr2 = vmx_get_exit_qual(vcpu);
> > -		if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> > +		if (enable_ept && !vcpu->arch.apf.host_apf_flags
> > +			&& allow_smaller_maxphyaddr) {
> >  			/*
> >  			 * EPT will cause page fault only if we need to
> >  			 * detect illegal GPAs.
> > @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> >  	 * would also use advanced VM-exit information for EPT violations to
> >  	 * reconstruct the page fault error code.
> >  	 */
> > -	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> > +	if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) &&
> > allow_smaller_maxphyaddr)
> >  		return kvm_emulate_instruction(vcpu, 0);
> >  
> >  	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
> >  #endif
> >  	vmx_check_vmcs12_offsets();
> >  
> > -	/*
> > -	 * Intel processors don't have problems with
> > -	 * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> > -	 * it for VMX by default
> > -	 */
> > -	allow_smaller_maxphyaddr = true;
> > -
> >  	return 0;
> >  }
> >  module_init(vmx_init);
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index 26175a4759fa..b859435efa2e 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
> >  
> >  static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> >  {
> > +	if (!allow_smaller_maxphyaddr)
> > +		return false;
> > +
> >  	return !enable_ept || cpuid_maxphyaddr(vcpu) <
> > boot_cpu_data.x86_phys_bits;
> >  }
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d39d6cf1d473..982f1d73a884 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
> >  u64 __read_mostly host_efer;
> >  EXPORT_SYMBOL_GPL(host_efer);
> >  
> > -bool __read_mostly allow_smaller_maxphyaddr;
> > +bool __read_mostly allow_smaller_maxphyaddr = 0;
> >  EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
> >  
> >  static u64 __read_mostly host_xss;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-29 11:59   ` Qian Cai
@ 2020-09-29 12:26     ` Paolo Bonzini
  2020-09-29 13:39       ` Qian Cai
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-29 12:26 UTC (permalink / raw)
  To: Qian Cai, Mohammed Gamal, kvm
  Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
	jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds

On 29/09/20 13:59, Qian Cai wrote:
> 
> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
> 
> I noticed the origin patch did not have this WARN_ON_ONCE(), but the mainline
> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address space
> support user-configurable") does have it for some reasons.

Because that part of the code should not be reached.  The exception
bitmap is set up with

        if (!vmx_need_pf_intercept(vcpu))
                eb &= ~(1u << PF_VECTOR);

where

static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
{
        if (!enable_ept)
                return true;

        return allow_smaller_maxphyaddr &&
		 cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
}

We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
which implies vmx_need_pf_intercept(vcpu) == false.  So the warning is
genuine; I've sent a patch.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-29 12:26     ` Paolo Bonzini
@ 2020-09-29 13:39       ` Qian Cai
  2020-09-29 14:47         ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Qian Cai @ 2020-09-29 13:39 UTC (permalink / raw)
  To: Paolo Bonzini, Mohammed Gamal, kvm
  Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
	jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds

On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
> On 29/09/20 13:59, Qian Cai wrote:
> > WARN_ON_ONCE(!allow_smaller_maxphyaddr);
> > 
> > I noticed the origin patch did not have this WARN_ON_ONCE(), but the
> > mainline
> > commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
> > space
> > support user-configurable") does have it for some reasons.
> 
> Because that part of the code should not be reached.  The exception
> bitmap is set up with
> 
>         if (!vmx_need_pf_intercept(vcpu))
>                 eb &= ~(1u << PF_VECTOR);
> 
> where
> 
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
>         if (!enable_ept)
>                 return true;
> 
>         return allow_smaller_maxphyaddr &&
> 		 cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
> }
> 
> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
> which implies vmx_need_pf_intercept(vcpu) == false.  So the warning is
> genuine; I've sent a patch.

Care to provide a link to the patch? Just curious.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-29 13:39       ` Qian Cai
@ 2020-09-29 14:47         ` Paolo Bonzini
  2020-10-02 17:28           ` Naresh Kamboju
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-29 14:47 UTC (permalink / raw)
  To: Qian Cai, Mohammed Gamal, kvm
  Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
	jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds

On 29/09/20 15:39, Qian Cai wrote:
> On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
>> On 29/09/20 13:59, Qian Cai wrote:
>>> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
>>>
>>> I noticed the origin patch did not have this WARN_ON_ONCE(), but the
>>> mainline
>>> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
>>> space
>>> support user-configurable") does have it for some reasons.
>>
>> Because that part of the code should not be reached.  The exception
>> bitmap is set up with
>>
>>         if (!vmx_need_pf_intercept(vcpu))
>>                 eb &= ~(1u << PF_VECTOR);
>>
>> where
>>
>> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
>> {
>>         if (!enable_ept)
>>                 return true;
>>
>>         return allow_smaller_maxphyaddr &&
>> 		 cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
>> }
>>
>> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
>> which implies vmx_need_pf_intercept(vcpu) == false.  So the warning is
>> genuine; I've sent a patch.
> 
> Care to provide a link to the patch? Just curious.
> 

Ok, I haven't sent it yet. :)  But here it is:

commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Sep 29 08:31:32 2020 -0400

    KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
    
    The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
    the #PF intercept bit in the exception bitmap when they do not match.
    This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
    hypervisor can get a vmexit for #PF exceptions even when the
    corresponding bit is clear in the exception bitmap.
    
    This is unexpected and is promptly reported as a WARN_ON_ONCE.
    To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
    is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
    
    Reported-by: Qian Cai <cai@redhat.com>>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f0384e93548a..f4e9c310032a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -794,6 +794,18 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
 	 */
 	if (is_guest_mode(vcpu))
 		eb |= get_vmcs12(vcpu)->exception_bitmap;
+        else {
+		/*
+		 * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
+		 * between guest and host.  In that case we only care about present
+		 * faults.  For vmcs02, however, PFEC_MASK and PFEC_MATCH are set in
+		 * prepare_vmcs02_rare.
+		 */
+		bool selective_pf_trap = enable_ept && (eb & (1u << PF_VECTOR));
+		int mask = selective_pf_trap ? PFERR_PRESENT_MASK : 0;
+		vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, mask);
+		vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, mask);
+	}
 
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
@@ -4355,16 +4367,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmx->pt_desc.guest.output_mask = 0x7F;
 		vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
 	}
-
-	/*
-	 * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
-	 * between guest and host.  In that case we only care about present
-	 * faults.
-	 */
-	if (enable_ept) {
-		vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
-		vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
-	}
 }
 
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-29 14:47         ` Paolo Bonzini
@ 2020-10-02 17:28           ` Naresh Kamboju
  2020-10-02 17:30             ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Naresh Kamboju @ 2020-10-02 17:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Qian Cai, Mohammed Gamal, kvm list, open list,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Stephen Rothwell, Linux-Next Mailing List,
	Linus Torvalds, lkft-triage

Hi Paolo,

Thanks for the patch.

On Tue, 29 Sep 2020 at 20:17, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 29/09/20 15:39, Qian Cai wrote:
> > On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
> >> On 29/09/20 13:59, Qian Cai wrote:
> >>> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
> >>>
> >>> I noticed the origin patch did not have this WARN_ON_ONCE(), but the
> >>> mainline
> >>> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
> >>> space
> >>> support user-configurable") does have it for some reasons.
> >>
> >> Because that part of the code should not be reached.  The exception
> >> bitmap is set up with
> >>
> >>         if (!vmx_need_pf_intercept(vcpu))
> >>                 eb &= ~(1u << PF_VECTOR);
> >>
> >> where
> >>
> >> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> >> {
> >>         if (!enable_ept)
> >>                 return true;
> >>
> >>         return allow_smaller_maxphyaddr &&
> >>               cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
> >> }
> >>
> >> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
> >> which implies vmx_need_pf_intercept(vcpu) == false.  So the warning is
> >> genuine; I've sent a patch.
> >
> > Care to provide a link to the patch? Just curious.
> >
>
> Ok, I haven't sent it yet. :)  But here it is:
>
> commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Tue Sep 29 08:31:32 2020 -0400
>
>     KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
>
>     The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
>     the #PF intercept bit in the exception bitmap when they do not match.
>     This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
>     hypervisor can get a vmexit for #PF exceptions even when the
>     corresponding bit is clear in the exception bitmap.
>
>     This is unexpected and is promptly reported as a WARN_ON_ONCE.
>     To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
>     is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).

I have tested this patch on an x86_64 machine and the reported issue is gone.

>
>     Reported-by: Qian Cai <cai@redhat.com>
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>

>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f0384e93548a..f4e9c310032a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -794,6 +794,18 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
>          */
>         if (is_guest_mode(vcpu))
>                 eb |= get_vmcs12(vcpu)->exception_bitmap;
> +        else {
> +               /*
> +                * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
> +                * between guest and host.  In that case we only care about present
> +                * faults.  For vmcs02, however, PFEC_MASK and PFEC_MATCH are set in
> +                * prepare_vmcs02_rare.
> +                */
> +               bool selective_pf_trap = enable_ept && (eb & (1u << PF_VECTOR));
> +               int mask = selective_pf_trap ? PFERR_PRESENT_MASK : 0;
> +               vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, mask);
> +               vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, mask);
> +       }
>
>         vmcs_write32(EXCEPTION_BITMAP, eb);
>  }
> @@ -4355,16 +4367,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>                 vmx->pt_desc.guest.output_mask = 0x7F;
>                 vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
>         }
> -
> -       /*
> -        * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
> -        * between guest and host.  In that case we only care about present
> -        * faults.
> -        */
> -       if (enable_ept) {
> -               vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
> -               vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
> -       }
>  }
>
>  static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>

test log link
https://lkft.validation.linaro.org/scheduler/job/1813223

- Naresh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-10-02 17:28           ` Naresh Kamboju
@ 2020-10-02 17:30             ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2020-10-02 17:30 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Qian Cai, Mohammed Gamal, kvm list, open list,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Stephen Rothwell, Linux-Next Mailing List,
	Linus Torvalds, lkft-triage

On 02/10/20 19:28, Naresh Kamboju wrote:
>>
>> commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
>> Author: Paolo Bonzini <pbonzini@redhat.com>
>> Date:   Tue Sep 29 08:31:32 2020 -0400
>>
>>     KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
>>
>>     The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
>>     the #PF intercept bit in the exception bitmap when they do not match.
>>     This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
>>     hypervisor can get a vmexit for #PF exceptions even when the
>>     corresponding bit is clear in the exception bitmap.
>>
>>     This is unexpected and is promptly reported as a WARN_ON_ONCE.
>>     To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
>>     is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
> I have tested this patch on an x86_64 machine and the reported issue is gone.
> 

Thanks, the issue with my disk is gone too so it'll get to Linus in time
for rc8.

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
                   ` (2 preceding siblings ...)
  2020-09-28 15:34 ` Qian Cai
@ 2021-01-16  0:08 ` Jim Mattson
  2021-01-18 10:18   ` Mohammed Gamal
  3 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2021-01-16  0:08 UTC (permalink / raw)
  To: Mohammed Gamal
  Cc: kvm list, Paolo Bonzini, LKML, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Aaron Lewis, Sean Christopherson

On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
>  arch/x86/kvm/vmx/vmx.h |  3 +++
>  arch/x86/kvm/x86.c     |  2 +-
>  3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
>  module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;

Since this variable is in the kvm module rather than the kvm_intel
module, its current setting is preserved across "rmmod kvm_intel;
modprobe kvm_intel." That is, if set to true, it doesn't revert to
false after "rmmod kvm_intel." Is that the intended behavior?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2021-01-16  0:08 ` Jim Mattson
@ 2021-01-18 10:18   ` Mohammed Gamal
  2021-06-21 18:01     ` Jim Mattson
  0 siblings, 1 reply; 19+ messages in thread
From: Mohammed Gamal @ 2021-01-18 10:18 UTC (permalink / raw)
  To: Jim Mattson, Paolo Bonzini
  Cc: kvm list, LKML, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	Aaron Lewis, Sean Christopherson

On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com>
> wrote:
> > 
> > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > parameter.
> > 
> > Since smaller physical address spaces are only supported on VMX,
> > the parameter
> > is only exposed in the kvm_intel module.
> > Modifications to VMX page fault and EPT violation handling will
> > depend on whether
> > that parameter is enabled.
> > 
> > Also disable support by default, and let the user decide if they
> > want to enable
> > it.
> > 
> > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> >  arch/x86/kvm/vmx/vmx.h |  3 +++
> >  arch/x86/kvm/x86.c     |  2 +-
> >  3 files changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 819c185adf09..dc778c7b5a06 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -129,6 +129,9 @@ static bool __read_mostly
> > enable_preemption_timer = 1;
> >  module_param_named(preemption_timer, enable_preemption_timer,
> > bool, S_IRUGO);
> >  #endif
> > 
> > +extern bool __read_mostly allow_smaller_maxphyaddr;
> 
> Since this variable is in the kvm module rather than the kvm_intel
> module, its current setting is preserved across "rmmod kvm_intel;
> modprobe kvm_intel." That is, if set to true, it doesn't revert to
> false after "rmmod kvm_intel." Is that the intended behavior?
> 

IIRC, this is because this setting was indeed not intended to be just
VMX-specific, but since AMD has an issue with PTE accessed-bits being
set by hardware and thus we can't yet enable this feature on it, it
might make sense to move the variable to the kvm_intel module for now.

Paolo, what do you think?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2021-01-18 10:18   ` Mohammed Gamal
@ 2021-06-21 18:01     ` Jim Mattson
  2021-06-22 12:59       ` Mohammed Gamal
  0 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2021-06-21 18:01 UTC (permalink / raw)
  To: Mohammed Gamal
  Cc: Paolo Bonzini, kvm list, LKML, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Aaron Lewis, Sean Christopherson

On Mon, Jan 18, 2021 at 2:22 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>
> On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com>
> > wrote:
> > >
> > > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > > parameter.
> > >
> > > Since smaller physical address spaces are only supported on VMX,
> > > the parameter
> > > is only exposed in the kvm_intel module.
> > > Modifications to VMX page fault and EPT violation handling will
> > > depend on whether
> > > that parameter is enabled.
> > >
> > > Also disable support by default, and let the user decide if they
> > > want to enable
> > > it.
> > >
> > > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > > ---
> > >  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > >  arch/x86/kvm/vmx/vmx.h |  3 +++
> > >  arch/x86/kvm/x86.c     |  2 +-
> > >  3 files changed, 10 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 819c185adf09..dc778c7b5a06 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -129,6 +129,9 @@ static bool __read_mostly
> > > enable_preemption_timer = 1;
> > >  module_param_named(preemption_timer, enable_preemption_timer,
> > > bool, S_IRUGO);
> > >  #endif
> > >
> > > +extern bool __read_mostly allow_smaller_maxphyaddr;
> >
> > Since this variable is in the kvm module rather than the kvm_intel
> > module, its current setting is preserved across "rmmod kvm_intel;
> > modprobe kvm_intel." That is, if set to true, it doesn't revert to
> > false after "rmmod kvm_intel." Is that the intended behavior?
> >
>
> IIRC, this is because this setting was indeed not intended to be just
> VMX-specific, but since AMD has an issue with PTE accessed-bits being
> set by hardware and thus we can't yet enable this feature on it, it
> might make sense to move the variable to the kvm_intel module for now.

Um...

We do allow it for SVM, if NPT is not enabled. In fact, we set it
unconditionally in that case. See commit 3edd68399dc15 ("KVM: x86: Add
a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support").

Perhaps it should be a module parameter for SVM as well?

And, in any case, it would be nice if the parameter reverted to false
when the kvm_intel module is unloaded.

> Paolo, what do you think?
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
  2021-06-21 18:01     ` Jim Mattson
@ 2021-06-22 12:59       ` Mohammed Gamal
  0 siblings, 0 replies; 19+ messages in thread
From: Mohammed Gamal @ 2021-06-22 12:59 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, kvm list, LKML, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, Aaron Lewis, Sean Christopherson

On Mon, 2021-06-21 at 11:01 -0700, Jim Mattson wrote:
> On Mon, Jan 18, 2021 at 2:22 AM Mohammed Gamal <mgamal@redhat.com>
> wrote:
> > 
> > On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> > > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com>
> > > wrote:
> > > > 
> > > > This patch exposes allow_smaller_maxphyaddr to the user as a
> > > > module
> > > > parameter.
> > > > 
> > > > Since smaller physical address spaces are only supported on
> > > > VMX,
> > > > the parameter
> > > > is only exposed in the kvm_intel module.
> > > > Modifications to VMX page fault and EPT violation handling will
> > > > depend on whether
> > > > that parameter is enabled.
> > > > 
> > > > Also disable support by default, and let the user decide if
> > > > they
> > > > want to enable
> > > > it.
> > > > 
> > > > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > > > ---
> > > >  arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > > >  arch/x86/kvm/vmx/vmx.h |  3 +++
> > > >  arch/x86/kvm/x86.c     |  2 +-
> > > >  3 files changed, 10 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 819c185adf09..dc778c7b5a06 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -129,6 +129,9 @@ static bool __read_mostly
> > > > enable_preemption_timer = 1;
> > > >  module_param_named(preemption_timer, enable_preemption_timer,
> > > > bool, S_IRUGO);
> > > >  #endif
> > > > 
> > > > +extern bool __read_mostly allow_smaller_maxphyaddr;
> > > 
> > > Since this variable is in the kvm module rather than the
> > > kvm_intel
> > > module, its current setting is preserved across "rmmod kvm_intel;
> > > modprobe kvm_intel." That is, if set to true, it doesn't revert
> > > to
> > > false after "rmmod kvm_intel." Is that the intended behavior?
> > > 
> > 
> > IIRC, this is because this setting was indeed not intended to be
> > just
> > VMX-specific, but since AMD has an issue with PTE accessed-bits
> > being
> > set by hardware and thus we can't yet enable this feature on it, it
> > might make sense to move the variable to the kvm_intel module for
> > now.
> 
> Um...
> 
> We do allow it for SVM, if NPT is not enabled. In fact, we set it
> unconditionally in that case. See commit 3edd68399dc15 ("KVM: x86:
> Add
> a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support").
> 
> Perhaps it should be a module parameter for SVM as well?

Hmmm, I think given how AMD CPUs' behavior with NPT enabled, maybe it'd
actually be a better idea to move this entirely to VMX for the time
being. And then maybe make it available again on AMD only if the
behavior with NPT is changed.

> 
> And, in any case, it would be nice if the parameter reverted to false
> when the kvm_intel module is unloaded.
> 
> > Paolo, what do you think?
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-06-22 12:59 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
2020-09-03 17:57 ` Jim Mattson
2020-09-03 18:03   ` Paolo Bonzini
2020-09-03 18:32     ` Jim Mattson
2020-09-03 20:02       ` Paolo Bonzini
2020-09-03 21:26         ` Jim Mattson
2020-09-23 13:45 ` Paolo Bonzini
     [not found]   ` <CALMp9eTHbhwfdq4Be=XcUG9z82KK8AapQeVmsdH=mGdQ_Yt2ug@mail.gmail.com>
2020-09-23 15:19     ` Paolo Bonzini
2020-09-28 15:34 ` Qian Cai
2020-09-29 11:59   ` Qian Cai
2020-09-29 12:26     ` Paolo Bonzini
2020-09-29 13:39       ` Qian Cai
2020-09-29 14:47         ` Paolo Bonzini
2020-10-02 17:28           ` Naresh Kamboju
2020-10-02 17:30             ` Paolo Bonzini
2021-01-16  0:08 ` Jim Mattson
2021-01-18 10:18   ` Mohammed Gamal
2021-06-21 18:01     ` Jim Mattson
2021-06-22 12:59       ` Mohammed Gamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).