* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
@ 2020-09-03 17:57 ` Jim Mattson
2020-09-03 18:03 ` Paolo Bonzini
2020-09-23 13:45 ` Paolo Bonzini
` (2 subsequent siblings)
3 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2020-09-03 17:57 UTC (permalink / raw)
To: Mohammed Gamal
Cc: kvm list, Paolo Bonzini, LKML, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel
On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
I think a smaller guest physical address width *should* be allowed.
However, perhaps the pedantic adherence to the architectural
specification could be turned on or off per-VM? And, if we're going to
be pedantic, I think we should go all the way and get MOV-to-CR3
correct.
Does the typical guest care about whether or not setting any of the
bits 51:46 in a PFN results in a fault?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 17:57 ` Jim Mattson
@ 2020-09-03 18:03 ` Paolo Bonzini
2020-09-03 18:32 ` Jim Mattson
0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-03 18:03 UTC (permalink / raw)
To: Jim Mattson, Mohammed Gamal
Cc: kvm list, LKML, Sean Christopherson, Vitaly Kuznetsov,
Wanpeng Li, Joerg Roedel
On 03/09/20 19:57, Jim Mattson wrote:
> On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>>
>> Since smaller physical address spaces are only supported on VMX, the parameter
>> is only exposed in the kvm_intel module.
>> Modifications to VMX page fault and EPT violation handling will depend on whether
>> that parameter is enabled.
>>
>> Also disable support by default, and let the user decide if they want to enable
>> it.
>
> I think a smaller guest physical address width *should* be allowed.
> However, perhaps the pedantic adherence to the architectural
> specification could be turned on or off per-VM? And, if we're going to
> be pedantic, I think we should go all the way and get MOV-to-CR3
> correct.
That would be way too slow. Even the current trapping of present #PF
can introduce some slowdown depending on the workload.
> Does the typical guest care about whether or not setting any of the
> bits 51:46 in a PFN results in a fault?
At least KVM with shadow pages does, which is a bit niche but it shows
that you cannot really rely on no one doing it. As you guessed, the
main usage of the feature is for machines with 5-level page tables where
there are no reserved bits; emulating smaller MAXPHYADDR allows
migrating VMs from 4-level page-table hosts.
Enabling per-VM would not be particularly useful IMO because if you want
to disable this code you can just set host MAXPHYADDR = guest
MAXPHYADDR, which should be the common case unless you want to do that
kind of Skylake to Icelake (or similar) migration.
Paolo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 18:03 ` Paolo Bonzini
@ 2020-09-03 18:32 ` Jim Mattson
2020-09-03 20:02 ` Paolo Bonzini
0 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2020-09-03 18:32 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Mohammed Gamal, kvm list, LKML, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel
On Thu, Sep 3, 2020 at 11:03 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 03/09/20 19:57, Jim Mattson wrote:
> > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
> >> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
> >>
> >> Since smaller physical address spaces are only supported on VMX, the parameter
> >> is only exposed in the kvm_intel module.
> >> Modifications to VMX page fault and EPT violation handling will depend on whether
> >> that parameter is enabled.
> >>
> >> Also disable support by default, and let the user decide if they want to enable
> >> it.
> >
> > I think a smaller guest physical address width *should* be allowed.
> > However, perhaps the pedantic adherence to the architectural
> > specification could be turned on or off per-VM? And, if we're going to
> > be pedantic, I think we should go all the way and get MOV-to-CR3
> > correct.
>
> That would be way too slow. Even the current trapping of present #PF
> can introduce some slowdown depending on the workload.
Yes, I was concerned about that...which is why I would not want to
enable pedantic mode. But if you're going to be pedantic, why go
halfway?
> > Does the typical guest care about whether or not setting any of the
> > bits 51:46 in a PFN results in a fault?
>
> At least KVM with shadow pages does, which is a bit niche but it shows
> that you cannot really rely on no one doing it. As you guessed, the
> main usage of the feature is for machines with 5-level page tables where
> there are no reserved bits; emulating smaller MAXPHYADDR allows
> migrating VMs from 4-level page-table hosts.
>
> Enabling per-VM would not be particularly useful IMO because if you want
> to disable this code you can just set host MAXPHYADDR = guest
> MAXPHYADDR, which should be the common case unless you want to do that
> kind of Skylake to Icelake (or similar) migration.
I expect that it will be quite common to run 46-bit wide legacy VMs on
Ice Lake hardware, as Ice Lake machines start showing up in
heterogeneous data centers.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 18:32 ` Jim Mattson
@ 2020-09-03 20:02 ` Paolo Bonzini
2020-09-03 21:26 ` Jim Mattson
0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-03 20:02 UTC (permalink / raw)
To: Jim Mattson
Cc: Mohammed Gamal, kvm list, LKML, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel
On 03/09/20 20:32, Jim Mattson wrote:
>> [Checking writes to CR3] would be way too slow. Even the current
>> trapping of present #PF can introduce some slowdown depending on the
>> workload.
>
> Yes, I was concerned about that...which is why I would not want to
> enable pedantic mode. But if you're going to be pedantic, why go
> halfway?
Because I am not sure about any guest, even KVM, caring about setting
bits 51:46 in CR3.
>>> Does the typical guest care about whether or not setting any of the
>>> bits 51:46 in a PFN results in a fault?
>>
>> At least KVM with shadow pages does, which is a bit niche but it shows
>> that you cannot really rely on no one doing it. As you guessed, the
>> main usage of the feature is for machines with 5-level page tables where
>> there are no reserved bits; emulating smaller MAXPHYADDR allows
>> migrating VMs from 4-level page-table hosts.
>>
>> Enabling per-VM would not be particularly useful IMO because if you want
>> to disable this code you can just set host MAXPHYADDR = guest
>> MAXPHYADDR, which should be the common case unless you want to do that
>> kind of Skylake to Icelake (or similar) migration.
>
> I expect that it will be quite common to run 46-bit wide legacy VMs on
> Ice Lake hardware, as Ice Lake machines start showing up in
> heterogeneous data centers.
If you'll be okay with running _all_ 46-bit wide legacy VMs without
MAXPHYADDR emulation, that's what this patch is for. If you'll be okay
with running _only_ 46-bit wide VMs without emulation, you still don't
need special enabling per-VM beyond the automatic one based on
CPUID[0x8000_0008]. Do you think you'll need to enable it for some
special 46-bit VMs?
Paolo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 20:02 ` Paolo Bonzini
@ 2020-09-03 21:26 ` Jim Mattson
0 siblings, 0 replies; 19+ messages in thread
From: Jim Mattson @ 2020-09-03 21:26 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Mohammed Gamal, kvm list, LKML, Sean Christopherson,
Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel
On Thu, Sep 3, 2020 at 1:02 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 03/09/20 20:32, Jim Mattson wrote:
> >> [Checking writes to CR3] would be way too slow. Even the current
> >> trapping of present #PF can introduce some slowdown depending on the
> >> workload.
> >
> > Yes, I was concerned about that...which is why I would not want to
> > enable pedantic mode. But if you're going to be pedantic, why go
> > halfway?
>
> Because I am not sure about any guest, even KVM, caring about setting
> bits 51:46 in CR3.
>
> >>> Does the typical guest care about whether or not setting any of the
> >>> bits 51:46 in a PFN results in a fault?
> >>
> >> At least KVM with shadow pages does, which is a bit niche but it shows
> >> that you cannot really rely on no one doing it. As you guessed, the
> >> main usage of the feature is for machines with 5-level page tables where
> >> there are no reserved bits; emulating smaller MAXPHYADDR allows
> >> migrating VMs from 4-level page-table hosts.
> >>
> >> Enabling per-VM would not be particularly useful IMO because if you want
> >> to disable this code you can just set host MAXPHYADDR = guest
> >> MAXPHYADDR, which should be the common case unless you want to do that
> >> kind of Skylake to Icelake (or similar) migration.
> >
> > I expect that it will be quite common to run 46-bit wide legacy VMs on
> > Ice Lake hardware, as Ice Lake machines start showing up in
> > heterogeneous data centers.
>
> If you'll be okay with running _all_ 46-bit wide legacy VMs without
> MAXPHYADDR emulation, that's what this patch is for. If you'll be okay
> with running _only_ 46-bit wide VMs without emulation, you still don't
> need special enabling per-VM beyond the automatic one based on
> CPUID[0x8000_0008]. Do you think you'll need to enable it for some
> special 46-bit VMs?
Yes. From what you've said above, we would only want to enable this
for the niche case of 46-bit KVM guests using shadow paging. I would
expect that to be a very small number of VMs. :-)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
2020-09-03 17:57 ` Jim Mattson
@ 2020-09-23 13:45 ` Paolo Bonzini
[not found] ` <CALMp9eTHbhwfdq4Be=XcUG9z82KK8AapQeVmsdH=mGdQ_Yt2ug@mail.gmail.com>
2020-09-28 15:34 ` Qian Cai
2021-01-16 0:08 ` Jim Mattson
3 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-23 13:45 UTC (permalink / raw)
To: Mohammed Gamal, kvm
Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli, jmattson, joro
On 03/09/20 16:11, Mohammed Gamal wrote:
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> ---
> arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> arch/x86/kvm/vmx/vmx.h | 3 +++
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;
> +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> +
> #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> #define KVM_VM_CR0_ALWAYS_ON \
> @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>
> if (is_page_fault(intr_info)) {
> cr2 = vmx_get_exit_qual(vcpu);
> - if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> + if (enable_ept && !vcpu->arch.apf.host_apf_flags
> + && allow_smaller_maxphyaddr) {
> /*
> * EPT will cause page fault only if we need to
> * detect illegal GPAs.
> @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> * would also use advanced VM-exit information for EPT violations to
> * reconstruct the page fault error code.
> */
> - if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) && allow_smaller_maxphyaddr)
> return kvm_emulate_instruction(vcpu, 0);
>
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
> #endif
> vmx_check_vmcs12_offsets();
>
> - /*
> - * Intel processors don't have problems with
> - * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> - * it for VMX by default
> - */
> - allow_smaller_maxphyaddr = true;
> -
> return 0;
> }
> module_init(vmx_init);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 26175a4759fa..b859435efa2e 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
>
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
> + if (!allow_smaller_maxphyaddr)
> + return false;
> +
> return !enable_ept || cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
This needs to return true if !enable_ept.
if (!enable_ept)
return true;
return allow_smaller_maxphyaddr &&
cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
Fixed and queued, thanks.
Paolo
> }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d39d6cf1d473..982f1d73a884 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
>
> -bool __read_mostly allow_smaller_maxphyaddr;
> +bool __read_mostly allow_smaller_maxphyaddr = 0;
> EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
>
> static u64 __read_mostly host_xss;
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
2020-09-03 17:57 ` Jim Mattson
2020-09-23 13:45 ` Paolo Bonzini
@ 2020-09-28 15:34 ` Qian Cai
2020-09-29 11:59 ` Qian Cai
2021-01-16 0:08 ` Jim Mattson
3 siblings, 1 reply; 19+ messages in thread
From: Qian Cai @ 2020-09-28 15:34 UTC (permalink / raw)
To: Mohammed Gamal, kvm, pbonzini
Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds
On Thu, 2020-09-03 at 16:11 +0200, Mohammed Gamal wrote:
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on
> whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to
> enable
> it.
>
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Running a simple SR-IOV on Intel will trigger the warning below:
.config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config
P.S.: I did confirm the linux-next included this hunk as well:
https://lore.kernel.org/kvm/8c7ce8ff-a212-a974-3829-c45eb5335651@redhat.com/
[ 1119.752137][ T7441] WARNING: CPU: 27 PID: 7441 at arch/x86/kvm/vmx/vmx.c:4809 handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1119.763312][ T7441] Modules linked in: loop nls_ascii nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ses enclosure efivarfs ip_tables x_tables sd_mod nvme tg3 firmware_class smartpqi nvme_core libphy scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[ 1119.786660][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted: G I 5.9.0-rc7-next-20200928+ #2
[ 1119.796572][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 1119.805870][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1119.812788][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8 <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
[ 1119.832384][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
[ 1119.838353][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX: 0000000000000000
[ 1119.846247][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc2b98940
[ 1119.854124][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09: ffffed11d1fcd071
[ 1119.862012][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12: ffff8888f6129000
[ 1119.869903][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15: 0000000000000000
[ 1119.877795][ T7441] FS: 00007fc3277fe700(0000) GS:ffff88901f640000(0000) knlGS:0000000000000000
[ 1119.886649][ T7441] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1119.893127][ T7441] CR2: 0000000000000000 CR3: 0000000ab7376002 CR4: 00000000007726e0
[ 1119.901022][ T7441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1119.908916][ T7441] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1119.916806][ T7441] PKRU: 55555554
[ 1119.920226][ T7441] Call Trace:
[ 1119.923426][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
[ 1119.928877][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1119.934335][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
[ 1119.941029][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
[ 1119.947177][ T7441] ? rcu_read_unlock+0x40/0x40
[ 1119.951822][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1119.957272][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 1119.962441][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1119.968325][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
[ 1119.973433][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
[ 1119.979776][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1119.985945][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1119.991924][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
[ 1119.997047][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
[ 1120.002305][ T7441] ? find_held_lock+0x33/0x1c0
[ 1120.006968][ T7441] ? __fget_files+0x1a4/0x2e0
[ 1120.011533][ T7441] ? lock_downgrade+0x730/0x730
[ 1120.016283][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
[ 1120.021714][ T7441] ? __fget_files+0x1c3/0x2e0
[ 1120.026287][ T7441] __x64_sys_ioctl+0x315/0xfc0
[ 1120.030933][ T7441] ? generic_block_fiemap+0x60/0x60
[ 1120.036034][ T7441] ? up_read+0x1a3/0x730
[ 1120.040158][ T7441] ? down_read_nested+0x420/0x420
[ 1120.045080][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
[ 1120.050862][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.056311][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.061743][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.067629][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
[ 1120.073410][ T7441] do_syscall_64+0x33/0x40
[ 1120.077723][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1120.083505][ T7441] RIP: 0033:0x7fc3368c687b
[ 1120.087813][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
[ 1120.107408][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1120.115739][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX: 00007fc3368c687b
[ 1120.123622][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[ 1120.131513][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09: 0000000000000000
[ 1120.139402][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12: 00005613156b4100
[ 1120.147294][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15: 00005613177da760
[ 1120.155185][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted: G I 5.9.0-rc7-next-20200928+ #2
[ 1120.165072][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 1120.174345][ T7441] Call Trace:
[ 1120.177509][ T7441] dump_stack+0x99/0xcb
[ 1120.181548][ T7441] __warn.cold.13+0xe/0x55
[ 1120.185848][ T7441] ? handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1120.192157][ T7441] report_bug+0x1af/0x260
[ 1120.196367][ T7441] handle_bug+0x44/0x80
[ 1120.200400][ T7441] exc_invalid_op+0x13/0x40
[ 1120.204784][ T7441] asm_exc_invalid_op+0x12/0x20
[ 1120.209520][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
[ 1120.216435][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8 <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
[ 1120.236016][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
[ 1120.241973][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX: 0000000000000000
[ 1120.249851][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc2b98940
[ 1120.257729][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09: ffffed11d1fcd071
[ 1120.265605][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12: ffff8888f6129000
[ 1120.273483][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15: 0000000000000000
[ 1120.281364][ T7441] ? handle_exception_nmi+0x788/0xe60 [kvm_intel]
[ 1120.287695][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
[ 1120.293126][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.298582][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
[ 1120.305262][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
[ 1120.311392][ T7441] ? rcu_read_unlock+0x40/0x40
[ 1120.316038][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.321469][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
[ 1120.326639][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.332506][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
[ 1120.337613][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
[ 1120.343942][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1120.350096][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
[ 1120.356076][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
[ 1120.361180][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
[ 1120.366436][ T7441] ? find_held_lock+0x33/0x1c0
[ 1120.371082][ T7441] ? __fget_files+0x1a4/0x2e0
[ 1120.375639][ T7441] ? lock_downgrade+0x730/0x730
[ 1120.380371][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
[ 1120.385802][ T7441] ? __fget_files+0x1c3/0x2e0
[ 1120.390360][ T7441] __x64_sys_ioctl+0x315/0xfc0
[ 1120.395006][ T7441] ? generic_block_fiemap+0x60/0x60
[ 1120.400088][ T7441] ? up_read+0x1a3/0x730
[ 1120.404209][ T7441] ? down_read_nested+0x420/0x420
[ 1120.409116][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
[ 1120.414898][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.420329][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
[ 1120.425761][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
[ 1120.431628][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
[ 1120.437410][ T7441] do_syscall_64+0x33/0x40
[ 1120.441704][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1120.447484][ T7441] RIP: 0033:0x7fc3368c687b
[ 1120.451779][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
[ 1120.471361][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1120.479676][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX: 00007fc3368c687b
[ 1120.487552][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017
[ 1120.495429][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09: 0000000000000000
[ 1120.503305][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12: 00005613156b4100
[ 1120.511181][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15: 00005613177da760
[ 1120.519104][ T7441] irq event stamp: 4321673
[ 1120.523400][ T7441] hardirqs last enabled at (4321681): [<ffffffffa6c2a76f>] console_unlock+0x81f/0xa20
[ 1120.532948][ T7441] hardirqs last disabled at (4321690): [<ffffffffa6c2a67b>] console_unlock+0x72b/0xa20
[ 1120.542495][ T7441] softirqs last enabled at (4321672): [<ffffffffa800061b>] __do_softirq+0x61b/0x95d
[ 1120.551864][ T7441] softirqs last disabled at (4321665): [<ffffffffa7e00ec2>] asm_call_irq_on_stack+0x12/0x20
[ 1120.561852][ T7441] ---[ end trace 31c2bbb23abc5aa2 ]---
> ---
> arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> arch/x86/kvm/vmx/vmx.h | 3 +++
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;
> +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> +
> #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> #define KVM_VM_CR0_ALWAYS_ON \
> @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>
> if (is_page_fault(intr_info)) {
> cr2 = vmx_get_exit_qual(vcpu);
> - if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> + if (enable_ept && !vcpu->arch.apf.host_apf_flags
> + && allow_smaller_maxphyaddr) {
> /*
> * EPT will cause page fault only if we need to
> * detect illegal GPAs.
> @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> * would also use advanced VM-exit information for EPT violations to
> * reconstruct the page fault error code.
> */
> - if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) &&
> allow_smaller_maxphyaddr)
> return kvm_emulate_instruction(vcpu, 0);
>
> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
> #endif
> vmx_check_vmcs12_offsets();
>
> - /*
> - * Intel processors don't have problems with
> - * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> - * it for VMX by default
> - */
> - allow_smaller_maxphyaddr = true;
> -
> return 0;
> }
> module_init(vmx_init);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 26175a4759fa..b859435efa2e 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
>
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
> + if (!allow_smaller_maxphyaddr)
> + return false;
> +
> return !enable_ept || cpuid_maxphyaddr(vcpu) <
> boot_cpu_data.x86_phys_bits;
> }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d39d6cf1d473..982f1d73a884 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
> u64 __read_mostly host_efer;
> EXPORT_SYMBOL_GPL(host_efer);
>
> -bool __read_mostly allow_smaller_maxphyaddr;
> +bool __read_mostly allow_smaller_maxphyaddr = 0;
> EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
>
> static u64 __read_mostly host_xss;
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-28 15:34 ` Qian Cai
@ 2020-09-29 11:59 ` Qian Cai
2020-09-29 12:26 ` Paolo Bonzini
0 siblings, 1 reply; 19+ messages in thread
From: Qian Cai @ 2020-09-29 11:59 UTC (permalink / raw)
To: Mohammed Gamal, kvm, pbonzini
Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds
On Mon, 2020-09-28 at 11:34 -0400, Qian Cai wrote:
> On Thu, 2020-09-03 at 16:11 +0200, Mohammed Gamal wrote:
> > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > parameter.
> >
> > Since smaller physical address spaces are only supported on VMX, the
> > parameter
> > is only exposed in the kvm_intel module.
> > Modifications to VMX page fault and EPT violation handling will depend on
> > whether
> > that parameter is enabled.
> >
> > Also disable support by default, and let the user decide if they want to
> > enable
> > it.
> >
> > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
>
> Running a simple SR-IOV on Intel will trigger the warning below:
>
> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config
>
> P.S.: I did confirm the linux-next included this hunk as well:
> https://lore.kernel.org/kvm/8c7ce8ff-a212-a974-3829-c45eb5335651@redhat.com/
>
>
> [ 1119.752137][ T7441] WARNING: CPU: 27 PID: 7441 at
> arch/x86/kvm/vmx/vmx.c:4809 handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
That is:
WARN_ON_ONCE(!allow_smaller_maxphyaddr);
I noticed the origin patch did not have this WARN_ON_ONCE(), but the mainline
commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address space
support user-configurable") does have it for some reasons.
Paolo, any idea?
> [ 1119.763312][ T7441] Modules linked in: loop nls_ascii nls_cp437 vfat fat
> kvm_intel kvm irqbypass efivars ses enclosure efivarfs ip_tables x_tables
> sd_mod nvme tg3 firmware_class smartpqi nvme_core libphy scsi_transport_sas
> dm_mirror dm_region_hash dm_log dm_mod
> [ 1119.786660][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted:
> G I 5.9.0-rc7-next-20200928+ #2
> [ 1119.796572][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560
> Gen10, BIOS U34 11/13/2019
> [ 1119.805870][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1119.812788][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00
> 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8
> <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
> [ 1119.832384][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
> [ 1119.838353][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX:
> 0000000000000000
> [ 1119.846247][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffffc2b98940
> [ 1119.854124][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09:
> ffffed11d1fcd071
> [ 1119.862012][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12:
> ffff8888f6129000
> [ 1119.869903][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15:
> 0000000000000000
> [ 1119.877795][ T7441] FS: 00007fc3277fe700(0000) GS:ffff88901f640000(0000)
> knlGS:0000000000000000
> [ 1119.886649][ T7441] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1119.893127][ T7441] CR2: 0000000000000000 CR3: 0000000ab7376002 CR4:
> 00000000007726e0
> [ 1119.901022][ T7441] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 1119.908916][ T7441] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 1119.916806][ T7441] PKRU: 55555554
> [ 1119.920226][ T7441] Call Trace:
> [ 1119.923426][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
> [ 1119.928877][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1119.934335][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
> [ 1119.941029][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
> [ 1119.947177][ T7441] ? rcu_read_unlock+0x40/0x40
> [ 1119.951822][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1119.957272][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 1119.962441][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1119.968325][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
> [ 1119.973433][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
> [ 1119.979776][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1119.985945][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1119.991924][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
> [ 1119.997047][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
> [ 1120.002305][ T7441] ? find_held_lock+0x33/0x1c0
> [ 1120.006968][ T7441] ? __fget_files+0x1a4/0x2e0
> [ 1120.011533][ T7441] ? lock_downgrade+0x730/0x730
> [ 1120.016283][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
> [ 1120.021714][ T7441] ? __fget_files+0x1c3/0x2e0
> [ 1120.026287][ T7441] __x64_sys_ioctl+0x315/0xfc0
> [ 1120.030933][ T7441] ? generic_block_fiemap+0x60/0x60
> [ 1120.036034][ T7441] ? up_read+0x1a3/0x730
> [ 1120.040158][ T7441] ? down_read_nested+0x420/0x420
> [ 1120.045080][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
> [ 1120.050862][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.056311][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.061743][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.067629][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
> [ 1120.073410][ T7441] do_syscall_64+0x33/0x40
> [ 1120.077723][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1120.083505][ T7441] RIP: 0033:0x7fc3368c687b
> [ 1120.087813][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
> [ 1120.107408][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 1120.115739][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX:
> 00007fc3368c687b
> [ 1120.123622][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000017
> [ 1120.131513][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09:
> 0000000000000000
> [ 1120.139402][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00005613156b4100
> [ 1120.147294][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15:
> 00005613177da760
> [ 1120.155185][ T7441] CPU: 27 PID: 7441 Comm: qemu-kvm Tainted:
> G I 5.9.0-rc7-next-20200928+ #2
> [ 1120.165072][ T7441] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560
> Gen10, BIOS U34 11/13/2019
> [ 1120.174345][ T7441] Call Trace:
> [ 1120.177509][ T7441] dump_stack+0x99/0xcb
> [ 1120.181548][ T7441] __warn.cold.13+0xe/0x55
> [ 1120.185848][ T7441] ? handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1120.192157][ T7441] report_bug+0x1af/0x260
> [ 1120.196367][ T7441] handle_bug+0x44/0x80
> [ 1120.200400][ T7441] exc_invalid_op+0x13/0x40
> [ 1120.204784][ T7441] asm_exc_invalid_op+0x12/0x20
> [ 1120.209520][ T7441] RIP: 0010:handle_exception_nmi+0xbfc/0xe60 [kvm_intel]
> [ 1120.216435][ T7441] Code: 00 00 85 d2 0f 84 9c f5 ff ff c7 83 ac 0e 00 00
> 00 00 00 00 48 83 c4 20 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b4 80 33 f8
> <0f> 0b e9 20 fc ff ff 4c 89 ef e8 25 22 95 dc e9 23 f4 ff ff 4c 89
> [ 1120.236016][ T7441] RSP: 0018:ffffc90027887998 EFLAGS: 00010246
> [ 1120.241973][ T7441] RAX: ffff88800008a000 RBX: ffff888e8fe68040 RCX:
> 0000000000000000
> [ 1120.249851][ T7441] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffffc2b98940
> [ 1120.257729][ T7441] RBP: 0000000080000b0e R08: ffffed11d1fcd071 R09:
> ffffed11d1fcd071
> [ 1120.265605][ T7441] R10: ffff888e8fe68387 R11: ffffed11d1fcd070 R12:
> ffff8888f6129000
> [ 1120.273483][ T7441] R13: ffff888e8fe68130 R14: ffff888e8fe68380 R15:
> 0000000000000000
> [ 1120.281364][ T7441] ? handle_exception_nmi+0x788/0xe60 [kvm_intel]
> [ 1120.287695][ T7441] vcpu_enter_guest+0x1ef4/0x4850 [kvm]
> [ 1120.293126][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.298582][ T7441] ? kvm_vcpu_reload_apic_access_page+0x50/0x50 [kvm]
> [ 1120.305262][ T7441] ? kvm_arch_vcpu_ioctl_run+0x1de/0x1340 [kvm]
> [ 1120.311392][ T7441] ? rcu_read_unlock+0x40/0x40
> [ 1120.316038][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.321469][ T7441] ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 1120.326639][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.332506][ T7441] ? __local_bh_enable_ip+0xa0/0xe0
> [ 1120.337613][ T7441] ? kvm_load_guest_fpu.isra.128+0x79/0x2d0 [kvm]
> [ 1120.343942][ T7441] ? kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1120.350096][ T7441] kvm_arch_vcpu_ioctl_run+0x377/0x1340 [kvm]
> [ 1120.356076][ T7441] kvm_vcpu_ioctl+0x3f2/0xad0 [kvm]
> [ 1120.361180][ T7441] ? kvm_vcpu_block+0xc40/0xc40 [kvm]
> [ 1120.366436][ T7441] ? find_held_lock+0x33/0x1c0
> [ 1120.371082][ T7441] ? __fget_files+0x1a4/0x2e0
> [ 1120.375639][ T7441] ? lock_downgrade+0x730/0x730
> [ 1120.380371][ T7441] ? rcu_read_lock_sched_held+0xd0/0xd0
> [ 1120.385802][ T7441] ? __fget_files+0x1c3/0x2e0
> [ 1120.390360][ T7441] __x64_sys_ioctl+0x315/0xfc0
> [ 1120.395006][ T7441] ? generic_block_fiemap+0x60/0x60
> [ 1120.400088][ T7441] ? up_read+0x1a3/0x730
> [ 1120.404209][ T7441] ? down_read_nested+0x420/0x420
> [ 1120.409116][ T7441] ? syscall_enter_from_user_mode+0x17/0x50
> [ 1120.414898][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.420329][ T7441] ? rcu_read_lock_sched_held+0x9c/0xd0
> [ 1120.425761][ T7441] ? lockdep_hardirqs_on_prepare+0x32b/0x4d0
> [ 1120.431628][ T7441] ? syscall_enter_from_user_mode+0x1c/0x50
> [ 1120.437410][ T7441] do_syscall_64+0x33/0x40
> [ 1120.441704][ T7441] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1120.447484][ T7441] RIP: 0033:0x7fc3368c687b
> [ 1120.451779][ T7441] Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
> [ 1120.471361][ T7441] RSP: 002b:00007fc3277fd678 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 1120.479676][ T7441] RAX: ffffffffffffffda RBX: 00007fc33bbf2001 RCX:
> 00007fc3368c687b
> [ 1120.487552][ T7441] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000017
> [ 1120.495429][ T7441] RBP: 0000000000000001 R08: 00005613156cbad0 R09:
> 0000000000000000
> [ 1120.503305][ T7441] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00005613156b4100
> [ 1120.511181][ T7441] R13: 0000000000000000 R14: 00007fc33bbf1000 R15:
> 00005613177da760
> [ 1120.519104][ T7441] irq event stamp: 4321673
> [ 1120.523400][ T7441] hardirqs last enabled at (4321681):
> [<ffffffffa6c2a76f>] console_unlock+0x81f/0xa20
> [ 1120.532948][ T7441] hardirqs last disabled at (4321690):
> [<ffffffffa6c2a67b>] console_unlock+0x72b/0xa20
> [ 1120.542495][ T7441] softirqs last enabled at (4321672):
> [<ffffffffa800061b>] __do_softirq+0x61b/0x95d
> [ 1120.551864][ T7441] softirqs last disabled at (4321665):
> [<ffffffffa7e00ec2>] asm_call_irq_on_stack+0x12/0x20
> [ 1120.561852][ T7441] ---[ end trace 31c2bbb23abc5aa2 ]---
>
> > ---
> > arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > arch/x86/kvm/vmx/vmx.h | 3 +++
> > arch/x86/kvm/x86.c | 2 +-
> > 3 files changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 819c185adf09..dc778c7b5a06 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> > module_param_named(preemption_timer, enable_preemption_timer, bool,
> > S_IRUGO);
> > #endif
> >
> > +extern bool __read_mostly allow_smaller_maxphyaddr;
> > +module_param(allow_smaller_maxphyaddr, bool, S_IRUGO | S_IWUSR);
> > +
> > #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> > #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> > #define KVM_VM_CR0_ALWAYS_ON \
> > @@ -4798,7 +4801,8 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> >
> > if (is_page_fault(intr_info)) {
> > cr2 = vmx_get_exit_qual(vcpu);
> > - if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> > + if (enable_ept && !vcpu->arch.apf.host_apf_flags
> > + && allow_smaller_maxphyaddr) {
> > /*
> > * EPT will cause page fault only if we need to
> > * detect illegal GPAs.
> > @@ -5331,7 +5335,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > * would also use advanced VM-exit information for EPT violations to
> > * reconstruct the page fault error code.
> > */
> > - if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> > + if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)) &&
> > allow_smaller_maxphyaddr)
> > return kvm_emulate_instruction(vcpu, 0);
> >
> > return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > @@ -8303,13 +8307,6 @@ static int __init vmx_init(void)
> > #endif
> > vmx_check_vmcs12_offsets();
> >
> > - /*
> > - * Intel processors don't have problems with
> > - * GUEST_MAXPHYADDR < HOST_MAXPHYADDR so enable
> > - * it for VMX by default
> > - */
> > - allow_smaller_maxphyaddr = true;
> > -
> > return 0;
> > }
> > module_init(vmx_init);
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index 26175a4759fa..b859435efa2e 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -551,6 +551,9 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
> >
> > static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> > {
> > + if (!allow_smaller_maxphyaddr)
> > + return false;
> > +
> > return !enable_ept || cpuid_maxphyaddr(vcpu) <
> > boot_cpu_data.x86_phys_bits;
> > }
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d39d6cf1d473..982f1d73a884 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -188,7 +188,7 @@ static struct kvm_shared_msrs __percpu *shared_msrs;
> > u64 __read_mostly host_efer;
> > EXPORT_SYMBOL_GPL(host_efer);
> >
> > -bool __read_mostly allow_smaller_maxphyaddr;
> > +bool __read_mostly allow_smaller_maxphyaddr = 0;
> > EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
> >
> > static u64 __read_mostly host_xss;
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-29 11:59 ` Qian Cai
@ 2020-09-29 12:26 ` Paolo Bonzini
2020-09-29 13:39 ` Qian Cai
0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-29 12:26 UTC (permalink / raw)
To: Qian Cai, Mohammed Gamal, kvm
Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds
On 29/09/20 13:59, Qian Cai wrote:
>
> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
>
> I noticed the origin patch did not have this WARN_ON_ONCE(), but the mainline
> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address space
> support user-configurable") does have it for some reasons.
Because that part of the code should not be reached. The exception
bitmap is set up with
if (!vmx_need_pf_intercept(vcpu))
eb &= ~(1u << PF_VECTOR);
where
static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
{
if (!enable_ept)
return true;
return allow_smaller_maxphyaddr &&
cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
}
We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
genuine; I've sent a patch.
Paolo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-29 12:26 ` Paolo Bonzini
@ 2020-09-29 13:39 ` Qian Cai
2020-09-29 14:47 ` Paolo Bonzini
0 siblings, 1 reply; 19+ messages in thread
From: Qian Cai @ 2020-09-29 13:39 UTC (permalink / raw)
To: Paolo Bonzini, Mohammed Gamal, kvm
Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds
On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
> On 29/09/20 13:59, Qian Cai wrote:
> > WARN_ON_ONCE(!allow_smaller_maxphyaddr);
> >
> > I noticed the origin patch did not have this WARN_ON_ONCE(), but the
> > mainline
> > commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
> > space
> > support user-configurable") does have it for some reasons.
>
> Because that part of the code should not be reached. The exception
> bitmap is set up with
>
> if (!vmx_need_pf_intercept(vcpu))
> eb &= ~(1u << PF_VECTOR);
>
> where
>
> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> {
> if (!enable_ept)
> return true;
>
> return allow_smaller_maxphyaddr &&
> cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
> }
>
> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
> which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
> genuine; I've sent a patch.
Care to provide a link to the patch? Just curious.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-29 13:39 ` Qian Cai
@ 2020-09-29 14:47 ` Paolo Bonzini
2020-10-02 17:28 ` Naresh Kamboju
0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2020-09-29 14:47 UTC (permalink / raw)
To: Qian Cai, Mohammed Gamal, kvm
Cc: linux-kernel, sean.j.christopherson, vkuznets, wanpengli,
jmattson, joro, Stephen Rothwell, linux-next, Linus Torvalds
On 29/09/20 15:39, Qian Cai wrote:
> On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
>> On 29/09/20 13:59, Qian Cai wrote:
>>> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
>>>
>>> I noticed the origin patch did not have this WARN_ON_ONCE(), but the
>>> mainline
>>> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
>>> space
>>> support user-configurable") does have it for some reasons.
>>
>> Because that part of the code should not be reached. The exception
>> bitmap is set up with
>>
>> if (!vmx_need_pf_intercept(vcpu))
>> eb &= ~(1u << PF_VECTOR);
>>
>> where
>>
>> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
>> {
>> if (!enable_ept)
>> return true;
>>
>> return allow_smaller_maxphyaddr &&
>> cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
>> }
>>
>> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
>> which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
>> genuine; I've sent a patch.
>
> Care to provide a link to the patch? Just curious.
>
Ok, I haven't sent it yet. :) But here it is:
commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Tue Sep 29 08:31:32 2020 -0400
KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
the #PF intercept bit in the exception bitmap when they do not match.
This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
hypervisor can get a vmexit for #PF exceptions even when the
corresponding bit is clear in the exception bitmap.
This is unexpected and is promptly reported as a WARN_ON_ONCE.
To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
Reported-by: Qian Cai <cai@redhat.com>>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f0384e93548a..f4e9c310032a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -794,6 +794,18 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
*/
if (is_guest_mode(vcpu))
eb |= get_vmcs12(vcpu)->exception_bitmap;
+ else {
+ /*
+ * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
+ * between guest and host. In that case we only care about present
+ * faults. For vmcs02, however, PFEC_MASK and PFEC_MATCH are set in
+ * prepare_vmcs02_rare.
+ */
+ bool selective_pf_trap = enable_ept && (eb & (1u << PF_VECTOR));
+ int mask = selective_pf_trap ? PFERR_PRESENT_MASK : 0;
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, mask);
+ vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, mask);
+ }
vmcs_write32(EXCEPTION_BITMAP, eb);
}
@@ -4355,16 +4367,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
vmx->pt_desc.guest.output_mask = 0x7F;
vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
}
-
- /*
- * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
- * between guest and host. In that case we only care about present
- * faults.
- */
- if (enable_ept) {
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
- vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
- }
}
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-29 14:47 ` Paolo Bonzini
@ 2020-10-02 17:28 ` Naresh Kamboju
2020-10-02 17:30 ` Paolo Bonzini
0 siblings, 1 reply; 19+ messages in thread
From: Naresh Kamboju @ 2020-10-02 17:28 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Qian Cai, Mohammed Gamal, kvm list, open list,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Stephen Rothwell, Linux-Next Mailing List,
Linus Torvalds, lkft-triage
Hi Paolo,
Thanks for the patch.
On Tue, 29 Sep 2020 at 20:17, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 29/09/20 15:39, Qian Cai wrote:
> > On Tue, 2020-09-29 at 14:26 +0200, Paolo Bonzini wrote:
> >> On 29/09/20 13:59, Qian Cai wrote:
> >>> WARN_ON_ONCE(!allow_smaller_maxphyaddr);
> >>>
> >>> I noticed the origin patch did not have this WARN_ON_ONCE(), but the
> >>> mainline
> >>> commit b96e6506c2ea ("KVM: x86: VMX: Make smaller physical guest address
> >>> space
> >>> support user-configurable") does have it for some reasons.
> >>
> >> Because that part of the code should not be reached. The exception
> >> bitmap is set up with
> >>
> >> if (!vmx_need_pf_intercept(vcpu))
> >> eb &= ~(1u << PF_VECTOR);
> >>
> >> where
> >>
> >> static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
> >> {
> >> if (!enable_ept)
> >> return true;
> >>
> >> return allow_smaller_maxphyaddr &&
> >> cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
> >> }
> >>
> >> We shouldn't get here if "enable_ept && !allow_smaller_maxphyaddr",
> >> which implies vmx_need_pf_intercept(vcpu) == false. So the warning is
> >> genuine; I've sent a patch.
> >
> > Care to provide a link to the patch? Just curious.
> >
>
> Ok, I haven't sent it yet. :) But here it is:
>
> commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date: Tue Sep 29 08:31:32 2020 -0400
>
> KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
>
> The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
> the #PF intercept bit in the exception bitmap when they do not match.
> This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
> hypervisor can get a vmexit for #PF exceptions even when the
> corresponding bit is clear in the exception bitmap.
>
> This is unexpected and is promptly reported as a WARN_ON_ONCE.
> To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
> is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
I have tested this patch on an x86_64 machine and the reported issue is gone.
>
> Reported-by: Qian Cai <cai@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f0384e93548a..f4e9c310032a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -794,6 +794,18 @@ void update_exception_bitmap(struct kvm_vcpu *vcpu)
> */
> if (is_guest_mode(vcpu))
> eb |= get_vmcs12(vcpu)->exception_bitmap;
> + else {
> + /*
> + * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
> + * between guest and host. In that case we only care about present
> + * faults. For vmcs02, however, PFEC_MASK and PFEC_MATCH are set in
> + * prepare_vmcs02_rare.
> + */
> + bool selective_pf_trap = enable_ept && (eb & (1u << PF_VECTOR));
> + int mask = selective_pf_trap ? PFERR_PRESENT_MASK : 0;
> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, mask);
> + vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, mask);
> + }
>
> vmcs_write32(EXCEPTION_BITMAP, eb);
> }
> @@ -4355,16 +4367,6 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> vmx->pt_desc.guest.output_mask = 0x7F;
> vmcs_write64(GUEST_IA32_RTIT_CTL, 0);
> }
> -
> - /*
> - * If EPT is enabled, #PF is only trapped if MAXPHYADDR is mismatched
> - * between guest and host. In that case we only care about present
> - * faults.
> - */
> - if (enable_ept) {
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, PFERR_PRESENT_MASK);
> - vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, PFERR_PRESENT_MASK);
> - }
> }
>
> static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>
test log link
https://lkft.validation.linaro.org/scheduler/job/1813223
- Naresh
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-10-02 17:28 ` Naresh Kamboju
@ 2020-10-02 17:30 ` Paolo Bonzini
0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2020-10-02 17:30 UTC (permalink / raw)
To: Naresh Kamboju
Cc: Qian Cai, Mohammed Gamal, kvm list, open list,
Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, Stephen Rothwell, Linux-Next Mailing List,
Linus Torvalds, lkft-triage
On 02/10/20 19:28, Naresh Kamboju wrote:
>>
>> commit 608e2791d7353e7d777bf32038ca3e7d548155a4 (HEAD -> kvm-master)
>> Author: Paolo Bonzini <pbonzini@redhat.com>
>> Date: Tue Sep 29 08:31:32 2020 -0400
>>
>> KVM: VMX: update PFEC_MASK/PFEC_MATCH together with PF intercept
>>
>> The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
>> the #PF intercept bit in the exception bitmap when they do not match.
>> This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
>> hypervisor can get a vmexit for #PF exceptions even when the
>> corresponding bit is clear in the exception bitmap.
>>
>> This is unexpected and is promptly reported as a WARN_ON_ONCE.
>> To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
>> is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
> I have tested this patch on an x86_64 machine and the reported issue is gone.
>
Thanks, the issue with my disk is gone too so it'll get to Linus in time
for rc8.
Paolo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2020-09-03 14:11 [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable Mohammed Gamal
` (2 preceding siblings ...)
2020-09-28 15:34 ` Qian Cai
@ 2021-01-16 0:08 ` Jim Mattson
2021-01-18 10:18 ` Mohammed Gamal
3 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2021-01-16 0:08 UTC (permalink / raw)
To: Mohammed Gamal
Cc: kvm list, Paolo Bonzini, LKML, Vitaly Kuznetsov, Wanpeng Li,
Joerg Roedel, Aaron Lewis, Sean Christopherson
On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to enable
> it.
>
> Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> ---
> arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> arch/x86/kvm/vmx/vmx.h | 3 +++
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
> module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;
Since this variable is in the kvm module rather than the kvm_intel
module, its current setting is preserved across "rmmod kvm_intel;
modprobe kvm_intel." That is, if set to true, it doesn't revert to
false after "rmmod kvm_intel." Is that the intended behavior?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2021-01-16 0:08 ` Jim Mattson
@ 2021-01-18 10:18 ` Mohammed Gamal
2021-06-21 18:01 ` Jim Mattson
0 siblings, 1 reply; 19+ messages in thread
From: Mohammed Gamal @ 2021-01-18 10:18 UTC (permalink / raw)
To: Jim Mattson, Paolo Bonzini
Cc: kvm list, LKML, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
Aaron Lewis, Sean Christopherson
On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com>
> wrote:
> >
> > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > parameter.
> >
> > Since smaller physical address spaces are only supported on VMX,
> > the parameter
> > is only exposed in the kvm_intel module.
> > Modifications to VMX page fault and EPT violation handling will
> > depend on whether
> > that parameter is enabled.
> >
> > Also disable support by default, and let the user decide if they
> > want to enable
> > it.
> >
> > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > ---
> > arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > arch/x86/kvm/vmx/vmx.h | 3 +++
> > arch/x86/kvm/x86.c | 2 +-
> > 3 files changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 819c185adf09..dc778c7b5a06 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -129,6 +129,9 @@ static bool __read_mostly
> > enable_preemption_timer = 1;
> > module_param_named(preemption_timer, enable_preemption_timer,
> > bool, S_IRUGO);
> > #endif
> >
> > +extern bool __read_mostly allow_smaller_maxphyaddr;
>
> Since this variable is in the kvm module rather than the kvm_intel
> module, its current setting is preserved across "rmmod kvm_intel;
> modprobe kvm_intel." That is, if set to true, it doesn't revert to
> false after "rmmod kvm_intel." Is that the intended behavior?
>
IIRC, this is because this setting was indeed not intended to be just
VMX-specific, but since AMD has an issue with PTE accessed-bits being
set by hardware and thus we can't yet enable this feature on it, it
might make sense to move the variable to the kvm_intel module for now.
Paolo, what do you think?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2021-01-18 10:18 ` Mohammed Gamal
@ 2021-06-21 18:01 ` Jim Mattson
2021-06-22 12:59 ` Mohammed Gamal
0 siblings, 1 reply; 19+ messages in thread
From: Jim Mattson @ 2021-06-21 18:01 UTC (permalink / raw)
To: Mohammed Gamal
Cc: Paolo Bonzini, kvm list, LKML, Vitaly Kuznetsov, Wanpeng Li,
Joerg Roedel, Aaron Lewis, Sean Christopherson
On Mon, Jan 18, 2021 at 2:22 AM Mohammed Gamal <mgamal@redhat.com> wrote:
>
> On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com>
> > wrote:
> > >
> > > This patch exposes allow_smaller_maxphyaddr to the user as a module
> > > parameter.
> > >
> > > Since smaller physical address spaces are only supported on VMX,
> > > the parameter
> > > is only exposed in the kvm_intel module.
> > > Modifications to VMX page fault and EPT violation handling will
> > > depend on whether
> > > that parameter is enabled.
> > >
> > > Also disable support by default, and let the user decide if they
> > > want to enable
> > > it.
> > >
> > > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > > arch/x86/kvm/vmx/vmx.h | 3 +++
> > > arch/x86/kvm/x86.c | 2 +-
> > > 3 files changed, 10 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 819c185adf09..dc778c7b5a06 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -129,6 +129,9 @@ static bool __read_mostly
> > > enable_preemption_timer = 1;
> > > module_param_named(preemption_timer, enable_preemption_timer,
> > > bool, S_IRUGO);
> > > #endif
> > >
> > > +extern bool __read_mostly allow_smaller_maxphyaddr;
> >
> > Since this variable is in the kvm module rather than the kvm_intel
> > module, its current setting is preserved across "rmmod kvm_intel;
> > modprobe kvm_intel." That is, if set to true, it doesn't revert to
> > false after "rmmod kvm_intel." Is that the intended behavior?
> >
>
> IIRC, this is because this setting was indeed not intended to be just
> VMX-specific, but since AMD has an issue with PTE accessed-bits being
> set by hardware and thus we can't yet enable this feature on it, it
> might make sense to move the variable to the kvm_intel module for now.
Um...
We do allow it for SVM, if NPT is not enabled. In fact, we set it
unconditionally in that case. See commit 3edd68399dc15 ("KVM: x86: Add
a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support").
Perhaps it should be a module parameter for SVM as well?
And, in any case, it would be nice if the parameter reverted to false
when the kvm_intel module is unloaded.
> Paolo, what do you think?
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable
2021-06-21 18:01 ` Jim Mattson
@ 2021-06-22 12:59 ` Mohammed Gamal
0 siblings, 0 replies; 19+ messages in thread
From: Mohammed Gamal @ 2021-06-22 12:59 UTC (permalink / raw)
To: Jim Mattson
Cc: Paolo Bonzini, kvm list, LKML, Vitaly Kuznetsov, Wanpeng Li,
Joerg Roedel, Aaron Lewis, Sean Christopherson
On Mon, 2021-06-21 at 11:01 -0700, Jim Mattson wrote:
> On Mon, Jan 18, 2021 at 2:22 AM Mohammed Gamal <mgamal@redhat.com>
> wrote:
> >
> > On Fri, 2021-01-15 at 16:08 -0800, Jim Mattson wrote:
> > > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal <mgamal@redhat.com>
> > > wrote:
> > > >
> > > > This patch exposes allow_smaller_maxphyaddr to the user as a
> > > > module
> > > > parameter.
> > > >
> > > > Since smaller physical address spaces are only supported on
> > > > VMX,
> > > > the parameter
> > > > is only exposed in the kvm_intel module.
> > > > Modifications to VMX page fault and EPT violation handling will
> > > > depend on whether
> > > > that parameter is enabled.
> > > >
> > > > Also disable support by default, and let the user decide if
> > > > they
> > > > want to enable
> > > > it.
> > > >
> > > > Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
> > > > ---
> > > > arch/x86/kvm/vmx/vmx.c | 15 ++++++---------
> > > > arch/x86/kvm/vmx/vmx.h | 3 +++
> > > > arch/x86/kvm/x86.c | 2 +-
> > > > 3 files changed, 10 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 819c185adf09..dc778c7b5a06 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -129,6 +129,9 @@ static bool __read_mostly
> > > > enable_preemption_timer = 1;
> > > > module_param_named(preemption_timer, enable_preemption_timer,
> > > > bool, S_IRUGO);
> > > > #endif
> > > >
> > > > +extern bool __read_mostly allow_smaller_maxphyaddr;
> > >
> > > Since this variable is in the kvm module rather than the
> > > kvm_intel
> > > module, its current setting is preserved across "rmmod kvm_intel;
> > > modprobe kvm_intel." That is, if set to true, it doesn't revert
> > > to
> > > false after "rmmod kvm_intel." Is that the intended behavior?
> > >
> >
> > IIRC, this is because this setting was indeed not intended to be
> > just
> > VMX-specific, but since AMD has an issue with PTE accessed-bits
> > being
> > set by hardware and thus we can't yet enable this feature on it, it
> > might make sense to move the variable to the kvm_intel module for
> > now.
>
> Um...
>
> We do allow it for SVM, if NPT is not enabled. In fact, we set it
> unconditionally in that case. See commit 3edd68399dc15 ("KVM: x86:
> Add
> a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support").
>
> Perhaps it should be a module parameter for SVM as well?
Hmmm, I think given how AMD CPUs' behavior with NPT enabled, maybe it'd
actually be a better idea to move this entirely to VMX for the time
being. And then maybe make it available again on AMD only if the
behavior with NPT is changed.
>
> And, in any case, it would be nice if the parameter reverted to false
> when the kvm_intel module is unloaded.
>
> > Paolo, what do you think?
> >
> >
>
^ permalink raw reply [flat|nested] 19+ messages in thread