[PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
@ 2019-02-22 16:46 Vitaly Kuznetsov
  2019-02-22 17:17 ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Vitaly Kuznetsov @ 2019-02-22 16:46 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Radim Krčmář, Junaid Shahid, linux-kernel

I noticed that fast_cr3_switch() always fails when we switch back from L2
to L1 as it is not able to find a cached root. This is odd: host's CR3
usually stays the same, we expect to always follow the fast path. Turns
out the problem is that page role is always mismatched because
kvm_mmu_get_page() filters out cr4_pae when direct, the value is stored
in page header and later compared with new_role in cached_root_available().
As cr4_pae is always set in long mode prev_roots cache is dysfunctional.

The problem appeared after we introduced kvm_calc_mmu_role_common():
previously, we were only setting this bit for shadow MMU root but now
we set it for everything. Restore the original behavior.

Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC:
Alternatively, I can suggest two solutions:
- Do not clear cr4_pae in kvm_mmu_get_page() and check direct on call sites
 (detect_write_misaligned(), get_written_sptes()).
- Filter cr4_pae out when direct in kvm_mmu_new_cr3().
The code in kvm_mmu_get_page() is very ancient, I'm afraid to touch it :=)
---
 arch/x86/kvm/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f2d1d230d5b8..c729e98eee49 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4791,7 +4791,6 @@ static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu,
 
 	role.base.access = ACC_ALL;
 	role.base.nxe = !!is_nx(vcpu);
-	role.base.cr4_pae = !!is_pae(vcpu);
 	role.base.cr0_wp = is_write_protection(vcpu);
 	role.base.smm = is_smm(vcpu);
 	role.base.guest_mode = is_guest_mode(vcpu);
@@ -4871,6 +4870,8 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
 {
 	union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu, base_only);
 
+	role.base.cr4_pae = !!is_pae(vcpu);
+
 	role.base.smep_andnot_wp = role.ext.cr4_smep &&
 		!is_write_protection(vcpu);
 	role.base.smap_andnot_wp = role.ext.cr4_smap &&
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-02-22 16:46 [PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case Vitaly Kuznetsov
@ 2019-02-22 17:17 ` Paolo Bonzini
  2019-02-22 18:49   ` Vitaly Kuznetsov
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2019-02-22 17:17 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Radim Krčmář, Junaid Shahid, linux-kernel

On 22/02/19 17:46, Vitaly Kuznetsov wrote:
> I noticed that fast_cr3_switch() always fails when we switch back from L2
> to L1 as it is not able to find a cached root. This is odd: host's CR3
> usually stays the same, we expect to always follow the fast path. Turns
> out the problem is that page role is always mismatched because
> kvm_mmu_get_page() filters out cr4_pae when direct, the value is stored
> in page header and later compared with new_role in cached_root_available().
> As cr4_pae is always set in long mode prev_roots cache is dysfunctional.

Really cr4_pae means "are the PTEs 8 bytes".  So I think your patch is
correct but on top we should set it to 1 (not zero!!) for
kvm_calc_shadow_ept_root_page_role, init_kvm_nested_mmu and
kvm_calc_tdp_mmu_root_page_role.  Or maybe everything breaks with that
change.

> - Do not clear cr4_pae in kvm_mmu_get_page() and check direct on call sites
>  (detect_write_misaligned(), get_written_sptes()).

These only run with shadow page tables, by the way.

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-02-22 17:17 ` Paolo Bonzini
@ 2019-02-22 18:49   ` Vitaly Kuznetsov
  2019-02-22 20:29     ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Vitaly Kuznetsov @ 2019-02-22 18:49 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Radim Krčmář, Junaid Shahid, linux-kernel

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 22/02/19 17:46, Vitaly Kuznetsov wrote:
>> I noticed that fast_cr3_switch() always fails when we switch back from L2
>> to L1 as it is not able to find a cached root. This is odd: host's CR3
>> usually stays the same, we expect to always follow the fast path. Turns
>> out the problem is that page role is always mismatched because
>> kvm_mmu_get_page() filters out cr4_pae when direct, the value is stored
>> in page header and later compared with new_role in cached_root_available().
>> As cr4_pae is always set in long mode prev_roots cache is dysfunctional.
>
> Really cr4_pae means "are the PTEs 8 bytes".  So I think your patch is
> correct but on top we should set it to 1 (not zero!!) for
> kvm_calc_shadow_ept_root_page_role, init_kvm_nested_mmu and
> kvm_calc_tdp_mmu_root_page_role.  Or maybe everything breaks with that
> change.
>

Yes, exactly. If we put '1' there kvm_mmu_get_page() will again filter
it out and we won't be able to find the root in prev_roots cache :-(

>> - Do not clear cr4_pae in kvm_mmu_get_page() and check direct on call sites
>>  (detect_write_misaligned(), get_written_sptes()).
>
> These only run with shadow page tables, by the way.
>

Yes, and that's why I think it may make sense to move the filtering
logic there. At least in other cases cr4_pae will always be equal to
is_pae().

It seems I know too little about shadow paging and all these corner
cases :-(

-- 
Vitaly

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-02-22 18:49   ` Vitaly Kuznetsov
@ 2019-02-22 20:29     ` Paolo Bonzini
  2019-02-23 11:15       ` [PATCH v2 RFC] " Vitaly Kuznetsov
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2019-02-22 20:29 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Radim Krčmář, Junaid Shahid, linux-kernel

On 22/02/19 19:49, Vitaly Kuznetsov wrote:
>> Really cr4_pae means "are the PTEs 8 bytes".  So I think your patch is
>> correct but on top we should set it to 1 (not zero!!) for
>> kvm_calc_shadow_ept_root_page_role, init_kvm_nested_mmu and
>> kvm_calc_tdp_mmu_root_page_role.  Or maybe everything breaks with that
>> change.
>>
> Yes, exactly. If we put '1' there kvm_mmu_get_page() will again filter
> it out and we won't be able to find the root in prev_roots cache :-(
> 

Well, of course then kvm_mmu_get_page() would have to remove the filtering.

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 RFC] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-02-22 20:29     ` Paolo Bonzini
@ 2019-02-23 11:15       ` Vitaly Kuznetsov
  2019-03-07 14:07         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 10+ messages in thread
From: Vitaly Kuznetsov @ 2019-02-23 11:15 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Radim Krčmář, Junaid Shahid, linux-kernel

Alternative patch: remove the filtering from kvm_mmu_get_page() and check
for direct on call sites. cr4_pae setting in kvm_calc_mmu_role_common()
can be preserved for consistency.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/mmu.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f2d1d230d5b8..7fb8118f2af6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2420,8 +2420,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	role = vcpu->arch.mmu->mmu_role.base;
 	role.level = level;
 	role.direct = direct;
-	if (role.direct)
-		role.cr4_pae = 0;
 	role.access = access;
 	if (!vcpu->arch.mmu->direct_map
 	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
@@ -5176,7 +5174,7 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
 		 gpa, bytes, sp->role.word);
 
 	offset = offset_in_page(gpa);
-	pte_size = sp->role.cr4_pae ? 8 : 4;
+	pte_size = (sp->role.cr4_pae && !sp->role.direct) ? 8 : 4;
 
 	/*
 	 * Sometimes, the OS only writes the last one bytes to update status
@@ -5200,7 +5198,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
 	page_offset = offset_in_page(gpa);
 	level = sp->role.level;
 	*nspte = 1;
-	if (!sp->role.cr4_pae) {
+	if (!sp->role.cr4_pae || sp->role.direct) {
 		page_offset <<= 1;	/* 32->64 */
 		/*
 		 * A 32-bit pde maps 4MB while the shadow pdes map
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 RFC] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-02-23 11:15       ` [PATCH v2 RFC] " Vitaly Kuznetsov
@ 2019-03-07 14:07         ` Vitaly Kuznetsov
  2019-03-07 16:06           ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Vitaly Kuznetsov @ 2019-03-07 14:07 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: kvm, Junaid Shahid, linux-kernel

Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Alternative patch: remove the filtering from kvm_mmu_get_page() and check
> for direct on call sites. cr4_pae setting in kvm_calc_mmu_role_common()
> can be preserved for consistency.
>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/mmu.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f2d1d230d5b8..7fb8118f2af6 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2420,8 +2420,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  	role = vcpu->arch.mmu->mmu_role.base;
>  	role.level = level;
>  	role.direct = direct;
> -	if (role.direct)
> -		role.cr4_pae = 0;
>  	role.access = access;
>  	if (!vcpu->arch.mmu->direct_map
>  	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
> @@ -5176,7 +5174,7 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
>  		 gpa, bytes, sp->role.word);
>  
>  	offset = offset_in_page(gpa);
> -	pte_size = sp->role.cr4_pae ? 8 : 4;
> +	pte_size = (sp->role.cr4_pae && !sp->role.direct) ? 8 : 4;
>  
>  	/*
>  	 * Sometimes, the OS only writes the last one bytes to update status
> @@ -5200,7 +5198,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
>  	page_offset = offset_in_page(gpa);
>  	level = sp->role.level;
>  	*nspte = 1;
> -	if (!sp->role.cr4_pae) {
> +	if (!sp->role.cr4_pae || sp->role.direct) {
>  		page_offset <<= 1;	/* 32->64 */
>  		/*
>  		 * A 32-bit pde maps 4MB while the shadow pdes map

While I personally prefer this approach to not setting role.cr4_pae in
kvm_calc_mmu_role_common() I'd like to get maintainers opinion (I did
test the patch with ept=off but I have to admit I don't know much about
shadow page tables which actually use detect_write_misaligned()/
get_written_sptes())

Paolo, Radim, (anyone else) - any thoughts?

-- 
Vitaly

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 RFC] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-03-07 14:07         ` Vitaly Kuznetsov
@ 2019-03-07 16:06           ` Sean Christopherson
  2019-03-07 16:41             ` Vitaly Kuznetsov
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2019-03-07 16:06 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm, Junaid Shahid, linux-kernel

On Thu, Mar 07, 2019 at 03:07:05PM +0100, Vitaly Kuznetsov wrote:
> Vitaly Kuznetsov <vkuznets@redhat.com> writes:
> 
> > Alternative patch: remove the filtering from kvm_mmu_get_page() and check
> > for direct on call sites. cr4_pae setting in kvm_calc_mmu_role_common()
> > can be preserved for consistency.
> >
> > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > ---
> >  arch/x86/kvm/mmu.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index f2d1d230d5b8..7fb8118f2af6 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -2420,8 +2420,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >  	role = vcpu->arch.mmu->mmu_role.base;
> >  	role.level = level;
> >  	role.direct = direct;
> > -	if (role.direct)
> > -		role.cr4_pae = 0;
> >  	role.access = access;
> >  	if (!vcpu->arch.mmu->direct_map
> >  	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
> > @@ -5176,7 +5174,7 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
> >  		 gpa, bytes, sp->role.word);
> >  
> >  	offset = offset_in_page(gpa);
> > -	pte_size = sp->role.cr4_pae ? 8 : 4;
> > +	pte_size = (sp->role.cr4_pae && !sp->role.direct) ? 8 : 4;
> >  
> >  	/*
> >  	 * Sometimes, the OS only writes the last one bytes to update status
> > @@ -5200,7 +5198,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
> >  	page_offset = offset_in_page(gpa);
> >  	level = sp->role.level;
> >  	*nspte = 1;
> > -	if (!sp->role.cr4_pae) {
> > +	if (!sp->role.cr4_pae || sp->role.direct) {
> >  		page_offset <<= 1;	/* 32->64 */
> >  		/*
> >  		 * A 32-bit pde maps 4MB while the shadow pdes map
> 
> While I personally prefer this approach to not setting role.cr4_pae in
> kvm_calc_mmu_role_common() I'd like to get maintainers opinion (I did
> test the patch with ept=off but I have to admit I don't know much about
> shadow page tables which actually use detect_write_misaligned()/
> get_written_sptes())
> 
> Paolo, Radim, (anyone else) - any thoughts?

The changes to detect_write_misaligned() and get_written_sptes() are
wrong/unnecessary as those functions should only be called for indirect
shadow PTEs, e.g.:

	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
		if (detect_write_misaligned(sp, gpa, bytes) ||
		      detect_write_flooding(sp)) {
			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
			++vcpu->kvm->stat.mmu_flooded;
			continue;
		}

		spte = get_written_sptes(sp, gpa, &npte);
		if (!spte)
			continue;

		...
	}

If anything, they could WARN_ON(sp->role.direct), but IMO that's overkill
since they're static helpers with a single call site (above).

I'm missing the background for this patch, why does clearing role.cr4_pae
for direct SPTEs cause problems with the prev_roots cache?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 RFC] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-03-07 16:06           ` Sean Christopherson
@ 2019-03-07 16:41             ` Vitaly Kuznetsov
  2019-03-07 18:59               ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Vitaly Kuznetsov @ 2019-03-07 16:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm, Junaid Shahid, linux-kernel

Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Thu, Mar 07, 2019 at 03:07:05PM +0100, Vitaly Kuznetsov wrote:
>> Vitaly Kuznetsov <vkuznets@redhat.com> writes:
>> 
>> > Alternative patch: remove the filtering from kvm_mmu_get_page() and check
>> > for direct on call sites. cr4_pae setting in kvm_calc_mmu_role_common()
>> > can be preserved for consistency.
>> >
>> > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> > ---
>> >  arch/x86/kvm/mmu.c | 6 ++----
>> >  1 file changed, 2 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> > index f2d1d230d5b8..7fb8118f2af6 100644
>> > --- a/arch/x86/kvm/mmu.c
>> > +++ b/arch/x86/kvm/mmu.c
>> > @@ -2420,8 +2420,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>> >  	role = vcpu->arch.mmu->mmu_role.base;
>> >  	role.level = level;
>> >  	role.direct = direct;
>> > -	if (role.direct)
>> > -		role.cr4_pae = 0;
>> >  	role.access = access;
>> >  	if (!vcpu->arch.mmu->direct_map
>> >  	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
>> > @@ -5176,7 +5174,7 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
>> >  		 gpa, bytes, sp->role.word);
>> >  
>> >  	offset = offset_in_page(gpa);
>> > -	pte_size = sp->role.cr4_pae ? 8 : 4;
>> > +	pte_size = (sp->role.cr4_pae && !sp->role.direct) ? 8 : 4;
>> >  
>> >  	/*
>> >  	 * Sometimes, the OS only writes the last one bytes to update status
>> > @@ -5200,7 +5198,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
>> >  	page_offset = offset_in_page(gpa);
>> >  	level = sp->role.level;
>> >  	*nspte = 1;
>> > -	if (!sp->role.cr4_pae) {
>> > +	if (!sp->role.cr4_pae || sp->role.direct) {
>> >  		page_offset <<= 1;	/* 32->64 */
>> >  		/*
>> >  		 * A 32-bit pde maps 4MB while the shadow pdes map
>> 
>> While I personally prefer this approach to not setting role.cr4_pae in
>> kvm_calc_mmu_role_common() I'd like to get maintainers opinion (I did
>> test the patch with ept=off but I have to admit I don't know much about
>> shadow page tables which actually use detect_write_misaligned()/
>> get_written_sptes())
>> 
>> Paolo, Radim, (anyone else) - any thoughts?
>
> The changes to detect_write_misaligned() and get_written_sptes() are
> wrong/unnecessary as those functions should only be called for indirect
> shadow PTEs, e.g.:
>
> 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> 		if (detect_write_misaligned(sp, gpa, bytes) ||
> 		      detect_write_flooding(sp)) {
> 			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
> 			++vcpu->kvm->stat.mmu_flooded;
> 			continue;
> 		}
>
> 		spte = get_written_sptes(sp, gpa, &npte);
> 		if (!spte)
> 			continue;
>
> 		...
> 	}
>
> If anything, they could WARN_ON(sp->role.direct), but IMO that's overkill
> since they're static helpers with a single call site (above).
>
> I'm missing the background for this patch, why does clearing role.cr4_pae
> for direct SPTEs cause problems with the prev_roots cache?

Oh, thank you for taking a look! You've probably missed my original v1
patch where I tried to explaing the issue:

"I noticed that fast_cr3_switch() always fails when we switch back from
L2 to L1 as it is not able to find a cached root. This is odd: host's
CR3 usually stays the same, we expect to always follow the fast
path. Turns out the problem is that page role is always mismatched
because kvm_mmu_get_page() filters out cr4_pae when direct, the value is
stored in page header and later compared with new_role in
cached_root_available(). As cr4_pae is always set in long mode
prev_roots cache is dysfunctional.

The problem appeared after we introduced kvm_calc_mmu_role_common():
previously, we were only setting this bit for shadow MMU root but now we
set it for everything. Restore the original behavior.

Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
"

and the patch was just removing role.cr4_pae setting and moving it to
kvm_calc_shadow_mmu_root_page_role(). This is an alternative approach -
always set cr4_pae but count on kvm_mmu_get_page() filtering out cr4_pae
when direct is set.

-- 
Vitaly

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 RFC] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-03-07 16:41             ` Vitaly Kuznetsov
@ 2019-03-07 18:59               ` Sean Christopherson
  2019-03-07 19:02                 ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2019-03-07 18:59 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm, Junaid Shahid, linux-kernel

On Thu, Mar 07, 2019 at 05:41:39PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Thu, Mar 07, 2019 at 03:07:05PM +0100, Vitaly Kuznetsov wrote:
> >> Vitaly Kuznetsov <vkuznets@redhat.com> writes:
> >> 
> >> > Alternative patch: remove the filtering from kvm_mmu_get_page() and check
> >> > for direct on call sites. cr4_pae setting in kvm_calc_mmu_role_common()
> >> > can be preserved for consistency.
> >> >
> >> > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> >> > ---
> >> >  arch/x86/kvm/mmu.c | 6 ++----
> >> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> > index f2d1d230d5b8..7fb8118f2af6 100644
> >> > --- a/arch/x86/kvm/mmu.c
> >> > +++ b/arch/x86/kvm/mmu.c
> >> > @@ -2420,8 +2420,6 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> >> >  	role = vcpu->arch.mmu->mmu_role.base;
> >> >  	role.level = level;
> >> >  	role.direct = direct;
> >> > -	if (role.direct)
> >> > -		role.cr4_pae = 0;
> >> >  	role.access = access;
> >> >  	if (!vcpu->arch.mmu->direct_map
> >> >  	    && vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
> >> > @@ -5176,7 +5174,7 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
> >> >  		 gpa, bytes, sp->role.word);
> >> >  
> >> >  	offset = offset_in_page(gpa);
> >> > -	pte_size = sp->role.cr4_pae ? 8 : 4;
> >> > +	pte_size = (sp->role.cr4_pae && !sp->role.direct) ? 8 : 4;
> >> >  
> >> >  	/*
> >> >  	 * Sometimes, the OS only writes the last one bytes to update status
> >> > @@ -5200,7 +5198,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
> >> >  	page_offset = offset_in_page(gpa);
> >> >  	level = sp->role.level;
> >> >  	*nspte = 1;
> >> > -	if (!sp->role.cr4_pae) {
> >> > +	if (!sp->role.cr4_pae || sp->role.direct) {
> >> >  		page_offset <<= 1;	/* 32->64 */
> >> >  		/*
> >> >  		 * A 32-bit pde maps 4MB while the shadow pdes map
> >> 
> >> While I personally prefer this approach to not setting role.cr4_pae in
> >> kvm_calc_mmu_role_common() I'd like to get maintainers opinion (I did
> >> test the patch with ept=off but I have to admit I don't know much about
> >> shadow page tables which actually use detect_write_misaligned()/
> >> get_written_sptes())
> >> 
> >> Paolo, Radim, (anyone else) - any thoughts?
> >
> > The changes to detect_write_misaligned() and get_written_sptes() are
> > wrong/unnecessary as those functions should only be called for indirect
> > shadow PTEs, e.g.:
> >
> > 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> > 		if (detect_write_misaligned(sp, gpa, bytes) ||
> > 		      detect_write_flooding(sp)) {
> > 			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
> > 			++vcpu->kvm->stat.mmu_flooded;
> > 			continue;
> > 		}
> >
> > 		spte = get_written_sptes(sp, gpa, &npte);
> > 		if (!spte)
> > 			continue;
> >
> > 		...
> > 	}
> >
> > If anything, they could WARN_ON(sp->role.direct), but IMO that's overkill
> > since they're static helpers with a single call site (above).
> >
> > I'm missing the background for this patch, why does clearing role.cr4_pae
> > for direct SPTEs cause problems with the prev_roots cache?
> 
> Oh, thank you for taking a look! You've probably missed my original v1
> patch where I tried to explaing the issue:
> 
> "I noticed that fast_cr3_switch() always fails when we switch back from
> L2 to L1 as it is not able to find a cached root. This is odd: host's
> CR3 usually stays the same, we expect to always follow the fast
> path. Turns out the problem is that page role is always mismatched
> because kvm_mmu_get_page() filters out cr4_pae when direct, the value is
> stored in page header and later compared with new_role in
> cached_root_available(). As cr4_pae is always set in long mode
> prev_roots cache is dysfunctional.
> 
> The problem appeared after we introduced kvm_calc_mmu_role_common():
> previously, we were only setting this bit for shadow MMU root but now we
> set it for everything. Restore the original behavior.
> 
> Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
> "
> 
> and the patch was just removing role.cr4_pae setting and moving it to
> kvm_calc_shadow_mmu_root_page_role(). This is an alternative approach -
> always set cr4_pae but count on kvm_mmu_get_page() filtering out cr4_pae
> when direct is set.

Ah, but neither approach solves the underlying problem that KVM can't
identify a shadow page that is associated with a nested EPT entry.
For example, kvm_calc_shadow_ept_root_page_role() should set cr4_pae=1,
i.e. always has 8-byte guest PTEs regardless of PAE.  But that would
break e.g. __kvm_sync_page() as it would always mismatch if L2 is a
non-PAE guest (I think I got that right...).

I think what we could do is repurpose role's nxe, cr0_wp, and
sm{a,e}p_andnot_wp bits to uniquely identify a nested EPT/NPT entry.
E.g. cr0_wp=1 and sm{a,e}p_andnot_wp=1 are an impossible combination.
I'll throw together a patch to see what breaks.  In fact, I think we
could revamp kvm_calc_shadow_ept_root_page_role() to completely ignore
all legacy paging bits, i.e. handling changes in L2's configuration is
L1's problem.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 RFC] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case
  2019-03-07 18:59               ` Sean Christopherson
@ 2019-03-07 19:02                 ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2019-03-07 19:02 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Paolo Bonzini, Radim Krčmář,
	kvm, Junaid Shahid, linux-kernel

On Thu, Mar 07, 2019 at 10:59:46AM -0800, Sean Christopherson wrote:
> I think what we could do is repurpose role's nxe, cr0_wp, and
> sm{a,e}p_andnot_wp bits to uniquely identify a nested EPT/NPT entry.

Ignore the "NPT" comment, this would only apply to EPT.

> E.g. cr0_wp=1 and sm{a,e}p_andnot_wp=1 are an impossible combination.
> I'll throw together a patch to see what breaks.  In fact, I think we
> could revamp kvm_calc_shadow_ept_root_page_role() to completely ignore
> all legacy paging bits, i.e. handling changes in L2's configuration is
> L1's problem.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-03-07 19:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-22 16:46 [PATCH] x86/kvm/mmu: make mmu->prev_roots cache work for NPT case Vitaly Kuznetsov
2019-02-22 17:17 ` Paolo Bonzini
2019-02-22 18:49   ` Vitaly Kuznetsov
2019-02-22 20:29     ` Paolo Bonzini
2019-02-23 11:15       ` [PATCH v2 RFC] " Vitaly Kuznetsov
2019-03-07 14:07         ` Vitaly Kuznetsov
2019-03-07 16:06           ` Sean Christopherson
2019-03-07 16:41             ` Vitaly Kuznetsov
2019-03-07 18:59               ` Sean Christopherson
2019-03-07 19:02                 ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.