Re: [PATCH 2/2] KVM: X86/MMU: Introduce role.passthrough for shadowing 5-level NPT for 4-level NPT L1 guest

From: Paolo Bonzini <pbonzini@redhat.com>
To: Lai Jiangshan <jiangshanlai@gmail.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Sean Christopherson <seanjc@google.com>
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH 2/2] KVM: X86/MMU: Introduce role.passthrough for shadowing 5-level NPT for 4-level NPT L1 guest
Date: Wed, 20 Apr 2022 19:15:04 +0200	[thread overview]
Message-ID: <3e6018ad-06b7-f465-7b74-aeb20bca9489@redhat.com> (raw)
In-Reply-To: <20220420131204.2850-3-jiangshanlai@gmail.com>

On 4/20/22 15:12, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
> allocated with role.level = 5 and the guest pagetable's root gfn.
> 
> And root_sp->spt[0] is also allocated with the same gfn and the same
> role except role.level = 4.  Luckily that they are different shadow
> pages, but only root_sp->spt[0] is the real translation of the guest
> pagetable.
> 
> Here comes a problem:
> 
> If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
> and uses the same gfn as the root page for nested NPT before and after
> switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
> for the guest even the guest switches gCR4_LA57.  The guest will see
> unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
> lucky that the problem can't hurt L0.
> 
> And three special cases need to be handled:
> 
> The root_sp should be like role.direct=1 sometimes: its contents are
> not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
> level sp in shadow paging, sp->gfns is often unused and kept zero, but
> it could be relevant and meaningful if sp->gfns is used because they
> are backed by concrete gptes.)
> 
> For such root_sp in the case, root_sp is just a portal to contribute
> root_sp->spt[0], and root_sp->gfns should not be used and
> root_sp->spt[0] should not be dropped if gpte[0] of the guest root
> pagetable is changed.
> 
> Such root_sp should not be accounted too.
> 
> So add role.passthrough to distinguish the shadow pages in the hash
> when gCR4_LA57 is toggled and fix above special cases by using it in
> kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Looks good, and is easy to backport; thanks.

However, once we have also your series "KVM: X86/MMU: Use one-off 
special shadow page for special roots", perhaps the passthrough sp can 
be converted to a special root (but now with a shadow_page) as well?

Paolo

> ---
>   Documentation/virt/kvm/mmu.rst  |  3 +++
>   arch/x86/include/asm/kvm_host.h |  5 +++--
>   arch/x86/kvm/mmu/mmu.c          | 16 ++++++++++++++++
>   arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
>   4 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
> index 5b1ebad24c77..4018b9d7a0d3 100644
> --- a/Documentation/virt/kvm/mmu.rst
> +++ b/Documentation/virt/kvm/mmu.rst
> @@ -202,6 +202,9 @@ Shadow pages contain the following information:
>       Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
>       bits before Haswell; shadow EPT page tables also cannot use A/D bits
>       if the L1 hypervisor does not enable them.
> +  role.passthrough:
> +    Is 1 if role.level = 5 when shadowing 5-level shadow NPT for
> +    4-level NPT L1.
>     gfn:
>       Either the guest page table containing the translations shadowed by this
>       page, or the base page frame for linear translations.  See role.direct.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9694dd5e6ccc..d4f8f4784d87 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -285,7 +285,7 @@ struct kvm_kernel_irq_routing_entry;
>    * minimize the size of kvm_memory_slot.arch.gfn_track, i.e. allows allocating
>    * 2 bytes per gfn instead of 4 bytes per gfn.
>    *
> - * Indirect upper-level shadow pages are tracked for write-protection via
> + * Upper-level shadow pages having gptes are tracked for write-protection via
>    * gfn_track.  As above, gfn_track is a 16 bit counter, so KVM must not create
>    * more than 2^16-1 upper-level shadow pages at a single gfn, otherwise
>    * gfn_track will overflow and explosions will ensure.
> @@ -331,7 +331,8 @@ union kvm_mmu_page_role {
>   		unsigned smap_andnot_wp:1;
>   		unsigned ad_disabled:1;
>   		unsigned guest_mode:1;
> -		unsigned :6;
> +		unsigned passthrough:1;
> +		unsigned :5;
>   
>   		/*
>   		 * This is left at the top of the word so that
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1bdff55218ef..d14cb6f99cb1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -737,6 +737,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>   
>   static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>   {
> +	if (sp->role.passthrough)
> +		return sp->gfn;
> +
>   	if (!sp->role.direct)
>   		return sp->gfns[index];
>   
> @@ -745,6 +748,11 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>   
>   static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>   {
> +	if (sp->role.passthrough) {
> +		WARN_ON_ONCE(gfn != sp->gfn);
> +		return;
> +	}
> +
>   	if (!sp->role.direct) {
>   		sp->gfns[index] = gfn;
>   		return;
> @@ -1861,6 +1869,9 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
>   	if (sp->role.direct)
>   		return false;
>   
> +	if (sp->role.passthrough)
> +		return false;
> +
>   	return true;
>   }
>   
> @@ -2059,6 +2070,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>   		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
>   		role.quadrant = quadrant;
>   	}
> +	if (level <= vcpu->arch.mmu->root_level)
> +		role.passthrough = 0;
>   
>   	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>   	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> @@ -4890,6 +4903,9 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>   
>   	role.base.direct = false;
>   	role.base.level = kvm_mmu_get_tdp_level(vcpu);
> +	if (role.base.level == PT64_ROOT_5LEVEL &&
> +	    role_regs_to_root_level(regs) == PT64_ROOT_4LEVEL)
> +		role.base.passthrough = 1;
>   
>   	return role;
>   }
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 8621188b46df..c1b975fb85a2 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1042,6 +1042,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>   		.level = 0xf,
>   		.access = 0x7,
>   		.quadrant = 0x3,
> +		.passthrough = 0x1,
>   	};
>   
>   	/*