linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page
@ 2020-11-20  9:55 Lai Jiangshan
  2020-11-26  0:05 ` Sean Christopherson
  2021-06-03  5:24 ` [PATCH V2] KVM: X86: MMU: Use " Lai Jiangshan
  0 siblings, 2 replies; 10+ messages in thread
From: Lai Jiangshan @ 2020-11-20  9:55 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Lai Jiangshan, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Avi Kivity, linux-doc

From: Lai Jiangshan <laijs@linux.alibaba.com>

Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
guest pte updates") said role.access is common access permissions for
all ptes in this shadow page, which is the inherited permissions from
the parent ptes.

But the commit did not enforce this definition when kvm_mmu_get_page()
is called in FNAME(fetch). Rather, it uses a random (last level pte's
combined) access permissions. And the permissions won't be checked again
in next FNAME(fetch) since the spte is present. It might fail to meet
guest's expectation when guest sets up spaghetti pagetables.

Fixes: 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated guest pte updates")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 Documentation/virt/kvm/mmu.rst |  4 ++--
 arch/x86/kvm/mmu/paging_tmpl.h | 14 +++++++++-----
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 1c030dbac7c4..b31586504a9a 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -171,8 +171,8 @@ Shadow pages contain the following information:
     shadow pages) so role.quadrant takes values in the range 0..3.  Each
     quadrant maps 1GB virtual address space.
   role.access:
-    Inherited guest access permissions in the form uwx.  Note execute
-    permission is positive, not negative.
+    Inherited guest access permissions from the parent ptes in the form uwx.
+    Note execute permission is positive, not negative.
   role.invalid:
     The page is invalid and should not be used.  It is a root page that is
     currently pinned (by a cpu hardware register pointing to it); once it is
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 50e268eb8e1a..00a0bfaed6e8 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -90,8 +90,8 @@ struct guest_walker {
 	gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
 	pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
 	bool pte_writable[PT_MAX_FULL_LEVELS];
-	unsigned pt_access;
-	unsigned pte_access;
+	unsigned int pt_access[PT_MAX_FULL_LEVELS];
+	unsigned int pte_access;
 	gfn_t gfn;
 	struct x86_exception fault;
 };
@@ -418,13 +418,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		}
 
 		walker->ptes[walker->level - 1] = pte;
+
+		/* Convert to ACC_*_MASK flags for struct guest_walker.  */
+		walker->pt_access[walker->level - 1] = FNAME(gpte_access)(pt_access ^ walk_nx_mask);
 	} while (!is_last_gpte(mmu, walker->level, pte));
 
 	pte_pkey = FNAME(gpte_pkeys)(vcpu, pte);
 	accessed_dirty = have_ad ? pte_access & PT_GUEST_ACCESSED_MASK : 0;
 
 	/* Convert to ACC_*_MASK flags for struct guest_walker.  */
-	walker->pt_access = FNAME(gpte_access)(pt_access ^ walk_nx_mask);
 	walker->pte_access = FNAME(gpte_access)(pte_access ^ walk_nx_mask);
 	errcode = permission_fault(vcpu, mmu, walker->pte_access, pte_pkey, access);
 	if (unlikely(errcode))
@@ -463,7 +465,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	}
 
 	pgprintk("%s: pte %llx pte_access %x pt_access %x\n",
-		 __func__, (u64)pte, walker->pte_access, walker->pt_access);
+		 __func__, (u64)pte, walker->pte_access,
+		 walker->pt_access[walker->level - 1]);
 	return 1;
 
 error:
@@ -635,7 +638,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	bool huge_page_disallowed = exec && nx_huge_page_workaround_enabled;
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
-	unsigned direct_access, access = gw->pt_access;
+	unsigned int direct_access, access;
 	int top_level, level, req_level, ret;
 	gfn_t base_gfn = gw->gfn;
 
@@ -667,6 +670,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 		sp = NULL;
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
+			access = gw->pt_access[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
 					      false, access);
 		}
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page
  2020-11-20  9:55 [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page Lai Jiangshan
@ 2020-11-26  0:05 ` Sean Christopherson
  2020-11-27 16:48   ` Paolo Bonzini
  2021-06-03  5:24 ` [PATCH V2] KVM: X86: MMU: Use " Lai Jiangshan
  1 sibling, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2020-11-26  0:05 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Avi Kivity, linux-doc

On Fri, Nov 20, 2020, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
> guest pte updates") said role.access is common access permissions for
> all ptes in this shadow page, which is the inherited permissions from
> the parent ptes.
> 
> But the commit did not enforce this definition when kvm_mmu_get_page()
> is called in FNAME(fetch). Rather, it uses a random (last level pte's
> combined) access permissions.

I wouldn't say it's random, the issue is specifically that all shadow pages end
up using the combined set of permissions of the entire walk, as opposed to the
only combined permissions of its parents.

> And the permissions won't be checked again in next FNAME(fetch) since the
> spte is present. It might fail to meet guest's expectation when guest sets up
> spaghetti pagetables.

Can you provide details on the exact failure scenario?  It would be very helpful
for documentation and understanding.  I can see how using the full combined
permissions will cause weirdness for upper level SPs in kvm_mmu_get_page(), but
I'm struggling to connect the dots to understand how that will cause incorrect
behavior for the guest.  AFAICT, outside of the SP cache, KVM only consumes
role.access for the final/last SP.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page
  2020-11-26  0:05 ` Sean Christopherson
@ 2020-11-27 16:48   ` Paolo Bonzini
  2020-11-28  2:04     ` Lai Jiangshan
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2020-11-27 16:48 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, Avi Kivity, linux-doc

On 26/11/20 01:05, Sean Christopherson wrote:
> On Fri, Nov 20, 2020, Lai Jiangshan wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
>> guest pte updates") said role.access is common access permissions for
>> all ptes in this shadow page, which is the inherited permissions from
>> the parent ptes.
>>
>> But the commit did not enforce this definition when kvm_mmu_get_page()
>> is called in FNAME(fetch). Rather, it uses a random (last level pte's
>> combined) access permissions.
> 
> I wouldn't say it's random, the issue is specifically that all shadow pages end
> up using the combined set of permissions of the entire walk, as opposed to the
> only combined permissions of its parents.
> 
>> And the permissions won't be checked again in next FNAME(fetch) since the
>> spte is present. It might fail to meet guest's expectation when guest sets up
>> spaghetti pagetables.
> 
> Can you provide details on the exact failure scenario?  It would be very helpful
> for documentation and understanding.  I can see how using the full combined
> permissions will cause weirdness for upper level SPs in kvm_mmu_get_page(), but
> I'm struggling to connect the dots to understand how that will cause incorrect
> behavior for the guest.  AFAICT, outside of the SP cache, KVM only consumes
> role.access for the final/last SP.
> 

Agreed, a unit test would be even better, but just a description in the 
commit message would be enough.

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page
  2020-11-27 16:48   ` Paolo Bonzini
@ 2020-11-28  2:04     ` Lai Jiangshan
  2020-11-30 17:41       ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Lai Jiangshan @ 2020-11-28  2:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	X86 ML, H. Peter Anvin, Avi Kivity, linux-doc

On Sat, Nov 28, 2020 at 12:48 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 26/11/20 01:05, Sean Christopherson wrote:
> > On Fri, Nov 20, 2020, Lai Jiangshan wrote:
> >> From: Lai Jiangshan <laijs@linux.alibaba.com>
> >>
> >> Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
> >> guest pte updates") said role.access is common access permissions for
> >> all ptes in this shadow page, which is the inherited permissions from
> >> the parent ptes.
> >>
> >> But the commit did not enforce this definition when kvm_mmu_get_page()
> >> is called in FNAME(fetch). Rather, it uses a random (last level pte's
> >> combined) access permissions.
> >
> > I wouldn't say it's random, the issue is specifically that all shadow pages end
> > up using the combined set of permissions of the entire walk, as opposed to the
> > only combined permissions of its parents.
> >
> >> And the permissions won't be checked again in next FNAME(fetch) since the
> >> spte is present. It might fail to meet guest's expectation when guest sets up
> >> spaghetti pagetables.
> >
> > Can you provide details on the exact failure scenario?  It would be very helpful
> > for documentation and understanding.  I can see how using the full combined
> > permissions will cause weirdness for upper level SPs in kvm_mmu_get_page(), but
> > I'm struggling to connect the dots to understand how that will cause incorrect
> > behavior for the guest.  AFAICT, outside of the SP cache, KVM only consumes
> > role.access for the final/last SP.
> >
>
> Agreed, a unit test would be even better, but just a description in the
> commit message would be enough.
>
> Paolo
>

Something in my mind, but I haven't test it:

pgd[]pud[]  pmd[]        pte[]            virtual address pointers
 (same hpa as pmd2\)  /->pte1(u--)->page1 <- ptr1 (u--)
         /->pmd1(uw-)--->pte2(uw-)->page2 <- ptr2 (uw-)
pgd->pud-|           (shared pte[] as above)
         \->pmd2(u--)--->pte1(u--)->page1 <- ptr3 (u--)
 (same hpa as pmd1/)  \->pte2(uw-)->page2 <- ptr4 (u--)


pmd1 and pmd2 point to the same pte table, so:
ptr1 and ptr3 points to the same page.
ptr2 and ptr4 points to the same page.

  The guess read-accesses to ptr1 first. So the hypervisor gets the
shadow pte page table with role.access=u-- among other things.
   (Note the shadowed pmd1's access is uwx)

  And then the guest write-accesses to ptr2, and the hypervisor
set up shadow page for ptr2.
   (Note the hypervisor silencely accepts the role.access=u--
    shadow pte page table in FNAME(fetch))

  After that, the guess read-accesses to ptr3, the hypervisor
reused the same shadow pte page table as above.

  At last, the guest writes to ptr4 without vmexit nor pagefault,
Which should cause vmexit as the guest expects.

In theory, guest userspace can trick the guest kernel if the guest
kernel sets up page table like this.

Such spaghetti pagetables are unlikely to be seen in the guest.

But when the guest is using KPTI and not using SMEP. KPTI means
all pgd entries are marked NX on the lower/userspace part of
the kernel pagetable. Which means SMEP is not needed.
(see arch/x86/mm/pti.c)

Assume the guest does disable SMEP and the guest has the flaw
that the guest user can trick guest kernel to execute on lower
part of the address space.

Normally, NX bit marked on the kernel pagetable's lower pgd
entries can help in this case. But when in guest with shadowpage
in hypervisor, the guest user can make those NX bit useless.

Again, I haven't tested it neither. I will try it later and
update the patch including adding some more checks in the mmu.c.

Thanks,
Lai

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page
  2020-11-28  2:04     ` Lai Jiangshan
@ 2020-11-30 17:41       ` Sean Christopherson
  2020-11-30 17:53         ` Paolo Bonzini
  2020-12-01  1:31         ` Lai Jiangshan
  0 siblings, 2 replies; 10+ messages in thread
From: Sean Christopherson @ 2020-11-30 17:41 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	X86 ML, H. Peter Anvin, Avi Kivity, linux-doc

On Sat, Nov 28, 2020, Lai Jiangshan wrote:
> On Sat, Nov 28, 2020 at 12:48 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 26/11/20 01:05, Sean Christopherson wrote:
> > > On Fri, Nov 20, 2020, Lai Jiangshan wrote:
> > >> From: Lai Jiangshan <laijs@linux.alibaba.com>
> > >>
> > >> Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
> > >> guest pte updates") said role.access is common access permissions for
> > >> all ptes in this shadow page, which is the inherited permissions from
> > >> the parent ptes.
> > >>
> > >> But the commit did not enforce this definition when kvm_mmu_get_page()
> > >> is called in FNAME(fetch). Rather, it uses a random (last level pte's
> > >> combined) access permissions.
> > >
> > > I wouldn't say it's random, the issue is specifically that all shadow pages end
> > > up using the combined set of permissions of the entire walk, as opposed to the
> > > only combined permissions of its parents.
> > >
> > >> And the permissions won't be checked again in next FNAME(fetch) since the
> > >> spte is present. It might fail to meet guest's expectation when guest sets up
> > >> spaghetti pagetables.
> > >
> > > Can you provide details on the exact failure scenario?  It would be very helpful
> > > for documentation and understanding.  I can see how using the full combined
> > > permissions will cause weirdness for upper level SPs in kvm_mmu_get_page(), but
> > > I'm struggling to connect the dots to understand how that will cause incorrect
> > > behavior for the guest.  AFAICT, outside of the SP cache, KVM only consumes
> > > role.access for the final/last SP.
> > >
> >
> > Agreed, a unit test would be even better, but just a description in the
> > commit message would be enough.
> >
> > Paolo
> >
> 
> Something in my mind, but I haven't test it:
> 
> pgd[]pud[]  pmd[]        pte[]            virtual address pointers
>  (same hpa as pmd2\)  /->pte1(u--)->page1 <- ptr1 (u--)
>          /->pmd1(uw-)--->pte2(uw-)->page2 <- ptr2 (uw-)
> pgd->pud-|           (shared pte[] as above)
>          \->pmd2(u--)--->pte1(u--)->page1 <- ptr3 (u--)
>  (same hpa as pmd1/)  \->pte2(uw-)->page2 <- ptr4 (u--)
> 
> 
> pmd1 and pmd2 point to the same pte table, so:
> ptr1 and ptr3 points to the same page.
> ptr2 and ptr4 points to the same page.
> 
>   The guess read-accesses to ptr1 first. So the hypervisor gets the
> shadow pte page table with role.access=u-- among other things.
>    (Note the shadowed pmd1's access is uwx)
> 
>   And then the guest write-accesses to ptr2, and the hypervisor
> set up shadow page for ptr2.
>    (Note the hypervisor silencely accepts the role.access=u--
>     shadow pte page table in FNAME(fetch))
> 
>   After that, the guess read-accesses to ptr3, the hypervisor
> reused the same shadow pte page table as above.
> 
>   At last, the guest writes to ptr4 without vmexit nor pagefault,
> Which should cause vmexit as the guest expects.

Hmm, yes, KVM would incorrectly handle this scenario.  But, the proposed patch
would not address the issue as KVM always maps non-leaf shadow pages with full
access permissions.

> In theory, guest userspace can trick the guest kernel if the guest
> kernel sets up page table like this.

I doubt any kernel is affected.  Providing a RO or NX view by splitting the VA
space at the PMD level is doable, but it would be much more awkward to deal with
than splitting the VAs at the PGD level (kernel vs. userspace)

E.g. Linux uses constant[*] protections for page tables, with different constant
protections for kernel v. userspace.

[*] Ignoring encryption, which is technically an address bit anyways.

> Such spaghetti pagetables are unlikely to be seen in the guest.
> 
> But when the guest is using KPTI and not using SMEP. KPTI means
> all pgd entries are marked NX on the lower/userspace part of
> the kernel pagetable. Which means SMEP is not needed.
> (see arch/x86/mm/pti.c)
> 
> Assume the guest does disable SMEP and the guest has the flaw
> that the guest user can trick guest kernel to execute on lower
> part of the address space.
> 
> Normally, NX bit marked on the kernel pagetable's lower pgd
> entries can help in this case. But when in guest with shadowpage
> in hypervisor, the guest user can make those NX bit useless.

This NX use case won't be affected.  The example above requires ptr2 and ptr4 to
use the same PGD and PUD.  If ptr2 and ptr4 use different PGDs, i.e. kernel vs.
userspace, KVM will use different shadow pages for the two PGDs, and the kernel
variant will have role.NX=1 in the leaf SPTEs.
 
> Again, I haven't tested it neither. I will try it later and
> update the patch including adding some more checks in the mmu.c.
> 
> Thanks,
> Lai

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page
  2020-11-30 17:41       ` Sean Christopherson
@ 2020-11-30 17:53         ` Paolo Bonzini
  2020-12-01  1:31         ` Lai Jiangshan
  1 sibling, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2020-11-30 17:53 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: LKML, kvm, Lai Jiangshan, Jonathan Corbet, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML,
	H. Peter Anvin, Avi Kivity, linux-doc

On 30/11/20 18:41, Sean Christopherson wrote:
>>
>> pmd1 and pmd2 point to the same pte table, so:
>> ptr1 and ptr3 points to the same page.
>> ptr2 and ptr4 points to the same page.
>>
>>    The guess read-accesses to ptr1 first. So the hypervisor gets the
>> shadow pte page table with role.access=u-- among other things.
>>     (Note the shadowed pmd1's access is uwx)
>>
>>    And then the guest write-accesses to ptr2, and the hypervisor
>> set up shadow page for ptr2.
>>     (Note the hypervisor silencely accepts the role.access=u--
>>      shadow pte page table in FNAME(fetch))
>>
>>    After that, the guess read-accesses to ptr3, the hypervisor
>> reused the same shadow pte page table as above.
>>
>>    At last, the guest writes to ptr4 without vmexit nor pagefault,
>> Which should cause vmexit as the guest expects.
>
> Hmm, yes, KVM would incorrectly handle this scenario.  But, the proposed patch
> would not address the issue as KVM always maps non-leaf shadow pages with full
> access permissions.

Can we have a testcase in kvm-unit-tests?  It's okay of course if it 
only fails with ept=0.

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page
  2020-11-30 17:41       ` Sean Christopherson
  2020-11-30 17:53         ` Paolo Bonzini
@ 2020-12-01  1:31         ` Lai Jiangshan
  1 sibling, 0 replies; 10+ messages in thread
From: Lai Jiangshan @ 2020-12-01  1:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	X86 ML, H. Peter Anvin, Avi Kivity, linux-doc

On Tue, Dec 1, 2020 at 1:41 AM Sean Christopherson <seanjc@google.com> wrote:

>
> Hmm, yes, KVM would incorrectly handle this scenario.  But, the proposed patch
> would not address the issue as KVM always maps non-leaf shadow pages with full
> access permissions.
>

Is it possible to exactly copy the access permissions from the guest
for non-leaf
shadow pages? Any protection from hypervisor (such as dirty track,
rmap_write_protect)
can only play on the leaf shadow ptes.

> Can we have a testcase in kvm-unit-tests?  It's okay of course if it
> only fails with ept=0.

Yes, it may have a flaw with ept=0. I don't get what "It's okay of course"
means. Is it related to kvm-unit-tests? Or no cloud provider uses
ept=0?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V2] KVM: X86: MMU: Use the correct inherited permissions to get shadow page
  2020-11-20  9:55 [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page Lai Jiangshan
  2020-11-26  0:05 ` Sean Christopherson
@ 2021-06-03  5:24 ` Lai Jiangshan
  2021-06-03 17:59   ` Sean Christopherson
  1 sibling, 1 reply; 10+ messages in thread
From: Lai Jiangshan @ 2021-06-03  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paolo Bonzini, Jonathan Corbet,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm, linux-doc

From: Lai Jiangshan <laijs@linux.alibaba.com>

Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
guest pte updates") said role.access is common access permissions for
all ptes in this shadow page, which is the inherited permissions from
the parent ptes, and should not from any children pte.

But the commit did not enforce this definition when kvm_mmu_get_page()
is called in FNAME(fetch). Rather, it uses a whole combined access of
the first accessing vitual address except the ternimating pte. And the
permissions won't be checked again in next FNAME(fetch) since the spte
is present. It might fail to meet guest's expectation when guest uses
shared pagetables.

For example, here is a shared pagetable:
   pgd[]   pud[]        pmd[]            virtual address pointers
                     /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
        /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
   pgd-|           (shared pmd[] as above)
        \->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
                     \->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
  pud1 and pud2 point to the same pmd table, so:
  ptr1 and ptr3 points to the same page.
  ptr2 and ptr4 points to the same page.

(pud1 and pud2 here are pud entries, not pud pagtable pointer
 pmd1 and pmd2 here are pmd entries, not pmd pagtable pointer)

  The guess read-accesses to ptr1 first. So the hypervisor gets the
shadow page table with role.access=u-- for ptr1's pud1 and ptr1's pmd1.
(Note: current pt->access is the combined access of pgd, pud1 and
pmd1, so it is "u--".  But the current code uses this pt->access to
get pagetable for pud1 which violate the definition in the comment
which should be the combined access of pgd, pud1, a.k.a "uw-".)

  And then the guest write-accesses to ptr2, and the hypervisor
set up shadow page for ptr2.
(Note: current pt->access=uw-, but pud1 points to a shadow pmd
table with role.access=u--.  Since pud1 is present, the hypervisor
silencely accepts it without recheck the access in FNAME(fetch))

  After that, the guess read-accesses to ptr3, the hypervisor
reused the same shadow pmd page table for pud2 as ptr1.
(Note: because current pt->access=u--, which is the access of pgd, pud2
and pmd1)

  At last, the guest writes to ptr4 without vmexit nor pagefault.
Which should cause pagefault as the guest expects.

Any kind of shared pagetable might have the similar problem when in
virtual machine without TDP enabled if the permissions are different
from different ancestors.

In order to fix the problem, we change pt->access to be an array, and
any access in it will not combind accesses from child ptes.

The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/ 
Remember to test it with TDP disabled.

The problem had existed long before the commit 41074d07c78b ("KVM: MMU:
Fix inherited permissions for emulated guest pte updates"), and it
is hard to find which is the culprit.  So there is no fixes tag here.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
Changed from V1:
	Update changelog only

 Documentation/virt/kvm/mmu.rst |  4 ++--
 arch/x86/kvm/mmu/paging_tmpl.h | 14 +++++++++-----
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 5bfe28b0728e..20d85daed395 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -171,8 +171,8 @@ Shadow pages contain the following information:
     shadow pages) so role.quadrant takes values in the range 0..3.  Each
     quadrant maps 1GB virtual address space.
   role.access:
-    Inherited guest access permissions in the form uwx.  Note execute
-    permission is positive, not negative.
+    Inherited guest access permissions from the parent ptes in the form uwx.
+    Note execute permission is positive, not negative.
   role.invalid:
     The page is invalid and should not be used.  It is a root page that is
     currently pinned (by a cpu hardware register pointing to it); once it is
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 70b7e44e3035..823a5919f9fa 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -90,8 +90,8 @@ struct guest_walker {
 	gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
 	pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
 	bool pte_writable[PT_MAX_FULL_LEVELS];
-	unsigned pt_access;
-	unsigned pte_access;
+	unsigned int pt_access[PT_MAX_FULL_LEVELS];
+	unsigned int pte_access;
 	gfn_t gfn;
 	struct x86_exception fault;
 };
@@ -418,13 +418,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		}
 
 		walker->ptes[walker->level - 1] = pte;
+
+		/* Convert to ACC_*_MASK flags for struct guest_walker.  */
+		walker->pt_access[walker->level - 1] = FNAME(gpte_access)(pt_access ^ walk_nx_mask);
 	} while (!is_last_gpte(mmu, walker->level, pte));
 
 	pte_pkey = FNAME(gpte_pkeys)(vcpu, pte);
 	accessed_dirty = have_ad ? pte_access & PT_GUEST_ACCESSED_MASK : 0;
 
 	/* Convert to ACC_*_MASK flags for struct guest_walker.  */
-	walker->pt_access = FNAME(gpte_access)(pt_access ^ walk_nx_mask);
 	walker->pte_access = FNAME(gpte_access)(pte_access ^ walk_nx_mask);
 	errcode = permission_fault(vcpu, mmu, walker->pte_access, pte_pkey, access);
 	if (unlikely(errcode))
@@ -463,7 +465,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	}
 
 	pgprintk("%s: pte %llx pte_access %x pt_access %x\n",
-		 __func__, (u64)pte, walker->pte_access, walker->pt_access);
+		 __func__, (u64)pte, walker->pte_access,
+		 walker->pt_access[walker->level - 1]);
 	return 1;
 
 error:
@@ -643,7 +646,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	bool huge_page_disallowed = exec && nx_huge_page_workaround_enabled;
 	struct kvm_mmu_page *sp = NULL;
 	struct kvm_shadow_walk_iterator it;
-	unsigned direct_access, access = gw->pt_access;
+	unsigned int direct_access, access;
 	int top_level, level, req_level, ret;
 	gfn_t base_gfn = gw->gfn;
 
@@ -675,6 +678,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 		sp = NULL;
 		if (!is_shadow_present_pte(*it.sptep)) {
 			table_gfn = gw->table_gfn[it.level - 2];
+			access = gw->pt_access[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
 					      false, access);
 		}
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] KVM: X86: MMU: Use the correct inherited permissions to get shadow page
  2021-06-03  5:24 ` [PATCH V2] KVM: X86: MMU: Use " Lai Jiangshan
@ 2021-06-03 17:59   ` Sean Christopherson
  2021-06-08 17:09     ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2021-06-03 17:59 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Paolo Bonzini, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm, linux-doc

On Thu, Jun 03, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
> guest pte updates") said role.access is common access permissions for
> all ptes in this shadow page, which is the inherited permissions from
> the parent ptes, and should not from any children pte.
> 
> But the commit did not enforce this definition when kvm_mmu_get_page()
> is called in FNAME(fetch). Rather, it uses a whole combined access of

Nit: I'd like to avoid "combined access" as "combined" has a specific meaning
in Intel's EPT terminology.

> the first accessing vitual address except the ternimating pte. And the
                      ^^^^^^                    ^^^^^^^^^^^
		      virtual                   terminating

> permissions won't be checked again in next FNAME(fetch) since the spte
> is present. It might fail to meet guest's expectation when guest uses
> shared pagetables.

Alternatively, I'd be completely ok not even mentioning commit 41074d07c78b.
It's definitely a similar bug, but not directly related. 

> For example, here is a shared pagetable:
>    pgd[]   pud[]        pmd[]            virtual address pointers
>                      /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
>         /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
>    pgd-|           (shared pmd[] as above)
>         \->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
>                      \->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
>   pud1 and pud2 point to the same pmd table, so:
>   ptr1 and ptr3 points to the same page.
>   ptr2 and ptr4 points to the same page.
> 
> (pud1 and pud2 here are pud entries, not pud pagtable pointer
>  pmd1 and pmd2 here are pmd entries, not pmd pagtable pointer)
> 
>   The guess read-accesses to ptr1 first. So the hypervisor gets the
> shadow page table with role.access=u-- for ptr1's pud1 and ptr1's pmd1.
> (Note: current pt->access is the combined access of pgd, pud1 and
> pmd1, so it is "u--".  But the current code uses this pt->access to
> get pagetable for pud1 which violate the definition in the comment
> which should be the combined access of pgd, pud1, a.k.a "uw-".)
> 
>   And then the guest write-accesses to ptr2, and the hypervisor
> set up shadow page for ptr2.
> (Note: current pt->access=uw-, but pud1 points to a shadow pmd
> table with role.access=u--.  Since pud1 is present, the hypervisor
> silencely accepts it without recheck the access in FNAME(fetch))
> 
>   After that, the guess read-accesses to ptr3, the hypervisor
> reused the same shadow pmd page table for pud2 as ptr1.
> (Note: because current pt->access=u--, which is the access of pgd, pud2
> and pmd1)
> 
>   At last, the guest writes to ptr4 without vmexit nor pagefault.
> Which should cause pagefault as the guest expects.
> 
> Any kind of shared pagetable might have the similar problem when in
> virtual machine without TDP enabled if the permissions are different
> from different ancestors.
> 
> In order to fix the problem, we change pt->access to be an array, and
> any access in it will not combind accesses from child ptes.
                            ^^^^^^^
			    combine

This typo aside, can we rewrite this paragraph and hoist it to the top?  I love
the in-depth analysis, but the downside is that describing the change after the
analysis makes it difficult to understand the actual code change.

I'd also like to provide a more detailed explanation of why the fix works, which
ties in a bit with the first two paragraphs.

Maybe drop the first two paragraphs and combine the info into something like this?

  When computing the access permissions of a shadow page, use the effective
  permissions of the walk up to that point, i.e. the logic AND of its parents'
  permissions.  Two guest PxE entries that point at the same table gfn need to
  be shadowed with different shadow pages if their parents' permissions are
  different.  KVM currently uses the effective permissions of the last
  non-leaf entry for all non-leaf entries, which can lead to incorrectly
  reusing a shadow page if a lower-level entry has more restrictve permissions,
  and eventually result in a missing guest protection page fault.

> The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/ 
> Remember to test it with TDP disabled.
> 
> The problem had existed long before the commit 41074d07c78b ("KVM: MMU:
> Fix inherited permissions for emulated guest pte updates"), and it
> is hard to find which is the culprit.  So there is no fixes tag here.

We definitely want a Fixes or manual Cc: stable@.  I think this is appropriate:

  Fixes: cea0f0e7ea54 ("[PATCH] KVM: MMU: Shadow page table caching")

AFAICT, KVM didn't reuse shadow pages for PxEs with different parents.

> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
> Changed from V1:
> 	Update changelog only


Apologies for neglecting to respond to your questions many months ago.  

On Mon, Nov 30, 2020 at 5:31 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> On Tue, Dec 1, 2020 at 1:41 AM Sean Christopherson <seanjc@google.com> wrote:
>
> > Hmm, yes, KVM would incorrectly handle this scenario.  But, the proposed patch
> > would not address the issue as KVM always maps non-leaf shadow pages with full
> > access permissions.
> >
>
> Is it possible to exactly copy the access permissions from the guest
> for non-leaf shadow pages?

The answer is: my analysis was wrong.  KVM does map non-leaf shadow pages with
full permissions, but changing the shadow page's access permissions causes KVM
to use a different shadow page, which is key.  Using a different shadow page
means that KVM creates a completely different tree and thus a different leaf
SPTE (with more restricted permissions), even though the guest PTEs are one and
the same.

> Any protection from hypervisor (such as dirty track, rmap_write_protect) can
> only play on the leaf shadow ptes.
>
> > Can we have a testcase in kvm-unit-tests?  It's okay of course if it
> > only fails with ept=0.
>
> Yes, it may have a flaw with ept=0. I don't get what "It's okay of course"
> means. Is it related to kvm-unit-tests? Or no cloud provider uses
> ept=0?

Paolo was saying that it's okay if the unit test relies on legacy shadow paging,
as opposed to setting up nested EPT or neset NPT.

> @@ -418,13 +418,15 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  		}
>  
>  		walker->ptes[walker->level - 1] = pte;
> +
> +		/* Convert to ACC_*_MASK flags for struct guest_walker.  */

I'd drop the comment about converting the mask, that's mostly self-explanatory
given the rest of the code.  What I'd like to do instead is explain why
walker->pt_access[] needs to compute the parent permissiona and only the parent
permissions.

> +		walker->pt_access[walker->level - 1] = FNAME(gpte_access)(pt_access ^ walk_nx_mask);

I think it makes sense to hoist this up to where table_gfn is set.  There's no
harm in filling pt_access[] on a fault, e.g. if this point isn't reached because
the pte is not-present.

That will allow dropping pt_access, associate pt_access more closely with
table_gfn, and make it more difficult to incorrectly incorporate the pte's
permission in the pt_access, e.g. if someone in the future thinks using pt_access
instead of pte_access is a typo.

E.g. for the comment and dropping of pt_access;

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 823a5919f9fa..cefbaa917ad8 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -316,7 +316,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
        pt_element_t pte;
        pt_element_t __user *ptep_user;
        gfn_t table_gfn;
-       u64 pt_access, pte_access;
+       u64 pte_access;
        unsigned index, accessed_dirty, pte_pkey;
        unsigned nested_access;
        gpa_t pte_gpa;
@@ -362,7 +362,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
        do {
                unsigned long host_addr;
 
-               pt_access = pte_access;
                --walker->level;
 
                index = PT_INDEX(addr, walker->level);
@@ -374,6 +373,17 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
                walker->table_gfn[walker->level - 1] = table_gfn;
                walker->pte_gpa[walker->level - 1] = pte_gpa;
 
+               /*
+                * The access permissions of the table are the logical AND of
+                * its parent's permissions, i.e. everything that's been
+                * collected so far.  The PxE's pt_access is combined with its
+                * its table_gfn (and other data) to generate the tag/key for
+                * the cache of shadow pages.  Two guest PxEs that point at the
+                * same table_gfn but have different parent permissions must
+                * use different shadow pages.
+                */
+               walker->pt_access[walker->level - 1] = FNAME(gpte_access)(pte_access ^ walk_nx_mask);
+
                real_gpa = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
                                              nested_access,
                                              &walker->fault);
@@ -407,7 +417,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
                 * Inverting the NX it lets us AND it like other
                 * permission bits.
                 */
-               pte_access = pt_access & (pte ^ walk_nx_mask);
+               pte_access &= (pte ^ walk_nx_mask);
 
                if (unlikely(!FNAME(is_present_gpte)(pte)))
                        goto error;
@@ -418,9 +428,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
                }
 
                walker->ptes[walker->level - 1] = pte;
-
-               /* Convert to ACC_*_MASK flags for struct guest_walker.  */
-               walker->pt_access[walker->level - 1] = FNAME(gpte_access)(pt_access ^ walk_nx_mask);
        } while (!is_last_gpte(mmu, walker->level, pte));
 
        pte_pkey = FNAME(gpte_pkeys)(vcpu, pte);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V2] KVM: X86: MMU: Use the correct inherited permissions to get shadow page
  2021-06-03 17:59   ` Sean Christopherson
@ 2021-06-08 17:09     ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2021-06-08 17:09 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm,
	linux-doc

On 03/06/21 19:59, Sean Christopherson wrote:
> Maybe drop the first two paragraphs and combine the info into something like this?
> 
>    When computing the access permissions of a shadow page, use the effective
>    permissions of the walk up to that point, i.e. the logic AND of its parents'
>    permissions.  Two guest PxE entries that point at the same table gfn need to
>    be shadowed with different shadow pages if their parents' permissions are
>    different.  KVM currently uses the effective permissions of the last
>    non-leaf entry for all non-leaf entries, which can lead to incorrectly
>    reusing a shadow page if a lower-level entry has more restrictve permissions,
>    and eventually result in a missing guest protection page fault.

And also a rewritten description of the sequence leading to the bug:

- First, the guest reads from ptr1 first and KVM prepares a shadow
   page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
   "u--" comes from the effective permissions of pgd, pud1 and
   pmd1, which are stored in pt->access.  "u--" is used also to get
   the pagetable for pud1, instead of "uw-".

- Then the guest writes to ptr2 and KVM reuses pud1 which is present.
   The hypervisor set up a shadow page for ptr2 with pt->access is "uw-".
   However the pud1 pmdthe pud1 pmd (because of the incorrect argument to
   kvm_mmu_get_page in the previous step) has role.access="u--".

- Then the guest reads from ptr3.  The hypervisor reuses pud1's
   shadow pmd for pud2, because both use "u--" for their permissions.
   Thus, the shadow pmd already includes entries for both pmd1 and pmd2.

- At last, the guest writes to ptr4.  This causes no vmexit or pagefault,
   because pud1's shadow page structures included an "uw-" page even though
   its role.access was "u--".

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-06-08 17:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-20  9:55 [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page Lai Jiangshan
2020-11-26  0:05 ` Sean Christopherson
2020-11-27 16:48   ` Paolo Bonzini
2020-11-28  2:04     ` Lai Jiangshan
2020-11-30 17:41       ` Sean Christopherson
2020-11-30 17:53         ` Paolo Bonzini
2020-12-01  1:31         ` Lai Jiangshan
2021-06-03  5:24 ` [PATCH V2] KVM: X86: MMU: Use " Lai Jiangshan
2021-06-03 17:59   ` Sean Christopherson
2021-06-08 17:09     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).