[PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
@ 2022-03-27 20:35 Mingwei Zhang
  2022-03-27 20:43 ` Mingwei Zhang
  0 siblings, 1 reply; 15+ messages in thread
From: Mingwei Zhang @ 2022-03-27 20:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack,
	Mingwei Zhang

Add a lockdep check before invoking lookup_address_in_mm().
lookup_address_in_mm() walks all levels of host page table without
accquiring any lock. This is usually unsafe unless we are walking the
kernel addresses (check other usage cases of lookup_address_in_mm and
lookup_address_in_pgd).

Walking host page table (especially guest addresses) usually requires
holding two types of locks: 1) mmu_lock in mm or the lock that protects
the reverse maps of host memory in range; 2) lock for the leaf paging
structures.

One exception case is when we take the mmu_lock of the secondary mmu.
Holding mmu_lock of KVM MMU in either read mode or write mode prevents host
level entities from modifying the host page table concurrently. This is
because all of them will have to invoke KVM mmu_notifier first before doing
the actual work. Since KVM mmu_notifier invalidation operations always take
the mmu write lock, we are safe if we hold the mmu lock here.

Note: this means that KVM cannot allow concurrent multiple mmu_notifier
invalidation callbacks by using KVM mmu read lock. Since, otherwise, any
host level entity can cause race conditions with this one. Walking host
page table here may get us stale information or may trigger NULL ptr
dereference that is hard to reproduce.

Having a lockdep check here will prevent or at least warn future
development that directly walks host page table simply in a KVM ioctl
function. In addition, it provides a record for any future development on
KVM mmu_notifier.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1361eb4599b4..342aa184c0a2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	 */
 	hva = __gfn_to_hva_memslot(slot, gfn);

+	/*
+	 * lookup_address_in_mm() walks all levels of host page table without
+	 * accquiring any lock. This is not safe when KVM does not take the
+	 * mmu_lock. Holding mmu_lock in either read mode or write mode prevents
+	 * host level entities from modifying the host page table. This is
+	 * because all of them will have to invoke KVM mmu_notifier first before
+	 * doing the actual work. Since KVM mmu_notifier invalidation operations
+	 * always take the mmu write lock, we are safe if we hold the mmu lock
+	 * here.
+	 *
+	 * Note: this means that KVM cannot allow concurrent multiple
+	 * mmu_notifier invalidation callbacks by using KVM mmu read lock.
+	 * Otherwise, any host level entity can cause race conditions with this
+	 * one. Walking host page table here may get us stale information or may
+	 * trigger NULL ptr dereference that is hard to reproduce.
+	 */
+	lockdep_is_held(kvm->mmu_lock);
+
 	pte = lookup_address_in_mm(kvm->mm, hva, &level);
 	if (unlikely(!pte))
 		return PG_LEVEL_4K;
-- 
2.35.1.1021.g381101b075-goog

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-03-27 20:35 [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm() Mingwei Zhang
@ 2022-03-27 20:43 ` Mingwei Zhang
  0 siblings, 0 replies; 15+ messages in thread
From: Mingwei Zhang @ 2022-03-27 20:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

Please ignore this one and will provide an updated patch shortly.

On Sun, Mar 27, 2022 at 1:35 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> Add a lockdep check before invoking lookup_address_in_mm().
> lookup_address_in_mm() walks all levels of host page table without
> accquiring any lock. This is usually unsafe unless we are walking the
> kernel addresses (check other usage cases of lookup_address_in_mm and
> lookup_address_in_pgd).
>
> Walking host page table (especially guest addresses) usually requires
> holding two types of locks: 1) mmu_lock in mm or the lock that protects
> the reverse maps of host memory in range; 2) lock for the leaf paging
> structures.
>
> One exception case is when we take the mmu_lock of the secondary mmu.
> Holding mmu_lock of KVM MMU in either read mode or write mode prevents host
> level entities from modifying the host page table concurrently. This is
> because all of them will have to invoke KVM mmu_notifier first before doing
> the actual work. Since KVM mmu_notifier invalidation operations always take
> the mmu write lock, we are safe if we hold the mmu lock here.
>
> Note: this means that KVM cannot allow concurrent multiple mmu_notifier
> invalidation callbacks by using KVM mmu read lock. Since, otherwise, any
> host level entity can cause race conditions with this one. Walking host
> page table here may get us stale information or may trigger NULL ptr
> dereference that is hard to reproduce.
>
> Having a lockdep check here will prevent or at least warn future
> development that directly walks host page table simply in a KVM ioctl
> function. In addition, it provides a record for any future development on
> KVM mmu_notifier.
>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: David Matlack <dmatlack@google.com>
>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1361eb4599b4..342aa184c0a2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>          */
>         hva = __gfn_to_hva_memslot(slot, gfn);
>
> +       /*
> +        * lookup_address_in_mm() walks all levels of host page table without
> +        * accquiring any lock. This is not safe when KVM does not take the
> +        * mmu_lock. Holding mmu_lock in either read mode or write mode prevents
> +        * host level entities from modifying the host page table. This is
> +        * because all of them will have to invoke KVM mmu_notifier first before
> +        * doing the actual work. Since KVM mmu_notifier invalidation operations
> +        * always take the mmu write lock, we are safe if we hold the mmu lock
> +        * here.
> +        *
> +        * Note: this means that KVM cannot allow concurrent multiple
> +        * mmu_notifier invalidation callbacks by using KVM mmu read lock.
> +        * Otherwise, any host level entity can cause race conditions with this
> +        * one. Walking host page table here may get us stale information or may
> +        * trigger NULL ptr dereference that is hard to reproduce.
> +        */
> +       lockdep_is_held(kvm->mmu_lock);
> +
>         pte = lookup_address_in_mm(kvm->mm, hva, &level);
>         if (unlikely(!pte))
>                 return PG_LEVEL_4K;
> --
> 2.35.1.1021.g381101b075-goog
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-04-27  2:56                   ` Mingwei Zhang
@ 2022-04-27 14:08                     ` Sean Christopherson
  0 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2022-04-27 14:08 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> On Tue, Apr 26, 2022 at 6:30 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> > > On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> > > > > > I completely agree that lookup_address() and friends are unnecessarily fragile,
> > > > > > but I think that attempting to harden them to fix this KVM bug will open a can
> > > > > > of worms and end up delaying getting KVM fixed.
> > > > >
> > > > > So basically, we need to:
> > > > >  - choose perf_get_page_size() instead of using any of the
> > > > > lookup_address*() in mm.
> > > > >  - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size
> > > > > -> level translation.
> > > > >
> > > > > Agree?
> > > >
> > > > Drat, I didn't see that it returns the page size, not the level.  That's a bit
> > > > unfortunate.  It definitely makes me less averse to fixing lookup_address_in_pgd()
> > > >
> > > > Hrm.  I guess since we know there's at least one broken user, and in theory
> > > > fixing lookup_address_in_pgd() should do no harm to users that don't need protection,
> > > > it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers
> > > > push back.
> > >
> > > Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the
> > > page fault usage case does not need irq save/restore. But the other
> > > one needs it. So, we can easily fix the function with READ_ONCE and
> > > lockless staff. But wrapping the function with irq save/restore from
> > > the KVM side.
> >
> > I think it makes sense to do the save/restore in lookup_address_in_pgd().  The
> > Those helpers are exported, so odds are good there are broken users that will
> > benefit from fixing all paths.
> 
> no, lookup_address_in_pgd() is probably just broken for KVM. In other
> call sites, some may already disable IRQ, so doing that again inside
> lookup_address_in_pgd() will be bad.

No, it's not bad.  local_irq_save/restore() intended preciesly for cases where
IRQs need to be disabled but IRQs may or may not have already been disabled by
the caller.   PUSHF+POPF is not expensive relatively speaking, 

> I am looking at here:
> https://elixir.bootlin.com/linux/latest/source/arch/arm/kernel/traps.c#L304

That's arm code, lookup_address_in_pgd() is x86 specific.  :-) That said, I'm sure
there exists at least one caller that runs with IRQs disabled.  But as above,
it's not a problem.

> so, the save/restore are done in oops_begin() and oops_end(), which is
> wrapping show_fault_oops() that calls lookup_address_in_pgd().
> 
> So, I think we need to ensure the READ_ONCE.
> 
> hmm, regarding the lockless macros, Paolo is right, for x86 it makes
> no difference. s390 seems to have a different implementation, but
> kvm_mmu_max_mapping_level() as well as host_pfn_mapping_level are both
> functions in x86 mmu.

Yep, all of this is x86 specific.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-04-27  1:30                 ` Sean Christopherson
@ 2022-04-27  2:56                   ` Mingwei Zhang
  2022-04-27 14:08                     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Mingwei Zhang @ 2022-04-27  2:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Tue, Apr 26, 2022 at 6:30 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> > On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> > > > > I completely agree that lookup_address() and friends are unnecessarily fragile,
> > > > > but I think that attempting to harden them to fix this KVM bug will open a can
> > > > > of worms and end up delaying getting KVM fixed.
> > > >
> > > > So basically, we need to:
> > > >  - choose perf_get_page_size() instead of using any of the
> > > > lookup_address*() in mm.
> > > >  - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size
> > > > -> level translation.
> > > >
> > > > Agree?
> > >
> > > Drat, I didn't see that it returns the page size, not the level.  That's a bit
> > > unfortunate.  It definitely makes me less averse to fixing lookup_address_in_pgd()
> > >
> > > Hrm.  I guess since we know there's at least one broken user, and in theory
> > > fixing lookup_address_in_pgd() should do no harm to users that don't need protection,
> > > it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers
> > > push back.
> >
> > Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the
> > page fault usage case does not need irq save/restore. But the other
> > one needs it. So, we can easily fix the function with READ_ONCE and
> > lockless staff. But wrapping the function with irq save/restore from
> > the KVM side.
>
> I think it makes sense to do the save/restore in lookup_address_in_pgd().  The
> Those helpers are exported, so odds are good there are broken users that will
> benefit from fixing all paths.

no, lookup_address_in_pgd() is probably just broken for KVM. In other
call sites, some may already disable IRQ, so doing that again inside
lookup_address_in_pgd() will be bad.

I am looking at here:
https://elixir.bootlin.com/linux/latest/source/arch/arm/kernel/traps.c#L304

so, the save/restore are done in oops_begin() and oops_end(), which is
wrapping show_fault_oops() that calls lookup_address_in_pgd().

So, I think we need to ensure the READ_ONCE.

hmm, regarding the lockless macros, Paolo is right, for x86 it makes
no difference. s390 seems to have a different implementation, but
kvm_mmu_max_mapping_level() as well as host_pfn_mapping_level are both
functions in x86 mmu.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-04-27  1:24               ` Mingwei Zhang
@ 2022-04-27  1:30                 ` Sean Christopherson
  2022-04-27  2:56                   ` Mingwei Zhang
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2022-04-27  1:30 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> > > > I completely agree that lookup_address() and friends are unnecessarily fragile,
> > > > but I think that attempting to harden them to fix this KVM bug will open a can
> > > > of worms and end up delaying getting KVM fixed.
> > >
> > > So basically, we need to:
> > >  - choose perf_get_page_size() instead of using any of the
> > > lookup_address*() in mm.
> > >  - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size
> > > -> level translation.
> > >
> > > Agree?
> >
> > Drat, I didn't see that it returns the page size, not the level.  That's a bit
> > unfortunate.  It definitely makes me less averse to fixing lookup_address_in_pgd()
> >
> > Hrm.  I guess since we know there's at least one broken user, and in theory
> > fixing lookup_address_in_pgd() should do no harm to users that don't need protection,
> > it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers
> > push back.
> 
> Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the
> page fault usage case does not need irq save/restore. But the other
> one needs it. So, we can easily fix the function with READ_ONCE and
> lockless staff. But wrapping the function with irq save/restore from
> the KVM side.

I think it makes sense to do the save/restore in lookup_address_in_pgd().  The
Those helpers are exported, so odds are good there are broken users that will
benefit from fixing all paths.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-04-27  1:16             ` Sean Christopherson
@ 2022-04-27  1:24               ` Mingwei Zhang
  2022-04-27  1:30                 ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Mingwei Zhang @ 2022-04-27  1:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Tue, Apr 26, 2022 at 6:16 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> > > I completely agree that lookup_address() and friends are unnecessarily fragile,
> > > but I think that attempting to harden them to fix this KVM bug will open a can
> > > of worms and end up delaying getting KVM fixed.
> >
> > So basically, we need to:
> >  - choose perf_get_page_size() instead of using any of the
> > lookup_address*() in mm.
> >  - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size
> > -> level translation.
> >
> > Agree?
>
> Drat, I didn't see that it returns the page size, not the level.  That's a bit
> unfortunate.  It definitely makes me less averse to fixing lookup_address_in_pgd()
>
> Hrm.  I guess since we know there's at least one broken user, and in theory
> fixing lookup_address_in_pgd() should do no harm to users that don't need protection,
> it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers
> push back.

Yeah, fixing lookup_address_in_pgd() should be cleaner(), since the
page fault usage case does not need irq save/restore. But the other
one needs it. So, we can easily fix the function with READ_ONCE and
lockless staff. But wrapping the function with irq save/restore from
the KVM side.

ok, I think I can proceed and upload one version for that.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-04-26 18:48           ` Mingwei Zhang
@ 2022-04-27  1:16             ` Sean Christopherson
  2022-04-27  1:24               ` Mingwei Zhang
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2022-04-27  1:16 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Tue, Apr 26, 2022, Mingwei Zhang wrote:
> > I completely agree that lookup_address() and friends are unnecessarily fragile,
> > but I think that attempting to harden them to fix this KVM bug will open a can
> > of worms and end up delaying getting KVM fixed.
> 
> So basically, we need to:
>  - choose perf_get_page_size() instead of using any of the
> lookup_address*() in mm.
>  - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size
> -> level translation.
> 
> Agree?

Drat, I didn't see that it returns the page size, not the level.  That's a bit
unfortunate.  It definitely makes me less averse to fixing lookup_address_in_pgd()

Hrm.  I guess since we know there's at least one broken user, and in theory
fixing lookup_address_in_pgd() should do no harm to users that don't need protection,
it makes sense to just fix lookup_address_in_pgd() and see if the x86 maintainers
push back.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-04-26 18:10         ` Sean Christopherson
@ 2022-04-26 18:48           ` Mingwei Zhang
  2022-04-27  1:16             ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Mingwei Zhang @ 2022-04-26 18:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

> I completely agree that lookup_address() and friends are unnecessarily fragile,
> but I think that attempting to harden them to fix this KVM bug will open a can
> of worms and end up delaying getting KVM fixed.

So basically, we need to:
 - choose perf_get_page_size() instead of using any of the
lookup_address*() in mm.
 - add a wrapper layer to adapt: 1) irq disabling/enabling and 2) size
-> level translation.

Agree?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-04-26 17:49       ` Paolo Bonzini
@ 2022-04-26 18:10         ` Sean Christopherson
  2022-04-26 18:48           ` Mingwei Zhang
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2022-04-26 18:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Mingwei Zhang, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Tue, Apr 26, 2022, Paolo Bonzini wrote:
> On 3/28/22 20:15, Sean Christopherson wrote:
> > > lookup_address_in_mm() walks the host page table as if it is a
> > > sequence of_static_  memory chunks. This is clearly dangerous.
> > Yeah, it's broken.  The proper fix is do something like what perf uses, or maybe
> > just genericize and reuse the code from commit 8af26be06272
> > ("perf/core: Fix arch_perf_get_page_size()).
> > 
> 
> Indeed, KVM could use perf_get_pgtable_size().  The conversion from the
> result of *_leaf_size() to level is basically (ctz(size) - 12) / 9.
> 
> Alternatively, there are the three difference between perf_get_page_size()
> and lookup_address_in_pgd():
> 
> * the *_offset_lockless() macros, which are unnecessary on x86
> 
> * READ_ONCE, which is important but in practice unlikely to make a
> difference

It can make a difference for this specific case.  I can't find the bug/patch, but
a year or two back there was a bug in a similar mm/ path where lack of READ_ONCE()
led to deferencing garbage due re-reading an upper level entry.  IIRC, it was a
page promotion (to huge page) case, where the p*d_large() check came back false
(saw the old value) and then p*d_offset() walked into the weeds because it used
the new value (huge page offset).

> * local_irq_{save,restore} around the walk
> 
> 
> The last is the important one and it should be added to
> lookup_address_in_pgd().

I don't think so.  The issue is that, similar to adding a lockdep here, simply
disabling IRQs is not sufficient to ensure the resolved pfn is valid.  And again,
like this case, disabling IRQs is not actually required when sufficient protections
are in place, e.g. in KVM's page fault case, the mmu_notifier invalidate_start
event must occur before the primary MMUs modifies its PTEs.

In other words, disabling IRQs is both unnecessary and gives a false sense of security.

I completely agree that lookup_address() and friends are unnecessarily fragile,
but I think that attempting to harden them to fix this KVM bug will open a can
of worms and end up delaying getting KVM fixed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-03-28 18:15     ` Sean Christopherson
  2022-04-26 17:19       ` Mingwei Zhang
@ 2022-04-26 17:49       ` Paolo Bonzini
  2022-04-26 18:10         ` Sean Christopherson
  1 sibling, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2022-04-26 17:49 UTC (permalink / raw)
  To: Sean Christopherson, Mingwei Zhang
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	LKML, Ben Gardon, David Matlack

On 3/28/22 20:15, Sean Christopherson wrote:
>> lookup_address_in_mm() walks the host page table as if it is a
>> sequence of_static_  memory chunks. This is clearly dangerous.
> Yeah, it's broken.  The proper fix is do something like what perf uses, or maybe
> just genericize and reuse the code from commit 8af26be06272
> ("perf/core: Fix arch_perf_get_page_size()).
> 

Indeed, KVM could use perf_get_pgtable_size().  The conversion from the 
result of *_leaf_size() to level is basically (ctz(size) - 12) / 9.

Alternatively, there are the three difference between 
perf_get_page_size() and lookup_address_in_pgd():

* the *_offset_lockless() macros, which are unnecessary on x86

* READ_ONCE, which is important but in practice unlikely to make a 
difference

* local_irq_{save,restore} around the walk


The last is the important one and it should be added to 
lookup_address_in_pgd().

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-03-28 18:15     ` Sean Christopherson
@ 2022-04-26 17:19       ` Mingwei Zhang
  2022-04-26 17:49       ` Paolo Bonzini
  1 sibling, 0 replies; 15+ messages in thread
From: Mingwei Zhang @ 2022-04-26 17:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Mon, Mar 28, 2022 at 11:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 28, 2022, Mingwei Zhang wrote:
> > With that, I start to feel this is a bug. The issue is just so rare
> > that it has never triggered a problem.
> >
> > lookup_address_in_mm() walks the host page table as if it is a
> > sequence of _static_ memory chunks. This is clearly dangerous.
>
> Yeah, it's broken.  The proper fix is do something like what perf uses, or maybe
> just genericize and reuse the code from commit 8af26be06272
> ("perf/core: Fix arch_perf_get_page_size()).

hmm, I am thinking about this. We clearly need an adaptor layer if we
choose to use this function, e.g., size -> layer change; using irq or
not. Alternatively, I am wondering if we can just modify
lookup_address_in_mm() to make the code compatible with "lockless"
walk?

On top of that, since kvm_mmu_max_mapping_level() is used in two
places: 1) ept violation and 2) disabling dirty logging. The former
does not require disable/enable irq since it is safe. So maybe add a
parameter in this function and plumb through towards
host_pfn_mapping_level()?
>
> > But right now,  kvm_mmu_max_mapping_level() are used in other places
> > as well: kvm_mmu_zap_collapsible_spte(), which does not satisfy the
> > strict requirement of walking the host page table.
>
> The host pfn size is used only as a hueristic, so false postives/negatives are
> ok, the only race that needs to be avoided is dereferencing freed page table
> memory.  lookup_address_in_pgd() is really broken because it doesn't even ensure
> a given PxE is READ_ONCE().  I suppose one could argue the caller is broken, but
> I doubt KVM is the only user that doesn't provide the necessary protections.

right. since lookup_address_in_pgd() is so broken. I am thinking about
just fix it in place instead of switching to a different function.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-03-28 17:41   ` Mingwei Zhang
@ 2022-03-28 18:15     ` Sean Christopherson
  2022-04-26 17:19       ` Mingwei Zhang
  2022-04-26 17:49       ` Paolo Bonzini
  0 siblings, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2022-03-28 18:15 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

On Mon, Mar 28, 2022, Mingwei Zhang wrote:
> With that, I start to feel this is a bug. The issue is just so rare
> that it has never triggered a problem.
>
> lookup_address_in_mm() walks the host page table as if it is a
> sequence of _static_ memory chunks. This is clearly dangerous.

Yeah, it's broken.  The proper fix is do something like what perf uses, or maybe
just genericize and reuse the code from commit 8af26be06272
("perf/core: Fix arch_perf_get_page_size()).

> But right now,  kvm_mmu_max_mapping_level() are used in other places
> as well: kvm_mmu_zap_collapsible_spte(), which does not satisfy the
> strict requirement of walking the host page table.

The host pfn size is used only as a hueristic, so false postives/negatives are
ok, the only race that needs to be avoided is dereferencing freed page table
memory.  lookup_address_in_pgd() is really broken because it doesn't even ensure
a given PxE is READ_ONCE().  I suppose one could argue the caller is broken, but
I doubt KVM is the only user that doesn't provide the necessary protections.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-03-28 15:16 ` Sean Christopherson
@ 2022-03-28 17:41   ` Mingwei Zhang
  2022-03-28 18:15     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Mingwei Zhang @ 2022-03-28 17:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Ben Gardon, David Matlack

Hi Sean,


On Mon, Mar 28, 2022 at 8:16 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sun, Mar 27, 2022, Mingwei Zhang wrote:
> > Add a lockdep check before invoking lookup_address_in_mm().
> > lookup_address_in_mm() walks all levels of host page table without
> > accquiring any lock. This is usually unsafe unless we are walking the
> > kernel addresses (check other usage cases of lookup_address_in_mm and
> > lookup_address_in_pgd).
> >
> > Walking host page table (especially guest addresses) usually requires
> > holding two types of locks: 1) mmu_lock in mm or the lock that protects
> > the reverse maps of host memory in range; 2) lock for the leaf paging
> > structures.
> >
> > One exception case is when we take the mmu_lock of the secondary mmu.
> > Holding mmu_lock of KVM MMU in either read mode or write mode prevents host
> > level entities from modifying the host page table concurrently. This is
> > because all of them will have to invoke KVM mmu_notifier first before doing
> > the actual work. Since KVM mmu_notifier invalidation operations always take
> > the mmu write lock, we are safe if we hold the mmu lock here.
> >
> > Note: this means that KVM cannot allow concurrent multiple mmu_notifier
> > invalidation callbacks by using KVM mmu read lock. Since, otherwise, any
> > host level entity can cause race conditions with this one. Walking host
> > page table here may get us stale information or may trigger NULL ptr
> > dereference that is hard to reproduce.
> >
> > Having a lockdep check here will prevent or at least warn future
> > development that directly walks host page table simply in a KVM ioctl
> > function. In addition, it provides a record for any future development on
> > KVM mmu_notifier.
> >
> > Cc: Sean Christopherson <seanjc@google.com>
> > Cc: Ben Gardon <bgardon@google.com>
> > Cc: David Matlack <dmatlack@google.com>
> >
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1361eb4599b4..066bb5435156 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >        */
> >       hva = __gfn_to_hva_memslot(slot, gfn);
> >
> > +     /*
> > +      * lookup_address_in_mm() walks all levels of host page table without
> > +      * accquiring any lock. This is not safe when KVM does not take the
> > +      * mmu_lock. Holding mmu_lock in either read mode or write mode prevents
> > +      * host level entities from modifying the host page table. This is
> > +      * because all of them will have to invoke KVM mmu_notifier first before
> > +      * doing the actual work. Since KVM mmu_notifier invalidation operations
> > +      * always take the mmu write lock, we are safe if we hold the mmu lock
> > +      * here.
> > +      *
> > +      * Note: this means that KVM cannot allow concurrent multiple
> > +      * mmu_notifier invalidation callbacks by using KVM mmu read lock.
> > +      * Otherwise, any host level entity can cause race conditions with this
> > +      * one. Walking host page table here may get us stale information or may
> > +      * trigger NULL ptr dereference that is hard to reproduce.
> > +      */
> > +     lockdep_assert_held(&kvm->mmu_lock);
>
> Holding mmu_lock isn't strictly required.  It would also be safe to use this helper
> if mmu_notifier_retry_hva() were checked after grabbing the mapping level, before
> consuming it.  E.g. we could theoretically move this to kvm_faultin_pfn().
>
> And simply holding the lock isn't sufficient, i.e. the lockdep gives a false sense
> of security.  E.g. calling this while holding mmu_lock but without first checking
> mmu_notifier_count would let it run concurrently with host PTE modifications.

Right, even holding the kvm->mmu_lock is not safe, since we may have
several concurrent invalidations ongoing and they are done zapping
entries in EPT (so that they could just release the kvm->mmu_lock) and
start working on the host page table. If we want to make it safe, we
also have to check mmu_notifier_count (and potentially mmu_seq as
well).

With that, I start to feel this is a bug. The issue is just so rare
that it has never triggered a problem.

lookup_address_in_mm() walks the host page table as if it is a
sequence of _static_ memory chunks. This is clearly dangerous. If we
look at hva_to_pfn(), which is the right way to walk host page table:

hva_to_pfn() =>
  hva_to_pfn_fast() =>
    get_user_page_fast_only() =>
      internal_get_user_pages_fast() =>
        lockless_pages_from_mm() =>
          local_irq_save(flags); /* Disable interrupts here. */
          gup_pgd_range(start, end, gup_flags, pages, &nr_pinned);
  ... ...
  hva_to_pfn_slow() =>
    get_user_pages_unlocked() =>
      mmap_read_lock(mm); /* taking the mm lock here. */

The above code has two branches to walk the host page table: 1) the
fast one and 2) slow one; The slower one takes the mm lock, while the
faster one simply disables the interrupts.

I think we might have to mimic the same thing in
lockless_pages_from_mm(), i.e. wrapping
local_irq_{save,restore}(flags) around the lookup_address_in_mm().

Alternatively, we have to specify that the function
lookup_address_in_mm() as well as its callers:
host_pfn_mapping_level() and kvm_mmu_max_mapping_level() CANNOT be
called in generic places in KVM, but only in the fault path and AFTER
the check of "is_page_fault_stale()".

But right now,  kvm_mmu_max_mapping_level() are used in other places
as well: kvm_mmu_zap_collapsible_spte(), which does not satisfy the
strict requirement of walking the host page table.

>
> I'm definitely in favor of adding a comment to document the mmu_notifier
> interactions, but I don't like adding a lockdep.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
  2022-03-27 20:58 Mingwei Zhang
@ 2022-03-28 15:16 ` Sean Christopherson
  2022-03-28 17:41   ` Mingwei Zhang
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2022-03-28 15:16 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack

On Sun, Mar 27, 2022, Mingwei Zhang wrote:
> Add a lockdep check before invoking lookup_address_in_mm().
> lookup_address_in_mm() walks all levels of host page table without
> accquiring any lock. This is usually unsafe unless we are walking the
> kernel addresses (check other usage cases of lookup_address_in_mm and
> lookup_address_in_pgd).
> 
> Walking host page table (especially guest addresses) usually requires
> holding two types of locks: 1) mmu_lock in mm or the lock that protects
> the reverse maps of host memory in range; 2) lock for the leaf paging
> structures.
> 
> One exception case is when we take the mmu_lock of the secondary mmu.
> Holding mmu_lock of KVM MMU in either read mode or write mode prevents host
> level entities from modifying the host page table concurrently. This is
> because all of them will have to invoke KVM mmu_notifier first before doing
> the actual work. Since KVM mmu_notifier invalidation operations always take
> the mmu write lock, we are safe if we hold the mmu lock here.
> 
> Note: this means that KVM cannot allow concurrent multiple mmu_notifier
> invalidation callbacks by using KVM mmu read lock. Since, otherwise, any
> host level entity can cause race conditions with this one. Walking host
> page table here may get us stale information or may trigger NULL ptr
> dereference that is hard to reproduce.
> 
> Having a lockdep check here will prevent or at least warn future
> development that directly walks host page table simply in a KVM ioctl
> function. In addition, it provides a record for any future development on
> KVM mmu_notifier.
> 
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: David Matlack <dmatlack@google.com>
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1361eb4599b4..066bb5435156 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	 */
>  	hva = __gfn_to_hva_memslot(slot, gfn);
>  
> +	/*
> +	 * lookup_address_in_mm() walks all levels of host page table without
> +	 * accquiring any lock. This is not safe when KVM does not take the
> +	 * mmu_lock. Holding mmu_lock in either read mode or write mode prevents
> +	 * host level entities from modifying the host page table. This is
> +	 * because all of them will have to invoke KVM mmu_notifier first before
> +	 * doing the actual work. Since KVM mmu_notifier invalidation operations
> +	 * always take the mmu write lock, we are safe if we hold the mmu lock
> +	 * here.
> +	 *
> +	 * Note: this means that KVM cannot allow concurrent multiple
> +	 * mmu_notifier invalidation callbacks by using KVM mmu read lock.
> +	 * Otherwise, any host level entity can cause race conditions with this
> +	 * one. Walking host page table here may get us stale information or may
> +	 * trigger NULL ptr dereference that is hard to reproduce.
> +	 */
> +	lockdep_assert_held(&kvm->mmu_lock);

Holding mmu_lock isn't strictly required.  It would also be safe to use this helper
if mmu_notifier_retry_hva() were checked after grabbing the mapping level, before
consuming it.  E.g. we could theoretically move this to kvm_faultin_pfn().

And simply holding the lock isn't sufficient, i.e. the lockdep gives a false sense
of security.  E.g. calling this while holding mmu_lock but without first checking
mmu_notifier_count would let it run concurrently with host PTE modifications.

I'm definitely in favor of adding a comment to document the mmu_notifier
interactions, but I don't like adding a lockdep.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm()
@ 2022-03-27 20:58 Mingwei Zhang
  2022-03-28 15:16 ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Mingwei Zhang @ 2022-03-27 20:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack,
	Mingwei Zhang

Add a lockdep check before invoking lookup_address_in_mm().
lookup_address_in_mm() walks all levels of host page table without
accquiring any lock. This is usually unsafe unless we are walking the
kernel addresses (check other usage cases of lookup_address_in_mm and
lookup_address_in_pgd).

Walking host page table (especially guest addresses) usually requires
holding two types of locks: 1) mmu_lock in mm or the lock that protects
the reverse maps of host memory in range; 2) lock for the leaf paging
structures.

One exception case is when we take the mmu_lock of the secondary mmu.
Holding mmu_lock of KVM MMU in either read mode or write mode prevents host
level entities from modifying the host page table concurrently. This is
because all of them will have to invoke KVM mmu_notifier first before doing
the actual work. Since KVM mmu_notifier invalidation operations always take
the mmu write lock, we are safe if we hold the mmu lock here.

Note: this means that KVM cannot allow concurrent multiple mmu_notifier
invalidation callbacks by using KVM mmu read lock. Since, otherwise, any
host level entity can cause race conditions with this one. Walking host
page table here may get us stale information or may trigger NULL ptr
dereference that is hard to reproduce.

Having a lockdep check here will prevent or at least warn future
development that directly walks host page table simply in a KVM ioctl
function. In addition, it provides a record for any future development on
KVM mmu_notifier.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1361eb4599b4..066bb5435156 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2820,6 +2820,24 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	 */
 	hva = __gfn_to_hva_memslot(slot, gfn);

+	/*
+	 * lookup_address_in_mm() walks all levels of host page table without
+	 * accquiring any lock. This is not safe when KVM does not take the
+	 * mmu_lock. Holding mmu_lock in either read mode or write mode prevents
+	 * host level entities from modifying the host page table. This is
+	 * because all of them will have to invoke KVM mmu_notifier first before
+	 * doing the actual work. Since KVM mmu_notifier invalidation operations
+	 * always take the mmu write lock, we are safe if we hold the mmu lock
+	 * here.
+	 *
+	 * Note: this means that KVM cannot allow concurrent multiple
+	 * mmu_notifier invalidation callbacks by using KVM mmu read lock.
+	 * Otherwise, any host level entity can cause race conditions with this
+	 * one. Walking host page table here may get us stale information or may
+	 * trigger NULL ptr dereference that is hard to reproduce.
+	 */
+	lockdep_assert_held(&kvm->mmu_lock);
+
 	pte = lookup_address_in_mm(kvm->mm, hva, &level);
 	if (unlikely(!pte))
 		return PG_LEVEL_4K;
-- 
2.35.1.1021.g381101b075-goog

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-04-27 14:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-27 20:35 [PATCH] KVM: x86/mmu: add lockdep check before lookup_address_in_mm() Mingwei Zhang
2022-03-27 20:43 ` Mingwei Zhang
2022-03-27 20:58 Mingwei Zhang
2022-03-28 15:16 ` Sean Christopherson
2022-03-28 17:41   ` Mingwei Zhang
2022-03-28 18:15     ` Sean Christopherson
2022-04-26 17:19       ` Mingwei Zhang
2022-04-26 17:49       ` Paolo Bonzini
2022-04-26 18:10         ` Sean Christopherson
2022-04-26 18:48           ` Mingwei Zhang
2022-04-27  1:16             ` Sean Christopherson
2022-04-27  1:24               ` Mingwei Zhang
2022-04-27  1:30                 ` Sean Christopherson
2022-04-27  2:56                   ` Mingwei Zhang
2022-04-27 14:08                     ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.