kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
@ 2024-01-12 20:38 Woodhouse, David
  2024-01-17 16:09 ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Woodhouse, David @ 2024-01-12 20:38 UTC (permalink / raw)
  To: kvm; +Cc: pbonzini, seanjc, Durrant, Paul


[-- Attachment #1.1: Type: text/plain, Size: 16628 bytes --]

This function can race with kvm_gpc_deactivate(). Since that function
does not take the ->refresh_lock, it can wipe and unmap the pfn and
khva while hva_to_pfn_retry() has dropped its write lock on gpc->lock.

Then if hva_to_pfn_retry() determines that the PFN hasn't changed and
that it can re-use the old pfn and khva, they get assigned back to
gpc->pfn and gpc->khva even though the khva was already unmapped by
kvm_gpc_deactivate(). This leaves the cache in an apparently valid
state but with ->khva pointing to an address which has been unmapped.
Which in turn leads to oopses in e.g. __kvm_xen_has_interrupt() and
set_shinfo_evtchn_pending().

It may be possible to fix this just by making kvm_gpc_deactivate()
take the ->refresh_lock, but that still leaves ->refresh_lock being
basically redundant with the write lock on ->lock, which frankly
makes my skin itch, with the way that pfn_to_hva_retry() operates on
fields in the gpc without holding ->lock.

Instead, fix it by cleaning up the semantis of hva_to_pfn_retry(). It
no longer operates on the gpc object at all; it's called with a uhva
and returns the corresponding pfn (pinned), and a mapped khva for it.

The calling function __kvm_gpc_refresh() now drops ->lock before calling
hva_to_pfn_retry(), then retakes the lock before checking for changes,
and discards the new mapping if it lost a race. And will correctly
note the old pfn/khva to be unmapped at the right time, instead of
preserving them in a local variable while dropping the lock.

The optimisation in hva_to_pfn_retry() where it attempts to use the
old mapping if the pfn doesn't change is dropped, since it makes the
pinning more complex. It's a pointless optimisation anyway, since the
odds of the pfn ending up the same when the uhva has changed (i.e.
the odds of the two userspace addresses both pointing to the same
underlying physical page) are negligible,

I remain slightly confused because although this is clearly a race in
the gfn_to_pfn_cache code, I don't quite know how the Xen support code
actually managed to trigger it. We've seen oopses from dereferencing a
valid-looking ->khva in both __kvm_xen_has_interrupt() (the vcpu_info)
and in set_shinfo_evtchn_pending() (the shared_info). But surely the
race shouldn't happen for the vcpu_info gpc because all calls to both
refresh and deactivate hold the vcpu mutex, and it shouldn't happen
for the shared_info gpc because all calls to both will hold the
kvm->arch.xen.xen_lock mutex.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---

This is based on (and in) my tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenfv
which has all the other outstanding KVM/Xen fixes.

 virt/kvm/pfncache.c | 181 +++++++++++++++++++++-----------------------
 1 file changed, 85 insertions(+), 96 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 70394d7c9a38..adca709a5884 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -135,110 +135,67 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
        return kvm->mmu_invalidate_seq != mmu_seq;
 }
 
-static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
+/*
+ * Given a user virtual address, obtain a pinned host PFN and kernel mapping
+ * for it. The caller will release the PFN after installing it into the GPC
+ * so that the MMU notifier invalidation mechanism is active.
+ */
+static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva,
+                                 kvm_pfn_t *pfn, void **khva)
 {
        /* Note, the new page offset may be different than the old! */
-       void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
        kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
        void *new_khva = NULL;
        unsigned long mmu_seq;
 
-       lockdep_assert_held(&gpc->refresh_lock);
-
-       lockdep_assert_held_write(&gpc->lock);
-
-       /*
-        * Invalidate the cache prior to dropping gpc->lock, the gpa=>uhva
-        * assets have already been updated and so a concurrent check() from a
-        * different task may not fail the gpa/uhva/generation checks.
-        */
-       gpc->valid = false;
-
-       do {
-               mmu_seq = gpc->kvm->mmu_invalidate_seq;
+       for (;;) {
+               mmu_seq = kvm->mmu_invalidate_seq;
                smp_rmb();
 
-               write_unlock_irq(&gpc->lock);
-
-               /*
-                * If the previous iteration "failed" due to an mmu_notifier
-                * event, release the pfn and unmap the kernel virtual address
-                * from the previous attempt.  Unmapping might sleep, so this
-                * needs to be done after dropping the lock.  Opportunistically
-                * check for resched while the lock isn't held.
-                */
-               if (new_pfn != KVM_PFN_ERR_FAULT) {
-                       /*
-                        * Keep the mapping if the previous iteration reused
-                        * the existing mapping and didn't create a new one.
-                        */
-                       if (new_khva != old_khva)
-                               gpc_unmap(new_pfn, new_khva);
-
-                       kvm_release_pfn_clean(new_pfn);
-
-                       cond_resched();
-               }
-
                /* We always request a writeable mapping */
-               new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+               new_pfn = hva_to_pfn(uhva, false, false, NULL, true, NULL);
                if (is_error_noslot_pfn(new_pfn))
-                       goto out_error;
+                       return -EFAULT;
 
                /*
-                * Obtain a new kernel mapping if KVM itself will access the
-                * pfn.  Note, kmap() and memremap() can both sleep, so this
-                * too must be done outside of gpc->lock!
+                * Always obtain a new kernel mapping. Trying to reuse an
+                * existing one is more complex than it's worth.
                 */
-               if (new_pfn == gpc->pfn)
-                       new_khva = old_khva;
-               else
-                       new_khva = gpc_map(new_pfn);
-
+               new_khva = gpc_map(new_pfn);
                if (!new_khva) {
                        kvm_release_pfn_clean(new_pfn);
-                       goto out_error;
+                       return -EFAULT;
                }
 
-               write_lock_irq(&gpc->lock);
+               if (!mmu_notifier_retry_cache(kvm, mmu_seq))
+                       break;
 
                /*
-                * Other tasks must wait for _this_ refresh to complete before
-                * attempting to refresh.
+                * If this iteration "failed" due to an mmu_notifier event,
+                * release the pfn and unmap the kernel virtual address, and
+                * loop around again.
                 */
-               WARN_ON_ONCE(gpc->valid);
-       } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
-
-       gpc->valid = true;
-       gpc->pfn = new_pfn;
-       gpc->khva = new_khva + offset_in_page(gpc->uhva);
+               if (new_pfn != KVM_PFN_ERR_FAULT) {
+                       gpc_unmap(new_pfn, new_khva);
+                       kvm_release_pfn_clean(new_pfn);
+               }
+       }
 
-       /*
-        * Put the reference to the _new_ pfn.  The pfn is now tracked by the
-        * cache and can be safely migrated, swapped, etc... as the cache will
-        * invalidate any mappings in response to relevant mmu_notifier events.
-        */
-       kvm_release_pfn_clean(new_pfn);
+       *pfn = new_pfn;
+       *khva = new_khva;
 
        return 0;
-
-out_error:
-       write_lock_irq(&gpc->lock);
-
-       return -EFAULT;
 }
 
-static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
-                            unsigned long len)
+static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa,
+                            unsigned long uhva, unsigned long len)
 {
        struct kvm_memslots *slots = kvm_memslots(gpc->kvm);
        unsigned long page_offset = (gpa != KVM_XEN_INVALID_GPA) ?
                offset_in_page(gpa) :
                offset_in_page(uhva);
-       bool unmap_old = false;
        unsigned long old_uhva;
-       kvm_pfn_t old_pfn;
-       bool hva_change = false;
+       kvm_pfn_t old_pfn = KVM_PFN_ERR_FAULT;
        void *old_khva;
        int ret;
 
@@ -251,8 +208,9 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
        /*
         * If another task is refreshing the cache, wait for it to complete.
-        * There is no guarantee that concurrent refreshes will see the same
-        * gpa, memslots generation, etc..., so they must be fully serialized.
+        * This is purely an optimisation, to avoid concurrent mappings from
+        * hva_to_pfn_retry(), all but one of which will be discarded after
+        * losing a race to install them in the GPC.
         */
        mutex_lock(&gpc->refresh_lock);
 
@@ -272,7 +230,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
                gpc->uhva = PAGE_ALIGN_DOWN(uhva);
 
                if (gpc->uhva != old_uhva)
-                       hva_change = true;
+                       gpc->valid = false;
        } else if (gpc->gpa != gpa ||
                   gpc->generation != slots->generation ||
                   kvm_is_error_hva(gpc->uhva)) {
@@ -285,7 +243,11 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
                if (kvm_is_error_hva(gpc->uhva)) {
                        ret = -EFAULT;
-                       goto out;
+
+                       gpc->valid = false;
+                       gpc->pfn = KVM_PFN_ERR_FAULT;
+                       gpc->khva = NULL;
+                       goto out_unlock;
                }
 
                /*
@@ -293,7 +255,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
                 * HVA may still be the same.
                 */
                if (gpc->uhva != old_uhva)
-                       hva_change = true;
+                       gpc->valid = false;
        } else {
                gpc->uhva = old_uhva;
        }
@@ -305,9 +267,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
         * If the userspace HVA changed or the PFN was already invalid,
         * drop the lock and do the HVA to PFN lookup again.
         */
-       if (!gpc->valid || hva_change) {
-               ret = hva_to_pfn_retry(gpc);
-       } else {
+       if (gpc->valid) {
                /*
                 * If the HVA→PFN mapping was already valid, don't unmap it.
                 * But do update gpc->khva because the offset within the page
@@ -315,30 +275,59 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
                 */
                gpc->khva = old_khva + page_offset;
                ret = 0;
-               goto out_unlock;
-       }
 
- out:
-       /*
-        * Invalidate the cache and purge the pfn/khva if the refresh failed.
-        * Some/all of the uhva, gpa, and memslot generation info may still be
-        * valid, leave it as is.
-        */
-       if (ret) {
+               /* old_pfn must not be unmapped because it was reused. */
+               old_pfn = KVM_PFN_ERR_FAULT;
+       } else {
+               kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
+               unsigned long new_uhva = gpc->uhva;
+               void *new_khva = NULL;
+
+               /*
+                * Invalidate the cache prior to dropping gpc->lock; the
+                * gpa=>uhva assets have already been updated and so a
+                * concurrent check() from a different task may not fail
+                * the gpa/uhva/generation checks as it should.
+                */
                gpc->valid = false;
-               gpc->pfn = KVM_PFN_ERR_FAULT;
-               gpc->khva = NULL;
-       }
 
-       /* Detect a pfn change before dropping the lock! */
-       unmap_old = (old_pfn != gpc->pfn);
+               write_unlock_irq(&gpc->lock);
+
+               ret = hva_to_pfn_retry(gpc->kvm, new_uhva, &new_pfn, &new_khva);
+
+               write_lock_irq(&gpc->lock);
+
+               if (ret || gpc->uhva != new_uhva) {
+                       /*
+                        * On failure or if another update occurred while the
+                        * lock was dropped, just purge the new mapping. */
+                       old_pfn = new_pfn;
+                       old_khva = new_khva;
+               } else {
+                       old_pfn = gpc->pfn;
+                       old_khva = gpc->khva;
+
+                       gpc->pfn = new_pfn;
+                       gpc->khva = new_khva + offset_in_page(gpc->uhva);
+                       gpc->valid = true;
+               }
+
+               /*
+                * Put the reference to the _new_ pfn. On success, the
+                * pfn is now tracked by the cache and can safely be
+                * migrated, swapped, etc. as the cache will invalidate
+                * any mappings in response to relevant mmu_notifier
+                * events.
+                */
+               kvm_release_pfn_clean(new_pfn);
+       }
 
 out_unlock:
        write_unlock_irq(&gpc->lock);
 
        mutex_unlock(&gpc->refresh_lock);
 
-       if (unmap_old)
+       if (old_pfn != KVM_PFN_ERR_FAULT)
                gpc_unmap(old_pfn, old_khva);
 
        return ret;
-- 
2.41.0



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5957 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
  2024-01-12 20:38 [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues Woodhouse, David
@ 2024-01-17 16:09 ` Sean Christopherson
  2024-01-17 16:32   ` Woodhouse, David
  2024-01-23 15:06   ` David Woodhouse
  0 siblings, 2 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-01-17 16:09 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm, pbonzini, Paul Durrant

On Fri, Jan 12, 2024, David Woodhouse wrote:
> This function can race with kvm_gpc_deactivate(). Since that function
> does not take the ->refresh_lock, it can wipe and unmap the pfn and
> khva while hva_to_pfn_retry() has dropped its write lock on gpc->lock.
> 
> Then if hva_to_pfn_retry() determines that the PFN hasn't changed and
> that it can re-use the old pfn and khva, they get assigned back to
> gpc->pfn and gpc->khva even though the khva was already unmapped by
> kvm_gpc_deactivate(). This leaves the cache in an apparently valid
> state but with ->khva pointing to an address which has been unmapped.
> Which in turn leads to oopses in e.g. __kvm_xen_has_interrupt() and
> set_shinfo_evtchn_pending().
> 
> It may be possible to fix this just by making kvm_gpc_deactivate()
> take the ->refresh_lock, but that still leaves ->refresh_lock being
> basically redundant with the write lock on ->lock, which frankly
> makes my skin itch, with the way that pfn_to_hva_retry() operates on
> fields in the gpc without holding ->lock.
> 
> Instead, fix it by cleaning up the semantis of hva_to_pfn_retry(). It
> no longer operates on the gpc object at all; it's called with a uhva
> and returns the corresponding pfn (pinned), and a mapped khva for it.
> 
> The calling function __kvm_gpc_refresh() now drops ->lock before calling
> hva_to_pfn_retry(), then retakes the lock before checking for changes,
> and discards the new mapping if it lost a race. And will correctly
> note the old pfn/khva to be unmapped at the right time, instead of
> preserving them in a local variable while dropping the lock.
> 
> The optimisation in hva_to_pfn_retry() where it attempts to use the
> old mapping if the pfn doesn't change is dropped, since it makes the
> pinning more complex. It's a pointless optimisation anyway, since the
> odds of the pfn ending up the same when the uhva has changed (i.e.
> the odds of the two userspace addresses both pointing to the same
> underlying physical page) are negligible,
> 
> I remain slightly confused because although this is clearly a race in
> the gfn_to_pfn_cache code, I don't quite know how the Xen support code
> actually managed to trigger it. We've seen oopses from dereferencing a
> valid-looking ->khva in both __kvm_xen_has_interrupt() (the vcpu_info)
> and in set_shinfo_evtchn_pending() (the shared_info). But surely the
> race shouldn't happen for the vcpu_info gpc because all calls to both
> refresh and deactivate hold the vcpu mutex, and it shouldn't happen

FWIW, neither kvm_xen_destroy_vcpu() nor kvm_xen_destroy_vm() holds the appropriate
mutex.  

> for the shared_info gpc because all calls to both will hold the
> kvm->arch.xen.xen_lock mutex.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> 
> This is based on (and in) my tree at
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenfv
> which has all the other outstanding KVM/Xen fixes.
> 
>  virt/kvm/pfncache.c | 181 +++++++++++++++++++++-----------------------
>  1 file changed, 85 insertions(+), 96 deletions(-)

NAK, at least as a bug fix.  We've already shuffled deck chairs on the Titanic
several times, I have zero confidence that doing so one more time is going to
truly solve the underlying mess.

The contract with the gfn_to_pfn_cache, or rather the lack thereof, is all kinds
of screwed up.  E.g. I added the mutex in commit 93984f19e7bc ("KVM: Fully serialize
gfn=>pfn cache refresh via mutex") to guard against concurrent unmap(), but the
unmap() API has since been removed.  We need to define an actual contract instead
of continuing to throw noodles as the wall in the hope that something sticks.

As you note above, some other mutex _should_ be held.  I think we should lean
into that.  E.g.

  1. Pass in the guarding mutex to kvm_gpc_init() and assert that said mutex is
     held for __refresh(), activate(), and deactivate().
  2. Fix the cases where that doesn't hold true.
  3. Drop refresh_mutex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
  2024-01-17 16:09 ` Sean Christopherson
@ 2024-01-17 16:32   ` Woodhouse, David
  2024-01-17 16:51     ` Sean Christopherson
  2024-01-23 15:06   ` David Woodhouse
  1 sibling, 1 reply; 8+ messages in thread
From: Woodhouse, David @ 2024-01-17 16:32 UTC (permalink / raw)
  To: seanjc; +Cc: kvm, pbonzini, Durrant, Paul


[-- Attachment #1.1: Type: text/plain, Size: 5218 bytes --]

On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> On Fri, Jan 12, 2024, David Woodhouse wrote:
> > This function can race with kvm_gpc_deactivate(). Since that function
> > does not take the ->refresh_lock, it can wipe and unmap the pfn and
> > khva while hva_to_pfn_retry() has dropped its write lock on gpc->lock.
> > 
> > Then if hva_to_pfn_retry() determines that the PFN hasn't changed and
> > that it can re-use the old pfn and khva, they get assigned back to
> > gpc->pfn and gpc->khva even though the khva was already unmapped by
> > kvm_gpc_deactivate(). This leaves the cache in an apparently valid
> > state but with ->khva pointing to an address which has been unmapped.
> > Which in turn leads to oopses in e.g. __kvm_xen_has_interrupt() and
> > set_shinfo_evtchn_pending().
> > 
> > It may be possible to fix this just by making kvm_gpc_deactivate()
> > take the ->refresh_lock, but that still leaves ->refresh_lock being
> > basically redundant with the write lock on ->lock, which frankly
> > makes my skin itch, with the way that pfn_to_hva_retry() operates on
> > fields in the gpc without holding ->lock.
> > 
> > Instead, fix it by cleaning up the semantis of hva_to_pfn_retry(). It
> > no longer operates on the gpc object at all; it's called with a uhva
> > and returns the corresponding pfn (pinned), and a mapped khva for it.
> > 
> > The calling function __kvm_gpc_refresh() now drops ->lock before calling
> > hva_to_pfn_retry(), then retakes the lock before checking for changes,
> > and discards the new mapping if it lost a race. And will correctly
> > note the old pfn/khva to be unmapped at the right time, instead of
> > preserving them in a local variable while dropping the lock.
> > 
> > The optimisation in hva_to_pfn_retry() where it attempts to use the
> > old mapping if the pfn doesn't change is dropped, since it makes the
> > pinning more complex. It's a pointless optimisation anyway, since the
> > odds of the pfn ending up the same when the uhva has changed (i.e.
> > the odds of the two userspace addresses both pointing to the same
> > underlying physical page) are negligible,
> > 
> > I remain slightly confused because although this is clearly a race in
> > the gfn_to_pfn_cache code, I don't quite know how the Xen support code
> > actually managed to trigger it. We've seen oopses from dereferencing a
> > valid-looking ->khva in both __kvm_xen_has_interrupt() (the vcpu_info)
> > and in set_shinfo_evtchn_pending() (the shared_info). But surely the
> > race shouldn't happen for the vcpu_info gpc because all calls to both
> > refresh and deactivate hold the vcpu mutex, and it shouldn't happen
> 
> FWIW, neither kvm_xen_destroy_vcpu() nor kvm_xen_destroy_vm() holds the appropriate
> mutex.  

Those shouldn't be implicated in the cases where we've seen it happen.
And I think it needs the GPC to be left in !active,valid state due to
the race and then *reactivated*, while still marked 'valid'. Which
can't happen after the destroy paths.

> 
> > for the shared_info gpc because all calls to both will hold the
> > kvm->arch.xen.xen_lock mutex.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> > 
> > This is based on (and in) my tree at
> > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenfv
> > which has all the other outstanding KVM/Xen fixes.
> > 
> >  virt/kvm/pfncache.c | 181 +++++++++++++++++++++-----------------------
> >  1 file changed, 85 insertions(+), 96 deletions(-)
> 
> NAK, at least as a bug fix.  We've already shuffled deck chairs on the Titanic
> several times, I have zero confidence that doing so one more time is going to
> truly solve the underlying mess.

Agreed, but as it stands, especially with refresh_lock, this is just
overly complex. We should make the rwlock stand alone, and not have
code which drops the lock and then makes assumptions that things won't
change.


> The contract with the gfn_to_pfn_cache, or rather the lack thereof, is all kinds
> of screwed up.  E.g. I added the mutex in commit 93984f19e7bc ("KVM: Fully serialize
> gfn=>pfn cache refresh via mutex") to guard against concurrent unmap(), but the
> unmap() API has since been removed.  We need to define an actual contract instead
> of continuing to throw noodles as the wall in the hope that something sticks.
> 
> As you note above, some other mutex _should_ be held.  I think we should lean
> into that.  E.g.

I don't. I'd like this code to stand alone *without* making the caller
depend on "some other lock" just for its own internal consistency.


>   1. Pass in the guarding mutex to kvm_gpc_init() and assert that said mutex is
>      held for __refresh(), activate(), and deactivate().
>   2. Fix the cases where that doesn't hold true.
>   3. Drop refresh_mutex
> 

I'll go for (3) but I disagree about (1) and (2). Just let the rwlock
work as $DEITY intended, which is what this patch is doing. It's a
cleanup.

(And I didn't drop refresh_lock so far partly because it wants to be
done in a separate commit, but also because it does provide an
optimisation, as noted.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5957 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
  2024-01-17 16:32   ` Woodhouse, David
@ 2024-01-17 16:51     ` Sean Christopherson
  2024-01-17 16:59       ` David Woodhouse
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-01-17 16:51 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm, pbonzini, Paul Durrant

On Wed, Jan 17, 2024, David Woodhouse wrote:
> On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> > On Fri, Jan 12, 2024, David Woodhouse wrote:
> > As you note above, some other mutex _should_ be held.  I think we should lean
> > into that.  E.g.
> 
> I don't. I'd like this code to stand alone *without* making the caller
> depend on "some other lock" just for its own internal consistency.

Hmm, I get where you're coming from, but protecting a per-vCPU asset with
vcpu->mutex is completely sane/reasonable.  Xen's refresh from a completely
different task is the oddball.  And unnecessarily taking multiple mutexes muddies
the water, e.g. it's not clear what role kvm->arch.xen.xen_lock plays when it's
acquired by kvm_xen_set_evtchn().

> >   1. Pass in the guarding mutex to kvm_gpc_init() and assert that said mutex is
> >      held for __refresh(), activate(), and deactivate().
> >   2. Fix the cases where that doesn't hold true.
> >   3. Drop refresh_mutex
> > 
> 
> I'll go for (3) but I disagree about (1) and (2). Just let the rwlock
> work as $DEITY intended, which is what this patch is doing. It's a
> cleanup.
> 
> (And I didn't drop refresh_lock so far partly because it wants to be
> done in a separate commit, but also because it does provide an
> optimisation, as noted.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re:  [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
  2024-01-17 16:51     ` Sean Christopherson
@ 2024-01-17 16:59       ` David Woodhouse
  2024-01-17 18:14         ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: David Woodhouse @ 2024-01-17 16:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, pbonzini, Paul Durrant

[-- Attachment #1: Type: text/plain, Size: 1725 bytes --]

On Wed, 2024-01-17 at 08:51 -0800, Sean Christopherson wrote:
> On Wed, Jan 17, 2024, David Woodhouse wrote:
> > On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> > > On Fri, Jan 12, 2024, David Woodhouse wrote:
> > > As you note above, some other mutex _should_ be held.  I think we should lean
> > > into that.  E.g.
> > 
> > I don't. I'd like this code to stand alone *without* making the caller
> > depend on "some other lock" just for its own internal consistency.
> 
> Hmm, I get where you're coming from, but protecting a per-vCPU asset with
> vcpu->mutex is completely sane/reasonable.  Xen's refresh from a completely
> different task is the oddball.  

Well yes, because that's what the gfn_to_pfn_cache is *for*, surely?

If we were in a context where we could sleep and take mutexes like the
vcpu mutex (and without deadlocking with a thread actually running that
vCPU), then we could mostly just use kvm_write_guest().

The gfn_to_pfn_cache exists specifically to handle that 'oddball' case.

> And unnecessarily taking multiple mutexes muddies
> the water, e.g. it's not clear what role kvm->arch.xen.xen_lock plays when it's
> acquired by kvm_xen_set_evtchn().

Right, I was frowning at that the other day. I believe it's there
purely because the gfn_to_pfn_cache *wasn't* self-contained with its
own consistent locking, and did this awful "caller must do my locking
for me" thing.

I'd like to fix the gfn_to_pfn_cache locking to be internally complete
and consistent, then I think we probably don't need arch.xen.xen_lock
in kvm_xen_set_evtchn(). I'm going to give that a lot more thought
though and not attempt to shoe-horn it into this patch though.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
  2024-01-17 16:59       ` David Woodhouse
@ 2024-01-17 18:14         ` Sean Christopherson
  2024-01-17 18:33           ` David Woodhouse
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-01-17 18:14 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm, pbonzini, Paul Durrant

On Wed, Jan 17, 2024, David Woodhouse wrote:
> On Wed, 2024-01-17 at 08:51 -0800, Sean Christopherson wrote:
> > On Wed, Jan 17, 2024, David Woodhouse wrote:
> > > On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> > > > On Fri, Jan 12, 2024, David Woodhouse wrote:
> > > > As you note above, some other mutex _should_ be held.  I think we should lean
> > > > into that.  E.g.
> > > 
> > > I don't. I'd like this code to stand alone *without* making the caller
> > > depend on "some other lock" just for its own internal consistency.
> > 
> > Hmm, I get where you're coming from, but protecting a per-vCPU asset with
> > vcpu->mutex is completely sane/reasonable.  Xen's refresh from a completely
> > different task is the oddball.  
> 
> Well yes, because that's what the gfn_to_pfn_cache is *for*, surely?
> 
> If we were in a context where we could sleep and take mutexes like the
> vcpu mutex (and without deadlocking with a thread actually running that
> vCPU), then we could mostly just use kvm_write_guest().
> 
> The gfn_to_pfn_cache exists specifically to handle that 'oddball' case.

No?  As I see it, the main role of the gpc code is to handle the mmu_notifier
interactions.  I am not at all convinced that the common code should take on
supporting "oh, and by the way, any task can you use the cache from any context".

That's the "oddball" I am referring to.  I'm not entirely opposed to making the
gpc code fully standalone, but I would like to at least get to the point where
have very high confidence that arch.xen.xen_lock can be fully excised before
committing to handling that use case in the common gpc code.

> > And unnecessarily taking multiple mutexes muddies
> > the water, e.g. it's not clear what role kvm->arch.xen.xen_lock plays when it's
> > acquired by kvm_xen_set_evtchn().
> 
> Right, I was frowning at that the other day. I believe it's there
> purely because the gfn_to_pfn_cache *wasn't* self-contained with its
> own consistent locking, and did this awful "caller must do my locking
> for me" thing.
> 
> I'd like to fix the gfn_to_pfn_cache locking to be internally complete
> and consistent, then I think we probably don't need arch.xen.xen_lock
> in kvm_xen_set_evtchn(). I'm going to give that a lot more thought
> though and not attempt to shoe-horn it into this patch though.
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
  2024-01-17 18:14         ` Sean Christopherson
@ 2024-01-17 18:33           ` David Woodhouse
  0 siblings, 0 replies; 8+ messages in thread
From: David Woodhouse @ 2024-01-17 18:33 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, pbonzini, Paul Durrant

[-- Attachment #1: Type: text/plain, Size: 3448 bytes --]

On Wed, 2024-01-17 at 10:14 -0800, Sean Christopherson wrote:
> On Wed, Jan 17, 2024, David Woodhouse wrote:
> > On Wed, 2024-01-17 at 08:51 -0800, Sean Christopherson wrote:
> > > On Wed, Jan 17, 2024, David Woodhouse wrote:
> > > > On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> > > > > On Fri, Jan 12, 2024, David Woodhouse wrote:
> > > > > As you note above, some other mutex _should_ be held.  I think we should lean
> > > > > into that.  E.g.
> > > > 
> > > > I don't. I'd like this code to stand alone *without* making the caller
> > > > depend on "some other lock" just for its own internal consistency.
> > > 
> > > Hmm, I get where you're coming from, but protecting a per-vCPU asset with
> > > vcpu->mutex is completely sane/reasonable.  Xen's refresh from a completely
> > > different task is the oddball.  
> > 
> > Well yes, because that's what the gfn_to_pfn_cache is *for*, surely?
> > 
> > If we were in a context where we could sleep and take mutexes like the
> > vcpu mutex (and without deadlocking with a thread actually running that
> > vCPU), then we could mostly just use kvm_write_guest().
> > 
> > The gfn_to_pfn_cache exists specifically to handle that 'oddball' case.
> 
> No?  As I see it, the main role of the gpc code is to handle the mmu_notifier
> interactions.  I am not at all convinced that the common code should take on
> supporting "oh, and by the way, any task can you use the cache from any context".

But that really was its raison d'être. It exists to allow interrupts to
be delivered, runstate to be updated — all of which runs in a context
that can't just access the userspace address.

The MMU notifiers are just a means to that end — to know when the cache
becomes invalid and we need to take a slow path to refresh it.

> That's the "oddball" I am referring to.  I'm not entirely opposed to making the
> gpc code fully standalone, but I would like to at least get to the point where
> have very high confidence that arch.xen.xen_lock can be fully excised before
> committing to handling that use case in the common gpc code.

I fully agree that we should have high confidence that the
arch.xen.xen_lock isn't needed before we actually elide it.

This patch is very much moving us in the right direction. Even if it
doesn't actually fix a bug, it's a cleanup of some fairly awful locking
abuse. We *shouldn't* drop the rwlock and then assume that fields will
be unchanged when we take it again "because our caller will save us by
doing our locking for us".

And while I can't see *how* the caller is failing to do that locking
for us, I'm definitely looking at a trace where the ->khva being
dereferenced from an allegedly valid gfn_to_pfn_cache is a valid-
looking but not present kernel address.

That's why I was staring at the code until my brain hurt and trying to
work out how it could possibly happen. And I think the awful locking
abuse in __kvm_gpc_refresh() has to be part of it.

I still want to do a full audit of all the possible parallel code paths
before actually removing xen_lock from the code paths where the
gfn_to_pfn_cache is all it protects — but this patch is a very clear
cleanup of some fairly awful locking abuse.

I'm OK with you nacking it as a "bug fix", given the uncertainty. But
let's take it as a cleanup and a simplification, and making the API
easier to use correctly.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
  2024-01-17 16:09 ` Sean Christopherson
  2024-01-17 16:32   ` Woodhouse, David
@ 2024-01-23 15:06   ` David Woodhouse
  1 sibling, 0 replies; 8+ messages in thread
From: David Woodhouse @ 2024-01-23 15:06 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, pbonzini, Paul Durrant

[-- Attachment #1: Type: text/plain, Size: 4912 bytes --]

On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> 
> NAK, at least as a bug fix.

OK. Not a bug fix.

> 
> The contract with the gfn_to_pfn_cache, or rather the lack thereof,
> is all kinds of screwed up.

Well yes, that's my point. How's this for a reworked commit message
which doesn't claim it's a bug fix....


    KVM: pfncache: clean up and simplify locking mess
    
    The locking on the gfn_to_pfn_cache is... interesting. And awful.
    
    There is a rwlock in ->lock which readers take to ensure protection
    against concurrent changes. But __kvm_gpc_refresh() makes assumptions
    that certain fields will not change even while it drops the write lock
    and performs MM operations to revalidate the target PFN and kernel
    mapping.
    
    Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via
    mutex") partly addressed that — not by fixing it, but by adding a new
    mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
    calls on a given gfn_to_pfn_cache, but is still only a partial solution.
    
    There is still a theoretical race where __kvm_gpc_refresh() runs in
    parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
    dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
    and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
    ->pfn and ->khva are still valid, and reinstalls those values into the
    structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
    but ->active clear. And a ->khva which looks like a reasonable kernel
    address but is actually unmapped.
    
    All it takes is a subsequent reactivation to cause that ->khva to be
    dereferenced. This would theoretically cause an oops which would look
    something like this:
    
    [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
    [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0
    
    I say "theoretically" because theoretically, that oops that was seen in
    production cannot happen. The code which uses the gfn_to_pfn_cache is
    supposed to have its *own* locking, to further paper over the fact that
    the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
    rwlock abuse is not sufficient.
    
    For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
    shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
    the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
    but the cases where the vcpu or kvm object is being *destroyed*, in
    which case the subsequent reactivation should never happen.
    
    Theoretically.
    
    Nevertheless, this locking abuse is awful and should be fixed, even if
    no clear explanation can be found for how the oops happened. So...
    
    Clean up the semantics of hva_to_pfn_retry() so that it no longer does
    any locking gymnastics because it no longer operates on the gpc object
    at all. It is now called with a uhva and simply returns the
    corresponding pfn (pinned), and a mapped khva for it.
    
    Its caller __kvm_gpc_refresh() now sets gpc->uhva and clears gpc->valid
    before dropping ->lock, calling hva_to_pfn_retry() and retaking ->lock
    for write.
    
    If hva_to_pfn_retry() fails, *or* if the ->uhva or ->active fields in
    the gpc changed while the lock was dropped, the new mapping is discarded
    and the gpc is not modified. On success with an unchanged gpc, the new
    mapping is installed and the current ->pfn and ->uhva are taken into the
    local old_pfn and old_khva variables to be unmapped once the locks are
    all released.
    
    This simplification means that ->refresh_lock is no longer needed for
    correctness, but it does still provide a minor optimisation because it
    will prevent two concurrent __kvm_gpc_refresh() calls from mapping a
    given PFN, only for one of them to lose the race and discard its
    mapping.
    
    The optimisation in hva_to_pfn_retry() where it attempts to use the old
    mapping if the pfn doesn't change is dropped, since it makes the pinning
    more complex. It's a pointless optimisation anyway, since the odds of
    the pfn ending up the same when the uhva has changed (i.e. the odds of
    the two userspace addresses both pointing to the same underlying
    physical page) are negligible,
    
    The 'hva_changed' local variable in __kvm_gpc_refresh() is also removed,
    since it's simpler just to clear gpc->valid if the uhva changed.
    Likewise the unmap_old variable is dropped because it's just as easy to
    check the old_pfn variable for KVM_PFN_ERR_FAULT.
    
    Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
    Reviewed-by: Paul Durrant <pdurrant@amazon.com>




[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-01-23 15:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-12 20:38 [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues Woodhouse, David
2024-01-17 16:09 ` Sean Christopherson
2024-01-17 16:32   ` Woodhouse, David
2024-01-17 16:51     ` Sean Christopherson
2024-01-17 16:59       ` David Woodhouse
2024-01-17 18:14         ` Sean Christopherson
2024-01-17 18:33           ` David Woodhouse
2024-01-23 15:06   ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).